Class ExemptionUrlFilter
- java.lang.Object
-
- org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- org.apache.nutch.urlfilter.regex.RegexURLFilter
-
- org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
-
- All Implemented Interfaces:
Configurable
,URLExemptionFilter
,URLFilter
,Pluggable
public class ExemptionUrlFilter extends RegexURLFilter implements URLExemptionFilter
This implementation ofURLExemptionFilter
uses regex configuration to check if URL is eligible for exemption from thedb.ignore.external.links
configuration property. When this filter is enabled, the external urls will be checked against configured sequence of regex rules.The exemption rule file defaults to
The exemption rules are specified in plain text file where each line is a rule. The format is same same asdb-ignore-external-exemptions.txt
in the classpath but can be overridden using the configuration propertydb.ignore.external.exemptions.file
.regex-urlfilter.txt
. Each non-comment, non-blank line contains a regular expression prefixed by + or -. The first matching pattern in the file determines whether a URL is exempted or ignored. If no pattern matches, the URL is ignored.- Since:
- Feb 10, 2016
- Version:
- 1
- See Also:
URLExemptionFilter
,RegexURLFilter
-
-
Field Summary
Fields Modifier and Type Field Description static String
DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE
-
Fields inherited from class org.apache.nutch.urlfilter.regex.RegexURLFilter
URLFILTER_REGEX_FILE, URLFILTER_REGEX_RULES
-
Fields inherited from class org.apache.nutch.urlfilter.api.RegexURLFilterBase
hasHostDomainRules
-
Fields inherited from interface org.apache.nutch.net.URLExemptionFilter
X_POINT_ID
-
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description ExemptionUrlFilter()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
filter(String fromUrl, String toUrl)
Checks if toUrl is exempted when the ignore external is enabledList<Pattern>
getExemptions()
protected Reader
getRulesReader(Configuration conf)
Gets reader for regex rulesstatic void
main(String[] args)
-
Methods inherited from class org.apache.nutch.urlfilter.regex.RegexURLFilter
createRule, createRule
-
Methods inherited from class org.apache.nutch.urlfilter.api.RegexURLFilterBase
filter, getConf, main, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE
public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE
- See Also:
- Constant Field Values
-
-
Method Detail
-
filter
public boolean filter(String fromUrl, String toUrl)
Description copied from interface:URLExemptionFilter
Checks if toUrl is exempted when the ignore external is enabled- Specified by:
filter
in interfaceURLExemptionFilter
- Parameters:
fromUrl
- : the source url which generated the outlinktoUrl
- : the destination url which needs to be checked for exemption- Returns:
- true when toUrl is exempted from dbIgnore
-
getRulesReader
protected Reader getRulesReader(Configuration conf) throws IOException
Gets reader for regex rules- Overrides:
getRulesReader
in classRegexURLFilter
- Parameters:
conf
- is the current configuration.- Returns:
- the name of the resource containing the rules to use.
- Throws:
IOException
- if there is a fatal error obtaining theReader
-
main
public static void main(String[] args)
-
-