Package org.apache.nutch.parse
Interface HtmlParseFilter
-
- All Superinterfaces:
Configurable
,Pluggable
- All Known Implementing Classes:
CCParseFilter
,DebugParseFilter
,HeadingsParseFilter
,HTMLLanguageParser
,JSParseFilter
,MetaTagsParser
,NaiveBayesParseFilter
,RegexParseFilter
,RelTagParser
public interface HtmlParseFilter extends Pluggable, Configurable
Extension point for DOM-based HTML parsers. Permits one to add additional metadata to HTML parses. All plugins found which implement this extension point are run sequentially on the parse.
-
-
Field Summary
Fields Modifier and Type Field Description static String
X_POINT_ID
The name of the extension point.
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description ParseResult
filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
X_POINT_ID
static final String X_POINT_ID
The name of the extension point.
-
-
Method Detail
-
filter
ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.- Parameters:
content
- theContent
for a given responseparseResult
- the result of running on or moreParser
's on the content.metaTags
- a populatedHTMLMetaTags
objectdoc
- aDocumentFragment
(DOM) which can be processed in the filtering process.- Returns:
- a filtered
ParseResult
- See Also:
Parser.getParse(Content)
-
-