Package org.apache.nutch.indexer
Interface IndexingFilter
-
- All Superinterfaces:
Configurable
,Pluggable
- All Known Implementing Classes:
AnchorIndexingFilter
,ArbitraryIndexingFilter
,BasicIndexingFilter
,CCIndexingFilter
,FeedIndexingFilter
,GeoIPIndexingFilter
,JexlIndexingFilter
,LanguageIndexingFilter
,LinksIndexingFilter
,MetadataIndexer
,MimeTypeIndexingFilter
,MoreIndexingFilter
,RelTagIndexingFilter
,ReplaceIndexer
,StaticFieldIndexer
,SubcollectionIndexingFilter
,TLDIndexingFilter
,URLMetaIndexingFilter
public interface IndexingFilter extends Pluggable, Configurable
Extension point for indexing. Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse.
-
-
Field Summary
Fields Modifier and Type Field Description static String
X_POINT_ID
The name of the extension point.
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description NutchDocument
filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse.-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
X_POINT_ID
static final String X_POINT_ID
The name of the extension point.
-
-
Method Detail
-
filter
NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.- Parameters:
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the page (fetch datum from segment containing fetch status and fetch time)inlinks
- page inlinks- Returns:
- modified (or a new) document instance, or null (meaning the document should be discarded)
- Throws:
IndexingException
- if an error occurs during during filtering
-
-