Class ArbitraryIndexingFilter

  • All Implemented Interfaces:
    Configurable, IndexingFilter, Pluggable

    public class ArbitraryIndexingFilter
    extends Object
    implements IndexingFilter
    Adds arbitrary searchable fields to a document from the class and method the user identifies in the config. The user supplies the name of the field to add with the class and method names that supply the value. Example:

    <property>
    <name>index.arbitrary.function.count</name>
    <value>1</value>
    </property>

    <property>
    <name>index.arbitrary.fieldName.0</name>
    <value>advisors</value>
    </property>

    <property>
    <name>index.arbitrary.className.0</name>
    <value>com.example.arbitrary.AdvisorCalculator</value>
    </property>

    <property>
    <name>index.arbitrary.constructorArgs.0</name>
    <value>Kirk</value>
    </property>

    <property>
    <name>index.arbitrary.methodName.0</name>
    <value>countAdvisors</value>
    </property>

    <property>
    <name>index.arbitrary.methodArgs.0</name>
    <value>Spock,McCoy</value>
    </property>

    To set more than one arbitrary field value, increment index.arbitrary.function.count and repeat the rest of these blocks with successive int values appended to the property names, e.g. fieldName.1, methodName.1, etc.
    • Constructor Detail

      • ArbitraryIndexingFilter

        public ArbitraryIndexingFilter()
    • Method Detail

      • filter

        public NutchDocument filter​(NutchDocument doc,
                                    Parse parse,
                                    Text url,
                                    CrawlDatum datum,
                                    Inlinks inlinks)
                             throws IndexingException
        The ArbitraryIndexingFilter filter object uses reflection to instantiate the configured class and invoke the configured method. It requires a few configuration settings for adding arbitrary fields and values to the NutchDocument as searchable fields. See index.arbitrary.function.count, and (possibly multiple instances when index.arbitrary.function.count > 1) of the following index.arbitrary.fieldName.index, index.arbitrary.className.index, index.arbitrary.constructorArgs.index, index.arbitrary.methodName.index, and index.arbitrary.methodArgs.index in nutch-default.xml or nutch-site.xml where index ranges from 0 to index.arbitrary.function.count - 1.
        Specified by:
        filter in interface IndexingFilter
        Parameters:
        doc - The NutchDocument object
        parse - The relevant Parse object passing through the filter
        url - URL to be filtered by the user-specified class
        datum - The CrawlDatum entry
        inlinks - The Inlinks containing anchor text
        Returns:
        filtered NutchDocument
        Throws:
        IndexingException - if an error occurs during during filtering
      • setIndexedConf

        public void setIndexedConf​(Configuration conf,
                                   int ndx)
        Set the Configuration object for a specific set of values in the config
        Parameters:
        conf - The Configuration object holding values for the current arbitrary field.
        ndx - The ordinal counter value for the current arbitrary field appended to the base property names in the xml configuration file.