Interface SegmentMergeFilter


  • public interface SegmentMergeFilter
    Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.
    • Field Detail

      • X_POINT_ID

        static final String X_POINT_ID
        The name of the extension point.
    • Method Detail

      • filter

        boolean filter​(Text key,
                       CrawlDatum generateData,
                       CrawlDatum fetchData,
                       CrawlDatum sigData,
                       Content content,
                       ParseData parseData,
                       ParseText parseText,
                       Collection<CrawlDatum> linked)
        The filtering method which gets all information being merged for a given key (URL).
        Parameters:
        key - the segment record key
        generateData - directory and data produced by the generation phase
        fetchData - directory and data produced by the fetch phase
        sigData - directory and data produced by the parse phase
        content - directory and data produced by the parse phase
        parseData - directory and data produced by the parse phase
        parseText - directory and data produced by the parse phase
        linked - all LINKED values from the latest segment
        Returns:
        true values for this key (URL) should be merged into the new segment.