Class OPICScoringFilter

  • All Implemented Interfaces:
    Configurable, Pluggable, ScoringFilter

    public class OPICScoringFilter
    extends Object
    implements ScoringFilter
    This plugin implements a variant of an Online Page Importance Computation (OPIC) score, described in this paper: Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive On-Line Page Importance Computation.
    Author:
    Andrzej Bialecki
    • Constructor Detail

      • OPICScoringFilter

        public OPICScoringFilter()
    • Method Detail

      • injectedScore

        public void injectedScore​(Text url,
                                  CrawlDatum datum)
                           throws ScoringFilterException
        Description copied from interface: ScoringFilter
        Set an initial score for newly injected pages. Note: newly injected pages may have no inlinks, so filter implementations may wish to set this score to a non-zero value, to give newly injected pages some initial credit.
        Specified by:
        injectedScore in interface ScoringFilter
        Parameters:
        url - url of the page
        datum - new datum. Filters will modify it in-place.
        Throws:
        ScoringFilterException - if there is a fatal error setting an initial score for newly injected pages
      • initialScore

        public void initialScore​(Text url,
                                 CrawlDatum datum)
                          throws ScoringFilterException
        Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. Newly discovered pages have at least one inlink.
        Specified by:
        initialScore in interface ScoringFilter
        Parameters:
        url - url of the page
        datum - new datum. Filters will modify it in-place.
        Throws:
        ScoringFilterException - if there is a fatal error setting an initial score for newly discovered pages
      • generatorSortValue

        public float generatorSortValue​(Text url,
                                        CrawlDatum datum,
                                        float initSort)
                                 throws ScoringFilterException
        Specified by:
        generatorSortValue in interface ScoringFilter
        Parameters:
        url - url of the page
        datum - page's datum, should not be modified
        initSort - initial sort value, or a value from previous filters in chain
        Returns:
        a sort value for use in sorting and selecting the top N scoring pages during fetchlist generation
        Throws:
        ScoringFilterException - if there is a fatal error preparing the sort value
      • updateDbScore

        public void updateDbScore​(Text url,
                                  CrawlDatum old,
                                  CrawlDatum datum,
                                  List<CrawlDatum> inlinked)
                           throws ScoringFilterException
        Increase the score by a sum of inlinked scores.
        Specified by:
        updateDbScore in interface ScoringFilter
        Parameters:
        url - url of the page
        old - original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - the datum parameter may contain values that are no longer valid, if other updates occurred between generation and this update.
        datum - the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.
        inlinked - (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch.
        Throws:
        ScoringFilterException - there is a fatal error calculating a new score of CrawlDatum during CrawlDb update
      • passScoreBeforeParsing

        public void passScoreBeforeParsing​(Text url,
                                           CrawlDatum datum,
                                           Content content)
        Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
        Specified by:
        passScoreBeforeParsing in interface ScoringFilter
        Parameters:
        url - url of the page
        datum - source datum. NOTE: modifications to this value are not persisted.
        content - instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.
      • passScoreAfterParsing

        public void passScoreAfterParsing​(Text url,
                                          Content content,
                                          Parse parse)
        Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
        Specified by:
        passScoreAfterParsing in interface ScoringFilter
        Parameters:
        url - page url
        content - original content. NOTE: modifications to this value are not persisted.
        parse - target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.
      • distributeScoreToOutlinks

        public CrawlDatum distributeScoreToOutlinks​(Text fromUrl,
                                                    ParseData parseData,
                                                    Collection<Map.Entry<Text,​CrawlDatum>> targets,
                                                    CrawlDatum adjust,
                                                    int allCount)
                                             throws ScoringFilterException
        Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
        Specified by:
        distributeScoreToOutlinks in interface ScoringFilter
        Parameters:
        fromUrl - url of the source page
        parseData - ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.
        targets - <url, CrawlDatum> pairs. NOTE: filters can modify this in-place, all changes will be persisted.
        adjust - a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status to CrawlDatum.STATUS_LINKED.
        allCount - number of all collected outlinks from the source page
        Returns:
        if needed, implementations may return an instance of CrawlDatum, with status CrawlDatum.STATUS_LINKED, which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed.
        Throws:
        ScoringFilterException - there is a fatal error distributing score data from the current page to all of its outlinks
      • indexerScore

        public float indexerScore​(Text url,
                                  NutchDocument doc,
                                  CrawlDatum dbDatum,
                                  CrawlDatum fetchDatum,
                                  Parse parse,
                                  Inlinks inlinks,
                                  float initScore)
                           throws ScoringFilterException
        Dampen the boost value by scorePower.
        Specified by:
        indexerScore in interface ScoringFilter
        Parameters:
        url - url of the page
        doc - indexed document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.
        dbDatum - current page from CrawlDb. NOTE:
        • changes made to this instance are not persisted
        • may be null if indexing is done without CrawlDb or if the segment is generated not from the CrawlDb (via FreeGenerator).
        fetchDatum - datum from FetcherOutput (containing among others the fetching status)
        parse - parsing result. NOTE: changes made to this instance are not persisted.
        inlinks - current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.
        initScore - initial boost value for the indexed document.
        Returns:
        boost value for the indexed document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying the indexed document directly.
        Throws:
        ScoringFilterException - if there is a fatal error whilst calculating the indexed document score/boost