Package org.apache.nutch.scoring.opic
Class OPICScoringFilter
- java.lang.Object
-
- org.apache.nutch.scoring.opic.OPICScoringFilter
-
- All Implemented Interfaces:
Configurable
,Pluggable
,ScoringFilter
public class OPICScoringFilter extends Object implements ScoringFilter
This plugin implements a variant of an Online Page Importance Computation (OPIC) score, described in this paper: Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive On-Line Page Importance Computation.- Author:
- Andrzej Bialecki
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.scoring.ScoringFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description OPICScoringFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description CrawlDatum
distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.float
generatorSortValue(Text url, CrawlDatum datum, float initSort)
Configuration
getConf()
float
indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
Dampen the boost value by scorePower.void
initialScore(Text url, CrawlDatum datum)
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level.void
injectedScore(Text url, CrawlDatum datum)
Set an initial score for newly injected pages.void
passScoreAfterParsing(Text url, Content content, Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.void
passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.void
setConf(Configuration conf)
void
updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
Increase the score by a sum of inlinked scores.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.nutch.scoring.ScoringFilter
orphanedScore
-
-
-
-
Method Detail
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
injectedScore
public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException
Description copied from interface:ScoringFilter
Set an initial score for newly injected pages. Note: newly injected pages may have no inlinks, so filter implementations may wish to set this score to a non-zero value, to give newly injected pages some initial credit.- Specified by:
injectedScore
in interfaceScoringFilter
- Parameters:
url
- url of the pagedatum
- new datum. Filters will modify it in-place.- Throws:
ScoringFilterException
- if there is a fatal error setting an initial score for newly injected pages
-
initialScore
public void initialScore(Text url, CrawlDatum datum) throws ScoringFilterException
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. Newly discovered pages have at least one inlink.- Specified by:
initialScore
in interfaceScoringFilter
- Parameters:
url
- url of the pagedatum
- new datum. Filters will modify it in-place.- Throws:
ScoringFilterException
- if there is a fatal error setting an initial score for newly discovered pages
-
generatorSortValue
public float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException
- Specified by:
generatorSortValue
in interfaceScoringFilter
- Parameters:
url
- url of the pagedatum
- page's datum, should not be modifiedinitSort
- initial sort value, or a value from previous filters in chain- Returns:
- a sort value for use in sorting and selecting the top N scoring pages during fetchlist generation
- Throws:
ScoringFilterException
- if there is a fatal error preparing the sort value
-
updateDbScore
public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) throws ScoringFilterException
Increase the score by a sum of inlinked scores.- Specified by:
updateDbScore
in interfaceScoringFilter
- Parameters:
url
- url of the pageold
- original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - thedatum
parameter may contain values that are no longer valid, if other updates occurred between generation and this update.datum
- the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.inlinked
- (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch.- Throws:
ScoringFilterException
- there is a fatal error calculating a new score ofCrawlDatum
during CrawlDb update
-
passScoreBeforeParsing
public void passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.- Specified by:
passScoreBeforeParsing
in interfaceScoringFilter
- Parameters:
url
- url of the pagedatum
- source datum. NOTE: modifications to this value are not persisted.content
- instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.
-
passScoreAfterParsing
public void passScoreAfterParsing(Text url, Content content, Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.- Specified by:
passScoreAfterParsing
in interfaceScoringFilter
- Parameters:
url
- page urlcontent
- original content. NOTE: modifications to this value are not persisted.parse
- target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.
-
distributeScoreToOutlinks
public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount) throws ScoringFilterException
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.- Specified by:
distributeScoreToOutlinks
in interfaceScoringFilter
- Parameters:
fromUrl
- url of the source pageparseData
- ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.targets
- <url, CrawlDatum> pairs. NOTE: filters can modify this in-place, all changes will be persisted.adjust
- a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status toCrawlDatum.STATUS_LINKED
.allCount
- number of all collected outlinks from the source page- Returns:
- if needed, implementations may return an instance of CrawlDatum,
with status
CrawlDatum.STATUS_LINKED
, which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed. - Throws:
ScoringFilterException
- there is a fatal error distributing score data from the current page to all of its outlinks
-
indexerScore
public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) throws ScoringFilterException
Dampen the boost value by scorePower.- Specified by:
indexerScore
in interfaceScoringFilter
- Parameters:
url
- url of the pagedoc
- indexed document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.dbDatum
- current page from CrawlDb. NOTE:- changes made to this instance are not persisted
- may be null if indexing is done without CrawlDb or if the segment is generated not from the CrawlDb (via FreeGenerator).
fetchDatum
- datum from FetcherOutput (containing among others the fetching status)parse
- parsing result. NOTE: changes made to this instance are not persisted.inlinks
- current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.initScore
- initial boost value for the indexed document.- Returns:
- boost value for the indexed document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying the indexed document directly.
- Throws:
ScoringFilterException
- if there is a fatal error whilst calculating the indexed document score/boost
-
-