Package org.apache.nutch.crawl
Class AbstractFetchSchedule
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.crawl.AbstractFetchSchedule
-
- All Implemented Interfaces:
Configurable
,FetchSchedule
- Direct Known Subclasses:
AdaptiveFetchSchedule
,DefaultFetchSchedule
public abstract class AbstractFetchSchedule extends Configured implements FetchSchedule
This class provides common methods for implementations ofFetchSchedule
.- Author:
- Andrzej Bialecki
-
-
Field Summary
Fields Modifier and Type Field Description protected int
defaultInterval
protected int
maxInterval
-
Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
-
-
Constructor Summary
Constructors Constructor Description AbstractFetchSchedule()
AbstractFetchSchedule(Configuration conf)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description long
calculateLastFetchTime(CrawlDatum datum)
This method return the last fetch time of the CrawlDatumCrawlDatum
forceRefetch(Text url, CrawlDatum datum, boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.CrawlDatum
initializeSchedule(Text url, CrawlDatum datum)
Initialize fetch schedule related data.void
setConf(Configuration conf)
CrawlDatum
setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page.CrawlDatum
setPageGoneSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.CrawlDatum
setPageRetrySchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.boolean
shouldFetch(Text url, CrawlDatum datum, long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist.-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
-
-
-
-
Constructor Detail
-
AbstractFetchSchedule
public AbstractFetchSchedule()
-
AbstractFetchSchedule
public AbstractFetchSchedule(Configuration conf)
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
- Overrides:
setConf
in classConfigured
-
initializeSchedule
public CrawlDatum initializeSchedule(Text url, CrawlDatum datum)
Initialize fetch schedule related data. Implementations should at least set thefetchTime
andfetchInterval
. The default implementation sets thefetchTime
to now, using the defaultfetchInterval
.- Specified by:
initializeSchedule
in interfaceFetchSchedule
- Parameters:
url
- URL of the page.datum
- datum instance to be initialized (modified in place).- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum.
-
setFetchSchedule
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.- Specified by:
setFetchSchedule
in interfaceFetchSchedule
- Parameters:
url
- url of the pagedatum
- page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.prevFetchTime
- previous value of fetch time, or 0 if not available.prevModifiedTime
- previous value of modifiedTime, or 0 if not available.fetchTime
- the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.modifiedTime
- last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.state
- ifFetchSchedule.STATUS_MODIFIED
, then the content is considered to be "changed" before thefetchTime
, ifFetchSchedule.STATUS_NOTMODIFIED
then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set toFetchSchedule.STATUS_UNKNOWN
, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.
-
setPageGoneSchedule
public CrawlDatum setPageGoneSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50% but the value may never exceedmaxInterval
.- Specified by:
setPageGoneSchedule
in interfaceFetchSchedule
- Parameters:
url
- URL of the page.datum
- datum instance to be adjusted.prevFetchTime
- previous fetch time.prevModifiedTime
- previous modified time.fetchTime
- current fetch time.- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum.
-
setPageRetrySchedule
public CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. The default implementation sets the next fetch time 1 day in the future and increases the retry counter.- Specified by:
setPageRetrySchedule
in interfaceFetchSchedule
- Parameters:
url
- URL of the page.datum
- page information.prevFetchTime
- previous fetch time.prevModifiedTime
- previous modified time.fetchTime
- current fetch time.- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum.
-
calculateLastFetchTime
public long calculateLastFetchTime(CrawlDatum datum)
This method return the last fetch time of the CrawlDatum- Specified by:
calculateLastFetchTime
in interfaceFetchSchedule
- Parameters:
datum
- page information.- Returns:
- the date as a long.
-
shouldFetch
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. NOTE: a true return value does not guarantee that the page will be fetched, it just allows it to be included in the further selection process based on scores. The default implementation checksfetchTime
, if it is higher than thecurTime
it returns false, and true otherwise. It will also check that fetchTime is not too remote (more thanmaxInterval
, in which case it lowers the interval and returns true.- Specified by:
shouldFetch
in interfaceFetchSchedule
- Parameters:
url
- URL of the page.datum
- datum instance.curTime
- reference time (usually set to the time when the fetchlist generation process was started).- Returns:
- true, if the page should be considered for inclusion in the current fetchlist, otherwise false.
-
forceRefetch
public CrawlDatum forceRefetch(Text url, CrawlDatum datum, boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.- Specified by:
forceRefetch
in interfaceFetchSchedule
- Parameters:
url
- URL of the page.datum
- datum instance.asap
- if true, force refetch as soon as possible - this sets the fetchTime to now. If false, force refetch whenever the next fetch time is set.- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum.
-
-