Package org.apache.nutch.crawl
Class AdaptiveFetchSchedule
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.crawl.AbstractFetchSchedule
-
- org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- All Implemented Interfaces:
Configurable
,FetchSchedule
- Direct Known Subclasses:
MimeAdaptiveFetchSchedule
public class AdaptiveFetchSchedule extends AbstractFetchSchedule
This class implements an adaptive re-fetch algorithm. This works as follows:- for pages that has changed since the last fetchTime, decrease their fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
- for pages that haven't changed since the last fetchTime, increase their
fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
If SYNC_DELTA property is true, then:- calculate a
delta = fetchTime - modifiedTime
- try to synchronize with the time of change, by shifting the next
fetchTime by a fraction of the difference between the last modification time
and the last fetch time. I.e. the next fetch time will be set to
fetchTime + fetchInterval - delta * SYNC_DELTA_RATE
- if the adjusted fetch interval is bigger than the delta, then
fetchInterval = delta
.
- calculate a
- the minimum value of fetchInterval may not be smaller than MIN_INTERVAL (default is 1 minute).
- the maximum value of fetchInterval may not be bigger than MAX_INTERVAL (default is 365 days).
NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm, so that the fetch interval either increases or decreases infinitely, with little relevance to the page changes. Please use
main(String[])
method to test the values before applying them in a production system.- Author:
- Andrzej Bialecki
-
-
Field Summary
Fields Modifier and Type Field Description protected float
DEC_RATE
protected float
INC_RATE
-
Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
defaultInterval, maxInterval
-
Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
-
-
Constructor Summary
Constructors Constructor Description AdaptiveFetchSchedule()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static String
getHostName(String url)
Strip a URL, leaving only the host name.float
getMaxInterval(Text url, float defaultMaxInterval)
Returns the max_interval for this URL, which might depend on the host.float
getMinInterval(Text url, float defaultMinInterval)
Returns the min_interval for this URL, which might depend on the host.static void
main(String[] args)
void
setConf(Configuration conf)
CrawlDatum
setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page.-
Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
-
-
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
- Overrides:
setConf
in classAbstractFetchSchedule
-
getHostName
public static String getHostName(String url) throws URISyntaxException
Strip a URL, leaving only the host name.- Parameters:
url
- url to get hostname for- Returns:
- hostname
- Throws:
URISyntaxException
- if the given string violates RFC 2396
-
getMaxInterval
public float getMaxInterval(Text url, float defaultMaxInterval)
Returns the max_interval for this URL, which might depend on the host.- Parameters:
url
- the URL to be scheduleddefaultMaxInterval
- the value to which to default if max_interval has not been configured for this host- Returns:
- the configured maximum interval or the default interval
-
getMinInterval
public float getMinInterval(Text url, float defaultMinInterval)
Returns the min_interval for this URL, which might depend on the host.- Parameters:
url
- the URL to be scheduleddefaultMinInterval
- the value to which to default if min_interval has not been configured for this host- Returns:
- the configured minimum interval or the default interval
-
setFetchSchedule
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Description copied from class:AbstractFetchSchedule
Sets thefetchInterval
andfetchTime
on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.- Specified by:
setFetchSchedule
in interfaceFetchSchedule
- Overrides:
setFetchSchedule
in classAbstractFetchSchedule
- Parameters:
url
- url of the pagedatum
- page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.prevFetchTime
- previous value of fetch time, or 0 if not available.prevModifiedTime
- previous value of modifiedTime, or 0 if not available.fetchTime
- the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.modifiedTime
- last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.state
- ifFetchSchedule.STATUS_MODIFIED
, then the content is considered to be "changed" before thefetchTime
, ifFetchSchedule.STATUS_NOTMODIFIED
then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set toFetchSchedule.STATUS_UNKNOWN
, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.
-
-