Package org.apache.nutch.crawl
Class Generator
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.NutchTool
-
- org.apache.nutch.crawl.Generator
-
- All Implemented Interfaces:
Configurable
,Tool
public class Generator extends NutchTool implements Tool
Generates a subset of a crawl db to fetch. This version allows to generate fetchlists for several segments in one go. Unlike in the initial version (OldGenerator), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can chose separately how to count the URLS i.e. by domain or host to limit the entries.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Generator.CrawlDbUpdater
Update the CrawlDB so that the next generate won't include the same URLs.static class
Generator.DecreasingFloatComparator
static class
Generator.HashComparator
Sort fetch lists by hash of URL.static class
Generator.PartitionReducer
static class
Generator.Selector
Selects entries due for fetch.static class
Generator.SelectorEntry
static class
Generator.SelectorInverseMapper
static class
Generator.SelectorMapper
Select and invert subset due for fetch.static class
Generator.SelectorReducer
Collect until limit is reached.
-
Field Summary
Fields Modifier and Type Field Description static String
GENERATE_UPDATE_CRAWLDB
static String
GENERATOR_COUNT_MODE
static String
GENERATOR_COUNT_VALUE_DOMAIN
static String
GENERATOR_COUNT_VALUE_HOST
static String
GENERATOR_CUR_TIME
static String
GENERATOR_DELAY
static String
GENERATOR_EXPR
static String
GENERATOR_FETCH_DELAY_EXPR
static String
GENERATOR_FILTER
static String
GENERATOR_HOSTDB
static String
GENERATOR_MAX_COUNT
static String
GENERATOR_MAX_COUNT_EXPR
static String
GENERATOR_MAX_NUM_SEGMENTS
static String
GENERATOR_MIN_INTERVAL
static String
GENERATOR_MIN_SCORE
static String
GENERATOR_NORMALISE
static String
GENERATOR_RESTRICT_STATUS
static String
GENERATOR_TOP_N
protected static org.slf4j.Logger
LOG
-
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
-
-
Constructor Summary
Constructors Constructor Description Generator()
Generator(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description Path[]
generate(Path dbDir, Path segments, int numLists, long topN, long curTime)
Path[]
generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean force)
Deprecated.since 1.19 usegenerate(Path, Path, int, long, long, boolean, boolean, boolean, int, String, String)
orgenerate(Path, Path, int, long, long, boolean, boolean, boolean, int, String)
in the instance that no hostdb is availablePath[]
generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean norm, boolean force, int maxNumSegments, String expr)
This signature should be used in the instance that no hostdb is available.Path[]
generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean norm, boolean force, int maxNumSegments, String expr, String hostdb)
Generate fetchlists in one or more segments.static String
generateSegmentName()
static void
main(String[] args)
Generate a fetchlist from the crawldb.int
run(String[] args)
Map<String,Object>
run(Map<String,Object> args, String crawlId)
Runs the tool, using a map of arguments.-
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
LOG
protected static final org.slf4j.Logger LOG
-
GENERATE_UPDATE_CRAWLDB
public static final String GENERATE_UPDATE_CRAWLDB
- See Also:
- Constant Field Values
-
GENERATOR_MIN_SCORE
public static final String GENERATOR_MIN_SCORE
- See Also:
- Constant Field Values
-
GENERATOR_MIN_INTERVAL
public static final String GENERATOR_MIN_INTERVAL
- See Also:
- Constant Field Values
-
GENERATOR_RESTRICT_STATUS
public static final String GENERATOR_RESTRICT_STATUS
- See Also:
- Constant Field Values
-
GENERATOR_FILTER
public static final String GENERATOR_FILTER
- See Also:
- Constant Field Values
-
GENERATOR_NORMALISE
public static final String GENERATOR_NORMALISE
- See Also:
- Constant Field Values
-
GENERATOR_MAX_COUNT
public static final String GENERATOR_MAX_COUNT
- See Also:
- Constant Field Values
-
GENERATOR_COUNT_MODE
public static final String GENERATOR_COUNT_MODE
- See Also:
- Constant Field Values
-
GENERATOR_COUNT_VALUE_DOMAIN
public static final String GENERATOR_COUNT_VALUE_DOMAIN
- See Also:
- Constant Field Values
-
GENERATOR_COUNT_VALUE_HOST
public static final String GENERATOR_COUNT_VALUE_HOST
- See Also:
- Constant Field Values
-
GENERATOR_TOP_N
public static final String GENERATOR_TOP_N
- See Also:
- Constant Field Values
-
GENERATOR_CUR_TIME
public static final String GENERATOR_CUR_TIME
- See Also:
- Constant Field Values
-
GENERATOR_DELAY
public static final String GENERATOR_DELAY
- See Also:
- Constant Field Values
-
GENERATOR_MAX_NUM_SEGMENTS
public static final String GENERATOR_MAX_NUM_SEGMENTS
- See Also:
- Constant Field Values
-
GENERATOR_EXPR
public static final String GENERATOR_EXPR
- See Also:
- Constant Field Values
-
GENERATOR_HOSTDB
public static final String GENERATOR_HOSTDB
- See Also:
- Constant Field Values
-
GENERATOR_MAX_COUNT_EXPR
public static final String GENERATOR_MAX_COUNT_EXPR
- See Also:
- Constant Field Values
-
GENERATOR_FETCH_DELAY_EXPR
public static final String GENERATOR_FETCH_DELAY_EXPR
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
Generator
public Generator()
-
Generator
public Generator(Configuration conf)
-
-
Method Detail
-
generate
public Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime) throws IOException, InterruptedException, ClassNotFoundException
-
generate
@Deprecated public Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean force) throws IOException, InterruptedException, ClassNotFoundException
Deprecated.since 1.19 usegenerate(Path, Path, int, long, long, boolean, boolean, boolean, int, String, String)
orgenerate(Path, Path, int, long, long, boolean, boolean, boolean, int, String)
in the instance that no hostdb is availableThis is an old signature used for compatibility - does not specify whether or not to normalise and set the number of segments to 1- Parameters:
dbDir
- Crawl database directorysegments
- Segments directorynumLists
- Number of reduce taskstopN
- Number of top URLs to be selectedcurTime
- Current time in millisecondsfilter
- whether to apply filtering operationforce
- if true, and the target lockfile exists, consider it valid. If false and the target file exists, throw an IOException.- Returns:
- Path to generated segment or null if no entries were selected
- Throws:
IOException
- if an I/O exception occurs.InterruptedException
- if a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity.ClassNotFoundException
- if runtime class(es) are not available- See Also:
LockUtil.createLockFile(Configuration, Path, boolean)
-
generate
public Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean norm, boolean force, int maxNumSegments, String expr) throws IOException, InterruptedException, ClassNotFoundException
This signature should be used in the instance that no hostdb is available. Generate fetchlists in one or more segments. Whether to filter URLs or not is read from the "generate.filter" property set for the job from command-line. If the property is not found, the URLs are filtered. Same for the normalisation.- Parameters:
dbDir
- Crawl database directorysegments
- Segments directorynumLists
- Number of reduce taskstopN
- Number of top URLs to be selectedcurTime
- Current time in millisecondsfilter
- whether to apply filtering operationnorm
- whether to apply normalization operationforce
- if true, and the target lockfile exists, consider it valid. If false and the target file exists, throw an IOException.maxNumSegments
- maximum number of segments to generateexpr
- a Jexl expression to use in the Generator job.- Returns:
- Path to generated segment or null if no entries were selected
- Throws:
IOException
- if an I/O exception occurs.InterruptedException
- if a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity.ClassNotFoundException
- if runtime class(es) are not available- See Also:
JexlUtil.parseExpression(String)
,LockUtil.createLockFile(Configuration, Path, boolean)
-
generate
public Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean norm, boolean force, int maxNumSegments, String expr, String hostdb) throws IOException, InterruptedException, ClassNotFoundException
Generate fetchlists in one or more segments. Whether to filter URLs or not is read from the "generate.filter" property set for the job from command-line. If the property is not found, the URLs are filtered. Same for the normalisation.- Parameters:
dbDir
- Crawl database directorysegments
- Segments directorynumLists
- Number of reduce taskstopN
- Number of top URLs to be selectedcurTime
- Current time in millisecondsfilter
- whether to apply filtering operationnorm
- whether to apply normalization operationforce
- if true, and the target lockfile exists, consider it valid. If false and the target file exists, throw an IOException.maxNumSegments
- maximum number of segments to generateexpr
- a Jexl expression to use in the Generator job.hostdb
- name of a hostdb from which to execute Jexl expressions in a bid to determine the maximum URL count and/or fetch delay per host.- Returns:
- Path to generated segment or null if no entries were selected
- Throws:
IOException
- if an I/O exception occurs.InterruptedException
- if a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity.ClassNotFoundException
- if runtime class(es) are not available- See Also:
JexlUtil.parseExpression(String)
,LockUtil.createLockFile(Configuration, Path, boolean)
-
generateSegmentName
public static String generateSegmentName()
-
main
public static void main(String[] args) throws Exception
Generate a fetchlist from the crawldb.- Parameters:
args
- array of arguments for this job- Throws:
Exception
- if there is an error running the job
-
run
public Map<String,Object> run(Map<String,Object> args, String crawlId) throws Exception
Description copied from class:NutchTool
Runs the tool, using a map of arguments. May return results, or null.- Specified by:
run
in classNutchTool
- Parameters:
args
- aMap
of arguments to be run with the toolcrawlId
- a crawl identifier to associate with the tool invocation- Returns:
- Map results object if tool executes successfully otherwise null
- Throws:
Exception
- if there is an error during the tool execution
-
-