Package org.apache.nutch.crawl
Class Injector
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.NutchTool
-
- org.apache.nutch.crawl.Injector
-
- All Implemented Interfaces:
Configurable
,Tool
public class Injector extends NutchTool implements Tool
Injector takes a flat text file of URLs (or a folder containing text files) and merges ("injects") these URLs into the CrawlDb. Useful for bootstrapping a Nutch crawl. The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='.Note, that some metadata keys are reserved:
- nutch.score
- allows to set a custom score for a specific URL
- nutch.fetchInterval
- allows to set a custom fetch interval for a specific URL
- nutch.fetchInterval.fixed
- allows to set a custom fetch interval for a specific URL that is not changed by AdaptiveFetchSchedule
Example:
http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Injector.InjectMapper
InjectMapper reads the CrawlDb seeds are injected into the plain-text seed files and parses each line into the URL and metadata.static class
Injector.InjectReducer
Combine multiple new entries for a url.
-
Field Summary
Fields Modifier and Type Field Description static String
nutchFetchIntervalMDName
metadata key reserved for setting a custom fetchInterval for a specific URLstatic String
nutchFixedFetchIntervalMDName
metadata key reserved for setting a fixed custom fetchInterval for a specific URLstatic String
nutchScoreMDName
metadata key reserved for setting a custom score for a specific URLstatic String
URL_FILTER_NORMALIZE_ALL
property to pass value of command-line option -filterNormalizeAll to mapper-
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
-
-
Constructor Summary
Constructors Constructor Description Injector()
Injector(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
inject(Path crawlDb, Path urlDir)
void
inject(Path crawlDb, Path urlDir, boolean overwrite, boolean update)
void
inject(Path crawlDb, Path urlDir, boolean overwrite, boolean update, boolean normalize, boolean filter, boolean filterNormalizeAll)
static void
main(String[] args)
int
run(String[] args)
Map<String,Object>
run(Map<String,Object> args, String crawlId)
Used by the Nutch REST servicevoid
usage()
-
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
URL_FILTER_NORMALIZE_ALL
public static final String URL_FILTER_NORMALIZE_ALL
property to pass value of command-line option -filterNormalizeAll to mapper- See Also:
- Constant Field Values
-
nutchScoreMDName
public static String nutchScoreMDName
metadata key reserved for setting a custom score for a specific URL
-
nutchFetchIntervalMDName
public static String nutchFetchIntervalMDName
metadata key reserved for setting a custom fetchInterval for a specific URL
-
nutchFixedFetchIntervalMDName
public static String nutchFixedFetchIntervalMDName
metadata key reserved for setting a fixed custom fetchInterval for a specific URL
-
-
Constructor Detail
-
Injector
public Injector()
-
Injector
public Injector(Configuration conf)
-
-
Method Detail
-
inject
public void inject(Path crawlDb, Path urlDir) throws IOException, ClassNotFoundException, InterruptedException
-
inject
public void inject(Path crawlDb, Path urlDir, boolean overwrite, boolean update) throws IOException, ClassNotFoundException, InterruptedException
-
inject
public void inject(Path crawlDb, Path urlDir, boolean overwrite, boolean update, boolean normalize, boolean filter, boolean filterNormalizeAll) throws IOException, ClassNotFoundException, InterruptedException
-
usage
public void usage()
-
run
public Map<String,Object> run(Map<String,Object> args, String crawlId) throws Exception
Used by the Nutch REST service- Specified by:
run
in classNutchTool
- Parameters:
args
- aMap
of arguments to be run with the toolcrawlId
- a crawl identifier to associate with the tool invocation- Returns:
- Map results object if tool executes successfully otherwise null
- Throws:
Exception
- if there is an error during the tool execution
-
-