Package org.apache.nutch.crawl
Class CrawlDbMerger
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.crawl.CrawlDbMerger
-
- All Implemented Interfaces:
Configurable
,Tool
public class CrawlDbMerger extends Configured implements Tool
This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.
If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of
CrawlDatum.getFetchTime()
. However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.- Author:
- Andrzej Bialecki
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CrawlDbMerger.Merger
-
Constructor Summary
Constructors Constructor Description CrawlDbMerger()
CrawlDbMerger(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static Job
createMergeJob(Configuration conf, Path output, boolean normalize, boolean filter)
static void
main(String[] args)
Run the tool.void
merge(Path output, Path[] dbs, boolean normalize, boolean filter)
int
run(String[] args)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Constructor Detail
-
CrawlDbMerger
public CrawlDbMerger()
-
CrawlDbMerger
public CrawlDbMerger(Configuration conf)
-
-
Method Detail
-
merge
public void merge(Path output, Path[] dbs, boolean normalize, boolean filter) throws Exception
- Throws:
Exception
-
createMergeJob
public static Job createMergeJob(Configuration conf, Path output, boolean normalize, boolean filter) throws IOException
- Throws:
IOException
-
main
public static void main(String[] args) throws Exception
Run the tool.- Parameters:
args
- job parameters- Throws:
Exception
- if there is an issue executing this job
-
-