Package org.apache.nutch.crawl
Class DeduplicationJob
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.NutchTool
-
- org.apache.nutch.crawl.DeduplicationJob
-
- All Implemented Interfaces:
Configurable
,Tool
public class DeduplicationJob extends NutchTool implements Tool
Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept. The documents marked as duplicate can then be deleted with the command CleaningJob.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
DeduplicationJob.DBFilter
static class
DeduplicationJob.DedupReducer<K extends Writable>
static class
DeduplicationJob.StatusUpdateReducer
Combine multiple new entries for a url.
-
Field Summary
Fields Modifier and Type Field Description protected static String
DEDUPLICATION_COMPARE_ORDER
protected static String
DEDUPLICATION_GROUP_MODE
protected static Text
urlKey
protected static String
UTF_8
-
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
-
-
Constructor Summary
Constructors Constructor Description DeduplicationJob()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static void
main(String[] args)
int
run(String[] args)
Map<String,Object>
run(Map<String,Object> args, String crawlId)
Runs the tool, using a map of arguments.-
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
urlKey
protected static final Text urlKey
-
DEDUPLICATION_GROUP_MODE
protected static final String DEDUPLICATION_GROUP_MODE
- See Also:
- Constant Field Values
-
DEDUPLICATION_COMPARE_ORDER
protected static final String DEDUPLICATION_COMPARE_ORDER
- See Also:
- Constant Field Values
-
UTF_8
protected static final String UTF_8
-
-
Method Detail
-
run
public int run(String[] args) throws IOException
- Specified by:
run
in interfaceTool
- Throws:
IOException
-
run
public Map<String,Object> run(Map<String,Object> args, String crawlId) throws Exception
Description copied from class:NutchTool
Runs the tool, using a map of arguments. May return results, or null.- Specified by:
run
in classNutchTool
- Parameters:
args
- aMap
of arguments to be run with the toolcrawlId
- a crawl identifier to associate with the tool invocation- Returns:
- Map results object if tool executes successfully otherwise null
- Throws:
Exception
- if there is an error during the tool execution
-
-