DeduplicationJob (apache-nutch 1.20 API)

java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.nutch.util.NutchTool
  - - org.apache.nutch.crawl.DeduplicationJob

All Implemented Interfaces:

Configurable, Tool
```
public class DeduplicationJob
extends NutchTool
implements Tool
```
Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept. The documents marked as duplicate can then be deleted with the command CleaningJob.

Nested Class Summary

Nested Classes
Modifier and Type	Class	Description
`static class`	`DeduplicationJob.DBFilter`
`static class`	`DeduplicationJob.DedupReducer<K extends Writable>`
`static class`	`DeduplicationJob.StatusUpdateReducer`	Combine multiple new entries for a url.

Field Summary

Fields
Modifier and Type	Field	Description
`protected static String`	`DEDUPLICATION_COMPARE_ORDER`
`protected static String`	`DEDUPLICATION_GROUP_MODE`
`protected static Text`	`urlKey`
`protected static String`	`UTF_8`

Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status

Constructor Summary

Constructors
Constructor Description

DeduplicationJob()

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`static void`	`main(String[] args)`
`int`	`run(String[] args)`
`Map<String,Object>`	`run(Map<String,Object> args, String crawlId)`	Runs the tool, using a map of arguments.

Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob

Methods inherited from class org.apache.hadoop.conf.Configured
getConf

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf

- Field Detail
  - urlKey
```
protected static final Text urlKey
```
  - DEDUPLICATION_GROUP_MODE
```
protected static final String DEDUPLICATION_GROUP_MODE
```
    See Also:
    
    Constant Field Values
  - DEDUPLICATION_COMPARE_ORDER
```
protected static final String DEDUPLICATION_COMPARE_ORDER
```
    See Also:
    
    Constant Field Values
  - UTF_8
```
protected static final String UTF_8
```
- Constructor Detail
  - DeduplicationJob
```
public DeduplicationJob()
```
- Method Detail
  - run
```
public int run(String[] args)
        throws IOException
```
    Specified by:
    
    run in interface Tool
    
    Throws:
    
    IOException
  - main
```
public static void main(String[] args)
                 throws Exception
```
    Throws:
    
    Exception
  - run
```
public Map<String,Object> run(Map<String,Object> args,
                                    String crawlId)
                             throws Exception
```
    Description copied from class: NutchTool
    
    Runs the tool, using a map of arguments. May return results, or null.
    
    Specified by:
    
    run in class NutchTool
    
    Parameters:
    
    args - a Map of arguments to be run with the tool
    
    crawlId - a crawl identifier to associate with the tool invocation
    
    Returns:
    
    Map results object if tool executes successfully otherwise null
    
    Throws:
    
    Exception - if there is an error during the tool execution