Class DeduplicationJob

  • All Implemented Interfaces:
    Configurable, Tool

    public class DeduplicationJob
    extends NutchTool
    implements Tool
    Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed). If two (or more) documents have the same score, then the document with the latest timestamp is kept. If the documents have the same timestamp then the one with the shortest URL is kept. The documents marked as duplicate can then be deleted with the command CleaningJob.
    • Constructor Detail

      • DeduplicationJob

        public DeduplicationJob()
    • Method Detail

      • run

        public Map<String,​Object> run​(Map<String,​Object> args,
                                            String crawlId)
                                     throws Exception
        Description copied from class: NutchTool
        Runs the tool, using a map of arguments. May return results, or null.
        Specified by:
        run in class NutchTool
        Parameters:
        args - a Map of arguments to be run with the tool
        crawlId - a crawl identifier to associate with the tool invocation
        Returns:
        Map results object if tool executes successfully otherwise null
        Throws:
        Exception - if there is an error during the tool execution