Class CrawlDbMerger

  • All Implemented Interfaces:
    Configurable, Tool

    public class CrawlDbMerger
    extends Configured
    implements Tool
    This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.

    It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.

    If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

    Author:
    Andrzej Bialecki