Class Generator

  • All Implemented Interfaces:
    Configurable, Tool

    public class Generator
    extends NutchTool
    implements Tool
    Generates a subset of a crawl db to fetch. This version allows to generate fetchlists for several segments in one go. Unlike in the initial version (OldGenerator), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can chose separately how to count the URLS i.e. by domain or host to limit the entries.