Class Injector

  • All Implemented Interfaces:
    Configurable, Tool

    public class Injector
    extends NutchTool
    implements Tool
    Injector takes a flat text file of URLs (or a folder containing text files) and merges ("injects") these URLs into the CrawlDb. Useful for bootstrapping a Nutch crawl. The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='.

    Note, that some metadata keys are reserved:

    nutch.score
    allows to set a custom score for a specific URL
    nutch.fetchInterval
    allows to set a custom fetch interval for a specific URL
    nutch.fetchInterval.fixed
    allows to set a custom fetch interval for a specific URL that is not changed by AdaptiveFetchSchedule

    Example:

      http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source