Class SitemapProcessor

  • All Implemented Interfaces:
    Configurable, Tool

    public class SitemapProcessor
    extends Configured
    implements Tool

    Performs sitemap processing by fetching sitemap links, parsing the content and merging the URLs from sitemaps (with the metadata) into the CrawlDb.

    There are two use cases supported in Nutch's sitemap processing:

    1. Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a list of sitemap links and inject and fetch only the pages listed in the sitemaps. This suits well for targeted crawl of specific hosts.
    2. For an open web crawl, it is not possible to track each host and get the sitemap links manually. Nutch automatically detects the sitemaps for all hosts seen in the crawls and present in the HostDb and injects the URLs from the sitemaps into the CrawlDb.
    See Also:
    SitemapFeature