Package org.apache.nutch.util
Class SitemapProcessor
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.SitemapProcessor
-
- All Implemented Interfaces:
Configurable
,Tool
public class SitemapProcessor extends Configured implements Tool
Performs sitemap processing by fetching sitemap links, parsing the content and merging the URLs from sitemaps (with the metadata) into the CrawlDb.
There are two use cases supported in Nutch's sitemap processing:
- Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a list of sitemap links and inject and fetch only the pages listed in the sitemaps. This suits well for targeted crawl of specific hosts.
- For an open web crawl, it is not possible to track each host and get the sitemap links manually. Nutch automatically detects the sitemaps for all hosts seen in the crawls and present in the HostDb and injects the URLs from the sitemaps into the CrawlDb.
- See Also:
- SitemapFeature
-
-
Field Summary
Fields Modifier and Type Field Description static String
CURRENT_NAME
static String
LOCK_NAME
static SimpleDateFormat
sdf
static String
SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT
static String
SITEMAP_OVERWRITE_EXISTING
static String
SITEMAP_REDIR_MAX
static String
SITEMAP_SIZE_MAX
static String
SITEMAP_STRICT_PARSING
static String
SITEMAP_URL_FILTERING
static String
SITEMAP_URL_NORMALIZING
-
Constructor Summary
Constructors Constructor Description SitemapProcessor()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static void
main(String[] args)
int
run(String[] args)
void
sitemap(Path crawldb, Path hostdb, Path sitemapUrlDir, boolean strict, boolean filter, boolean normalize, int threads)
static void
usage()
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
sdf
public static final SimpleDateFormat sdf
-
CURRENT_NAME
public static final String CURRENT_NAME
- See Also:
- Constant Field Values
-
LOCK_NAME
public static final String LOCK_NAME
- See Also:
- Constant Field Values
-
SITEMAP_STRICT_PARSING
public static final String SITEMAP_STRICT_PARSING
- See Also:
- Constant Field Values
-
SITEMAP_URL_FILTERING
public static final String SITEMAP_URL_FILTERING
- See Also:
- Constant Field Values
-
SITEMAP_URL_NORMALIZING
public static final String SITEMAP_URL_NORMALIZING
- See Also:
- Constant Field Values
-
SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT
public static final String SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT
- See Also:
- Constant Field Values
-
SITEMAP_OVERWRITE_EXISTING
public static final String SITEMAP_OVERWRITE_EXISTING
- See Also:
- Constant Field Values
-
SITEMAP_REDIR_MAX
public static final String SITEMAP_REDIR_MAX
- See Also:
- Constant Field Values
-
SITEMAP_SIZE_MAX
public static final String SITEMAP_SIZE_MAX
- See Also:
- Constant Field Values
-
-