Package org.apache.nutch.crawl
Class CrawlDb
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.NutchTool
-
- org.apache.nutch.crawl.CrawlDb
-
- All Implemented Interfaces:
Configurable
,Tool
public class CrawlDb extends NutchTool implements Tool
This class takes the output of the fetcher and updates the crawldb accordingly.
-
-
Field Summary
Fields Modifier and Type Field Description static String
CRAWLDB_ADDITIONS_ALLOWED
static String
CRAWLDB_PURGE_404
static String
CRAWLDB_PURGE_ORPHANS
static String
CURRENT_NAME
static String
LOCK_NAME
-
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
-
-
Constructor Summary
Constructors Constructor Description CrawlDb()
CrawlDb(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static Job
createJob(Configuration config, Path crawlDb)
static void
install(Job job, Path crawlDb)
static Path
lock(Configuration job, Path crawlDb, boolean force)
static void
main(String[] args)
int
run(String[] args)
Map<String,Object>
run(Map<String,Object> args, String crawlId)
Runs the tool, using a map of arguments.void
update(Path crawlDb, Path[] segments, boolean normalize, boolean filter)
void
update(Path crawlDb, Path[] segments, boolean normalize, boolean filter, boolean additionsAllowed, boolean force)
-
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
CRAWLDB_ADDITIONS_ALLOWED
public static final String CRAWLDB_ADDITIONS_ALLOWED
- See Also:
- Constant Field Values
-
CRAWLDB_PURGE_404
public static final String CRAWLDB_PURGE_404
- See Also:
- Constant Field Values
-
CRAWLDB_PURGE_ORPHANS
public static final String CRAWLDB_PURGE_ORPHANS
- See Also:
- Constant Field Values
-
CURRENT_NAME
public static final String CURRENT_NAME
- See Also:
- Constant Field Values
-
LOCK_NAME
public static final String LOCK_NAME
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
CrawlDb
public CrawlDb()
-
CrawlDb
public CrawlDb(Configuration conf)
-
-
Method Detail
-
update
public void update(Path crawlDb, Path[] segments, boolean normalize, boolean filter) throws IOException, InterruptedException, ClassNotFoundException
-
update
public void update(Path crawlDb, Path[] segments, boolean normalize, boolean filter, boolean additionsAllowed, boolean force) throws IOException, InterruptedException, ClassNotFoundException
-
createJob
public static Job createJob(Configuration config, Path crawlDb) throws IOException
- Throws:
IOException
-
lock
public static Path lock(Configuration job, Path crawlDb, boolean force) throws IOException
- Throws:
IOException
-
install
public static void install(Job job, Path crawlDb) throws IOException
- Throws:
IOException
-
run
public Map<String,Object> run(Map<String,Object> args, String crawlId) throws Exception
Description copied from class:NutchTool
Runs the tool, using a map of arguments. May return results, or null.- Specified by:
run
in classNutchTool
- Parameters:
args
- aMap
of arguments to be run with the toolcrawlId
- a crawl identifier to associate with the tool invocation- Returns:
- Map results object if tool executes successfully otherwise null
- Throws:
Exception
- if there is an error during the tool execution
-
-