Package org.apache.nutch.crawl
Class CrawlDbReader
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.AbstractChecker
-
- org.apache.nutch.crawl.CrawlDbReader
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Configurable
,Tool
public class CrawlDbReader extends AbstractChecker implements Closeable
Read utility for the CrawlDB.- Author:
- Andrzej Bialecki
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CrawlDbReader.CrawlDatumCsvOutputFormat
static class
CrawlDbReader.CrawlDatumJsonOutputFormat
static class
CrawlDbReader.CrawlDbDumpMapper
static class
CrawlDbReader.CrawlDbStatMapper
static class
CrawlDbReader.CrawlDbStatReducer
static class
CrawlDbReader.CrawlDbTopNMapper
static class
CrawlDbReader.CrawlDbTopNReducer
static class
CrawlDbReader.JsonIndenter
-
Field Summary
Fields Modifier and Type Field Description protected String
crawlDb
-
Fields inherited from class org.apache.nutch.util.AbstractChecker
keepClientCnxOpen, stdin, tcpPort, usage
-
-
Constructor Summary
Constructors Constructor Description CrawlDbReader()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
CrawlDatum
get(String crawlDb, String url, Configuration config)
static void
main(String[] args)
protected int
process(String line, StringBuilder output)
void
processDumpJob(String crawlDb, String output, Configuration config, String format, String regex, String status, Integer retry, String expr, Float sample)
void
processStatJob(String crawlDb, Configuration config, boolean sort)
void
processTopNJob(String crawlDb, long topN, float min, String output, Configuration config)
Object
query(Map<String,String> args, Configuration conf, String type, String crawlId)
void
readUrl(String crawlDb, String url, Configuration config, StringBuilder output)
int
run(String[] args)
-
Methods inherited from class org.apache.nutch.util.AbstractChecker
getProtocolOutput, parseArgs, processSingle, processStdin, processTCP, run
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
crawlDb
protected String crawlDb
-
-
Method Detail
-
close
public void close()
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
-
processStatJob
public void processStatJob(String crawlDb, Configuration config, boolean sort) throws IOException, InterruptedException, ClassNotFoundException
-
get
public CrawlDatum get(String crawlDb, String url, Configuration config) throws IOException
- Throws:
IOException
-
process
protected int process(String line, StringBuilder output) throws Exception
- Specified by:
process
in classAbstractChecker
- Throws:
Exception
-
readUrl
public void readUrl(String crawlDb, String url, Configuration config, StringBuilder output) throws IOException
- Throws:
IOException
-
processDumpJob
public void processDumpJob(String crawlDb, String output, Configuration config, String format, String regex, String status, Integer retry, String expr, Float sample) throws IOException, ClassNotFoundException, InterruptedException
-
processTopNJob
public void processTopNJob(String crawlDb, long topN, float min, String output, Configuration config) throws IOException, ClassNotFoundException, InterruptedException
-
run
public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException, Exception
- Specified by:
run
in interfaceTool
- Throws:
IOException
InterruptedException
ClassNotFoundException
Exception
-
-