Package org.apache.nutch.tools.warc
Class WARCExporter
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.tools.warc.WARCExporter
-
- All Implemented Interfaces:
Configurable
,Tool
public class WARCExporter extends Configured implements Tool
MapReduce job to exports Nutch segments as WARC files. The file format is documented in the [ISO Standard](http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf). Generates elements of type response if the configuration 'store.http.headers' was set to true during the fetching and the http headers were stored verbatim; generates elements of type 'resource' otherwise.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
WARCExporter.WARCMapReduce
-
Constructor Summary
Constructors Constructor Description WARCExporter()
WARCExporter(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
generateWARC(String output, List<Path> segments, boolean onlySuccessfulResponses, boolean includeParseData, boolean includeParseText)
static void
main(String[] args)
int
run(String[] args)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Constructor Detail
-
WARCExporter
public WARCExporter()
-
WARCExporter
public WARCExporter(Configuration conf)
-
-
Method Detail
-
generateWARC
public int generateWARC(String output, List<Path> segments, boolean onlySuccessfulResponses, boolean includeParseData, boolean includeParseText) throws IOException
- Throws:
IOException
-
-