Package org.apache.nutch.tools
Class FileDumper
- java.lang.Object
-
- org.apache.nutch.tools.FileDumper
-
public class FileDumper extends Object
The file dumper tool enables one to reverse generate the raw content from Nutch segment data directories.The tool has a number of immediate uses:
- one can see what a page looked like at the time it was crawled
- one can see different media types acquired as part of the crawl
- it enables us to see webpages before we augment them with additional metadata, this can be handy for providing a provenance trail for your crawl data.
Upon successful completion the tool displays a very convenient JSON snippet detailing the mimetype classifications and the counts of documents which fall into those classifications. An example is as follows:
INFO: File Types: TOTAL Stats: [ {"mimeType":"application/xml","count":"19"} {"mimeType":"image/png","count":"47"} {"mimeType":"image/jpeg","count":"141"} {"mimeType":"image/vnd.microsoft.icon","count":"4"} {"mimeType":"text/plain","count":"89"} {"mimeType":"video/quicktime","count":"2"} {"mimeType":"image/gif","count":"63"} {"mimeType":"application/xhtml+xml","count":"1670"} {"mimeType":"application/octet-stream","count":"40"} {"mimeType":"text/html","count":"1863"} ] FILTER Stats: [ {"mimeType":"image/png","count":"47"} {"mimeType":"image/jpeg","count":"141"} {"mimeType":"image/vnd.microsoft.icon","count":"4"} {"mimeType":"video/quicktime","count":"2"} {"mimeType":"image/gif","count":"63"} ]
In the case above, the tool would have been run with the -mimeType image/png image/jpeg image/vnd.microsoft.icon video/quicktime image/gif flag and corresponding values activated.
-
-
Constructor Summary
Constructors Constructor Description FileDumper()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
dump(File outputDir, File segmentRootDir, String[] mimeTypes, boolean flatDir, boolean mimeTypeStats, boolean reverseURLDump)
Dumps the reverse engineered raw content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument.static void
main(String[] args)
Main method for invoking this tool
-
-
-
Method Detail
-
dump
public void dump(File outputDir, File segmentRootDir, String[] mimeTypes, boolean flatDir, boolean mimeTypeStats, boolean reverseURLDump) throws Exception
Dumps the reverse engineered raw content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument.- Parameters:
outputDir
- the directory you wish to dump the raw content to. This directory will be created.segmentRootDir
- a directory containing one or more segments.mimeTypes
- an array of mime types we have to dump, all others will be filtered out.flatDir
- a boolean flag specifying whether the output directory should contain only files instead of using nested directories to prevent naming conflicts.mimeTypeStats
- a flag indicating whether mimetype stats should be displayed instead of dumping files.reverseURLDump
- whether to reverse the URLs when they are written to disk- Throws:
Exception
- if there is a fatal error dumping files to disk
-
-