Class FileDumper


  • public class FileDumper
    extends Object
    The file dumper tool enables one to reverse generate the raw content from Nutch segment data directories.

    The tool has a number of immediate uses:

    1. one can see what a page looked like at the time it was crawled
    2. one can see different media types acquired as part of the crawl
    3. it enables us to see webpages before we augment them with additional metadata, this can be handy for providing a provenance trail for your crawl data.

    Upon successful completion the tool displays a very convenient JSON snippet detailing the mimetype classifications and the counts of documents which fall into those classifications. An example is as follows:

     
     INFO: File Types: 
       TOTAL Stats:    
        [
         {"mimeType":"application/xml","count":"19"}
         {"mimeType":"image/png","count":"47"}
         {"mimeType":"image/jpeg","count":"141"}
         {"mimeType":"image/vnd.microsoft.icon","count":"4"}
         {"mimeType":"text/plain","count":"89"}
         {"mimeType":"video/quicktime","count":"2"}
         {"mimeType":"image/gif","count":"63"}
         {"mimeType":"application/xhtml+xml","count":"1670"}
         {"mimeType":"application/octet-stream","count":"40"}
         {"mimeType":"text/html","count":"1863"}
       ]
       
       FILTER Stats: 
       [
         {"mimeType":"image/png","count":"47"}
         {"mimeType":"image/jpeg","count":"141"}
         {"mimeType":"image/vnd.microsoft.icon","count":"4"}
         {"mimeType":"video/quicktime","count":"2"}
         {"mimeType":"image/gif","count":"63"}
       ]
     
     

    In the case above, the tool would have been run with the -mimeType image/png image/jpeg image/vnd.microsoft.icon video/quicktime image/gif flag and corresponding values activated.

    • Constructor Detail

      • FileDumper

        public FileDumper()
    • Method Detail

      • dump

        public void dump​(File outputDir,
                         File segmentRootDir,
                         String[] mimeTypes,
                         boolean flatDir,
                         boolean mimeTypeStats,
                         boolean reverseURLDump)
                  throws Exception
        Dumps the reverse engineered raw content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument.
        Parameters:
        outputDir - the directory you wish to dump the raw content to. This directory will be created.
        segmentRootDir - a directory containing one or more segments.
        mimeTypes - an array of mime types we have to dump, all others will be filtered out.
        flatDir - a boolean flag specifying whether the output directory should contain only files instead of using nested directories to prevent naming conflicts.
        mimeTypeStats - a flag indicating whether mimetype stats should be displayed instead of dumping files.
        reverseURLDump - whether to reverse the URLs when they are written to disk
        Throws:
        Exception - if there is a fatal error dumping files to disk
      • main

        public static void main​(String[] args)
                         throws Exception
        Main method for invoking this tool
        Parameters:
        args - 1) output directory (which will be created) to host the raw data and 2) a directory containing one or more segments.
        Throws:
        Exception - if there is a fatal error running this tool