java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.nutch.util.NutchTool
  - - org.apache.nutch.tools.CommonCrawlDataDumper

All Implemented Interfaces:: Configurable, Tool

public class CommonCrawlDataDumper
extends NutchTool
implements Tool

The Common Crawl Data Dumper tool enables one to reverse generate the raw content from Nutch segment data directories into a common crawling data format, consumed by many applications. The data is then serialized as CBOR

Text content will be stored in a structured document format. Below is a schema for storage of data and metadata related to a crawling request, with the response body truncated for readability. This document must be encoded using CBOR and should be compressed with gzip after encoding. The timestamped URL key for these records' keys follows the same layout as the media file directory structure, with underscores in place of directory separators.

Thus, the timestamped url key for the record is provided below followed by an example record:

 
 com_somepage_33a3e36bbef59c2a5242c2ccee59239ab30d51f3_1411623696000

     {
         "url": "http:\/\/somepage.com\/22\/14560817",
         "timestamp": "1411623696000",
         "request": {
             "method": "GET",
             "client": {
                 "hostname": "crawler01.local",
                 "address": "74.347.129.200",
                 "software": "Apache Nutch v1.10",
                 "robots": "classic",
                 "contact": {
                     "name": "Nutch Admin",
                     "email": "nutch.pro@nutchadmin.org"
                 }
             },
             "headers": {
                 "Accept": "text\/html,application\/xhtml+xml,application\/xml",
                 "Accept-Encoding": "gzip,deflate,sdch",
                 "Accept-Language": "en-US,en",
                 "User-Agent": "Mozilla\/5.0",
                 "...": "..."
             },
             "body": null
         },
         "response": {
             "status": "200",
             "server": {
                 "hostname": "somepage.com",
                 "address": "55.33.51.19",
             },
             "headers": {
                 "Content-Encoding": "gzip",
                 "Content-Type": "text\/html",
                 "Date": "Thu, 25 Sep 2014 04:16:58 GMT",
                 "Expires": "Thu, 25 Sep 2014 04:16:57 GMT",
                 "Server": "nginx",
                 "...": "..."
             },
             "body": "\r\n  <!DOCTYPE html PUBLIC ... \r\n\r\n  \r\n    </body>\r\n    </html>\r\n  \r\n\r\n",
         },
         "key": "com_somepage_33a3e36bbef59c2a5242c2ccee59239ab30d51f3_1411623696000",
         "imported": "1411623698000"
     }

Upon successful completion the tool displays a very convenient JSON snippet detailing the mimetype classifications and the counts of documents which fall into those classifications. An example is as follows:

 
 INFO: File Types:
   TOTAL Stats:    {
     {"mimeType":"application/xml","count":19"}
     {"mimeType":"image/png","count":47"}
     {"mimeType":"image/jpeg","count":141"}
     {"mimeType":"image/vnd.microsoft.icon","count":4"}
     {"mimeType":"text/plain","count":89"}
     {"mimeType":"video/quicktime","count":2"}
     {"mimeType":"image/gif","count":63"}
     {"mimeType":"application/xhtml+xml","count":1670"}
     {"mimeType":"application/octet-stream","count":40"}
     {"mimeType":"text/html","count":1863"}
   }

Field Summary
- Fields inherited from class org.apache.nutch.util.NutchTool
  currentJob, currentJobNum, numJobs, results, status

Constructor Summary

Constructors
Constructor Description

CommonCrawlDataDumper()
Constructor

CommonCrawlDataDumper(CommonCrawlConfig config)
Configurable constructor

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`dump(File outputDir, File segmentRootDir, File linkdb, boolean gzip, String[] mimeTypes, boolean epochFilename, String extension, boolean warc)`	Dumps the reverse engineered CBOR content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument.
`static void`	`main(String[] args)`	Main method for invoking this tool
`static String`	`reverseUrl(String urlString)`
`int`	`run(String[] args)`
`Map<String,Object>`	`run(Map<String,Object> args, String crawlId)`	Used by the REST service

Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob

Methods inherited from class org.apache.hadoop.conf.Configured
getConf

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf

- Constructor Detail
  - CommonCrawlDataDumper
```
public CommonCrawlDataDumper(CommonCrawlConfig config)
```
    Configurable constructor
    
    Parameters:
    
    config - A populated CommonCrawlConfig
  - CommonCrawlDataDumper
```
public CommonCrawlDataDumper()
```
    Constructor
- Method Detail
  - main
```
public static void main(String[] args)
                 throws Exception
```
    Main method for invoking this tool
    
    Parameters:
    
    args - 1) output directory (which will be created if it does not already exist) to host the CBOR data and 2) a directory containing one or more segments from which we wish to generate CBOR data from. Optionally, 3) a list of mimetypes and the 4) the gzip option may be provided.
    
    Throws:
    
    Exception - if there is an error running this NutchTool
  - dump
```
public void dump(File outputDir,
                 File segmentRootDir,
                 File linkdb,
                 boolean gzip,
                 String[] mimeTypes,
                 boolean epochFilename,
                 String extension,
                 boolean warc)
          throws Exception
```
    Dumps the reverse engineered CBOR content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument. If the boolean argument is provided then the CBOR is also zipped.
    
    Parameters:
    
    outputDir - the directory you wish to dump the raw content to. This directory will be created.
    
    segmentRootDir - a directory containing one or more segments.
    
    linkdb - Path to linkdb.
    
    gzip - a boolean flag indicating whether the CBOR content should also be gzipped.
    
    mimeTypes - a string array of mimeTypes to filter on, everything else is excluded
    
    epochFilename - if true, output files will be names using the epoch time (in milliseconds).
    
    extension - a file extension to use with output documents.
    
    warc - if true write as warc format
    
    Throws:
    
    Exception - if any exception occurs.
  - reverseUrl
```
public static String reverseUrl(String urlString)
```
  - run
```
public int run(String[] args)
        throws Exception
```
    Specified by:
    
    run in interface Tool
    
    Throws:
    
    Exception
  - run
```
public Map<String,Object> run(Map<String,Object> args,
                                    String crawlId)
                             throws Exception
```
    Used by the REST service
    
    Specified by:
    
    run in class NutchTool
    
    Parameters:
    
    args - a Map of arguments to be run with the tool
    
    crawlId - a crawl identifier to associate with the tool invocation
    
    Returns:
    
    Map results object if tool executes successfully otherwise null
    
    Throws:
    
    Exception - if there is an error during the tool execution

Constructor	Description
`CommonCrawlDataDumper()`	Constructor
`CommonCrawlDataDumper(CommonCrawlConfig config)`	Configurable constructor

Class CommonCrawlDataDumper

Field Summary

Fields inherited from class org.apache.nutch.util.NutchTool

Constructor Summary

Method Summary

Methods inherited from class org.apache.nutch.util.NutchTool

Methods inherited from class org.apache.hadoop.conf.Configured

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.hadoop.conf.Configurable

Constructor Detail

CommonCrawlDataDumper

CommonCrawlDataDumper

Method Detail

main

dump

reverseUrl

run

run