Package org.apache.nutch.tools
Class CommonCrawlFormatWARC
- java.lang.Object
-
- org.apache.nutch.tools.AbstractCommonCrawlFormat
-
- org.apache.nutch.tools.CommonCrawlFormatWARC
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
,CommonCrawlFormat
public class CommonCrawlFormatWARC extends AbstractCommonCrawlFormat
-
-
Field Summary
Fields Modifier and Type Field Description static String
MAX_WARC_FILE_SIZE
static String
TEMPLATE
-
Fields inherited from class org.apache.nutch.tools.AbstractCommonCrawlFormat
conf, content, inLinks, jsonArray, keyPrefix, LOG, metadata, reverseKey, reverseKeyValue, simpleDateFormat, url
-
-
Constructor Summary
Constructors Constructor Description CommonCrawlFormatWARC(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config, ParseData parseData)
CommonCrawlFormatWARC(Configuration nutchConf, CommonCrawlConfig config)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
Optional method that could be implemented if the actual format needs some close procedure.protected void
closeArray(String key, boolean nested, boolean newline)
protected void
closeObject(String key)
protected String
generateJson()
String
getJsonData()
Get a string representation of the JSON structure of the URL content.String
getJsonData(String url, Content content, Metadata metadata, ParseData parseData)
Returns a string representation of the JSON structure of the URL content.protected void
startArray(String key, boolean nested, boolean newline)
protected void
startObject(String key)
protected void
writeArrayValue(String value)
protected void
writeKeyNull(String key)
protected void
writeKeyValue(String key, String value)
protected URI
writeRequest(URI id)
protected URI
writeResponse()
-
Methods inherited from class org.apache.nutch.tools.AbstractCommonCrawlFormat
getImported, getInLinks, getJsonData, getKey, getMethod, getRequestAccept, getRequestAcceptEncoding, getRequestAcceptLanguage, getRequestContactEmail, getRequestContactName, getRequestHostAddress, getRequestHostName, getRequestRobots, getRequestSoftware, getRequestUserAgent, getResponseAddress, getResponseContent, getResponseContentEncoding, getResponseContentType, getResponseDate, getResponseHostName, getResponseServer, getResponseStatus, getTimestamp, getUrl, setInLinks
-
-
-
-
Field Detail
-
MAX_WARC_FILE_SIZE
public static final String MAX_WARC_FILE_SIZE
- See Also:
- Constant Field Values
-
TEMPLATE
public static final String TEMPLATE
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
CommonCrawlFormatWARC
public CommonCrawlFormatWARC(Configuration nutchConf, CommonCrawlConfig config) throws IOException
- Throws:
IOException
-
CommonCrawlFormatWARC
public CommonCrawlFormatWARC(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config, ParseData parseData) throws IOException
- Throws:
IOException
-
-
Method Detail
-
getJsonData
public String getJsonData(String url, Content content, Metadata metadata, ParseData parseData) throws IOException
Description copied from interface:CommonCrawlFormat
Returns a string representation of the JSON structure of the URL content. Takes into consideration theContent
,Metadata
andParseData
.- Specified by:
getJsonData
in interfaceCommonCrawlFormat
- Overrides:
getJsonData
in classAbstractCommonCrawlFormat
- Parameters:
url
- the canonical urlcontent
- urlContent
metadata
- urlMetadata
parseData
- urlParseData
- Returns:
- the JSON URL content string
- Throws:
IOException
- if there is a fatal I/O error obtaining JSON data
-
getJsonData
public String getJsonData() throws IOException
Description copied from interface:CommonCrawlFormat
Get a string representation of the JSON structure of the URL content.- Specified by:
getJsonData
in interfaceCommonCrawlFormat
- Overrides:
getJsonData
in classAbstractCommonCrawlFormat
- Returns:
- the JSON URL content string
- Throws:
IOException
- if there is a fatal I/O error obtaining JSON data
-
writeResponse
protected URI writeResponse() throws IOException, ParseException
- Throws:
IOException
ParseException
-
writeRequest
protected URI writeRequest(URI id) throws IOException, ParseException
- Throws:
IOException
ParseException
-
generateJson
protected String generateJson() throws IOException
- Specified by:
generateJson
in classAbstractCommonCrawlFormat
- Throws:
IOException
-
writeKeyValue
protected void writeKeyValue(String key, String value) throws IOException
- Specified by:
writeKeyValue
in classAbstractCommonCrawlFormat
- Throws:
IOException
-
writeKeyNull
protected void writeKeyNull(String key) throws IOException
- Specified by:
writeKeyNull
in classAbstractCommonCrawlFormat
- Throws:
IOException
-
startArray
protected void startArray(String key, boolean nested, boolean newline) throws IOException
- Specified by:
startArray
in classAbstractCommonCrawlFormat
- Throws:
IOException
-
closeArray
protected void closeArray(String key, boolean nested, boolean newline) throws IOException
- Specified by:
closeArray
in classAbstractCommonCrawlFormat
- Throws:
IOException
-
writeArrayValue
protected void writeArrayValue(String value) throws IOException
- Specified by:
writeArrayValue
in classAbstractCommonCrawlFormat
- Throws:
IOException
-
startObject
protected void startObject(String key) throws IOException
- Specified by:
startObject
in classAbstractCommonCrawlFormat
- Throws:
IOException
-
closeObject
protected void closeObject(String key) throws IOException
- Specified by:
closeObject
in classAbstractCommonCrawlFormat
- Throws:
IOException
-
close
public void close()
Description copied from interface:CommonCrawlFormat
Optional method that could be implemented if the actual format needs some close procedure.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in interfaceCommonCrawlFormat
- Overrides:
close
in classAbstractCommonCrawlFormat
-
-