Package org.apache.nutch.tools
Class AbstractCommonCrawlFormat
- java.lang.Object
-
- org.apache.nutch.tools.AbstractCommonCrawlFormat
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
,CommonCrawlFormat
- Direct Known Subclasses:
CommonCrawlFormatJackson
,CommonCrawlFormatJettinson
,CommonCrawlFormatSimple
,CommonCrawlFormatWARC
public abstract class AbstractCommonCrawlFormat extends Object implements CommonCrawlFormat
Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.
-
-
Field Summary
Fields Modifier and Type Field Description protected Configuration
conf
protected Content
content
protected List<String>
inLinks
protected boolean
jsonArray
protected String
keyPrefix
protected static org.slf4j.Logger
LOG
protected Metadata
metadata
protected boolean
reverseKey
protected String
reverseKeyValue
protected boolean
simpleDateFormat
protected String
url
-
Constructor Summary
Constructors Constructor Description AbstractCommonCrawlFormat(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
close()
Optional method that could be implemented if the actual format needs some close procedure.protected abstract void
closeArray(String key, boolean nested, boolean newline)
protected abstract void
closeObject(String key)
protected abstract String
generateJson()
protected String
getImported()
List<String>
getInLinks()
gets set of inlinksString
getJsonData()
Get a string representation of the JSON structure of the URL content.String
getJsonData(String url, Content content, Metadata metadata)
Returns a string representation of the JSON structure of the URL content.String
getJsonData(String url, Content content, Metadata metadata, ParseData parseData)
Returns a string representation of the JSON structure of the URL content.protected String
getKey()
protected String
getMethod()
protected String
getRequestAccept()
protected String
getRequestAcceptEncoding()
protected String
getRequestAcceptLanguage()
protected String
getRequestContactEmail()
protected String
getRequestContactName()
protected String
getRequestHostAddress()
protected String
getRequestHostName()
protected String
getRequestRobots()
protected String
getRequestSoftware()
protected String
getRequestUserAgent()
protected String
getResponseAddress()
protected String
getResponseContent()
protected String
getResponseContentEncoding()
protected String
getResponseContentType()
protected String
getResponseDate()
protected String
getResponseHostName()
protected String
getResponseServer()
protected String
getResponseStatus()
protected String
getTimestamp()
protected String
getUrl()
void
setInLinks(List<String> inLinks)
sets inlinks of this documentprotected abstract void
startArray(String key, boolean nested, boolean newline)
protected abstract void
startObject(String key)
protected abstract void
writeArrayValue(String value)
protected abstract void
writeKeyNull(String key)
protected abstract void
writeKeyValue(String key, String value)
-
-
-
Field Detail
-
LOG
protected static final org.slf4j.Logger LOG
-
url
protected String url
-
content
protected Content content
-
metadata
protected Metadata metadata
-
conf
protected Configuration conf
-
keyPrefix
protected String keyPrefix
-
simpleDateFormat
protected boolean simpleDateFormat
-
jsonArray
protected boolean jsonArray
-
reverseKey
protected boolean reverseKey
-
reverseKeyValue
protected String reverseKeyValue
-
-
Constructor Detail
-
AbstractCommonCrawlFormat
public AbstractCommonCrawlFormat(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config) throws IOException
- Throws:
IOException
-
-
Method Detail
-
getJsonData
public String getJsonData(String url, Content content, Metadata metadata) throws IOException
Description copied from interface:CommonCrawlFormat
Returns a string representation of the JSON structure of the URL content. Takes into consideration both theContent
andMetadata
- Specified by:
getJsonData
in interfaceCommonCrawlFormat
- Parameters:
url
- the canonical urlcontent
- urlContent
metadata
- urlMetadata
- Returns:
- the JSON URL content string
- Throws:
IOException
- if there is a fatal I/O error obtaining JSON data
-
getJsonData
public String getJsonData(String url, Content content, Metadata metadata, ParseData parseData) throws IOException
Description copied from interface:CommonCrawlFormat
Returns a string representation of the JSON structure of the URL content. Takes into consideration theContent
,Metadata
andParseData
.- Specified by:
getJsonData
in interfaceCommonCrawlFormat
- Parameters:
url
- the canonical urlcontent
- urlContent
metadata
- urlMetadata
parseData
- urlParseData
- Returns:
- the JSON URL content string
- Throws:
IOException
- if there is a fatal I/O error obtaining JSON data
-
getJsonData
public String getJsonData() throws IOException
Description copied from interface:CommonCrawlFormat
Get a string representation of the JSON structure of the URL content.- Specified by:
getJsonData
in interfaceCommonCrawlFormat
- Returns:
- the JSON URL content string
- Throws:
IOException
- if there is a fatal I/O error obtaining JSON data
-
writeKeyValue
protected abstract void writeKeyValue(String key, String value) throws IOException
- Throws:
IOException
-
writeKeyNull
protected abstract void writeKeyNull(String key) throws IOException
- Throws:
IOException
-
startArray
protected abstract void startArray(String key, boolean nested, boolean newline) throws IOException
- Throws:
IOException
-
closeArray
protected abstract void closeArray(String key, boolean nested, boolean newline) throws IOException
- Throws:
IOException
-
writeArrayValue
protected abstract void writeArrayValue(String value) throws IOException
- Throws:
IOException
-
startObject
protected abstract void startObject(String key) throws IOException
- Throws:
IOException
-
closeObject
protected abstract void closeObject(String key) throws IOException
- Throws:
IOException
-
generateJson
protected abstract String generateJson() throws IOException
- Throws:
IOException
-
getUrl
protected String getUrl()
-
getTimestamp
protected String getTimestamp()
-
getMethod
protected String getMethod()
-
getRequestHostName
protected String getRequestHostName()
-
getRequestHostAddress
protected String getRequestHostAddress()
-
getRequestSoftware
protected String getRequestSoftware()
-
getRequestRobots
protected String getRequestRobots()
-
getRequestContactName
protected String getRequestContactName()
-
getRequestContactEmail
protected String getRequestContactEmail()
-
getRequestAccept
protected String getRequestAccept()
-
getRequestAcceptEncoding
protected String getRequestAcceptEncoding()
-
getRequestAcceptLanguage
protected String getRequestAcceptLanguage()
-
getRequestUserAgent
protected String getRequestUserAgent()
-
getResponseStatus
protected String getResponseStatus()
-
getResponseHostName
protected String getResponseHostName()
-
getResponseAddress
protected String getResponseAddress()
-
getResponseContentEncoding
protected String getResponseContentEncoding()
-
getResponseContentType
protected String getResponseContentType()
-
getInLinks
public List<String> getInLinks()
Description copied from interface:CommonCrawlFormat
gets set of inlinks- Specified by:
getInLinks
in interfaceCommonCrawlFormat
- Returns:
- gets inlinks of this document
-
setInLinks
public void setInLinks(List<String> inLinks)
Description copied from interface:CommonCrawlFormat
sets inlinks of this document- Specified by:
setInLinks
in interfaceCommonCrawlFormat
- Parameters:
inLinks
- list of inlinks
-
getResponseDate
protected String getResponseDate()
-
getResponseServer
protected String getResponseServer()
-
getResponseContent
protected String getResponseContent()
-
getKey
protected String getKey()
-
getImported
protected String getImported()
-
close
public void close()
Description copied from interface:CommonCrawlFormat
Optional method that could be implemented if the actual format needs some close procedure.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in interfaceCommonCrawlFormat
-
-