Package org.apache.nutch.tools
Class WARCUtils
- java.lang.Object
-
- org.apache.nutch.tools.WARCUtils
-
public class WARCUtils extends Object
-
-
Field Summary
Fields Modifier and Type Field Description static String
COLONSP
static String
CONFORMS_TO
static String
CRLF
static String
FORMAT
static org.archive.uid.UUIDGenerator
generator
static String
HOSTNAME
static String
HTTP_HEADER_FROM
static String
HTTP_HEADER_USER_AGENT
static String
IP
static String
OPERATOR
protected static Pattern
PROBLEMATIC_HEADERS
static String
ROBOTS
static String
SOFTWARE
protected static String
X_HIDE_HEADER
-
Constructor Summary
Constructors Constructor Description WARCUtils()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static org.archive.io.warc.WARCRecordInfo
docToMetadata(NutchDocument doc)
static String
fixHttpHeaders(String headers, int contentLength)
Modify verbatim HTTP response headers: fix, remove or replace headersContent-Length
,Content-Encoding
andTransfer-Encoding
which may confuse WARC readers.static String
getAgentString(String name, String version, String description, String URL, String email)
static String
getHostname(Configuration conf)
static String
getIPAddress(Configuration conf)
static org.archive.util.anvl.ANVLRecord
getWARCInfoContent(Configuration conf)
static byte[]
toByteArray(org.archive.format.http.HttpHeaders headers)
-
-
-
Field Detail
-
SOFTWARE
public static final String SOFTWARE
- See Also:
- Constant Field Values
-
HTTP_HEADER_FROM
public static final String HTTP_HEADER_FROM
- See Also:
- Constant Field Values
-
HTTP_HEADER_USER_AGENT
public static final String HTTP_HEADER_USER_AGENT
- See Also:
- Constant Field Values
-
HOSTNAME
public static final String HOSTNAME
- See Also:
- Constant Field Values
-
ROBOTS
public static final String ROBOTS
- See Also:
- Constant Field Values
-
OPERATOR
public static final String OPERATOR
- See Also:
- Constant Field Values
-
FORMAT
public static final String FORMAT
- See Also:
- Constant Field Values
-
CONFORMS_TO
public static final String CONFORMS_TO
- See Also:
- Constant Field Values
-
IP
public static final String IP
- See Also:
- Constant Field Values
-
generator
public static final org.archive.uid.UUIDGenerator generator
-
CRLF
public static final String CRLF
- See Also:
- Constant Field Values
-
COLONSP
public static final String COLONSP
- See Also:
- Constant Field Values
-
PROBLEMATIC_HEADERS
protected static final Pattern PROBLEMATIC_HEADERS
-
X_HIDE_HEADER
protected static final String X_HIDE_HEADER
- See Also:
- Constant Field Values
-
-
Method Detail
-
getWARCInfoContent
public static final org.archive.util.anvl.ANVLRecord getWARCInfoContent(Configuration conf)
-
getHostname
public static final String getHostname(Configuration conf) throws UnknownHostException
- Throws:
UnknownHostException
-
getIPAddress
public static final String getIPAddress(Configuration conf) throws UnknownHostException
- Throws:
UnknownHostException
-
toByteArray
public static final byte[] toByteArray(org.archive.format.http.HttpHeaders headers) throws IOException
- Throws:
IOException
-
getAgentString
public static final String getAgentString(String name, String version, String description, String URL, String email)
-
docToMetadata
public static final org.archive.io.warc.WARCRecordInfo docToMetadata(NutchDocument doc) throws UnsupportedEncodingException
- Throws:
UnsupportedEncodingException
-
fixHttpHeaders
public static final String fixHttpHeaders(String headers, int contentLength)
Modify verbatim HTTP response headers: fix, remove or replace headersContent-Length
,Content-Encoding
andTransfer-Encoding
which may confuse WARC readers. Ensure that returned header end with a single empty line (\r\n\r\n
).- Parameters:
headers
- HTTP 1.1 or 1.0 response header string, CR-LF-separated lines, first line is status linecontentLength
- Effective uncompressed and unchunked length of content- Returns:
- safe HTTP response header
-
-