Package org.apache.nutch.parse
Class ParserChecker
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.AbstractChecker
-
- org.apache.nutch.parse.ParserChecker
-
- All Implemented Interfaces:
Configurable
,Tool
public class ParserChecker extends AbstractChecker
Parser checker, useful for testing parser. It also accurately reports possible fetching and parsing failures and presents protocol status signals to aid debugging. The tool enables us to retrieve the following data from any url:contentType
: The URLContent
type.signature
: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure. It is calculated usingMD5Signature
orTextProfileSignature
.Version
: FromParseData
.Status
: FromParseData
.Title
: of the URLOutlinks
: associated with the URLContent Metadata
: such as X-AspNet-Version, Date, Content-length, servedBy, Content-Type, Cache-Control, etc.Parse Metadata
: such as CharEncodingForConversion, OriginalCharEncoding, language, etc.ParseText
: The page parse text which varies in length depdnecing oncontent.length
configuration.
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
checkRobotsTxt
protected boolean
dumpText
protected boolean
followRedirects
protected String
forceAsContentType
protected HashMap<String,String>
metadata
protected URLNormalizers
normalizers
-
Fields inherited from class org.apache.nutch.util.AbstractChecker
keepClientCnxOpen, stdin, tcpPort, usage
-
-
Constructor Summary
Constructors Constructor Description ParserChecker()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static void
main(String[] args)
protected int
process(String url, StringBuilder output)
int
run(String[] args)
-
Methods inherited from class org.apache.nutch.util.AbstractChecker
getProtocolOutput, parseArgs, processSingle, processStdin, processTCP, run
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
normalizers
protected URLNormalizers normalizers
-
dumpText
protected boolean dumpText
-
followRedirects
protected boolean followRedirects
-
checkRobotsTxt
protected boolean checkRobotsTxt
-
forceAsContentType
protected String forceAsContentType
-
-
Method Detail
-
process
protected int process(String url, StringBuilder output) throws Exception
- Specified by:
process
in classAbstractChecker
- Throws:
Exception
-
-