Class ParserChecker

  • All Implemented Interfaces:
    Configurable, Tool

    public class ParserChecker
    extends AbstractChecker
    Parser checker, useful for testing parser. It also accurately reports possible fetching and parsing failures and presents protocol status signals to aid debugging. The tool enables us to retrieve the following data from any url:
    1. contentType: The URL Content type.
    2. signature: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure. It is calculated using MD5Signature or TextProfileSignature.
    3. Version: From ParseData.
    4. Status: From ParseData.
    5. Title: of the URL
    6. Outlinks: associated with the URL
    7. Content Metadata: such as X-AspNet-Version, Date, Content-length, servedBy, Content-Type, Cache-Control, etc.
    8. Parse Metadata: such as CharEncodingForConversion, OriginalCharEncoding, language, etc.
    9. ParseText: The page parse text which varies in length depdnecing on content.length configuration.