apache-nutch 1.20 API
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.
Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.collection |
Subcollection is a subset of an index.
|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.exchange |
Control code for exchange component, which acts in indexing job and decides to
which index writer a document should be routed, based on plugins behavior.
|
org.apache.nutch.exchange.jexl |
Plugin of Exchange component based on JEXL expressions.
|
org.apache.nutch.fetcher |
The Nutch multi-threaded fetching module
|
org.apache.nutch.hostdb | |
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.arbitrary |
Indexing filter to add document arbitrary data to the index
from the output of a user-specified class.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.feed |
Indexing filter to index meta data from RSS feeds.
|
org.apache.nutch.indexer.filter | |
org.apache.nutch.indexer.geoip |
This plugin implements an indexing filter which takes
advantage of the
GeoIP2-java API.
|
org.apache.nutch.indexer.jexl |
This plugin implements a dynamic indexing filter which uses JEXL
expressions to allow filtering based on the page's metadata
|
org.apache.nutch.indexer.links | |
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:last modified
date, MIME type, content length.
|
org.apache.nutch.indexer.replace |
Indexing filter to allow pattern replacements on metadata.
|
org.apache.nutch.indexer.staticfield |
A simple plugin called at indexing that adds fields with static data.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.indexer.urlmeta |
URL Meta Tag Indexing Plugin
|
org.apache.nutch.indexwriter.cloudsearch | |
org.apache.nutch.indexwriter.csv |
Index writer plugin to write a plain CSV file.
|
org.apache.nutch.indexwriter.dummy |
Index writer plugin for debugging, writes pairs of <action, url> to a
text file, action is one of "add", "update", or "delete".
|
org.apache.nutch.indexwriter.elastic |
Index writer plugin for Elasticsearch.
|
org.apache.nutch.indexwriter.kafka |
Index writer plugin to produce JSON messages to Kafka.
|
org.apache.nutch.indexwriter.opensearch1x |
Index writer plugin for OpenSearch.
|
org.apache.nutch.indexwriter.rabbit | |
org.apache.nutch.indexwriter.solr |
Index writer plugin for Apache Solr.
|
org.apache.nutch.metadata |
A Multi-valued Metadata container, and set
of constant fields for Nutch Metadata.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.net |
Web-related interfaces: URL
filters
and normalizers . |
org.apache.nutch.net.protocols |
Helper classes related to the
Protocol
interface, see also org.apache.nutch.protocol . |
org.apache.nutch.net.urlnormalizer.ajax | |
org.apache.nutch.net.urlnormalizer.basic |
URL normalizer performing basic normalizations:
remove default ports, e.g., port 80 for
http:// URLs
remove needless slashes and dot segments in the path component
remove anchors
use percent-encoding (only) where needed
E.g.,
https://www.example.org/a/../b//./select%2Dlang.php?lang=espaƱol#anchor
is normalized to https://www.example.org/b/select-lang.php?lang=espa%C3%B1ol
Optional and configurable normalizations are:
convert Internationalized Domain Names (IDNs) uniquely either to the
ASCII (Punycode) or Unicode representation, see property
urlnormalizer.basic.host.idn
remove a trailing dot from host names, see property
urlnormalizer.basic.host.trim-trailing-dot
|
org.apache.nutch.net.urlnormalizer.host |
URL normalizer renaming hosts to a canonical form listed in the
configuration file.
|
org.apache.nutch.net.urlnormalizer.pass |
URL normalizer dummy which does not change URLs.
|
org.apache.nutch.net.urlnormalizer.protocol |
URL normalizer to normalize the protocol for all URLs of a given host or
domain.
|
org.apache.nutch.net.urlnormalizer.querystring |
URL normalizer which sort the elements in the query part to avoid duplicates
by permutations.
|
org.apache.nutch.net.urlnormalizer.regex |
URL normalizer with configurable rules based on regular expressions
(
Pattern ). |
org.apache.nutch.net.urlnormalizer.slash | |
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.parse.ext |
Parse wrapper to run external command to do the parsing.
|
org.apache.nutch.parse.feed |
Parse RSS feeds.
|
org.apache.nutch.parse.headings |
Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
|
org.apache.nutch.parse.html |
An HTML document parsing plugin.
|
org.apache.nutch.parse.js |
Parser and parse filter plugin to extract all (possible) links
from JavaScript files and embedded JavaScript code snippets.
|
org.apache.nutch.parse.metatags |
Parse filter to extract meta tags: keywords, description, etc.
|
org.apache.nutch.parse.tika |
Parse various document formats with help of
Apache Tika.
|
org.apache.nutch.parse.zip |
Parse ZIP files: embedded files are recursively passed to appropriate parsers.
|
org.apache.nutch.parsefilter.debug |
Adds serialized DOM to parse data, useful for debugging, to understand how
the parser implementation interprets a document (not only HTML).
|
org.apache.nutch.parsefilter.naivebayes |
Html Parse filter that classifies the outlinks from the parseresult as
relevant or irrelevant based on the parseText's relevancy (using a training
file where you can give positive and negative example texts see the
description of parsefilter.naivebayes.trainfile) and if found irrelevent
it gives the link a second chance if it contains any of the words from the
list given in parsefilter.naivebayes.wordlist.
|
org.apache.nutch.parsefilter.regex |
RegexParseFilter.
|
org.apache.nutch.plugin |
The Nutch
Plugin System. |
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.htmlunit |
Protocol plugin which supports retrieving documents via HTTP/HTTPS using
Selenium and the
HtmlUnitDriver web
driver for the for the
HtmlUnit headless browser.
|
org.apache.nutch.protocol.http |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient , etc.) |
org.apache.nutch.protocol.httpclient |
Protocol plugin which supports retrieving documents via the HTTP andHTTPS
protocols, optionally with Basic, Digest and NTLM authentication schemes for
web server as well as proxy server.
|
org.apache.nutch.protocol.interactiveselenium |
Protocol plugin which supports retrieving documents using and interacting
with Selenium.
|
org.apache.nutch.protocol.interactiveselenium.handlers |
Handler implementations to interact with
Selenium for
org.apache.nutch.protocol.interactiveselenium . |
org.apache.nutch.protocol.okhttp |
Protocol plugin for HTTP/HTTPS based on
okhttp, supports HTTP 1.1
and/or http/2.
|
org.apache.nutch.protocol.selenium |
Protocol plugin which supports retrieving documents via
Selenium.
|
org.apache.nutch.publisher | |
org.apache.nutch.publisher.rabbitmq |
Publisher package to implement queues
|
org.apache.nutch.rabbitmq | |
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.depth |
Scoring filter to stop crawling at a configurable depth
(number of "hops" from seed URLs).
|
org.apache.nutch.scoring.link |
Scoring filter used in conjunction with
WebGraph . |
org.apache.nutch.scoring.metadata |
Metadata Scoring Plugin
|
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.orphan |
Scoring filter to modify score or status of orphaned pages (no inlinks found
for a configurable amount of time).
|
org.apache.nutch.scoring.similarity | |
org.apache.nutch.scoring.similarity.cosine |
Implements the cosine similarity metric for scoring relevant documents
|
org.apache.nutch.scoring.similarity.util |
Utility package for Lucene functions.
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.scoring.urlmeta |
URL Meta Tag Scoring Plugin
|
org.apache.nutch.scoring.webgraph | |
org.apache.nutch.segment |
A segment stores all data from on generate/fetch/update cycle:
fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
|
org.apache.nutch.service | |
org.apache.nutch.service.impl | |
org.apache.nutch.service.model.request | |
org.apache.nutch.service.model.response | |
org.apache.nutch.service.resources | |
org.apache.nutch.tools |
Miscellaneous tools.
|
org.apache.nutch.tools.arc |
Tools to read the
Arc file format.
|
org.apache.nutch.tools.warc |
Tools to import / export between Nutch segments and
WARC archives.
|
org.apache.nutch.urlfilter.api |
Generic
URL filter library,
abstracting away from regular expression implementations. |
org.apache.nutch.urlfilter.automaton |
URL filter plugin based on
dk.brics.automaton Finite-State
Automata for JavaTM.
|
org.apache.nutch.urlfilter.domain |
URL filter plugin to include only URLs which match an element in a given list of
domain suffixes, domain names, and/or host names.
|
org.apache.nutch.urlfilter.domaindenylist |
URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
|
org.apache.nutch.urlfilter.fast |
URL filter plugin that first does fast exact suffix matches on host/domain
names before applying regular expressions to the path component of a URL.
|
org.apache.nutch.urlfilter.ignoreexempt |
URL filter plugin which identifies exemptions to external urls when
when external urls are set to ignore.
|
org.apache.nutch.urlfilter.prefix |
URL filter plugin to include only URLs which match one of a given list of URL prefixes.
|
org.apache.nutch.urlfilter.regex |
URL filter plugin to include and/or exclude URLs matching Java regular expressions.
|
org.apache.nutch.urlfilter.suffix |
URL filter plugin to either exclude or include only URLs which match
one of the given (path) suffixes.
|
org.apache.nutch.urlfilter.validator |
URL filter plugin that validates given urls.
|
org.apache.nutch.util |
Miscellaneous utility classes.
|
org.apache.nutch.util.domain |
Classes for domain name analysis.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons metadata.
|