All Classes Interface Summary Class Summary Enum Summary Exception Summary
Class |
Description |
AbstractChecker |
Scaffolding class for the various Checker implementations.
|
AbstractCommonCrawlFormat |
Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.
|
AbstractFetchSchedule |
This class provides common methods for implementations of
FetchSchedule .
|
AbstractResource |
|
AbstractScoringFilter |
|
AdaptiveFetchSchedule |
This class implements an adaptive re-fetch algorithm.
|
AdminResource |
|
AjaxURLNormalizer |
URLNormalizer capable of dealing with AJAX URL's.
|
AnchorIndexingFilter |
Indexing filter that offers an option to either index all inbound anchor text
for a document or deduplicate anchors.
|
ArbitraryIndexingFilter |
Adds arbitrary searchable fields to a document from the class and method
the user identifies in the config.
|
ArcInputFormat |
A input format the reads arc files.
|
ArcRecordReader |
The ArchRecordReader class provides a record reader which reads
records from arc files.
|
ArcSegmentCreator |
The ArcSegmentCreator is a replacement for fetcher that will
take arc files as input and produce a nutch segment as output.
|
ArcSegmentCreator.ArcSegmentCreatorMapper |
|
AutomatonURLFilter |
RegexURLFilterBase implementation based on the dk.brics.automaton Finite-State
Automata for Java TM.
|
BasicIndexingFilter |
Adds basic searchable fields to a document.
|
BasicURLNormalizer |
Converts URLs to a normal form:
remove dot segments in path: /./ or /../
remove default ports, e.g.
|
BlockedException |
|
CaseInsensitiveMetadata |
A decorator to Metadata that adds for case-insensitive lookup of keys.
|
CCIndexingFilter |
Adds basic searchable fields to a document.
|
CCParseFilter |
Adds metadata identifying the Creative Commons license used, if any.
|
CCParseFilter.Walker |
Walks DOM tree, looking for RDF in comments and licenses in anchors.
|
CIDR |
Parse a CIDR block
notation and test whether an IP address is contained in the subnet range
defined by the CIDR.
|
CircularDependencyException |
CircularDependencyException will be thrown if a circular
dependency is detected.
|
Classify |
|
CleaningJob |
The class scans CrawlDB looking for entries with status DB_GONE (404) or
DB_DUPLICATE and sends delete requests to indexers for those documents.
|
CleaningJob.DBFilter |
|
CleaningJob.DeleterReducer |
|
Client |
Client.java encapsulates functionalities necessary for nutch to get dir list
and retrieve file from an FTP server.
|
CloudSearchConstants |
|
CloudSearchIndexWriter |
Writes documents to CloudSearch.
|
CloudSearchUtils |
|
CollectionManager |
|
CommandRunner |
|
CommonCrawlConfig |
|
CommonCrawlDataDumper |
The Common Crawl Data Dumper tool enables one to reverse generate the raw
content from Nutch segment data directories into a common crawling data
format, consumed by many applications.
|
CommonCrawlFormat |
Interface for all CommonCrawl formatter.
|
CommonCrawlFormatFactory |
|
CommonCrawlFormatJackson |
This class provides methods to map crawled data on JSON using Jackson Streaming APIs.
|
CommonCrawlFormatJettinson |
This class provides methods to map crawled data on JSON using Jettinson APIs.
|
CommonCrawlFormatSimple |
This class provides methods to map crawled data on JSON using a StringBuilder object.
|
CommonCrawlFormatWARC |
|
ConfigResource |
|
ConfManager |
|
ConfManagerImpl |
|
Content |
|
ContentAsTextInputFormat |
An input format that takes Nutch Content objects and converts them to text
while converting newline endings to spaces.
|
CosineSimilarity |
|
CrawlCompletionStats |
Extracts some simple crawl completion stats from the crawldb
Stats will be sorted by host/domain and will be of the form:
1 www.spitzer.caltech.edu FETCHED
50 www.spitzer.caltech.edu UNFETCHED
|
CrawlCompletionStats.CrawlCompletionStatsCombiner |
|
CrawlDatum |
|
CrawlDatum.Comparator |
A Comparator optimized for CrawlDatum.
|
CrawlDatumProcessor |
These are instantiated once for each host.
|
CrawlDb |
This class takes the output of the fetcher and updates the crawldb
accordingly.
|
CrawlDbFilter |
This class provides a way to separate the URL normalization and filtering
steps from the rest of CrawlDb manipulation code.
|
CrawlDbMerger |
This tool merges several CrawlDb-s into one, optionally filtering URLs
through the current URLFilters, to skip prohibited pages.
|
CrawlDbMerger.Merger |
|
CrawlDbReader |
Read utility for the CrawlDB.
|
CrawlDbReader.CrawlDatumCsvOutputFormat |
|
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter |
|
CrawlDbReader.CrawlDatumJsonOutputFormat |
|
CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter |
|
CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer |
|
CrawlDbReader.CrawlDbDumpMapper |
|
CrawlDbReader.CrawlDbStatMapper |
|
CrawlDbReader.CrawlDbStatReducer |
|
CrawlDbReader.CrawlDbTopNMapper |
|
CrawlDbReader.CrawlDbTopNReducer |
|
CrawlDbReader.JsonIndenter |
|
CrawlDbReducer |
Merge new page entries with existing entries.
|
CreativeCommons |
A collection of Creative Commons properties names.
|
CSVConstants |
|
CSVIndexWriter |
Write Nutch documents to a CSV file (comma separated values), i.e., dump
index as CSV or tab-separated plain text table.
|
DbQuery |
|
DbResource |
|
DebugParseFilter |
Adds serialized DOM to parse data, useful for debugging, to understand how
the parser implementation interprets a document (not only HTML).
|
DeduplicationJob |
Generic deduplicator which groups fetched URLs with the same digest and marks
all of them as duplicate except the one with the highest score (based on the
score in the crawldb, which is not necessarily the same as the score
indexed).
|
DeduplicationJob.DBFilter |
|
DeduplicationJob.DedupReducer<K extends Writable> |
|
DeduplicationJob.StatusUpdateReducer |
Combine multiple new entries for a url.
|
DefalultMultiInteractionHandler |
This is a placeholder/example of a technique or use case where we do multiple
interaction with the web driver and need data from each such interaction in the end.
|
DefaultClickAllAjaxLinksHandler |
This handler clicks all the
tags
because it considers them as not usual links but ajax links/interactions.
|
DefaultFetchSchedule |
This class implements the default re-fetch schedule.
|
DefaultHandler |
|
DeflateUtils |
A collection of utility methods for working on deflated data.
|
DepthScoringFilter |
This scoring filter limits the number of hops from the initial seed urls.
|
DmozParser |
Utility that converts DMOZ
RDF into a flat file of URLs to be injected.
|
DocVector |
|
DomainDenylistURLFilter |
Filters URLs based on a file containing domain suffixes, domain names, and
hostnames.
|
DomainStatistics |
Extracts some very basic statistics about domains from the crawldb
|
DomainStatistics.DomainStatisticsCombiner |
|
DomainStatistics.MyCounter |
|
DomainSuffix |
This class represents the last part of the host name, which is operated by
authoritives, not individuals.
|
DomainSuffix.Status |
Enumeration of the status of the tld.
|
DomainSuffixes |
Storage class for DomainSuffix objects Note: this class is
singleton
|
DomainURLFilter |
Filters URLs based on a file containing domain suffixes, domain names, and
hostnames.
|
DOMBuilder |
This class takes SAX events (in addition to some extra events that SAX
doesn't handle yet) and adds the result to a document or document fragment.
|
DOMContentUtils |
A collection of methods for extracting content from DOM trees.
|
DOMContentUtils |
A collection of methods for extracting content from DOM trees.
|
DOMContentUtils.LinkParams |
|
DomUtil |
|
DublinCore |
A collection of Dublin Core metadata names.
|
DummyConstants |
|
DummyIndexWriter |
DummyIndexWriter.
|
DummySSLProtocolSocketFactory |
|
DummyX509TrustManager |
|
DummyX509TrustManager |
|
DummyX509TrustManager |
|
DummyX509TrustManager |
|
DummyX509TrustManager |
|
DumpFileUtil |
|
ElasticConstants |
|
ElasticIndexWriter |
Sends NutchDocuments to a configured Elasticsearch index.
|
EncodingDetector |
A simple class for detecting character encodings.
|
Exchange |
|
ExchangeConfig |
|
Exchanges |
|
ExemptionUrlFilter |
This implementation of URLExemptionFilter uses regex configuration
to check if URL is eligible for exemption from 'db.ignore.external'.
|
Extension |
An Extension is a kind of listener descriptor that will be
installed on a concrete ExtensionPoint that acts as kind of
Publisher.
|
ExtensionPoint |
The ExtensionPoint provide meta information of a extension
point.
|
ExtParser |
A wrapper that invokes external command to do real parsing job.
|
FastURLFilter |
Filters URLs based on a file of regular expressions using host/domains
matching first.
|
FastURLFilter.DenyAllRule |
Rule for DenyPath .* or DenyPath .?
|
FastURLFilter.DenyPathQueryRule |
|
FastURLFilter.DenyPathRule |
|
FastURLFilter.Rule |
|
Feed |
A collection of Feed property names extracted by the ROME library.
|
FeedIndexingFilter |
|
FeedParser |
|
Fetcher |
A queue-based fetcher.
|
Fetcher.FetcherRun |
|
Fetcher.InputFormat |
|
FetcherOutputFormat |
Splits FetcherOutput entries into multiple map files.
|
FetcherThread |
This class picks items from queues and fetches the pages.
|
FetcherThreadEvent |
This class is used to capture the various events occurring
at fetch time.
|
FetcherThreadEvent.PublishEventType |
Type of event to specify start, end or reporting of a fetch item.
|
FetcherThreadPublisher |
This class handles the publishing of the events to the queue implementation.
|
FetchItem |
This class describes the item to be fetched.
|
FetchItemQueue |
This class handles FetchItems which come from the same host ID (be it a
proto/hostname or proto/IP pair).
|
FetchItemQueues |
A collection of queues that keeps track of the total number of items, and
provides items eligible for fetching from any queue.
|
FetchNode |
|
FetchNodeDb |
|
FetchNodeDbInfo |
|
FetchOverdueCrawlDatumProcessor |
Simple custom crawl datum processor that counts the number of records that
are overdue for fetching, e.g.
|
FetchSchedule |
This interface defines the contract for implementations that manipulate fetch
times and re-fetch intervals.
|
FetchScheduleFactory |
|
FieldReplacer |
POJO to store a filename, its match pattern and its replacement string.
|
File |
This class is a protocol plugin used for file: scheme.
|
FileDumper |
The file dumper tool enables one to reverse generate the raw content from
Nutch segment data directories.
|
FileError |
Thrown for File error codes.
|
FileException |
|
FileResponse |
FileResponse.java mimics file replies as http response.
|
FreeGenerator |
This tool generates fetchlists (segments to be fetched) from plain text files
containing one URL per line.
|
FreeGenerator.FG |
|
FreeGenerator.FG.FGMapper |
|
FreeGenerator.FG.FGReducer |
|
FSUtils |
Utility methods for common filesystem operations.
|
Ftp |
This class is a protocol plugin used for ftp: scheme.
|
FtpError |
Thrown for Ftp error codes.
|
FtpException |
Superclass for important exceptions thrown during FTP talk, that must be
handled with care.
|
FtpExceptionBadSystResponse |
Exception indicating bad reply of SYST command.
|
FtpExceptionCanNotHaveDataConnection |
Exception indicating failure of opening data connection.
|
FtpExceptionControlClosedByForcedDataClose |
Exception indicating control channel is closed by server end, due to forced
closure of data channel at client (our) end.
|
FtpExceptionUnknownForcedDataClose |
Exception indicating unrecognizable reply from server after forced closure of
data channel by client (our) side.
|
FtpResponse |
FtpResponse.java mimics ftp replies as http response.
|
FtpRobotRulesParser |
This class is used for parsing robots for urls belonging to FTP protocol.
|
Generator |
Generates a subset of a crawl db to fetch.
|
Generator.CrawlDbUpdater |
Update the CrawlDB so that the next generate won't include the same URLs.
|
Generator.CrawlDbUpdater.CrawlDbUpdateMapper |
|
Generator.CrawlDbUpdater.CrawlDbUpdateReducer |
|
Generator.DecreasingFloatComparator |
|
Generator.HashComparator |
Sort fetch lists by hash of URL.
|
Generator.PartitionReducer |
|
Generator.Selector |
Selects entries due for fetch.
|
Generator.SelectorEntry |
|
Generator.SelectorInverseMapper |
|
Generator.SelectorMapper |
Select and invert subset due for fetch.
|
Generator.SelectorReducer |
Collect until limit is reached.
|
GenericWritableConfigurable |
A generic Writable wrapper that can inject Configuration to
Configurable s
|
GeoIPDocumentCreator |
|
GeoIPIndexingFilter |
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
|
GZIPUtils |
A collection of utility methods for working on GZIPed data.
|
HadoopFSUtil |
|
HeadingsParseFilter |
HtmlParseFilter to retrieve h1 and h2 values from the DOM.
|
HostDatum |
|
HostURLNormalizer |
URL normalizer for mapping hosts to their desired form.
|
HTMLLanguageParser |
|
HTMLMetaProcessor |
Class for parsing META Directives from DOM trees.
|
HTMLMetaProcessor |
Class for parsing META Directives from DOM trees.
|
HTMLMetaTags |
This class holds the information about HTML "meta" tags extracted from a
page.
|
HtmlParseFilter |
Extension point for DOM-based HTML parsers.
|
HtmlParseFilters |
|
HtmlParser |
|
HtmlUnitWebDriver |
|
HtmlUnitWebWindowListener |
|
Http |
|
Http |
|
Http |
This class is a protocol plugin that configures an HTTP client for Basic,
Digest and NTLM authentication schemes for web server as well as proxy
server.
|
Http |
|
Http |
|
HttpAuthentication |
The base level of services required for Http Authentication
|
HttpAuthenticationException |
Can be used to identify problems during creation of Authentication objects.
|
HttpAuthenticationFactory |
Provides the Http protocol implementation with the ability to authenticate
when prompted.
|
HttpBase |
|
HttpBasicAuthentication |
Implementation of RFC 2617 Basic Authentication.
|
HttpDateFormat |
Parse and format HTTP dates in HTTP headers, e.g., used to fill the
"If-Modified-Since" request header field.
|
HttpException |
|
HttpFormAuthConfigurer |
|
HttpFormAuthentication |
|
HttpHeaders |
A collection of HTTP header names.
|
HttpResponse |
An HTTP response.
|
HttpResponse |
An HTTP response.
|
HttpResponse |
An HTTP response.
|
HttpResponse |
|
HttpResponse |
|
HttpResponse.Scheme |
|
HttpResponse.Scheme |
|
HttpResponse.Scheme |
|
HttpResponse.Scheme |
|
HttpRobotRulesParser |
This class is used for parsing robots for urls belonging to HTTP protocol.
|
HttpWebClient |
|
IndexerMapReduce |
This class is typically invoked from within
IndexingJob and handles all MapReduce
functionality required when undertaking indexing.
|
IndexerMapReduce.IndexerMapper |
|
IndexerMapReduce.IndexerReducer |
|
IndexerOutputFormat |
|
IndexingException |
|
IndexingFilter |
Extension point for indexing.
|
IndexingFilters |
|
IndexingFiltersChecker |
Reads and parses a URL and run the indexers on it.
|
IndexingJob |
Generic indexer which relies on the plugins implementing IndexWriter
|
IndexWriter |
|
IndexWriterConfig |
|
IndexWriterParams |
|
IndexWriters |
|
Injector |
Injector takes a flat text file of URLs (or a folder containing text files)
and merges ("injects") these URLs into the CrawlDb.
|
Injector.InjectMapper |
InjectMapper reads
the CrawlDb seeds are injected into
the plain-text seed files and parses each line into the URL and
metadata.
|
Injector.InjectReducer |
Combine multiple new entries for a url.
|
Inlink |
An incoming link to a page.
|
Inlinks |
|
InteractiveSeleniumHandler |
|
IPFilterRules |
Optionally limit or block connections to IP address ranges
(localhost/loopback or site-local addresses, subnet ranges given in CIDR
notation, or single IP addresses).
|
JexlExchange |
|
JexlIndexingFilter |
An IndexingFilter that allows filtering of
documents based on a JEXL expression.
|
JexlUtil |
Utility methods for handling JEXL expressions
|
JobConfig |
Job-specific configuration.
|
JobFactory |
|
JobInfo |
This is the response object containing Job information
|
JobInfo.State |
|
JobManager |
|
JobManager.JobType |
|
JobManagerImpl |
|
JobResource |
|
JobWorker |
|
JSParseFilter |
This class is a heuristic link extractor for JavaScript files and code
snippets.
|
KafkaConstants |
|
KafkaIndexWriter |
Sends Nutch documents to a configured Kafka Cluster
|
LanguageIndexingFilter |
|
LinkAnalysisScoringFilter |
|
LinkDatum |
A class for holding link information including the url, anchor text, a score,
the timestamp of the link and a link type.
|
LinkDb |
Maintains an inverted link map, listing incoming links for each url.
|
LinkDb.LinkDbMapper |
|
LinkDbFilter |
This class provides a way to separate the URL normalization and filtering
steps from the rest of LinkDb manipulation code.
|
LinkDbMerger |
This tool merges several LinkDb-s into one, optionally filtering URLs through
the current URLFilters, to skip prohibited URLs and links.
|
LinkDbMerger.LinkDbMergeReducer |
|
LinkDbReader |
Read utility for the LinkDb.
|
LinkDbReader.LinkDBDumpMapper |
|
LinkDumper |
The LinkDumper tool creates a database of node to inlink information that can
be read using the nested Reader class.
|
LinkDumper.Inverter |
Inverts outlinks from the WebGraph to inlinks and attaches node
information.
|
LinkDumper.Inverter.InvertMapper |
Wraps all values in ObjectWritables.
|
LinkDumper.Inverter.InvertReducer |
Inverts outlinks to inlinks while attaching node information to the
outlink.
|
LinkDumper.LinkNode |
Bean class which holds url to node information.
|
LinkDumper.LinkNodes |
Writable class which holds an array of LinkNode objects.
|
LinkDumper.Merger |
Merges LinkNode objects into a single array value per url.
|
LinkDumper.Reader |
Reader class which will print out the url and all of its inlinks to system
out.
|
LinkRank |
|
LinkReader |
|
LinksIndexingFilter |
An IndexingFilter that adds
outlinks and inlinks field(s) to the document.
|
LockUtil |
Utility methods for handling application-level locking.
|
LuceneAnalyzerUtil |
Creates a custom analyzer based on user provided inputs
|
LuceneAnalyzerUtil.StemFilterType |
|
LuceneTokenizer |
|
LuceneTokenizer.TokenizerType |
|
MD5Signature |
Default implementation of a page signature.
|
Metadata |
A multi-valued metadata container.
|
MetadataIndexer |
Indexer which can be configured to extract metadata from the crawldb, parse
metadata or content metadata.
|
MetadataScoringFilter |
|
MetaTagsParser |
Parse HTML meta tags (keywords, description) and store them in the parse
metadata so that they can be indexed with the index-metadata plugin with the
prefix 'metatag.'.
|
MetaWrapper |
This is a simple decorator that adds metadata to any Writable-s that can be
serialized by NutchWritable .
|
MimeAdaptiveFetchSchedule |
Extension of @see AdaptiveFetchSchedule that allows for more flexible
configuration of DEC and INC factors for various MIME-types.
|
MimeTypeIndexingFilter |
An IndexingFilter that allows filtering
of documents based on the MIME Type detected by Tika
|
MimeUtil |
This is a facade class to insulate Nutch from its underlying Mime Type
substrate library, Apache Tika.
|
MissingDependencyException |
MissingDependencyException will be thrown if a plugin dependency
cannot be found.
|
Model |
This class creates a model used to store Document vector representation of the corpus.
|
MoreIndexingFilter |
Add (or reset) a few metaData properties as respective fields (if they are
available), so that they can be accurately used within the search index.
|
NaiveBayesParseFilter |
Html Parse filter that classifies the outlinks from the parseresult as
relevant or irrelevant based on the parseText's relevancy (using a training
file where you can give positive and negative example texts see the
description of parsefilter.naivebayes.trainfile) and if found irrelevant it
gives the link a second chance if it contains any of the words from the list
given in parsefilter.naivebayes.wordlist.
|
Node |
A class which holds the number of inlinks and outlinks for a given url along
with an inlink score from a link analysis program and any metadata.
|
NodeDumper |
A tools that dumps out the top urls by number of inlinks, number of outlinks,
or by score, to a text file.
|
NodeDumper.Dumper |
Outputs the hosts or domains with an associated value.
|
NodeDumper.Dumper.DumperMapper |
Outputs the host or domain as key for this record and numInlinks,
numOutlinks or score as the value.
|
NodeDumper.Dumper.DumperReducer |
Outputs either the sum or the top value for this record.
|
NodeDumper.Sorter |
Outputs the top urls sorted in descending order.
|
NodeDumper.Sorter.SorterMapper |
Outputs the url with the appropriate number of inlinks, outlinks, or for
score.
|
NodeDumper.Sorter.SorterReducer |
Flips and collects the url and numeric sort value.
|
NodeReader |
Reads and prints to system out information for a single node from the NodeDb
in the WebGraph.
|
NodeReader |
|
NodeWalker |
A utility class that allows the walking of any DOM tree using a stack instead
of recursion.
|
Nutch |
A collection of Nutch internal metadata constants.
|
NutchConfig |
|
NutchConfiguration |
Utility to create Hadoop Configuration s that include Nutch-specific
resources.
|
NutchDocument |
|
NutchField |
This class represents a multi-valued field with a weight.
|
NutchIndexAction |
A NutchIndexAction is the new unit of indexing holding the document
and action information.
|
NutchJob |
|
NutchPublisher |
All publisher subscriber model implementations should implement this interface.
|
NutchPublishers |
|
NutchReader |
|
NutchServer |
|
NutchServerInfo |
|
NutchServerPoolExecutor |
|
NutchTool |
|
NutchWritable |
|
ObjectCache |
|
OkHttp |
|
OkHttpResponse |
|
OkHttpResponse.TruncatedContent |
Container to store whether and why content has been truncated
|
OpenSearch1xConstants |
|
OpenSearch1xIndexWriter |
Sends NutchDocuments to a configured OpenSearch index.
|
OPICScoringFilter |
This plugin implements a variant of an Online Page Importance Computation
(OPIC) score, described in this paper:
Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive
On-Line Page Importance Computation.
|
OrphanScoringFilter |
Orphan scoring filter that determines whether a page has become orphaned,
e.g.
|
Outlink |
An outgoing link from a page.
|
OutlinkExtractor |
Extractor to extract Outlink s / URLs from
plain text using Regular Expressions.
|
Parse |
The result of parsing a page's raw content.
|
ParseData |
Data extracted from a page's content.
|
ParseException |
|
ParseImpl |
The result of parsing a page's raw content.
|
ParseOutputFormat |
|
Parser |
A parser for content generated by a
Protocol implementation.
|
ParserChecker |
Parser checker, useful for testing parser.
|
ParseResult |
A utility class that stores result of a parse.
|
ParserFactory |
Creates and caches Parser plugins.
|
ParserNotFound |
|
ParseSegment |
|
ParseSegment.ParseSegmentMapper |
|
ParseSegment.ParseSegmentReducer |
|
ParseStatus |
|
ParseText |
|
ParseUtil |
A Utility class containing methods to simply perform parsing utilities such
as iterating through a preferred list of Parser s to obtain
Parse objects.
|
PassURLNormalizer |
This URLNormalizer doesn't change urls.
|
Pluggable |
Defines the capability of a class to be plugged into Nutch.
|
Plugin |
A nutch-plugin is an container for a set of custom logic that provide
extensions to the nutch core functionality or another plugin that provides an
API for extending.
|
PluginClassLoader |
The PluginClassLoader is a child-first classloader that only
contains classes of the runtime libraries setuped in the plugin manifest file
and exported libraries of plugins that are required plugins.
|
PluginDescriptor |
The PluginDescriptor provide access to all meta information of a
nutch-plugin, as well to the internationalizable resources and the plugin own
classloader.
|
PluginManifestParser |
The PluginManifestParser provides a mechanism for
parsing Nutch plugin manifest files ( plugin.xml ) contained
in a String of plugin directories.
|
PluginRepository |
The plugin repository is a registry of all plugins.
|
PluginRuntimeException |
PluginRuntimeException will be thrown until a exception in the
plugin managemnt occurs.
|
PrefixStringMatcher |
A class for efficiently matching String s against a set of
prefixes.
|
PrefixURLFilter |
Filters URLs based on a file of URL prefixes.
|
PrintCommandListener |
This is a support class for logging all ftp command/reply traffic.
|
Protocol |
A retriever of url content.
|
ProtocolException |
Deprecated.
|
ProtocolException |
|
ProtocolFactory |
|
ProtocolLogUtil |
|
ProtocolNotFound |
|
ProtocolOutput |
Simple aggregate to pass from protocol plugins both content and protocol
status.
|
ProtocolStatus |
|
ProtocolStatusStatistics |
Extracts protocol status code information from the crawl database.
|
ProtocolStatusStatistics.ProtocolStatusStatisticsCombiner |
|
ProtocolURLNormalizer |
URL normalizer to normalize the protocol for all URLs of a given host or
domain, e.g.
|
QuerystringURLNormalizer |
URL normalizer plugin for normalizing query strings but sorting query string
parameters.
|
QueueFeeder |
This class feeds the queues with input items, and re-fills them as items
are consumed by FetcherThread-s.
|
RabbitIndexWriter |
|
RabbitMQClient |
Client for RabbitMQ
|
RabbitMQMessage |
|
RabbitMQPublisherImpl |
|
ReaderConfig |
|
ReaderResouce |
The Reader endpoint enables a user to read sequence files,
nodes and links from the Nutch webgraph.
|
ReadHostDb |
|
RegexParseFilter |
RegexParseFilter.
|
RegexRule |
A generic regular expression rule.
|
RegexURLFilter |
|
RegexURLFilterBase |
Generic URLFilter based on regular
expressions.
|
RegexURLNormalizer |
Allows users to do regex substitutions on all/any URLs that are encountered,
which is useful for stripping session IDs from URLs.
|
RelTagIndexingFilter |
|
RelTagParser |
Adds microformat rel-tags of document if found.
|
ReplaceIndexer |
Do pattern replacements on selected field contents prior to indexing.
|
ResolverThread |
Simple runnable that performs DNS lookup for a single host.
|
ResolveUrls |
A simple tool that will spin up multiple threads to resolve urls to ip
addresses.
|
Response |
A response interface.
|
Response.TruncatedContentReason |
|
RobotRulesParser |
This class uses crawler-commons for handling the parsing of
robots.txt files.
|
ScoreUpdater |
Updates the score from the WebGraph node database into the crawl database.
|
ScoreUpdater.ScoreUpdaterMapper |
Changes input into ObjectWritables.
|
ScoreUpdater.ScoreUpdaterReducer |
Creates new CrawlDatum objects with the updated score from the NodeDb or
with a cleared score.
|
ScoringFilter |
A contract defining behavior of scoring plugins.
|
ScoringFilterException |
Specialized exception for errors during scoring.
|
ScoringFilters |
|
SeedList |
|
SeedManager |
|
SeedManagerImpl |
|
SeedResource |
|
SeedUrl |
|
SegmentChecker |
Checks whether a segment is valid, or has a certain status (generated,
fetched, parsed), or can be used safely for a certain processing step
(e.g., indexing).
|
SegmentMergeFilter |
Interface used to filter segments during segment merge.
|
SegmentMergeFilters |
This class wraps all SegmentMergeFilter extensions in a single object
so it is easier to operate on them.
|
SegmentMerger |
This tool takes several segments and merges their data together.
|
SegmentMerger.ObjectInputFormat |
Wraps inputs in an MetaWrapper , to permit merging different types
in reduce and use additional metadata.
|
SegmentMerger.SegmentMergerMapper |
|
SegmentMerger.SegmentMergerReducer |
NOTE: in selecting the latest version we rely exclusively on the segment
name (not all segment data contain time information).
|
SegmentMerger.SegmentOutputFormat |
|
SegmentPart |
Utility class for handling information about segment parts.
|
SegmentReader |
Dump the content of a segment.
|
SegmentReader.InputCompatMapper |
|
SegmentReader.InputCompatReducer |
|
SegmentReader.SegmentReaderStats |
|
SegmentReader.TextOutputFormat |
Implements a text output format
|
SegmentReaderUtil |
|
SequenceReader |
Enables reading a sequence file and methods provide different
ways to read the file.
|
ServiceConfig |
|
ServiceInfo |
|
ServicesResource |
The services resource defines an endpoint to enable the user to carry out
Nutch jobs like dump, commoncrawldump, etc.
|
ServiceWorker |
|
ShowProperties |
Tool to list properties and their values set by the current Nutch
configuration
|
Signature |
|
SignatureComparator |
|
SignatureFactory |
Factory class, which instantiates a Signature implementation according to the
current Configuration configuration.
|
SimilarityModel |
|
SimilarityScoringFilter |
|
SitemapProcessor |
Performs sitemap processing by fetching
sitemap links, parsing the content and merging the URLs from sitemaps (with
the metadata) into the CrawlDb.
|
SlashURLNormalizer |
|
SolrConstants |
|
SolrIndexWriter |
|
SolrUtils |
|
SpellCheckedMetadata |
A decorator to Metadata that adds spellchecking capabilities to property
names.
|
StaticFieldIndexer |
A simple plugin called at indexing that adds fields with static data.
|
StringUtil |
A collection of String processing utility methods.
|
Subcollection |
SubCollection represents a subset of index, you can define url patterns that
will indicate that particular page (url) is part of SubCollection.
|
SubcollectionIndexingFilter |
|
SuffixStringMatcher |
A class for efficiently matching String s against a set of
suffixes.
|
SuffixURLFilter |
Filters URLs based on a file of URL suffixes.
|
TableUtil |
|
TextMD5Signature |
Implementation of a page signature.
|
TextProfileSignature |
An implementation of a page signature.
|
TikaParser |
Wrapper for Tika parsers.
|
TimingUtil |
|
TLDIndexingFilter |
Adds the top-level domain extensions to the index
|
TLDScoringFilter |
Scoring filter to boost top-level domains (TLDs).
|
TopLevelDomain |
(From wikipedia) A top-level domain (TLD) is the last part of an Internet
domain name; that is, the letters which follow the final dot of any domain
name.
|
TopLevelDomain.Type |
|
Train |
|
TrieStringMatcher |
TrieStringMatcher is a base class for simple tree-based string matching.
|
UpdateHostDb |
Tool to create a HostDB from the CrawlDB.
|
UpdateHostDbMapper |
Mapper ingesting HostDB and CrawlDB entries.
|
UpdateHostDbReducer |
|
URLExemptionFilter |
Interface used to allow exemptions to external domain resources by overriding db.ignore.external.links .
|
URLExemptionFilters |
|
URLFilter |
Interface used to limit which URLs enter Nutch.
|
URLFilterChecker |
Checks one given filter or all filters.
|
URLFilterException |
|
URLFilters |
Creates and caches plugins implementing URLFilter .
|
URLMetaIndexingFilter |
This is part of the URL Meta plugin.
|
URLMetaScoringFilter |
|
URLNormalizer |
Interface used to convert URLs to normal form and optionally perform
substitutions
|
URLNormalizerChecker |
Checks one given normalizer or all normalizers.
|
URLNormalizers |
This class uses a "chained filter" pattern to run defined normalizers.
|
URLPartitioner |
Partition urls by host, domain name or IP depending on the value of the
parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'
|
URLStreamHandlerFactory |
This URLStreamHandlerFactory knows about all the plugins
in use and thus can create the correct URLStreamHandler
even if it comes from a plugin classpath.
|
URLUtil |
Utility class for URL analysis
|
UrlValidator |
Validates URLs.
|
WARCExporter |
MapReduce job to exports Nutch segments as WARC files.
|
WARCExporter.WARCMapReduce |
|
WARCExporter.WARCMapReduce.WARCMapper |
|
WARCExporter.WARCMapReduce.WARCReducer |
|
WARCUtils |
|
WebGraph |
Creates three databases, one for inlinks, one for outlinks, and a node
database that holds the number of in and outlinks to a url and the current
score for the url.
|
WebGraph.OutlinkDb |
The OutlinkDb creates a database of all outlinks.
|
WebGraph.OutlinkDb.OutlinkDbMapper |
Passes through existing LinkDatum objects from an existing OutlinkDb and
maps out new LinkDatum objects from new crawls ParseData.
|
WebGraph.OutlinkDb.OutlinkDbReducer |
|
XMLCharacterRecognizer |
Class used to verify whether the specified ch conforms to the XML
1.0 definition of whitespace.
|
ZipParser |
ZipParser class based on MSPowerPointParser class by Stephan Strittmatter.
|
ZipTextExtractor |
|