All Classes (apache-nutch 1.20 API)

All Classes Interface Summary Class Summary Enum Summary Exception Summary
Class	Description
AbstractChecker	Scaffolding class for the various Checker implementations.
AbstractCommonCrawlFormat	Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.
AbstractFetchSchedule	This class provides common methods for implementations of `FetchSchedule`.
AbstractResource
AbstractScoringFilter
AdaptiveFetchSchedule	This class implements an adaptive re-fetch algorithm.
AdminResource
AjaxURLNormalizer	URLNormalizer capable of dealing with AJAX URL's.
AnchorIndexingFilter	Indexing filter that offers an option to either index all inbound anchor text for a document or deduplicate anchors.
ArbitraryIndexingFilter	Adds arbitrary searchable fields to a document from the class and method the user identifies in the config.
ArcInputFormat	A input format the reads arc files.
ArcRecordReader	The `ArchRecordReader` class provides a record reader which reads records from arc files.
ArcSegmentCreator	The `ArcSegmentCreator` is a replacement for fetcher that will take arc files as input and produce a nutch segment as output.
ArcSegmentCreator.ArcSegmentCreatorMapper
AutomatonURLFilter	RegexURLFilterBase implementation based on the dk.brics.automaton Finite-State Automata for Java^TM.
BasicIndexingFilter	Adds basic searchable fields to a document.
BasicURLNormalizer	Converts URLs to a normal form: remove dot segments in path: `/./` or `/../` remove default ports, e.g.
BlockedException
CaseInsensitiveMetadata	A decorator to Metadata that adds for case-insensitive lookup of keys.
CCIndexingFilter	Adds basic searchable fields to a document.
CCParseFilter	Adds metadata identifying the Creative Commons license used, if any.
CCParseFilter.Walker	Walks DOM tree, looking for RDF in comments and licenses in anchors.
CIDR	Parse a CIDR block notation and test whether an IP address is contained in the subnet range defined by the CIDR.
CircularDependencyException	`CircularDependencyException` will be thrown if a circular dependency is detected.
Classify
CleaningJob	The class scans CrawlDB looking for entries with status DB_GONE (404) or DB_DUPLICATE and sends delete requests to indexers for those documents.
CleaningJob.DBFilter
CleaningJob.DeleterReducer
Client	Client.java encapsulates functionalities necessary for nutch to get dir list and retrieve file from an FTP server.
CloudSearchConstants
CloudSearchIndexWriter	Writes documents to CloudSearch.
CloudSearchUtils
CollectionManager
CommandRunner
CommonCrawlConfig
CommonCrawlDataDumper	The Common Crawl Data Dumper tool enables one to reverse generate the raw content from Nutch segment data directories into a common crawling data format, consumed by many applications.
CommonCrawlFormat	Interface for all CommonCrawl formatter.
CommonCrawlFormatFactory	Factory class that creates new `CommonCrawlFormat` objects (a.k.a.
CommonCrawlFormatJackson	This class provides methods to map crawled data on JSON using Jackson Streaming APIs.
CommonCrawlFormatJettinson	This class provides methods to map crawled data on JSON using Jettinson APIs.
CommonCrawlFormatSimple	This class provides methods to map crawled data on JSON using a StringBuilder object.
CommonCrawlFormatWARC
ConfigResource
ConfManager
ConfManagerImpl
Content
ContentAsTextInputFormat	An input format that takes Nutch Content objects and converts them to text while converting newline endings to spaces.
CosineSimilarity
CrawlCompletionStats	Extracts some simple crawl completion stats from the crawldb Stats will be sorted by host/domain and will be of the form: 1 www.spitzer.caltech.edu FETCHED 50 www.spitzer.caltech.edu UNFETCHED
CrawlCompletionStats.CrawlCompletionStatsCombiner
CrawlDatum
CrawlDatum.Comparator	A Comparator optimized for CrawlDatum.
CrawlDatumProcessor	These are instantiated once for each host.
CrawlDb	This class takes the output of the fetcher and updates the crawldb accordingly.
CrawlDbFilter	This class provides a way to separate the URL normalization and filtering steps from the rest of CrawlDb manipulation code.
CrawlDbMerger	This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
CrawlDbMerger.Merger
CrawlDbReader	Read utility for the CrawlDB.
CrawlDbReader.CrawlDatumCsvOutputFormat
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
CrawlDbReader.CrawlDatumJsonOutputFormat
CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter
CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer
CrawlDbReader.CrawlDbDumpMapper
CrawlDbReader.CrawlDbStatMapper
CrawlDbReader.CrawlDbStatReducer
CrawlDbReader.CrawlDbTopNMapper
CrawlDbReader.CrawlDbTopNReducer
CrawlDbReader.JsonIndenter
CrawlDbReducer	Merge new page entries with existing entries.
CreativeCommons	A collection of Creative Commons properties names.
CSVConstants
CSVIndexWriter	Write Nutch documents to a CSV file (comma separated values), i.e., dump index as CSV or tab-separated plain text table.
DbQuery
DbResource
DebugParseFilter	Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).
DeduplicationJob	Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed).
DeduplicationJob.DBFilter
DeduplicationJob.DedupReducer<K extends Writable>
DeduplicationJob.StatusUpdateReducer	Combine multiple new entries for a url.
DefalultMultiInteractionHandler	This is a placeholder/example of a technique or use case where we do multiple interaction with the web driver and need data from each such interaction in the end.
DefaultClickAllAjaxLinksHandler	This handler clicks all the tags because it considers them as not usual links but ajax links/interactions.
DefaultFetchSchedule	This class implements the default re-fetch schedule.
DefaultHandler
DeflateUtils	A collection of utility methods for working on deflated data.
DepthScoringFilter	This scoring filter limits the number of hops from the initial seed urls.
DmozParser	Utility that converts DMOZ RDF into a flat file of URLs to be injected.
DocVector
DomainDenylistURLFilter	Filters URLs based on a file containing domain suffixes, domain names, and hostnames.
DomainStatistics	Extracts some very basic statistics about domains from the crawldb
DomainStatistics.DomainStatisticsCombiner
DomainStatistics.MyCounter
DomainSuffix	This class represents the last part of the host name, which is operated by authoritives, not individuals.
DomainSuffix.Status	Enumeration of the status of the tld.
DomainSuffixes	Storage class for `DomainSuffix` objects Note: this class is singleton
DomainURLFilter	Filters URLs based on a file containing domain suffixes, domain names, and hostnames.
DOMBuilder	This class takes SAX events (in addition to some extra events that SAX doesn't handle yet) and adds the result to a document or document fragment.
DOMContentUtils	A collection of methods for extracting content from DOM trees.
DOMContentUtils	A collection of methods for extracting content from DOM trees.
DOMContentUtils.LinkParams
DomUtil
DublinCore	A collection of Dublin Core metadata names.
DummyConstants
DummyIndexWriter	DummyIndexWriter.
DummySSLProtocolSocketFactory
DummyX509TrustManager
DummyX509TrustManager
DummyX509TrustManager
DummyX509TrustManager
DummyX509TrustManager
DumpFileUtil
ElasticConstants
ElasticIndexWriter	Sends NutchDocuments to a configured Elasticsearch index.
EncodingDetector	A simple class for detecting character encodings.
Exchange
ExchangeConfig
Exchanges
ExemptionUrlFilter	This implementation of `URLExemptionFilter` uses regex configuration to check if URL is eligible for exemption from 'db.ignore.external'.
Extension	An `Extension` is a kind of listener descriptor that will be installed on a concrete `ExtensionPoint` that acts as kind of Publisher.
ExtensionPoint	The `ExtensionPoint` provide meta information of a extension point.
ExtParser	A wrapper that invokes external command to do real parsing job.
FastURLFilter	Filters URLs based on a file of regular expressions using host/domains matching first.
FastURLFilter.DenyAllRule	Rule for `DenyPath .*` or `DenyPath .?`
FastURLFilter.DenyPathQueryRule
FastURLFilter.DenyPathRule
FastURLFilter.Rule
Feed	A collection of Feed property names extracted by the ROME library.
FeedIndexingFilter
FeedParser
Fetcher	A queue-based fetcher.
Fetcher.FetcherRun
Fetcher.InputFormat
FetcherOutputFormat	Splits FetcherOutput entries into multiple map files.
FetcherThread	This class picks items from queues and fetches the pages.
FetcherThreadEvent	This class is used to capture the various events occurring at fetch time.
FetcherThreadEvent.PublishEventType	Type of event to specify start, end or reporting of a fetch item.
FetcherThreadPublisher	This class handles the publishing of the events to the queue implementation.
FetchItem	This class describes the item to be fetched.
FetchItemQueue	This class handles FetchItems which come from the same host ID (be it a proto/hostname or proto/IP pair).
FetchItemQueues	A collection of queues that keeps track of the total number of items, and provides items eligible for fetching from any queue.
FetchNode
FetchNodeDb
FetchNodeDbInfo
FetchOverdueCrawlDatumProcessor	Simple custom crawl datum processor that counts the number of records that are overdue for fetching, e.g.
FetchSchedule	This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.
FetchScheduleFactory	Creates and caches a `FetchSchedule` implementation.
FieldReplacer	POJO to store a filename, its match pattern and its replacement string.
File	This class is a protocol plugin used for file: scheme.
FileDumper	The file dumper tool enables one to reverse generate the raw content from Nutch segment data directories.
FileError	Thrown for File error codes.
FileException
FileResponse	FileResponse.java mimics file replies as http response.
FreeGenerator	This tool generates fetchlists (segments to be fetched) from plain text files containing one URL per line.
FreeGenerator.FG
FreeGenerator.FG.FGMapper
FreeGenerator.FG.FGReducer
FSUtils	Utility methods for common filesystem operations.
Ftp	This class is a protocol plugin used for ftp: scheme.
FtpError	Thrown for Ftp error codes.
FtpException	Superclass for important exceptions thrown during FTP talk, that must be handled with care.
FtpExceptionBadSystResponse	Exception indicating bad reply of SYST command.
FtpExceptionCanNotHaveDataConnection	Exception indicating failure of opening data connection.
FtpExceptionControlClosedByForcedDataClose	Exception indicating control channel is closed by server end, due to forced closure of data channel at client (our) end.
FtpExceptionUnknownForcedDataClose	Exception indicating unrecognizable reply from server after forced closure of data channel by client (our) side.
FtpResponse	FtpResponse.java mimics ftp replies as http response.
FtpRobotRulesParser	This class is used for parsing robots for urls belonging to FTP protocol.
Generator	Generates a subset of a crawl db to fetch.
Generator.CrawlDbUpdater	Update the CrawlDB so that the next generate won't include the same URLs.
Generator.CrawlDbUpdater.CrawlDbUpdateMapper
Generator.CrawlDbUpdater.CrawlDbUpdateReducer
Generator.DecreasingFloatComparator
Generator.HashComparator	Sort fetch lists by hash of URL.
Generator.PartitionReducer
Generator.Selector	Selects entries due for fetch.
Generator.SelectorEntry
Generator.SelectorInverseMapper
Generator.SelectorMapper	Select and invert subset due for fetch.
Generator.SelectorReducer	Collect until limit is reached.
GenericWritableConfigurable	A generic Writable wrapper that can inject Configuration to `Configurable`s
GeoIPDocumentCreator	Simple utility class which enables efficient, structured `NutchDocument` building based on input from `GeoIPIndexingFilter`, where configuration is also read.
GeoIPIndexingFilter	This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
GZIPUtils	A collection of utility methods for working on GZIPed data.
HadoopFSUtil
HeadingsParseFilter	HtmlParseFilter to retrieve h1 and h2 values from the DOM.
HostDatum
HostURLNormalizer	URL normalizer for mapping hosts to their desired form.
HTMLLanguageParser
HTMLMetaProcessor	Class for parsing META Directives from DOM trees.
HTMLMetaProcessor	Class for parsing META Directives from DOM trees.
HTMLMetaTags	This class holds the information about HTML "meta" tags extracted from a page.
HtmlParseFilter	Extension point for DOM-based HTML parsers.
HtmlParseFilters	Creates and caches `HtmlParseFilter` implementing plugins.
HtmlParser
HtmlUnitWebDriver
HtmlUnitWebWindowListener
Http
Http
Http	This class is a protocol plugin that configures an HTTP client for Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
Http
Http
HttpAuthentication	The base level of services required for Http Authentication
HttpAuthenticationException	Can be used to identify problems during creation of Authentication objects.
HttpAuthenticationFactory	Provides the Http protocol implementation with the ability to authenticate when prompted.
HttpBase
HttpBasicAuthentication	Implementation of RFC 2617 Basic Authentication.
HttpDateFormat	Parse and format HTTP dates in HTTP headers, e.g., used to fill the "If-Modified-Since" request header field.
HttpException
HttpFormAuthConfigurer
HttpFormAuthentication
HttpHeaders	A collection of HTTP header names.
HttpResponse	An HTTP response.
HttpResponse	An HTTP response.
HttpResponse	An HTTP response.
HttpResponse
HttpResponse
HttpResponse.Scheme
HttpResponse.Scheme
HttpResponse.Scheme
HttpResponse.Scheme
HttpRobotRulesParser	This class is used for parsing robots for urls belonging to HTTP protocol.
HttpWebClient
IndexerMapReduce	This class is typically invoked from within `IndexingJob` and handles all MapReduce functionality required when undertaking indexing.
IndexerMapReduce.IndexerMapper
IndexerMapReduce.IndexerReducer
IndexerOutputFormat
IndexingException
IndexingFilter	Extension point for indexing.
IndexingFilters	Creates and caches `IndexingFilter` implementing plugins.
IndexingFiltersChecker	Reads and parses a URL and run the indexers on it.
IndexingJob	Generic indexer which relies on the plugins implementing IndexWriter
IndexWriter
IndexWriterConfig
IndexWriterParams
IndexWriters	Creates and caches `IndexWriter` implementing plugins.
Injector	Injector takes a flat text file of URLs (or a folder containing text files) and merges ("injects") these URLs into the CrawlDb.
Injector.InjectMapper	InjectMapper reads the CrawlDb seeds are injected into the plain-text seed files and parses each line into the URL and metadata.
Injector.InjectReducer	Combine multiple new entries for a url.
Inlink	An incoming link to a page.
Inlinks	A list of `Inlink`s.
InteractiveSeleniumHandler
IPFilterRules	Optionally limit or block connections to IP address ranges (localhost/loopback or site-local addresses, subnet ranges given in CIDR notation, or single IP addresses).
JexlExchange
JexlIndexingFilter	An `IndexingFilter` that allows filtering of documents based on a JEXL expression.
JexlUtil	Utility methods for handling JEXL expressions
JobConfig	Job-specific configuration.
JobFactory
JobInfo	This is the response object containing Job information
JobInfo.State
JobManager
JobManager.JobType
JobManagerImpl
JobResource
JobWorker
JSParseFilter	This class is a heuristic link extractor for JavaScript files and code snippets.
KafkaConstants
KafkaIndexWriter	Sends Nutch documents to a configured Kafka Cluster
LanguageIndexingFilter	An `IndexingFilter` that add a `lang` (language) field to the document.
LinkAnalysisScoringFilter
LinkDatum	A class for holding link information including the url, anchor text, a score, the timestamp of the link and a link type.
LinkDb	Maintains an inverted link map, listing incoming links for each url.
LinkDb.LinkDbMapper
LinkDbFilter	This class provides a way to separate the URL normalization and filtering steps from the rest of LinkDb manipulation code.
LinkDbMerger	This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.
LinkDbMerger.LinkDbMergeReducer
LinkDbReader	Read utility for the LinkDb.
LinkDbReader.LinkDBDumpMapper
LinkDumper	The LinkDumper tool creates a database of node to inlink information that can be read using the nested Reader class.
LinkDumper.Inverter	Inverts outlinks from the WebGraph to inlinks and attaches node information.
LinkDumper.Inverter.InvertMapper	Wraps all values in ObjectWritables.
LinkDumper.Inverter.InvertReducer	Inverts outlinks to inlinks while attaching node information to the outlink.
LinkDumper.LinkNode	Bean class which holds url to node information.
LinkDumper.LinkNodes	Writable class which holds an array of LinkNode objects.
LinkDumper.Merger	Merges LinkNode objects into a single array value per url.
LinkDumper.Reader	Reader class which will print out the url and all of its inlinks to system out.
LinkRank
LinkReader
LinksIndexingFilter	An `IndexingFilter` that adds `outlinks` and `inlinks` field(s) to the document.
LockUtil	Utility methods for handling application-level locking.
LuceneAnalyzerUtil	Creates a custom analyzer based on user provided inputs
LuceneAnalyzerUtil.StemFilterType
LuceneTokenizer
LuceneTokenizer.TokenizerType
MD5Signature	Default implementation of a page signature.
Metadata	A multi-valued metadata container.
MetadataIndexer	Indexer which can be configured to extract metadata from the crawldb, parse metadata or content metadata.
MetadataScoringFilter	For documentation: `org.apache.nutch.scoring.metadata`
MetaTagsParser	Parse HTML meta tags (keywords, description) and store them in the parse metadata so that they can be indexed with the index-metadata plugin with the prefix 'metatag.'.
MetaWrapper	This is a simple decorator that adds metadata to any Writable-s that can be serialized by `NutchWritable`.
MimeAdaptiveFetchSchedule	Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types.
MimeTypeIndexingFilter	An `IndexingFilter` that allows filtering of documents based on the MIME Type detected by Tika
MimeUtil	This is a facade class to insulate Nutch from its underlying Mime Type substrate library, Apache Tika.
MissingDependencyException	`MissingDependencyException` will be thrown if a plugin dependency cannot be found.
Model	This class creates a model used to store Document vector representation of the corpus.
MoreIndexingFilter	Add (or reset) a few metaData properties as respective fields (if they are available), so that they can be accurately used within the search index.
NaiveBayesParseFilter	Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevant it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.
Node	A class which holds the number of inlinks and outlinks for a given url along with an inlink score from a link analysis program and any metadata.
NodeDumper	A tools that dumps out the top urls by number of inlinks, number of outlinks, or by score, to a text file.
NodeDumper.Dumper	Outputs the hosts or domains with an associated value.
NodeDumper.Dumper.DumperMapper	Outputs the host or domain as key for this record and numInlinks, numOutlinks or score as the value.
NodeDumper.Dumper.DumperReducer	Outputs either the sum or the top value for this record.
NodeDumper.Sorter	Outputs the top urls sorted in descending order.
NodeDumper.Sorter.SorterMapper	Outputs the url with the appropriate number of inlinks, outlinks, or for score.
NodeDumper.Sorter.SorterReducer	Flips and collects the url and numeric sort value.
NodeReader	Reads and prints to system out information for a single node from the NodeDb in the WebGraph.
NodeReader
NodeWalker	A utility class that allows the walking of any DOM tree using a stack instead of recursion.
Nutch	A collection of Nutch internal metadata constants.
NutchConfig
NutchConfiguration	Utility to create Hadoop `Configuration`s that include Nutch-specific resources.
NutchDocument	A `NutchDocument` is the unit of indexing.
NutchField	This class represents a multi-valued field with a weight.
NutchIndexAction	A `NutchIndexAction` is the new unit of indexing holding the document and action information.
NutchJob	A `Job` for Nutch jobs.
NutchPublisher	All publisher subscriber model implementations should implement this interface.
NutchPublishers
NutchReader
NutchServer
NutchServerInfo
NutchServerPoolExecutor
NutchTool
NutchWritable
ObjectCache
OkHttp
OkHttpResponse
OkHttpResponse.TruncatedContent	Container to store whether and why content has been truncated
OpenSearch1xConstants
OpenSearch1xIndexWriter	Sends NutchDocuments to a configured OpenSearch index.
OPICScoringFilter	This plugin implements a variant of an Online Page Importance Computation (OPIC) score, described in this paper: Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive On-Line Page Importance Computation.
OrphanScoringFilter	Orphan scoring filter that determines whether a page has become orphaned, e.g.
Outlink	An outgoing link from a page.
OutlinkExtractor	Extractor to extract `Outlink`s / URLs from plain text using Regular Expressions.
Parse	The result of parsing a page's raw content.
ParseData	Data extracted from a page's content.
ParseException
ParseImpl	The result of parsing a page's raw content.
ParseOutputFormat
Parser	A parser for content generated by a `Protocol` implementation.
ParserChecker	Parser checker, useful for testing parser.
ParseResult	A utility class that stores result of a parse.
ParserFactory	Creates and caches `Parser` plugins.
ParserNotFound
ParseSegment
ParseSegment.ParseSegmentMapper
ParseSegment.ParseSegmentReducer
ParseStatus
ParseText
ParseUtil	A Utility class containing methods to simply perform parsing utilities such as iterating through a preferred list of `Parser`s to obtain `Parse` objects.
PassURLNormalizer	This URLNormalizer doesn't change urls.
Pluggable	Defines the capability of a class to be plugged into Nutch.
Plugin	A nutch-plugin is an container for a set of custom logic that provide extensions to the nutch core functionality or another plugin that provides an API for extending.
PluginClassLoader	The `PluginClassLoader` is a child-first classloader that only contains classes of the runtime libraries setuped in the plugin manifest file and exported libraries of plugins that are required plugins.
PluginDescriptor	The `PluginDescriptor` provide access to all meta information of a nutch-plugin, as well to the internationalizable resources and the plugin own classloader.
PluginManifestParser	The `PluginManifestParser` provides a mechanism for parsing Nutch plugin manifest files (`plugin.xml`) contained in a `String` of plugin directories.
PluginRepository	The plugin repository is a registry of all plugins.
PluginRuntimeException	`PluginRuntimeException` will be thrown until a exception in the plugin managemnt occurs.
PrefixStringMatcher	A class for efficiently matching `String`s against a set of prefixes.
PrefixURLFilter	Filters URLs based on a file of URL prefixes.
PrintCommandListener	This is a support class for logging all ftp command/reply traffic.
Protocol	A retriever of url content.
ProtocolException	Deprecated. Use `ProtocolException` instead.
ProtocolException
ProtocolFactory	Creates and caches `Protocol` plugins.
ProtocolLogUtil
ProtocolNotFound
ProtocolOutput	Simple aggregate to pass from protocol plugins both content and protocol status.
ProtocolStatus
ProtocolStatusStatistics	Extracts protocol status code information from the crawl database.
ProtocolStatusStatistics.ProtocolStatusStatisticsCombiner
ProtocolURLNormalizer	URL normalizer to normalize the protocol for all URLs of a given host or domain, e.g.
QuerystringURLNormalizer	URL normalizer plugin for normalizing query strings but sorting query string parameters.
QueueFeeder	This class feeds the queues with input items, and re-fills them as items are consumed by FetcherThread-s.
RabbitIndexWriter
RabbitMQClient	Client for RabbitMQ
RabbitMQMessage
RabbitMQPublisherImpl
ReaderConfig
ReaderResouce	The Reader endpoint enables a user to read sequence files, nodes and links from the Nutch webgraph.
ReadHostDb
RegexParseFilter	RegexParseFilter.
RegexRule	A generic regular expression rule.
RegexURLFilter	Filters URLs based on a file of regular expressions using the `Java Regex implementation`.
RegexURLFilterBase	Generic `URLFilter` based on regular expressions.
RegexURLNormalizer	Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.
RelTagIndexingFilter	An `IndexingFilter` that add `tag` field(s) to the document.
RelTagParser	Adds microformat rel-tags of document if found.
ReplaceIndexer	Do pattern replacements on selected field contents prior to indexing.
ResolverThread	Simple runnable that performs DNS lookup for a single host.
ResolveUrls	A simple tool that will spin up multiple threads to resolve urls to ip addresses.
Response	A response interface.
Response.TruncatedContentReason
RobotRulesParser	This class uses crawler-commons for handling the parsing of `robots.txt` files.
ScoreUpdater	Updates the score from the WebGraph node database into the crawl database.
ScoreUpdater.ScoreUpdaterMapper	Changes input into ObjectWritables.
ScoreUpdater.ScoreUpdaterReducer	Creates new CrawlDatum objects with the updated score from the NodeDb or with a cleared score.
ScoringFilter	A contract defining behavior of scoring plugins.
ScoringFilterException	Specialized exception for errors during scoring.
ScoringFilters	Creates and caches `ScoringFilter` implementing plugins.
SeedList
SeedManager
SeedManagerImpl
SeedResource
SeedUrl
SegmentChecker	Checks whether a segment is valid, or has a certain status (generated, fetched, parsed), or can be used safely for a certain processing step (e.g., indexing).
SegmentMergeFilter	Interface used to filter segments during segment merge.
SegmentMergeFilters	This class wraps all `SegmentMergeFilter` extensions in a single object so it is easier to operate on them.
SegmentMerger	This tool takes several segments and merges their data together.
SegmentMerger.ObjectInputFormat	Wraps inputs in an `MetaWrapper`, to permit merging different types in reduce and use additional metadata.
SegmentMerger.SegmentMergerMapper
SegmentMerger.SegmentMergerReducer	NOTE: in selecting the latest version we rely exclusively on the segment name (not all segment data contain time information).
SegmentMerger.SegmentOutputFormat
SegmentPart	Utility class for handling information about segment parts.
SegmentReader	Dump the content of a segment.
SegmentReader.InputCompatMapper
SegmentReader.InputCompatReducer
SegmentReader.SegmentReaderStats
SegmentReader.TextOutputFormat	Implements a text output format
SegmentReaderUtil
SequenceReader	Enables reading a sequence file and methods provide different ways to read the file.
ServiceConfig
ServiceInfo
ServicesResource	The services resource defines an endpoint to enable the user to carry out Nutch jobs like dump, commoncrawldump, etc.
ServiceWorker
ShowProperties	Tool to list properties and their values set by the current Nutch configuration
Signature
SignatureComparator
SignatureFactory	Factory class, which instantiates a Signature implementation according to the current Configuration configuration.
SimilarityModel
SimilarityScoringFilter
SitemapProcessor	Performs sitemap processing by fetching sitemap links, parsing the content and merging the URLs from sitemaps (with the metadata) into the CrawlDb.
SlashURLNormalizer
SolrConstants
SolrIndexWriter
SolrUtils
SpellCheckedMetadata	A decorator to Metadata that adds spellchecking capabilities to property names.
StaticFieldIndexer	A simple plugin called at indexing that adds fields with static data.
StringUtil	A collection of String processing utility methods.
Subcollection	SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.
SubcollectionIndexingFilter
SuffixStringMatcher	A class for efficiently matching `String`s against a set of suffixes.
SuffixURLFilter	Filters URLs based on a file of URL suffixes.
TableUtil
TextMD5Signature	Implementation of a page signature.
TextProfileSignature	An implementation of a page signature.
TikaParser	Wrapper for Tika parsers.
TimingUtil
TLDIndexingFilter	Adds the top-level domain extensions to the index
TLDScoringFilter	Scoring filter to boost top-level domains (TLDs).
TopLevelDomain	(From wikipedia) A top-level domain (TLD) is the last part of an Internet domain name; that is, the letters which follow the final dot of any domain name.
TopLevelDomain.Type
Train
TrieStringMatcher	TrieStringMatcher is a base class for simple tree-based string matching.
UpdateHostDb	Tool to create a HostDB from the CrawlDB.
UpdateHostDbMapper	Mapper ingesting HostDB and CrawlDB entries.
UpdateHostDbReducer
URLExemptionFilter	Interface used to allow exemptions to external domain resources by overriding `db.ignore.external.links`.
URLExemptionFilters	Creates and caches `URLExemptionFilter` implementing plugins.
URLFilter	Interface used to limit which URLs enter Nutch.
URLFilterChecker	Checks one given filter or all filters.
URLFilterException
URLFilters	Creates and caches plugins implementing `URLFilter`.
URLMetaIndexingFilter	This is part of the URL Meta plugin.
URLMetaScoringFilter	For documentation: `org.apache.nutch.scoring.urlmeta`
URLNormalizer	Interface used to convert URLs to normal form and optionally perform substitutions
URLNormalizerChecker	Checks one given normalizer or all normalizers.
URLNormalizers	This class uses a "chained filter" pattern to run defined normalizers.
URLPartitioner	Partition urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'
URLStreamHandlerFactory	This URLStreamHandlerFactory knows about all the plugins in use and thus can create the correct URLStreamHandler even if it comes from a plugin classpath.
URLUtil	Utility class for URL analysis
UrlValidator	Validates URLs.
WARCExporter	MapReduce job to exports Nutch segments as WARC files.
WARCExporter.WARCMapReduce
WARCExporter.WARCMapReduce.WARCMapper
WARCExporter.WARCMapReduce.WARCReducer
WARCUtils
WebGraph	Creates three databases, one for inlinks, one for outlinks, and a node database that holds the number of in and outlinks to a url and the current score for the url.
WebGraph.OutlinkDb	The OutlinkDb creates a database of all outlinks.
WebGraph.OutlinkDb.OutlinkDbMapper	Passes through existing LinkDatum objects from an existing OutlinkDb and maps out new LinkDatum objects from new crawls ParseData.
WebGraph.OutlinkDb.OutlinkDbReducer
XMLCharacterRecognizer	Class used to verify whether the specified `ch` conforms to the XML 1.0 definition of whitespace.
ZipParser	ZipParser class based on MSPowerPointParser class by Stephan Strittmatter.
ZipTextExtractor