A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ 
All Classes All Packages

A

abort(String, String) - Method in class org.apache.nutch.service.impl.JobManagerImpl
 
abort(String, String) - Method in interface org.apache.nutch.service.JobManager
 
abort(String, String) - Method in class org.apache.nutch.service.resources.JobResource
 
AbstractChecker - Class in org.apache.nutch.util
Scaffolding class for the various Checker implementations.
AbstractChecker() - Constructor for class org.apache.nutch.util.AbstractChecker
 
AbstractCommonCrawlFormat - Class in org.apache.nutch.tools
Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.
AbstractCommonCrawlFormat(String, Content, Metadata, Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
AbstractFetchSchedule - Class in org.apache.nutch.crawl
This class provides common methods for implementations of FetchSchedule.
AbstractFetchSchedule() - Constructor for class org.apache.nutch.crawl.AbstractFetchSchedule
 
AbstractFetchSchedule(Configuration) - Constructor for class org.apache.nutch.crawl.AbstractFetchSchedule
 
AbstractResource - Class in org.apache.nutch.service.resources
 
AbstractResource() - Constructor for class org.apache.nutch.service.resources.AbstractResource
 
AbstractScoringFilter - Class in org.apache.nutch.scoring
 
AbstractScoringFilter() - Constructor for class org.apache.nutch.scoring.AbstractScoringFilter
 
accept - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The "Accept" request header value.
accept() - Method in class org.apache.nutch.urlfilter.api.RegexRule
Return if this rule is used for filtering-in or out.
accept(InetAddress) - Method in class org.apache.nutch.protocol.okhttp.IPFilterRules
 
acceptCharset - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The "Accept-Charset" request header value.
acceptLanguage - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The "Accept-Language" request header value.
ACCESS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Access denied - authorization required, but missing/incorrect.
action - Variable in class org.apache.nutch.indexer.NutchIndexAction
 
AdaptiveFetchSchedule - Class in org.apache.nutch.crawl
This class implements an adaptive re-fetch algorithm.
AdaptiveFetchSchedule() - Constructor for class org.apache.nutch.crawl.AdaptiveFetchSchedule
 
add(Object) - Method in class org.apache.nutch.indexer.NutchField
 
add(String, Object) - Method in class org.apache.nutch.indexer.NutchDocument
 
add(String, String) - Method in class org.apache.nutch.metadata.Metadata
Add a metadata name/value mapping.
add(String, String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
 
add(Inlink) - Method in class org.apache.nutch.crawl.Inlinks
 
add(Inlinks) - Method in class org.apache.nutch.crawl.Inlinks
 
add(NutchDocument, String, String) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
 
ADD - Static variable in class org.apache.nutch.indexer.NutchIndexAction
 
addAll(Metadata) - Method in class org.apache.nutch.metadata.Metadata
Add all name/value mappings (merge two metadata mappings).
addAttribute(String, String) - Method in class org.apache.nutch.plugin.Extension
Adds a attribute and is only used until model creation at plugin system start up.
addClue(String, String) - Method in class org.apache.nutch.util.EncodingDetector
 
addClue(String, String, int) - Method in class org.apache.nutch.util.EncodingDetector
 
addDependency(String) - Method in class org.apache.nutch.plugin.PluginDescriptor
Adds a dependency
addEventData(String, Object) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Add new data to the eventData object.
addExportedLibRelative(String) - Method in class org.apache.nutch.plugin.PluginDescriptor
Adds a exported library with a relative path to the plugin directory.
addExtension(Extension) - Method in class org.apache.nutch.plugin.ExtensionPoint
Install a coresponding extension to this extension point.
addExtension(Extension) - Method in class org.apache.nutch.plugin.PluginDescriptor
Adds a extension.
addExtensionPoint(ExtensionPoint) - Method in class org.apache.nutch.plugin.PluginDescriptor
Adds a extension point.
addFetchItem(Text, CrawlDatum) - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
addFetchItem(FetchItem) - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
addFetchItem(FetchItem) - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
addHeader(String, Object) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
 
addIfNotNull(NutchDocument, String, Object) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
Add field to document but only if value isn't null
addInProgressFetchItem(FetchItem) - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
addMeta(String, String) - Method in class org.apache.nutch.metadata.MetaWrapper
Add metadata.
addNotExportedLibRelative(String) - Method in class org.apache.nutch.plugin.PluginDescriptor
Adds a non-exported library with a relative path to the plugin directory.
addOutlinksToEventData(Collection<Outlink>) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Given a collection of lists this method will add it the oultink metadata
addPatternBackward(String) - Method in class org.apache.nutch.util.TrieStringMatcher
Adds any necessary nodes to the trie so that the given String can be decoded in reverse and the first character is represented by a terminal node.
addPatternForward(String) - Method in class org.apache.nutch.util.TrieStringMatcher
Adds any necessary nodes to the trie so that the given String can be decoded and the last character is represented by a terminal node.
addRobotsContent(List<Content>, URL, Response) - Method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
Append Content of robots.txt to robotsTxtContent
addUrlFeatures(NutchDocument, String) - Method in class org.creativecommons.nutch.CCIndexingFilter
Add the features represented by a license URL.
AdminResource - Class in org.apache.nutch.service.resources
 
AdminResource() - Constructor for class org.apache.nutch.service.resources.AdminResource
 
afterExecute(Runnable, Throwable) - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
 
agentNames - Variable in class org.apache.nutch.protocol.RobotRulesParser
 
AJAX_URL_PART - Static variable in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
 
AjaxURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.ajax
URLNormalizer capable of dealing with AJAX URL's.
AjaxURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
Default constructor.
allowForbidden - Variable in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
 
allowList - Variable in class org.apache.nutch.protocol.RobotRulesParser
set of host names or IPs to be explicitly excluded from robots.txt checking
analyze(Path) - Method in class org.apache.nutch.scoring.webgraph.LinkRank
Runs the complete link analysis job.
AnchorIndexingFilter - Class in org.apache.nutch.indexer.anchor
Indexing filter that offers an option to either index all inbound anchor text for a document or deduplicate anchors.
AnchorIndexingFilter() - Constructor for class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 
ANY - org.apache.nutch.service.model.response.JobInfo.State
 
append(Node) - Method in class org.apache.nutch.parse.html.DOMBuilder
Append a node to the current container.
ArbitraryIndexingFilter - Class in org.apache.nutch.indexer.arbitrary
Adds arbitrary searchable fields to a document from the class and method the user identifies in the config.
ArbitraryIndexingFilter() - Constructor for class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
 
ArcInputFormat - Class in org.apache.nutch.tools.arc
A input format the reads arc files.
ArcInputFormat() - Constructor for class org.apache.nutch.tools.arc.ArcInputFormat
 
ArcRecordReader - Class in org.apache.nutch.tools.arc
The ArchRecordReader class provides a record reader which reads records from arc files.
ArcRecordReader(Configuration, FileSplit) - Constructor for class org.apache.nutch.tools.arc.ArcRecordReader
Constructor that sets the configuration and file split.
ArcSegmentCreator - Class in org.apache.nutch.tools.arc
The ArcSegmentCreator is a replacement for fetcher that will take arc files as input and produce a nutch segment as output.
ArcSegmentCreator() - Constructor for class org.apache.nutch.tools.arc.ArcSegmentCreator
 
ArcSegmentCreator(Configuration) - Constructor for class org.apache.nutch.tools.arc.ArcSegmentCreator
Constructor that sets the job configuration.
ArcSegmentCreator.ArcSegmentCreatorMapper - Class in org.apache.nutch.tools.arc
 
ArcSegmentCreatorMapper() - Constructor for class org.apache.nutch.tools.arc.ArcSegmentCreator.ArcSegmentCreatorMapper
 
areAvailableExchanges() - Method in class org.apache.nutch.exchange.Exchanges
 
ARG_CRAWLDB - Static variable in interface org.apache.nutch.metadata.Nutch
Argument key to specify the location of crawldb for the REST endpoints
ARG_HOSTDB - Static variable in interface org.apache.nutch.metadata.Nutch
Argument key to specify the location of hostdb for the REST endpoints
ARG_LINKDB - Static variable in interface org.apache.nutch.metadata.Nutch
Argument key to specify the location of linkdb for the REST endpoints
ARG_SEEDDIR - Static variable in interface org.apache.nutch.metadata.Nutch
Argument key to specify location of the seed url dir for the REST endpoints
ARG_SEEDNAME - Static variable in interface org.apache.nutch.metadata.Nutch
Argument key to specify name of a seed list for the REST endpoints
ARG_SEGMENTDIR - Static variable in interface org.apache.nutch.metadata.Nutch
Argument key to specify the location of a directory of segments for the REST endpoints.
ARG_SEGMENTS - Static variable in interface org.apache.nutch.metadata.Nutch
Argument key to specify the location of individual segment or list of segments for the REST endpoints.
args - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
attrName - Variable in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
 
AUTH_HEADER_NAME - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
AUTH_HEADER_VALUE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
autoDetectClues(Content, boolean) - Method in class org.apache.nutch.util.EncodingDetector
 
AutomatonURLFilter - Class in org.apache.nutch.urlfilter.automaton
RegexURLFilterBase implementation based on the dk.brics.automaton Finite-State Automata for JavaTM.
AutomatonURLFilter() - Constructor for class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
 
AutomatonURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
 
autoResolveContentType(String, String, byte[]) - Method in class org.apache.nutch.util.MimeUtil
A facade interface to trying all the possible mime type resolution strategies available within Tika.

B

BasicIndexingFilter - Class in org.apache.nutch.indexer.basic
Adds basic searchable fields to a document.
BasicIndexingFilter() - Constructor for class org.apache.nutch.indexer.basic.BasicIndexingFilter
 
BasicURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.basic
Converts URLs to a normal form: remove dot segments in path: /./ or /../ remove default ports, e.g.
BasicURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
 
BATCH_DUMP - Static variable in interface org.apache.nutch.indexwriter.cloudsearch.CloudSearchConstants
 
beforeExecute(Thread, Runnable) - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
 
bind(String, String, String, String, String, String) - Method in class org.apache.nutch.rabbitmq.RabbitMQClient
Creates a relationship between an exchange and a queue.
BLOCKED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Deprecated.
BlockedException - Exception in org.apache.nutch.protocol.http.api
 
BlockedException(String) - Constructor for exception org.apache.nutch.protocol.http.api.BlockedException
 
booleanValue() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse.TruncatedContent
 
buffer - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
BUFFER_SIZE - Static variable in class org.apache.nutch.protocol.http.api.HttpBase
 
BULK_CLOSE_TIMEOUT - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
BULK_CLOSE_TIMEOUT - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
bulkProcessorListener() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
Generates a default BulkProcessor.Listener
bulkProcessorListener() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
Generates a default BulkProcessor.Listener
bytes - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
 

C

CACHE - Static variable in class org.apache.nutch.protocol.RobotRulesParser
 
CACHING_FORBIDDEN_ALL - Static variable in interface org.apache.nutch.metadata.Nutch
Don't show either original forbidden content or summaries.
CACHING_FORBIDDEN_CONTENT - Static variable in interface org.apache.nutch.metadata.Nutch
Don't show original forbidden content, but show summaries.
CACHING_FORBIDDEN_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
Sites may request that search engines don't provide access to cached documents.
CACHING_FORBIDDEN_NONE - Static variable in interface org.apache.nutch.metadata.Nutch
Show both original forbidden content and summaries (default).
calculate(Content, Parse) - Method in class org.apache.nutch.crawl.MD5Signature
 
calculate(Content, Parse) - Method in class org.apache.nutch.crawl.Signature
 
calculate(Content, Parse) - Method in class org.apache.nutch.crawl.TextMD5Signature
 
calculate(Content, Parse) - Method in class org.apache.nutch.crawl.TextProfileSignature
 
calculateLastFetchTime(CrawlDatum) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
This method return the last fetch time of the CrawlDatum
calculateLastFetchTime(CrawlDatum) - Method in interface org.apache.nutch.crawl.FetchSchedule
Calculates last fetch time of the given CrawlDatum.
canStop(boolean) - Method in class org.apache.nutch.service.NutchServer
 
CaseInsensitiveMetadata - Class in org.apache.nutch.metadata
A decorator to Metadata that adds for case-insensitive lookup of keys.
CaseInsensitiveMetadata() - Constructor for class org.apache.nutch.metadata.CaseInsensitiveMetadata
Constructs a new, empty metadata.
CCIndexingFilter - Class in org.creativecommons.nutch
Adds basic searchable fields to a document.
CCIndexingFilter() - Constructor for class org.creativecommons.nutch.CCIndexingFilter
 
CCParseFilter - Class in org.creativecommons.nutch
Adds metadata identifying the Creative Commons license used, if any.
CCParseFilter() - Constructor for class org.creativecommons.nutch.CCParseFilter
 
CCParseFilter.Walker - Class in org.creativecommons.nutch
Walks DOM tree, looking for RDF in comments and licenses in anchors.
cdata(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of cdata.
CHAR_ENCODING_FOR_CONVERSION - Static variable in interface org.apache.nutch.metadata.Nutch
 
characters(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of character data.
charactersRaw(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
If available, when the disable-output-escaping attribute is used, output raw text without escaping.
chars - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
 
CHARSET_UTF8 - Static variable in class org.apache.nutch.parse.feed.FeedParser
 
checkAndReplace(String, String) - Method in class org.apache.nutch.indexer.replace.FieldReplacer
Return a replacement value for a field.
checkAny - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
 
checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
 
checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
 
checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
 
checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
 
checkExceptionThreshold(String) - Method in class org.apache.nutch.fetcher.FetchItemQueues
Increment the exception counter of a queue in case of an exception e.g.
checkExceptionThreshold(String, int, long) - Method in class org.apache.nutch.fetcher.FetchItemQueues
Increment the exception counter of a queue in case of an exception e.g.
checkFailed - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
checkKnown - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
checkNew - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
checkOutputSpecs(JobContext) - Method in class org.apache.nutch.fetcher.FetcherOutputFormat
 
checkOutputSpecs(JobContext) - Method in class org.apache.nutch.parse.ParseOutputFormat
 
checkQueueMode(String) - Static method in class org.apache.nutch.fetcher.FetchItemQueues
Check whether queue mode is valid, fall-back to default mode if not.
checkRobotsTxt - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
 
checkRobotsTxt - Variable in class org.apache.nutch.parse.ParserChecker
 
checkSegmentDir(Path, FileSystem) - Static method in class org.apache.nutch.segment.SegmentChecker
Check the segment to see if it is valid based on the sub directories.
checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
 
checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
 
checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
 
checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
 
checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
 
checkTimelimit() - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
childLen - Variable in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
 
children - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
 
childrenList - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
 
chooseRepr(String, String, boolean) - Static method in class org.apache.nutch.util.URLUtil
Given two urls, a src and a destination of a redirect, it returns the representative url.
CIDR - Class in org.apache.nutch.protocol.okhttp
Parse a CIDR block notation and test whether an IP address is contained in the subnet range defined by the CIDR.
CIDR(String) - Constructor for class org.apache.nutch.protocol.okhttp.CIDR
 
CIDR(InetAddress, int) - Constructor for class org.apache.nutch.protocol.okhttp.CIDR
 
CircularDependencyException - Exception in org.apache.nutch.plugin
CircularDependencyException will be thrown if a circular dependency is detected.
CircularDependencyException(String) - Constructor for exception org.apache.nutch.plugin.CircularDependencyException
 
CircularDependencyException(Throwable) - Constructor for exception org.apache.nutch.plugin.CircularDependencyException
 
CLASS - org.apache.nutch.service.JobManager.JobType
 
CLASSIC - org.apache.nutch.scoring.similarity.util.LuceneTokenizer.TokenizerType
 
classify(String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Classify
 
classify(String) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
Classify - Class in org.apache.nutch.parsefilter.naivebayes
 
Classify() - Constructor for class org.apache.nutch.parsefilter.naivebayes.Classify
 
cleanField(String) - Static method in class org.apache.nutch.util.StringUtil
Simple character substitution which cleans/removes all � chars from a given String.
CleaningJob - Class in org.apache.nutch.indexer
The class scans CrawlDB looking for entries with status DB_GONE (404) or DB_DUPLICATE and sends delete requests to indexers for those documents.
CleaningJob() - Constructor for class org.apache.nutch.indexer.CleaningJob
 
CleaningJob.DBFilter - Class in org.apache.nutch.indexer
 
CleaningJob.DeleterReducer - Class in org.apache.nutch.indexer
 
cleanMimeType(String) - Static method in class org.apache.nutch.util.MimeUtil
Cleans a MimeType name by removing out the actual MimeType, from a string of the form:
cleanup(Reducer.Context) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
 
cleanup(Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorReducer
 
cleanup(Reducer.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
Shut down all running threads and wait for completion.
cleanupAfterFailure(Path, FileSystem) - Static method in class org.apache.nutch.util.NutchJob
Clean up the file system in case of a job failure.
cleanupAfterFailure(Path, Path, FileSystem) - Static method in class org.apache.nutch.util.NutchJob
Clean up the file system in case of a job failure.
cleanUpDriver(WebDriver) - Static method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
 
cleanUpDriver(WebDriver) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
clear() - Method in class org.apache.nutch.crawl.Inlinks
 
clear() - Method in class org.apache.nutch.metadata.Metadata
Remove all mappings from metadata.
clearClues() - Method in class org.apache.nutch.util.EncodingDetector
Clears all clues.
Client - Class in org.apache.nutch.protocol.ftp
Client.java encapsulates functionalities necessary for nutch to get dir list and retrieve file from an FTP server.
Client() - Constructor for class org.apache.nutch.protocol.ftp.Client
Public default constructor
CLIENT_TRANSFER_ENCODING - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
clone() - Method in class org.apache.nutch.crawl.CrawlDatum
 
clone() - Method in class org.apache.nutch.hostdb.HostDatum
 
clone() - Method in class org.apache.nutch.indexer.NutchDocument
 
clone() - Method in class org.apache.nutch.indexer.NutchField
 
close() - Method in class org.apache.nutch.crawl.CrawlDbReader
 
close() - Method in class org.apache.nutch.crawl.LinkDbReader
 
close() - Method in interface org.apache.nutch.indexer.IndexWriter
 
close() - Method in class org.apache.nutch.indexer.IndexWriters
 
close() - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
close() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
close() - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
close() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
close() - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
close() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
close() - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
close() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
close() - Method in class org.apache.nutch.rabbitmq.RabbitMQClient
Closes the channel and the connection with the server.
close() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
close() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
Closes the record reader resources.
close() - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
 
close() - Method in interface org.apache.nutch.tools.CommonCrawlFormat
Optional method that could be implemented if the actual format needs some close procedure.
close() - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
close(TaskAttemptContext) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
 
close(TaskAttemptContext) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter
 
closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
 
closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
 
closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
closeObject(String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
closeObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
 
closeObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
closeObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
 
closeObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
closeReaders(MapFile.Reader[]) - Static method in class org.apache.nutch.util.FSUtils
Closes a group of MapFile readers.
closeReaders(SequenceFile.Reader[]) - Static method in class org.apache.nutch.util.FSUtils
Closes a group of SequenceFile readers.
CloudSearchConstants - Interface in org.apache.nutch.indexwriter.cloudsearch
 
CloudSearchIndexWriter - Class in org.apache.nutch.indexwriter.cloudsearch
Writes documents to CloudSearch.
CloudSearchIndexWriter() - Constructor for class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
CloudSearchUtils - Class in org.apache.nutch.indexwriter.cloudsearch
 
CloudSearchUtils() - Constructor for class org.apache.nutch.indexwriter.cloudsearch.CloudSearchUtils
 
COLLECTION - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
CollectionManager - Class in org.apache.nutch.collection
 
CollectionManager() - Constructor for class org.apache.nutch.collection.CollectionManager
Used for testing
CollectionManager(Configuration) - Constructor for class org.apache.nutch.collection.CollectionManager
 
COLONSP - Static variable in class org.apache.nutch.tools.WARCUtils
 
CommandRunner - Class in org.apache.nutch.util
 
CommandRunner() - Constructor for class org.apache.nutch.util.CommandRunner
 
comment(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
Report an XML comment anywhere in the document.
commit() - Method in interface org.apache.nutch.indexer.IndexWriter
 
commit() - Method in class org.apache.nutch.indexer.IndexWriters
 
commit() - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
commit() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
(nothing to commit)
commit() - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
commit() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
commit() - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
commit() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
commit() - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
commit() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
COMMIT_SIZE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
CommonCrawlConfig - Class in org.apache.nutch.tools
 
CommonCrawlConfig() - Constructor for class org.apache.nutch.tools.CommonCrawlConfig
Default constructor
CommonCrawlConfig(InputStream) - Constructor for class org.apache.nutch.tools.CommonCrawlConfig
 
CommonCrawlDataDumper - Class in org.apache.nutch.tools
The Common Crawl Data Dumper tool enables one to reverse generate the raw content from Nutch segment data directories into a common crawling data format, consumed by many applications.
CommonCrawlDataDumper() - Constructor for class org.apache.nutch.tools.CommonCrawlDataDumper
Constructor
CommonCrawlDataDumper(CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlDataDumper
Configurable constructor
commoncrawlDump(ServiceConfig) - Method in class org.apache.nutch.service.resources.ServicesResource
 
CommonCrawlFormat - Interface in org.apache.nutch.tools
Interface for all CommonCrawl formatter.
CommonCrawlFormatFactory - Class in org.apache.nutch.tools
Factory class that creates new CommonCrawlFormat objects (a.k.a.
CommonCrawlFormatFactory() - Constructor for class org.apache.nutch.tools.CommonCrawlFormatFactory
 
CommonCrawlFormatJackson - Class in org.apache.nutch.tools
This class provides methods to map crawled data on JSON using Jackson Streaming APIs.
CommonCrawlFormatJackson(String, Content, Metadata, Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatJackson
 
CommonCrawlFormatJackson(Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatJackson
 
CommonCrawlFormatJettinson - Class in org.apache.nutch.tools
This class provides methods to map crawled data on JSON using Jettinson APIs.
CommonCrawlFormatJettinson(String, Content, Metadata, Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
CommonCrawlFormatSimple - Class in org.apache.nutch.tools
This class provides methods to map crawled data on JSON using a StringBuilder object.
CommonCrawlFormatSimple(String, Content, Metadata, Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatSimple
 
CommonCrawlFormatWARC - Class in org.apache.nutch.tools
 
CommonCrawlFormatWARC(String, Content, Metadata, Configuration, CommonCrawlConfig, ParseData) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatWARC
 
CommonCrawlFormatWARC(Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatWARC
 
Comparator() - Constructor for class org.apache.nutch.crawl.CrawlDatum.Comparator
 
compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.CrawlDatum.Comparator
 
compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.Generator.DecreasingFloatComparator
Compares two FloatWritables decreasing.
compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.Generator.HashComparator
 
compare(Object, Object) - Method in class org.apache.nutch.crawl.SignatureComparator
 
compare(WritableComparable, WritableComparable) - Method in class org.apache.nutch.crawl.Generator.HashComparator
 
compareOrder - Variable in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
 
compareTo(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
Sort two CrawlDatum objects by decreasing score.
compareTo(TrieStringMatcher.TrieNode) - Method in class org.apache.nutch.util.TrieStringMatcher.TrieNode
 
computeCosineSimilarity(DocVector) - Static method in class org.apache.nutch.scoring.similarity.cosine.Model
 
conf - Variable in class org.apache.nutch.crawl.Signature
 
conf - Variable in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
 
conf - Variable in class org.apache.nutch.plugin.Plugin
 
conf - Variable in class org.apache.nutch.protocol.RobotRulesParser
 
conf - Static variable in interface org.apache.nutch.service.NutchReader
 
conf - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
conf - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
 
configManager - Variable in class org.apache.nutch.service.resources.AbstractResource
 
ConfigResource - Class in org.apache.nutch.service.resources
 
ConfigResource() - Constructor for class org.apache.nutch.service.resources.ConfigResource
 
ConfManager - Interface in org.apache.nutch.service
 
ConfManagerImpl - Class in org.apache.nutch.service.impl
 
ConfManagerImpl() - Constructor for class org.apache.nutch.service.impl.ConfManagerImpl
 
CONFORMS_TO - Static variable in class org.apache.nutch.tools.WARCUtils
 
connectionFailures - Variable in class org.apache.nutch.hostdb.HostDatum
 
contains(InetAddress) - Method in class org.apache.nutch.protocol.okhttp.CIDR
 
containsWord(String, ArrayList<String>) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
content - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
Content - Class in org.apache.nutch.protocol
 
Content() - Constructor for class org.apache.nutch.protocol.Content
 
Content(String, String, byte[], String, Metadata, Configuration) - Constructor for class org.apache.nutch.protocol.Content
 
Content(String, String, byte[], String, Metadata, MimeUtil) - Constructor for class org.apache.nutch.protocol.Content
 
CONTENT_DISPOSITION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
CONTENT_ENCODING - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
CONTENT_LANGUAGE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
CONTENT_LENGTH - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
CONTENT_LOCATION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
CONTENT_MD5 - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
CONTENT_REDIR - Static variable in class org.apache.nutch.fetcher.Fetcher
 
CONTENT_TYPE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
ContentAsTextInputFormat - Class in org.apache.nutch.segment
An input format that takes Nutch Content objects and converts them to text while converting newline endings to spaces.
ContentAsTextInputFormat() - Constructor for class org.apache.nutch.segment.ContentAsTextInputFormat
 
context - Variable in class org.apache.nutch.hostdb.ResolverThread
 
CONTRIBUTOR - Static variable in interface org.apache.nutch.metadata.DublinCore
An entity responsible for making contributions to the content of the resource.
COOKIE - Static variable in class org.apache.nutch.protocol.http.api.HttpBase
 
CosineSimilarity - Class in org.apache.nutch.scoring.similarity.cosine
 
CosineSimilarity() - Constructor for class org.apache.nutch.scoring.similarity.cosine.CosineSimilarity
 
count(String) - Method in class org.apache.nutch.service.impl.LinkReader
 
count(String) - Method in class org.apache.nutch.service.impl.NodeReader
 
count(String) - Method in class org.apache.nutch.service.impl.SequenceReader
 
count(String) - Method in interface org.apache.nutch.service.NutchReader
 
count(CrawlDatum) - Method in interface org.apache.nutch.hostdb.CrawlDatumProcessor
Process a single crawl datum instance to aggregate custom counts.
count(CrawlDatum) - Method in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
 
COUNTRY - org.apache.nutch.util.domain.TopLevelDomain.Type
 
COVERAGE - Static variable in interface org.apache.nutch.metadata.DublinCore
The extent or scope of the content of the resource.
CRAWL_ID_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
Used by Nutch REST service
CrawlCompletionStats - Class in org.apache.nutch.util
Extracts some simple crawl completion stats from the crawldb Stats will be sorted by host/domain and will be of the form: 1 www.spitzer.caltech.edu FETCHED 50 www.spitzer.caltech.edu UNFETCHED
CrawlCompletionStats() - Constructor for class org.apache.nutch.util.CrawlCompletionStats
 
CrawlCompletionStats.CrawlCompletionStatsCombiner - Class in org.apache.nutch.util
 
CrawlCompletionStatsCombiner() - Constructor for class org.apache.nutch.util.CrawlCompletionStats.CrawlCompletionStatsCombiner
 
crawlDatum - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
CrawlDatum - Class in org.apache.nutch.crawl
 
CrawlDatum() - Constructor for class org.apache.nutch.crawl.CrawlDatum
 
CrawlDatum(int, int) - Constructor for class org.apache.nutch.crawl.CrawlDatum
 
CrawlDatum(int, int, float) - Constructor for class org.apache.nutch.crawl.CrawlDatum
 
CrawlDatum.Comparator - Class in org.apache.nutch.crawl
A Comparator optimized for CrawlDatum.
CrawlDatumCsvOutputFormat() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat
 
CrawlDatumJsonOutputFormat() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat
 
CrawlDatumProcessor - Interface in org.apache.nutch.hostdb
These are instantiated once for each host.
crawlDatumProcessors - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
crawlDb - Variable in class org.apache.nutch.crawl.CrawlDbReader
 
CrawlDb - Class in org.apache.nutch.crawl
This class takes the output of the fetcher and updates the crawldb accordingly.
CrawlDb() - Constructor for class org.apache.nutch.crawl.CrawlDb
 
CrawlDb(Configuration) - Constructor for class org.apache.nutch.crawl.CrawlDb
 
CRAWLDB_ADDITIONS_ALLOWED - Static variable in class org.apache.nutch.crawl.CrawlDb
 
CRAWLDB_PURGE_404 - Static variable in class org.apache.nutch.crawl.CrawlDb
 
CRAWLDB_PURGE_ORPHANS - Static variable in class org.apache.nutch.crawl.CrawlDb
 
CrawlDbDumpMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
 
CrawlDbFilter - Class in org.apache.nutch.crawl
This class provides a way to separate the URL normalization and filtering steps from the rest of CrawlDb manipulation code.
CrawlDbFilter() - Constructor for class org.apache.nutch.crawl.CrawlDbFilter
 
CrawlDbMerger - Class in org.apache.nutch.crawl
This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
CrawlDbMerger() - Constructor for class org.apache.nutch.crawl.CrawlDbMerger
 
CrawlDbMerger(Configuration) - Constructor for class org.apache.nutch.crawl.CrawlDbMerger
 
CrawlDbMerger.Merger - Class in org.apache.nutch.crawl
 
CrawlDbReader - Class in org.apache.nutch.crawl
Read utility for the CrawlDB.
CrawlDbReader() - Constructor for class org.apache.nutch.crawl.CrawlDbReader
 
CrawlDbReader.CrawlDatumCsvOutputFormat - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDatumJsonOutputFormat - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDbDumpMapper - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDbStatMapper - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDbStatReducer - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDbTopNMapper - Class in org.apache.nutch.crawl
 
CrawlDbReader.CrawlDbTopNReducer - Class in org.apache.nutch.crawl
 
CrawlDbReader.JsonIndenter - Class in org.apache.nutch.crawl
 
CrawlDbReducer - Class in org.apache.nutch.crawl
Merge new page entries with existing entries.
CrawlDbReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReducer
 
CrawlDbStatMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
 
CrawlDbStatReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
 
CrawlDbTopNMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
 
CrawlDbTopNReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
 
CrawlDbUpdateMapper() - Constructor for class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateMapper
 
CrawlDbUpdater() - Constructor for class org.apache.nutch.crawl.Generator.CrawlDbUpdater
 
CrawlDbUpdateReducer() - Constructor for class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateReducer
 
create() - Static method in class org.apache.nutch.util.NutchConfiguration
Create a Configuration for Nutch.
create(boolean, Properties) - Static method in class org.apache.nutch.util.NutchConfiguration
Create a Configuration from supplied properties.
create(Text, CrawlDatum, String) - Static method in class org.apache.nutch.fetcher.FetchItem
Create an item.
create(Text, CrawlDatum, String, int) - Static method in class org.apache.nutch.fetcher.FetchItem
Create an item.
create(JobConfig) - Method in class org.apache.nutch.service.impl.JobManagerImpl
 
create(JobConfig) - Method in interface org.apache.nutch.service.JobManager
Creates specified job
create(JobConfig) - Method in class org.apache.nutch.service.resources.JobResource
Create a new job
create(NutchConfig) - Method in interface org.apache.nutch.service.ConfManager
 
create(NutchConfig) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
Created a new configuration based on the values provided.
createChromeRemoteWebDriver(URL, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
createChromeWebDriver(String, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
createComponents(String) - Method in class org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil
 
createConfig(NutchConfig) - Method in class org.apache.nutch.service.resources.ConfigResource
Create new configuration.
createDefaultRemoteWebDriver(URL, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
createDocFromCityDb(String, NutchDocument, DatabaseReader) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
 
createDocFromCityService(String, NutchDocument, WebServiceClient) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
 
createDocFromConnectionDb(String, NutchDocument, DatabaseReader) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
 
createDocFromCountryService(String, NutchDocument, WebServiceClient) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
 
createDocFromDomainDb(String, NutchDocument, DatabaseReader) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
 
createDocFromInsightsService(String, NutchDocument, WebServiceClient) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
 
createDocFromIspDb(String, NutchDocument, DatabaseReader) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
 
createDocVector(String, int, int) - Static method in class org.apache.nutch.scoring.similarity.cosine.Model
Used to create a DocVector from given String text.
createFileName(String, String, String) - Static method in class org.apache.nutch.util.DumpFileUtil
 
createFileNameFromUrl(String, String, String, String, String, boolean) - Static method in class org.apache.nutch.util.DumpFileUtil
 
createFirefoxRemoteWebDriver(URL, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
createFirefoxWebDriver(String, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
createJob(Configuration, Path) - Static method in class org.apache.nutch.crawl.CrawlDb
 
createKey() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
Creates a new instance of the Text object for the key.
createLockFile(Configuration, Path, boolean) - Static method in class org.apache.nutch.util.LockUtil
Create a lock file.
createLockFile(FileSystem, Path, boolean) - Static method in class org.apache.nutch.util.LockUtil
Create a lock file.
createMergeJob(Configuration, Path, boolean, boolean) - Static method in class org.apache.nutch.crawl.CrawlDbMerger
 
createMergeJob(Configuration, Path, boolean, boolean) - Static method in class org.apache.nutch.crawl.LinkDbMerger
 
createModel(Configuration) - Static method in class org.apache.nutch.scoring.similarity.cosine.Model
 
createParseResult(String, Parse) - Static method in class org.apache.nutch.parse.ParseResult
Convenience method for obtaining ParseResult from a single Parse output.
createRandomRemoteWebDriver(URL, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
createRecordReader(InputSplit, TaskAttemptContext) - Method in class org.apache.nutch.segment.SegmentMerger.ObjectInputFormat
 
createRecordReader(InputSplit, TaskAttemptContext) - Method in class org.apache.nutch.tools.arc.ArcInputFormat
 
createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Creates a new RegexRule.
createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
 
createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
 
createRule(boolean, String, String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Creates a new RegexRule.
createRule(boolean, String, String) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
 
createRule(boolean, String, String) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
 
createSeedFile(SeedList) - Method in class org.apache.nutch.service.resources.SeedResource
Method creates seed list file and returns temporary directory path
createSegments(Path, Path) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
Creates the arc files to segments job.
createSocket(String, int) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
 
createSocket(String, int, InetAddress, int) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
 
createSocket(String, int, InetAddress, int, HttpConnectionParams) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
Attempts to get a new socket connection to the given host within the given time limit.
createSocket(Socket, String, int, boolean) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
 
createSubCollection(String, String) - Method in class org.apache.nutch.collection.CollectionManager
Create a new subcollection.
createToolByClassName(String, Configuration) - Method in class org.apache.nutch.service.impl.JobFactory
 
createToolByType(JobManager.JobType, Configuration) - Method in class org.apache.nutch.service.impl.JobFactory
 
createTwoLevelsDirectory(String, String) - Static method in class org.apache.nutch.util.DumpFileUtil
 
createTwoLevelsDirectory(String, String, boolean) - Static method in class org.apache.nutch.util.DumpFileUtil
 
createURLStreamHandler(String) - Method in class org.apache.nutch.plugin.PluginRepository
Invoked whenever a URL needs to be instantiated.
createURLStreamHandler(String) - Method in class org.apache.nutch.plugin.URLStreamHandlerFactory
 
createValue() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
Creates a new instance of the BytesWritable object for the key
createWebGraph(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.scoring.webgraph.WebGraph
Creates the three different WebGraph databases, Outlinks, Inlinks, and Node.
CreativeCommons - Interface in org.apache.nutch.metadata
A collection of Creative Commons properties names.
CREATOR - Static variable in interface org.apache.nutch.metadata.DublinCore
An entity primarily responsible for making the content of the resource.
CRLF - Static variable in class org.apache.nutch.tools.WARCUtils
 
CSV_CHARSET - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_ESCAPECHARACTER - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_FIELD_SEPARATOR - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_FIELDS - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_MAXFIELDLENGTH - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_MAXFIELDVALUES - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_OUTPATH - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_QUOTECHARACTER - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_VALUESEPARATOR - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSV_WITHHEADER - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
 
CSVConstants - Interface in org.apache.nutch.indexwriter.csv
 
CSVIndexWriter - Class in org.apache.nutch.indexwriter.csv
Write Nutch documents to a CSV file (comma separated values), i.e., dump index as CSV or tab-separated plain text table.
CSVIndexWriter() - Constructor for class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
CSVIndexWriter.Separator - Class in org.apache.nutch.indexwriter.csv
represent separators (also quote and escape characters) as char(s) and byte(s) in the output encoding for efficiency.
csvout - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
CURRENT_NAME - Static variable in class org.apache.nutch.crawl.CrawlDb
 
CURRENT_NAME - Static variable in class org.apache.nutch.crawl.LinkDb
 
CURRENT_NAME - Static variable in class org.apache.nutch.util.SitemapProcessor
 
currentJob - Variable in class org.apache.nutch.util.NutchTool
 
currentJobNum - Variable in class org.apache.nutch.util.NutchTool
 

D

DATE - Static variable in interface org.apache.nutch.metadata.DublinCore
A date associated with an event in the life cycle of the resource.
dateFormatStr - Static variable in class org.apache.nutch.indexer.feed.FeedIndexingFilter
 
datum - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
 
datum - Variable in class org.apache.nutch.hostdb.ResolverThread
 
DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE - Static variable in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
 
DBFilter() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.DBFilter
 
DBFilter() - Constructor for class org.apache.nutch.indexer.CleaningJob.DBFilter
 
DbQuery - Class in org.apache.nutch.service.model.request
 
DbQuery() - Constructor for class org.apache.nutch.service.model.request.DbQuery
 
DbResource - Class in org.apache.nutch.service.resources
 
DbResource() - Constructor for class org.apache.nutch.service.resources.DbResource
 
DebugParseFilter - Class in org.apache.nutch.parsefilter.debug
Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).
DebugParseFilter() - Constructor for class org.apache.nutch.parsefilter.debug.DebugParseFilter
 
DEC_RATE - Variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
 
DecreasingFloatComparator() - Constructor for class org.apache.nutch.crawl.Generator.DecreasingFloatComparator
 
DEDUP - org.apache.nutch.service.JobManager.JobType
 
DEDUPLICATION_COMPARE_ORDER - Static variable in class org.apache.nutch.crawl.DeduplicationJob
 
DEDUPLICATION_GROUP_MODE - Static variable in class org.apache.nutch.crawl.DeduplicationJob
 
DeduplicationJob - Class in org.apache.nutch.crawl
Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed).
DeduplicationJob() - Constructor for class org.apache.nutch.crawl.DeduplicationJob
 
DeduplicationJob.DBFilter - Class in org.apache.nutch.crawl
 
DeduplicationJob.DedupReducer<K extends Writable> - Class in org.apache.nutch.crawl
 
DeduplicationJob.StatusUpdateReducer - Class in org.apache.nutch.crawl
Combine multiple new entries for a url.
DedupReducer() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
 
DefalultMultiInteractionHandler - Class in org.apache.nutch.protocol.interactiveselenium.handlers
This is a placeholder/example of a technique or use case where we do multiple interaction with the web driver and need data from each such interaction in the end.
DefalultMultiInteractionHandler() - Constructor for class org.apache.nutch.protocol.interactiveselenium.handlers.DefalultMultiInteractionHandler
 
DEFAULT - Static variable in class org.apache.nutch.service.resources.ConfigResource
 
DEFAULT_BOOST - Static variable in class org.apache.nutch.util.domain.DomainSuffix
 
DEFAULT_FILE_NAME - Static variable in class org.apache.nutch.collection.CollectionManager
 
DEFAULT_ID - Static variable in class org.apache.nutch.fetcher.FetchItemQueues
 
DEFAULT_MAX_DEPTH - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
DEFAULT_PLUGIN - Static variable in class org.apache.nutch.parse.ParserFactory
Wildcard for default plugins.
DEFAULT_STATUS - Static variable in class org.apache.nutch.util.domain.DomainSuffix
 
DefaultClickAllAjaxLinksHandler - Class in org.apache.nutch.protocol.interactiveselenium.handlers
This handler clicks all the tags because it considers them as not usual links but ajax links/interactions.
DefaultClickAllAjaxLinksHandler() - Constructor for class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultClickAllAjaxLinksHandler
 
DefaultFetchSchedule - Class in org.apache.nutch.crawl
This class implements the default re-fetch schedule.
DefaultFetchSchedule() - Constructor for class org.apache.nutch.crawl.DefaultFetchSchedule
 
DefaultHandler - Class in org.apache.nutch.protocol.interactiveselenium.handlers
 
DefaultHandler() - Constructor for class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultHandler
 
defaultInterval - Variable in class org.apache.nutch.crawl.AbstractFetchSchedule
 
defaultProtocolImplMapping - Variable in class org.apache.nutch.protocol.ProtocolFactory
 
DEFER_VISIT_RULES - Static variable in class org.apache.nutch.protocol.RobotRulesParser
A BaseRobotRules object appropriate for use when the robots.txt file failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code.
deferVisits503 - Variable in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
 
deflate(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
Returns a deflated copy of the input array.
DeflateUtils - Class in org.apache.nutch.util
A collection of utility methods for working on deflated data.
DeflateUtils() - Constructor for class org.apache.nutch.util.DeflateUtils
 
delete(String) - Method in interface org.apache.nutch.indexer.IndexWriter
 
delete(String) - Method in class org.apache.nutch.indexer.IndexWriters
 
delete(String) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
delete(String) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
(deletion of documents is not supported)
delete(String) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
delete(String) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
delete(String) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
delete(String) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
delete(String) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
delete(String) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
delete(String) - Method in interface org.apache.nutch.service.ConfManager
 
delete(String) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
 
delete(String, boolean) - Method in class org.apache.nutch.indexer.CleaningJob
 
DELETE - Static variable in class org.apache.nutch.indexer.NutchIndexAction
 
DELETE - Static variable in interface org.apache.nutch.indexwriter.dummy.DummyConstants
 
deleteConfig(String) - Method in class org.apache.nutch.service.resources.ConfigResource
Removes the configuration from the list of known configurations.
DELETED - org.apache.nutch.util.domain.DomainSuffix.Status
 
DeleterReducer() - Constructor for class org.apache.nutch.indexer.CleaningJob.DeleterReducer
 
deleteSeedList(String) - Method in class org.apache.nutch.service.impl.SeedManagerImpl
 
deleteSeedList(String) - Method in interface org.apache.nutch.service.SeedManager
 
deleteSubCollection(String) - Method in class org.apache.nutch.collection.CollectionManager
Delete named subcollection
DenyPathQueryRule(String) - Constructor for class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyPathQueryRule
 
DenyPathRule(String) - Constructor for class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyPathRule
 
DEPRECATED - org.apache.nutch.util.domain.DomainSuffix.Status
 
DEPTH_KEY - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
DEPTH_KEY_W - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
DepthScoringFilter - Class in org.apache.nutch.scoring.depth
This scoring filter limits the number of hops from the initial seed urls.
DepthScoringFilter() - Constructor for class org.apache.nutch.scoring.depth.DepthScoringFilter
 
describe() - Method in interface org.apache.nutch.indexer.IndexWriter
Returns Map with the specific parameters the IndexWriter instance can take.
describe() - Method in class org.apache.nutch.indexer.IndexWriters
Lists the active IndexWriters and their configuration.
describe() - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
Returns Map with the specific parameters the IndexWriter instance can take.
describe() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
Returns Map with the specific parameters the IndexWriter instance can take.
describe() - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
Returns Map with the specific parameters the IndexWriter instance can take.
describe() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
Returns Map with the specific parameters the IndexWriter instance can take.
describe() - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
describe() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
Returns Map with the specific parameters the IndexWriter instance can take.
describe() - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
Returns Map with the specific parameters the IndexWriter instance can take.
describe() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
Returns Map with the specific parameters the IndexWriter instance can take.
DESCRIPTION - Static variable in interface org.apache.nutch.metadata.DublinCore
An account of the content of the resource.
DICTFILE_MODELFILTER - Static variable in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
DIR_NAME - Static variable in class org.apache.nutch.parse.ParseData
 
DIR_NAME - Static variable in class org.apache.nutch.parse.ParseText
 
DIR_NAME - Static variable in class org.apache.nutch.protocol.Content
 
disconnect() - Method in class org.apache.nutch.protocol.ftp.Client
Closes the connection to the FTP server and restores connection parameters to the default values.
DISCONNECT - org.apache.nutch.net.protocols.Response.TruncatedContentReason
network disconnect or timeout during fetch
displayFileTypes(Map<String, Integer>, Map<String, Integer>) - Static method in class org.apache.nutch.util.DumpFileUtil
 
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
This will take the metadata that you have listed in your "scoring.parse.md" property, and looks for them inside the parseData object.
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in interface org.apache.nutch.scoring.ScoringFilter
Distribute score value from the current page to all its outlinked pages.
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.ScoringFilters
 
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.similarity.cosine.CosineSimilarity
 
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in interface org.apache.nutch.scoring.similarity.SimilarityModel
 
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
 
distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object.
DmozParser - Class in org.apache.nutch.tools
Utility that converts DMOZ RDF into a flat file of URLs to be injected.
DmozParser() - Constructor for class org.apache.nutch.tools.DmozParser
 
dnsFailures - Variable in class org.apache.nutch.hostdb.HostDatum
 
doc - Variable in class org.apache.nutch.indexer.NutchIndexAction
 
docToMetadata(NutchDocument) - Static method in class org.apache.nutch.tools.WARCUtils
 
DocVector - Class in org.apache.nutch.scoring.similarity.cosine
 
DocVector() - Constructor for class org.apache.nutch.scoring.similarity.cosine.DocVector
 
docVectors - Static variable in class org.apache.nutch.scoring.similarity.cosine.Model
 
doIndex - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
 
DomainDenylistURLFilter - Class in org.apache.nutch.urlfilter.domaindenylist
Filters URLs based on a file containing domain suffixes, domain names, and hostnames.
DomainDenylistURLFilter() - Constructor for class org.apache.nutch.urlfilter.domaindenylist.DomainDenylistURLFilter
 
DomainStatistics - Class in org.apache.nutch.util.domain
Extracts some very basic statistics about domains from the crawldb
DomainStatistics() - Constructor for class org.apache.nutch.util.domain.DomainStatistics
 
DomainStatistics.DomainStatisticsCombiner - Class in org.apache.nutch.util.domain
 
DomainStatistics.MyCounter - Enum in org.apache.nutch.util.domain
 
DomainStatisticsCombiner() - Constructor for class org.apache.nutch.util.domain.DomainStatistics.DomainStatisticsCombiner
 
DomainSuffix - Class in org.apache.nutch.util.domain
This class represents the last part of the host name, which is operated by authoritives, not individuals.
DomainSuffix(String) - Constructor for class org.apache.nutch.util.domain.DomainSuffix
 
DomainSuffix(String, DomainSuffix.Status, float) - Constructor for class org.apache.nutch.util.domain.DomainSuffix
 
DomainSuffix.Status - Enum in org.apache.nutch.util.domain
Enumeration of the status of the tld.
DomainSuffixes - Class in org.apache.nutch.util.domain
Storage class for DomainSuffix objects Note: this class is singleton
DomainURLFilter - Class in org.apache.nutch.urlfilter.domain
Filters URLs based on a file containing domain suffixes, domain names, and hostnames.
DomainURLFilter() - Constructor for class org.apache.nutch.urlfilter.domain.DomainURLFilter
 
DOMBuilder - Class in org.apache.nutch.parse.html
This class takes SAX events (in addition to some extra events that SAX doesn't handle yet) and adds the result to a document or document fragment.
DOMBuilder(Document) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
DOMBuilder instance constructor...
DOMBuilder(Document, DocumentFragment) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
DOMBuilder instance constructor...
DOMBuilder(Document, Node) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
DOMBuilder instance constructor...
DOMContentUtils - Class in org.apache.nutch.parse.html
A collection of methods for extracting content from DOM trees.
DOMContentUtils - Class in org.apache.nutch.parse.tika
A collection of methods for extracting content from DOM trees.
DOMContentUtils(Configuration) - Constructor for class org.apache.nutch.parse.html.DOMContentUtils
 
DOMContentUtils(Configuration) - Constructor for class org.apache.nutch.parse.tika.DOMContentUtils
 
DOMContentUtils.LinkParams - Class in org.apache.nutch.parse.html
 
DomUtil - Class in org.apache.nutch.util
 
DomUtil() - Constructor for class org.apache.nutch.util.DomUtil
 
dotProduct(DocVector) - Method in class org.apache.nutch.scoring.similarity.cosine.DocVector
 
DublinCore - Interface in org.apache.nutch.metadata
A collection of Dublin Core metadata names.
DummyConstants - Interface in org.apache.nutch.indexwriter.dummy
 
DummyIndexWriter - Class in org.apache.nutch.indexwriter.dummy
DummyIndexWriter.
DummyIndexWriter() - Constructor for class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
DummySSLProtocolSocketFactory - Class in org.apache.nutch.protocol.httpclient
 
DummySSLProtocolSocketFactory() - Constructor for class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
Constructor for DummySSLProtocolSocketFactory.
DummyX509TrustManager - Class in org.apache.nutch.protocol.htmlunit
 
DummyX509TrustManager - Class in org.apache.nutch.protocol.http
 
DummyX509TrustManager - Class in org.apache.nutch.protocol.httpclient
 
DummyX509TrustManager - Class in org.apache.nutch.protocol.interactiveselenium
 
DummyX509TrustManager - Class in org.apache.nutch.protocol.selenium
 
DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
Constructor for DummyX509TrustManager.
DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.http.DummyX509TrustManager
Constructor for DummyX509TrustManager.
DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
Constructor for DummyX509TrustManager.
DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
Constructor for DummyX509TrustManager.
DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.selenium.DummyX509TrustManager
Constructor for DummyX509TrustManager.
dump() - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
dump() - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
dump(File, File, File, boolean, String[], boolean, String, boolean) - Method in class org.apache.nutch.tools.CommonCrawlDataDumper
Dumps the reverse engineered CBOR content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument.
dump(File, File, String[], boolean, boolean, boolean) - Method in class org.apache.nutch.tools.FileDumper
Dumps the reverse engineered raw content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument.
dump(Path, Path) - Method in class org.apache.nutch.segment.SegmentReader
 
DUMP_DIR - Static variable in class org.apache.nutch.scoring.webgraph.LinkDumper
 
Dumper() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
 
DumperMapper() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperMapper
 
DumperReducer() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperReducer
 
DumpFileUtil - Class in org.apache.nutch.util
 
DumpFileUtil() - Constructor for class org.apache.nutch.util.DumpFileUtil
 
dumpLinks(Path) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper
Runs the inverter and merger jobs of the LinkDumper tool to create the url to inlink node database.
dumpNodes(Path, NodeDumper.DumpType, long, Path, boolean, NodeDumper.NameType, NodeDumper.AggrType, boolean) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper
Runs the process to dump the top urls out to a text file.
dumpText - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
 
dumpText - Variable in class org.apache.nutch.parse.ParserChecker
 
dumpUrl(Path, String) - Method in class org.apache.nutch.scoring.webgraph.NodeReader
Prints the content of the Node represented by the url to system out.

E

elapsedTime(long, long) - Static method in class org.apache.nutch.util.TimingUtil
Calculate the elapsed time between two times specified in milliseconds.
ElasticConstants - Interface in org.apache.nutch.indexwriter.elastic
 
ElasticIndexWriter - Class in org.apache.nutch.indexwriter.elastic
Sends NutchDocuments to a configured Elasticsearch index.
ElasticIndexWriter() - Constructor for class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
elName - Variable in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
 
EMPTY_RESULT - org.apache.nutch.util.domain.DomainStatistics.MyCounter
 
EMPTY_RULES - Static variable in class org.apache.nutch.protocol.RobotRulesParser
A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
emptyMetaDataWritableSerialized - Static variable in class org.apache.nutch.hostdb.HostDatum
 
emptyQueue() - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
emptyQueues() - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
enableCookieHeader - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Controls whether or not to set Cookie HTTP header based on CrawlDatum metadata
enableIfModifiedsinceHeader - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Configuration directive for If-Modified-Since HTTP header
encoding - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
encoding of CSV file
EncodingDetector - Class in org.apache.nutch.util
A simple class for detecting character encodings.
EncodingDetector(Configuration) - Constructor for class org.apache.nutch.util.EncodingDetector
 
end - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
 
END - org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
 
endCDATA() - Method in class org.apache.nutch.parse.html.DOMBuilder
Report the end of a CDATA section.
endDocument() - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of the end of a document.
endDTD() - Method in class org.apache.nutch.parse.html.DOMBuilder
Report the end of DTD declarations.
endElement(String, String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of the end of an element.
endEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
Report the end of an entity.
ENDPOINT - Static variable in interface org.apache.nutch.indexwriter.cloudsearch.CloudSearchConstants
 
endPrefixMapping(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
End the scope of a prefix-URI mapping.
ENGLISHMINIMALSTEM_FILTER - org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
 
entityReference(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notivication of a entityReference.
EQUAL_CHARACTER - Static variable in class org.apache.nutch.crawl.Injector.InjectMapper
 
equals(Object) - Method in class org.apache.nutch.crawl.CrawlDatum
 
equals(Object) - Method in class org.apache.nutch.crawl.Inlink
 
equals(Object) - Method in class org.apache.nutch.metadata.Metadata
 
equals(Object) - Method in class org.apache.nutch.parse.Outlink
 
equals(Object) - Method in class org.apache.nutch.parse.ParseData
 
equals(Object) - Method in class org.apache.nutch.parse.ParseStatus
 
equals(Object) - Method in class org.apache.nutch.parse.ParseText
 
equals(Object) - Method in class org.apache.nutch.plugin.PluginClassLoader
 
equals(Object) - Method in class org.apache.nutch.protocol.Content
 
equals(Object) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
 
equals(Object) - Method in class org.apache.nutch.protocol.ProtocolStatus
 
equals(Object) - Method in class org.apache.nutch.service.model.request.SeedList
 
equals(Object) - Method in class org.apache.nutch.service.model.request.SeedUrl
 
escape(String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
Escape some exotic characters in the fragment part
ESCAPED_URL_PART - Static variable in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
 
evaluate() - Method in class org.apache.nutch.util.CommandRunner
 
EXCEPTION - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Unspecified exception occurred.
Exchange - Interface in org.apache.nutch.exchange
 
ExchangeConfig - Class in org.apache.nutch.exchange
 
Exchanges - Class in org.apache.nutch.exchange
 
Exchanges(Configuration) - Constructor for class org.apache.nutch.exchange.Exchanges
 
exec() - Method in class org.apache.nutch.util.CommandRunner
Execute the command
execute(JexlScript, String) - Method in class org.apache.nutch.crawl.CrawlDatum
 
executor - Variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
ExemptionUrlFilter - Class in org.apache.nutch.urlfilter.ignoreexempt
This implementation of URLExemptionFilter uses regex configuration to check if URL is eligible for exemption from 'db.ignore.external'.
ExemptionUrlFilter() - Constructor for class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
 
EXPONENTIAL_BACKOFF_MILLIS - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
EXPONENTIAL_BACKOFF_MILLIS - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
EXPONENTIAL_BACKOFF_RETRIES - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
EXPONENTIAL_BACKOFF_RETRIES - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
Extension - Class in org.apache.nutch.plugin
An Extension is a kind of listener descriptor that will be installed on a concrete ExtensionPoint that acts as kind of Publisher.
Extension(PluginDescriptor, String, String, String, Configuration, PluginRepository) - Constructor for class org.apache.nutch.plugin.Extension
 
ExtensionPoint - Class in org.apache.nutch.plugin
The ExtensionPoint provide meta information of a extension point.
ExtensionPoint(String, String, String) - Constructor for class org.apache.nutch.plugin.ExtensionPoint
Constructor
ExtParser - Class in org.apache.nutch.parse.ext
A wrapper that invokes external command to do real parsing job.
ExtParser() - Constructor for class org.apache.nutch.parse.ext.ExtParser
 
extractText(InputStream, String, List<Outlink>) - Method in class org.apache.nutch.parse.zip.ZipTextExtractor
 

F

FAILED - org.apache.nutch.service.model.response.JobInfo.State
 
FAILED - Static variable in class org.apache.nutch.parse.ParseStatus
General failure.
FAILED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Content was not retrieved.
FAILED_EXCEPTION - Static variable in class org.apache.nutch.parse.ParseStatus
Parsing failed.
FAILED_INVALID_FORMAT - Static variable in class org.apache.nutch.parse.ParseStatus
Parsing failed.
FAILED_MISSING_CONTENT - Static variable in class org.apache.nutch.parse.ParseStatus
Parsing failed.
FAILED_MISSING_PARTS - Static variable in class org.apache.nutch.parse.ParseStatus
Parsing failed.
FAILED_TRUNCATED - Static variable in class org.apache.nutch.parse.ParseStatus
Parsing failed.
failures - Variable in class org.apache.nutch.hostdb.HostDatum
 
FastURLFilter - Class in org.apache.nutch.urlfilter.fast
Filters URLs based on a file of regular expressions using host/domains matching first.
FastURLFilter() - Constructor for class org.apache.nutch.urlfilter.fast.FastURLFilter
 
FastURLFilter.DenyAllRule - Class in org.apache.nutch.urlfilter.fast
Rule for DenyPath .* or DenyPath .?
FastURLFilter.DenyPathQueryRule - Class in org.apache.nutch.urlfilter.fast
 
FastURLFilter.DenyPathRule - Class in org.apache.nutch.urlfilter.fast
 
FastURLFilter.Rule - Class in org.apache.nutch.urlfilter.fast
 
Feed - Interface in org.apache.nutch.metadata
A collection of Feed property names extracted by the ROME library.
FEED - Static variable in interface org.apache.nutch.metadata.Feed
 
FEED_AUTHOR - Static variable in interface org.apache.nutch.metadata.Feed
 
FEED_PUBLISHED - Static variable in interface org.apache.nutch.metadata.Feed
 
FEED_TAGS - Static variable in interface org.apache.nutch.metadata.Feed
 
FEED_UPDATED - Static variable in interface org.apache.nutch.metadata.Feed
 
FeedIndexingFilter - Class in org.apache.nutch.indexer.feed
 
FeedIndexingFilter() - Constructor for class org.apache.nutch.indexer.feed.FeedIndexingFilter
 
FeedParser - Class in org.apache.nutch.parse.feed
 
FeedParser() - Constructor for class org.apache.nutch.parse.feed.FeedParser
 
fetch(Path, int) - Method in class org.apache.nutch.fetcher.Fetcher
 
FETCH - org.apache.nutch.service.JobManager.JobType
 
FETCH_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
 
FETCH_EVENT_CONTENTLANG - Static variable in interface org.apache.nutch.metadata.Nutch
Content-lanueage key in the Pub/Sub event metadata for the content-language of the parsed page
FETCH_EVENT_CONTENTTYPE - Static variable in interface org.apache.nutch.metadata.Nutch
Content-type key in the Pub/Sub event metadata for the content-type of the parsed page
FETCH_EVENT_FETCHTIME - Static variable in interface org.apache.nutch.metadata.Nutch
Fetch time key in the Pub/Sub event metadata for the fetch time of the parsed page
FETCH_EVENT_SCORE - Static variable in interface org.apache.nutch.metadata.Nutch
Score key in the Pub/Sub event metadata for the score of the parsed page
FETCH_EVENT_TITLE - Static variable in interface org.apache.nutch.metadata.Nutch
Title key in the Pub/Sub event metadata for the title of the parsed page
FETCH_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
FETCH_TIME - Static variable in interface org.apache.nutch.net.protocols.Response
Key to hold the time when the page has been fetched
FETCH_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
fetchDb(int, int) - Method in class org.apache.nutch.service.resources.DbResource
 
fetched - Variable in class org.apache.nutch.hostdb.HostDatum
 
fetched - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
 
FETCHED - org.apache.nutch.util.domain.DomainStatistics.MyCounter
 
Fetcher - Class in org.apache.nutch.fetcher
A queue-based fetcher.
Fetcher() - Constructor for class org.apache.nutch.fetcher.Fetcher
 
Fetcher(Configuration) - Constructor for class org.apache.nutch.fetcher.Fetcher
 
Fetcher.FetcherRun - Class in org.apache.nutch.fetcher
 
Fetcher.InputFormat - Class in org.apache.nutch.fetcher
 
FetcherOutputFormat - Class in org.apache.nutch.fetcher
Splits FetcherOutput entries into multiple map files.
FetcherOutputFormat() - Constructor for class org.apache.nutch.fetcher.FetcherOutputFormat
 
fetchErrors - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
 
FetcherRun() - Constructor for class org.apache.nutch.fetcher.Fetcher.FetcherRun
 
FetcherThread - Class in org.apache.nutch.fetcher
This class picks items from queues and fetches the pages.
FetcherThread(Configuration, AtomicInteger, FetchItemQueues, QueueFeeder, AtomicInteger, AtomicLong, Mapper.Context, AtomicInteger, String, boolean, boolean, AtomicInteger, AtomicLong) - Constructor for class org.apache.nutch.fetcher.FetcherThread
 
FetcherThreadEvent - Class in org.apache.nutch.fetcher
This class is used to capture the various events occurring at fetch time.
FetcherThreadEvent(FetcherThreadEvent.PublishEventType, String) - Constructor for class org.apache.nutch.fetcher.FetcherThreadEvent
Constructor to create an event to be published
FetcherThreadEvent.PublishEventType - Enum in org.apache.nutch.fetcher
Type of event to specify start, end or reporting of a fetch item.
FetcherThreadPublisher - Class in org.apache.nutch.fetcher
This class handles the publishing of the events to the queue implementation.
FetcherThreadPublisher(Configuration) - Constructor for class org.apache.nutch.fetcher.FetcherThreadPublisher
Configure all registered publishers
FetchItem - Class in org.apache.nutch.fetcher
This class describes the item to be fetched.
FetchItem(Text, URL, CrawlDatum, String) - Constructor for class org.apache.nutch.fetcher.FetchItem
 
FetchItem(Text, URL, CrawlDatum, String, int) - Constructor for class org.apache.nutch.fetcher.FetchItem
 
FetchItemQueue - Class in org.apache.nutch.fetcher
This class handles FetchItems which come from the same host ID (be it a proto/hostname or proto/IP pair).
FetchItemQueue(Configuration, int, long, long) - Constructor for class org.apache.nutch.fetcher.FetchItemQueue
 
FetchItemQueues - Class in org.apache.nutch.fetcher
A collection of queues that keeps track of the total number of items, and provides items eligible for fetching from any queue.
FetchItemQueues(Configuration) - Constructor for class org.apache.nutch.fetcher.FetchItemQueues
 
FetchNode - Class in org.apache.nutch.fetcher
 
FetchNode() - Constructor for class org.apache.nutch.fetcher.FetchNode
 
FetchNodeDb - Class in org.apache.nutch.fetcher
 
FetchNodeDb() - Constructor for class org.apache.nutch.fetcher.FetchNodeDb
 
FetchNodeDbInfo - Class in org.apache.nutch.service.model.response
 
FetchNodeDbInfo() - Constructor for class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
FetchOverdueCrawlDatumProcessor - Class in org.apache.nutch.hostdb
Simple custom crawl datum processor that counts the number of records that are overdue for fetching, e.g.
FetchOverdueCrawlDatumProcessor(Configuration) - Constructor for class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
 
FetchSchedule - Interface in org.apache.nutch.crawl
This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.
FetchScheduleFactory - Class in org.apache.nutch.crawl
Creates and caches a FetchSchedule implementation.
FG() - Constructor for class org.apache.nutch.tools.FreeGenerator.FG
 
FGMapper() - Constructor for class org.apache.nutch.tools.FreeGenerator.FG.FGMapper
 
FGReducer() - Constructor for class org.apache.nutch.tools.FreeGenerator.FG.FGReducer
 
FIELD - Static variable in class org.creativecommons.nutch.CCIndexingFilter
The name of the document field we use.
fieldName - Static variable in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
Doc field name
FieldReplacer - Class in org.apache.nutch.indexer.replace
POJO to store a filename, its match pattern and its replacement string.
FieldReplacer(String, String, String, Integer) - Constructor for class org.apache.nutch.indexer.replace.FieldReplacer
Field replacer with the input and output field the same.
FieldReplacer(String, String, String, String, Integer) - Constructor for class org.apache.nutch.indexer.replace.FieldReplacer
Create a FieldReplacer for a field.
File - Class in org.apache.nutch.protocol.file
This class is a protocol plugin used for file: scheme.
File() - Constructor for class org.apache.nutch.protocol.file.File
 
FileDumper - Class in org.apache.nutch.tools
The file dumper tool enables one to reverse generate the raw content from Nutch segment data directories.
FileDumper() - Constructor for class org.apache.nutch.tools.FileDumper
 
FileError - Exception in org.apache.nutch.protocol.file
Thrown for File error codes.
FileError(int) - Constructor for exception org.apache.nutch.protocol.file.FileError
 
FileException - Exception in org.apache.nutch.protocol.file
 
FileException() - Constructor for exception org.apache.nutch.protocol.file.FileException
 
FileException(String) - Constructor for exception org.apache.nutch.protocol.file.FileException
 
FileException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.file.FileException
 
FileException(Throwable) - Constructor for exception org.apache.nutch.protocol.file.FileException
 
fileLen - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
 
FileResponse - Class in org.apache.nutch.protocol.file
FileResponse.java mimics file replies as http response.
FileResponse(URL, CrawlDatum, File, Configuration) - Constructor for class org.apache.nutch.protocol.file.FileResponse
Default public constructor
filter - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
filter() - Method in class org.apache.nutch.parse.ParseResult
Remove all results where status is not successful (as determined by ParseStatus.isSuccess()).
filter(String) - Method in class org.apache.nutch.collection.Subcollection
Simple "indexOf" currentFilter for matching patterns.
filter(String) - Method in interface org.apache.nutch.net.URLFilter
Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
filter(String) - Method in class org.apache.nutch.net.URLFilters
Run all defined filters.
filter(String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
 
filter(String) - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
 
filter(String) - Method in class org.apache.nutch.urlfilter.domaindenylist.DomainDenylistURLFilter
 
filter(String) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
filter(String) - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
 
filter(String) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
filter(String) - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
 
filter(String, String) - Method in interface org.apache.nutch.net.URLExemptionFilter
Checks if toUrl is exempted when the ignore external is enabled
filter(String, String) - Method in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
 
filter(Text, CrawlDatum, CrawlDatum, CrawlDatum, Content, ParseData, ParseText, Collection<CrawlDatum>) - Method in interface org.apache.nutch.segment.SegmentMergeFilter
The filtering method which gets all information being merged for a given key (URL).
filter(Text, CrawlDatum, CrawlDatum, CrawlDatum, Content, ParseData, ParseText, Collection<CrawlDatum>) - Method in class org.apache.nutch.segment.SegmentMergeFilters
Iterates over all SegmentMergeFilter extensions and if any of them returns false, it will return false as well.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
The AnchorIndexingFilter filter object which supports boolean configuration settings for the deduplication of anchors.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
The ArbitraryIndexingFilter filter object uses reflection to instantiate the configured class and invoke the configured method.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
The BasicIndexingFilter filter object which supports few configuration settings for adding basic searchable fields.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
Extracts out the relevant fields: FEED_AUTHOR FEED_TAGS FEED_PUBLISHED FEED_UPDATED FEED And sends them to the Indexer for indexing within the Nutch index.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in interface org.apache.nutch.indexer.IndexingFilter
Adds fields or otherwise modifies the document that will be indexed for a parse.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.IndexingFilters
Run all defined filters.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.jexl.JexlIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.links.LinksIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.replace.ReplaceIndexer
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
The StaticFieldIndexer filter object which adds fields as per configuration setting.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the CrawlDatum object.
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
 
filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.creativecommons.nutch.CCIndexingFilter
 
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
Scan the HTML document looking at possible indications of content language
1.
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.microformats.reltag.RelTagParser
Scan the HTML document looking at possible rel-tags
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
 
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in interface org.apache.nutch.parse.HtmlParseFilter
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.HtmlParseFilters
Run all defined filters.
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.js.JSParseFilter
Scan the JavaScript fragments of a HTML page looking for possible Outlink's
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.metatags.MetaTagsParser
 
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parsefilter.debug.DebugParseFilter
 
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parsefilter.regex.RegexParseFilter
 
filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.creativecommons.nutch.CCParseFilter
Adds metadata or otherwise modifies a parse of an HTML document, given the DOM tree of a page.
filterNormalize(String) - Method in class org.apache.nutch.hostdb.UpdateHostDbMapper
Filters and or normalizes the input hostname by applying the configured URL filters and normalizers the URL "http://hostname/".
filterNormalize(String, String, String, boolean, boolean, String, URLFilters, URLExemptionFilters, URLNormalizers) - Static method in class org.apache.nutch.parse.ParseOutputFormat
 
filterNormalize(String, String, String, boolean, boolean, String, URLFilters, URLExemptionFilters, URLNormalizers, String) - Static method in class org.apache.nutch.parse.ParseOutputFormat
 
filterParse(String) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
filters - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
filterUrl(String) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
finalize() - Method in class org.apache.nutch.plugin.Plugin
 
finalize() - Method in class org.apache.nutch.plugin.PluginRepository
Deprecated. 
finalize() - Method in class org.apache.nutch.protocol.ftp.Ftp
 
finalize(HostDatum) - Method in interface org.apache.nutch.hostdb.CrawlDatumProcessor
Process the final host datum instance and store the aggregated custom counts in the HostDatum.
finalize(HostDatum) - Method in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
 
find(String, int) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
Get index of first occurrence of any separator characters.
findAuthentication(Metadata) - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
 
findWorker(String) - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
Find the Job Worker Thread.
FINISHED - org.apache.nutch.service.model.response.JobInfo.State
 
finishFetchItem(FetchItem) - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
finishFetchItem(FetchItem, boolean) - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
finishFetchItem(FetchItem, boolean) - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
FIXED_INTERVAL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
Used by AdaptiveFetchSchedule to maintain custom fetch interval
fixHttpHeaders(String, int) - Static method in class org.apache.nutch.tools.WARCUtils
Modify verbatim HTTP response headers: fix, remove or replace headers Content-Length, Content-Encoding and Transfer-Encoding which may confuse WARC readers.
flattenHashMap(HashMap<String, Integer>) - Static method in class org.apache.nutch.parsefilter.naivebayes.Train
 
followRedirects - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
 
followRedirects - Variable in class org.apache.nutch.parse.ParserChecker
 
FORBID_ALL_RULES - Static variable in class org.apache.nutch.protocol.RobotRulesParser
A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
force - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
forceAsContentType - Variable in class org.apache.nutch.parse.ParserChecker
 
forceRefetch(Text, CrawlDatum, boolean) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.
forceRefetch(Text, CrawlDatum, boolean) - Method in interface org.apache.nutch.crawl.FetchSchedule
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching.
FORMAT - Static variable in interface org.apache.nutch.metadata.DublinCore
Typically, Format may include the media-type or dimensions of the resource.
FORMAT - Static variable in class org.apache.nutch.net.protocols.HttpDateFormat
 
FORMAT - Static variable in class org.apache.nutch.tools.WARCUtils
 
forName(String) - Method in class org.apache.nutch.util.MimeUtil
A facade interface to Tika's underlying MimeTypes.forName(String) method.
FreeGenerator - Class in org.apache.nutch.tools
This tool generates fetchlists (segments to be fetched) from plain text files containing one URL per line.
FreeGenerator() - Constructor for class org.apache.nutch.tools.FreeGenerator
 
FreeGenerator.FG - Class in org.apache.nutch.tools
 
FreeGenerator.FG.FGMapper - Class in org.apache.nutch.tools
 
FreeGenerator.FG.FGReducer - Class in org.apache.nutch.tools
 
fromHexString(String) - Static method in class org.apache.nutch.util.StringUtil
Convert a String containing consecutive (no inside whitespace) hexadecimal digits into a corresponding byte array.
FSUtils - Class in org.apache.nutch.util
Utility methods for common filesystem operations.
FSUtils() - Constructor for class org.apache.nutch.util.FSUtils
 
Ftp - Class in org.apache.nutch.protocol.ftp
This class is a protocol plugin used for ftp: scheme.
Ftp() - Constructor for class org.apache.nutch.protocol.ftp.Ftp
 
FtpError - Exception in org.apache.nutch.protocol.ftp
Thrown for Ftp error codes.
FtpError(int) - Constructor for exception org.apache.nutch.protocol.ftp.FtpError
 
FtpException - Exception in org.apache.nutch.protocol.ftp
Superclass for important exceptions thrown during FTP talk, that must be handled with care.
FtpException() - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
 
FtpException(String) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
 
FtpException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
 
FtpException(Throwable) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
 
FtpExceptionBadSystResponse - Exception in org.apache.nutch.protocol.ftp
Exception indicating bad reply of SYST command.
FtpExceptionCanNotHaveDataConnection - Exception in org.apache.nutch.protocol.ftp
Exception indicating failure of opening data connection.
FtpExceptionControlClosedByForcedDataClose - Exception in org.apache.nutch.protocol.ftp
Exception indicating control channel is closed by server end, due to forced closure of data channel at client (our) end.
FtpExceptionUnknownForcedDataClose - Exception in org.apache.nutch.protocol.ftp
Exception indicating unrecognizable reply from server after forced closure of data channel by client (our) side.
FtpResponse - Class in org.apache.nutch.protocol.ftp
FtpResponse.java mimics ftp replies as http response.
FtpResponse(URL, CrawlDatum, Ftp, Configuration) - Constructor for class org.apache.nutch.protocol.ftp.FtpResponse
 
FtpRobotRulesParser - Class in org.apache.nutch.protocol.ftp
This class is used for parsing robots for urls belonging to FTP protocol.
FtpRobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
 

G

generate(Path, Path, int, long, long) - Method in class org.apache.nutch.crawl.Generator
 
generate(Path, Path, int, long, long, boolean, boolean) - Method in class org.apache.nutch.crawl.Generator
generate(Path, Path, int, long, long, boolean, boolean, boolean, int, String) - Method in class org.apache.nutch.crawl.Generator
This signature should be used in the instance that no hostdb is available.
generate(Path, Path, int, long, long, boolean, boolean, boolean, int, String, String) - Method in class org.apache.nutch.crawl.Generator
Generate fetchlists in one or more segments.
GENERATE - org.apache.nutch.service.JobManager.JobType
 
GENERATE_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
 
GENERATE_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
GENERATE_UPDATE_CRAWLDB - Static variable in class org.apache.nutch.crawl.Generator
 
generated - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
 
generateJson() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
generateJson() - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
 
generateJson() - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
generateJson() - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
 
generateJson() - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
generateSegmentName() - Static method in class org.apache.nutch.crawl.Generator
 
generateSegmentName() - Static method in class org.apache.nutch.tools.arc.ArcSegmentCreator
Generates a random name for the segments.
generateWARC(String, List<Path>, boolean, boolean, boolean) - Method in class org.apache.nutch.tools.warc.WARCExporter
 
generator - Static variable in class org.apache.nutch.tools.WARCUtils
 
Generator - Class in org.apache.nutch.crawl
Generates a subset of a crawl db to fetch.
Generator() - Constructor for class org.apache.nutch.crawl.Generator
 
Generator(Configuration) - Constructor for class org.apache.nutch.crawl.Generator
 
GENERATOR_COUNT_MODE - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_COUNT_VALUE_DOMAIN - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_COUNT_VALUE_HOST - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_CUR_TIME - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_DELAY - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_EXPR - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_FETCH_DELAY_EXPR - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_FILTER - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_HOSTDB - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_MAX_COUNT - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_MAX_COUNT_EXPR - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_MAX_NUM_SEGMENTS - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_MIN_INTERVAL - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_MIN_SCORE - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_NORMALISE - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_RESTRICT_STATUS - Static variable in class org.apache.nutch.crawl.Generator
 
GENERATOR_TOP_N - Static variable in class org.apache.nutch.crawl.Generator
 
Generator.CrawlDbUpdater - Class in org.apache.nutch.crawl
Update the CrawlDB so that the next generate won't include the same URLs.
Generator.CrawlDbUpdater.CrawlDbUpdateMapper - Class in org.apache.nutch.crawl
 
Generator.CrawlDbUpdater.CrawlDbUpdateReducer - Class in org.apache.nutch.crawl
 
Generator.DecreasingFloatComparator - Class in org.apache.nutch.crawl
 
Generator.HashComparator - Class in org.apache.nutch.crawl
Sort fetch lists by hash of URL.
Generator.PartitionReducer - Class in org.apache.nutch.crawl
 
Generator.Selector - Class in org.apache.nutch.crawl
Selects entries due for fetch.
Generator.SelectorEntry - Class in org.apache.nutch.crawl
 
Generator.SelectorInverseMapper - Class in org.apache.nutch.crawl
 
Generator.SelectorMapper - Class in org.apache.nutch.crawl
Select and invert subset due for fetch.
Generator.SelectorReducer - Class in org.apache.nutch.crawl
Collect until limit is reached.
generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
 
generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
generatorSortValue(Text, CrawlDatum, float) - Method in interface org.apache.nutch.scoring.ScoringFilter
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.
generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.ScoringFilters
Calculate a sort value for Generate.
GENERIC - org.apache.nutch.util.domain.TopLevelDomain.Type
 
GenericWritableConfigurable - Class in org.apache.nutch.util
A generic Writable wrapper that can inject Configuration to Configurables
GenericWritableConfigurable() - Constructor for class org.apache.nutch.util.GenericWritableConfigurable
 
GeoIPDocumentCreator - Class in org.apache.nutch.indexer.geoip
Simple utility class which enables efficient, structured NutchDocument building based on input from GeoIPIndexingFilter, where configuration is also read.
GeoIPDocumentCreator() - Constructor for class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
 
GeoIPIndexingFilter - Class in org.apache.nutch.indexer.geoip
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
GeoIPIndexingFilter() - Constructor for class org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
Default constructor for this plugin
get(String) - Method in class org.apache.nutch.metadata.Metadata
Get the value associated to a metadata name.
get(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
 
get(String) - Method in class org.apache.nutch.parse.ParseResult
Retrieve a single parse output.
get(String) - Static method in class org.apache.nutch.segment.SegmentPart
Create SegmentPart from a full path of a location inside any segment part.
get(String) - Method in interface org.apache.nutch.service.ConfManager
 
get(String) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
Returns the configuration associatedConfManagerImpl with the given confId
get(String) - Method in class org.apache.nutch.util.domain.DomainSuffixes
Return the DomainSuffix object for the extension, if extension is a top level domain returned object will be an instance of TopLevelDomain
get(String, String) - Method in class org.apache.nutch.indexer.IndexWriterParams
 
get(String, String) - Method in class org.apache.nutch.service.impl.JobManagerImpl
 
get(String, String) - Method in interface org.apache.nutch.service.JobManager
 
get(String, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
 
get(Configuration) - Static method in class org.apache.nutch.indexer.IndexWriters
 
get(Configuration) - Static method in class org.apache.nutch.plugin.PluginRepository
Get a cached instance of the PluginRepository
get(Configuration) - Static method in class org.apache.nutch.util.ObjectCache
 
get(Path, Text, Writer, Map<String, List<Writable>>) - Method in class org.apache.nutch.segment.SegmentReader
 
get(Text) - Method in class org.apache.nutch.parse.ParseResult
Retrieve a single parse output.
get(FileSplit) - Static method in class org.apache.nutch.segment.SegmentPart
Create SegmentPart from a FileSplit.
getAccept() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getAcceptCharset() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getAcceptedIssuers() - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
 
getAcceptedIssuers() - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
 
getAcceptedIssuers() - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
 
getAcceptedIssuers() - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
 
getAcceptedIssuers() - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
 
getAcceptLanguage() - Method in class org.apache.nutch.protocol.http.api.HttpBase
Value of "Accept-Language" request header sent by Nutch.
getAdditionalPostHeaders() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
getAgentString(String, String, String, String, String) - Static method in class org.apache.nutch.tools.WARCUtils
 
getAll() - Method in class org.apache.nutch.collection.CollectionManager
Returns all collections
getAllJobs() - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
get all jobs (currently running and completed)
getAnchor() - Method in class org.apache.nutch.crawl.Inlink
 
getAnchor() - Method in class org.apache.nutch.parse.Outlink
 
getAnchor() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
getAnchors() - Method in class org.apache.nutch.crawl.Inlinks
Get all anchor texts.
getAnchors(Text) - Method in class org.apache.nutch.crawl.LinkDbReader
 
getArgs() - Method in class org.apache.nutch.parse.ParseStatus
 
getArgs() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
getArgs() - Method in class org.apache.nutch.service.model.request.DbQuery
 
getArgs() - Method in class org.apache.nutch.service.model.request.JobConfig
 
getArgs() - Method in class org.apache.nutch.service.model.request.ServiceConfig
 
getArgs() - Method in class org.apache.nutch.service.model.response.JobInfo
 
getAsMap(String) - Method in interface org.apache.nutch.service.ConfManager
 
getAsMap(String) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
 
getAttribute(String) - Method in class org.apache.nutch.plugin.Extension
Returns a attribute value, that is setuped in the manifest file and is definied by the extension point xml schema.
getAuthentication(String, Configuration) - Static method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
This method is responsible for providing Basic authentication information.
getBase(Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
If Node contains a BASE tag then it's HREF is returned.
getBase(Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
If Node contains a BASE tag then it's HREF is returned.
getBaseHref() - Method in class org.apache.nutch.parse.HTMLMetaTags
 
getBaseUrl() - Method in class org.apache.nutch.protocol.Content
The base url for relative links contained in the content.
getBasicPattern() - Static method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
Provides a pattern which can be used by an outside resource to determine if this class can provide credentials based on simple header information.
getBlackListString() - Method in class org.apache.nutch.collection.Subcollection
Returns blacklist String
getBody() - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
 
getBoolean(String, boolean) - Method in class org.apache.nutch.indexer.IndexWriterParams
 
getBoost() - Method in class org.apache.nutch.util.domain.DomainSuffix
 
getBufferSize() - Method in class org.apache.nutch.protocol.ftp.Ftp
 
getCachedClass(PluginDescriptor, String) - Method in class org.apache.nutch.plugin.PluginRepository
 
getCacheKey(URL) - Static method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
Compose unique key to store and access robot rules in cache for given URL
getCharset(Metadata) - Static method in class org.apache.nutch.segment.SegmentReader
Try to get HTML encoding from parse metadata.
getChildren() - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
getClassLoader() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns a cached classloader for a plugin.
getClazz() - Method in class org.apache.nutch.exchange.ExchangeConfig
 
getClazz() - Method in class org.apache.nutch.plugin.Extension
Returns the full class name of the extension point implementation
getClient(URL) - Method in class org.apache.nutch.protocol.okhttp.OkHttp
Distribute hosts over clients by host name
getCode() - Method in interface org.apache.nutch.net.protocols.Response
Get the response code.
getCode() - Method in class org.apache.nutch.protocol.file.FileResponse
Get the response code.
getCode() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
Get the response code.
getCode() - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
 
getCode() - Method in class org.apache.nutch.protocol.http.HttpResponse
 
getCode() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
 
getCode() - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
 
getCode() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
 
getCode() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
getCode() - Method in class org.apache.nutch.protocol.selenium.HttpResponse
 
getCode(int) - Method in exception org.apache.nutch.protocol.file.FileError
 
getCode(int) - Method in exception org.apache.nutch.protocol.ftp.FtpError
 
getCollectionManager(Configuration) - Static method in class org.apache.nutch.collection.CollectionManager
 
getCommand() - Method in class org.apache.nutch.util.CommandRunner
 
getCommonCrawlFormat(String, String, Content, Metadata, Configuration, CommonCrawlConfig) - Static method in class org.apache.nutch.tools.CommonCrawlFormatFactory
Deprecated. 
getCommonCrawlFormat(String, Configuration, CommonCrawlConfig) - Static method in class org.apache.nutch.tools.CommonCrawlFormatFactory
 
getConf() - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
 
getConf() - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
 
getConf() - Method in class org.apache.nutch.crawl.Generator.Selector
 
getConf() - Method in class org.apache.nutch.crawl.Signature
 
getConf() - Method in class org.apache.nutch.crawl.URLPartitioner
 
getConf() - Method in class org.apache.nutch.exchange.jexl.JexlExchange
 
getConf() - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
Get the Configuration object
getConf() - Method in class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
Get the Configuration object
getConf() - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
Get the Configuration object
getConf() - Method in class org.apache.nutch.indexer.CleaningJob
 
getConf() - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexer.jexl.JexlIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexer.links.LinksIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
 
getConf() - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexer.replace.ReplaceIndexer
 
getConf() - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
Get the Configuration object
getConf() - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
 
getConf() - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
getConf() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
getConf() - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
getConf() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
getConf() - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
getConf() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
getConf() - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
getConf() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
getConf() - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
 
getConf() - Method in class org.apache.nutch.microformats.reltag.RelTagParser
 
getConf() - Method in class org.apache.nutch.net.protocols.ProtocolLogUtil
 
getConf() - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
 
getConf() - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
 
getConf() - Method in class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
 
getConf() - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
 
getConf() - Method in class org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer
 
getConf() - Method in class org.apache.nutch.net.urlnormalizer.querystring.QuerystringURLNormalizer
 
getConf() - Method in class org.apache.nutch.net.urlnormalizer.slash.SlashURLNormalizer
 
getConf() - Method in class org.apache.nutch.parse.ext.ExtParser
 
getConf() - Method in class org.apache.nutch.parse.feed.FeedParser
 
getConf() - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
 
getConf() - Method in class org.apache.nutch.parse.html.HtmlParser
 
getConf() - Method in class org.apache.nutch.parse.js.JSParseFilter
 
getConf() - Method in class org.apache.nutch.parse.metatags.MetaTagsParser
 
getConf() - Method in class org.apache.nutch.parse.tika.TikaParser
 
getConf() - Method in class org.apache.nutch.parse.zip.ZipParser
 
getConf() - Method in class org.apache.nutch.parsefilter.debug.DebugParseFilter
 
getConf() - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
getConf() - Method in class org.apache.nutch.parsefilter.regex.RegexParseFilter
 
getConf() - Method in class org.apache.nutch.protocol.file.File
Get the Configuration object
getConf() - Method in class org.apache.nutch.protocol.ftp.Ftp
Get the Configuration object
getConf() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getConf() - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
 
getConf() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
 
getConf() - Method in class org.apache.nutch.protocol.RobotRulesParser
Get the Configuration object
getConf() - Method in class org.apache.nutch.publisher.NutchPublishers
 
getConf() - Method in class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
 
getConf() - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
getConf() - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
 
getConf() - Method in class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
 
getConf() - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
 
getConf() - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
 
getConf() - Method in class org.apache.nutch.urlfilter.domaindenylist.DomainDenylistURLFilter
 
getConf() - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
getConf() - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
 
getConf() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
getConf() - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
 
getConf() - Method in class org.apache.nutch.util.GenericWritableConfigurable
 
getConf() - Method in class org.creativecommons.nutch.CCIndexingFilter
 
getConf() - Method in class org.creativecommons.nutch.CCParseFilter
 
getConfId() - Method in class org.apache.nutch.service.model.request.DbQuery
 
getConfId() - Method in class org.apache.nutch.service.model.request.JobConfig
 
getConfId() - Method in class org.apache.nutch.service.model.request.ServiceConfig
 
getConfId() - Method in class org.apache.nutch.service.model.response.JobInfo
 
getConfig(String) - Method in class org.apache.nutch.service.resources.ConfigResource
Get configuration properties
getConfigId() - Method in class org.apache.nutch.service.model.request.NutchConfig
 
getConfigs() - Method in class org.apache.nutch.service.resources.ConfigResource
Returns a list of all configurations created.
getConfiguration() - Method in class org.apache.nutch.service.model.response.NutchServerInfo
 
getConfManager() - Method in class org.apache.nutch.service.NutchServer
 
getConnectionFailures() - Method in class org.apache.nutch.hostdb.HostDatum
 
getContent() - Method in interface org.apache.nutch.net.protocols.Response
Get the full content of the response.
getContent() - Method in class org.apache.nutch.protocol.Content
The binary content retrieved.
getContent() - Method in class org.apache.nutch.protocol.file.FileResponse
 
getContent() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
 
getContent() - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
 
getContent() - Method in class org.apache.nutch.protocol.http.HttpResponse
 
getContent() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
 
getContent() - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
 
getContent() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
 
getContent() - Method in class org.apache.nutch.protocol.ProtocolOutput
 
getContent() - Method in class org.apache.nutch.protocol.selenium.HttpResponse
 
getContentMeta() - Method in class org.apache.nutch.parse.ParseData
The original Metadata retrieved from content
getContentType() - Method in exception org.apache.nutch.parse.ParserNotFound
 
getContentType() - Method in class org.apache.nutch.protocol.Content
The media type of the retrieved content.
getContentType() - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
 
getCookie() - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
getCookie(URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
If per-host cookies are configured, this method will look it up for the given url.
getCookiePolicy() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
getCookies() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
 
getCountryName() - Method in class org.apache.nutch.util.domain.TopLevelDomain
Returns the country name if TLD is Country Code TLD
getCrawlId() - Method in class org.apache.nutch.service.model.request.DbQuery
 
getCrawlId() - Method in class org.apache.nutch.service.model.request.JobConfig
 
getCrawlId() - Method in class org.apache.nutch.service.model.request.ServiceConfig
 
getCrawlId() - Method in class org.apache.nutch.service.model.response.JobInfo
 
getCredentials() - Method in interface org.apache.nutch.protocol.httpclient.HttpAuthentication
Gets the credentials generated by the HttpAuthentication object.
getCredentials() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
Gets the Basic credentials generated by this HttpBasicAuthentication object
getCurrentKey() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
 
getCurrentNode() - Method in class org.apache.nutch.parse.html.DOMBuilder
Get the node currently being processed.
getCurrentNode() - Method in class org.apache.nutch.util.NodeWalker
Return the current node.
getCurrentValue() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
 
getCustomRequestHeaders() - Method in class org.apache.nutch.protocol.okhttp.OkHttp
 
getData() - Method in interface org.apache.nutch.parse.Parse
Other data extracted from the page.
getData() - Method in class org.apache.nutch.parse.ParseImpl
 
getDatum() - Method in class org.apache.nutch.fetcher.FetchItem
 
getDependencies() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns a array of plugin ids.
getDescriptor() - Method in class org.apache.nutch.plugin.Extension
Get the plugin descriptor.
getDescriptor() - Method in class org.apache.nutch.plugin.Plugin
Returns the plugin descriptor
getDnsFailures() - Method in class org.apache.nutch.hostdb.HostDatum
 
getDocumentMeta() - Method in class org.apache.nutch.indexer.NutchDocument
 
getDom(InputStream) - Static method in class org.apache.nutch.util.DomUtil
Returns parsed dom tree or null if any error
getDomain() - Method in class org.apache.nutch.util.domain.DomainSuffix
 
getDomainName(String) - Static method in class org.apache.nutch.util.URLUtil
Returns the domain name of the url.
getDomainName(URL) - Static method in class org.apache.nutch.util.URLUtil
Get the domain name of the url.
getDomainSuffix(String) - Static method in class org.apache.nutch.util.URLUtil
Returns the DomainSuffix corresponding to the last public part of the hostname
getDomainSuffix(URL) - Static method in class org.apache.nutch.util.URLUtil
Returns the DomainSuffix corresponding to the last public part of the hostname
getDriverForPage(String, Configuration) - Static method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
 
getDriverForPage(String, Configuration) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
getDumpPaths() - Method in class org.apache.nutch.service.model.response.ServiceInfo
 
getDuplicate(CrawlDatum, CrawlDatum) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
 
getElement(DocumentFragment, String) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
Finds the specified element and returns its value
getEmptyParse(Configuration) - Method in class org.apache.nutch.parse.ParseStatus
Creates an empty Parse instance containing the status
getEmptyParseResult(String, Configuration) - Method in class org.apache.nutch.parse.ParseStatus
Creates an empty ParseResult for a given URL
getEventData() - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Get event data
getEventType() - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Get type of this event object
getExemptions() - Method in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
 
getExitValue() - Method in class org.apache.nutch.util.CommandRunner
 
getExportedLibUrls() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns a array of exported libs as URLs
getExtensionInstance() - Method in class org.apache.nutch.plugin.Extension
Return an instance of the extension implementation.
getExtensionPoint(String) - Method in class org.apache.nutch.plugin.PluginRepository
Returns a extension point identified by a extension point id.
getExtensions() - Method in class org.apache.nutch.plugin.ExtensionPoint
Returns a array of extensions that lsiten to this extension point
getExtensions() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns an array of extensions.
getExtensions(String) - Method in class org.apache.nutch.parse.ParserFactory
Finds the best-suited parse plugin for a given contentType.
getExtenstionPoints() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns a array of extension points.
getFetched() - Method in class org.apache.nutch.hostdb.HostDatum
 
getFetchInterval() - Method in class org.apache.nutch.crawl.CrawlDatum
 
getFetchItem() - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
getFetchItem() - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
getFetchItemQueue(String) - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
getFetchNodeDb() - Method in class org.apache.nutch.fetcher.FetchNodeDb
 
getFetchNodeDb() - Method in class org.apache.nutch.service.NutchServer
 
getFetchSchedule(Configuration) - Static method in class org.apache.nutch.crawl.FetchScheduleFactory
Return the FetchSchedule implementation specified within the given Configuration, or DefaultFetchSchedule by default.
getFetchTime() - Method in class org.apache.nutch.crawl.CrawlDatum
Get the fetch time.
getFetchTime() - Method in class org.apache.nutch.fetcher.FetchNode
 
getField(String) - Method in class org.apache.nutch.indexer.NutchDocument
 
getFieldName() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
 
getFieldNames() - Method in class org.apache.nutch.indexer.NutchDocument
 
getFieldValue(String) - Method in class org.apache.nutch.indexer.NutchDocument
 
getFilters() - Method in class org.apache.nutch.net.URLFilters
 
getFromUrl() - Method in class org.apache.nutch.crawl.Inlink
 
getGeneralTags() - Method in class org.apache.nutch.parse.HTMLMetaTags
 
getGone() - Method in class org.apache.nutch.hostdb.HostDatum
 
getHeader(String) - Method in interface org.apache.nutch.net.protocols.Response
Get the value of a named header.
getHeader(String) - Method in class org.apache.nutch.protocol.file.FileResponse
Returns the value of a named header.
getHeader(String) - Method in class org.apache.nutch.protocol.ftp.FtpResponse
Returns the value of a named header.
getHeader(String) - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
 
getHeader(String) - Method in class org.apache.nutch.protocol.http.HttpResponse
 
getHeader(String) - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
 
getHeader(String) - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
 
getHeader(String) - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
 
getHeader(String) - Method in class org.apache.nutch.protocol.selenium.HttpResponse
 
getHeaders() - Method in interface org.apache.nutch.net.protocols.Response
Get all the headers.
getHeaders() - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
 
getHeaders() - Method in class org.apache.nutch.protocol.http.HttpResponse
 
getHeaders() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
 
getHeaders() - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
 
getHeaders() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
 
getHeaders() - Method in class org.apache.nutch.protocol.selenium.HttpResponse
 
getHeaders() - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
 
getHomepageUrl() - Method in class org.apache.nutch.hostdb.HostDatum
 
getHost(String) - Static method in class org.apache.nutch.util.URLUtil
Returns the lowercased hostname for the URL or null if the URL is not well-formed formed.
getHost(URL) - Static method in class org.apache.nutch.util.URLUtil
Returns the lowercased hostname for the URL.
getHostname(Configuration) - Static method in class org.apache.nutch.tools.WARCUtils
 
getHostName(String) - Static method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
Strip a URL, leaving only the host name.
getHostSegments(String) - Static method in class org.apache.nutch.util.URLUtil
Partitions of the hostname of the url by "."
getHostSegments(URL) - Static method in class org.apache.nutch.util.URLUtil
Partitions of the hostname of the url by "."
getHTMLContent(WebDriver, Configuration) - Static method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
 
getHtmlPage(String) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
 
getHtmlPage(String, Configuration) - Static method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
Function for obtaining the HTML BODY using the selected selenium webdriver There are a number of configuration properties within nutch-site.xml which determine whether to take screenshots of the rendered pages and persist them as timestamped .png's into HDFS.
getHtmlPage(String, Configuration) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
Function for obtaining the HTML using the selected selenium webdriver There are a number of configuration properties within nutch-site.xml which determine whether to take screenshots of the rendered pages and persist them as timestamped .png's into HDFS.
getHttpEquivTags() - Method in class org.apache.nutch.parse.HTMLMetaTags
 
getId() - Method in class org.apache.nutch.collection.Subcollection
 
getId() - Method in class org.apache.nutch.exchange.ExchangeConfig
 
getId() - Method in class org.apache.nutch.plugin.Extension
Return the unique id of the extension.
getId() - Method in class org.apache.nutch.plugin.ExtensionPoint
Returns the unique id of the extension point.
getId() - Method in class org.apache.nutch.service.model.request.SeedList
 
getId() - Method in class org.apache.nutch.service.model.request.SeedUrl
 
getId() - Method in class org.apache.nutch.service.model.response.JobInfo
 
getID(String) - Static method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchUtils
Returns a normalised doc ID based on the URL of a document
getImported() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getInfo() - Method in class org.apache.nutch.service.impl.JobWorker
 
getInfo(String) - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
 
getInfo(String, String) - Method in class org.apache.nutch.service.resources.JobResource
Get job info
getInlinks(Text) - Method in class org.apache.nutch.crawl.LinkDbReader
 
getInLinks() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getInLinks() - Method in interface org.apache.nutch.tools.CommonCrawlFormat
gets set of inlinks
getInlinkScore() - Method in class org.apache.nutch.scoring.webgraph.Node
 
getInProgressSize() - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
getInstance() - Static method in class org.apache.nutch.fetcher.FetchNodeDb
 
getInstance() - Static method in class org.apache.nutch.plugin.URLStreamHandlerFactory
Get the singleton instance of this class.
getInstance() - Static method in class org.apache.nutch.service.NutchServer
 
getInstance() - Static method in class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyAllRule
 
getInstance() - Static method in class org.apache.nutch.util.domain.DomainSuffixes
Singleton instance, lazy instantination
getInstance(Element) - Static method in class org.apache.nutch.exchange.ExchangeConfig
 
getInt(String, int) - Method in class org.apache.nutch.indexer.IndexWriterParams
 
getIPAddress(Configuration) - Static method in class org.apache.nutch.tools.WARCUtils
 
getJobClassName() - Method in class org.apache.nutch.service.model.request.JobConfig
 
getJobFailureLogMessage(String, Job) - Static method in class org.apache.nutch.util.NutchJob
Method to return job failure log message.
getJobHistory() - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
Get the Job history
getJobManager() - Method in class org.apache.nutch.service.NutchServer
 
getJobRunning() - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
Get the list of currently running jobs
getJobs() - Method in class org.apache.nutch.service.model.response.NutchServerInfo
 
getJobs(String) - Method in class org.apache.nutch.service.resources.JobResource
Get job history for a given job regardless of the jobs state
getJsonArray() - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
getJsonData() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getJsonData() - Method in interface org.apache.nutch.tools.CommonCrawlFormat
Get a string representation of the JSON structure of the URL content.
getJsonData() - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
getJsonData(String, Content, Metadata) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getJsonData(String, Content, Metadata) - Method in interface org.apache.nutch.tools.CommonCrawlFormat
Returns a string representation of the JSON structure of the URL content.
getJsonData(String, Content, Metadata, ParseData) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getJsonData(String, Content, Metadata, ParseData) - Method in interface org.apache.nutch.tools.CommonCrawlFormat
Returns a string representation of the JSON structure of the URL content.
getJsonData(String, Content, Metadata, ParseData) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
getKey() - Method in class org.apache.nutch.collection.Subcollection
 
getKey() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getKeyPrefix() - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
getL2Norm() - Method in class org.apache.nutch.scoring.similarity.cosine.DocVector
 
getLastCheck() - Method in class org.apache.nutch.hostdb.HostDatum
 
getLastModified() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
getLinks() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
 
getLinkType() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
getLoginFormId() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
getLoginPostData() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
getLoginUrl() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
getLong(String, long) - Method in class org.apache.nutch.indexer.IndexWriterParams
 
getMajorCode() - Method in class org.apache.nutch.parse.ParseStatus
 
getMaxContent() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getMaxDuration() - Method in class org.apache.nutch.protocol.http.api.HttpBase
The time limit to download the entire content, in seconds.
getMaxInterval(Text, float) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
Returns the max_interval for this URL, which might depend on the host.
getMessage() - Method in class org.apache.nutch.parse.ParseStatus
 
getMessage() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
getMeta(String) - Method in class org.apache.nutch.metadata.MetaWrapper
Get metadata value for a given key.
getMeta(String) - Method in class org.apache.nutch.parse.ParseData
Get a metadata single value.
getMetadata() - Method in class org.apache.nutch.metadata.MetaWrapper
Get all metadata.
getMetadata() - Method in class org.apache.nutch.parse.Outlink
 
getMetadata() - Method in class org.apache.nutch.protocol.Content
Other protocol-specific data.
getMetadata() - Method in class org.apache.nutch.scoring.webgraph.Node
 
getMetaData() - Method in class org.apache.nutch.crawl.CrawlDatum
Get CrawlDatum metadata
getMetaData() - Method in class org.apache.nutch.hostdb.HostDatum
Get Host metadata.
getMetaTags(HTMLMetaTags, Node, URL) - Static method in class org.apache.nutch.parse.html.HTMLMetaProcessor
Sets the indicators in robotsMeta to appropriate values, based on any META tags found under the given node.
getMetaTags(HTMLMetaTags, Node, URL) - Static method in class org.apache.nutch.parse.tika.HTMLMetaProcessor
Sets the indicators in robotsMeta to appropriate values, based on any META tags found under the given node.
getMetaValues(String) - Method in class org.apache.nutch.metadata.MetaWrapper
Get multiple metadata values for a given key.
getMethod() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getMimeType(File) - Method in class org.apache.nutch.util.MimeUtil
Facade interface to Tika's underlying MimeTypes.getMimeType(File) method.
getMimeType(String) - Method in class org.apache.nutch.util.MimeUtil
Facade interface to Tika's underlying MimeTypes.getMimeType(String) method.
getMinInterval(Text, float) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
Returns the min_interval for this URL, which might depend on the host.
getMinorCode() - Method in class org.apache.nutch.parse.ParseStatus
 
getModifiedTime() - Method in class org.apache.nutch.crawl.CrawlDatum
 
getMsg() - Method in class org.apache.nutch.service.model.response.JobInfo
 
getName() - Method in class org.apache.nutch.collection.Subcollection
 
getName() - Method in class org.apache.nutch.plugin.ExtensionPoint
Returns the name of the extension point.
getName() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns the name of the plugin.
getName() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
getName() - Method in class org.apache.nutch.service.model.request.SeedList
 
getNoCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
Get the current value of noCache.
getNode() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
 
getNodeValue(Node) - Static method in class org.apache.nutch.parse.headings.HeadingsParseFilter
Returns the text value of the specified Node and child nodes
getNoFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
Get the current value of noFollow.
getNoIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
Get the current value of noIndex.
getNormalizedName(String) - Static method in class org.apache.nutch.metadata.SpellCheckedMetadata
Get the normalized name of metadata attribute name.
getNotExportedLibUrls() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns a array of libraries as URLs that are not exported by the plugin.
getNotModified() - Method in class org.apache.nutch.hostdb.HostDatum
 
getNumInlinks() - Method in class org.apache.nutch.scoring.webgraph.Node
 
getNumOfOutlinks() - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
getNumOutlinks() - Method in class org.apache.nutch.scoring.webgraph.Node
 
getObject(String) - Method in class org.apache.nutch.util.ObjectCache
 
getOrderedPlugins(Class<?>, String, String) - Method in class org.apache.nutch.plugin.PluginRepository
Get ordered list of plugins.
getOutlinks() - Method in class org.apache.nutch.fetcher.FetchNode
 
getOutlinks() - Method in class org.apache.nutch.parse.ParseData
Get the outlinks of the page.
getOutlinks(String, String, Configuration) - Static method in class org.apache.nutch.parse.OutlinkExtractor
Extracts Outlink from given plain text and adds anchor to the extracted Outlinks
getOutlinks(String, Configuration) - Static method in class org.apache.nutch.parse.OutlinkExtractor
Extracts Outlink from given plain text.
getOutlinks(URL, ArrayList<Outlink>, List<Link>) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
 
getOutlinks(URL, ArrayList<Outlink>, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
This method finds all anchors below the supplied DOM node, and creates appropriate Outlink records for each (relative to the supplied base URL), and adds them to the outlinks ArrayList.
getOutlinks(URL, ArrayList<Outlink>, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
This method finds all anchors below the supplied DOM node, and creates appropriate Outlink records for each (relative to the supplied base URL), and adds them to the outlinks ArrayList.
getOutlinkScore() - Method in class org.apache.nutch.scoring.webgraph.Node
 
getOutputCommitter(TaskAttemptContext) - Method in class org.apache.nutch.parse.ParseOutputFormat
 
getOutputDir() - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
getPage(String) - Static method in class org.apache.nutch.util.URLUtil
Returns the page for the url.
getParameters() - Method in class org.apache.nutch.exchange.ExchangeConfig
 
getParams() - Method in class org.apache.nutch.service.model.request.NutchConfig
 
getParse(Content) - Method in class org.apache.nutch.parse.ext.ExtParser
 
getParse(Content) - Method in class org.apache.nutch.parse.feed.FeedParser
Parses the given feed and extracts out and parsers all linked items within the feed, using the underlying ROME feed parsing library.
getParse(Content) - Method in class org.apache.nutch.parse.html.HtmlParser
 
getParse(Content) - Method in class org.apache.nutch.parse.js.JSParseFilter
Parse a JavaScript file and extract outlinks
getParse(Content) - Method in interface org.apache.nutch.parse.Parser
This method parses the given content and returns a map of <key, parse> pairs.
getParse(Content) - Method in class org.apache.nutch.parse.tika.TikaParser
 
getParse(Content) - Method in class org.apache.nutch.parse.zip.ZipParser
 
getParseMeta() - Method in class org.apache.nutch.parse.ParseData
Other content properties.
getParserById(String) - Method in class org.apache.nutch.parse.ParserFactory
Function returns a Parser instance with the specified extId, representing its extension ID.
getParsers(String, String) - Method in class org.apache.nutch.parse.ParserFactory
Function returns an array of Parsers for a given content type.
getPartition(FloatWritable, Writable, int) - Method in class org.apache.nutch.crawl.Generator.Selector
Partition by host / domain or IP.
getPartition(Text, Writable, int) - Method in class org.apache.nutch.crawl.URLPartitioner
Hash by host or domain name or IP address.
getPassAllFilter() - Static method in class org.apache.nutch.util.HadoopFSUtil
Get a path filter which allows all paths.
getPassDirectoriesFilter(FileSystem) - Static method in class org.apache.nutch.util.HadoopFSUtil
Get a path filter which allows all directories.
getPath() - Method in class org.apache.nutch.service.model.request.ReaderConfig
 
getPaths(FileStatus[]) - Static method in class org.apache.nutch.util.HadoopFSUtil
Turns an array of FileStatus into an array of Paths.
getPattern() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
 
getPluginClass() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns the fully qualified name of the class which implements the abstarct Plugin class.
getPluginDescriptor(String) - Method in class org.apache.nutch.plugin.PluginRepository
Returns the descriptor of one plugin identified by a plugin id.
getPluginDescriptors() - Method in class org.apache.nutch.plugin.PluginRepository
Returns all registed plugin descriptors.
getPluginFolder(String) - Method in class org.apache.nutch.plugin.PluginManifestParser
Return the named plugin folder.
getPluginId() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns the unique identifier of the plug-in or null.
getPluginInstance(PluginDescriptor) - Method in class org.apache.nutch.plugin.PluginRepository
Returns an instance of a plugin.
getPluginPath() - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns the directory path of the plugin.
getPort() - Method in class org.apache.nutch.service.NutchServer
 
getPos() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
Returns the current position in the file.
getProgress() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
Returns the percentage of progress in processing the file.
getProgress() - Method in class org.apache.nutch.util.NutchTool
Get relative progress of the tool.
getProperty(String, String) - Method in class org.apache.nutch.service.resources.ConfigResource
Get property
getProtocol(String) - Method in class org.apache.nutch.protocol.ProtocolFactory
Returns the appropriate Protocol implementation for a url.
getProtocol(String) - Static method in class org.apache.nutch.util.URLUtil
 
getProtocol(URL) - Method in class org.apache.nutch.protocol.ProtocolFactory
Returns the appropriate Protocol implementation for a url.
getProtocol(URL) - Static method in class org.apache.nutch.util.URLUtil
 
getProtocolById(String) - Method in class org.apache.nutch.protocol.ProtocolFactory
 
getProtocolOutput(String, CrawlDatum, boolean) - Method in class org.apache.nutch.util.AbstractChecker
 
getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.file.File
Creates a FileResponse object corresponding to the url and return a ProtocolOutput object as per the content received
getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.ftp.Ftp
Creates a FtpResponse object corresponding to the url and returns a ProtocolOutput object as per the content received
getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getProtocolOutput(Text, CrawlDatum) - Method in interface org.apache.nutch.protocol.Protocol
Get the ProtocolOutput for a given url and crawldatum
getProviderName() - Method in class org.apache.nutch.plugin.PluginDescriptor
 
getProxyHost() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getProxyPort() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getQueueCount() - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
getQueueCountMaxExceptions() - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
getQueueID() - Method in class org.apache.nutch.fetcher.FetchItem
 
getQueueSize() - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
getReaders(Path, Configuration) - Static method in class org.apache.nutch.util.SegmentReaderUtil
 
getRealm() - Method in interface org.apache.nutch.protocol.httpclient.HttpAuthentication
Gets the realm used by the HttpAuthentication object during creation.
getRealm() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
Gets the realm attribute of the HttpBasicAuthentication object.
getReason() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse.TruncatedContent
 
getRecordReader(InputSplit, Job, Mapper.Context) - Method in class org.apache.nutch.segment.ContentAsTextInputFormat
 
getRecordReader(InputSplit, Job, Mapper.Context) - Method in class org.apache.nutch.tools.arc.ArcInputFormat
Get the RecordReader for reading the arc file.
getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat
 
getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat
 
getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.fetcher.FetcherOutputFormat
 
getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.indexer.IndexerOutputFormat
 
getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.parse.ParseOutputFormat
 
getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentOutputFormat
 
getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.segment.SegmentReader.TextOutputFormat
 
getRedirPerm() - Method in class org.apache.nutch.hostdb.HostDatum
 
getRedirTemp() - Method in class org.apache.nutch.hostdb.HostDatum
 
getRefresh() - Method in class org.apache.nutch.parse.HTMLMetaTags
Get the current value of refresh.
getRefreshHref() - Method in class org.apache.nutch.parse.HTMLMetaTags
 
getRefreshTime() - Method in class org.apache.nutch.parse.HTMLMetaTags
 
getRemovedFormFields() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
getReplacement() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
 
getReprUrl() - Method in class org.apache.nutch.fetcher.FetcherThread
 
getRequestAccept() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestAcceptEncoding() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestAcceptLanguage() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestContactEmail() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestContactName() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestHostAddress() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestHostName() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestRobots() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestSoftware() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getRequestUserAgent() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResource(String) - Method in class org.apache.nutch.plugin.PluginClassLoader
 
getResourceAsStream(String) - Method in class org.apache.nutch.plugin.PluginClassLoader
 
getResources(String) - Method in class org.apache.nutch.plugin.PluginClassLoader
 
getResourceString(String, Locale) - Method in class org.apache.nutch.plugin.PluginDescriptor
Returns a I18N'd resource string.
getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.htmlunit.Http
 
getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.http.Http
 
getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.httpclient.Http
Fetches the url with a configured HTTP client and gets the response.
getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.interactiveselenium.Http
 
getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.okhttp.OkHttp
 
getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.selenium.Http
 
getResponseAddress() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResponseContent() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResponseContentEncoding() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResponseContentType() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResponseDate() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResponseHostName() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResponseServer() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResponseStatus() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getResult() - Method in class org.apache.nutch.service.model.response.JobInfo
 
getRetriesSinceFetch() - Method in class org.apache.nutch.crawl.CrawlDatum
 
getReversedHost(String) - Static method in class org.apache.nutch.util.TableUtil
Given a reversed url, returns the reversed host E.g "com.foo.bar:http:8983/to/index.html?a=b" -> "com.foo.bar"
getReverseKey() - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
getReverseKeyValue() - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
getRobotRules(Text, CrawlDatum, List<Content>) - Method in class org.apache.nutch.protocol.file.File
No robots parsing is done for file protocol.
getRobotRules(Text, CrawlDatum, List<Content>) - Method in class org.apache.nutch.protocol.ftp.Ftp
Get the robots rules for a given url
getRobotRules(Text, CrawlDatum, List<Content>) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getRobotRules(Text, CrawlDatum, List<Content>) - Method in interface org.apache.nutch.protocol.Protocol
Retrieve robot rules applicable for this URL.
getRobotRulesSet(Protocol, URL, List<Content>) - Method in class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
The hosts for which the caching of robots rules is yet to be done, it sends a Ftp request to the host corresponding to the URL passed, gets robots file, parses the rules and caches the rules object to avoid re-work in future.
getRobotRulesSet(Protocol, URL, List<Content>) - Method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
Get the rules from robots.txt which applies for the given url.
getRobotRulesSet(Protocol, URL, List<Content>) - Method in class org.apache.nutch.protocol.RobotRulesParser
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
getRobotRulesSet(Protocol, Text, List<Content>) - Method in class org.apache.nutch.protocol.RobotRulesParser
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
getRootNode() - Method in class org.apache.nutch.parse.html.DOMBuilder
Get the root node of the DOM being created.
getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Returns the name of the file of rules to use for a particular implementation.
getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
Rules specified as a config property will override rules specified as a config file.
getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
Gets reader for regex rules
getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
Rules specified as a config property will override rules specified as a config file.
getRunningJobs() - Method in class org.apache.nutch.service.model.response.NutchServerInfo
 
getSchema() - Method in class org.apache.nutch.plugin.ExtensionPoint
Returns a path to the xml schema of a extension point.
getScopedRules() - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
 
getScore() - Method in class org.apache.nutch.crawl.CrawlDatum
 
getScore() - Method in class org.apache.nutch.hostdb.HostDatum
 
getScore() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
getSeedFilePath() - Method in class org.apache.nutch.service.model.request.SeedList
 
getSeedList() - Method in class org.apache.nutch.service.model.request.SeedUrl
 
getSeedList(String) - Method in class org.apache.nutch.service.impl.SeedManagerImpl
 
getSeedList(String) - Method in interface org.apache.nutch.service.SeedManager
 
getSeedLists() - Method in class org.apache.nutch.service.resources.SeedResource
Gets the list of seedFiles already created
getSeedManager() - Method in class org.apache.nutch.service.NutchServer
 
getSeeds() - Method in class org.apache.nutch.service.impl.SeedManagerImpl
 
getSeeds() - Method in interface org.apache.nutch.service.SeedManager
 
getSeedUrls() - Method in class org.apache.nutch.service.model.request.SeedList
 
getSeedUrlsCount() - Method in class org.apache.nutch.service.model.request.SeedList
 
getServerStatus() - Method in class org.apache.nutch.service.resources.AdminResource
Get the status of the Nutch Server
getSignature() - Method in class org.apache.nutch.crawl.CrawlDatum
 
getSignature(Configuration) - Static method in class org.apache.nutch.crawl.SignatureFactory
Return the Signature implementation for a given Configuration, or MD5Signature by default.
getSimpleDateFormat() - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
getSplits(JobContext) - Method in class org.apache.nutch.fetcher.Fetcher.InputFormat
Don't split inputs to keep things polite - a single fetch list must be processed in one fetcher task.
getStartDate() - Method in class org.apache.nutch.service.model.response.NutchServerInfo
 
getStarted() - Method in class org.apache.nutch.service.NutchServer
 
getState() - Method in class org.apache.nutch.service.model.response.JobInfo
 
getStats(Path, SegmentReader.SegmentReaderStats) - Method in class org.apache.nutch.segment.SegmentReader
 
getStatus() - Method in class org.apache.nutch.crawl.CrawlDatum
 
getStatus() - Method in class org.apache.nutch.fetcher.FetchNode
 
getStatus() - Method in class org.apache.nutch.parse.ParseData
Get the status of parsing the page.
getStatus() - Method in class org.apache.nutch.protocol.ProtocolOutput
 
getStatus() - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
getStatus() - Method in class org.apache.nutch.util.domain.DomainSuffix
 
getStatus() - Method in class org.apache.nutch.util.NutchTool
Returns current status of the running tool
getStatusByName(String) - Static method in class org.apache.nutch.crawl.CrawlDatum
 
getStatusName(byte) - Static method in class org.apache.nutch.crawl.CrawlDatum
 
getStrings(String) - Method in class org.apache.nutch.indexer.IndexWriterParams
 
getStrings(String, String...) - Method in class org.apache.nutch.indexer.IndexWriterParams
 
getSubColection(String) - Method in class org.apache.nutch.collection.CollectionManager
Get the named subcollection
getSubCollections(String) - Method in class org.apache.nutch.collection.CollectionManager
Return names of collections url is part of
getSystemName() - Method in class org.apache.nutch.protocol.ftp.Client
Fetches the system type name from the server and returns the string.
getTargetPoint() - Method in class org.apache.nutch.plugin.Extension
Get target point
getText() - Method in interface org.apache.nutch.parse.Parse
The textual content of the page.
getText() - Method in class org.apache.nutch.parse.ParseImpl
 
getText() - Method in class org.apache.nutch.parse.ParseText
 
getText(StringBuffer, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
This is a convinience method, equivalent to getText(sb, node, false).
getText(StringBuffer, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
This is a convinience method, equivalent to getText(sb, node, false).
getText(StringBuffer, Node, boolean) - Method in class org.apache.nutch.parse.html.DOMContentUtils
This method takes a StringBuffer and a DOM Node, and will append all the content text found beneath the DOM node to the StringBuffer.
getThrownError() - Method in class org.apache.nutch.util.CommandRunner
 
getTimeout() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getTimeout() - Method in class org.apache.nutch.util.CommandRunner
 
getTimestamp() - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Get timestamp of current event.
getTimestamp() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
getTimestamp() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getTitle() - Method in class org.apache.nutch.fetcher.FetchNode
 
getTitle() - Method in class org.apache.nutch.parse.ParseData
Get the title of the page.
getTitle(StringBuffer, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
This method takes a StringBuffer and a DOM Node, and will append the content text found beneath the first title node to the StringBuffer.
getTitle(StringBuffer, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
This method takes a StringBuffer and a DOM Node, and will append the content text found beneath the first title node to the StringBuffer.
getTlsPreferredCipherSuites() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getTlsPreferredProtocols() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getToFieldName() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
 
getTokenStream() - Method in class org.apache.nutch.scoring.similarity.util.LuceneTokenizer
get the tokenStream created by Tokenizer
getTopLevelDomainName(String) - Static method in class org.apache.nutch.util.URLUtil
Returns the top level domain name of the url.
getTopLevelDomainName(URL) - Static method in class org.apache.nutch.util.URLUtil
Returns the top level domain name of the url.
getTotalSize() - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
getToUrl() - Method in class org.apache.nutch.parse.Outlink
 
getType() - Method in class org.apache.nutch.service.model.request.DbQuery
 
getType() - Method in class org.apache.nutch.service.model.request.JobConfig
 
getType() - Method in class org.apache.nutch.service.model.response.JobInfo
 
getType() - Method in class org.apache.nutch.util.domain.TopLevelDomain
 
getTypes() - Method in class org.apache.nutch.crawl.NutchWritable
 
getUnfetched() - Method in class org.apache.nutch.hostdb.HostDatum
 
getUniqueFile(TaskAttemptContext, String) - Method in class org.apache.nutch.parse.ParseOutputFormat
 
getUrl() - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Get URL of this event
getUrl() - Method in class org.apache.nutch.fetcher.FetchItem
 
getUrl() - Method in class org.apache.nutch.fetcher.FetchNode
 
getUrl() - Method in interface org.apache.nutch.net.protocols.Response
Get the URL used to retrieve this response.
getUrl() - Method in exception org.apache.nutch.parse.ParserNotFound
 
getUrl() - Method in class org.apache.nutch.protocol.Content
The url fetched.
getUrl() - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
 
getUrl() - Method in class org.apache.nutch.protocol.http.HttpResponse
 
getUrl() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
 
getUrl() - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
 
getUrl() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
 
getUrl() - Method in exception org.apache.nutch.protocol.ProtocolNotFound
 
getUrl() - Method in class org.apache.nutch.protocol.selenium.HttpResponse
 
getUrl() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
getUrl() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
 
getUrl() - Method in class org.apache.nutch.service.model.request.SeedUrl
 
getUrl() - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
getUrl() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
getURL2() - Method in class org.apache.nutch.fetcher.FetchItem
 
getUrlMD5(String) - Static method in class org.apache.nutch.util.DumpFileUtil
 
getUseHttp11() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getUserAgent() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
getUUID(Configuration) - Static method in class org.apache.nutch.util.NutchConfiguration
Retrieve a Nutch UUID of this configuration object, or null if the configuration was created elsewhere.
getValues() - Method in class org.apache.nutch.indexer.NutchField
 
getValues(String) - Method in class org.apache.nutch.metadata.Metadata
Get the values associated to a metadata name.
getValues(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
 
getVersion() - Method in class org.apache.nutch.parse.ParseData
 
getVersion() - Method in class org.apache.nutch.parse.ParseStatus
 
getVersion() - Method in class org.apache.nutch.plugin.PluginDescriptor
 
getWaitForExit() - Method in class org.apache.nutch.util.CommandRunner
 
getWARCInfoContent(Configuration) - Static method in class org.apache.nutch.tools.WARCUtils
 
getWarcSize() - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
getWeight() - Method in class org.apache.nutch.indexer.NutchDocument
 
getWeight() - Method in class org.apache.nutch.indexer.NutchField
 
getWhiteList() - Method in class org.apache.nutch.collection.Subcollection
Returns whitelist
getWhiteListString() - Method in class org.apache.nutch.collection.Subcollection
Returns whitelist String
getWriter() - Method in class org.apache.nutch.parse.html.DOMBuilder
Return null since there is no Writer for this class.
gone - Variable in class org.apache.nutch.hostdb.HostDatum
 
GONE - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Resource is gone.
guessEncoding(Content, String) - Method in class org.apache.nutch.util.EncodingDetector
Guess the encoding with the previously specified list of clues.
GZIPUtils - Class in org.apache.nutch.util
A collection of utility methods for working on GZIPed data.
GZIPUtils() - Constructor for class org.apache.nutch.util.GZIPUtils
 

H

HadoopFSUtil - Class in org.apache.nutch.util
 
HadoopFSUtil() - Constructor for class org.apache.nutch.util.HadoopFSUtil
 
hasDbStatus(CrawlDatum) - Static method in class org.apache.nutch.crawl.CrawlDatum
 
hasFetchStatus(CrawlDatum) - Static method in class org.apache.nutch.crawl.CrawlDatum
 
hashCode() - Method in class org.apache.nutch.crawl.CrawlDatum
 
hashCode() - Method in class org.apache.nutch.crawl.Inlink
 
hashCode() - Method in class org.apache.nutch.parse.Outlink
 
hashCode() - Method in class org.apache.nutch.plugin.PluginClassLoader
 
hashCode() - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
 
hashCode() - Method in class org.apache.nutch.service.model.request.SeedList
 
hashCode() - Method in class org.apache.nutch.service.model.request.SeedUrl
 
HashComparator() - Constructor for class org.apache.nutch.crawl.Generator.HashComparator
 
hasHomepageUrl() - Method in class org.apache.nutch.hostdb.HostDatum
 
hasHostDomainRules - Variable in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Whether there are host- or domain-specific rules.
hasMetaData() - Method in class org.apache.nutch.hostdb.HostDatum
 
hasNext() - Method in class org.apache.nutch.util.NodeWalker
 
hasObject(String) - Method in class org.apache.nutch.util.ObjectCache
 
head(String, int) - Method in class org.apache.nutch.service.impl.LinkReader
 
head(String, int) - Method in class org.apache.nutch.service.impl.NodeReader
 
head(String, int) - Method in class org.apache.nutch.service.impl.SequenceReader
 
head(String, int) - Method in interface org.apache.nutch.service.NutchReader
 
HeadingsParseFilter - Class in org.apache.nutch.parse.headings
HtmlParseFilter to retrieve h1 and h2 values from the DOM.
HeadingsParseFilter() - Constructor for class org.apache.nutch.parse.headings.HeadingsParseFilter
 
homepageUrl - Variable in class org.apache.nutch.hostdb.HostDatum
 
host - Variable in class org.apache.nutch.hostdb.ResolverThread
 
host - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
HOST - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
 
hostDatum - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
HostDatum - Class in org.apache.nutch.hostdb
 
HostDatum() - Constructor for class org.apache.nutch.hostdb.HostDatum
 
HostDatum(float) - Constructor for class org.apache.nutch.hostdb.HostDatum
 
HostDatum(float, Date) - Constructor for class org.apache.nutch.hostdb.HostDatum
 
HostDatum(float, Date, String) - Constructor for class org.apache.nutch.hostdb.HostDatum
 
HOSTDB_CHECK_FAILED - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_CHECK_KNOWN - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_CHECK_NEW - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_CRAWLDATUM_PROCESSORS - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_DUMP_HEADER - Static variable in class org.apache.nutch.hostdb.ReadHostDb
 
HOSTDB_DUMP_HOMEPAGES - Static variable in class org.apache.nutch.hostdb.ReadHostDb
 
HOSTDB_DUMP_HOSTNAMES - Static variable in class org.apache.nutch.hostdb.ReadHostDb
 
HOSTDB_FILTER_EXPRESSION - Static variable in class org.apache.nutch.hostdb.ReadHostDb
 
HOSTDB_FORCE_CHECK - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_NUM_RESOLVER_THREADS - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_NUMERIC_FIELDS - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_PERCENTILES - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_PURGE_FAILED_HOSTS_THRESHOLD - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_RECHECK_INTERVAL - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_STRING_FIELDS - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_URL_FILTERING - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTDB_URL_NORMALIZING - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
HOSTNAME - Static variable in class org.apache.nutch.tools.WARCUtils
 
hostOrDomain() - Method in class org.apache.nutch.urlfilter.api.RegexRule
Return if this rule is used for filtering-in or out.
hostProtocolMapping - Variable in class org.apache.nutch.protocol.ProtocolFactory
 
HOSTS - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
HOSTS - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
hostText - Variable in class org.apache.nutch.hostdb.ResolverThread
 
HostURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.host
URL normalizer for mapping hosts to their desired form.
HostURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
 
HTMLLanguageParser - Class in org.apache.nutch.analysis.lang
 
HTMLLanguageParser() - Constructor for class org.apache.nutch.analysis.lang.HTMLLanguageParser
 
HTMLMetaProcessor - Class in org.apache.nutch.parse.html
Class for parsing META Directives from DOM trees.
HTMLMetaProcessor - Class in org.apache.nutch.parse.tika
Class for parsing META Directives from DOM trees.
HTMLMetaProcessor() - Constructor for class org.apache.nutch.parse.html.HTMLMetaProcessor
 
HTMLMetaProcessor() - Constructor for class org.apache.nutch.parse.tika.HTMLMetaProcessor
 
HTMLMetaTags - Class in org.apache.nutch.parse
This class holds the information about HTML "meta" tags extracted from a page.
HTMLMetaTags() - Constructor for class org.apache.nutch.parse.HTMLMetaTags
 
HtmlParseFilter - Interface in org.apache.nutch.parse
Extension point for DOM-based HTML parsers.
HTMLPARSEFILTER_ORDER - Static variable in class org.apache.nutch.parse.HtmlParseFilters
 
HtmlParseFilters - Class in org.apache.nutch.parse
Creates and caches HtmlParseFilter implementing plugins.
HtmlParseFilters(Configuration) - Constructor for class org.apache.nutch.parse.HtmlParseFilters
 
HtmlParser - Class in org.apache.nutch.parse.html
 
HtmlParser() - Constructor for class org.apache.nutch.parse.html.HtmlParser
 
HtmlUnitWebDriver - Class in org.apache.nutch.protocol.htmlunit
 
HtmlUnitWebDriver() - Constructor for class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
 
HtmlUnitWebWindowListener - Class in org.apache.nutch.protocol.htmlunit
 
HtmlUnitWebWindowListener() - Constructor for class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
 
HtmlUnitWebWindowListener(int) - Constructor for class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
 
Http - Class in org.apache.nutch.protocol.htmlunit
 
Http - Class in org.apache.nutch.protocol.http
 
Http - Class in org.apache.nutch.protocol.httpclient
This class is a protocol plugin that configures an HTTP client for Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
Http - Class in org.apache.nutch.protocol.interactiveselenium
 
Http - Class in org.apache.nutch.protocol.selenium
 
Http() - Constructor for class org.apache.nutch.protocol.htmlunit.Http
Default constructor.
Http() - Constructor for class org.apache.nutch.protocol.http.Http
Public default constructor.
Http() - Constructor for class org.apache.nutch.protocol.httpclient.Http
Constructs this plugin.
Http() - Constructor for class org.apache.nutch.protocol.interactiveselenium.Http
 
Http() - Constructor for class org.apache.nutch.protocol.selenium.Http
 
HTTP - org.apache.nutch.protocol.htmlunit.HttpResponse.Scheme
 
HTTP - org.apache.nutch.protocol.http.HttpResponse.Scheme
 
HTTP - org.apache.nutch.protocol.interactiveselenium.HttpResponse.Scheme
 
HTTP - org.apache.nutch.protocol.selenium.HttpResponse.Scheme
 
HTTP_HEADER_FROM - Static variable in class org.apache.nutch.tools.WARCUtils
 
HTTP_HEADER_USER_AGENT - Static variable in class org.apache.nutch.tools.WARCUtils
 
HTTP_LOG_SUPPRESSION - Static variable in class org.apache.nutch.net.protocols.ProtocolLogUtil
 
HttpAuthentication - Interface in org.apache.nutch.protocol.httpclient
The base level of services required for Http Authentication
HttpAuthenticationException - Exception in org.apache.nutch.protocol.httpclient
Can be used to identify problems during creation of Authentication objects.
HttpAuthenticationException() - Constructor for exception org.apache.nutch.protocol.httpclient.HttpAuthenticationException
Constructs a new exception with null as its detail message.
HttpAuthenticationException(String) - Constructor for exception org.apache.nutch.protocol.httpclient.HttpAuthenticationException
Constructs a new exception with the specified detail message.
HttpAuthenticationException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.httpclient.HttpAuthenticationException
Constructs a new exception with the specified message and cause.
HttpAuthenticationException(Throwable) - Constructor for exception org.apache.nutch.protocol.httpclient.HttpAuthenticationException
Constructs a new exception with the specified cause and detail message from given clause if it is not null.
HttpAuthenticationFactory - Class in org.apache.nutch.protocol.httpclient
Provides the Http protocol implementation with the ability to authenticate when prompted.
HttpAuthenticationFactory(Configuration) - Constructor for class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
 
HttpBase - Class in org.apache.nutch.protocol.http.api
 
HttpBase() - Constructor for class org.apache.nutch.protocol.http.api.HttpBase
Creates a new instance of HttpBase
HttpBase(Logger) - Constructor for class org.apache.nutch.protocol.http.api.HttpBase
Creates a new instance of HttpBase
HttpBasicAuthentication - Class in org.apache.nutch.protocol.httpclient
Implementation of RFC 2617 Basic Authentication.
HttpBasicAuthentication(String, Configuration) - Constructor for class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
Construct an HttpBasicAuthentication for the given challenge parameters.
HttpDateFormat - Class in org.apache.nutch.net.protocols
Parse and format HTTP dates in HTTP headers, e.g., used to fill the "If-Modified-Since" request header field.
HttpDateFormat() - Constructor for class org.apache.nutch.net.protocols.HttpDateFormat
 
HttpException - Exception in org.apache.nutch.protocol.http.api
 
HttpException() - Constructor for exception org.apache.nutch.protocol.http.api.HttpException
 
HttpException(String) - Constructor for exception org.apache.nutch.protocol.http.api.HttpException
 
HttpException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.http.api.HttpException
 
HttpException(Throwable) - Constructor for exception org.apache.nutch.protocol.http.api.HttpException
 
HttpFormAuthConfigurer - Class in org.apache.nutch.protocol.httpclient
 
HttpFormAuthConfigurer() - Constructor for class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
HttpFormAuthentication - Class in org.apache.nutch.protocol.httpclient
 
HttpFormAuthentication(String, String, Map<String, String>, Map<String, String>, Set<String>) - Constructor for class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
 
HttpFormAuthentication(HttpFormAuthConfigurer, HttpClient, Http) - Constructor for class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
 
HttpHeaders - Interface in org.apache.nutch.metadata
A collection of HTTP header names.
HttpResponse - Class in org.apache.nutch.protocol.htmlunit
An HTTP response.
HttpResponse - Class in org.apache.nutch.protocol.http
An HTTP response.
HttpResponse - Class in org.apache.nutch.protocol.httpclient
An HTTP response.
HttpResponse - Class in org.apache.nutch.protocol.interactiveselenium
 
HttpResponse - Class in org.apache.nutch.protocol.selenium
 
HttpResponse(HttpBase, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.htmlunit.HttpResponse
Default public constructor.
HttpResponse(HttpBase, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.http.HttpResponse
Default public constructor.
HttpResponse(Http, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.interactiveselenium.HttpResponse
 
HttpResponse(Http, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.selenium.HttpResponse
 
HttpResponse.Scheme - Enum in org.apache.nutch.protocol.htmlunit
 
HttpResponse.Scheme - Enum in org.apache.nutch.protocol.http
 
HttpResponse.Scheme - Enum in org.apache.nutch.protocol.interactiveselenium
 
HttpResponse.Scheme - Enum in org.apache.nutch.protocol.selenium
 
HttpRobotRulesParser - Class in org.apache.nutch.protocol.http.api
This class is used for parsing robots for urls belonging to HTTP protocol.
HttpRobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
 
HTTPS - org.apache.nutch.protocol.htmlunit.HttpResponse.Scheme
 
HTTPS - org.apache.nutch.protocol.http.HttpResponse.Scheme
 
HTTPS - org.apache.nutch.protocol.interactiveselenium.HttpResponse.Scheme
 
HTTPS - org.apache.nutch.protocol.selenium.HttpResponse.Scheme
 
HttpWebClient - Class in org.apache.nutch.protocol.selenium
 
HttpWebClient() - Constructor for class org.apache.nutch.protocol.selenium.HttpWebClient
 

I

IDENTIFIER - Static variable in interface org.apache.nutch.metadata.DublinCore
Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.
IDLE - org.apache.nutch.service.model.response.JobInfo.State
 
IF_MODIFIED_SINCE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
ignorableWhitespace(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of ignorable whitespace in element content.
IGNORE_EXTERNAL_LINKS - Static variable in class org.apache.nutch.crawl.LinkDb
 
IGNORE_INTERNAL_LINKS - Static variable in class org.apache.nutch.crawl.LinkDb
 
in - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
 
IN_USE - org.apache.nutch.util.domain.DomainSuffix.Status
 
INC_RATE - Variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
 
incConnectionFailures() - Method in class org.apache.nutch.hostdb.HostDatum
 
incDnsFailures() - Method in class org.apache.nutch.hostdb.HostDatum
 
incrementExceptionCounter() - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
index(Path, Path, List<Path>, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
 
index(Path, Path, List<Path>, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
 
index(Path, Path, List<Path>, boolean, boolean, String) - Method in class org.apache.nutch.indexer.IndexingJob
 
index(Path, Path, List<Path>, boolean, boolean, String, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
 
index(Path, Path, List<Path>, boolean, boolean, String, boolean, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
 
index(Path, Path, List<Path>, boolean, boolean, String, boolean, boolean, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
 
INDEX - org.apache.nutch.service.JobManager.JobType
 
INDEX - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
INDEX - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
INDEXER_BINARY_AS_BASE64 - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
INDEXER_DELETE - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
INDEXER_DELETE_ROBOTS_NOINDEX - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
INDEXER_DELETE_SKIPPED - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
INDEXER_NO_COMMIT - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
INDEXER_PARAMS - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
INDEXER_SKIP_NOTMODIFIED - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
IndexerMapper() - Constructor for class org.apache.nutch.indexer.IndexerMapReduce.IndexerMapper
 
IndexerMapReduce - Class in org.apache.nutch.indexer
This class is typically invoked from within IndexingJob and handles all MapReduce functionality required when undertaking indexing.
IndexerMapReduce() - Constructor for class org.apache.nutch.indexer.IndexerMapReduce
 
IndexerMapReduce.IndexerMapper - Class in org.apache.nutch.indexer
 
IndexerMapReduce.IndexerReducer - Class in org.apache.nutch.indexer
 
IndexerOutputFormat - Class in org.apache.nutch.indexer
 
IndexerOutputFormat() - Constructor for class org.apache.nutch.indexer.IndexerOutputFormat
 
IndexerReducer() - Constructor for class org.apache.nutch.indexer.IndexerMapReduce.IndexerReducer
 
indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
 
indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
Dampen the boost value by scorePower.
indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in interface org.apache.nutch.scoring.ScoringFilter
This method calculates a indexed document score/boost.
indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.ScoringFilters
 
indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
 
IndexingException - Exception in org.apache.nutch.indexer
 
IndexingException() - Constructor for exception org.apache.nutch.indexer.IndexingException
 
IndexingException(String) - Constructor for exception org.apache.nutch.indexer.IndexingException
 
IndexingException(String, Throwable) - Constructor for exception org.apache.nutch.indexer.IndexingException
 
IndexingException(Throwable) - Constructor for exception org.apache.nutch.indexer.IndexingException
 
IndexingFilter - Interface in org.apache.nutch.indexer
Extension point for indexing.
INDEXINGFILTER_ORDER - Static variable in class org.apache.nutch.indexer.IndexingFilters
 
IndexingFilters - Class in org.apache.nutch.indexer
Creates and caches IndexingFilter implementing plugins.
IndexingFilters(Configuration) - Constructor for class org.apache.nutch.indexer.IndexingFilters
 
IndexingFiltersChecker - Class in org.apache.nutch.indexer
Reads and parses a URL and run the indexers on it.
IndexingFiltersChecker() - Constructor for class org.apache.nutch.indexer.IndexingFiltersChecker
 
IndexingJob - Class in org.apache.nutch.indexer
Generic indexer which relies on the plugins implementing IndexWriter
IndexingJob() - Constructor for class org.apache.nutch.indexer.IndexingJob
 
IndexingJob(Configuration) - Constructor for class org.apache.nutch.indexer.IndexingJob
 
IndexWriter - Interface in org.apache.nutch.indexer
 
IndexWriterConfig - Class in org.apache.nutch.indexer
 
IndexWriterParams - Class in org.apache.nutch.indexer
 
IndexWriterParams(Map<? extends String, ? extends String>) - Constructor for class org.apache.nutch.indexer.IndexWriterParams
Fill IndexWriterParams from map.
indexWriters(NutchDocument) - Method in class org.apache.nutch.exchange.Exchanges
Returns all the indexers where the document must be sent to.
IndexWriters - Class in org.apache.nutch.indexer
Creates and caches IndexWriter implementing plugins.
IndexWriters.IndexWriterWrapper - Class in org.apache.nutch.indexer
 
IndexWriterWrapper() - Constructor for class org.apache.nutch.indexer.IndexWriters.IndexWriterWrapper
 
inflate(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
Returns an inflated copy of the input array.
inflateBestEffort(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
Returns an inflated copy of the input array.
inflateBestEffort(byte[], int) - Static method in class org.apache.nutch.util.DeflateUtils
Returns an inflated copy of the input array, truncated to sizeLimit bytes, if necessary.
INFRASTRUCTURE - org.apache.nutch.util.domain.DomainSuffix.Status
 
INFRASTRUCTURE - org.apache.nutch.util.domain.TopLevelDomain.Type
 
init() - Method in class org.apache.nutch.collection.CollectionManager
 
init(Path) - Method in class org.apache.nutch.crawl.LinkDbReader
 
initialize(InputSplit, TaskAttemptContext) - Method in class org.apache.nutch.tools.arc.ArcRecordReader
 
initialize(Element) - Method in class org.apache.nutch.collection.Subcollection
Initialize Subcollection from dom element
initializeSchedule(Text, CrawlDatum) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
Initialize fetch schedule related data.
initializeSchedule(Text, CrawlDatum) - Method in interface org.apache.nutch.crawl.FetchSchedule
Initialize fetch schedule related data.
initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
 
initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level.
initialScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
Set an initial score for newly discovered pages.
initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
Calculate a new initial score, used when adding newly discovered pages.
initMRJob(Path, Path, Collection<Path>, Job, boolean) - Static method in class org.apache.nutch.indexer.IndexerMapReduce
 
inject(Path, Path) - Method in class org.apache.nutch.crawl.Injector
 
inject(Path, Path, boolean, boolean) - Method in class org.apache.nutch.crawl.Injector
 
inject(Path, Path, boolean, boolean, boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.Injector
 
INJECT - org.apache.nutch.service.JobManager.JobType
 
injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
 
injectedScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
Set an initial score for newly injected pages.
injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
Calculate a new initial score, used when injecting new pages.
InjectMapper() - Constructor for class org.apache.nutch.crawl.Injector.InjectMapper
 
Injector - Class in org.apache.nutch.crawl
Injector takes a flat text file of URLs (or a folder containing text files) and merges ("injects") these URLs into the CrawlDb.
Injector() - Constructor for class org.apache.nutch.crawl.Injector
 
Injector(Configuration) - Constructor for class org.apache.nutch.crawl.Injector
 
Injector.InjectMapper - Class in org.apache.nutch.crawl
InjectMapper reads the CrawlDb seeds are injected into the plain-text seed files and parses each line into the URL and metadata.
Injector.InjectReducer - Class in org.apache.nutch.crawl
Combine multiple new entries for a url.
InjectReducer() - Constructor for class org.apache.nutch.crawl.Injector.InjectReducer
 
Inlink - Class in org.apache.nutch.crawl
An incoming link to a page.
Inlink() - Constructor for class org.apache.nutch.crawl.Inlink
 
Inlink(String, String) - Constructor for class org.apache.nutch.crawl.Inlink
 
INLINK - Static variable in class org.apache.nutch.scoring.webgraph.LinkDatum
 
INLINK_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
 
inLinks - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
Inlinks - Class in org.apache.nutch.crawl
A list of Inlinks.
Inlinks() - Constructor for class org.apache.nutch.crawl.Inlinks
 
InputCompatMapper() - Constructor for class org.apache.nutch.segment.SegmentReader.InputCompatMapper
 
InputCompatReducer() - Constructor for class org.apache.nutch.segment.SegmentReader.InputCompatReducer
 
InputFormat() - Constructor for class org.apache.nutch.fetcher.Fetcher.InputFormat
 
install(Job, Path) - Static method in class org.apache.nutch.crawl.CrawlDb
 
install(Job, Path) - Static method in class org.apache.nutch.crawl.LinkDb
 
InteractiveSeleniumHandler - Interface in org.apache.nutch.protocol.interactiveselenium.handlers
 
INTERNAL - org.apache.nutch.net.protocols.Response.TruncatedContentReason
implementation internal reason
invert(Path, Path[], boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDb
 
invert(Path, Path, boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDb
 
Inverter() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
 
INVERTLINKS - org.apache.nutch.service.JobManager.JobType
 
InvertMapper() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertMapper
 
InvertReducer() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertReducer
 
IP - Static variable in class org.apache.nutch.tools.WARCUtils
 
IP_ADDRESS - Static variable in interface org.apache.nutch.net.protocols.Response
Key to hold the IP address the request is sent to if store.ip.address is true
IPFilterRules - Class in org.apache.nutch.protocol.okhttp
Optionally limit or block connections to IP address ranges (localhost/loopback or site-local addresses, subnet ranges given in CIDR notation, or single IP addresses).
IPFilterRules(Configuration) - Constructor for class org.apache.nutch.protocol.okhttp.IPFilterRules
 
isAllowListed(URL) - Method in class org.apache.nutch.protocol.RobotRulesParser
Check whether a URL belongs to a allowlisted host.
isAnySuccess() - Method in class org.apache.nutch.parse.ParseResult
A convenience method which returns true if at least one of the parses is successful.
isCanonical() - Method in interface org.apache.nutch.parse.Parse
Indicates if the parse is coming from a url or a sub-url
isCanonical() - Method in class org.apache.nutch.parse.ParseImpl
 
isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
 
isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
 
isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
 
isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
 
isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
 
isCompressed() - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
isCookieEnabled() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
isDomainSuffix(String) - Method in class org.apache.nutch.util.domain.DomainSuffixes
Return whether the extension is a registered domain entry
isEligibleForCheck(HostDatum) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
Determines whether a record is eligible for recheck.
isEmpty() - Method in class org.apache.nutch.hostdb.HostDatum
 
isEmpty() - Method in class org.apache.nutch.parse.ParseResult
Checks whether the result is empty.
isEmpty() - Method in class org.apache.nutch.protocol.okhttp.IPFilterRules
 
isEmpty(String) - Static method in class org.apache.nutch.util.StringUtil
Checks if a string is empty (ie is null or empty).
isExempted(String, String) - Method in class org.apache.nutch.net.URLExemptionFilters
Run all defined filters.
isForce() - Method in class org.apache.nutch.service.model.request.NutchConfig
 
isHalted() - Method in class org.apache.nutch.fetcher.FetcherThread
 
isHomePageOf(URL, String) - Static method in class org.apache.nutch.util.URLUtil
Test whether a URL is the home page or root page of a host.
isIfModifiedSinceEnabled() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
isIgnoreCase() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
isIndexable(Path, FileSystem) - Static method in class org.apache.nutch.segment.SegmentChecker
Check if the segment is indexable.
isLoginRedirect() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
isMagic(byte[]) - Static method in class org.apache.nutch.tools.arc.ArcRecordReader
Returns true if the byte array passed matches the gzip header magic number.
isModeAccept() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
isModelCreated - Static variable in class org.apache.nutch.scoring.similarity.cosine.Model
 
isMultiValued(String) - Method in class org.apache.nutch.metadata.Metadata
Returns true if named value is multivalued.
isParsed(Path, FileSystem) - Static method in class org.apache.nutch.segment.SegmentChecker
Check the segment to see if it is has been parsed before.
isParsing(Configuration) - Static method in class org.apache.nutch.fetcher.Fetcher
 
isPermanentFailure() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
isRedirect() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
 
isRedirect() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
isRemoteVerificationEnabled() - Method in class org.apache.nutch.protocol.ftp.Client
Return whether or not verification of the remote host participating in data connections is enabled.
isRunning() - Method in class org.apache.nutch.service.NutchServer
 
isSameDomainName(String, String) - Static method in class org.apache.nutch.util.URLUtil
Returns whether the given urls have the same domain name.
isSameDomainName(URL, URL) - Static method in class org.apache.nutch.util.URLUtil
Returns whether the given urls have the same domain name.
isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
 
isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
 
isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
 
isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
 
isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
 
isStoreHttpHeaders() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
isStoreHttpRequest() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
isStoreIPAddress() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
isStorePartialAsTruncated() - Method in class org.apache.nutch.protocol.http.api.HttpBase
Whether to save partial fetches as truncated content, cf.
isStoringContent(Configuration) - Static method in class org.apache.nutch.fetcher.Fetcher
 
isSuccess() - Method in class org.apache.nutch.parse.ParseResult
A convenience method which returns true only if all parses are successful.
isSuccess() - Method in class org.apache.nutch.parse.ParseStatus
 
isSuccess() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
isTlsCheckCertificates() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
isTransientFailure() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
isTruncated(Content) - Static method in class org.apache.nutch.parse.ParseSegment
Checks if the page's content is truncated.
isValid() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
Does this FieldReplacer have a valid fieldname and pattern?
isWhiteSpace(char) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
Returns whether the specified ch conforms to the XML 1.0 definition of whitespace.
isWhiteSpace(char[], int, int) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
Tell if the string is whitespace.
isWhiteSpace(String) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
Tell if the string is whitespace.
isWhiteSpace(StringBuffer) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
Tell if the string is whitespace.
iterator() - Method in class org.apache.nutch.crawl.Inlinks
 
iterator() - Method in class org.apache.nutch.indexer.NutchDocument
Iterate over all fields.
iterator() - Method in class org.apache.nutch.parse.ParseResult
Iterate over all entries in the <url, Parse> map.

J

JexlExchange - Class in org.apache.nutch.exchange.jexl
 
JexlExchange() - Constructor for class org.apache.nutch.exchange.jexl.JexlExchange
 
JexlIndexingFilter - Class in org.apache.nutch.indexer.jexl
An IndexingFilter that allows filtering of documents based on a JEXL expression.
JexlIndexingFilter() - Constructor for class org.apache.nutch.indexer.jexl.JexlIndexingFilter
 
JexlUtil - Class in org.apache.nutch.util
Utility methods for handling JEXL expressions
JexlUtil() - Constructor for class org.apache.nutch.util.JexlUtil
 
JobConfig - Class in org.apache.nutch.service.model.request
Job-specific configuration.
JobConfig() - Constructor for class org.apache.nutch.service.model.request.JobConfig
 
JobFactory - Class in org.apache.nutch.service.impl
 
JobFactory() - Constructor for class org.apache.nutch.service.impl.JobFactory
 
JobInfo - Class in org.apache.nutch.service.model.response
This is the response object containing Job information
JobInfo(String, JobConfig, JobInfo.State, String) - Constructor for class org.apache.nutch.service.model.response.JobInfo
 
JobInfo.State - Enum in org.apache.nutch.service.model.response
 
jobManager - Variable in class org.apache.nutch.service.resources.AbstractResource
 
JobManager - Interface in org.apache.nutch.service
 
JobManager.JobType - Enum in org.apache.nutch.service
 
JobManagerImpl - Class in org.apache.nutch.service.impl
 
JobManagerImpl(JobFactory, ConfManager, NutchServerPoolExecutor) - Constructor for class org.apache.nutch.service.impl.JobManagerImpl
 
JobResource - Class in org.apache.nutch.service.resources
 
JobResource() - Constructor for class org.apache.nutch.service.resources.JobResource
 
JobWorker - Class in org.apache.nutch.service.impl
 
JobWorker(JobConfig, Configuration, NutchTool) - Constructor for class org.apache.nutch.service.impl.JobWorker
To initialize JobWorker thread with the Job Configurations provided by user.
jsonArray - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
JsonIndenter() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.JsonIndenter
 
JSParseFilter - Class in org.apache.nutch.parse.js
This class is a heuristic link extractor for JavaScript files and code snippets.
JSParseFilter() - Constructor for class org.apache.nutch.parse.js.JSParseFilter
 

K

KafkaConstants - Interface in org.apache.nutch.indexwriter.kafka
 
KafkaIndexWriter - Class in org.apache.nutch.indexwriter.kafka
Sends Nutch documents to a configured Kafka Cluster
KafkaIndexWriter() - Constructor for class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
keepClientCnxOpen - Variable in class org.apache.nutch.util.AbstractChecker
 
KEY_SERIALIZER - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
 
KEY_STORE_PASSWORD - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
KEY_STORE_PATH - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
KEY_STORE_TYPE - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
keyPrefix - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
KILLED - org.apache.nutch.service.model.response.JobInfo.State
 
KILLING - org.apache.nutch.service.model.response.JobInfo.State
 
killJob() - Method in class org.apache.nutch.service.impl.JobWorker
 
killJob() - Method in class org.apache.nutch.util.NutchTool
Kill the job immediately.

L

LANGUAGE - Static variable in interface org.apache.nutch.metadata.DublinCore
A language of the intellectual content of the resource.
LanguageIndexingFilter - Class in org.apache.nutch.analysis.lang
An IndexingFilter that add a lang (language) field to the document.
LanguageIndexingFilter() - Constructor for class org.apache.nutch.analysis.lang.LanguageIndexingFilter
Constructs a new Language Indexing Filter.
LAST_MODIFIED - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
lastCheck - Variable in class org.apache.nutch.hostdb.HostDatum
 
leftPad(String, int) - Static method in class org.apache.nutch.util.StringUtil
Returns a copy of s (left padded) with leading spaces so that it's length is length.
LENGTH - org.apache.nutch.net.protocols.Response.TruncatedContentReason
fetch exceeded configured http.content.limit
LICENSE_LOCATION - Static variable in interface org.apache.nutch.metadata.CreativeCommons
 
LICENSE_URL - Static variable in interface org.apache.nutch.metadata.CreativeCommons
 
LineRecordWriter(DataOutputStream) - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
 
LineRecordWriter(DataOutputStream) - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter
 
LinkAnalysisScoringFilter - Class in org.apache.nutch.scoring.link
 
LinkAnalysisScoringFilter() - Constructor for class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
 
LinkDatum - Class in org.apache.nutch.scoring.webgraph
A class for holding link information including the url, anchor text, a score, the timestamp of the link and a link type.
LinkDatum() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
Default constructor, no url, timestamp, score, or link type.
LinkDatum(String) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
Creates a LinkDatum with a given url.
LinkDatum(String, String) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
Creates a LinkDatum with a url and an anchor text.
LinkDatum(String, String, long) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
 
LinkDb - Class in org.apache.nutch.crawl
Maintains an inverted link map, listing incoming links for each url.
LinkDb() - Constructor for class org.apache.nutch.crawl.LinkDb
 
LinkDb(Configuration) - Constructor for class org.apache.nutch.crawl.LinkDb
 
LinkDb.LinkDbMapper - Class in org.apache.nutch.crawl
 
LinkDBDumpMapper() - Constructor for class org.apache.nutch.crawl.LinkDbReader.LinkDBDumpMapper
 
LinkDbFilter - Class in org.apache.nutch.crawl
This class provides a way to separate the URL normalization and filtering steps from the rest of LinkDb manipulation code.
LinkDbFilter() - Constructor for class org.apache.nutch.crawl.LinkDbFilter
 
LinkDbMapper() - Constructor for class org.apache.nutch.crawl.LinkDb.LinkDbMapper
 
LinkDbMerger - Class in org.apache.nutch.crawl
This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.
LinkDbMerger() - Constructor for class org.apache.nutch.crawl.LinkDbMerger
 
LinkDbMerger(Configuration) - Constructor for class org.apache.nutch.crawl.LinkDbMerger
 
LinkDbMerger.LinkDbMergeReducer - Class in org.apache.nutch.crawl
 
LinkDbMergeReducer() - Constructor for class org.apache.nutch.crawl.LinkDbMerger.LinkDbMergeReducer
 
LinkDbReader - Class in org.apache.nutch.crawl
Read utility for the LinkDb.
LinkDbReader() - Constructor for class org.apache.nutch.crawl.LinkDbReader
 
LinkDbReader(Configuration, Path) - Constructor for class org.apache.nutch.crawl.LinkDbReader
 
LinkDbReader.LinkDBDumpMapper - Class in org.apache.nutch.crawl
 
LinkDumper - Class in org.apache.nutch.scoring.webgraph
The LinkDumper tool creates a database of node to inlink information that can be read using the nested Reader class.
LinkDumper() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper
 
LinkDumper.Inverter - Class in org.apache.nutch.scoring.webgraph
Inverts outlinks from the WebGraph to inlinks and attaches node information.
LinkDumper.Inverter.InvertMapper - Class in org.apache.nutch.scoring.webgraph
Wraps all values in ObjectWritables.
LinkDumper.Inverter.InvertReducer - Class in org.apache.nutch.scoring.webgraph
Inverts outlinks to inlinks while attaching node information to the outlink.
LinkDumper.LinkNode - Class in org.apache.nutch.scoring.webgraph
Bean class which holds url to node information.
LinkDumper.LinkNodes - Class in org.apache.nutch.scoring.webgraph
Writable class which holds an array of LinkNode objects.
LinkDumper.Merger - Class in org.apache.nutch.scoring.webgraph
Merges LinkNode objects into a single array value per url.
LinkDumper.Reader - Class in org.apache.nutch.scoring.webgraph
Reader class which will print out the url and all of its inlinks to system out.
LinkNode() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
 
LinkNode(String, Node) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
 
LinkNodes() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
 
LinkNodes(LinkDumper.LinkNode[]) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
 
LinkParams(String, String, int) - Constructor for class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
 
LinkRank - Class in org.apache.nutch.scoring.webgraph
 
LinkRank() - Constructor for class org.apache.nutch.scoring.webgraph.LinkRank
Default constructor.
LinkRank(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.LinkRank
Configurable constructor.
linkRead() - Method in class org.apache.nutch.service.resources.ReaderResouce
Get Link Reader response schema
linkRead(ReaderConfig, int, int, int, boolean) - Method in class org.apache.nutch.service.resources.ReaderResouce
Read link object
LinkReader - Class in org.apache.nutch.service.impl
 
LinkReader() - Constructor for class org.apache.nutch.service.impl.LinkReader
 
LINKS_INLINKS_HOST - Static variable in class org.apache.nutch.indexer.links.LinksIndexingFilter
 
LINKS_ONLY_HOSTS - Static variable in class org.apache.nutch.indexer.links.LinksIndexingFilter
 
LINKS_OUTLINKS_HOST - Static variable in class org.apache.nutch.indexer.links.LinksIndexingFilter
 
LinksIndexingFilter - Class in org.apache.nutch.indexer.links
An IndexingFilter that adds outlinks and inlinks field(s) to the document.
LinksIndexingFilter() - Constructor for class org.apache.nutch.indexer.links.LinksIndexingFilter
 
list() - Method in interface org.apache.nutch.service.ConfManager
 
list() - Method in class org.apache.nutch.service.impl.ConfManagerImpl
 
list(String, JobInfo.State) - Method in class org.apache.nutch.service.impl.JobManagerImpl
 
list(String, JobInfo.State) - Method in interface org.apache.nutch.service.JobManager
 
list(List<Path>, Writer) - Method in class org.apache.nutch.segment.SegmentReader
 
listDumpPaths(String) - Method in class org.apache.nutch.service.resources.ServicesResource
 
loadClass(String, boolean) - Method in class org.apache.nutch.plugin.PluginClassLoader
 
LOCATION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
lock(Configuration, Path, boolean) - Static method in class org.apache.nutch.crawl.CrawlDb
 
LOCK_NAME - Static variable in class org.apache.nutch.crawl.CrawlDb
 
LOCK_NAME - Static variable in class org.apache.nutch.crawl.LinkDb
 
LOCK_NAME - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
 
LOCK_NAME - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
 
LOCK_NAME - Static variable in class org.apache.nutch.util.SitemapProcessor
 
LockUtil - Class in org.apache.nutch.util
Utility methods for handling application-level locking.
LockUtil() - Constructor for class org.apache.nutch.util.LockUtil
 
LOG - Static variable in class org.apache.nutch.crawl.Generator
 
LOG - Static variable in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
LOG - Static variable in class org.apache.nutch.plugin.PluginManifestParser
 
LOG - Static variable in class org.apache.nutch.plugin.PluginRepository
 
LOG - Static variable in class org.apache.nutch.plugin.URLStreamHandlerFactory
 
LOG - Static variable in class org.apache.nutch.protocol.file.File
 
LOG - Static variable in class org.apache.nutch.protocol.ftp.Ftp
 
LOG - Static variable in class org.apache.nutch.protocol.htmlunit.Http
 
LOG - Static variable in class org.apache.nutch.protocol.http.Http
 
LOG - Static variable in class org.apache.nutch.protocol.httpclient.Http
 
LOG - Static variable in class org.apache.nutch.protocol.interactiveselenium.Http
 
LOG - Static variable in class org.apache.nutch.protocol.okhttp.IPFilterRules
 
LOG - Static variable in class org.apache.nutch.protocol.okhttp.OkHttp
 
LOG - Static variable in class org.apache.nutch.protocol.okhttp.OkHttpResponse
 
LOG - Static variable in class org.apache.nutch.protocol.selenium.Http
 
LOG - Static variable in interface org.apache.nutch.service.NutchReader
 
LOG - Static variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
LOG - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
logConf() - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
logDateFormat - Static variable in class org.apache.nutch.util.TimingUtil
Formats dates for logging
logDateMillis(long) - Static method in class org.apache.nutch.util.TimingUtil
Convert epoch milliseconds (System.currentTimeMillis()) into date string (local time zone) used for logging
login() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
 
login(String, String) - Method in class org.apache.nutch.protocol.ftp.Client
Login to the FTP server using the provided username and password.
logout() - Method in class org.apache.nutch.protocol.ftp.Client
Logout of the FTP server by sending the QUIT command.
logShort(Throwable) - Method in class org.apache.nutch.net.protocols.ProtocolLogUtil
Return true if exception is configured to be logged as short message without stack trace, usually done for frequent exceptions with obvious reasons (e.g., UnknownHostException), configurable by http.log.exceptions.suppress.stack
longestMatch(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
Returns the longest prefix of input that is matched, or null if no match exists.
longestMatch(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
Returns the longest suffix of input that is matched, or null if no match exists.
longestMatch(String) - Method in class org.apache.nutch.util.TrieStringMatcher
Returns the longest substring of input that is matched by a pattern in the trie, or null if no match exists.
LuceneAnalyzerUtil - Class in org.apache.nutch.scoring.similarity.util
Creates a custom analyzer based on user provided inputs
LuceneAnalyzerUtil(LuceneAnalyzerUtil.StemFilterType, boolean) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil
Creates an analyzer instance based on Lucene default stopword set if the param useStopFilter is set to true
LuceneAnalyzerUtil(LuceneAnalyzerUtil.StemFilterType, List<String>, boolean) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil
Creates an analyzer instance based on user provided stop words.
LuceneAnalyzerUtil.StemFilterType - Enum in org.apache.nutch.scoring.similarity.util
 
LuceneTokenizer - Class in org.apache.nutch.scoring.similarity.util
 
LuceneTokenizer(String, LuceneTokenizer.TokenizerType, boolean, LuceneAnalyzerUtil.StemFilterType) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneTokenizer
Creates a tokenizer based on param values
LuceneTokenizer(String, LuceneTokenizer.TokenizerType, List<String>, boolean, LuceneAnalyzerUtil.StemFilterType) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneTokenizer
Creates a tokenizer based on param values
LuceneTokenizer(String, LuceneTokenizer.TokenizerType, LuceneAnalyzerUtil.StemFilterType, int, int) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneTokenizer
Creates a tokenizer for the ngram model based on param values
LuceneTokenizer.TokenizerType - Enum in org.apache.nutch.scoring.similarity.util
 

M

m_currentNode - Variable in class org.apache.nutch.parse.html.DOMBuilder
Current node
m_doc - Variable in class org.apache.nutch.parse.html.DOMBuilder
Root document
m_docFrag - Variable in class org.apache.nutch.parse.html.DOMBuilder
First node of document fragment or null if not a DocumentFragment
m_elemStack - Variable in class org.apache.nutch.parse.html.DOMBuilder
Vector of element nodes
m_inCData - Variable in class org.apache.nutch.parse.html.DOMBuilder
Flag indicating that we are processing a CData section
main(String[]) - Static method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
 
main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDb
 
main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDbMerger
Run the tool.
main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDbReader
 
main(String[]) - Static method in class org.apache.nutch.crawl.DeduplicationJob
 
main(String[]) - Static method in class org.apache.nutch.crawl.Generator
Generate a fetchlist from the crawldb.
main(String[]) - Static method in class org.apache.nutch.crawl.Injector
 
main(String[]) - Static method in class org.apache.nutch.crawl.LinkDb
 
main(String[]) - Static method in class org.apache.nutch.crawl.LinkDbMerger
Run the job
main(String[]) - Static method in class org.apache.nutch.crawl.LinkDbReader
 
main(String[]) - Static method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
 
main(String[]) - Static method in class org.apache.nutch.crawl.TextProfileSignature
 
main(String[]) - Static method in class org.apache.nutch.fetcher.Fetcher
Run the fetcher.
main(String[]) - Static method in class org.apache.nutch.hostdb.ReadHostDb
 
main(String[]) - Static method in class org.apache.nutch.hostdb.UpdateHostDb
 
main(String[]) - Static method in class org.apache.nutch.indexer.CleaningJob
 
main(String[]) - Static method in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
Main method for invoking this tool
main(String[]) - Static method in class org.apache.nutch.indexer.IndexingFiltersChecker
 
main(String[]) - Static method in class org.apache.nutch.indexer.IndexingJob
 
main(String[]) - Static method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
main(String[]) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
 
main(String[]) - Static method in class org.apache.nutch.net.URLFilterChecker
 
main(String[]) - Static method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
 
main(String[]) - Static method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
Spits out patterns and substitutions that are in the configuration file.
main(String[]) - Static method in class org.apache.nutch.net.URLNormalizerChecker
 
main(String[]) - Static method in class org.apache.nutch.parse.feed.FeedParser
Runs a command line version of this Parser.
main(String[]) - Static method in class org.apache.nutch.parse.html.HtmlParser
 
main(String[]) - Static method in class org.apache.nutch.parse.js.JSParseFilter
Main method which can be run from command line with the plugin option.
main(String[]) - Static method in class org.apache.nutch.parse.ParseData
 
main(String[]) - Static method in class org.apache.nutch.parse.ParserChecker
 
main(String[]) - Static method in class org.apache.nutch.parse.ParseSegment
 
main(String[]) - Static method in class org.apache.nutch.parse.ParseText
 
main(String[]) - Static method in class org.apache.nutch.parse.zip.ZipParser
 
main(String[]) - Static method in class org.apache.nutch.plugin.PluginRepository
Loads all necessary dependencies for a selected plugin, and then runs one of the classes' main() method.
main(String[]) - Static method in class org.apache.nutch.protocol.Content
 
main(String[]) - Static method in class org.apache.nutch.protocol.file.File
Quick way for running this class.
main(String[]) - Static method in class org.apache.nutch.protocol.ftp.Ftp
For debugging.
main(String[]) - Static method in class org.apache.nutch.protocol.htmlunit.Http
 
main(String[]) - Static method in class org.apache.nutch.protocol.http.Http
 
main(String[]) - Static method in class org.apache.nutch.protocol.httpclient.Http
Main method.
main(String[]) - Static method in class org.apache.nutch.protocol.interactiveselenium.Http
 
main(String[]) - Static method in class org.apache.nutch.protocol.okhttp.OkHttp
 
main(String[]) - Static method in class org.apache.nutch.protocol.RobotRulesParser
 
main(String[]) - Static method in class org.apache.nutch.protocol.selenium.Http
 
main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkDumper
 
main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkDumper.Reader
 
main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkRank
 
main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.NodeDumper
 
main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.NodeReader
Runs the NodeReader tool.
main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
 
main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.WebGraph
 
main(String[]) - Static method in class org.apache.nutch.segment.SegmentMerger
 
main(String[]) - Static method in class org.apache.nutch.segment.SegmentReader
 
main(String[]) - Static method in class org.apache.nutch.service.NutchServer
 
main(String[]) - Static method in class org.apache.nutch.tools.arc.ArcSegmentCreator
 
main(String[]) - Static method in class org.apache.nutch.tools.CommonCrawlDataDumper
Main method for invoking this tool
main(String[]) - Static method in class org.apache.nutch.tools.DmozParser
Command-line access.
main(String[]) - Static method in class org.apache.nutch.tools.FileDumper
Main method for invoking this tool
main(String[]) - Static method in class org.apache.nutch.tools.FreeGenerator
 
main(String[]) - Static method in class org.apache.nutch.tools.ResolveUrls
Runs the resolve urls tool.
main(String[]) - Static method in class org.apache.nutch.tools.ShowProperties
 
main(String[]) - Static method in class org.apache.nutch.tools.warc.WARCExporter
 
main(String[]) - Static method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
 
main(String[]) - Static method in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
 
main(String[]) - Static method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
 
main(String[]) - Static method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
 
main(String[]) - Static method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
main(String[]) - Static method in class org.apache.nutch.util.CommandRunner
 
main(String[]) - Static method in class org.apache.nutch.util.CrawlCompletionStats
 
main(String[]) - Static method in class org.apache.nutch.util.domain.DomainStatistics
 
main(String[]) - Static method in class org.apache.nutch.util.EncodingDetector
 
main(String[]) - Static method in class org.apache.nutch.util.PrefixStringMatcher
 
main(String[]) - Static method in class org.apache.nutch.util.ProtocolStatusStatistics
 
main(String[]) - Static method in class org.apache.nutch.util.SitemapProcessor
 
main(String[]) - Static method in class org.apache.nutch.util.StringUtil
 
main(String[]) - Static method in class org.apache.nutch.util.SuffixStringMatcher
 
main(String[]) - Static method in class org.apache.nutch.util.URLUtil
For testing
main(HttpBase, String[]) - Static method in class org.apache.nutch.protocol.http.api.HttpBase
 
main(RegexURLFilterBase, String[]) - Static method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Filter the standard input using a RegexURLFilterBase.
majorCodes - Static variable in class org.apache.nutch.parse.ParseStatus
 
makeClient(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
Generates a RestHighLevelClient with the hosts given
makeClient(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
Generates a RestHighLevelClient with the hosts given
map(FloatWritable, Generator.SelectorEntry, Mapper.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorInverseMapper
 
map(Text, BytesWritable, Mapper.Context) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator.ArcSegmentCreatorMapper
Runs the Map job to translate an arc record into output for Nutch segments.
map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertMapper
 
map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterMapper
 
map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.crawl.Injector.InjectMapper
 
map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbMapper
Mapper ingesting records from the HostDB, CrawlDB and plaintext host scores file.
map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.indexer.IndexerMapReduce.IndexerMapper
 
map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbMapper
 
map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCMapper
 
map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DBFilter
 
map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.indexer.CleaningJob.DBFilter
 
map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
 
map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorMapper
 
map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbFilter
 
map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
 
map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateMapper
 
map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
 
map(Text, Inlinks, Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDbFilter
 
map(Text, Inlinks, Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDbReader.LinkDBDumpMapper
 
map(Text, MetaWrapper, Mapper.Context) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentMergerMapper
 
map(Text, ParseData, Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDb.LinkDbMapper
 
map(Text, Node, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterMapper
 
map(Text, Node, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperMapper
 
map(WritableComparable<?>, Text, Mapper.Context) - Method in class org.apache.nutch.tools.FreeGenerator.FG.FGMapper
 
map(WritableComparable<?>, Writable, Mapper.Context) - Method in class org.apache.nutch.segment.SegmentReader.InputCompatMapper
 
map(WritableComparable<?>, Content, Mapper.Context) - Method in class org.apache.nutch.parse.ParseSegment.ParseSegmentMapper
 
mask(String) - Static method in class org.apache.nutch.util.StringUtil
Mask sensitive strings - passwords, etc.
mask(String, char) - Static method in class org.apache.nutch.util.StringUtil
Mask sensitive strings - passwords, etc.
mask(String, Pattern, char) - Static method in class org.apache.nutch.util.StringUtil
Mask sensitive strings - passwords, etc.
match(String) - Method in class org.apache.nutch.urlfilter.api.RegexRule
Checks if a url matches this rule.
match(URL) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyAllRule
 
match(URL) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyPathQueryRule
 
match(URL) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyPathRule
 
match(URL) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.Rule
 
match(NutchDocument) - Method in interface org.apache.nutch.exchange.Exchange
Determines if the document must go to the related index writers.
match(NutchDocument) - Method in class org.apache.nutch.exchange.jexl.JexlExchange
Determines if the document must go to the related index writers.
matchChar(TrieStringMatcher.TrieNode, String, int) - Method in class org.apache.nutch.util.TrieStringMatcher
Get the next TrieStringMatcher.TrieNode visited, given that you are at node, and that the next character in the input is the idx'th character of s.
matches(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
Returns true if the given String is matched by a prefix in the trie
matches(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
Returns true if the given String is matched by a suffix in the trie
matches(String) - Method in class org.apache.nutch.util.TrieStringMatcher
Returns true if the given String is matched by a pattern in the trie
MAX_BULK_DOCS - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
MAX_BULK_DOCS - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
MAX_BULK_LENGTH - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
MAX_BULK_LENGTH - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
MAX_DEPTH_KEY - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
MAX_DEPTH_KEY_W - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
MAX_DOC_COUNT - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
 
MAX_DOCS_BATCH - Static variable in interface org.apache.nutch.indexwriter.cloudsearch.CloudSearchConstants
 
MAX_WARC_FILE_SIZE - Static variable in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
maxContent - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The length limit for downloaded content, in bytes.
maxCrawlDelay - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Skip page if Crawl-Delay longer than this value.
maxDuration - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The time limit to download the entire content, in seconds.
maxInterval - Variable in class org.apache.nutch.crawl.AbstractFetchSchedule
 
maxNumRedirects - Variable in class org.apache.nutch.protocol.RobotRulesParser
 
MD5Signature - Class in org.apache.nutch.crawl
Default implementation of a page signature.
MD5Signature() - Constructor for class org.apache.nutch.crawl.MD5Signature
 
merge(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.CrawlDbMerger
 
merge(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDbMerger
 
merge(Path, Path[], boolean, boolean, long) - Method in class org.apache.nutch.segment.SegmentMerger
 
Merger() - Constructor for class org.apache.nutch.crawl.CrawlDbMerger.Merger
 
Merger() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
 
metadata - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
 
metadata - Variable in class org.apache.nutch.metadata.Metadata
A map of all metadata attributes.
metadata - Variable in class org.apache.nutch.parse.ParserChecker
 
metadata - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
metaData - Variable in class org.apache.nutch.hostdb.HostDatum
 
Metadata - Class in org.apache.nutch.metadata
A multi-valued metadata container.
Metadata() - Constructor for class org.apache.nutch.metadata.Metadata
Constructs a new, empty metadata.
METADATA_CONTENT - Static variable in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
 
METADATA_DATUM - Static variable in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
 
METADATA_PARSED - Static variable in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
 
MetadataIndexer - Class in org.apache.nutch.indexer.metadata
Indexer which can be configured to extract metadata from the crawldb, parse metadata or content metadata.
MetadataIndexer() - Constructor for class org.apache.nutch.indexer.metadata.MetadataIndexer
 
MetadataScoringFilter - Class in org.apache.nutch.scoring.metadata
MetadataScoringFilter() - Constructor for class org.apache.nutch.scoring.metadata.MetadataScoringFilter
 
metadataSource - Static variable in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
Metadata source field name
metadataToJson(Metadata) - Method in class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCReducer
Adds keys/values of a Nuta metadata container to a JsonObject.
MetaTagsParser - Class in org.apache.nutch.parse.metatags
Parse HTML meta tags (keywords, description) and store them in the parse metadata so that they can be indexed with the index-metadata plugin with the prefix 'metatag.'.
MetaTagsParser() - Constructor for class org.apache.nutch.parse.metatags.MetaTagsParser
 
MetaWrapper - Class in org.apache.nutch.metadata
This is a simple decorator that adds metadata to any Writable-s that can be serialized by NutchWritable.
MetaWrapper() - Constructor for class org.apache.nutch.metadata.MetaWrapper
 
MetaWrapper(Writable, Configuration) - Constructor for class org.apache.nutch.metadata.MetaWrapper
 
MetaWrapper(Metadata, Writable, Configuration) - Constructor for class org.apache.nutch.metadata.MetaWrapper
 
MimeAdaptiveFetchSchedule - Class in org.apache.nutch.crawl
Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types.
MimeAdaptiveFetchSchedule() - Constructor for class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
 
MIMEFILTER_REGEX_FILE - Static variable in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
 
MimeTypeIndexingFilter - Class in org.apache.nutch.indexer.filter
An IndexingFilter that allows filtering of documents based on the MIME Type detected by Tika
MimeTypeIndexingFilter() - Constructor for class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
 
MimeUtil - Class in org.apache.nutch.util
This is a facade class to insulate Nutch from its underlying Mime Type substrate library, Apache Tika.
MimeUtil(Configuration) - Constructor for class org.apache.nutch.util.MimeUtil
 
MIN_CONFIDENCE_KEY - Static variable in class org.apache.nutch.util.EncodingDetector
 
MissingDependencyException - Exception in org.apache.nutch.plugin
MissingDependencyException will be thrown if a plugin dependency cannot be found.
MissingDependencyException(String) - Constructor for exception org.apache.nutch.plugin.MissingDependencyException
 
MissingDependencyException(Throwable) - Constructor for exception org.apache.nutch.plugin.MissingDependencyException
 
Model - Class in org.apache.nutch.scoring.similarity.cosine
This class creates a model used to store Document vector representation of the corpus.
Model() - Constructor for class org.apache.nutch.scoring.similarity.cosine.Model
 
MODIFIED - Static variable in interface org.apache.nutch.metadata.DublinCore
Date on which the resource was changed.
modifyWebClient(WebClient) - Method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
 
MoreIndexingFilter - Class in org.apache.nutch.indexer.more
Add (or reset) a few metaData properties as respective fields (if they are available), so that they can be accurately used within the search index.
MoreIndexingFilter() - Constructor for class org.apache.nutch.indexer.more.MoreIndexingFilter
 
MOVED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Resource has moved permanently.

N

NaiveBayesParseFilter - Class in org.apache.nutch.parsefilter.naivebayes
Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevant it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.
NaiveBayesParseFilter() - Constructor for class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
names() - Method in class org.apache.nutch.metadata.Metadata
Returns an array of the names contained in the metadata.
next(Text, BytesWritable) - Method in class org.apache.nutch.tools.arc.ArcRecordReader
Returns true if the next record in the split is read into the key and value pair.
nextKeyValue() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
 
nextNode() - Method in class org.apache.nutch.util.NodeWalker
Returns the next Node on the stack and pushes all of its children onto the stack, allowing us to walk the node tree without the use of recursion.
NO_THRESHOLD - Static variable in class org.apache.nutch.util.EncodingDetector
 
Node - Class in org.apache.nutch.scoring.webgraph
A class which holds the number of inlinks and outlinks for a given url along with an inlink score from a link analysis program and any metadata.
Node() - Constructor for class org.apache.nutch.scoring.webgraph.Node
 
NODE_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
 
nodeChar - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
 
NodeDumper - Class in org.apache.nutch.scoring.webgraph
A tools that dumps out the top urls by number of inlinks, number of outlinks, or by score, to a text file.
NodeDumper() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper
 
NodeDumper.Dumper - Class in org.apache.nutch.scoring.webgraph
Outputs the hosts or domains with an associated value.
NodeDumper.Dumper.DumperMapper - Class in org.apache.nutch.scoring.webgraph
Outputs the host or domain as key for this record and numInlinks, numOutlinks or score as the value.
NodeDumper.Dumper.DumperReducer - Class in org.apache.nutch.scoring.webgraph
Outputs either the sum or the top value for this record.
NodeDumper.Sorter - Class in org.apache.nutch.scoring.webgraph
Outputs the top urls sorted in descending order.
NodeDumper.Sorter.SorterMapper - Class in org.apache.nutch.scoring.webgraph
Outputs the url with the appropriate number of inlinks, outlinks, or for score.
NodeDumper.Sorter.SorterReducer - Class in org.apache.nutch.scoring.webgraph
Flips and collects the url and numeric sort value.
nodeRead() - Method in class org.apache.nutch.service.resources.ReaderResouce
Get schema of the Node object
nodeRead(ReaderConfig, int, int, int, boolean) - Method in class org.apache.nutch.service.resources.ReaderResouce
Read Node object as stored in the Nutch Webgraph
NodeReader - Class in org.apache.nutch.scoring.webgraph
Reads and prints to system out information for a single node from the NodeDb in the WebGraph.
NodeReader - Class in org.apache.nutch.service.impl
 
NodeReader() - Constructor for class org.apache.nutch.scoring.webgraph.NodeReader
 
NodeReader() - Constructor for class org.apache.nutch.service.impl.NodeReader
 
NodeReader(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.NodeReader
 
NodeWalker - Class in org.apache.nutch.util
A utility class that allows the walking of any DOM tree using a stack instead of recursion.
NodeWalker(Node) - Constructor for class org.apache.nutch.util.NodeWalker
Starts the Node tree from the root node.
NONE - org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
 
NORM_HOST_IDN - Static variable in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
 
NORM_HOST_TRIM_TRAILING_DOT - Static variable in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
 
normalize - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
Attempts to normalize the input URL string
normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
 
normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
 
normalize(String, String) - Method in interface org.apache.nutch.net.URLNormalizer
 
normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
 
normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer
 
normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.querystring.QuerystringURLNormalizer
 
normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
 
normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.slash.SlashURLNormalizer
 
normalize(String, String) - Method in class org.apache.nutch.net.URLNormalizers
Normalize
normalizeEscapedFragment(String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
Returns a normalized input URL.
normalizeHashedFragment(String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
Returns a normalized input URL.
normalizers - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
normalizers - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
 
normalizers - Variable in class org.apache.nutch.parse.ParserChecker
 
NOT_FETCHED - org.apache.nutch.util.domain.DomainStatistics.MyCounter
 
NOT_IN_USE - org.apache.nutch.util.domain.DomainSuffix.Status
 
NOT_TRUNCATED - org.apache.nutch.net.protocols.Response.TruncatedContentReason
 
NOTFETCHING - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Not fetching.
NOTFOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Resource was not found.
notModified - Variable in class org.apache.nutch.hostdb.HostDatum
 
NOTMODIFIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Unchanged since the last fetch.
NOTPARSED - Static variable in class org.apache.nutch.parse.ParseStatus
Parsing was not performed.
now - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
numericFields - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
numericFieldWritables - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
numFailures() - Method in class org.apache.nutch.hostdb.HostDatum
 
numJobs - Variable in class org.apache.nutch.util.NutchTool
 
numOverDue - Variable in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
 
numRecords() - Method in class org.apache.nutch.hostdb.HostDatum
 
numResolverThreads - Variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
Nutch - Interface in org.apache.nutch.metadata
A collection of Nutch internal metadata constants.
NutchConfig - Class in org.apache.nutch.service.model.request
 
NutchConfig() - Constructor for class org.apache.nutch.service.model.request.NutchConfig
 
NutchConfiguration - Class in org.apache.nutch.util
Utility to create Hadoop Configurations that include Nutch-specific resources.
NutchDocument - Class in org.apache.nutch.indexer
A NutchDocument is the unit of indexing.
NutchDocument() - Constructor for class org.apache.nutch.indexer.NutchDocument
 
nutchFetchIntervalMDName - Static variable in class org.apache.nutch.crawl.Injector
metadata key reserved for setting a custom fetchInterval for a specific URL
NutchField - Class in org.apache.nutch.indexer
This class represents a multi-valued field with a weight.
NutchField() - Constructor for class org.apache.nutch.indexer.NutchField
 
NutchField(Object) - Constructor for class org.apache.nutch.indexer.NutchField
 
NutchField(Object, float) - Constructor for class org.apache.nutch.indexer.NutchField
 
nutchFixedFetchIntervalMDName - Static variable in class org.apache.nutch.crawl.Injector
metadata key reserved for setting a fixed custom fetchInterval for a specific URL
NutchIndexAction - Class in org.apache.nutch.indexer
A NutchIndexAction is the new unit of indexing holding the document and action information.
NutchIndexAction() - Constructor for class org.apache.nutch.indexer.NutchIndexAction
 
NutchIndexAction(NutchDocument, byte) - Constructor for class org.apache.nutch.indexer.NutchIndexAction
 
NutchJob - Class in org.apache.nutch.util
A Job for Nutch jobs.
NutchJob(Configuration, String) - Constructor for class org.apache.nutch.util.NutchJob
NutchPublisher - Interface in org.apache.nutch.publisher
All publisher subscriber model implementations should implement this interface.
NutchPublishers - Class in org.apache.nutch.publisher
 
NutchPublishers(Configuration) - Constructor for class org.apache.nutch.publisher.NutchPublishers
 
NutchReader - Interface in org.apache.nutch.service
 
nutchScoreMDName - Static variable in class org.apache.nutch.crawl.Injector
metadata key reserved for setting a custom score for a specific URL
NutchServer - Class in org.apache.nutch.service
 
NutchServerInfo - Class in org.apache.nutch.service.model.response
 
NutchServerInfo() - Constructor for class org.apache.nutch.service.model.response.NutchServerInfo
 
NutchServerPoolExecutor - Class in org.apache.nutch.service.impl
 
NutchServerPoolExecutor(int, int, long, TimeUnit, BlockingQueue<Runnable>) - Constructor for class org.apache.nutch.service.impl.NutchServerPoolExecutor
 
NutchTool - Class in org.apache.nutch.util
 
NutchTool() - Constructor for class org.apache.nutch.util.NutchTool
 
NutchTool(Configuration) - Constructor for class org.apache.nutch.util.NutchTool
 
NutchWritable - Class in org.apache.nutch.crawl
 
NutchWritable() - Constructor for class org.apache.nutch.crawl.NutchWritable
 
NutchWritable(Writable) - Constructor for class org.apache.nutch.crawl.NutchWritable
 

O

ObjectCache - Class in org.apache.nutch.util
 
ObjectInputFormat() - Constructor for class org.apache.nutch.segment.SegmentMerger.ObjectInputFormat
 
OkHttp - Class in org.apache.nutch.protocol.okhttp
 
OkHttp() - Constructor for class org.apache.nutch.protocol.okhttp.OkHttp
 
OkHttpResponse - Class in org.apache.nutch.protocol.okhttp
 
OkHttpResponse(OkHttp, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.okhttp.OkHttpResponse
 
OkHttpResponse.TruncatedContent - Class in org.apache.nutch.protocol.okhttp
Container to store whether and why content has been truncated
OLD_OUTLINK_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
 
open() - Method in class org.apache.nutch.exchange.Exchanges
Opens each configured exchange.
open(Map<String, String>) - Method in interface org.apache.nutch.exchange.Exchange
Initializes the internal variables.
open(Map<String, String>) - Method in class org.apache.nutch.exchange.jexl.JexlExchange
Initializes the internal variables.
open(Configuration, String) - Method in interface org.apache.nutch.indexer.IndexWriter
Deprecated.
open(Configuration, String) - Method in class org.apache.nutch.indexer.IndexWriters
Initializes the internal variables of index writers.
open(Configuration, String) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
open(Configuration, String) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
open(Configuration, String) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
open(Configuration, String) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
open(Configuration, String) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
open(Configuration, String) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
open(Configuration, String) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
open(Configuration, String) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
open(IndexWriterParams) - Method in interface org.apache.nutch.indexer.IndexWriter
Initializes the internal variables from a given index writer configuration.
open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
Initializes the internal variables from a given index writer configuration.
open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
Initializes the internal variables from a given index writer configuration.
open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
Initializes the internal variables from a given index writer configuration.
open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
Initializes the internal variables from a given index writer configuration.
open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
Initializes the internal variables from a given index writer configuration.
open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
Initializes the internal variables from a given index writer configuration.
openChannel() - Method in class org.apache.nutch.rabbitmq.RabbitMQClient
Opens a new channel into the opened connection.
openReaders() - Method in class org.apache.nutch.crawl.LinkDbReader
 
OpenSearch1xConstants - Interface in org.apache.nutch.indexwriter.opensearch1x
 
OpenSearch1xIndexWriter - Class in org.apache.nutch.indexwriter.opensearch1x
Sends NutchDocuments to a configured OpenSearch index.
OpenSearch1xIndexWriter() - Constructor for class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
OPERATOR - Static variable in class org.apache.nutch.tools.WARCUtils
 
OPICScoringFilter - Class in org.apache.nutch.scoring.opic
This plugin implements a variant of an Online Page Importance Computation (OPIC) score, described in this paper: Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive On-Line Page Importance Computation.
OPICScoringFilter() - Constructor for class org.apache.nutch.scoring.opic.OPICScoringFilter
 
OPTIONS - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
OPTIONS - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
org.apache.nutch.analysis.lang - package org.apache.nutch.analysis.lang
Text document language identifier.
org.apache.nutch.collection - package org.apache.nutch.collection
Subcollection is a subset of an index.
org.apache.nutch.crawl - package org.apache.nutch.crawl
Crawl control code and tools to run the crawler.
org.apache.nutch.exchange - package org.apache.nutch.exchange
Control code for exchange component, which acts in indexing job and decides to which index writer a document should be routed, based on plugins behavior.
org.apache.nutch.exchange.jexl - package org.apache.nutch.exchange.jexl
Plugin of Exchange component based on JEXL expressions.
org.apache.nutch.fetcher - package org.apache.nutch.fetcher
The Nutch multi-threaded fetching module
org.apache.nutch.hostdb - package org.apache.nutch.hostdb
 
org.apache.nutch.indexer - package org.apache.nutch.indexer
Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
org.apache.nutch.indexer.anchor - package org.apache.nutch.indexer.anchor
An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.arbitrary - package org.apache.nutch.indexer.arbitrary
Indexing filter to add document arbitrary data to the index from the output of a user-specified class.
org.apache.nutch.indexer.basic - package org.apache.nutch.indexer.basic
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
org.apache.nutch.indexer.feed - package org.apache.nutch.indexer.feed
Indexing filter to index meta data from RSS feeds.
org.apache.nutch.indexer.filter - package org.apache.nutch.indexer.filter
 
org.apache.nutch.indexer.geoip - package org.apache.nutch.indexer.geoip
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
org.apache.nutch.indexer.jexl - package org.apache.nutch.indexer.jexl
This plugin implements a dynamic indexing filter which uses JEXL expressions to allow filtering based on the page's metadata
org.apache.nutch.indexer.links - package org.apache.nutch.indexer.links
 
org.apache.nutch.indexer.metadata - package org.apache.nutch.indexer.metadata
Indexing filter to add document metadata to the index.
org.apache.nutch.indexer.more - package org.apache.nutch.indexer.more
A more indexing plugin, adds "more" index fields:last modified date, MIME type, content length.
org.apache.nutch.indexer.replace - package org.apache.nutch.indexer.replace
Indexing filter to allow pattern replacements on metadata.
org.apache.nutch.indexer.staticfield - package org.apache.nutch.indexer.staticfield
A simple plugin called at indexing that adds fields with static data.
org.apache.nutch.indexer.subcollection - package org.apache.nutch.indexer.subcollection
Indexing filter to assign documents to subcollections.
org.apache.nutch.indexer.tld - package org.apache.nutch.indexer.tld
Top Level Domain Indexing plugin.
org.apache.nutch.indexer.urlmeta - package org.apache.nutch.indexer.urlmeta
URL Meta Tag Indexing Plugin
org.apache.nutch.indexwriter.cloudsearch - package org.apache.nutch.indexwriter.cloudsearch
 
org.apache.nutch.indexwriter.csv - package org.apache.nutch.indexwriter.csv
Index writer plugin to write a plain CSV file.
org.apache.nutch.indexwriter.dummy - package org.apache.nutch.indexwriter.dummy
Index writer plugin for debugging, writes pairs of <action, url> to a text file, action is one of "add", "update", or "delete".
org.apache.nutch.indexwriter.elastic - package org.apache.nutch.indexwriter.elastic
Index writer plugin for Elasticsearch.
org.apache.nutch.indexwriter.kafka - package org.apache.nutch.indexwriter.kafka
Index writer plugin to produce JSON messages to Kafka.
org.apache.nutch.indexwriter.opensearch1x - package org.apache.nutch.indexwriter.opensearch1x
Index writer plugin for OpenSearch.
org.apache.nutch.indexwriter.rabbit - package org.apache.nutch.indexwriter.rabbit
 
org.apache.nutch.indexwriter.solr - package org.apache.nutch.indexwriter.solr
Index writer plugin for Apache Solr.
org.apache.nutch.metadata - package org.apache.nutch.metadata
A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.microformats.reltag - package org.apache.nutch.microformats.reltag
A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.apache.nutch.net - package org.apache.nutch.net
Web-related interfaces: URL filters and normalizers.
org.apache.nutch.net.protocols - package org.apache.nutch.net.protocols
Helper classes related to the Protocol interface, see also org.apache.nutch.protocol.
org.apache.nutch.net.urlnormalizer.ajax - package org.apache.nutch.net.urlnormalizer.ajax
 
org.apache.nutch.net.urlnormalizer.basic - package org.apache.nutch.net.urlnormalizer.basic
URL normalizer performing basic normalizations: remove default ports, e.g., port 80 for http:// URLs remove needless slashes and dot segments in the path component remove anchors use percent-encoding (only) where needed E.g., https://www.example.org/a/../b//./select%2Dlang.php?lang=español#anchor is normalized to https://www.example.org/b/select-lang.php?lang=espa%C3%B1ol Optional and configurable normalizations are: convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see property urlnormalizer.basic.host.idn remove a trailing dot from host names, see property urlnormalizer.basic.host.trim-trailing-dot
org.apache.nutch.net.urlnormalizer.host - package org.apache.nutch.net.urlnormalizer.host
URL normalizer renaming hosts to a canonical form listed in the configuration file.
org.apache.nutch.net.urlnormalizer.pass - package org.apache.nutch.net.urlnormalizer.pass
URL normalizer dummy which does not change URLs.
org.apache.nutch.net.urlnormalizer.protocol - package org.apache.nutch.net.urlnormalizer.protocol
URL normalizer to normalize the protocol for all URLs of a given host or domain.
org.apache.nutch.net.urlnormalizer.querystring - package org.apache.nutch.net.urlnormalizer.querystring
URL normalizer which sort the elements in the query part to avoid duplicates by permutations.
org.apache.nutch.net.urlnormalizer.regex - package org.apache.nutch.net.urlnormalizer.regex
URL normalizer with configurable rules based on regular expressions (Pattern).
org.apache.nutch.net.urlnormalizer.slash - package org.apache.nutch.net.urlnormalizer.slash
 
org.apache.nutch.parse - package org.apache.nutch.parse
The Parse interface and related classes.
org.apache.nutch.parse.ext - package org.apache.nutch.parse.ext
Parse wrapper to run external command to do the parsing.
org.apache.nutch.parse.feed - package org.apache.nutch.parse.feed
Parse RSS feeds.
org.apache.nutch.parse.headings - package org.apache.nutch.parse.headings
Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
org.apache.nutch.parse.html - package org.apache.nutch.parse.html
An HTML document parsing plugin.
org.apache.nutch.parse.js - package org.apache.nutch.parse.js
Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
org.apache.nutch.parse.metatags - package org.apache.nutch.parse.metatags
Parse filter to extract meta tags: keywords, description, etc.
org.apache.nutch.parse.tika - package org.apache.nutch.parse.tika
Parse various document formats with help of Apache Tika.
org.apache.nutch.parse.zip - package org.apache.nutch.parse.zip
Parse ZIP files: embedded files are recursively passed to appropriate parsers.
org.apache.nutch.parsefilter.debug - package org.apache.nutch.parsefilter.debug
Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).
org.apache.nutch.parsefilter.naivebayes - package org.apache.nutch.parsefilter.naivebayes
Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevent it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.
org.apache.nutch.parsefilter.regex - package org.apache.nutch.parsefilter.regex
RegexParseFilter.
org.apache.nutch.plugin - package org.apache.nutch.plugin
The Nutch Plugin System.
org.apache.nutch.protocol - package org.apache.nutch.protocol
Classes related to the Protocol interface, see also org.apache.nutch.net.protocols.
org.apache.nutch.protocol.file - package org.apache.nutch.protocol.file
Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp - package org.apache.nutch.protocol.ftp
Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.htmlunit - package org.apache.nutch.protocol.htmlunit
Protocol plugin which supports retrieving documents via HTTP/HTTPS using Selenium and the HtmlUnitDriver web driver for the for the HtmlUnit headless browser.
org.apache.nutch.protocol.http - package org.apache.nutch.protocol.http
Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.http.api - package org.apache.nutch.protocol.http.api
Common API used by HTTP plugins (http, httpclient, etc.)
org.apache.nutch.protocol.httpclient - package org.apache.nutch.protocol.httpclient
Protocol plugin which supports retrieving documents via the HTTP andHTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
org.apache.nutch.protocol.interactiveselenium - package org.apache.nutch.protocol.interactiveselenium
Protocol plugin which supports retrieving documents using and interacting with Selenium.
org.apache.nutch.protocol.interactiveselenium.handlers - package org.apache.nutch.protocol.interactiveselenium.handlers
Handler implementations to interact with Selenium for org.apache.nutch.protocol.interactiveselenium.
org.apache.nutch.protocol.okhttp - package org.apache.nutch.protocol.okhttp
Protocol plugin for HTTP/HTTPS based on okhttp, supports HTTP 1.1 and/or http/2.
org.apache.nutch.protocol.selenium - package org.apache.nutch.protocol.selenium
Protocol plugin which supports retrieving documents via Selenium.
org.apache.nutch.publisher - package org.apache.nutch.publisher
 
org.apache.nutch.publisher.rabbitmq - package org.apache.nutch.publisher.rabbitmq
Publisher package to implement queues
org.apache.nutch.rabbitmq - package org.apache.nutch.rabbitmq
 
org.apache.nutch.scoring - package org.apache.nutch.scoring
The ScoringFilter interface.
org.apache.nutch.scoring.depth - package org.apache.nutch.scoring.depth
Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).
org.apache.nutch.scoring.link - package org.apache.nutch.scoring.link
Scoring filter used in conjunction with WebGraph.
org.apache.nutch.scoring.metadata - package org.apache.nutch.scoring.metadata
Metadata Scoring Plugin
org.apache.nutch.scoring.opic - package org.apache.nutch.scoring.opic
Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
org.apache.nutch.scoring.orphan - package org.apache.nutch.scoring.orphan
Scoring filter to modify score or status of orphaned pages (no inlinks found for a configurable amount of time).
org.apache.nutch.scoring.similarity - package org.apache.nutch.scoring.similarity
 
org.apache.nutch.scoring.similarity.cosine - package org.apache.nutch.scoring.similarity.cosine
Implements the cosine similarity metric for scoring relevant documents
org.apache.nutch.scoring.similarity.util - package org.apache.nutch.scoring.similarity.util
Utility package for Lucene functions.
org.apache.nutch.scoring.tld - package org.apache.nutch.scoring.tld
Top Level Domain Scoring plugin.
org.apache.nutch.scoring.urlmeta - package org.apache.nutch.scoring.urlmeta
URL Meta Tag Scoring Plugin
org.apache.nutch.scoring.webgraph - package org.apache.nutch.scoring.webgraph
Scoring implementation based on link analysis (LinkRank), see WebGraph.
org.apache.nutch.segment - package org.apache.nutch.segment
A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
org.apache.nutch.service - package org.apache.nutch.service
 
org.apache.nutch.service.impl - package org.apache.nutch.service.impl
 
org.apache.nutch.service.model.request - package org.apache.nutch.service.model.request
 
org.apache.nutch.service.model.response - package org.apache.nutch.service.model.response
 
org.apache.nutch.service.resources - package org.apache.nutch.service.resources
 
org.apache.nutch.tools - package org.apache.nutch.tools
Miscellaneous tools.
org.apache.nutch.tools.arc - package org.apache.nutch.tools.arc
Tools to read the Arc file format.
org.apache.nutch.tools.warc - package org.apache.nutch.tools.warc
Tools to import / export between Nutch segments and WARC archives.
org.apache.nutch.urlfilter.api - package org.apache.nutch.urlfilter.api
Generic URL filter library, abstracting away from regular expression implementations.
org.apache.nutch.urlfilter.automaton - package org.apache.nutch.urlfilter.automaton
URL filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.
org.apache.nutch.urlfilter.domain - package org.apache.nutch.urlfilter.domain
URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.domaindenylist - package org.apache.nutch.urlfilter.domaindenylist
URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.fast - package org.apache.nutch.urlfilter.fast
URL filter plugin that first does fast exact suffix matches on host/domain names before applying regular expressions to the path component of a URL.
org.apache.nutch.urlfilter.ignoreexempt - package org.apache.nutch.urlfilter.ignoreexempt
URL filter plugin which identifies exemptions to external urls when when external urls are set to ignore.
org.apache.nutch.urlfilter.prefix - package org.apache.nutch.urlfilter.prefix
URL filter plugin to include only URLs which match one of a given list of URL prefixes.
org.apache.nutch.urlfilter.regex - package org.apache.nutch.urlfilter.regex
URL filter plugin to include and/or exclude URLs matching Java regular expressions.
org.apache.nutch.urlfilter.suffix - package org.apache.nutch.urlfilter.suffix
URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
org.apache.nutch.urlfilter.validator - package org.apache.nutch.urlfilter.validator
URL filter plugin that validates given urls.
org.apache.nutch.util - package org.apache.nutch.util
Miscellaneous utility classes.
org.apache.nutch.util.domain - package org.apache.nutch.util.domain
Classes for domain name analysis.
org.creativecommons.nutch - package org.creativecommons.nutch
Sample plugins that parse and index Creative Commons metadata.
ORIGINAL_CHAR_ENCODING - Static variable in interface org.apache.nutch.metadata.Nutch
 
ORPHAN_KEY_WRITABLE - Static variable in class org.apache.nutch.scoring.orphan.OrphanScoringFilter
 
orphanedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.orphan.OrphanScoringFilter
 
orphanedScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
This method may change the score or status of CrawlDatum during CrawlDb update, when the URL is neither fetched nor has any inlinks.
orphanedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
Calculate orphaned page score during CrawlDb.update().
OrphanScoringFilter - Class in org.apache.nutch.scoring.orphan
Orphan scoring filter that determines whether a page has become orphaned, e.g.
OrphanScoringFilter() - Constructor for class org.apache.nutch.scoring.orphan.OrphanScoringFilter
 
Outlink - Class in org.apache.nutch.parse
An outgoing link from a page.
Outlink() - Constructor for class org.apache.nutch.parse.Outlink
 
Outlink(String, String) - Constructor for class org.apache.nutch.parse.Outlink
 
OUTLINK - Static variable in class org.apache.nutch.scoring.webgraph.LinkDatum
 
OUTLINK_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
 
OutlinkDb() - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
Default constructor.
OutlinkDb(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
Configurable constructor.
OutlinkDbMapper() - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbMapper
 
OutlinkDbReducer() - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbReducer
 
OutlinkExtractor - Class in org.apache.nutch.parse
Extractor to extract Outlinks / URLs from plain text using Regular Expressions.
OutlinkExtractor() - Constructor for class org.apache.nutch.parse.OutlinkExtractor
 
overDueTime - Variable in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
 
overDueTimeLimit - Variable in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
 

P

parse(InputStream) - Method in class org.apache.nutch.collection.CollectionManager
 
parse(String) - Static method in class org.apache.nutch.segment.SegmentPart
Create SegmentPart from a String in format "segmentName/partName".
parse(Path) - Method in class org.apache.nutch.parse.ParseSegment
 
parse(Content) - Method in class org.apache.nutch.parse.ParseUtil
Performs a parse by iterating through a List of preferred Parsers until a successful parse is performed and a Parse object is returned.
Parse - Interface in org.apache.nutch.parse
The result of parsing a page's raw content.
PARSE - org.apache.nutch.service.JobManager.JobType
 
PARSE_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
 
PARSE_FORMAT - Static variable in class org.apache.nutch.net.protocols.HttpDateFormat
Use a less restrictive format for parsing: accept single-digit day-of-month and any timezone
parseArgs(String[], int) - Method in class org.apache.nutch.util.AbstractChecker
 
parseByExtensionId(String, Content) - Method in class org.apache.nutch.parse.ParseUtil
Method parses a Content object using the Parser specified by the parameter extId, i.e., the Parser's extension ID.
parseCharacterEncoding(String) - Static method in class org.apache.nutch.util.EncodingDetector
Parse the character encoding from the specified content type header.
parsed - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
 
ParseData - Class in org.apache.nutch.parse
Data extracted from a page's content.
ParseData() - Constructor for class org.apache.nutch.parse.ParseData
 
ParseData(ParseStatus, String, Outlink[], Metadata) - Constructor for class org.apache.nutch.parse.ParseData
 
ParseData(ParseStatus, String, Outlink[], Metadata, Metadata) - Constructor for class org.apache.nutch.parse.ParseData
 
parseDmozFile(File, int, boolean, int, Pattern) - Method in class org.apache.nutch.tools.DmozParser
Iterate through all the items in this structured DMOZ file.
parseErrors - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
 
ParseException - Exception in org.apache.nutch.parse
 
ParseException() - Constructor for exception org.apache.nutch.parse.ParseException
 
ParseException(String) - Constructor for exception org.apache.nutch.parse.ParseException
 
ParseException(String, Throwable) - Constructor for exception org.apache.nutch.parse.ParseException
 
ParseException(Throwable) - Constructor for exception org.apache.nutch.parse.ParseException
 
parseExpression(String) - Static method in class org.apache.nutch.util.JexlUtil
Parses the given expression to a JEXL expression.
ParseImpl - Class in org.apache.nutch.parse
The result of parsing a page's raw content.
ParseImpl() - Constructor for class org.apache.nutch.parse.ParseImpl
 
ParseImpl(String, ParseData) - Constructor for class org.apache.nutch.parse.ParseImpl
 
ParseImpl(Parse) - Constructor for class org.apache.nutch.parse.ParseImpl
 
ParseImpl(ParseText, ParseData) - Constructor for class org.apache.nutch.parse.ParseImpl
 
ParseImpl(ParseText, ParseData, boolean) - Constructor for class org.apache.nutch.parse.ParseImpl
 
parseList(List<String>, String) - Method in class org.apache.nutch.collection.Subcollection
Create a list of patterns from a chunk of text, patterns are separated with a newline
ParseOutputFormat - Class in org.apache.nutch.parse
 
ParseOutputFormat() - Constructor for class org.apache.nutch.parse.ParseOutputFormat
 
parsePluginFolder(String[]) - Method in class org.apache.nutch.plugin.PluginManifestParser
Returns a list of all found plugin descriptors.
Parser - Interface in org.apache.nutch.parse
A parser for content generated by a Protocol implementation.
ParserChecker - Class in org.apache.nutch.parse
Parser checker, useful for testing parser.
ParserChecker() - Constructor for class org.apache.nutch.parse.ParserChecker
 
ParseResult - Class in org.apache.nutch.parse
A utility class that stores result of a parse.
ParseResult(String) - Constructor for class org.apache.nutch.parse.ParseResult
Create a container for parse results.
ParserFactory - Class in org.apache.nutch.parse
Creates and caches Parser plugins.
ParserFactory(Configuration) - Constructor for class org.apache.nutch.parse.ParserFactory
 
ParserNotFound - Exception in org.apache.nutch.parse
 
ParserNotFound(String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
 
ParserNotFound(String, String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
 
ParserNotFound(String, String, String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
 
parseRules(String, byte[], String, String) - Method in class org.apache.nutch.protocol.RobotRulesParser
Deprecated.
parseRules(String, byte[], String, Collection<String>) - Method in class org.apache.nutch.protocol.RobotRulesParser
Parses the robots content using the SimpleRobotRulesParser from crawler-commons
ParseSegment - Class in org.apache.nutch.parse
 
ParseSegment() - Constructor for class org.apache.nutch.parse.ParseSegment
 
ParseSegment(Configuration) - Constructor for class org.apache.nutch.parse.ParseSegment
 
ParseSegment.ParseSegmentMapper - Class in org.apache.nutch.parse
 
ParseSegment.ParseSegmentReducer - Class in org.apache.nutch.parse
 
ParseSegmentMapper() - Constructor for class org.apache.nutch.parse.ParseSegment.ParseSegmentMapper
 
ParseSegmentReducer() - Constructor for class org.apache.nutch.parse.ParseSegment.ParseSegmentReducer
 
ParseStatus - Class in org.apache.nutch.parse
 
ParseStatus() - Constructor for class org.apache.nutch.parse.ParseStatus
 
ParseStatus(int) - Constructor for class org.apache.nutch.parse.ParseStatus
 
ParseStatus(int, int) - Constructor for class org.apache.nutch.parse.ParseStatus
 
ParseStatus(int, int, String) - Constructor for class org.apache.nutch.parse.ParseStatus
Simplified constructor for passing just a text message.
ParseStatus(int, int, String[]) - Constructor for class org.apache.nutch.parse.ParseStatus
 
ParseStatus(int, String) - Constructor for class org.apache.nutch.parse.ParseStatus
Simplified constructor for passing just a text message.
ParseStatus(int, String[]) - Constructor for class org.apache.nutch.parse.ParseStatus
 
ParseStatus(Throwable) - Constructor for class org.apache.nutch.parse.ParseStatus
 
ParseText - Class in org.apache.nutch.parse
 
ParseText() - Constructor for class org.apache.nutch.parse.ParseText
 
ParseText(String) - Constructor for class org.apache.nutch.parse.ParseText
 
ParseUtil - Class in org.apache.nutch.parse
A Utility class containing methods to simply perform parsing utilities such as iterating through a preferred list of Parsers to obtain Parse objects.
ParseUtil(Configuration) - Constructor for class org.apache.nutch.parse.ParseUtil
Overloaded constructor
partialAsTruncated - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Whether to save partial fetches as truncated content.
PARTITION_MODE_DOMAIN - Static variable in class org.apache.nutch.crawl.URLPartitioner
 
PARTITION_MODE_HOST - Static variable in class org.apache.nutch.crawl.URLPartitioner
 
PARTITION_MODE_IP - Static variable in class org.apache.nutch.crawl.URLPartitioner
 
PARTITION_MODE_KEY - Static variable in class org.apache.nutch.crawl.URLPartitioner
 
PartitionReducer() - Constructor for class org.apache.nutch.crawl.Generator.PartitionReducer
 
partName - Variable in class org.apache.nutch.segment.SegmentPart
Name of the segment part (ie.
passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
 
passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
Takes the metadata, which was lumped inside the content, and replicates it within your parse data.
passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
passScoreAfterParsing(Text, Content, Parse) - Method in interface org.apache.nutch.scoring.ScoringFilter
Currently a part of score distribution is performed using only data coming from the parsing process.
passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.ScoringFilters
 
passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
 
passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
Takes the metadata, which was lumped inside the content, and replicates it within your parse data.
passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
 
passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
Takes the metadata, specified in your "scoring.db.md" property, from the datum object and injects it into the content.
passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in interface org.apache.nutch.scoring.ScoringFilter
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content metadata.
passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.ScoringFilters
 
passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content.
PassURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.pass
This URLNormalizer doesn't change urls.
PassURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
 
PASSWORD - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
PASSWORD - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
PASSWORD - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
PATH - Static variable in interface org.apache.nutch.indexwriter.dummy.DummyConstants
 
pattern - Variable in class org.apache.nutch.urlfilter.fast.FastURLFilter.Rule
 
percentiles - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
PERM_REFRESH_TIME - Static variable in class org.apache.nutch.fetcher.Fetcher
 
Pluggable - Interface in org.apache.nutch.plugin
Defines the capability of a class to be plugged into Nutch.
Plugin - Class in org.apache.nutch.plugin
A nutch-plugin is an container for a set of custom logic that provide extensions to the nutch core functionality or another plugin that provides an API for extending.
Plugin(PluginDescriptor, Configuration) - Constructor for class org.apache.nutch.plugin.Plugin
Overloaded constructor
PluginClassLoader - Class in org.apache.nutch.plugin
The PluginClassLoader is a child-first classloader that only contains classes of the runtime libraries setuped in the plugin manifest file and exported libraries of plugins that are required plugins.
PluginClassLoader(URL[], ClassLoader) - Constructor for class org.apache.nutch.plugin.PluginClassLoader
Overloaded constructor
PluginDescriptor - Class in org.apache.nutch.plugin
The PluginDescriptor provide access to all meta information of a nutch-plugin, as well to the internationalizable resources and the plugin own classloader.
PluginDescriptor(String, String, String, String, String, String, Configuration) - Constructor for class org.apache.nutch.plugin.PluginDescriptor
Overloaded constructor
PluginManifestParser - Class in org.apache.nutch.plugin
The PluginManifestParser provides a mechanism for parsing Nutch plugin manifest files (plugin.xml) contained in a String of plugin directories.
PluginManifestParser(Configuration, PluginRepository) - Constructor for class org.apache.nutch.plugin.PluginManifestParser
 
PluginRepository - Class in org.apache.nutch.plugin
The plugin repository is a registry of all plugins.
PluginRepository(Configuration) - Constructor for class org.apache.nutch.plugin.PluginRepository
 
PluginRuntimeException - Exception in org.apache.nutch.plugin
PluginRuntimeException will be thrown until a exception in the plugin managemnt occurs.
PluginRuntimeException(String) - Constructor for exception org.apache.nutch.plugin.PluginRuntimeException
 
PluginRuntimeException(Throwable) - Constructor for exception org.apache.nutch.plugin.PluginRuntimeException
 
PORT - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
PORT - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
 
PORT - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
PORTERSTEM_FILTER - org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
 
pos - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
 
PrefixStringMatcher - Class in org.apache.nutch.util
A class for efficiently matching Strings against a set of prefixes.
PrefixStringMatcher(String[]) - Constructor for class org.apache.nutch.util.PrefixStringMatcher
Creates a new PrefixStringMatcher which will match Strings with any prefix in the supplied array.
PrefixStringMatcher(Collection<String>) - Constructor for class org.apache.nutch.util.PrefixStringMatcher
Creates a new PrefixStringMatcher which will match Strings with any prefix in the supplied Collection.
PrefixURLFilter - Class in org.apache.nutch.urlfilter.prefix
Filters URLs based on a file of URL prefixes.
PrefixURLFilter() - Constructor for class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
 
PrefixURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
 
PrintCommandListener - Class in org.apache.nutch.protocol.ftp
This is a support class for logging all ftp command/reply traffic.
PrintCommandListener(Logger) - Constructor for class org.apache.nutch.protocol.ftp.PrintCommandListener
 
PROBLEMATIC_HEADERS - Static variable in class org.apache.nutch.tools.WARCUtils
 
process(String, StringBuilder) - Method in class org.apache.nutch.crawl.CrawlDbReader
 
process(String, StringBuilder) - Method in class org.apache.nutch.crawl.LinkDbReader
 
process(String, StringBuilder) - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
 
process(String, StringBuilder) - Method in class org.apache.nutch.net.URLFilterChecker
 
process(String, StringBuilder) - Method in class org.apache.nutch.net.URLNormalizerChecker
 
process(String, StringBuilder) - Method in class org.apache.nutch.parse.ParserChecker
 
process(String, StringBuilder) - Method in class org.apache.nutch.util.AbstractChecker
 
processDeflateEncoded(byte[], URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
processDriver(WebDriver) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefalultMultiInteractionHandler
 
processDriver(WebDriver) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultClickAllAjaxLinksHandler
 
processDriver(WebDriver) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultHandler
 
processDriver(WebDriver) - Method in interface org.apache.nutch.protocol.interactiveselenium.handlers.InteractiveSeleniumHandler
 
processDumpJob(String, String, String) - Method in class org.apache.nutch.crawl.LinkDbReader
 
processDumpJob(String, String, Configuration, String, String, String, Integer, String, Float) - Method in class org.apache.nutch.crawl.CrawlDbReader
 
processGzipEncoded(byte[], URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
processingInstruction(String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of a processing instruction.
processSingle(String) - Method in class org.apache.nutch.util.AbstractChecker
 
processStatJob(String, Configuration, boolean) - Method in class org.apache.nutch.crawl.CrawlDbReader
 
processStdin() - Method in class org.apache.nutch.util.AbstractChecker
 
processTCP(int) - Method in class org.apache.nutch.util.AbstractChecker
 
processTopNJob(String, long, float, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
 
PROPOSED - org.apache.nutch.util.domain.DomainSuffix.Status
 
PROTO_NOT_FOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
This protocol was not found.
PROTO_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
Protocol - Interface in org.apache.nutch.protocol
A retriever of url content.
PROTOCOL_REDIR - Static variable in class org.apache.nutch.fetcher.Fetcher
 
PROTOCOL_STATUS_CODE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
protocolCommandSent(ProtocolCommandEvent) - Method in class org.apache.nutch.protocol.ftp.PrintCommandListener
 
ProtocolException - Exception in org.apache.nutch.net.protocols
Deprecated.
Use ProtocolException instead.
ProtocolException - Exception in org.apache.nutch.protocol
 
ProtocolException() - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
Deprecated.
 
ProtocolException() - Constructor for exception org.apache.nutch.protocol.ProtocolException
 
ProtocolException(String) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
Deprecated.
 
ProtocolException(String) - Constructor for exception org.apache.nutch.protocol.ProtocolException
 
ProtocolException(String, Throwable) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
Deprecated.
 
ProtocolException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.ProtocolException
 
ProtocolException(Throwable) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
Deprecated.
 
ProtocolException(Throwable) - Constructor for exception org.apache.nutch.protocol.ProtocolException
 
ProtocolFactory - Class in org.apache.nutch.protocol
Creates and caches Protocol plugins.
ProtocolFactory(Configuration) - Constructor for class org.apache.nutch.protocol.ProtocolFactory
 
ProtocolLogUtil - Class in org.apache.nutch.net.protocols
 
ProtocolLogUtil() - Constructor for class org.apache.nutch.net.protocols.ProtocolLogUtil
 
ProtocolNotFound - Exception in org.apache.nutch.protocol
 
ProtocolNotFound(String) - Constructor for exception org.apache.nutch.protocol.ProtocolNotFound
 
ProtocolNotFound(String, String) - Constructor for exception org.apache.nutch.protocol.ProtocolNotFound
 
ProtocolOutput - Class in org.apache.nutch.protocol
Simple aggregate to pass from protocol plugins both content and protocol status.
ProtocolOutput(Content) - Constructor for class org.apache.nutch.protocol.ProtocolOutput
 
ProtocolOutput(Content, ProtocolStatus) - Constructor for class org.apache.nutch.protocol.ProtocolOutput
 
protocolReplyReceived(ProtocolCommandEvent) - Method in class org.apache.nutch.protocol.ftp.PrintCommandListener
 
ProtocolStatus - Class in org.apache.nutch.protocol
 
ProtocolStatus() - Constructor for class org.apache.nutch.protocol.ProtocolStatus
 
ProtocolStatus(int) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
 
ProtocolStatus(int, long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
 
ProtocolStatus(int, Object) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
 
ProtocolStatus(int, Object, long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
 
ProtocolStatus(int, String[]) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
 
ProtocolStatus(int, String[], long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
 
ProtocolStatus(Throwable) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
 
ProtocolStatusStatistics - Class in org.apache.nutch.util
Extracts protocol status code information from the crawl database.
ProtocolStatusStatistics() - Constructor for class org.apache.nutch.util.ProtocolStatusStatistics
 
ProtocolStatusStatistics.ProtocolStatusStatisticsCombiner - Class in org.apache.nutch.util
 
ProtocolStatusStatisticsCombiner() - Constructor for class org.apache.nutch.util.ProtocolStatusStatistics.ProtocolStatusStatisticsCombiner
 
ProtocolURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.protocol
URL normalizer to normalize the protocol for all URLs of a given host or domain, e.g.
ProtocolURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer
 
proxyException - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The proxy exception list.
proxyHost - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The proxy hostname.
proxyPort - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The proxy port.
proxyType - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The proxy port.
PSEUDO_DOMAIN - org.apache.nutch.util.domain.DomainSuffix.Status
 
publish(Object, Configuration) - Method in interface org.apache.nutch.publisher.NutchPublisher
This method publishes the event.
publish(Object, Configuration) - Method in class org.apache.nutch.publisher.NutchPublishers
 
publish(Object, Configuration) - Method in class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
 
publish(String, String, RabbitMQMessage) - Method in class org.apache.nutch.rabbitmq.RabbitMQClient
Publishes a new message over an exchange.
publish(FetcherThreadEvent, Configuration) - Method in class org.apache.nutch.fetcher.FetcherThreadPublisher
Publish event to all registered publishers
PUBLISHER - Static variable in interface org.apache.nutch.metadata.DublinCore
An entity responsible for making the resource available.
purgeFailedHostsThreshold - Variable in class org.apache.nutch.hostdb.ResolverThread
 
purgeFailedHostsThreshold - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
put(String, FetchNode) - Method in class org.apache.nutch.fetcher.FetchNodeDb
 
put(String, ParseText, ParseData) - Method in class org.apache.nutch.parse.ParseResult
Store a result of parsing.
put(Text, ParseText, ParseData) - Method in class org.apache.nutch.parse.ParseResult
Store a result of parsing.
putAllMetaData(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
Add all metadata from other CrawlDatum to this CrawlDatum.
putAllMetaData(HostDatum) - Method in class org.apache.nutch.hostdb.HostDatum
Add all metadata from other HostDatum to this HostDatum.

Q

query(Map<String, String>, Configuration, String, String) - Method in class org.apache.nutch.crawl.CrawlDbReader
 
QuerystringURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.querystring
URL normalizer plugin for normalizing query strings but sorting query string parameters.
QuerystringURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.querystring.QuerystringURLNormalizer
 
queue - Variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
QUEUE_MODE_DOMAIN - Static variable in class org.apache.nutch.fetcher.FetchItemQueues
 
QUEUE_MODE_HOST - Static variable in class org.apache.nutch.fetcher.FetchItemQueues
 
QUEUE_MODE_IP - Static variable in class org.apache.nutch.fetcher.FetchItemQueues
 
QueueFeeder - Class in org.apache.nutch.fetcher
This class feeds the queues with input items, and re-fills them as items are consumed by FetcherThread-s.
QueueFeeder(Mapper.Context, FetchItemQueues, int) - Constructor for class org.apache.nutch.fetcher.QueueFeeder
 

R

RabbitIndexWriter - Class in org.apache.nutch.indexwriter.rabbit
 
RabbitIndexWriter() - Constructor for class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
RabbitMQClient - Class in org.apache.nutch.rabbitmq
Client for RabbitMQ
RabbitMQClient(String) - Constructor for class org.apache.nutch.rabbitmq.RabbitMQClient
Builds a new instance of RabbitMQClient
RabbitMQClient(String, int, String, String, String) - Constructor for class org.apache.nutch.rabbitmq.RabbitMQClient
Builds a new instance of RabbitMQClient
RabbitMQMessage - Class in org.apache.nutch.rabbitmq
 
RabbitMQMessage() - Constructor for class org.apache.nutch.rabbitmq.RabbitMQMessage
 
RabbitMQPublisherImpl - Class in org.apache.nutch.publisher.rabbitmq
 
RabbitMQPublisherImpl() - Constructor for class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
 
read(DataInput) - Static method in class org.apache.nutch.crawl.CrawlDatum
 
read(DataInput) - Static method in class org.apache.nutch.crawl.Inlink
 
read(DataInput) - Static method in class org.apache.nutch.parse.Outlink
 
read(DataInput) - Static method in class org.apache.nutch.parse.ParseData
 
read(DataInput) - Static method in class org.apache.nutch.parse.ParseImpl
 
read(DataInput) - Static method in class org.apache.nutch.parse.ParseStatus
 
read(DataInput) - Static method in class org.apache.nutch.parse.ParseText
 
read(DataInput) - Static method in class org.apache.nutch.protocol.Content
 
read(DataInput) - Static method in class org.apache.nutch.protocol.ProtocolStatus
 
read(String) - Method in class org.apache.nutch.service.impl.LinkReader
 
read(String) - Method in class org.apache.nutch.service.impl.NodeReader
 
read(String) - Method in class org.apache.nutch.service.impl.SequenceReader
 
read(String) - Method in interface org.apache.nutch.service.NutchReader
 
readConfiguration(Reader) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
readdb(DbQuery) - Method in class org.apache.nutch.service.resources.DbResource
 
READDB - org.apache.nutch.service.JobManager.JobType
 
Reader() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Reader
 
ReaderConfig - Class in org.apache.nutch.service.model.request
 
ReaderConfig() - Constructor for class org.apache.nutch.service.model.request.ReaderConfig
 
ReaderResouce - Class in org.apache.nutch.service.resources
The Reader endpoint enables a user to read sequence files, nodes and links from the Nutch webgraph.
ReaderResouce() - Constructor for class org.apache.nutch.service.resources.ReaderResouce
 
readFields(DataInput) - Method in class org.apache.nutch.crawl.CrawlDatum
 
readFields(DataInput) - Method in class org.apache.nutch.crawl.Generator.SelectorEntry
 
readFields(DataInput) - Method in class org.apache.nutch.crawl.Inlink
 
readFields(DataInput) - Method in class org.apache.nutch.crawl.Inlinks
 
readFields(DataInput) - Method in class org.apache.nutch.hostdb.HostDatum
 
readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchDocument
 
readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchField
 
readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchIndexAction
 
readFields(DataInput) - Method in class org.apache.nutch.metadata.Metadata
 
readFields(DataInput) - Method in class org.apache.nutch.metadata.MetaWrapper
 
readFields(DataInput) - Method in class org.apache.nutch.parse.Outlink
 
readFields(DataInput) - Method in class org.apache.nutch.parse.ParseData
 
readFields(DataInput) - Method in class org.apache.nutch.parse.ParseImpl
 
readFields(DataInput) - Method in class org.apache.nutch.parse.ParseStatus
 
readFields(DataInput) - Method in class org.apache.nutch.parse.ParseText
 
readFields(DataInput) - Method in class org.apache.nutch.protocol.Content
 
readFields(DataInput) - Method in class org.apache.nutch.protocol.ProtocolStatus
 
readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
 
readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
 
readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.Node
 
readFields(DataInput) - Method in class org.apache.nutch.util.GenericWritableConfigurable
 
readHostDb() - Method in class org.apache.nutch.crawl.Generator.SelectorReducer
 
ReadHostDb - Class in org.apache.nutch.hostdb
 
ReadHostDb() - Constructor for class org.apache.nutch.hostdb.ReadHostDb
 
readingCrawlDb - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
readUrl(String, String, Configuration, StringBuilder) - Method in class org.apache.nutch.crawl.CrawlDbReader
 
recheckInterval - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
REDIR_EXCEEDED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Too many redirects.
redirectIsQueuedRecently(Text) - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
redirPerm - Variable in class org.apache.nutch.hostdb.HostDatum
 
redirTemp - Variable in class org.apache.nutch.hostdb.HostDatum
 
reduce(K, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
 
reduce(ByteWritable, Iterable<Text>, Reducer.Context) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
 
reduce(FloatWritable, Iterable<Text>, Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
 
reduce(FloatWritable, Iterable<Text>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterReducer
 
reduce(FloatWritable, Iterable<Generator.SelectorEntry>, Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorReducer
 
reduce(Text, Iterable<FloatWritable>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperReducer
 
reduce(Text, Iterable<LongWritable>, Reducer.Context) - Method in class org.apache.nutch.util.CrawlCompletionStats.CrawlCompletionStatsCombiner
 
reduce(Text, Iterable<LongWritable>, Reducer.Context) - Method in class org.apache.nutch.util.domain.DomainStatistics.DomainStatisticsCombiner
 
reduce(Text, Iterable<LongWritable>, Reducer.Context) - Method in class org.apache.nutch.util.ProtocolStatusStatistics.ProtocolStatusStatisticsCombiner
 
reduce(Text, Iterable<ObjectWritable>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterReducer
 
reduce(Text, Iterable<ObjectWritable>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertReducer
 
reduce(Text, Iterable<Writable>, Reducer.Context) - Method in class org.apache.nutch.parse.ParseSegment.ParseSegmentReducer
 
reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
 
reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReducer
 
reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
 
reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateReducer
 
reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.Injector.InjectReducer
Merge the input records of one URL as per rules below :
reduce(Text, Iterable<Generator.SelectorEntry>, Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.PartitionReducer
 
reduce(Text, Iterable<Generator.SelectorEntry>, Reducer.Context) - Method in class org.apache.nutch.tools.FreeGenerator.FG.FGReducer
 
reduce(Text, Iterable<Inlinks>, Reducer.Context) - Method in class org.apache.nutch.crawl.LinkDbMerger.LinkDbMergeReducer
 
reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCReducer
 
reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.segment.SegmentReader.InputCompatReducer
 
reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
 
reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.indexer.IndexerMapReduce.IndexerReducer
 
reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbReducer
 
reduce(Text, Iterable<MetaWrapper>, Reducer.Context) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentMergerReducer
 
reduce(Text, Iterable<LinkDumper.LinkNode>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
Aggregate all LinkNode objects for a given url.
regex() - Method in class org.apache.nutch.urlfilter.api.RegexRule
Return if this rule's regex.
regexEscape(String) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
Escapes any character that needs escaping so it can be used in a regexp.
regexNormalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
This function does the replacements by iterating through all the regex patterns.
RegexParseFilter - Class in org.apache.nutch.parsefilter.regex
RegexParseFilter.
RegexParseFilter() - Constructor for class org.apache.nutch.parsefilter.regex.RegexParseFilter
 
RegexRule - Class in org.apache.nutch.urlfilter.api
A generic regular expression rule.
RegexRule(boolean, String) - Constructor for class org.apache.nutch.urlfilter.api.RegexRule
Constructs a new regular expression rule.
RegexRule(boolean, String, String) - Constructor for class org.apache.nutch.urlfilter.api.RegexRule
Constructs a new regular expression rule.
RegexURLFilter - Class in org.apache.nutch.urlfilter.regex
Filters URLs based on a file of regular expressions using the Java Regex implementation.
RegexURLFilter() - Constructor for class org.apache.nutch.urlfilter.regex.RegexURLFilter
 
RegexURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.regex.RegexURLFilter
 
RegexURLFilterBase - Class in org.apache.nutch.urlfilter.api
Generic URLFilter based on regular expressions.
RegexURLFilterBase() - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Constructs a new empty RegexURLFilterBase
RegexURLFilterBase(File) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Constructs a new RegexURLFilter and init it with a file of rules.
RegexURLFilterBase(Reader) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Constructs a new RegexURLFilter and init it with a Reader of rules.
RegexURLFilterBase(String) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
Constructs a new RegexURLFilter and inits it with a list of rules.
RegexURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.regex
Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.
RegexURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer()*
RegexURLNormalizer(Configuration) - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
 
RegexURLNormalizer(Configuration, String) - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
Constructor which can be passed the configuration file name, so it doesn't look in other configuration files for it.
REGION - Static variable in interface org.apache.nutch.indexwriter.cloudsearch.CloudSearchConstants
 
registerPluginRepository(PluginRepository) - Method in class org.apache.nutch.plugin.URLStreamHandlerFactory
Use this method once a new PluginRepository was created to register it.
REJECTED - org.apache.nutch.util.domain.DomainSuffix.Status
 
REL_TAG - Static variable in class org.apache.nutch.microformats.reltag.RelTagParser
 
RELATION - Static variable in interface org.apache.nutch.metadata.DublinCore
A reference to a related resource.
reloadRules() - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
RelTagIndexingFilter - Class in org.apache.nutch.microformats.reltag
An IndexingFilter that add tag field(s) to the document.
RelTagIndexingFilter() - Constructor for class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
 
RelTagParser - Class in org.apache.nutch.microformats.reltag
Adds microformat rel-tags of document if found.
RelTagParser() - Constructor for class org.apache.nutch.microformats.reltag.RelTagParser
 
remove(String) - Method in class org.apache.nutch.metadata.Metadata
Remove a metadata and all its associated values.
remove(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
 
removeField(String) - Method in class org.apache.nutch.indexer.NutchDocument
 
removeLockFile(Configuration, Path) - Static method in class org.apache.nutch.util.LockUtil
Remove lock file.
removeLockFile(FileSystem, Path) - Static method in class org.apache.nutch.util.LockUtil
Remove lock file.
replace(String) - Method in class org.apache.nutch.indexer.replace.FieldReplacer
Return the replacement value for a field value.
replace(FileSystem, Path, Path, boolean) - Static method in class org.apache.nutch.util.FSUtils
Replaces the current path with the new path and if set removes the old path.
replacefirstoccuranceof(String, String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Train
 
replaceHost(String, String, String) - Method in class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
 
ReplaceIndexer - Class in org.apache.nutch.indexer.replace
Do pattern replacements on selected field contents prior to indexing.
ReplaceIndexer() - Constructor for class org.apache.nutch.indexer.replace.ReplaceIndexer
 
REPORT - org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
 
REPR_URL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
reprUrl - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
REQUEST - Static variable in interface org.apache.nutch.net.protocols.Response
Key to hold the HTTP request if store.http.request is true
reset() - Method in class org.apache.nutch.indexer.NutchField
 
reset() - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets all boolean values to false.
resetFailures() - Method in class org.apache.nutch.hostdb.HostDatum
 
resetStatistics() - Method in class org.apache.nutch.hostdb.HostDatum
 
resolveEncodingAlias(String) - Static method in class org.apache.nutch.util.EncodingDetector
 
resolverThread - Variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
ResolverThread - Class in org.apache.nutch.hostdb
Simple runnable that performs DNS lookup for a single host.
ResolverThread(String, HostDatum, Reducer.Context, int) - Constructor for class org.apache.nutch.hostdb.ResolverThread
Overloaded constructor.
resolveURL(URL, String) - Static method in class org.apache.nutch.util.URLUtil
Resolve relative URL-s and fix a java.net.URL error in handling of URLs with pure query targets.
resolveUrls() - Method in class org.apache.nutch.tools.ResolveUrls
Creates a thread pool for resolving urls.
ResolveUrls - Class in org.apache.nutch.tools
A simple tool that will spin up multiple threads to resolve urls to ip addresses.
ResolveUrls(String) - Constructor for class org.apache.nutch.tools.ResolveUrls
Create a new ResolveUrls with a file from the local file system.
ResolveUrls(String, int) - Constructor for class org.apache.nutch.tools.ResolveUrls
Create a new ResolveUrls with a urls file and a number of threads for the Thread pool.
Response - Interface in org.apache.nutch.net.protocols
A response interface.
RESPONSE_HEADERS - Static variable in interface org.apache.nutch.net.protocols.Response
Key to hold the HTTP response header if store.http.headers is true
RESPONSE_TIME - Static variable in class org.apache.nutch.protocol.http.api.HttpBase
 
Response.TruncatedContentReason - Enum in org.apache.nutch.net.protocols
 
responseTime - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Record response time in CrawlDatum's meta data, see property http.store.responsetime.
results - Variable in class org.apache.nutch.util.NutchTool
 
retrieveFile(String, OutputStream, int) - Method in class org.apache.nutch.protocol.ftp.Client
retrieve file for path
retrieveList(String, List<FTPFile>, int, FTPFileEntryParser) - Method in class org.apache.nutch.protocol.ftp.Client
Retrieve list reply for path
retrieveNgrams(Configuration) - Static method in class org.apache.nutch.scoring.similarity.cosine.Model
Retrieves mingram and maxgram from configuration
RETRY - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Temporary failure.
reverseHost(String) - Static method in class org.apache.nutch.util.TableUtil
 
reverseKey - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
reverseKeyValue - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
reverseUrl(String) - Static method in class org.apache.nutch.tools.CommonCrawlDataDumper
 
reverseUrl(String) - Static method in class org.apache.nutch.util.TableUtil
Reverses a url's domain.
reverseUrl(URL) - Static method in class org.apache.nutch.util.TableUtil
Reverses a url's domain.
rightPad(String, int) - Static method in class org.apache.nutch.util.StringUtil
Returns a copy of s (right padded) with trailing spaces so that it's length is length.
RIGHTS - Static variable in interface org.apache.nutch.metadata.DublinCore
Information about rights held in and over the resource.
RobotRulesParser - Class in org.apache.nutch.protocol
This class uses crawler-commons for handling the parsing of robots.txt files.
RobotRulesParser() - Constructor for class org.apache.nutch.protocol.RobotRulesParser
 
RobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.RobotRulesParser
 
ROBOTS - Static variable in class org.apache.nutch.tools.WARCUtils
 
ROBOTS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Access denied by robots.txt rules.
ROBOTS_METATAG - Static variable in interface org.apache.nutch.metadata.Nutch
Name to store the robots metatag in ParseData's metadata.
root - Variable in class org.apache.nutch.util.TrieStringMatcher
 
Rule(String) - Constructor for class org.apache.nutch.urlfilter.fast.FastURLFilter.Rule
 
run() - Method in class org.apache.nutch.fetcher.FetcherThread
 
run() - Method in class org.apache.nutch.fetcher.QueueFeeder
 
run() - Method in class org.apache.nutch.hostdb.ResolverThread
 
run() - Method in class org.apache.nutch.service.impl.JobWorker
 
run() - Method in class org.apache.nutch.service.impl.ServiceWorker
 
run() - Method in class org.apache.nutch.util.AbstractChecker
 
run(String[]) - Method in class org.apache.nutch.crawl.CrawlDb
 
run(String[]) - Method in class org.apache.nutch.crawl.CrawlDbMerger
 
run(String[]) - Method in class org.apache.nutch.crawl.CrawlDbReader
 
run(String[]) - Method in class org.apache.nutch.crawl.DeduplicationJob
 
run(String[]) - Method in class org.apache.nutch.crawl.Generator
 
run(String[]) - Method in class org.apache.nutch.crawl.Injector
 
run(String[]) - Method in class org.apache.nutch.crawl.LinkDb
 
run(String[]) - Method in class org.apache.nutch.crawl.LinkDbMerger
 
run(String[]) - Method in class org.apache.nutch.crawl.LinkDbReader
 
run(String[]) - Method in class org.apache.nutch.fetcher.Fetcher
 
run(String[]) - Method in class org.apache.nutch.hostdb.ReadHostDb
 
run(String[]) - Method in class org.apache.nutch.hostdb.UpdateHostDb
 
run(String[]) - Method in class org.apache.nutch.indexer.CleaningJob
 
run(String[]) - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
 
run(String[]) - Method in class org.apache.nutch.indexer.IndexingJob
 
run(String[]) - Method in class org.apache.nutch.net.URLFilterChecker
 
run(String[]) - Method in class org.apache.nutch.net.URLNormalizerChecker
 
run(String[]) - Method in class org.apache.nutch.parse.ParserChecker
 
run(String[]) - Method in class org.apache.nutch.parse.ParseSegment
 
run(String[]) - Method in class org.apache.nutch.protocol.RobotRulesParser
 
run(String[]) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper
Runs the LinkDumper tool.
run(String[]) - Method in class org.apache.nutch.scoring.webgraph.LinkRank
Runs the LinkRank tool.
run(String[]) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper
Runs the node dumper tool.
run(String[]) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
Runs the ScoreUpdater tool.
run(String[]) - Method in class org.apache.nutch.scoring.webgraph.WebGraph
Parses command link arguments and runs the WebGraph jobs.
run(String[]) - Method in class org.apache.nutch.segment.SegmentMerger
Run this tool
run(String[]) - Method in class org.apache.nutch.segment.SegmentReader
 
run(String[]) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
 
run(String[]) - Method in class org.apache.nutch.tools.CommonCrawlDataDumper
 
run(String[]) - Method in class org.apache.nutch.tools.FreeGenerator
 
run(String[]) - Method in class org.apache.nutch.tools.ShowProperties
 
run(String[]) - Method in class org.apache.nutch.tools.warc.WARCExporter
 
run(String[]) - Method in class org.apache.nutch.util.CrawlCompletionStats
 
run(String[]) - Method in class org.apache.nutch.util.domain.DomainStatistics
 
run(String[]) - Method in class org.apache.nutch.util.ProtocolStatusStatistics
 
run(String[]) - Method in class org.apache.nutch.util.SitemapProcessor
 
run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.CrawlDb
 
run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.DeduplicationJob
 
run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.Generator
 
run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.Injector
Used by the Nutch REST service
run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.LinkDb
 
run(Map<String, Object>, String) - Method in class org.apache.nutch.fetcher.Fetcher
 
run(Map<String, Object>, String) - Method in class org.apache.nutch.indexer.IndexingJob
 
run(Map<String, Object>, String) - Method in class org.apache.nutch.parse.ParseSegment
 
run(Map<String, Object>, String) - Method in class org.apache.nutch.tools.CommonCrawlDataDumper
Used by the REST service
run(Map<String, Object>, String) - Method in class org.apache.nutch.util.NutchTool
Runs the tool, using a map of arguments.
run(Mapper.Context) - Method in class org.apache.nutch.fetcher.Fetcher.FetcherRun
 
RUNNING - org.apache.nutch.service.model.response.JobInfo.State
 

S

save() - Method in class org.apache.nutch.collection.CollectionManager
Save collections into file
saveDom(OutputStream, DocumentFragment) - Static method in class org.apache.nutch.util.DomUtil
Save dom into OutputStream
saveDom(OutputStream, Element) - Static method in class org.apache.nutch.util.DomUtil
Save dom into OutputStream
SCHEDULE_DEC_RATE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
 
SCHEDULE_INC_RATE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
 
SCHEDULE_MIME_FILE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
 
SCHEME - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
SCHEME - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
SCOPE_CRAWLDB - Static variable in class org.apache.nutch.net.URLNormalizers
Scope used when updating the CrawlDb with new URLs.
SCOPE_DEFAULT - Static variable in class org.apache.nutch.net.URLNormalizers
Default scope.
SCOPE_FETCHER - Static variable in class org.apache.nutch.net.URLNormalizers
Scope used by Fetcher when processing redirect URLs.
SCOPE_GENERATE_HOST_COUNT - Static variable in class org.apache.nutch.net.URLNormalizers
Scope used by Generator.
SCOPE_INDEXER - Static variable in class org.apache.nutch.net.URLNormalizers
Scope used when indexing URLs.
SCOPE_INJECT - Static variable in class org.apache.nutch.net.URLNormalizers
Scope used by Injector.
SCOPE_LINKDB - Static variable in class org.apache.nutch.net.URLNormalizers
Scope used when updating the LinkDb with new URLs.
SCOPE_OUTLINK - Static variable in class org.apache.nutch.net.URLNormalizers
Scope used when constructing new Outlink instances.
SCOPE_PARTITION - Static variable in class org.apache.nutch.net.URLNormalizers
Scope used by URLPartitioner.
score - Variable in class org.apache.nutch.hostdb.HostDatum
 
SCORE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
ScoreUpdater - Class in org.apache.nutch.scoring.webgraph
Updates the score from the WebGraph node database into the crawl database.
ScoreUpdater() - Constructor for class org.apache.nutch.scoring.webgraph.ScoreUpdater
 
ScoreUpdater.ScoreUpdaterMapper - Class in org.apache.nutch.scoring.webgraph
Changes input into ObjectWritables.
ScoreUpdater.ScoreUpdaterReducer - Class in org.apache.nutch.scoring.webgraph
Creates new CrawlDatum objects with the updated score from the NodeDb or with a cleared score.
ScoreUpdaterMapper() - Constructor for class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterMapper
 
ScoreUpdaterReducer() - Constructor for class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterReducer
 
ScoringFilter - Interface in org.apache.nutch.scoring
A contract defining behavior of scoring plugins.
ScoringFilterException - Exception in org.apache.nutch.scoring
Specialized exception for errors during scoring.
ScoringFilterException() - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
 
ScoringFilterException(String) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
 
ScoringFilterException(String, Throwable) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
 
ScoringFilterException(Throwable) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
 
ScoringFilters - Class in org.apache.nutch.scoring
Creates and caches ScoringFilter implementing plugins.
ScoringFilters(Configuration) - Constructor for class org.apache.nutch.scoring.ScoringFilters
 
sdf - Static variable in class org.apache.nutch.util.SitemapProcessor
 
SECONDS_PER_DAY - Static variable in interface org.apache.nutch.crawl.FetchSchedule
 
secondsToDaysHMS(long) - Static method in class org.apache.nutch.util.TimingUtil
Show time in seconds as days, hours, minutes and seconds (d days, hh:mm:ss)
secondsToHMS(long) - Static method in class org.apache.nutch.util.TimingUtil
Show time in seconds as hours, minutes and seconds (hh:mm:ss)
SeedList - Class in org.apache.nutch.service.model.request
 
SeedList() - Constructor for class org.apache.nutch.service.model.request.SeedList
 
SeedManager - Interface in org.apache.nutch.service
 
SeedManagerImpl - Class in org.apache.nutch.service.impl
 
SeedManagerImpl() - Constructor for class org.apache.nutch.service.impl.SeedManagerImpl
 
SeedResource - Class in org.apache.nutch.service.resources
 
SeedResource() - Constructor for class org.apache.nutch.service.resources.SeedResource
 
SeedUrl - Class in org.apache.nutch.service.model.request
 
SeedUrl() - Constructor for class org.apache.nutch.service.model.request.SeedUrl
 
SeedUrl(String) - Constructor for class org.apache.nutch.service.model.request.SeedUrl
 
SEGMENT_NAME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
SegmentChecker - Class in org.apache.nutch.segment
Checks whether a segment is valid, or has a certain status (generated, fetched, parsed), or can be used safely for a certain processing step (e.g., indexing).
SegmentChecker() - Constructor for class org.apache.nutch.segment.SegmentChecker
 
SegmentMergeFilter - Interface in org.apache.nutch.segment
Interface used to filter segments during segment merge.
SegmentMergeFilters - Class in org.apache.nutch.segment
This class wraps all SegmentMergeFilter extensions in a single object so it is easier to operate on them.
SegmentMergeFilters(Configuration) - Constructor for class org.apache.nutch.segment.SegmentMergeFilters
 
SegmentMerger - Class in org.apache.nutch.segment
This tool takes several segments and merges their data together.
SegmentMerger() - Constructor for class org.apache.nutch.segment.SegmentMerger
 
SegmentMerger(Configuration) - Constructor for class org.apache.nutch.segment.SegmentMerger
 
SegmentMerger.ObjectInputFormat - Class in org.apache.nutch.segment
Wraps inputs in an MetaWrapper, to permit merging different types in reduce and use additional metadata.
SegmentMerger.SegmentMergerMapper - Class in org.apache.nutch.segment
 
SegmentMerger.SegmentMergerReducer - Class in org.apache.nutch.segment
NOTE: in selecting the latest version we rely exclusively on the segment name (not all segment data contain time information).
SegmentMerger.SegmentOutputFormat - Class in org.apache.nutch.segment
 
SegmentMergerMapper() - Constructor for class org.apache.nutch.segment.SegmentMerger.SegmentMergerMapper
 
SegmentMergerReducer() - Constructor for class org.apache.nutch.segment.SegmentMerger.SegmentMergerReducer
 
segmentName - Variable in class org.apache.nutch.segment.SegmentPart
Name of the segment (just the last path component).
SegmentOutputFormat() - Constructor for class org.apache.nutch.segment.SegmentMerger.SegmentOutputFormat
 
SegmentPart - Class in org.apache.nutch.segment
Utility class for handling information about segment parts.
SegmentPart() - Constructor for class org.apache.nutch.segment.SegmentPart
 
SegmentPart(String, String) - Constructor for class org.apache.nutch.segment.SegmentPart
 
SegmentReader - Class in org.apache.nutch.segment
Dump the content of a segment.
SegmentReader() - Constructor for class org.apache.nutch.segment.SegmentReader
 
SegmentReader.InputCompatMapper - Class in org.apache.nutch.segment
 
SegmentReader.InputCompatReducer - Class in org.apache.nutch.segment
 
SegmentReader.SegmentReaderStats - Class in org.apache.nutch.segment
 
SegmentReader.TextOutputFormat - Class in org.apache.nutch.segment
Implements a text output format
SegmentReaderStats() - Constructor for class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
 
SegmentReaderUtil - Class in org.apache.nutch.util
 
SegmentReaderUtil() - Constructor for class org.apache.nutch.util.SegmentReaderUtil
 
segnum - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
 
Selector() - Constructor for class org.apache.nutch.crawl.Generator.Selector
 
SelectorEntry() - Constructor for class org.apache.nutch.crawl.Generator.SelectorEntry
 
SelectorInverseMapper() - Constructor for class org.apache.nutch.crawl.Generator.SelectorInverseMapper
 
SelectorMapper() - Constructor for class org.apache.nutch.crawl.Generator.SelectorMapper
 
SelectorReducer() - Constructor for class org.apache.nutch.crawl.Generator.SelectorReducer
 
sendNoOp() - Method in class org.apache.nutch.protocol.ftp.Client
Sends a NOOP command to the FTP server.
Separator(String) - Constructor for class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
 
sepStr - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
 
seqRead(ReaderConfig, int, int, int, boolean) - Method in class org.apache.nutch.service.resources.ReaderResouce
Read a sequence file
SequenceReader - Class in org.apache.nutch.service.impl
Enables reading a sequence file and methods provide different ways to read the file.
SequenceReader() - Constructor for class org.apache.nutch.service.impl.SequenceReader
 
serialize(Writable, JsonGenerator, SerializerProvider) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer
 
server - Variable in class org.apache.nutch.service.resources.AbstractResource
 
SERVER_TYPE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
SERVER_URLS - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
ServiceConfig - Class in org.apache.nutch.service.model.request
 
ServiceConfig() - Constructor for class org.apache.nutch.service.model.request.ServiceConfig
 
ServiceInfo - Class in org.apache.nutch.service.model.response
 
ServiceInfo() - Constructor for class org.apache.nutch.service.model.response.ServiceInfo
 
ServicesResource - Class in org.apache.nutch.service.resources
The services resource defines an endpoint to enable the user to carry out Nutch jobs like dump, commoncrawldump, etc.
ServicesResource() - Constructor for class org.apache.nutch.service.resources.ServicesResource
 
ServiceWorker - Class in org.apache.nutch.service.impl
 
ServiceWorker(ServiceConfig, NutchTool) - Constructor for class org.apache.nutch.service.impl.ServiceWorker
 
set(String) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
 
set(String, String) - Method in class org.apache.nutch.metadata.Metadata
Set metadata name/value.
set(String, String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
 
set(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
Copy the contents of another instance into this instance.
setAdditionalPostHeaders(Map<String, String>) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
setAll(Properties) - Method in class org.apache.nutch.metadata.Metadata
Copy All key-value pairs from properties.
setAnchor(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
setArgs(String[]) - Method in class org.apache.nutch.parse.ParseStatus
 
setArgs(String[]) - Method in class org.apache.nutch.protocol.ProtocolStatus
 
setArgs(Map<String, Object>) - Method in class org.apache.nutch.service.model.request.JobConfig
 
setArgs(Map<String, Object>) - Method in class org.apache.nutch.service.model.request.ServiceConfig
 
setArgs(Map<String, Object>) - Method in class org.apache.nutch.service.model.response.JobInfo
 
setArgs(Map<String, String>) - Method in class org.apache.nutch.service.model.request.DbQuery
 
setBaseHref(URL) - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets the baseHref.
setBlackList(String) - Method in class org.apache.nutch.collection.Subcollection
Set contents of blacklist from String
setBody(byte[]) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
 
setCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets noCache to false.
setChildNodes(Outlink[]) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
setChildren(List<FetchNodeDbInfo.ChildNode>) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
setClazz(String) - Method in class org.apache.nutch.plugin.Extension
Sets the Class that implement the concret extension and is only used until model creation at system start up.
setCode(int) - Method in class org.apache.nutch.protocol.ProtocolStatus
 
setCommand(String) - Method in class org.apache.nutch.util.CommandRunner
 
setCompressed(boolean) - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
setConf(Configuration) - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
 
setConf(Configuration) - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
 
setConf(Configuration) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
 
setConf(Configuration) - Method in class org.apache.nutch.crawl.Generator.Selector
 
setConf(Configuration) - Method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
 
setConf(Configuration) - Method in class org.apache.nutch.crawl.Signature
 
setConf(Configuration) - Method in class org.apache.nutch.crawl.TextProfileSignature
 
setConf(Configuration) - Method in class org.apache.nutch.crawl.URLPartitioner
 
setConf(Configuration) - Method in class org.apache.nutch.exchange.jexl.JexlExchange
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
Set the Configuration object
setConf(Configuration) - Method in class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
Set the Configuration object
setConf(Configuration) - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
Set the Configuration object
setConf(Configuration) - Method in class org.apache.nutch.indexer.CleaningJob
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
Sets the Configuration object used to configure this IndexingFilter.
setConf(Configuration) - Method in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.jexl.JexlIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.links.LinksIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.replace.ReplaceIndexer
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
Set the Configuration object
setConf(Configuration) - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
handles conf assignment and pulls the value assignment from the "urlmeta.tags" property
setConf(Configuration) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
setConf(Configuration) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
setConf(Configuration) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
setConf(Configuration) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
setConf(Configuration) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
setConf(Configuration) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
setConf(Configuration) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
setConf(Configuration) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
setConf(Configuration) - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
 
setConf(Configuration) - Method in class org.apache.nutch.microformats.reltag.RelTagParser
 
setConf(Configuration) - Method in class org.apache.nutch.net.protocols.ProtocolLogUtil
 
setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
 
setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
 
setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
 
setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
 
setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer
 
setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.querystring.QuerystringURLNormalizer
 
setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
 
setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.slash.SlashURLNormalizer
 
setConf(Configuration) - Method in class org.apache.nutch.parse.ext.ExtParser
 
setConf(Configuration) - Method in class org.apache.nutch.parse.feed.FeedParser
Sets the Configuration object for this Parser.
setConf(Configuration) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
 
setConf(Configuration) - Method in class org.apache.nutch.parse.html.DOMContentUtils
 
setConf(Configuration) - Method in class org.apache.nutch.parse.html.HtmlParser
 
setConf(Configuration) - Method in class org.apache.nutch.parse.js.JSParseFilter
 
setConf(Configuration) - Method in class org.apache.nutch.parse.metatags.MetaTagsParser
 
setConf(Configuration) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
 
setConf(Configuration) - Method in class org.apache.nutch.parse.tika.TikaParser
 
setConf(Configuration) - Method in class org.apache.nutch.parse.zip.ZipParser
 
setConf(Configuration) - Method in class org.apache.nutch.parsefilter.debug.DebugParseFilter
 
setConf(Configuration) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
setConf(Configuration) - Method in class org.apache.nutch.parsefilter.regex.RegexParseFilter
 
setConf(Configuration) - Method in class org.apache.nutch.protocol.file.File
Set the Configuration object
setConf(Configuration) - Method in class org.apache.nutch.protocol.ftp.Ftp
Set the Configuration object
setConf(Configuration) - Method in class org.apache.nutch.protocol.htmlunit.Http
Set the Configuration object.
setConf(Configuration) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
setConf(Configuration) - Method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
 
setConf(Configuration) - Method in class org.apache.nutch.protocol.http.Http
Set the Configuration object.
setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.Http
Reads the configuration from the Nutch configuration files and sets the configuration.
setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
 
setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
 
setConf(Configuration) - Method in class org.apache.nutch.protocol.interactiveselenium.Http
 
setConf(Configuration) - Method in class org.apache.nutch.protocol.okhttp.OkHttp
 
setConf(Configuration) - Method in class org.apache.nutch.protocol.RobotRulesParser
Set the Configuration object
setConf(Configuration) - Method in class org.apache.nutch.publisher.NutchPublishers
 
setConf(Configuration) - Method in class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
 
setConf(Configuration) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
setConf(Configuration) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
setConf(Configuration) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
 
setConf(Configuration) - Method in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
handles conf assignment and pulls the value assignment from the "scoring.db.md", "scoring.content.md" and "scoring.parse.md" properties.
setConf(Configuration) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
 
setConf(Configuration) - Method in class org.apache.nutch.scoring.orphan.OrphanScoringFilter
 
setConf(Configuration) - Method in class org.apache.nutch.scoring.similarity.cosine.CosineSimilarity
 
setConf(Configuration) - Method in interface org.apache.nutch.scoring.similarity.SimilarityModel
 
setConf(Configuration) - Method in class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
 
setConf(Configuration) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
handles conf assignment and pulls the value assignment from the "urlmeta.tags" property
setConf(Configuration) - Method in class org.apache.nutch.segment.SegmentMerger
 
setConf(Configuration) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
 
setConf(Configuration) - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
Sets the configuration.
setConf(Configuration) - Method in class org.apache.nutch.urlfilter.domaindenylist.DomainDenylistURLFilter
Sets the configuration.
setConf(Configuration) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
setConf(Configuration) - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
 
setConf(Configuration) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
setConf(Configuration) - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
 
setConf(Configuration) - Method in class org.apache.nutch.util.GenericWritableConfigurable
 
setConf(Configuration) - Method in class org.apache.nutch.util.NutchTool
 
setConf(Configuration) - Method in class org.creativecommons.nutch.CCIndexingFilter
 
setConf(Configuration) - Method in class org.creativecommons.nutch.CCParseFilter
 
setConfId(String) - Method in class org.apache.nutch.service.model.request.DbQuery
 
setConfId(String) - Method in class org.apache.nutch.service.model.request.JobConfig
 
setConfId(String) - Method in class org.apache.nutch.service.model.request.ServiceConfig
 
setConfId(String) - Method in class org.apache.nutch.service.model.response.JobInfo
 
setConfig(Configuration) - Method in interface org.apache.nutch.publisher.NutchPublisher
Use implementation specific configurations
setConfig(Configuration) - Method in class org.apache.nutch.publisher.NutchPublishers
 
setConfig(Configuration) - Method in class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
 
setConfigId(String) - Method in class org.apache.nutch.service.model.request.NutchConfig
 
setConfiguration(Set<String>) - Method in class org.apache.nutch.service.model.response.NutchServerInfo
 
setConnectionFailures(Long) - Method in class org.apache.nutch.hostdb.HostDatum
 
setContent(byte[]) - Method in class org.apache.nutch.protocol.Content
 
setContent(Content) - Method in class org.apache.nutch.protocol.ProtocolOutput
 
setContentType(String) - Method in class org.apache.nutch.protocol.Content
 
setContentType(String) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
 
setCookie(Text) - Method in class org.apache.nutch.fetcher.FetchItemQueue
 
setCookiePolicy(String) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
setCookies(String) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
 
setCrawlId(String) - Method in class org.apache.nutch.service.model.request.DbQuery
 
setCrawlId(String) - Method in class org.apache.nutch.service.model.request.JobConfig
 
setCrawlId(String) - Method in class org.apache.nutch.service.model.request.ServiceConfig
 
setCrawlId(String) - Method in class org.apache.nutch.service.model.response.JobInfo
 
setDataTimeout(int) - Method in class org.apache.nutch.protocol.ftp.Client
Sets the timeout in milliseconds to use for data connection.
setDescriptor(PluginDescriptor) - Method in class org.apache.nutch.plugin.Extension
Sets the plugin descriptor and is only used until model creation at system start up.
setDnsFailures(Long) - Method in class org.apache.nutch.hostdb.HostDatum
 
setDocumentLocator(Locator) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive an object for locating the origin of SAX document events.
setDumpPaths(List<String>) - Method in class org.apache.nutch.service.model.response.ServiceInfo
 
setEventData(Map<String, Object>) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Set metadata to this even
setEventType(FetcherThreadEvent.PublishEventType) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Set event type of this object
setFetched(long) - Method in class org.apache.nutch.hostdb.HostDatum
 
setFetchInterval(float) - Method in class org.apache.nutch.crawl.CrawlDatum
 
setFetchInterval(int) - Method in class org.apache.nutch.crawl.CrawlDatum
 
setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
Sets the fetchInterval and fetchTime on a successfully fetched page.
setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
 
setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.DefaultFetchSchedule
 
setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in interface org.apache.nutch.crawl.FetchSchedule
Sets the fetchInterval and fetchTime on a successfully fetched page.
setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
 
setFetchTime(long) - Method in class org.apache.nutch.crawl.CrawlDatum
Sets either the time of the last fetch or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.
setFetchTime(long) - Method in class org.apache.nutch.fetcher.FetchNode
 
setFileType(int) - Method in class org.apache.nutch.protocol.ftp.Client
Sets the file type to be transferred.
setFilterFromPath(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
setFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets noFollow to false.
setFollowTalk(boolean) - Method in class org.apache.nutch.protocol.ftp.Ftp
Set followTalk i.e.
setForce(boolean) - Method in class org.apache.nutch.service.model.request.NutchConfig
 
setFromConf(IndexWriterParams, String) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
 
setFromConf(IndexWriterParams, String, boolean) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
 
setGone(long) - Method in class org.apache.nutch.hostdb.HostDatum
 
setHalted(boolean) - Method in class org.apache.nutch.fetcher.FetcherThread
 
setHeaders(String) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
 
setHeaders(Map<String, Object>) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
 
setHomepageUrl(String) - Method in class org.apache.nutch.hostdb.HostDatum
 
setId(Long) - Method in class org.apache.nutch.service.model.request.SeedList
 
setId(Long) - Method in class org.apache.nutch.service.model.request.SeedUrl
 
setId(String) - Method in class org.apache.nutch.plugin.Extension
Sets the unique extension Id and is only used until model creation at system start up.
setId(String) - Method in class org.apache.nutch.service.model.response.JobInfo
 
setIDAttribute(String, Element) - Method in class org.apache.nutch.parse.html.DOMBuilder
Set an ID string to node association in the ID table.
setIgnoreCase(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
setIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets noIndex to false.
setIndexedConf(Configuration, int) - Method in class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
Set the Configuration object for a specific set of values in the config
setInfo(JobInfo) - Method in class org.apache.nutch.service.impl.JobWorker
 
setInLinks(List<String>) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
setInLinks(List<String>) - Method in interface org.apache.nutch.tools.CommonCrawlFormat
sets inlinks of this document
setInlinkScore(float) - Method in class org.apache.nutch.scoring.webgraph.Node
 
setInputStream(InputStream) - Method in class org.apache.nutch.util.CommandRunner
 
setJobClassName(String) - Method in class org.apache.nutch.service.model.request.JobConfig
 
setJobs(Collection<JobInfo>) - Method in class org.apache.nutch.service.model.response.NutchServerInfo
 
setJsonArray(boolean) - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
setKeepConnection(boolean) - Method in class org.apache.nutch.protocol.ftp.Ftp
Whether to keep ftp connection.
setKeyPrefix(String) - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
setLastCheck() - Method in class org.apache.nutch.hostdb.HostDatum
 
setLastCheck(Date) - Method in class org.apache.nutch.hostdb.HostDatum
 
setLastModified(long) - Method in class org.apache.nutch.protocol.ProtocolStatus
 
setLinks(LinkDumper.LinkNode[]) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
 
setLinkType(byte) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
setLoginFormId(String) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
setLoginPostData(Map<String, String>) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
setLoginRedirect(boolean) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
setLoginUrl(String) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
setMajorCode(byte) - Method in class org.apache.nutch.parse.ParseStatus
 
setMaxContentLength(int) - Method in class org.apache.nutch.protocol.file.File
Set the length after at which content is truncated.
setMaxContentLength(int) - Method in class org.apache.nutch.protocol.ftp.Ftp
Set the length after at which content is truncated.
setMessage(String) - Method in class org.apache.nutch.parse.ParseStatus
 
setMessage(String) - Method in class org.apache.nutch.protocol.ProtocolStatus
 
setMeta(String, String) - Method in class org.apache.nutch.metadata.MetaWrapper
Set metadata.
setMetadata(MapWritable) - Method in class org.apache.nutch.parse.Outlink
 
setMetadata(Metadata) - Method in class org.apache.nutch.protocol.Content
Other protocol-specific data.
setMetadata(Metadata) - Method in class org.apache.nutch.scoring.webgraph.Node
 
setMetaData(MapWritable) - Method in class org.apache.nutch.crawl.CrawlDatum
 
setMetaData(MapWritable) - Method in class org.apache.nutch.hostdb.HostDatum
 
setMinorCode(short) - Method in class org.apache.nutch.parse.ParseStatus
 
setModeAccept(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
setModifiedTime(long) - Method in class org.apache.nutch.crawl.CrawlDatum
 
setMsg(String) - Method in class org.apache.nutch.service.model.response.JobInfo
 
setName(String) - Method in class org.apache.nutch.service.model.request.SeedList
 
setNoCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets noCache to true.
setNode(Node) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
 
setNoFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets noFollow to true.
setNoIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets noIndex to true.
setNotModified(long) - Method in class org.apache.nutch.hostdb.HostDatum
 
setNumInlinks(int) - Method in class org.apache.nutch.scoring.webgraph.Node
 
setNumOfOutlinks(int) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
setNumOutlinks(int) - Method in class org.apache.nutch.scoring.webgraph.Node
 
setObject(String, Object) - Method in class org.apache.nutch.util.ObjectCache
 
setOutlinks(Outlink[]) - Method in class org.apache.nutch.fetcher.FetchNode
 
setOutlinks(Outlink[]) - Method in class org.apache.nutch.parse.ParseData
 
setOutputDir(String) - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
setPageGoneSchedule(Text, CrawlDatum, long, long, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
This method specifies how to schedule refetching of pages marked as GONE.
setPageGoneSchedule(Text, CrawlDatum, long, long, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
This method specifies how to schedule refetching of pages marked as GONE.
setPageRetrySchedule(Text, CrawlDatum, long, long, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
setPageRetrySchedule(Text, CrawlDatum, long, long, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
setParams(Map<String, String>) - Method in class org.apache.nutch.service.model.request.NutchConfig
 
setParseMeta(Metadata) - Method in class org.apache.nutch.parse.ParseData
 
setPath(String) - Method in class org.apache.nutch.service.model.request.ReaderConfig
 
setPoolSize(int) - Static method in class org.apache.nutch.util.MimeUtil
 
setPort(int) - Static method in class org.apache.nutch.service.NutchServer
 
setProperty(String, String, String) - Method in interface org.apache.nutch.service.ConfManager
 
setProperty(String, String, String) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
Sets the given property in the configuration associated with the confId
setReason(Response.TruncatedContentReason) - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse.TruncatedContent
 
setRedirect(boolean) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
 
setRedirPerm(long) - Method in class org.apache.nutch.hostdb.HostDatum
 
setRedirTemp(long) - Method in class org.apache.nutch.hostdb.HostDatum
 
setRefresh(boolean) - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets refresh to the supplied value.
setRefreshHref(URL) - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets the refreshHref.
setRefreshTime(int) - Method in class org.apache.nutch.parse.HTMLMetaTags
Sets the refreshTime.
setRemoteVerificationEnabled(boolean) - Method in class org.apache.nutch.protocol.ftp.Client
Enable or disable verification that the remote host taking part of a data connection is the same as the host to which the control connection is attached.
setRemovedFormFields(Set<String>) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
 
setResult(Map<String, Object>) - Method in class org.apache.nutch.service.model.response.JobInfo
 
setRetriesSinceFetch(int) - Method in class org.apache.nutch.crawl.CrawlDatum
 
setReverseKey(boolean) - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
setReverseKeyValue(String) - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
setRunningJobs(Collection<JobInfo>) - Method in class org.apache.nutch.service.model.response.NutchServerInfo
 
setScore(float) - Method in class org.apache.nutch.crawl.CrawlDatum
 
setScore(float) - Method in class org.apache.nutch.hostdb.HostDatum
 
setScore(float) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
setSeedFilePath(String) - Method in class org.apache.nutch.service.model.request.SeedList
 
setSeedList(String, SeedList) - Method in class org.apache.nutch.service.impl.SeedManagerImpl
 
setSeedList(String, SeedList) - Method in interface org.apache.nutch.service.SeedManager
 
setSeedList(SeedList) - Method in class org.apache.nutch.service.model.request.SeedUrl
 
setSeedUrls(Collection<SeedUrl>) - Method in class org.apache.nutch.service.model.request.SeedList
 
setSignature(byte[]) - Method in class org.apache.nutch.crawl.CrawlDatum
 
setSimpleDateFormat(boolean) - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
setStartDate(Date) - Method in class org.apache.nutch.service.model.response.NutchServerInfo
 
setState(JobInfo.State) - Method in class org.apache.nutch.service.model.response.JobInfo
 
setStatus(int) - Method in class org.apache.nutch.crawl.CrawlDatum
 
setStatus(int) - Method in class org.apache.nutch.fetcher.FetchNode
 
setStatus(int) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
setStatus(ProtocolStatus) - Method in class org.apache.nutch.protocol.ProtocolOutput
 
setStdErrorStream(OutputStream) - Method in class org.apache.nutch.util.CommandRunner
 
setStdOutputStream(OutputStream) - Method in class org.apache.nutch.util.CommandRunner
 
setTermFreqVector(HashMap<String, Integer>) - Method in class org.apache.nutch.scoring.similarity.cosine.DocVector
 
setTimeLimit(long) - Method in class org.apache.nutch.fetcher.QueueFeeder
 
setTimeout(int) - Method in class org.apache.nutch.protocol.ftp.Ftp
Set the timeout.
setTimeout(int) - Method in class org.apache.nutch.util.CommandRunner
 
setTimestamp(long) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
setTimestamp(Long) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Set timestamp for this event
setTitle(String) - Method in class org.apache.nutch.fetcher.FetchNode
 
setType(String) - Method in class org.apache.nutch.service.model.request.DbQuery
 
setType(JobManager.JobType) - Method in class org.apache.nutch.service.model.request.JobConfig
 
setType(JobManager.JobType) - Method in class org.apache.nutch.service.model.response.JobInfo
 
setUnfetched(long) - Method in class org.apache.nutch.hostdb.HostDatum
 
setup(Mapper.Context) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator.ArcSegmentCreatorMapper
Configures the job mapper.
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.Injector.InjectMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.indexer.IndexerMapReduce.IndexerMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbMapper
Configures the OutlinkDb job mapper.
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DBFilter
 
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbFilter
 
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.fetcher.Fetcher.FetcherRun
 
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDbFilter
 
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDbReader.LinkDBDumpMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentMergerMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDb.LinkDbMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterMapper
Configures the mapper, sets the flag for type of content and the topN number if any.
setup(Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.tools.FreeGenerator.FG.FGMapper
 
setup(Mapper.Context) - Method in class org.apache.nutch.parse.ParseSegment.ParseSegmentMapper
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterReducer
Configures the reducer, sets the flag for type of content and the topN number if any.
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.Injector.InjectReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.LinkDbMerger.LinkDbMergeReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.segment.SegmentReader.InputCompatReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
Configures the thread pool and prestarts all resolver threads.
setup(Reducer.Context) - Method in class org.apache.nutch.indexer.IndexerMapReduce.IndexerReducer
 
setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbReducer
Configures the OutlinkDb job reducer.
setup(Reducer.Context) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentMergerReducer
 
setUrl(String) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
Set URL of this event (fetched page)
setUrl(String) - Method in class org.apache.nutch.parse.Outlink
 
setUrl(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
setUrl(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
 
setUrl(String) - Method in class org.apache.nutch.service.model.request.SeedUrl
 
setUrl(String) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
 
setUrl(Text) - Method in class org.apache.nutch.fetcher.FetchNode
 
setURLScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.similarity.cosine.CosineSimilarity
 
setURLScoreAfterParsing(Text, Content, Parse) - Method in interface org.apache.nutch.scoring.similarity.SimilarityModel
 
setVectorEntry(int, long) - Method in class org.apache.nutch.scoring.similarity.cosine.DocVector
 
setWaitForExit(boolean) - Method in class org.apache.nutch.util.CommandRunner
 
setWarcSize(long) - Method in class org.apache.nutch.tools.CommonCrawlConfig
 
setWeight(float) - Method in class org.apache.nutch.indexer.NutchDocument
 
setWeight(float) - Method in class org.apache.nutch.indexer.NutchField
 
setWhiteList(String) - Method in class org.apache.nutch.collection.Subcollection
Set contents of whitelist from String
setWhiteList(ArrayList<String>) - Method in class org.apache.nutch.collection.Subcollection
 
shortestMatch(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
Returns the shortest prefix of input that is matched, or null if no match exists.
shortestMatch(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
Returns the shortest suffix of input that is matched, or null if no match exists.
shortestMatch(String) - Method in class org.apache.nutch.util.TrieStringMatcher
Returns the shortest substring of input that is matched by a pattern in the trie, or null if no match exists.
shouldCheck(HostDatum) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
Determines whether a record should be checked.
shouldFetch(Text, CrawlDatum, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
This method provides information whether the page is suitable for selection in the current fetchlist.
shouldFetch(Text, CrawlDatum, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
This method provides information whether the page is suitable for selection in the current fetchlist.
shouldProcessURL(String) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefalultMultiInteractionHandler
 
shouldProcessURL(String) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultClickAllAjaxLinksHandler
 
shouldProcessURL(String) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultHandler
 
shouldProcessURL(String) - Method in interface org.apache.nutch.protocol.interactiveselenium.handlers.InteractiveSeleniumHandler
 
ShowProperties - Class in org.apache.nutch.tools
Tool to list properties and their values set by the current Nutch configuration
ShowProperties() - Constructor for class org.apache.nutch.tools.ShowProperties
 
shutDown() - Method in class org.apache.nutch.plugin.Plugin
Shutdown the plugin.
Signature - Class in org.apache.nutch.crawl
 
Signature() - Constructor for class org.apache.nutch.crawl.Signature
 
SIGNATURE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
SignatureComparator - Class in org.apache.nutch.crawl
 
SignatureComparator() - Constructor for class org.apache.nutch.crawl.SignatureComparator
 
SignatureFactory - Class in org.apache.nutch.crawl
Factory class, which instantiates a Signature implementation according to the current Configuration configuration.
SimilarityModel - Interface in org.apache.nutch.scoring.similarity
 
SimilarityScoringFilter - Class in org.apache.nutch.scoring.similarity
 
SimilarityScoringFilter() - Constructor for class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
 
simpleDateFormat - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
sitemap(Path, Path, Path, boolean, boolean, boolean, int) - Method in class org.apache.nutch.util.SitemapProcessor
 
SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT - Static variable in class org.apache.nutch.util.SitemapProcessor
 
SITEMAP_OVERWRITE_EXISTING - Static variable in class org.apache.nutch.util.SitemapProcessor
 
SITEMAP_REDIR_MAX - Static variable in class org.apache.nutch.util.SitemapProcessor
 
SITEMAP_SIZE_MAX - Static variable in class org.apache.nutch.util.SitemapProcessor
 
SITEMAP_STRICT_PARSING - Static variable in class org.apache.nutch.util.SitemapProcessor
 
SITEMAP_URL_FILTERING - Static variable in class org.apache.nutch.util.SitemapProcessor
 
SITEMAP_URL_NORMALIZING - Static variable in class org.apache.nutch.util.SitemapProcessor
 
SitemapProcessor - Class in org.apache.nutch.util
Performs sitemap processing by fetching sitemap links, parsing the content and merging the URLs from sitemaps (with the metadata) into the CrawlDb.
SitemapProcessor() - Constructor for class org.apache.nutch.util.SitemapProcessor
 
size() - Method in class org.apache.nutch.crawl.Inlinks
 
size() - Method in class org.apache.nutch.metadata.Metadata
Returns the number of metadata names in this metadata.
size() - Method in class org.apache.nutch.parse.ParseResult
Return the number of parse outputs (both successful and failed)
skip(DataInput) - Static method in class org.apache.nutch.crawl.Inlink
Skips over one Inlink in the input.
skip(DataInput) - Static method in class org.apache.nutch.parse.Outlink
Skips over one Outlink in the input.
SKIP_TRUNCATED - Static variable in class org.apache.nutch.parse.ParseSegment
 
skipChildren() - Method in class org.apache.nutch.util.NodeWalker
Skips over and removes from the node stack the children of the last node.
skippedEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of a skipped entity.
SlashURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.slash
 
SlashURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.slash.SlashURLNormalizer
 
slice(String, int, int) - Method in class org.apache.nutch.service.impl.LinkReader
 
slice(String, int, int) - Method in class org.apache.nutch.service.impl.NodeReader
 
slice(String, int, int) - Method in class org.apache.nutch.service.impl.SequenceReader
 
slice(String, int, int) - Method in interface org.apache.nutch.service.NutchReader
 
SOFTWARE - Static variable in class org.apache.nutch.tools.WARCUtils
 
SolrConstants - Interface in org.apache.nutch.indexwriter.solr
 
SolrIndexWriter - Class in org.apache.nutch.indexwriter.solr
 
SolrIndexWriter() - Constructor for class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
SolrUtils - Class in org.apache.nutch.indexwriter.solr
 
SolrUtils() - Constructor for class org.apache.nutch.indexwriter.solr.SolrUtils
 
Sorter() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
 
SorterMapper() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterMapper
 
SorterReducer() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterReducer
 
SOURCE - Static variable in interface org.apache.nutch.metadata.DublinCore
A reference to a resource from which the present resource is derived.
SpellCheckedMetadata - Class in org.apache.nutch.metadata
A decorator to Metadata that adds spellchecking capabilities to property names.
SpellCheckedMetadata() - Constructor for class org.apache.nutch.metadata.SpellCheckedMetadata
 
splitEnd - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
 
splitLen - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
 
splitStart - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
 
SPONSORED - org.apache.nutch.util.domain.DomainSuffix.Status
 
STANDARD - org.apache.nutch.scoring.similarity.util.LuceneTokenizer.TokenizerType
 
start - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
 
start(String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Train
 
START - org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
 
startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
 
startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
 
startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
startCDATA() - Method in class org.apache.nutch.parse.html.DOMBuilder
Report the start of a CDATA section.
startDocument() - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of the beginning of a document.
startDTD(String, String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
Report the start of DTD declarations, if any.
startElement(String, String, String, Attributes) - Method in class org.apache.nutch.parse.html.DOMBuilder
Receive notification of the beginning of an element.
startEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
Report the beginning of an entity.
startObject(String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
startObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
 
startObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
startObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
 
startObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
startPrefixMapping(String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
Begin the scope of a prefix-URI Namespace mapping.
startServer() - Static method in class org.apache.nutch.service.NutchServer
 
startUp() - Method in class org.apache.nutch.plugin.Plugin
Will be invoked until plugin start up.
STARTUP - org.apache.nutch.util.domain.DomainSuffix.Status
 
STAT_PROGRESS - Static variable in interface org.apache.nutch.metadata.Nutch
For progress of job.
StaticFieldIndexer - Class in org.apache.nutch.indexer.staticfield
A simple plugin called at indexing that adds fields with static data.
StaticFieldIndexer() - Constructor for class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
 
statNames - Static variable in class org.apache.nutch.crawl.CrawlDatum
 
status - Variable in class org.apache.nutch.util.NutchTool
 
STATUS_BLOCKED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_DB_DUPLICATE - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page was marked as being a duplicate of another page
STATUS_DB_FETCHED - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page was successfully fetched.
STATUS_DB_GONE - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page no longer exists.
STATUS_DB_MAX - Static variable in class org.apache.nutch.crawl.CrawlDatum
Maximum value of DB-related status.
STATUS_DB_NOTMODIFIED - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page was successfully fetched and found not modified.
STATUS_DB_ORPHAN - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page was marked as orphan, e.g.
STATUS_DB_REDIR_PERM - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page permanently redirects to other page.
STATUS_DB_REDIR_TEMP - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page temporarily redirects to other page.
STATUS_DB_UNFETCHED - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page was not fetched yet.
STATUS_FAILED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_FAILURE - Static variable in class org.apache.nutch.parse.ParseStatus
 
STATUS_FETCH_GONE - Static variable in class org.apache.nutch.crawl.CrawlDatum
Fetching unsuccessful - page is gone.
STATUS_FETCH_MAX - Static variable in class org.apache.nutch.crawl.CrawlDatum
Maximum value of fetch-related status.
STATUS_FETCH_NOTMODIFIED - Static variable in class org.apache.nutch.crawl.CrawlDatum
Fetching successful - page is not modified.
STATUS_FETCH_REDIR_PERM - Static variable in class org.apache.nutch.crawl.CrawlDatum
Fetching permanently redirected to other page.
STATUS_FETCH_REDIR_TEMP - Static variable in class org.apache.nutch.crawl.CrawlDatum
Fetching temporarily redirected to other page.
STATUS_FETCH_RETRY - Static variable in class org.apache.nutch.crawl.CrawlDatum
Fetching unsuccessful, needs to be retried (transient errors).
STATUS_FETCH_SUCCESS - Static variable in class org.apache.nutch.crawl.CrawlDatum
Fetching was successful.
STATUS_GONE - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_INJECTED - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page was newly injected.
STATUS_LINKED - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page discovered through a link.
STATUS_MODIFIED - Static variable in interface org.apache.nutch.crawl.FetchSchedule
Page is known to have been modified since our last visit.
STATUS_NOTFETCHING - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_NOTFOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_NOTMODIFIED - Static variable in interface org.apache.nutch.crawl.FetchSchedule
Page is known to remain unmodified since our last visit.
STATUS_NOTMODIFIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_NOTPARSED - Static variable in class org.apache.nutch.parse.ParseStatus
 
STATUS_PARSE_META - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page got metadata from a parser
STATUS_REDIR_EXCEEDED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_RETRY - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_ROBOTS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_SIGNATURE - Static variable in class org.apache.nutch.crawl.CrawlDatum
Page signature.
STATUS_SUCCESS - Static variable in class org.apache.nutch.parse.ParseStatus
 
STATUS_SUCCESS - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
STATUS_UNKNOWN - Static variable in interface org.apache.nutch.crawl.FetchSchedule
It is unknown whether page was changed since our last visit.
STATUS_WOULDBLOCK - Static variable in class org.apache.nutch.protocol.ProtocolStatus
 
StatusUpdateReducer() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
 
stdin - Variable in class org.apache.nutch.util.AbstractChecker
 
stop() - Method in class org.apache.nutch.service.NutchServer
 
stop(String, String) - Method in class org.apache.nutch.service.impl.JobManagerImpl
 
stop(String, String) - Method in interface org.apache.nutch.service.JobManager
 
stop(String, String) - Method in class org.apache.nutch.service.resources.JobResource
Stop Job
stopJob() - Method in class org.apache.nutch.service.impl.JobWorker
To stop the executing job
stopJob() - Method in class org.apache.nutch.util.NutchTool
Stop the job with the possibility to resume.
STOPPING - org.apache.nutch.service.model.response.JobInfo.State
 
stopServer(boolean) - Method in class org.apache.nutch.service.resources.AdminResource
Stop the Nutch server
storeHttpHeaders - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Record the HTTP response header in the metadata, see property store.http.headers.
storeHttpRequest - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Record the HTTP request in the metadata, see property store.http.request.
storeIPAddress - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Record the IP address of the responding server, see property store.ip.address.
stringFields - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
stringFieldWritables - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
 
StringUtil - Class in org.apache.nutch.util
A collection of String processing utility methods.
StringUtil() - Constructor for class org.apache.nutch.util.StringUtil
 
stripNonCharCodepoints(String) - Static method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchUtils
 
Subcollection - Class in org.apache.nutch.collection
SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.
Subcollection(String, String, String, Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
public Constructor
Subcollection(String, String, Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
public Constructor
Subcollection(Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
 
SubcollectionIndexingFilter - Class in org.apache.nutch.indexer.subcollection
 
SubcollectionIndexingFilter() - Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
 
SubcollectionIndexingFilter(Configuration) - Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
 
SUBJECT - Static variable in interface org.apache.nutch.metadata.DublinCore
The topic of the content of the resource.
SUCCESS - Static variable in class org.apache.nutch.parse.ParseStatus
Parsing succeeded.
SUCCESS - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Content was retrieved without errors.
SUCCESS_REDIRECT - Static variable in class org.apache.nutch.parse.ParseStatus
Parsed content contains a directive to redirect to another URL.
SuffixStringMatcher - Class in org.apache.nutch.util
A class for efficiently matching Strings against a set of suffixes.
SuffixStringMatcher(String[]) - Constructor for class org.apache.nutch.util.SuffixStringMatcher
Creates a new PrefixStringMatcher which will match Strings with any suffix in the supplied array.
SuffixStringMatcher(Collection<String>) - Constructor for class org.apache.nutch.util.SuffixStringMatcher
Creates a new PrefixStringMatcher which will match Strings with any suffix in the supplied Collection
SuffixURLFilter - Class in org.apache.nutch.urlfilter.suffix
Filters URLs based on a file of URL suffixes.
SuffixURLFilter() - Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
SuffixURLFilter(Reader) - Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
 
SYSTEM_PROTOCOLS - Static variable in class org.apache.nutch.plugin.URLStreamHandlerFactory
Protocols covered by standard JVM URL handlers.

T

TAB_CHARACTER - Static variable in class org.apache.nutch.crawl.Injector.InjectMapper
 
TableUtil - Class in org.apache.nutch.util
 
TableUtil() - Constructor for class org.apache.nutch.util.TableUtil
 
TAG_BLACKLIST - Static variable in class org.apache.nutch.collection.Subcollection
 
TAG_COLLECTION - Static variable in class org.apache.nutch.collection.Subcollection
 
TAG_COLLECTIONS - Static variable in class org.apache.nutch.collection.Subcollection
 
TAG_ID - Static variable in class org.apache.nutch.collection.Subcollection
 
TAG_KEY - Static variable in class org.apache.nutch.collection.Subcollection
 
TAG_NAME - Static variable in class org.apache.nutch.collection.Subcollection
 
TAG_WHITELIST - Static variable in class org.apache.nutch.collection.Subcollection
 
tcpPort - Variable in class org.apache.nutch.util.AbstractChecker
 
TEMP_MOVED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Resource has moved temporarily.
TEMPLATE - Static variable in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
termFreqVector - Variable in class org.apache.nutch.scoring.similarity.cosine.DocVector
 
terminal - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
 
termVector - Variable in class org.apache.nutch.scoring.similarity.cosine.DocVector
 
TEXT_PLAIN_CONTENT_TYPE - Static variable in class org.apache.nutch.parse.feed.FeedParser
 
TextMD5Signature - Class in org.apache.nutch.crawl
Implementation of a page signature.
TextMD5Signature() - Constructor for class org.apache.nutch.crawl.TextMD5Signature
 
TextOutputFormat() - Constructor for class org.apache.nutch.segment.SegmentReader.TextOutputFormat
 
TextProfileSignature - Class in org.apache.nutch.crawl
An implementation of a page signature.
TextProfileSignature() - Constructor for class org.apache.nutch.crawl.TextProfileSignature
 
throwBadRequestException(String) - Method in class org.apache.nutch.service.resources.AbstractResource
 
TikaParser - Class in org.apache.nutch.parse.tika
Wrapper for Tika parsers.
TikaParser() - Constructor for class org.apache.nutch.parse.tika.TikaParser
 
TIME - org.apache.nutch.net.protocols.Response.TruncatedContentReason
fetch exceeded configured http.time.limit
timelimitExceeded() - Method in class org.apache.nutch.fetcher.FetchItemQueues
 
timeout - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The network timeout in millisecond
TimingUtil - Class in org.apache.nutch.util
 
TimingUtil() - Constructor for class org.apache.nutch.util.TimingUtil
 
TITLE - Static variable in interface org.apache.nutch.metadata.DublinCore
A name given to the resource.
TLDIndexingFilter - Class in org.apache.nutch.indexer.tld
Adds the top-level domain extensions to the index
TLDIndexingFilter() - Constructor for class org.apache.nutch.indexer.tld.TLDIndexingFilter
 
TLDScoringFilter - Class in org.apache.nutch.scoring.tld
Scoring filter to boost top-level domains (TLDs).
TLDScoringFilter() - Constructor for class org.apache.nutch.scoring.tld.TLDScoringFilter
 
tlsCheckCertificate - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Whether to check TLS/SSL certificates
tlsPreferredCipherSuites - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Which TLS/SSL cipher suites to support
tlsPreferredProtocols - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Which TLS/SSL protocols to support
toASCII(String) - Static method in class org.apache.nutch.util.URLUtil
 
toByteArray(HttpHeaders) - Static method in class org.apache.nutch.tools.WARCUtils
 
toContent() - Method in class org.apache.nutch.protocol.file.FileResponse
 
toContent() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
 
toDate(String) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
 
toHexString(byte[]) - Static method in class org.apache.nutch.util.StringUtil
Convenience call for StringUtil.toHexString(byte[], String, int), where sep = null; lineLen = Integer.MAX_VALUE.
toHexString(byte[], String, int) - Static method in class org.apache.nutch.util.StringUtil
Get a text representation of a byte[] as hexadecimal String, where each pair of hexadecimal digits corresponds to consecutive bytes in the array.
toLong(String) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
 
TOPIC - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
 
TopLevelDomain - Class in org.apache.nutch.util.domain
(From wikipedia) A top-level domain (TLD) is the last part of an Internet domain name; that is, the letters which follow the final dot of any domain name.
TopLevelDomain(String, DomainSuffix.Status, float, String) - Constructor for class org.apache.nutch.util.domain.TopLevelDomain
 
TopLevelDomain(String, TopLevelDomain.Type, DomainSuffix.Status, float) - Constructor for class org.apache.nutch.util.domain.TopLevelDomain
 
TopLevelDomain.Type - Enum in org.apache.nutch.util.domain
 
toString() - Method in class org.apache.nutch.crawl.CrawlDatum
 
toString() - Method in class org.apache.nutch.crawl.Generator.SelectorEntry
 
toString() - Method in class org.apache.nutch.crawl.Inlink
 
toString() - Method in class org.apache.nutch.crawl.Inlinks
 
toString() - Method in class org.apache.nutch.hostdb.HostDatum
 
toString() - Method in class org.apache.nutch.indexer.IndexWriterConfig
 
toString() - Method in class org.apache.nutch.indexer.NutchDocument
 
toString() - Method in class org.apache.nutch.indexer.NutchField
 
toString() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
 
toString() - Method in class org.apache.nutch.metadata.Metadata
 
toString() - Method in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
 
toString() - Method in class org.apache.nutch.parse.HTMLMetaTags
 
toString() - Method in class org.apache.nutch.parse.Outlink
 
toString() - Method in class org.apache.nutch.parse.ParseData
 
toString() - Method in class org.apache.nutch.parse.ParseStatus
 
toString() - Method in class org.apache.nutch.parse.ParseText
 
toString() - Method in class org.apache.nutch.plugin.Extension
 
toString() - Method in class org.apache.nutch.protocol.Content
 
toString() - Method in class org.apache.nutch.protocol.okhttp.CIDR
 
toString() - Method in class org.apache.nutch.protocol.ProtocolStatus
 
toString() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
toString() - Method in class org.apache.nutch.scoring.webgraph.Node
 
toString() - Method in class org.apache.nutch.segment.SegmentPart
Return a String representation of this class, in the form "segmentName/partName".
toString() - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.Rule
 
toString() - Method in class org.apache.nutch.util.domain.DomainSuffix
 
toString(long) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
 
toString(CharSequence) - Static method in class org.apache.nutch.util.TableUtil
Convert given Utf8 instance to String and and cleans out any offending "�" from the String.
toString(String) - Method in class org.apache.nutch.protocol.Content
 
toString(String, String) - Method in class org.apache.nutch.metadata.Metadata
 
toString(Charset) - Method in class org.apache.nutch.protocol.Content
 
toString(Calendar) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
 
toString(Date) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
Get the HTTP format of the specified date.
toUNICODE(String) - Static method in class org.apache.nutch.util.URLUtil
 
toZonedDateTime(String) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
 
train() - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
Train - Class in org.apache.nutch.parsefilter.naivebayes
 
Train() - Constructor for class org.apache.nutch.parsefilter.naivebayes.Train
 
TRAINFILE_MODELFILTER - Static variable in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
 
TRANSFER_ENCODING - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
TrieStringMatcher - Class in org.apache.nutch.util
TrieStringMatcher is a base class for simple tree-based string matching.
TrieStringMatcher() - Constructor for class org.apache.nutch.util.TrieStringMatcher
 
TrieStringMatcher.TrieNode - Class in org.apache.nutch.util
Node class for the character tree.
TRUNCATED_CONTENT - Static variable in interface org.apache.nutch.net.protocols.Response
Key to hold boolean whether content has been truncated, e.g., because it exceeds http.content.limit
TRUNCATED_CONTENT_REASON - Static variable in interface org.apache.nutch.net.protocols.Response
Key to hold reason why content has been truncated, see Response.TruncatedContentReason
TruncatedContent() - Constructor for class org.apache.nutch.protocol.okhttp.OkHttpResponse.TruncatedContent
 
TRUST_STORE_PASSWORD - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
TRUST_STORE_PATH - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
TRUST_STORE_TYPE - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
TYPE - Static variable in interface org.apache.nutch.metadata.DublinCore
The nature or genre of the content of the resource.

U

unescape(String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
Unescape some exotic characters in the fragment part
unfetched - Variable in class org.apache.nutch.hostdb.HostDatum
 
unflattenToHashmap(String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Classify
 
unreverseHost(String) - Static method in class org.apache.nutch.util.TableUtil
 
unreverseUrl(String) - Static method in class org.apache.nutch.util.TableUtil
 
UNSPECIFIED - org.apache.nutch.net.protocols.Response.TruncatedContentReason
unknown reason
UNSPONSORED - org.apache.nutch.util.domain.DomainSuffix.Status
 
unzip(byte[]) - Static method in class org.apache.nutch.util.GZIPUtils
Returns an gunzipped copy of the input array.
unzipBestEffort(byte[]) - Static method in class org.apache.nutch.util.GZIPUtils
Returns an gunzipped copy of the input array.
unzipBestEffort(byte[], int) - Static method in class org.apache.nutch.util.GZIPUtils
Returns an gunzipped copy of the input array, truncated to sizeLimit bytes, if necessary.
update(Path, Path) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
Updates the inlink score in the web graph node databsae into the crawl database.
update(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.CrawlDb
 
update(Path, Path[], boolean, boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.CrawlDb
 
update(NutchDocument) - Method in interface org.apache.nutch.indexer.IndexWriter
 
update(NutchDocument) - Method in class org.apache.nutch.indexer.IndexWriters
 
update(NutchDocument) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
update(NutchDocument) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
update(NutchDocument) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
update(NutchDocument) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
update(NutchDocument) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
update(NutchDocument) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
update(NutchDocument) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
update(NutchDocument) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
UPDATE - Static variable in class org.apache.nutch.indexer.NutchIndexAction
 
UPDATEDB - org.apache.nutch.service.JobManager.JobType
 
updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
 
updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
 
updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
Increase the score by a sum of inlinked scores.
updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.orphan.OrphanScoringFilter
Used for orphan control.
updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in interface org.apache.nutch.scoring.ScoringFilter
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages.
updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.ScoringFilters
Calculate updated page score during CrawlDb.update().
updateHashMap(HashMap<String, Integer>, String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Train
 
UpdateHostDb - Class in org.apache.nutch.hostdb
Tool to create a HostDB from the CrawlDB.
UpdateHostDb() - Constructor for class org.apache.nutch.hostdb.UpdateHostDb
 
UpdateHostDbMapper - Class in org.apache.nutch.hostdb
Mapper ingesting HostDB and CrawlDB entries.
UpdateHostDbMapper() - Constructor for class org.apache.nutch.hostdb.UpdateHostDbMapper
 
UpdateHostDbReducer - Class in org.apache.nutch.hostdb
 
UpdateHostDbReducer() - Constructor for class org.apache.nutch.hostdb.UpdateHostDbReducer
 
updateProperty(String, String, String) - Method in class org.apache.nutch.service.resources.ConfigResource
Adds/Updates a particular property value in the configuration
url - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
 
url - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
URL_FILTER_NORMALIZE_ALL - Static variable in class org.apache.nutch.crawl.Injector
property to pass value of command-line option -filterNormalizeAll to mapper
URL_FILTERING - Static variable in class org.apache.nutch.crawl.CrawlDbFilter
 
URL_FILTERING - Static variable in class org.apache.nutch.crawl.LinkDbFilter
 
URL_FILTERING - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
URL_FILTERING - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
 
URL_NORMALIZING - Static variable in class org.apache.nutch.crawl.CrawlDbFilter
 
URL_NORMALIZING - Static variable in class org.apache.nutch.crawl.LinkDbFilter
 
URL_NORMALIZING - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
 
URL_NORMALIZING - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
 
URL_NORMALIZING_SCOPE - Static variable in class org.apache.nutch.crawl.CrawlDbFilter
 
URL_NORMALIZING_SCOPE - Static variable in class org.apache.nutch.crawl.Injector.InjectMapper
 
URL_NORMALIZING_SCOPE - Static variable in class org.apache.nutch.crawl.LinkDbFilter
 
URL_VERSION - Static variable in class org.apache.nutch.tools.arc.ArcSegmentCreator.ArcSegmentCreatorMapper
 
URL_VERSION - Static variable in class org.apache.nutch.tools.arc.ArcSegmentCreator
 
URLExemptionFilter - Interface in org.apache.nutch.net
Interface used to allow exemptions to external domain resources by overriding db.ignore.external.links.
URLExemptionFilters - Class in org.apache.nutch.net
Creates and caches URLExemptionFilter implementing plugins.
URLExemptionFilters(Configuration) - Constructor for class org.apache.nutch.net.URLExemptionFilters
 
URLFilter - Interface in org.apache.nutch.net
Interface used to limit which URLs enter Nutch.
URLFILTER_AUTOMATON_FILE - Static variable in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
 
URLFILTER_AUTOMATON_RULES - Static variable in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
 
URLFILTER_FAST_FILE - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
URLFILTER_FAST_MAX_LENGTH - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
URLFILTER_FAST_PATH_MAX_LENGTH - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
URLFILTER_FAST_QUERY_MAX_LENGTH - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
 
URLFILTER_ORDER - Static variable in class org.apache.nutch.net.URLFilters
 
URLFILTER_REGEX_FILE - Static variable in class org.apache.nutch.urlfilter.regex.RegexURLFilter
 
URLFILTER_REGEX_RULES - Static variable in class org.apache.nutch.urlfilter.regex.RegexURLFilter
 
URLFilterChecker - Class in org.apache.nutch.net
Checks one given filter or all filters.
URLFilterChecker() - Constructor for class org.apache.nutch.net.URLFilterChecker
 
URLFilterException - Exception in org.apache.nutch.net
 
URLFilterException() - Constructor for exception org.apache.nutch.net.URLFilterException
 
URLFilterException(String) - Constructor for exception org.apache.nutch.net.URLFilterException
 
URLFilterException(String, Throwable) - Constructor for exception org.apache.nutch.net.URLFilterException
 
URLFilterException(Throwable) - Constructor for exception org.apache.nutch.net.URLFilterException
 
URLFilters - Class in org.apache.nutch.net
Creates and caches plugins implementing URLFilter.
URLFilters(Configuration) - Constructor for class org.apache.nutch.net.URLFilters
 
urlKey - Static variable in class org.apache.nutch.crawl.DeduplicationJob
 
URLMetaIndexingFilter - Class in org.apache.nutch.indexer.urlmeta
This is part of the URL Meta plugin.
URLMetaIndexingFilter() - Constructor for class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
 
URLMetaScoringFilter - Class in org.apache.nutch.scoring.urlmeta
URLMetaScoringFilter() - Constructor for class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
 
URLNormalizer - Interface in org.apache.nutch.net
Interface used to convert URLs to normal form and optionally perform substitutions
URLNormalizerChecker - Class in org.apache.nutch.net
Checks one given normalizer or all normalizers.
URLNormalizerChecker() - Constructor for class org.apache.nutch.net.URLNormalizerChecker
 
URLNormalizers - Class in org.apache.nutch.net
This class uses a "chained filter" pattern to run defined normalizers.
URLNormalizers(Configuration, String) - Constructor for class org.apache.nutch.net.URLNormalizers
 
URLPartitioner - Class in org.apache.nutch.crawl
Partition urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'
URLPartitioner() - Constructor for class org.apache.nutch.crawl.URLPartitioner
 
URLStreamHandlerFactory - Class in org.apache.nutch.plugin
This URLStreamHandlerFactory knows about all the plugins in use and thus can create the correct URLStreamHandler even if it comes from a plugin classpath.
URLUtil - Class in org.apache.nutch.util
Utility class for URL analysis
URLUtil() - Constructor for class org.apache.nutch.util.URLUtil
 
UrlValidator - Class in org.apache.nutch.urlfilter.validator
Validates URLs.
UrlValidator() - Constructor for class org.apache.nutch.urlfilter.validator.UrlValidator
 
usage - Variable in class org.apache.nutch.util.AbstractChecker
 
usage() - Method in class org.apache.nutch.crawl.Injector
 
usage() - Static method in class org.apache.nutch.util.SitemapProcessor
 
USE_AUTH - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
USE_AUTH - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
useHttp11 - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Do we use HTTP/1.1?
useHttp2 - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Whether to use HTTP/2
useProxy - Variable in class org.apache.nutch.protocol.http.api.HttpBase
Indicates if a proxy is used
useProxy(String) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
useProxy(URI) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
useProxy(URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
 
USER - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
 
USER - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
 
USER_AGENT - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
userAgent - Variable in class org.apache.nutch.protocol.http.api.HttpBase
The Nutch 'User-Agent' request header
USERNAME - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
UTF_8 - Static variable in class org.apache.nutch.crawl.DeduplicationJob
 
UUID_KEY - Static variable in class org.apache.nutch.util.NutchConfiguration
 

V

VAL_RESULT - Static variable in interface org.apache.nutch.metadata.Nutch
Name of the key used in the Result Map sent back by the REST endpoint
VALUE_SERIALIZER - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
 
valueOf(String) - Static method in enum org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.net.protocols.Response.TruncatedContentReason
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.protocol.htmlunit.HttpResponse.Scheme
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.protocol.http.HttpResponse.Scheme
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.protocol.interactiveselenium.HttpResponse.Scheme
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.protocol.selenium.HttpResponse.Scheme
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.scoring.similarity.util.LuceneTokenizer.TokenizerType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.service.JobManager.JobType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.service.model.response.JobInfo.State
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.util.domain.DomainStatistics.MyCounter
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.util.domain.DomainSuffix.Status
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.apache.nutch.util.domain.TopLevelDomain.Type
Returns the enum constant of this type with the specified name.
values() - Static method in enum org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.net.protocols.Response.TruncatedContentReason
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.protocol.htmlunit.HttpResponse.Scheme
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.protocol.http.HttpResponse.Scheme
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.protocol.interactiveselenium.HttpResponse.Scheme
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.protocol.selenium.HttpResponse.Scheme
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.scoring.similarity.util.LuceneTokenizer.TokenizerType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.service.JobManager.JobType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.service.model.response.JobInfo.State
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.util.domain.DomainStatistics.MyCounter
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.util.domain.DomainSuffix.Status
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.apache.nutch.util.domain.TopLevelDomain.Type
Returns an array containing the constants of this enum type, in the order they are declared.
VERSION - Static variable in class org.apache.nutch.indexer.NutchDocument
 

W

walk(Node, URL, Metadata, Configuration) - Static method in class org.creativecommons.nutch.CCParseFilter.Walker
Scan the document adding attributes to metadata.
WARCExporter - Class in org.apache.nutch.tools.warc
MapReduce job to exports Nutch segments as WARC files.
WARCExporter() - Constructor for class org.apache.nutch.tools.warc.WARCExporter
 
WARCExporter(Configuration) - Constructor for class org.apache.nutch.tools.warc.WARCExporter
 
WARCExporter.WARCMapReduce - Class in org.apache.nutch.tools.warc
 
WARCExporter.WARCMapReduce.WARCMapper - Class in org.apache.nutch.tools.warc
 
WARCExporter.WARCMapReduce.WARCReducer - Class in org.apache.nutch.tools.warc
 
WARCMapper() - Constructor for class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCMapper
 
WARCMapReduce() - Constructor for class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce
 
WARCReducer() - Constructor for class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCReducer
 
WARCUtils - Class in org.apache.nutch.tools
 
WARCUtils() - Constructor for class org.apache.nutch.tools.WARCUtils
 
WebGraph - Class in org.apache.nutch.scoring.webgraph
Creates three databases, one for inlinks, one for outlinks, and a node database that holds the number of in and outlinks to a url and the current score for the url.
WebGraph() - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph
 
WebGraph.OutlinkDb - Class in org.apache.nutch.scoring.webgraph
The OutlinkDb creates a database of all outlinks.
WebGraph.OutlinkDb.OutlinkDbMapper - Class in org.apache.nutch.scoring.webgraph
Passes through existing LinkDatum objects from an existing OutlinkDb and maps out new LinkDatum objects from new crawls ParseData.
WebGraph.OutlinkDb.OutlinkDbReducer - Class in org.apache.nutch.scoring.webgraph
 
webWindowClosed(WebWindowEvent) - Method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
 
webWindowContentChanged(WebWindowEvent) - Method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
 
webWindowOpened(WebWindowEvent) - Method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
 
WEIGHT_FIELD - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
 
whitespacePattern - Static variable in class org.apache.nutch.parse.headings.HeadingsParseFilter
Pattern used to strip surpluss whitespace
WORK_TYPE - Static variable in interface org.apache.nutch.metadata.CreativeCommons
 
WOULDBLOCK - Static variable in class org.apache.nutch.protocol.ProtocolStatus
Deprecated.
WRITABLE_CONTENT_TYPE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
 
WRITABLE_FIXED_INTERVAL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
WRITABLE_GENERATE_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
WRITABLE_PROTO_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
WRITABLE_REPR_URL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
 
WritableSerializer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer
 
write(DataOutput) - Method in class org.apache.nutch.crawl.CrawlDatum
 
write(DataOutput) - Method in class org.apache.nutch.crawl.Generator.SelectorEntry
 
write(DataOutput) - Method in class org.apache.nutch.crawl.Inlink
 
write(DataOutput) - Method in class org.apache.nutch.crawl.Inlinks
 
write(DataOutput) - Method in class org.apache.nutch.hostdb.HostDatum
 
write(DataOutput) - Method in class org.apache.nutch.indexer.NutchDocument
 
write(DataOutput) - Method in class org.apache.nutch.indexer.NutchField
 
write(DataOutput) - Method in class org.apache.nutch.indexer.NutchIndexAction
 
write(DataOutput) - Method in class org.apache.nutch.metadata.Metadata
 
write(DataOutput) - Method in class org.apache.nutch.metadata.MetaWrapper
 
write(DataOutput) - Method in class org.apache.nutch.parse.Outlink
 
write(DataOutput) - Method in class org.apache.nutch.parse.ParseData
 
write(DataOutput) - Method in class org.apache.nutch.parse.ParseImpl
 
write(DataOutput) - Method in class org.apache.nutch.parse.ParseStatus
 
write(DataOutput) - Method in class org.apache.nutch.parse.ParseText
 
write(DataOutput) - Method in class org.apache.nutch.protocol.Content
 
write(DataOutput) - Method in class org.apache.nutch.protocol.ProtocolStatus
 
write(DataOutput) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
 
write(DataOutput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
 
write(DataOutput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
 
write(DataOutput) - Method in class org.apache.nutch.scoring.webgraph.Node
 
write(Text, CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
 
write(Text, CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter
 
write(NutchDocument) - Method in interface org.apache.nutch.indexer.IndexWriter
 
write(NutchDocument) - Method in class org.apache.nutch.indexer.IndexWriters
 
write(NutchDocument) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
 
write(NutchDocument) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
 
write(NutchDocument) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
 
write(NutchDocument) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 
write(NutchDocument) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
 
write(NutchDocument) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
 
write(NutchDocument) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
 
write(NutchDocument) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
 
writeArrayValue(String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
writeArrayValue(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
 
writeArrayValue(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
writeArrayValue(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
 
writeArrayValue(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
writeKeyNull(String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
writeKeyNull(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
 
writeKeyNull(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
writeKeyNull(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
 
writeKeyNull(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
writeKeyValue(String, String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
 
writeKeyValue(String, String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
 
writeKeyValue(String, String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
 
writeKeyValue(String, String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
 
writeKeyValue(String, String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
writeObjectEntrySeparator(JsonGenerator) - Method in class org.apache.nutch.crawl.CrawlDbReader.JsonIndenter
 
writeObjectFieldValueSeparator(JsonGenerator) - Method in class org.apache.nutch.crawl.CrawlDbReader.JsonIndenter
 
writeOutAsDuplicate(CrawlDatum, Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
 
writeRequest(URI) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
writeResponse() - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
 
WWW_AUTHENTICATE - Static variable in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
The HTTP Authentication (WWW-Authenticate) header which is returned by a webserver requiring authentication.

X

X_HIDE_HEADER - Static variable in class org.apache.nutch.tools.WARCUtils
 
X_POINT_ID - Static variable in interface org.apache.nutch.exchange.Exchange
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.indexer.IndexingFilter
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.indexer.IndexWriter
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.net.URLExemptionFilter
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.net.URLFilter
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.net.URLNormalizer
 
X_POINT_ID - Static variable in interface org.apache.nutch.parse.HtmlParseFilter
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.parse.Parser
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.protocol.Protocol
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.publisher.NutchPublisher
 
X_POINT_ID - Static variable in interface org.apache.nutch.scoring.ScoringFilter
The name of the extension point.
X_POINT_ID - Static variable in interface org.apache.nutch.segment.SegmentMergeFilter
The name of the extension point.
XMLCharacterRecognizer - Class in org.apache.nutch.parse.html
Class used to verify whether the specified ch conforms to the XML 1.0 definition of whitespace.
XMLCharacterRecognizer() - Constructor for class org.apache.nutch.parse.html.XMLCharacterRecognizer
 

Y

YES_VAL - Static variable in class org.apache.nutch.util.TableUtil
 

Z

zip(byte[]) - Static method in class org.apache.nutch.util.GZIPUtils
Returns an gzipped copy of the input array.
ZipParser - Class in org.apache.nutch.parse.zip
ZipParser class based on MSPowerPointParser class by Stephan Strittmatter.
ZipParser() - Constructor for class org.apache.nutch.parse.zip.ZipParser
Creates a new instance of ZipParser
ZipTextExtractor - Class in org.apache.nutch.parse.zip
 
ZipTextExtractor(Configuration) - Constructor for class org.apache.nutch.parse.zip.ZipTextExtractor
Creates a new instance of ZipTextExtractor

_

__openPassiveDataConnection(int, String) - Method in class org.apache.nutch.protocol.ftp.Client
Open a passive data connection socket
_compare(byte[], int, int, byte[], int, int) - Static method in class org.apache.nutch.crawl.SignatureComparator
 
_compare(Object, Object) - Static method in class org.apache.nutch.crawl.SignatureComparator
 
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ 
All Classes All Packages