A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _
All Classes All Packages
All Classes All Packages
All Classes All Packages
A
- abort(String, String) - Method in class org.apache.nutch.service.impl.JobManagerImpl
- abort(String, String) - Method in interface org.apache.nutch.service.JobManager
- abort(String, String) - Method in class org.apache.nutch.service.resources.JobResource
- AbstractChecker - Class in org.apache.nutch.util
-
Scaffolding class for the various Checker implementations.
- AbstractChecker() - Constructor for class org.apache.nutch.util.AbstractChecker
- AbstractCommonCrawlFormat - Class in org.apache.nutch.tools
-
Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.
- AbstractCommonCrawlFormat(String, Content, Metadata, Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.AbstractCommonCrawlFormat
- AbstractFetchSchedule - Class in org.apache.nutch.crawl
-
This class provides common methods for implementations of
FetchSchedule
. - AbstractFetchSchedule() - Constructor for class org.apache.nutch.crawl.AbstractFetchSchedule
- AbstractFetchSchedule(Configuration) - Constructor for class org.apache.nutch.crawl.AbstractFetchSchedule
- AbstractResource - Class in org.apache.nutch.service.resources
- AbstractResource() - Constructor for class org.apache.nutch.service.resources.AbstractResource
- AbstractScoringFilter - Class in org.apache.nutch.scoring
- AbstractScoringFilter() - Constructor for class org.apache.nutch.scoring.AbstractScoringFilter
- accept - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The "Accept" request header value.
- accept() - Method in class org.apache.nutch.urlfilter.api.RegexRule
-
Return if this rule is used for filtering-in or out.
- accept(InetAddress) - Method in class org.apache.nutch.protocol.okhttp.IPFilterRules
- acceptCharset - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The "Accept-Charset" request header value.
- acceptLanguage - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The "Accept-Language" request header value.
- ACCESS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Access denied - authorization required, but missing/incorrect.
- action - Variable in class org.apache.nutch.indexer.NutchIndexAction
- AdaptiveFetchSchedule - Class in org.apache.nutch.crawl
-
This class implements an adaptive re-fetch algorithm.
- AdaptiveFetchSchedule() - Constructor for class org.apache.nutch.crawl.AdaptiveFetchSchedule
- add(Object) - Method in class org.apache.nutch.indexer.NutchField
- add(String, Object) - Method in class org.apache.nutch.indexer.NutchDocument
- add(String, String) - Method in class org.apache.nutch.metadata.Metadata
-
Add a metadata name/value mapping.
- add(String, String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
- add(Inlink) - Method in class org.apache.nutch.crawl.Inlinks
- add(Inlinks) - Method in class org.apache.nutch.crawl.Inlinks
- add(NutchDocument, String, String) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
- ADD - Static variable in class org.apache.nutch.indexer.NutchIndexAction
- addAll(Metadata) - Method in class org.apache.nutch.metadata.Metadata
-
Add all name/value mappings (merge two metadata mappings).
- addAttribute(String, String) - Method in class org.apache.nutch.plugin.Extension
-
Adds a attribute and is only used until model creation at plugin system start up.
- addClue(String, String) - Method in class org.apache.nutch.util.EncodingDetector
- addClue(String, String, int) - Method in class org.apache.nutch.util.EncodingDetector
- addDependency(String) - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Adds a dependency
- addEventData(String, Object) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Add new data to the eventData object.
- addExportedLibRelative(String) - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Adds a exported library with a relative path to the plugin directory.
- addExtension(Extension) - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Install a coresponding extension to this extension point.
- addExtension(Extension) - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Adds a extension.
- addExtensionPoint(ExtensionPoint) - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Adds a extension point.
- addFetchItem(Text, CrawlDatum) - Method in class org.apache.nutch.fetcher.FetchItemQueues
- addFetchItem(FetchItem) - Method in class org.apache.nutch.fetcher.FetchItemQueue
- addFetchItem(FetchItem) - Method in class org.apache.nutch.fetcher.FetchItemQueues
- addHeader(String, Object) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
- addIfNotNull(NutchDocument, String, Object) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
-
Add field to document but only if value isn't null
- addInProgressFetchItem(FetchItem) - Method in class org.apache.nutch.fetcher.FetchItemQueue
- addMeta(String, String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Add metadata.
- addNotExportedLibRelative(String) - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Adds a non-exported library with a relative path to the plugin directory.
- addOutlinksToEventData(Collection<Outlink>) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Given a collection of lists this method will add it the oultink metadata
- addPatternBackward(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Adds any necessary nodes to the trie so that the given
String
can be decoded in reverse and the first character is represented by a terminal node. - addPatternForward(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Adds any necessary nodes to the trie so that the given
String
can be decoded and the last character is represented by a terminal node. - addRobotsContent(List<Content>, URL, Response) - Method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
-
Append
Content
of robots.txt to robotsTxtContent - addUrlFeatures(NutchDocument, String) - Method in class org.creativecommons.nutch.CCIndexingFilter
-
Add the features represented by a license URL.
- AdminResource - Class in org.apache.nutch.service.resources
- AdminResource() - Constructor for class org.apache.nutch.service.resources.AdminResource
- afterExecute(Runnable, Throwable) - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
- agentNames - Variable in class org.apache.nutch.protocol.RobotRulesParser
- AJAX_URL_PART - Static variable in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
- AjaxURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.ajax
-
URLNormalizer capable of dealing with AJAX URL's.
- AjaxURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
-
Default constructor.
- allowForbidden - Variable in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
- allowList - Variable in class org.apache.nutch.protocol.RobotRulesParser
-
set of host names or IPs to be explicitly excluded from robots.txt checking
- analyze(Path) - Method in class org.apache.nutch.scoring.webgraph.LinkRank
-
Runs the complete link analysis job.
- AnchorIndexingFilter - Class in org.apache.nutch.indexer.anchor
-
Indexing filter that offers an option to either index all inbound anchor text for a document or deduplicate anchors.
- AnchorIndexingFilter() - Constructor for class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
- ANY - org.apache.nutch.service.model.response.JobInfo.State
- append(Node) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Append a node to the current container.
- ArbitraryIndexingFilter - Class in org.apache.nutch.indexer.arbitrary
-
Adds arbitrary searchable fields to a document from the class and method the user identifies in the config.
- ArbitraryIndexingFilter() - Constructor for class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
- ArcInputFormat - Class in org.apache.nutch.tools.arc
-
A input format the reads arc files.
- ArcInputFormat() - Constructor for class org.apache.nutch.tools.arc.ArcInputFormat
- ArcRecordReader - Class in org.apache.nutch.tools.arc
-
The
ArchRecordReader
class provides a record reader which reads records from arc files. - ArcRecordReader(Configuration, FileSplit) - Constructor for class org.apache.nutch.tools.arc.ArcRecordReader
-
Constructor that sets the configuration and file split.
- ArcSegmentCreator - Class in org.apache.nutch.tools.arc
-
The
ArcSegmentCreator
is a replacement for fetcher that will take arc files as input and produce a nutch segment as output. - ArcSegmentCreator() - Constructor for class org.apache.nutch.tools.arc.ArcSegmentCreator
- ArcSegmentCreator(Configuration) - Constructor for class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Constructor that sets the job configuration.
- ArcSegmentCreator.ArcSegmentCreatorMapper - Class in org.apache.nutch.tools.arc
- ArcSegmentCreatorMapper() - Constructor for class org.apache.nutch.tools.arc.ArcSegmentCreator.ArcSegmentCreatorMapper
- areAvailableExchanges() - Method in class org.apache.nutch.exchange.Exchanges
- ARG_CRAWLDB - Static variable in interface org.apache.nutch.metadata.Nutch
-
Argument key to specify the location of crawldb for the REST endpoints
- ARG_HOSTDB - Static variable in interface org.apache.nutch.metadata.Nutch
-
Argument key to specify the location of hostdb for the REST endpoints
- ARG_LINKDB - Static variable in interface org.apache.nutch.metadata.Nutch
-
Argument key to specify the location of linkdb for the REST endpoints
- ARG_SEEDDIR - Static variable in interface org.apache.nutch.metadata.Nutch
-
Argument key to specify location of the seed url dir for the REST endpoints
- ARG_SEEDNAME - Static variable in interface org.apache.nutch.metadata.Nutch
-
Argument key to specify name of a seed list for the REST endpoints
- ARG_SEGMENTDIR - Static variable in interface org.apache.nutch.metadata.Nutch
-
Argument key to specify the location of a directory of segments for the REST endpoints.
- ARG_SEGMENTS - Static variable in interface org.apache.nutch.metadata.Nutch
-
Argument key to specify the location of individual segment or list of segments for the REST endpoints.
- args - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- attrName - Variable in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
- AUTH_HEADER_NAME - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- AUTH_HEADER_VALUE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- autoDetectClues(Content, boolean) - Method in class org.apache.nutch.util.EncodingDetector
- AutomatonURLFilter - Class in org.apache.nutch.urlfilter.automaton
-
RegexURLFilterBase implementation based on the dk.brics.automaton Finite-State Automata for JavaTM.
- AutomatonURLFilter() - Constructor for class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
- AutomatonURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
- autoResolveContentType(String, String, byte[]) - Method in class org.apache.nutch.util.MimeUtil
-
A facade interface to trying all the possible mime type resolution strategies available within Tika.
B
- BasicIndexingFilter - Class in org.apache.nutch.indexer.basic
-
Adds basic searchable fields to a document.
- BasicIndexingFilter() - Constructor for class org.apache.nutch.indexer.basic.BasicIndexingFilter
- BasicURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.basic
-
Converts URLs to a normal form: remove dot segments in path:
/./
or/../
remove default ports, e.g. - BasicURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
- BATCH_DUMP - Static variable in interface org.apache.nutch.indexwriter.cloudsearch.CloudSearchConstants
- beforeExecute(Thread, Runnable) - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
- bind(String, String, String, String, String, String) - Method in class org.apache.nutch.rabbitmq.RabbitMQClient
-
Creates a relationship between an exchange and a queue.
- BLOCKED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Deprecated.
- BlockedException - Exception in org.apache.nutch.protocol.http.api
- BlockedException(String) - Constructor for exception org.apache.nutch.protocol.http.api.BlockedException
- booleanValue() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse.TruncatedContent
- buffer - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- BUFFER_SIZE - Static variable in class org.apache.nutch.protocol.http.api.HttpBase
- BULK_CLOSE_TIMEOUT - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- BULK_CLOSE_TIMEOUT - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- bulkProcessorListener() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
-
Generates a default BulkProcessor.Listener
- bulkProcessorListener() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
-
Generates a default BulkProcessor.Listener
- bytes - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
C
- CACHE - Static variable in class org.apache.nutch.protocol.RobotRulesParser
- CACHING_FORBIDDEN_ALL - Static variable in interface org.apache.nutch.metadata.Nutch
-
Don't show either original forbidden content or summaries.
- CACHING_FORBIDDEN_CONTENT - Static variable in interface org.apache.nutch.metadata.Nutch
-
Don't show original forbidden content, but show summaries.
- CACHING_FORBIDDEN_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
Sites may request that search engines don't provide access to cached documents.
- CACHING_FORBIDDEN_NONE - Static variable in interface org.apache.nutch.metadata.Nutch
-
Show both original forbidden content and summaries (default).
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.MD5Signature
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.Signature
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.TextMD5Signature
- calculate(Content, Parse) - Method in class org.apache.nutch.crawl.TextProfileSignature
- calculateLastFetchTime(CrawlDatum) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method return the last fetch time of the CrawlDatum
- calculateLastFetchTime(CrawlDatum) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Calculates last fetch time of the given CrawlDatum.
- canStop(boolean) - Method in class org.apache.nutch.service.NutchServer
- CaseInsensitiveMetadata - Class in org.apache.nutch.metadata
-
A decorator to Metadata that adds for case-insensitive lookup of keys.
- CaseInsensitiveMetadata() - Constructor for class org.apache.nutch.metadata.CaseInsensitiveMetadata
-
Constructs a new, empty metadata.
- CCIndexingFilter - Class in org.creativecommons.nutch
-
Adds basic searchable fields to a document.
- CCIndexingFilter() - Constructor for class org.creativecommons.nutch.CCIndexingFilter
- CCParseFilter - Class in org.creativecommons.nutch
-
Adds metadata identifying the Creative Commons license used, if any.
- CCParseFilter() - Constructor for class org.creativecommons.nutch.CCParseFilter
- CCParseFilter.Walker - Class in org.creativecommons.nutch
-
Walks DOM tree, looking for RDF in comments and licenses in anchors.
- cdata(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of cdata.
- CHAR_ENCODING_FOR_CONVERSION - Static variable in interface org.apache.nutch.metadata.Nutch
- characters(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of character data.
- charactersRaw(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
If available, when the disable-output-escaping attribute is used, output raw text without escaping.
- chars - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
- CHARSET_UTF8 - Static variable in class org.apache.nutch.parse.feed.FeedParser
- checkAndReplace(String, String) - Method in class org.apache.nutch.indexer.replace.FieldReplacer
-
Return a replacement value for a field.
- checkAny - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
- checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
- checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
- checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
- checkClientTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
- checkExceptionThreshold(String) - Method in class org.apache.nutch.fetcher.FetchItemQueues
-
Increment the exception counter of a queue in case of an exception e.g.
- checkExceptionThreshold(String, int, long) - Method in class org.apache.nutch.fetcher.FetchItemQueues
-
Increment the exception counter of a queue in case of an exception e.g.
- checkFailed - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- checkKnown - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- checkNew - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- checkOutputSpecs(JobContext) - Method in class org.apache.nutch.fetcher.FetcherOutputFormat
- checkOutputSpecs(JobContext) - Method in class org.apache.nutch.parse.ParseOutputFormat
- checkQueueMode(String) - Static method in class org.apache.nutch.fetcher.FetchItemQueues
-
Check whether queue mode is valid, fall-back to default mode if not.
- checkRobotsTxt - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
- checkRobotsTxt - Variable in class org.apache.nutch.parse.ParserChecker
- checkSegmentDir(Path, FileSystem) - Static method in class org.apache.nutch.segment.SegmentChecker
-
Check the segment to see if it is valid based on the sub directories.
- checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
- checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
- checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
- checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
- checkServerTrusted(X509Certificate[], String) - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
- checkTimelimit() - Method in class org.apache.nutch.fetcher.FetchItemQueues
- childLen - Variable in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
- children - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
- childrenList - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
- chooseRepr(String, String, boolean) - Static method in class org.apache.nutch.util.URLUtil
-
Given two urls, a src and a destination of a redirect, it returns the representative url.
- CIDR - Class in org.apache.nutch.protocol.okhttp
-
Parse a CIDR block notation and test whether an IP address is contained in the subnet range defined by the CIDR.
- CIDR(String) - Constructor for class org.apache.nutch.protocol.okhttp.CIDR
- CIDR(InetAddress, int) - Constructor for class org.apache.nutch.protocol.okhttp.CIDR
- CircularDependencyException - Exception in org.apache.nutch.plugin
-
CircularDependencyException
will be thrown if a circular dependency is detected. - CircularDependencyException(String) - Constructor for exception org.apache.nutch.plugin.CircularDependencyException
- CircularDependencyException(Throwable) - Constructor for exception org.apache.nutch.plugin.CircularDependencyException
- CLASS - org.apache.nutch.service.JobManager.JobType
- CLASSIC - org.apache.nutch.scoring.similarity.util.LuceneTokenizer.TokenizerType
- classify(String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Classify
- classify(String) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- Classify - Class in org.apache.nutch.parsefilter.naivebayes
- Classify() - Constructor for class org.apache.nutch.parsefilter.naivebayes.Classify
- cleanField(String) - Static method in class org.apache.nutch.util.StringUtil
-
Simple character substitution which cleans/removes all � chars from a given String.
- CleaningJob - Class in org.apache.nutch.indexer
-
The class scans CrawlDB looking for entries with status DB_GONE (404) or DB_DUPLICATE and sends delete requests to indexers for those documents.
- CleaningJob() - Constructor for class org.apache.nutch.indexer.CleaningJob
- CleaningJob.DBFilter - Class in org.apache.nutch.indexer
- CleaningJob.DeleterReducer - Class in org.apache.nutch.indexer
- cleanMimeType(String) - Static method in class org.apache.nutch.util.MimeUtil
-
Cleans a
MimeType
name by removing out the actualMimeType
, from a string of the form: - cleanup(Reducer.Context) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
- cleanup(Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorReducer
- cleanup(Reducer.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
-
Shut down all running threads and wait for completion.
- cleanupAfterFailure(Path, FileSystem) - Static method in class org.apache.nutch.util.NutchJob
-
Clean up the file system in case of a job failure.
- cleanupAfterFailure(Path, Path, FileSystem) - Static method in class org.apache.nutch.util.NutchJob
-
Clean up the file system in case of a job failure.
- cleanUpDriver(WebDriver) - Static method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
- cleanUpDriver(WebDriver) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- clear() - Method in class org.apache.nutch.crawl.Inlinks
- clear() - Method in class org.apache.nutch.metadata.Metadata
-
Remove all mappings from metadata.
- clearClues() - Method in class org.apache.nutch.util.EncodingDetector
-
Clears all clues.
- Client - Class in org.apache.nutch.protocol.ftp
-
Client.java encapsulates functionalities necessary for nutch to get dir list and retrieve file from an FTP server.
- Client() - Constructor for class org.apache.nutch.protocol.ftp.Client
-
Public default constructor
- CLIENT_TRANSFER_ENCODING - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- clone() - Method in class org.apache.nutch.crawl.CrawlDatum
- clone() - Method in class org.apache.nutch.hostdb.HostDatum
- clone() - Method in class org.apache.nutch.indexer.NutchDocument
- clone() - Method in class org.apache.nutch.indexer.NutchField
- close() - Method in class org.apache.nutch.crawl.CrawlDbReader
- close() - Method in class org.apache.nutch.crawl.LinkDbReader
- close() - Method in interface org.apache.nutch.indexer.IndexWriter
- close() - Method in class org.apache.nutch.indexer.IndexWriters
- close() - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- close() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- close() - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- close() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- close() - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- close() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- close() - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- close() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- close() - Method in class org.apache.nutch.rabbitmq.RabbitMQClient
-
Closes the channel and the connection with the server.
- close() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- close() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Closes the record reader resources.
- close() - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
- close() - Method in interface org.apache.nutch.tools.CommonCrawlFormat
-
Optional method that could be implemented if the actual format needs some close procedure.
- close() - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- close(TaskAttemptContext) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
- close(TaskAttemptContext) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter
- closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
- closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
- closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
- closeArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- closeObject(String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- closeObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
- closeObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
- closeObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
- closeObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- closeReaders(MapFile.Reader[]) - Static method in class org.apache.nutch.util.FSUtils
-
Closes a group of MapFile readers.
- closeReaders(SequenceFile.Reader[]) - Static method in class org.apache.nutch.util.FSUtils
-
Closes a group of SequenceFile readers.
- CloudSearchConstants - Interface in org.apache.nutch.indexwriter.cloudsearch
- CloudSearchIndexWriter - Class in org.apache.nutch.indexwriter.cloudsearch
-
Writes documents to CloudSearch.
- CloudSearchIndexWriter() - Constructor for class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- CloudSearchUtils - Class in org.apache.nutch.indexwriter.cloudsearch
- CloudSearchUtils() - Constructor for class org.apache.nutch.indexwriter.cloudsearch.CloudSearchUtils
- COLLECTION - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- CollectionManager - Class in org.apache.nutch.collection
- CollectionManager() - Constructor for class org.apache.nutch.collection.CollectionManager
-
Used for testing
- CollectionManager(Configuration) - Constructor for class org.apache.nutch.collection.CollectionManager
- COLONSP - Static variable in class org.apache.nutch.tools.WARCUtils
- CommandRunner - Class in org.apache.nutch.util
- CommandRunner() - Constructor for class org.apache.nutch.util.CommandRunner
- comment(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report an XML comment anywhere in the document.
- commit() - Method in interface org.apache.nutch.indexer.IndexWriter
- commit() - Method in class org.apache.nutch.indexer.IndexWriters
- commit() - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- commit() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
-
(nothing to commit)
- commit() - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- commit() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- commit() - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- commit() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- commit() - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- commit() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- COMMIT_SIZE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- CommonCrawlConfig - Class in org.apache.nutch.tools
- CommonCrawlConfig() - Constructor for class org.apache.nutch.tools.CommonCrawlConfig
-
Default constructor
- CommonCrawlConfig(InputStream) - Constructor for class org.apache.nutch.tools.CommonCrawlConfig
- CommonCrawlDataDumper - Class in org.apache.nutch.tools
-
The Common Crawl Data Dumper tool enables one to reverse generate the raw content from Nutch segment data directories into a common crawling data format, consumed by many applications.
- CommonCrawlDataDumper() - Constructor for class org.apache.nutch.tools.CommonCrawlDataDumper
-
Constructor
- CommonCrawlDataDumper(CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlDataDumper
-
Configurable constructor
- commoncrawlDump(ServiceConfig) - Method in class org.apache.nutch.service.resources.ServicesResource
- CommonCrawlFormat - Interface in org.apache.nutch.tools
-
Interface for all CommonCrawl formatter.
- CommonCrawlFormatFactory - Class in org.apache.nutch.tools
-
Factory class that creates new
CommonCrawlFormat
objects (a.k.a. - CommonCrawlFormatFactory() - Constructor for class org.apache.nutch.tools.CommonCrawlFormatFactory
- CommonCrawlFormatJackson - Class in org.apache.nutch.tools
-
This class provides methods to map crawled data on JSON using Jackson Streaming APIs.
- CommonCrawlFormatJackson(String, Content, Metadata, Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatJackson
- CommonCrawlFormatJackson(Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatJackson
- CommonCrawlFormatJettinson - Class in org.apache.nutch.tools
-
This class provides methods to map crawled data on JSON using Jettinson APIs.
- CommonCrawlFormatJettinson(String, Content, Metadata, Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatJettinson
- CommonCrawlFormatSimple - Class in org.apache.nutch.tools
-
This class provides methods to map crawled data on JSON using a StringBuilder object.
- CommonCrawlFormatSimple(String, Content, Metadata, Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatSimple
- CommonCrawlFormatWARC - Class in org.apache.nutch.tools
- CommonCrawlFormatWARC(String, Content, Metadata, Configuration, CommonCrawlConfig, ParseData) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatWARC
- CommonCrawlFormatWARC(Configuration, CommonCrawlConfig) - Constructor for class org.apache.nutch.tools.CommonCrawlFormatWARC
- Comparator() - Constructor for class org.apache.nutch.crawl.CrawlDatum.Comparator
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.CrawlDatum.Comparator
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.Generator.DecreasingFloatComparator
-
Compares two FloatWritables decreasing.
- compare(byte[], int, int, byte[], int, int) - Method in class org.apache.nutch.crawl.Generator.HashComparator
- compare(Object, Object) - Method in class org.apache.nutch.crawl.SignatureComparator
- compare(WritableComparable, WritableComparable) - Method in class org.apache.nutch.crawl.Generator.HashComparator
- compareOrder - Variable in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
- compareTo(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Sort two
CrawlDatum
objects by decreasing score. - compareTo(TrieStringMatcher.TrieNode) - Method in class org.apache.nutch.util.TrieStringMatcher.TrieNode
- computeCosineSimilarity(DocVector) - Static method in class org.apache.nutch.scoring.similarity.cosine.Model
- conf - Variable in class org.apache.nutch.crawl.Signature
- conf - Variable in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
- conf - Variable in class org.apache.nutch.plugin.Plugin
- conf - Variable in class org.apache.nutch.protocol.RobotRulesParser
- conf - Static variable in interface org.apache.nutch.service.NutchReader
- conf - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- conf - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
- configManager - Variable in class org.apache.nutch.service.resources.AbstractResource
- ConfigResource - Class in org.apache.nutch.service.resources
- ConfigResource() - Constructor for class org.apache.nutch.service.resources.ConfigResource
- ConfManager - Interface in org.apache.nutch.service
- ConfManagerImpl - Class in org.apache.nutch.service.impl
- ConfManagerImpl() - Constructor for class org.apache.nutch.service.impl.ConfManagerImpl
- CONFORMS_TO - Static variable in class org.apache.nutch.tools.WARCUtils
- connectionFailures - Variable in class org.apache.nutch.hostdb.HostDatum
- contains(InetAddress) - Method in class org.apache.nutch.protocol.okhttp.CIDR
- containsWord(String, ArrayList<String>) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- content - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- Content - Class in org.apache.nutch.protocol
- Content() - Constructor for class org.apache.nutch.protocol.Content
- Content(String, String, byte[], String, Metadata, Configuration) - Constructor for class org.apache.nutch.protocol.Content
- Content(String, String, byte[], String, Metadata, MimeUtil) - Constructor for class org.apache.nutch.protocol.Content
- CONTENT_DISPOSITION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- CONTENT_ENCODING - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- CONTENT_LANGUAGE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- CONTENT_LENGTH - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- CONTENT_LOCATION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- CONTENT_MD5 - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- CONTENT_REDIR - Static variable in class org.apache.nutch.fetcher.Fetcher
- CONTENT_TYPE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- ContentAsTextInputFormat - Class in org.apache.nutch.segment
-
An input format that takes Nutch Content objects and converts them to text while converting newline endings to spaces.
- ContentAsTextInputFormat() - Constructor for class org.apache.nutch.segment.ContentAsTextInputFormat
- context - Variable in class org.apache.nutch.hostdb.ResolverThread
- CONTRIBUTOR - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity responsible for making contributions to the content of the resource.
- COOKIE - Static variable in class org.apache.nutch.protocol.http.api.HttpBase
- CosineSimilarity - Class in org.apache.nutch.scoring.similarity.cosine
- CosineSimilarity() - Constructor for class org.apache.nutch.scoring.similarity.cosine.CosineSimilarity
- count(String) - Method in class org.apache.nutch.service.impl.LinkReader
- count(String) - Method in class org.apache.nutch.service.impl.NodeReader
- count(String) - Method in class org.apache.nutch.service.impl.SequenceReader
- count(String) - Method in interface org.apache.nutch.service.NutchReader
- count(CrawlDatum) - Method in interface org.apache.nutch.hostdb.CrawlDatumProcessor
-
Process a single crawl datum instance to aggregate custom counts.
- count(CrawlDatum) - Method in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
- COUNTRY - org.apache.nutch.util.domain.TopLevelDomain.Type
- COVERAGE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
The extent or scope of the content of the resource.
- CRAWL_ID_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
Used by Nutch REST service
- CrawlCompletionStats - Class in org.apache.nutch.util
-
Extracts some simple crawl completion stats from the crawldb Stats will be sorted by host/domain and will be of the form: 1 www.spitzer.caltech.edu FETCHED 50 www.spitzer.caltech.edu UNFETCHED
- CrawlCompletionStats() - Constructor for class org.apache.nutch.util.CrawlCompletionStats
- CrawlCompletionStats.CrawlCompletionStatsCombiner - Class in org.apache.nutch.util
- CrawlCompletionStatsCombiner() - Constructor for class org.apache.nutch.util.CrawlCompletionStats.CrawlCompletionStatsCombiner
- crawlDatum - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- CrawlDatum - Class in org.apache.nutch.crawl
- CrawlDatum() - Constructor for class org.apache.nutch.crawl.CrawlDatum
- CrawlDatum(int, int) - Constructor for class org.apache.nutch.crawl.CrawlDatum
- CrawlDatum(int, int, float) - Constructor for class org.apache.nutch.crawl.CrawlDatum
- CrawlDatum.Comparator - Class in org.apache.nutch.crawl
-
A Comparator optimized for CrawlDatum.
- CrawlDatumCsvOutputFormat() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat
- CrawlDatumJsonOutputFormat() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat
- CrawlDatumProcessor - Interface in org.apache.nutch.hostdb
-
These are instantiated once for each host.
- crawlDatumProcessors - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- crawlDb - Variable in class org.apache.nutch.crawl.CrawlDbReader
- CrawlDb - Class in org.apache.nutch.crawl
-
This class takes the output of the fetcher and updates the crawldb accordingly.
- CrawlDb() - Constructor for class org.apache.nutch.crawl.CrawlDb
- CrawlDb(Configuration) - Constructor for class org.apache.nutch.crawl.CrawlDb
- CRAWLDB_ADDITIONS_ALLOWED - Static variable in class org.apache.nutch.crawl.CrawlDb
- CRAWLDB_PURGE_404 - Static variable in class org.apache.nutch.crawl.CrawlDb
- CRAWLDB_PURGE_ORPHANS - Static variable in class org.apache.nutch.crawl.CrawlDb
- CrawlDbDumpMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
- CrawlDbFilter - Class in org.apache.nutch.crawl
-
This class provides a way to separate the URL normalization and filtering steps from the rest of CrawlDb manipulation code.
- CrawlDbFilter() - Constructor for class org.apache.nutch.crawl.CrawlDbFilter
- CrawlDbMerger - Class in org.apache.nutch.crawl
-
This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
- CrawlDbMerger() - Constructor for class org.apache.nutch.crawl.CrawlDbMerger
- CrawlDbMerger(Configuration) - Constructor for class org.apache.nutch.crawl.CrawlDbMerger
- CrawlDbMerger.Merger - Class in org.apache.nutch.crawl
- CrawlDbReader - Class in org.apache.nutch.crawl
-
Read utility for the CrawlDB.
- CrawlDbReader() - Constructor for class org.apache.nutch.crawl.CrawlDbReader
- CrawlDbReader.CrawlDatumCsvOutputFormat - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDatumJsonOutputFormat - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDbDumpMapper - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDbStatMapper - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDbStatReducer - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDbTopNMapper - Class in org.apache.nutch.crawl
- CrawlDbReader.CrawlDbTopNReducer - Class in org.apache.nutch.crawl
- CrawlDbReader.JsonIndenter - Class in org.apache.nutch.crawl
- CrawlDbReducer - Class in org.apache.nutch.crawl
-
Merge new page entries with existing entries.
- CrawlDbReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReducer
- CrawlDbStatMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
- CrawlDbStatReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
- CrawlDbTopNMapper() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
- CrawlDbTopNReducer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
- CrawlDbUpdateMapper() - Constructor for class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateMapper
- CrawlDbUpdater() - Constructor for class org.apache.nutch.crawl.Generator.CrawlDbUpdater
- CrawlDbUpdateReducer() - Constructor for class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateReducer
- create() - Static method in class org.apache.nutch.util.NutchConfiguration
-
Create a
Configuration
for Nutch. - create(boolean, Properties) - Static method in class org.apache.nutch.util.NutchConfiguration
-
Create a
Configuration
from supplied properties. - create(Text, CrawlDatum, String) - Static method in class org.apache.nutch.fetcher.FetchItem
-
Create an item.
- create(Text, CrawlDatum, String, int) - Static method in class org.apache.nutch.fetcher.FetchItem
-
Create an item.
- create(JobConfig) - Method in class org.apache.nutch.service.impl.JobManagerImpl
- create(JobConfig) - Method in interface org.apache.nutch.service.JobManager
-
Creates specified job
- create(JobConfig) - Method in class org.apache.nutch.service.resources.JobResource
-
Create a new job
- create(NutchConfig) - Method in interface org.apache.nutch.service.ConfManager
- create(NutchConfig) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
-
Created a new configuration based on the values provided.
- createChromeRemoteWebDriver(URL, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- createChromeWebDriver(String, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- createComponents(String) - Method in class org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil
- createConfig(NutchConfig) - Method in class org.apache.nutch.service.resources.ConfigResource
-
Create new configuration.
- createDefaultRemoteWebDriver(URL, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- createDocFromCityDb(String, NutchDocument, DatabaseReader) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
- createDocFromCityService(String, NutchDocument, WebServiceClient) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
- createDocFromConnectionDb(String, NutchDocument, DatabaseReader) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
- createDocFromCountryService(String, NutchDocument, WebServiceClient) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
- createDocFromDomainDb(String, NutchDocument, DatabaseReader) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
- createDocFromInsightsService(String, NutchDocument, WebServiceClient) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
- createDocFromIspDb(String, NutchDocument, DatabaseReader) - Static method in class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
- createDocVector(String, int, int) - Static method in class org.apache.nutch.scoring.similarity.cosine.Model
-
Used to create a DocVector from given String text.
- createFileName(String, String, String) - Static method in class org.apache.nutch.util.DumpFileUtil
- createFileNameFromUrl(String, String, String, String, String, boolean) - Static method in class org.apache.nutch.util.DumpFileUtil
- createFirefoxRemoteWebDriver(URL, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- createFirefoxWebDriver(String, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- createJob(Configuration, Path) - Static method in class org.apache.nutch.crawl.CrawlDb
- createKey() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Creates a new instance of the
Text
object for the key. - createLockFile(Configuration, Path, boolean) - Static method in class org.apache.nutch.util.LockUtil
-
Create a lock file.
- createLockFile(FileSystem, Path, boolean) - Static method in class org.apache.nutch.util.LockUtil
-
Create a lock file.
- createMergeJob(Configuration, Path, boolean, boolean) - Static method in class org.apache.nutch.crawl.CrawlDbMerger
- createMergeJob(Configuration, Path, boolean, boolean) - Static method in class org.apache.nutch.crawl.LinkDbMerger
- createModel(Configuration) - Static method in class org.apache.nutch.scoring.similarity.cosine.Model
- createParseResult(String, Parse) - Static method in class org.apache.nutch.parse.ParseResult
-
Convenience method for obtaining
ParseResult
from a singleParse
output. - createRandomRemoteWebDriver(URL, boolean) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- createRecordReader(InputSplit, TaskAttemptContext) - Method in class org.apache.nutch.segment.SegmentMerger.ObjectInputFormat
- createRecordReader(InputSplit, TaskAttemptContext) - Method in class org.apache.nutch.tools.arc.ArcInputFormat
- createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Creates a new
RegexRule
. - createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
- createRule(boolean, String) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
- createRule(boolean, String, String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Creates a new
RegexRule
. - createRule(boolean, String, String) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
- createRule(boolean, String, String) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
- createSeedFile(SeedList) - Method in class org.apache.nutch.service.resources.SeedResource
-
Method creates seed list file and returns temporary directory path
- createSegments(Path, Path) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Creates the arc files to segments job.
- createSocket(String, int) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
- createSocket(String, int, InetAddress, int) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
- createSocket(String, int, InetAddress, int, HttpConnectionParams) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
Attempts to get a new socket connection to the given host within the given time limit.
- createSocket(Socket, String, int, boolean) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
- createSubCollection(String, String) - Method in class org.apache.nutch.collection.CollectionManager
-
Create a new subcollection.
- createToolByClassName(String, Configuration) - Method in class org.apache.nutch.service.impl.JobFactory
- createToolByType(JobManager.JobType, Configuration) - Method in class org.apache.nutch.service.impl.JobFactory
- createTwoLevelsDirectory(String, String) - Static method in class org.apache.nutch.util.DumpFileUtil
- createTwoLevelsDirectory(String, String, boolean) - Static method in class org.apache.nutch.util.DumpFileUtil
- createURLStreamHandler(String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Invoked whenever a
URL
needs to be instantiated. - createURLStreamHandler(String) - Method in class org.apache.nutch.plugin.URLStreamHandlerFactory
- createValue() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Creates a new instance of the
BytesWritable
object for the key - createWebGraph(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.scoring.webgraph.WebGraph
-
Creates the three different WebGraph databases, Outlinks, Inlinks, and Node.
- CreativeCommons - Interface in org.apache.nutch.metadata
-
A collection of Creative Commons properties names.
- CREATOR - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity primarily responsible for making the content of the resource.
- CRLF - Static variable in class org.apache.nutch.tools.WARCUtils
- CSV_CHARSET - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_ESCAPECHARACTER - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_FIELD_SEPARATOR - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_FIELDS - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_MAXFIELDLENGTH - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_MAXFIELDVALUES - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_OUTPATH - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_QUOTECHARACTER - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_VALUESEPARATOR - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSV_WITHHEADER - Static variable in interface org.apache.nutch.indexwriter.csv.CSVConstants
- CSVConstants - Interface in org.apache.nutch.indexwriter.csv
- CSVIndexWriter - Class in org.apache.nutch.indexwriter.csv
-
Write Nutch documents to a CSV file (comma separated values), i.e., dump index as CSV or tab-separated plain text table.
- CSVIndexWriter() - Constructor for class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- CSVIndexWriter.Separator - Class in org.apache.nutch.indexwriter.csv
-
represent separators (also quote and escape characters) as char(s) and byte(s) in the output encoding for efficiency.
- csvout - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- CURRENT_NAME - Static variable in class org.apache.nutch.crawl.CrawlDb
- CURRENT_NAME - Static variable in class org.apache.nutch.crawl.LinkDb
- CURRENT_NAME - Static variable in class org.apache.nutch.util.SitemapProcessor
- currentJob - Variable in class org.apache.nutch.util.NutchTool
- currentJobNum - Variable in class org.apache.nutch.util.NutchTool
D
- DATE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A date associated with an event in the life cycle of the resource.
- dateFormatStr - Static variable in class org.apache.nutch.indexer.feed.FeedIndexingFilter
- datum - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
- datum - Variable in class org.apache.nutch.hostdb.ResolverThread
- DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE - Static variable in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
- DBFilter() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.DBFilter
- DBFilter() - Constructor for class org.apache.nutch.indexer.CleaningJob.DBFilter
- DbQuery - Class in org.apache.nutch.service.model.request
- DbQuery() - Constructor for class org.apache.nutch.service.model.request.DbQuery
- DbResource - Class in org.apache.nutch.service.resources
- DbResource() - Constructor for class org.apache.nutch.service.resources.DbResource
- DebugParseFilter - Class in org.apache.nutch.parsefilter.debug
-
Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).
- DebugParseFilter() - Constructor for class org.apache.nutch.parsefilter.debug.DebugParseFilter
- DEC_RATE - Variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
- DecreasingFloatComparator() - Constructor for class org.apache.nutch.crawl.Generator.DecreasingFloatComparator
- DEDUP - org.apache.nutch.service.JobManager.JobType
- DEDUPLICATION_COMPARE_ORDER - Static variable in class org.apache.nutch.crawl.DeduplicationJob
- DEDUPLICATION_GROUP_MODE - Static variable in class org.apache.nutch.crawl.DeduplicationJob
- DeduplicationJob - Class in org.apache.nutch.crawl
-
Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed).
- DeduplicationJob() - Constructor for class org.apache.nutch.crawl.DeduplicationJob
- DeduplicationJob.DBFilter - Class in org.apache.nutch.crawl
- DeduplicationJob.DedupReducer<K extends Writable> - Class in org.apache.nutch.crawl
- DeduplicationJob.StatusUpdateReducer - Class in org.apache.nutch.crawl
-
Combine multiple new entries for a url.
- DedupReducer() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
- DefalultMultiInteractionHandler - Class in org.apache.nutch.protocol.interactiveselenium.handlers
-
This is a placeholder/example of a technique or use case where we do multiple interaction with the web driver and need data from each such interaction in the end.
- DefalultMultiInteractionHandler() - Constructor for class org.apache.nutch.protocol.interactiveselenium.handlers.DefalultMultiInteractionHandler
- DEFAULT - Static variable in class org.apache.nutch.service.resources.ConfigResource
- DEFAULT_BOOST - Static variable in class org.apache.nutch.util.domain.DomainSuffix
- DEFAULT_FILE_NAME - Static variable in class org.apache.nutch.collection.CollectionManager
- DEFAULT_ID - Static variable in class org.apache.nutch.fetcher.FetchItemQueues
- DEFAULT_MAX_DEPTH - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
- DEFAULT_PLUGIN - Static variable in class org.apache.nutch.parse.ParserFactory
-
Wildcard for default plugins.
- DEFAULT_STATUS - Static variable in class org.apache.nutch.util.domain.DomainSuffix
- DefaultClickAllAjaxLinksHandler - Class in org.apache.nutch.protocol.interactiveselenium.handlers
- DefaultClickAllAjaxLinksHandler() - Constructor for class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultClickAllAjaxLinksHandler
- DefaultFetchSchedule - Class in org.apache.nutch.crawl
-
This class implements the default re-fetch schedule.
- DefaultFetchSchedule() - Constructor for class org.apache.nutch.crawl.DefaultFetchSchedule
- DefaultHandler - Class in org.apache.nutch.protocol.interactiveselenium.handlers
- DefaultHandler() - Constructor for class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultHandler
- defaultInterval - Variable in class org.apache.nutch.crawl.AbstractFetchSchedule
- defaultProtocolImplMapping - Variable in class org.apache.nutch.protocol.ProtocolFactory
- DEFER_VISIT_RULES - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
A
BaseRobotRules
object appropriate for use when therobots.txt
file failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code. - deferVisits503 - Variable in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
- deflate(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns a deflated copy of the input array.
- DeflateUtils - Class in org.apache.nutch.util
-
A collection of utility methods for working on deflated data.
- DeflateUtils() - Constructor for class org.apache.nutch.util.DeflateUtils
- delete(String) - Method in interface org.apache.nutch.indexer.IndexWriter
- delete(String) - Method in class org.apache.nutch.indexer.IndexWriters
- delete(String) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- delete(String) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
-
(deletion of documents is not supported)
- delete(String) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- delete(String) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- delete(String) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- delete(String) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- delete(String) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- delete(String) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- delete(String) - Method in interface org.apache.nutch.service.ConfManager
- delete(String) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
- delete(String, boolean) - Method in class org.apache.nutch.indexer.CleaningJob
- DELETE - Static variable in class org.apache.nutch.indexer.NutchIndexAction
- DELETE - Static variable in interface org.apache.nutch.indexwriter.dummy.DummyConstants
- deleteConfig(String) - Method in class org.apache.nutch.service.resources.ConfigResource
-
Removes the configuration from the list of known configurations.
- DELETED - org.apache.nutch.util.domain.DomainSuffix.Status
- DeleterReducer() - Constructor for class org.apache.nutch.indexer.CleaningJob.DeleterReducer
- deleteSeedList(String) - Method in class org.apache.nutch.service.impl.SeedManagerImpl
- deleteSeedList(String) - Method in interface org.apache.nutch.service.SeedManager
- deleteSubCollection(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Delete named subcollection
- DenyPathQueryRule(String) - Constructor for class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyPathQueryRule
- DenyPathRule(String) - Constructor for class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyPathRule
- DEPRECATED - org.apache.nutch.util.domain.DomainSuffix.Status
- DEPTH_KEY - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
- DEPTH_KEY_W - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
- DepthScoringFilter - Class in org.apache.nutch.scoring.depth
-
This scoring filter limits the number of hops from the initial seed urls.
- DepthScoringFilter() - Constructor for class org.apache.nutch.scoring.depth.DepthScoringFilter
- describe() - Method in interface org.apache.nutch.indexer.IndexWriter
-
Returns
Map
with the specific parameters the IndexWriter instance can take. - describe() - Method in class org.apache.nutch.indexer.IndexWriters
-
Lists the active IndexWriters and their configuration.
- describe() - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
-
Returns
Map
with the specific parameters the IndexWriter instance can take. - describe() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
-
Returns
Map
with the specific parameters the IndexWriter instance can take. - describe() - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
-
Returns
Map
with the specific parameters the IndexWriter instance can take. - describe() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
-
Returns
Map
with the specific parameters the IndexWriter instance can take. - describe() - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- describe() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
-
Returns
Map
with the specific parameters the IndexWriter instance can take. - describe() - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
-
Returns
Map
with the specific parameters the IndexWriter instance can take. - describe() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
Returns
Map
with the specific parameters the IndexWriter instance can take. - DESCRIPTION - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An account of the content of the resource.
- DICTFILE_MODELFILTER - Static variable in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- DIR_NAME - Static variable in class org.apache.nutch.parse.ParseData
- DIR_NAME - Static variable in class org.apache.nutch.parse.ParseText
- DIR_NAME - Static variable in class org.apache.nutch.protocol.Content
- disconnect() - Method in class org.apache.nutch.protocol.ftp.Client
-
Closes the connection to the FTP server and restores connection parameters to the default values.
- DISCONNECT - org.apache.nutch.net.protocols.Response.TruncatedContentReason
-
network disconnect or timeout during fetch
- displayFileTypes(Map<String, Integer>, Map<String, Integer>) - Static method in class org.apache.nutch.util.DumpFileUtil
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
-
This will take the metadata that you have listed in your "scoring.parse.md" property, and looks for them inside the parseData object.
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Distribute score value from the current page to all its outlinked pages.
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.ScoringFilters
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.similarity.cosine.CosineSimilarity
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in interface org.apache.nutch.scoring.similarity.SimilarityModel
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
- distributeScoreToOutlinks(Text, ParseData, Collection<Map.Entry<Text, CrawlDatum>>, CrawlDatum, int) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object.
- DmozParser - Class in org.apache.nutch.tools
-
Utility that converts DMOZ RDF into a flat file of URLs to be injected.
- DmozParser() - Constructor for class org.apache.nutch.tools.DmozParser
- dnsFailures - Variable in class org.apache.nutch.hostdb.HostDatum
- doc - Variable in class org.apache.nutch.indexer.NutchIndexAction
- docToMetadata(NutchDocument) - Static method in class org.apache.nutch.tools.WARCUtils
- DocVector - Class in org.apache.nutch.scoring.similarity.cosine
- DocVector() - Constructor for class org.apache.nutch.scoring.similarity.cosine.DocVector
- docVectors - Static variable in class org.apache.nutch.scoring.similarity.cosine.Model
- doIndex - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
- DomainDenylistURLFilter - Class in org.apache.nutch.urlfilter.domaindenylist
-
Filters URLs based on a file containing domain suffixes, domain names, and hostnames.
- DomainDenylistURLFilter() - Constructor for class org.apache.nutch.urlfilter.domaindenylist.DomainDenylistURLFilter
- DomainStatistics - Class in org.apache.nutch.util.domain
-
Extracts some very basic statistics about domains from the crawldb
- DomainStatistics() - Constructor for class org.apache.nutch.util.domain.DomainStatistics
- DomainStatistics.DomainStatisticsCombiner - Class in org.apache.nutch.util.domain
- DomainStatistics.MyCounter - Enum in org.apache.nutch.util.domain
- DomainStatisticsCombiner() - Constructor for class org.apache.nutch.util.domain.DomainStatistics.DomainStatisticsCombiner
- DomainSuffix - Class in org.apache.nutch.util.domain
-
This class represents the last part of the host name, which is operated by authoritives, not individuals.
- DomainSuffix(String) - Constructor for class org.apache.nutch.util.domain.DomainSuffix
- DomainSuffix(String, DomainSuffix.Status, float) - Constructor for class org.apache.nutch.util.domain.DomainSuffix
- DomainSuffix.Status - Enum in org.apache.nutch.util.domain
-
Enumeration of the status of the tld.
- DomainSuffixes - Class in org.apache.nutch.util.domain
-
Storage class for
DomainSuffix
objects Note: this class is singleton - DomainURLFilter - Class in org.apache.nutch.urlfilter.domain
-
Filters URLs based on a file containing domain suffixes, domain names, and hostnames.
- DomainURLFilter() - Constructor for class org.apache.nutch.urlfilter.domain.DomainURLFilter
- DOMBuilder - Class in org.apache.nutch.parse.html
-
This class takes SAX events (in addition to some extra events that SAX doesn't handle yet) and adds the result to a document or document fragment.
- DOMBuilder(Document) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMBuilder(Document, DocumentFragment) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMBuilder(Document, Node) - Constructor for class org.apache.nutch.parse.html.DOMBuilder
-
DOMBuilder instance constructor...
- DOMContentUtils - Class in org.apache.nutch.parse.html
-
A collection of methods for extracting content from DOM trees.
- DOMContentUtils - Class in org.apache.nutch.parse.tika
-
A collection of methods for extracting content from DOM trees.
- DOMContentUtils(Configuration) - Constructor for class org.apache.nutch.parse.html.DOMContentUtils
- DOMContentUtils(Configuration) - Constructor for class org.apache.nutch.parse.tika.DOMContentUtils
- DOMContentUtils.LinkParams - Class in org.apache.nutch.parse.html
- DomUtil - Class in org.apache.nutch.util
- DomUtil() - Constructor for class org.apache.nutch.util.DomUtil
- dotProduct(DocVector) - Method in class org.apache.nutch.scoring.similarity.cosine.DocVector
- DublinCore - Interface in org.apache.nutch.metadata
-
A collection of Dublin Core metadata names.
- DummyConstants - Interface in org.apache.nutch.indexwriter.dummy
- DummyIndexWriter - Class in org.apache.nutch.indexwriter.dummy
-
DummyIndexWriter.
- DummyIndexWriter() - Constructor for class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- DummySSLProtocolSocketFactory - Class in org.apache.nutch.protocol.httpclient
- DummySSLProtocolSocketFactory() - Constructor for class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
-
Constructor for DummySSLProtocolSocketFactory.
- DummyX509TrustManager - Class in org.apache.nutch.protocol.htmlunit
- DummyX509TrustManager - Class in org.apache.nutch.protocol.http
- DummyX509TrustManager - Class in org.apache.nutch.protocol.httpclient
- DummyX509TrustManager - Class in org.apache.nutch.protocol.interactiveselenium
- DummyX509TrustManager - Class in org.apache.nutch.protocol.selenium
- DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
-
Constructor for DummyX509TrustManager.
- DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.http.DummyX509TrustManager
-
Constructor for DummyX509TrustManager.
- DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
-
Constructor for DummyX509TrustManager.
- DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
-
Constructor for DummyX509TrustManager.
- DummyX509TrustManager(KeyStore) - Constructor for class org.apache.nutch.protocol.selenium.DummyX509TrustManager
-
Constructor for DummyX509TrustManager.
- dump() - Method in class org.apache.nutch.fetcher.FetchItemQueue
- dump() - Method in class org.apache.nutch.fetcher.FetchItemQueues
- dump(File, File, File, boolean, String[], boolean, String, boolean) - Method in class org.apache.nutch.tools.CommonCrawlDataDumper
-
Dumps the reverse engineered CBOR content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument.
- dump(File, File, String[], boolean, boolean, boolean) - Method in class org.apache.nutch.tools.FileDumper
-
Dumps the reverse engineered raw content from the provided segment directories if a parent directory contains more than one segment, otherwise a single segment can be passed as an argument.
- dump(Path, Path) - Method in class org.apache.nutch.segment.SegmentReader
- DUMP_DIR - Static variable in class org.apache.nutch.scoring.webgraph.LinkDumper
- Dumper() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper
- DumperMapper() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperMapper
- DumperReducer() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperReducer
- DumpFileUtil - Class in org.apache.nutch.util
- DumpFileUtil() - Constructor for class org.apache.nutch.util.DumpFileUtil
- dumpLinks(Path) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper
-
Runs the inverter and merger jobs of the LinkDumper tool to create the url to inlink node database.
- dumpNodes(Path, NodeDumper.DumpType, long, Path, boolean, NodeDumper.NameType, NodeDumper.AggrType, boolean) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper
-
Runs the process to dump the top urls out to a text file.
- dumpText - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
- dumpText - Variable in class org.apache.nutch.parse.ParserChecker
- dumpUrl(Path, String) - Method in class org.apache.nutch.scoring.webgraph.NodeReader
-
Prints the content of the Node represented by the url to system out.
E
- elapsedTime(long, long) - Static method in class org.apache.nutch.util.TimingUtil
-
Calculate the elapsed time between two times specified in milliseconds.
- ElasticConstants - Interface in org.apache.nutch.indexwriter.elastic
- ElasticIndexWriter - Class in org.apache.nutch.indexwriter.elastic
-
Sends NutchDocuments to a configured Elasticsearch index.
- ElasticIndexWriter() - Constructor for class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- elName - Variable in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
- EMPTY_RESULT - org.apache.nutch.util.domain.DomainStatistics.MyCounter
- EMPTY_RULES - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
A
BaseRobotRules
object appropriate for use when therobots.txt
file is empty or missing; all requests are allowed. - emptyMetaDataWritableSerialized - Static variable in class org.apache.nutch.hostdb.HostDatum
- emptyQueue() - Method in class org.apache.nutch.fetcher.FetchItemQueue
- emptyQueues() - Method in class org.apache.nutch.fetcher.FetchItemQueues
- enableCookieHeader - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Controls whether or not to set Cookie HTTP header based on CrawlDatum metadata
- enableIfModifiedsinceHeader - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Configuration directive for If-Modified-Since HTTP header
- encoding - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
-
encoding of CSV file
- EncodingDetector - Class in org.apache.nutch.util
-
A simple class for detecting character encodings.
- EncodingDetector(Configuration) - Constructor for class org.apache.nutch.util.EncodingDetector
- end - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
- END - org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
- endCDATA() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the end of a CDATA section.
- endDocument() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of the end of a document.
- endDTD() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the end of DTD declarations.
- endElement(String, String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of the end of an element.
- endEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the end of an entity.
- ENDPOINT - Static variable in interface org.apache.nutch.indexwriter.cloudsearch.CloudSearchConstants
- endPrefixMapping(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
End the scope of a prefix-URI mapping.
- ENGLISHMINIMALSTEM_FILTER - org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
- entityReference(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notivication of a entityReference.
- EQUAL_CHARACTER - Static variable in class org.apache.nutch.crawl.Injector.InjectMapper
- equals(Object) - Method in class org.apache.nutch.crawl.CrawlDatum
- equals(Object) - Method in class org.apache.nutch.crawl.Inlink
- equals(Object) - Method in class org.apache.nutch.metadata.Metadata
- equals(Object) - Method in class org.apache.nutch.parse.Outlink
- equals(Object) - Method in class org.apache.nutch.parse.ParseData
- equals(Object) - Method in class org.apache.nutch.parse.ParseStatus
- equals(Object) - Method in class org.apache.nutch.parse.ParseText
- equals(Object) - Method in class org.apache.nutch.plugin.PluginClassLoader
- equals(Object) - Method in class org.apache.nutch.protocol.Content
- equals(Object) - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
- equals(Object) - Method in class org.apache.nutch.protocol.ProtocolStatus
- equals(Object) - Method in class org.apache.nutch.service.model.request.SeedList
- equals(Object) - Method in class org.apache.nutch.service.model.request.SeedUrl
- escape(String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
-
Escape some exotic characters in the fragment part
- ESCAPED_URL_PART - Static variable in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
- evaluate() - Method in class org.apache.nutch.util.CommandRunner
- EXCEPTION - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Unspecified exception occurred.
- Exchange - Interface in org.apache.nutch.exchange
- ExchangeConfig - Class in org.apache.nutch.exchange
- Exchanges - Class in org.apache.nutch.exchange
- Exchanges(Configuration) - Constructor for class org.apache.nutch.exchange.Exchanges
- exec() - Method in class org.apache.nutch.util.CommandRunner
-
Execute the command
- execute(JexlScript, String) - Method in class org.apache.nutch.crawl.CrawlDatum
- executor - Variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- ExemptionUrlFilter - Class in org.apache.nutch.urlfilter.ignoreexempt
-
This implementation of
URLExemptionFilter
uses regex configuration to check if URL is eligible for exemption from 'db.ignore.external'. - ExemptionUrlFilter() - Constructor for class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
- EXPONENTIAL_BACKOFF_MILLIS - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- EXPONENTIAL_BACKOFF_MILLIS - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- EXPONENTIAL_BACKOFF_RETRIES - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- EXPONENTIAL_BACKOFF_RETRIES - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- Extension - Class in org.apache.nutch.plugin
-
An
Extension
is a kind of listener descriptor that will be installed on a concreteExtensionPoint
that acts as kind of Publisher. - Extension(PluginDescriptor, String, String, String, Configuration, PluginRepository) - Constructor for class org.apache.nutch.plugin.Extension
- ExtensionPoint - Class in org.apache.nutch.plugin
-
The
ExtensionPoint
provide meta information of a extension point. - ExtensionPoint(String, String, String) - Constructor for class org.apache.nutch.plugin.ExtensionPoint
-
Constructor
- ExtParser - Class in org.apache.nutch.parse.ext
-
A wrapper that invokes external command to do real parsing job.
- ExtParser() - Constructor for class org.apache.nutch.parse.ext.ExtParser
- extractText(InputStream, String, List<Outlink>) - Method in class org.apache.nutch.parse.zip.ZipTextExtractor
F
- FAILED - org.apache.nutch.service.model.response.JobInfo.State
- FAILED - Static variable in class org.apache.nutch.parse.ParseStatus
-
General failure.
- FAILED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Content was not retrieved.
- FAILED_EXCEPTION - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_INVALID_FORMAT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_MISSING_CONTENT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_MISSING_PARTS - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- FAILED_TRUNCATED - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing failed.
- failures - Variable in class org.apache.nutch.hostdb.HostDatum
- FastURLFilter - Class in org.apache.nutch.urlfilter.fast
-
Filters URLs based on a file of regular expressions using host/domains matching first.
- FastURLFilter() - Constructor for class org.apache.nutch.urlfilter.fast.FastURLFilter
- FastURLFilter.DenyAllRule - Class in org.apache.nutch.urlfilter.fast
-
Rule for
DenyPath .*
orDenyPath .?
- FastURLFilter.DenyPathQueryRule - Class in org.apache.nutch.urlfilter.fast
- FastURLFilter.DenyPathRule - Class in org.apache.nutch.urlfilter.fast
- FastURLFilter.Rule - Class in org.apache.nutch.urlfilter.fast
- Feed - Interface in org.apache.nutch.metadata
-
A collection of Feed property names extracted by the ROME library.
- FEED - Static variable in interface org.apache.nutch.metadata.Feed
- FEED_AUTHOR - Static variable in interface org.apache.nutch.metadata.Feed
- FEED_PUBLISHED - Static variable in interface org.apache.nutch.metadata.Feed
- FEED_TAGS - Static variable in interface org.apache.nutch.metadata.Feed
- FEED_UPDATED - Static variable in interface org.apache.nutch.metadata.Feed
- FeedIndexingFilter - Class in org.apache.nutch.indexer.feed
- FeedIndexingFilter() - Constructor for class org.apache.nutch.indexer.feed.FeedIndexingFilter
- FeedParser - Class in org.apache.nutch.parse.feed
- FeedParser() - Constructor for class org.apache.nutch.parse.feed.FeedParser
- fetch(Path, int) - Method in class org.apache.nutch.fetcher.Fetcher
- FETCH - org.apache.nutch.service.JobManager.JobType
- FETCH_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
- FETCH_EVENT_CONTENTLANG - Static variable in interface org.apache.nutch.metadata.Nutch
-
Content-lanueage key in the Pub/Sub event metadata for the content-language of the parsed page
- FETCH_EVENT_CONTENTTYPE - Static variable in interface org.apache.nutch.metadata.Nutch
-
Content-type key in the Pub/Sub event metadata for the content-type of the parsed page
- FETCH_EVENT_FETCHTIME - Static variable in interface org.apache.nutch.metadata.Nutch
-
Fetch time key in the Pub/Sub event metadata for the fetch time of the parsed page
- FETCH_EVENT_SCORE - Static variable in interface org.apache.nutch.metadata.Nutch
-
Score key in the Pub/Sub event metadata for the score of the parsed page
- FETCH_EVENT_TITLE - Static variable in interface org.apache.nutch.metadata.Nutch
-
Title key in the Pub/Sub event metadata for the title of the parsed page
- FETCH_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- FETCH_TIME - Static variable in interface org.apache.nutch.net.protocols.Response
-
Key to hold the time when the page has been fetched
- FETCH_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- fetchDb(int, int) - Method in class org.apache.nutch.service.resources.DbResource
- fetched - Variable in class org.apache.nutch.hostdb.HostDatum
- fetched - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
- FETCHED - org.apache.nutch.util.domain.DomainStatistics.MyCounter
- Fetcher - Class in org.apache.nutch.fetcher
-
A queue-based fetcher.
- Fetcher() - Constructor for class org.apache.nutch.fetcher.Fetcher
- Fetcher(Configuration) - Constructor for class org.apache.nutch.fetcher.Fetcher
- Fetcher.FetcherRun - Class in org.apache.nutch.fetcher
- Fetcher.InputFormat - Class in org.apache.nutch.fetcher
- FetcherOutputFormat - Class in org.apache.nutch.fetcher
-
Splits FetcherOutput entries into multiple map files.
- FetcherOutputFormat() - Constructor for class org.apache.nutch.fetcher.FetcherOutputFormat
- fetchErrors - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
- FetcherRun() - Constructor for class org.apache.nutch.fetcher.Fetcher.FetcherRun
- FetcherThread - Class in org.apache.nutch.fetcher
-
This class picks items from queues and fetches the pages.
- FetcherThread(Configuration, AtomicInteger, FetchItemQueues, QueueFeeder, AtomicInteger, AtomicLong, Mapper.Context, AtomicInteger, String, boolean, boolean, AtomicInteger, AtomicLong) - Constructor for class org.apache.nutch.fetcher.FetcherThread
- FetcherThreadEvent - Class in org.apache.nutch.fetcher
-
This class is used to capture the various events occurring at fetch time.
- FetcherThreadEvent(FetcherThreadEvent.PublishEventType, String) - Constructor for class org.apache.nutch.fetcher.FetcherThreadEvent
-
Constructor to create an event to be published
- FetcherThreadEvent.PublishEventType - Enum in org.apache.nutch.fetcher
-
Type of event to specify start, end or reporting of a fetch item.
- FetcherThreadPublisher - Class in org.apache.nutch.fetcher
-
This class handles the publishing of the events to the queue implementation.
- FetcherThreadPublisher(Configuration) - Constructor for class org.apache.nutch.fetcher.FetcherThreadPublisher
-
Configure all registered publishers
- FetchItem - Class in org.apache.nutch.fetcher
-
This class describes the item to be fetched.
- FetchItem(Text, URL, CrawlDatum, String) - Constructor for class org.apache.nutch.fetcher.FetchItem
- FetchItem(Text, URL, CrawlDatum, String, int) - Constructor for class org.apache.nutch.fetcher.FetchItem
- FetchItemQueue - Class in org.apache.nutch.fetcher
-
This class handles FetchItems which come from the same host ID (be it a proto/hostname or proto/IP pair).
- FetchItemQueue(Configuration, int, long, long) - Constructor for class org.apache.nutch.fetcher.FetchItemQueue
- FetchItemQueues - Class in org.apache.nutch.fetcher
-
A collection of queues that keeps track of the total number of items, and provides items eligible for fetching from any queue.
- FetchItemQueues(Configuration) - Constructor for class org.apache.nutch.fetcher.FetchItemQueues
- FetchNode - Class in org.apache.nutch.fetcher
- FetchNode() - Constructor for class org.apache.nutch.fetcher.FetchNode
- FetchNodeDb - Class in org.apache.nutch.fetcher
- FetchNodeDb() - Constructor for class org.apache.nutch.fetcher.FetchNodeDb
- FetchNodeDbInfo - Class in org.apache.nutch.service.model.response
- FetchNodeDbInfo() - Constructor for class org.apache.nutch.service.model.response.FetchNodeDbInfo
- FetchOverdueCrawlDatumProcessor - Class in org.apache.nutch.hostdb
-
Simple custom crawl datum processor that counts the number of records that are overdue for fetching, e.g.
- FetchOverdueCrawlDatumProcessor(Configuration) - Constructor for class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
- FetchSchedule - Interface in org.apache.nutch.crawl
-
This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.
- FetchScheduleFactory - Class in org.apache.nutch.crawl
-
Creates and caches a
FetchSchedule
implementation. - FG() - Constructor for class org.apache.nutch.tools.FreeGenerator.FG
- FGMapper() - Constructor for class org.apache.nutch.tools.FreeGenerator.FG.FGMapper
- FGReducer() - Constructor for class org.apache.nutch.tools.FreeGenerator.FG.FGReducer
- FIELD - Static variable in class org.creativecommons.nutch.CCIndexingFilter
-
The name of the document field we use.
- fieldName - Static variable in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
Doc field name
- FieldReplacer - Class in org.apache.nutch.indexer.replace
-
POJO to store a filename, its match pattern and its replacement string.
- FieldReplacer(String, String, String, Integer) - Constructor for class org.apache.nutch.indexer.replace.FieldReplacer
-
Field replacer with the input and output field the same.
- FieldReplacer(String, String, String, String, Integer) - Constructor for class org.apache.nutch.indexer.replace.FieldReplacer
-
Create a FieldReplacer for a field.
- File - Class in org.apache.nutch.protocol.file
-
This class is a protocol plugin used for file: scheme.
- File() - Constructor for class org.apache.nutch.protocol.file.File
- FileDumper - Class in org.apache.nutch.tools
-
The file dumper tool enables one to reverse generate the raw content from Nutch segment data directories.
- FileDumper() - Constructor for class org.apache.nutch.tools.FileDumper
- FileError - Exception in org.apache.nutch.protocol.file
-
Thrown for File error codes.
- FileError(int) - Constructor for exception org.apache.nutch.protocol.file.FileError
- FileException - Exception in org.apache.nutch.protocol.file
- FileException() - Constructor for exception org.apache.nutch.protocol.file.FileException
- FileException(String) - Constructor for exception org.apache.nutch.protocol.file.FileException
- FileException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.file.FileException
- FileException(Throwable) - Constructor for exception org.apache.nutch.protocol.file.FileException
- fileLen - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
- FileResponse - Class in org.apache.nutch.protocol.file
-
FileResponse.java mimics file replies as http response.
- FileResponse(URL, CrawlDatum, File, Configuration) - Constructor for class org.apache.nutch.protocol.file.FileResponse
-
Default public constructor
- filter - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- filter() - Method in class org.apache.nutch.parse.ParseResult
-
Remove all results where status is not successful (as determined by
ParseStatus.isSuccess()
). - filter(String) - Method in class org.apache.nutch.collection.Subcollection
-
Simple "indexOf" currentFilter for matching patterns.
- filter(String) - Method in interface org.apache.nutch.net.URLFilter
-
Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
- filter(String) - Method in class org.apache.nutch.net.URLFilters
-
Run all defined filters.
- filter(String) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
- filter(String) - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
- filter(String) - Method in class org.apache.nutch.urlfilter.domaindenylist.DomainDenylistURLFilter
- filter(String) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter
- filter(String) - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
- filter(String) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- filter(String) - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
- filter(String, String) - Method in interface org.apache.nutch.net.URLExemptionFilter
-
Checks if toUrl is exempted when the ignore external is enabled
- filter(String, String) - Method in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
- filter(Text, CrawlDatum, CrawlDatum, CrawlDatum, Content, ParseData, ParseText, Collection<CrawlDatum>) - Method in interface org.apache.nutch.segment.SegmentMergeFilter
-
The filtering method which gets all information being merged for a given key (URL).
- filter(Text, CrawlDatum, CrawlDatum, CrawlDatum, Content, ParseData, ParseText, Collection<CrawlDatum>) - Method in class org.apache.nutch.segment.SegmentMergeFilters
-
Iterates over all
SegmentMergeFilter
extensions and if any of them returns false, it will return false as well. - filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
The
AnchorIndexingFilter
filter object which supports boolean configuration settings for the deduplication of anchors. - filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
-
The
ArbitraryIndexingFilter
filter object uses reflection to instantiate the configured class and invoke the configured method. - filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
The
BasicIndexingFilter
filter object which supports few configuration settings for adding basic searchable fields. - filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
Extracts out the relevant fields: FEED_AUTHOR FEED_TAGS FEED_PUBLISHED FEED_UPDATED FEED And sends them to the
Indexer
for indexing within the Nutch index. - filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in interface org.apache.nutch.indexer.IndexingFilter
-
Adds fields or otherwise modifies the document that will be indexed for a parse.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.IndexingFilters
-
Run all defined filters.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.jexl.JexlIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.links.LinksIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.replace.ReplaceIndexer
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
The
StaticFieldIndexer
filter object which adds fields as per configuration setting. - filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
-
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the CrawlDatum object.
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
- filter(NutchDocument, Parse, Text, CrawlDatum, Inlinks) - Method in class org.creativecommons.nutch.CCIndexingFilter
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
-
Scan the HTML document looking at possible indications of content language
1. - filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.microformats.reltag.RelTagParser
-
Scan the HTML document looking at possible rel-tags
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in interface org.apache.nutch.parse.HtmlParseFilter
-
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.HtmlParseFilters
-
Run all defined filters.
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.js.JSParseFilter
-
Scan the JavaScript fragments of a HTML page looking for possible
Outlink
's - filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parse.metatags.MetaTagsParser
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parsefilter.debug.DebugParseFilter
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.apache.nutch.parsefilter.regex.RegexParseFilter
- filter(Content, ParseResult, HTMLMetaTags, DocumentFragment) - Method in class org.creativecommons.nutch.CCParseFilter
-
Adds metadata or otherwise modifies a parse of an HTML document, given the DOM tree of a page.
- filterNormalize(String) - Method in class org.apache.nutch.hostdb.UpdateHostDbMapper
-
Filters and or normalizes the input hostname by applying the configured URL filters and normalizers the URL "http://hostname/".
- filterNormalize(String, String, String, boolean, boolean, String, URLFilters, URLExemptionFilters, URLNormalizers) - Static method in class org.apache.nutch.parse.ParseOutputFormat
- filterNormalize(String, String, String, boolean, boolean, String, URLFilters, URLExemptionFilters, URLNormalizers, String) - Static method in class org.apache.nutch.parse.ParseOutputFormat
- filterParse(String) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- filters - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- filterUrl(String) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- finalize() - Method in class org.apache.nutch.plugin.Plugin
- finalize() - Method in class org.apache.nutch.plugin.PluginRepository
-
Deprecated.
- finalize() - Method in class org.apache.nutch.protocol.ftp.Ftp
- finalize(HostDatum) - Method in interface org.apache.nutch.hostdb.CrawlDatumProcessor
-
Process the final host datum instance and store the aggregated custom counts in the HostDatum.
- finalize(HostDatum) - Method in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
- find(String, int) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
-
Get index of first occurrence of any separator characters.
- findAuthentication(Metadata) - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
- findWorker(String) - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
-
Find the Job Worker Thread.
- FINISHED - org.apache.nutch.service.model.response.JobInfo.State
- finishFetchItem(FetchItem) - Method in class org.apache.nutch.fetcher.FetchItemQueues
- finishFetchItem(FetchItem, boolean) - Method in class org.apache.nutch.fetcher.FetchItemQueue
- finishFetchItem(FetchItem, boolean) - Method in class org.apache.nutch.fetcher.FetchItemQueues
- FIXED_INTERVAL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
-
Used by AdaptiveFetchSchedule to maintain custom fetch interval
- fixHttpHeaders(String, int) - Static method in class org.apache.nutch.tools.WARCUtils
-
Modify verbatim HTTP response headers: fix, remove or replace headers
Content-Length
,Content-Encoding
andTransfer-Encoding
which may confuse WARC readers. - flattenHashMap(HashMap<String, Integer>) - Static method in class org.apache.nutch.parsefilter.naivebayes.Train
- followRedirects - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
- followRedirects - Variable in class org.apache.nutch.parse.ParserChecker
- FORBID_ALL_RULES - Static variable in class org.apache.nutch.protocol.RobotRulesParser
-
A
BaseRobotRules
object appropriate for use when therobots.txt
file is not fetched due to a403/Forbidden
response; all requests are disallowed. - force - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- forceAsContentType - Variable in class org.apache.nutch.parse.ParserChecker
- forceRefetch(Text, CrawlDatum, boolean) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.
- forceRefetch(Text, CrawlDatum, boolean) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching.
- FORMAT - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Typically, Format may include the media-type or dimensions of the resource.
- FORMAT - Static variable in class org.apache.nutch.net.protocols.HttpDateFormat
- FORMAT - Static variable in class org.apache.nutch.tools.WARCUtils
- forName(String) - Method in class org.apache.nutch.util.MimeUtil
-
A facade interface to Tika's underlying
MimeTypes.forName(String)
method. - FreeGenerator - Class in org.apache.nutch.tools
-
This tool generates fetchlists (segments to be fetched) from plain text files containing one URL per line.
- FreeGenerator() - Constructor for class org.apache.nutch.tools.FreeGenerator
- FreeGenerator.FG - Class in org.apache.nutch.tools
- FreeGenerator.FG.FGMapper - Class in org.apache.nutch.tools
- FreeGenerator.FG.FGReducer - Class in org.apache.nutch.tools
- fromHexString(String) - Static method in class org.apache.nutch.util.StringUtil
-
Convert a String containing consecutive (no inside whitespace) hexadecimal digits into a corresponding byte array.
- FSUtils - Class in org.apache.nutch.util
-
Utility methods for common filesystem operations.
- FSUtils() - Constructor for class org.apache.nutch.util.FSUtils
- Ftp - Class in org.apache.nutch.protocol.ftp
-
This class is a protocol plugin used for ftp: scheme.
- Ftp() - Constructor for class org.apache.nutch.protocol.ftp.Ftp
- FtpError - Exception in org.apache.nutch.protocol.ftp
-
Thrown for Ftp error codes.
- FtpError(int) - Constructor for exception org.apache.nutch.protocol.ftp.FtpError
- FtpException - Exception in org.apache.nutch.protocol.ftp
-
Superclass for important exceptions thrown during FTP talk, that must be handled with care.
- FtpException() - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
- FtpException(String) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
- FtpException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
- FtpException(Throwable) - Constructor for exception org.apache.nutch.protocol.ftp.FtpException
- FtpExceptionBadSystResponse - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating bad reply of SYST command.
- FtpExceptionCanNotHaveDataConnection - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating failure of opening data connection.
- FtpExceptionControlClosedByForcedDataClose - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating control channel is closed by server end, due to forced closure of data channel at client (our) end.
- FtpExceptionUnknownForcedDataClose - Exception in org.apache.nutch.protocol.ftp
-
Exception indicating unrecognizable reply from server after forced closure of data channel by client (our) side.
- FtpResponse - Class in org.apache.nutch.protocol.ftp
-
FtpResponse.java mimics ftp replies as http response.
- FtpResponse(URL, CrawlDatum, Ftp, Configuration) - Constructor for class org.apache.nutch.protocol.ftp.FtpResponse
- FtpRobotRulesParser - Class in org.apache.nutch.protocol.ftp
-
This class is used for parsing robots for urls belonging to FTP protocol.
- FtpRobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
G
- generate(Path, Path, int, long, long) - Method in class org.apache.nutch.crawl.Generator
- generate(Path, Path, int, long, long, boolean, boolean) - Method in class org.apache.nutch.crawl.Generator
-
Deprecated.since 1.19 use
Generator.generate(Path, Path, int, long, long, boolean, boolean, boolean, int, String, String)
orGenerator.generate(Path, Path, int, long, long, boolean, boolean, boolean, int, String)
in the instance that no hostdb is available - generate(Path, Path, int, long, long, boolean, boolean, boolean, int, String) - Method in class org.apache.nutch.crawl.Generator
-
This signature should be used in the instance that no hostdb is available.
- generate(Path, Path, int, long, long, boolean, boolean, boolean, int, String, String) - Method in class org.apache.nutch.crawl.Generator
-
Generate fetchlists in one or more segments.
- GENERATE - org.apache.nutch.service.JobManager.JobType
- GENERATE_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
- GENERATE_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- GENERATE_UPDATE_CRAWLDB - Static variable in class org.apache.nutch.crawl.Generator
- generated - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
- generateJson() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- generateJson() - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
- generateJson() - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
- generateJson() - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
- generateJson() - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- generateSegmentName() - Static method in class org.apache.nutch.crawl.Generator
- generateSegmentName() - Static method in class org.apache.nutch.tools.arc.ArcSegmentCreator
-
Generates a random name for the segments.
- generateWARC(String, List<Path>, boolean, boolean, boolean) - Method in class org.apache.nutch.tools.warc.WARCExporter
- generator - Static variable in class org.apache.nutch.tools.WARCUtils
- Generator - Class in org.apache.nutch.crawl
-
Generates a subset of a crawl db to fetch.
- Generator() - Constructor for class org.apache.nutch.crawl.Generator
- Generator(Configuration) - Constructor for class org.apache.nutch.crawl.Generator
- GENERATOR_COUNT_MODE - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_COUNT_VALUE_DOMAIN - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_COUNT_VALUE_HOST - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_CUR_TIME - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_DELAY - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_EXPR - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_FETCH_DELAY_EXPR - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_FILTER - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_HOSTDB - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_MAX_COUNT - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_MAX_COUNT_EXPR - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_MAX_NUM_SEGMENTS - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_MIN_INTERVAL - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_MIN_SCORE - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_NORMALISE - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_RESTRICT_STATUS - Static variable in class org.apache.nutch.crawl.Generator
- GENERATOR_TOP_N - Static variable in class org.apache.nutch.crawl.Generator
- Generator.CrawlDbUpdater - Class in org.apache.nutch.crawl
-
Update the CrawlDB so that the next generate won't include the same URLs.
- Generator.CrawlDbUpdater.CrawlDbUpdateMapper - Class in org.apache.nutch.crawl
- Generator.CrawlDbUpdater.CrawlDbUpdateReducer - Class in org.apache.nutch.crawl
- Generator.DecreasingFloatComparator - Class in org.apache.nutch.crawl
- Generator.HashComparator - Class in org.apache.nutch.crawl
-
Sort fetch lists by hash of URL.
- Generator.PartitionReducer - Class in org.apache.nutch.crawl
- Generator.Selector - Class in org.apache.nutch.crawl
-
Selects entries due for fetch.
- Generator.SelectorEntry - Class in org.apache.nutch.crawl
- Generator.SelectorInverseMapper - Class in org.apache.nutch.crawl
- Generator.SelectorMapper - Class in org.apache.nutch.crawl
-
Select and invert subset due for fetch.
- Generator.SelectorReducer - Class in org.apache.nutch.crawl
-
Collect until limit is reached.
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
- generatorSortValue(Text, CrawlDatum, float) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.
- generatorSortValue(Text, CrawlDatum, float) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a sort value for Generate.
- GENERIC - org.apache.nutch.util.domain.TopLevelDomain.Type
- GenericWritableConfigurable - Class in org.apache.nutch.util
-
A generic Writable wrapper that can inject Configuration to
Configurable
s - GenericWritableConfigurable() - Constructor for class org.apache.nutch.util.GenericWritableConfigurable
- GeoIPDocumentCreator - Class in org.apache.nutch.indexer.geoip
-
Simple utility class which enables efficient, structured
NutchDocument
building based on input fromGeoIPIndexingFilter
, where configuration is also read. - GeoIPDocumentCreator() - Constructor for class org.apache.nutch.indexer.geoip.GeoIPDocumentCreator
- GeoIPIndexingFilter - Class in org.apache.nutch.indexer.geoip
-
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
- GeoIPIndexingFilter() - Constructor for class org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
-
Default constructor for this plugin
- get(String) - Method in class org.apache.nutch.metadata.Metadata
-
Get the value associated to a metadata name.
- get(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
- get(String) - Method in class org.apache.nutch.parse.ParseResult
-
Retrieve a single parse output.
- get(String) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a full path of a location inside any segment part.
- get(String) - Method in interface org.apache.nutch.service.ConfManager
- get(String) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
-
Returns the configuration associatedConfManagerImpl with the given confId
- get(String) - Method in class org.apache.nutch.util.domain.DomainSuffixes
-
Return the
DomainSuffix
object for the extension, if extension is a top level domain returned object will be an instance ofTopLevelDomain
- get(String, String) - Method in class org.apache.nutch.indexer.IndexWriterParams
- get(String, String) - Method in class org.apache.nutch.service.impl.JobManagerImpl
- get(String, String) - Method in interface org.apache.nutch.service.JobManager
- get(String, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
- get(Configuration) - Static method in class org.apache.nutch.indexer.IndexWriters
- get(Configuration) - Static method in class org.apache.nutch.plugin.PluginRepository
-
Get a cached instance of the
PluginRepository
- get(Configuration) - Static method in class org.apache.nutch.util.ObjectCache
- get(Path, Text, Writer, Map<String, List<Writable>>) - Method in class org.apache.nutch.segment.SegmentReader
- get(Text) - Method in class org.apache.nutch.parse.ParseResult
-
Retrieve a single parse output.
- get(FileSplit) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a FileSplit.
- getAccept() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getAcceptCharset() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getAcceptedIssuers() - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
- getAcceptedIssuers() - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
- getAcceptedIssuers() - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
- getAcceptedIssuers() - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
- getAcceptedIssuers() - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
- getAcceptLanguage() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
Value of "Accept-Language" request header sent by Nutch.
- getAdditionalPostHeaders() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- getAgentString(String, String, String, String, String) - Static method in class org.apache.nutch.tools.WARCUtils
- getAll() - Method in class org.apache.nutch.collection.CollectionManager
-
Returns all collections
- getAllJobs() - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
-
get all jobs (currently running and completed)
- getAnchor() - Method in class org.apache.nutch.crawl.Inlink
- getAnchor() - Method in class org.apache.nutch.parse.Outlink
- getAnchor() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- getAnchors() - Method in class org.apache.nutch.crawl.Inlinks
-
Get all anchor texts.
- getAnchors(Text) - Method in class org.apache.nutch.crawl.LinkDbReader
- getArgs() - Method in class org.apache.nutch.parse.ParseStatus
- getArgs() - Method in class org.apache.nutch.protocol.ProtocolStatus
- getArgs() - Method in class org.apache.nutch.service.model.request.DbQuery
- getArgs() - Method in class org.apache.nutch.service.model.request.JobConfig
- getArgs() - Method in class org.apache.nutch.service.model.request.ServiceConfig
- getArgs() - Method in class org.apache.nutch.service.model.response.JobInfo
- getAsMap(String) - Method in interface org.apache.nutch.service.ConfManager
- getAsMap(String) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
- getAttribute(String) - Method in class org.apache.nutch.plugin.Extension
-
Returns a attribute value, that is setuped in the manifest file and is definied by the extension point xml schema.
- getAuthentication(String, Configuration) - Static method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
This method is responsible for providing Basic authentication information.
- getBase(Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
If Node contains a BASE tag then it's HREF is returned.
- getBase(Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
If Node contains a BASE tag then it's HREF is returned.
- getBaseHref() - Method in class org.apache.nutch.parse.HTMLMetaTags
- getBaseUrl() - Method in class org.apache.nutch.protocol.Content
-
The base url for relative links contained in the content.
- getBasicPattern() - Static method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Provides a pattern which can be used by an outside resource to determine if this class can provide credentials based on simple header information.
- getBlackListString() - Method in class org.apache.nutch.collection.Subcollection
-
Returns blacklist String
- getBody() - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
- getBoolean(String, boolean) - Method in class org.apache.nutch.indexer.IndexWriterParams
- getBoost() - Method in class org.apache.nutch.util.domain.DomainSuffix
- getBufferSize() - Method in class org.apache.nutch.protocol.ftp.Ftp
- getCachedClass(PluginDescriptor, String) - Method in class org.apache.nutch.plugin.PluginRepository
- getCacheKey(URL) - Static method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
-
Compose unique key to store and access robot rules in cache for given URL
- getCharset(Metadata) - Static method in class org.apache.nutch.segment.SegmentReader
-
Try to get HTML encoding from parse metadata.
- getChildren() - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- getClassLoader() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a cached classloader for a plugin.
- getClazz() - Method in class org.apache.nutch.exchange.ExchangeConfig
- getClazz() - Method in class org.apache.nutch.plugin.Extension
-
Returns the full class name of the extension point implementation
- getClient(URL) - Method in class org.apache.nutch.protocol.okhttp.OkHttp
-
Distribute hosts over clients by host name
- getCode() - Method in interface org.apache.nutch.net.protocols.Response
-
Get the response code.
- getCode() - Method in class org.apache.nutch.protocol.file.FileResponse
-
Get the response code.
- getCode() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
-
Get the response code.
- getCode() - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
- getCode() - Method in class org.apache.nutch.protocol.http.HttpResponse
- getCode() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
- getCode() - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
- getCode() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
- getCode() - Method in class org.apache.nutch.protocol.ProtocolStatus
- getCode() - Method in class org.apache.nutch.protocol.selenium.HttpResponse
- getCode(int) - Method in exception org.apache.nutch.protocol.file.FileError
- getCode(int) - Method in exception org.apache.nutch.protocol.ftp.FtpError
- getCollectionManager(Configuration) - Static method in class org.apache.nutch.collection.CollectionManager
- getCommand() - Method in class org.apache.nutch.util.CommandRunner
- getCommonCrawlFormat(String, String, Content, Metadata, Configuration, CommonCrawlConfig) - Static method in class org.apache.nutch.tools.CommonCrawlFormatFactory
-
Deprecated.
- getCommonCrawlFormat(String, Configuration, CommonCrawlConfig) - Static method in class org.apache.nutch.tools.CommonCrawlFormatFactory
- getConf() - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
- getConf() - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
- getConf() - Method in class org.apache.nutch.crawl.Generator.Selector
- getConf() - Method in class org.apache.nutch.crawl.Signature
- getConf() - Method in class org.apache.nutch.crawl.URLPartitioner
- getConf() - Method in class org.apache.nutch.exchange.jexl.JexlExchange
- getConf() - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
Get the
Configuration
object - getConf() - Method in class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
-
Get the
Configuration
object - getConf() - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
Get the
Configuration
object - getConf() - Method in class org.apache.nutch.indexer.CleaningJob
- getConf() - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
- getConf() - Method in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
- getConf() - Method in class org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
- getConf() - Method in class org.apache.nutch.indexer.jexl.JexlIndexingFilter
- getConf() - Method in class org.apache.nutch.indexer.links.LinksIndexingFilter
- getConf() - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
- getConf() - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
- getConf() - Method in class org.apache.nutch.indexer.replace.ReplaceIndexer
- getConf() - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
Get the
Configuration
object - getConf() - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
- getConf() - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
- getConf() - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
- getConf() - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- getConf() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- getConf() - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- getConf() - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- getConf() - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- getConf() - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- getConf() - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- getConf() - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- getConf() - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
- getConf() - Method in class org.apache.nutch.microformats.reltag.RelTagParser
- getConf() - Method in class org.apache.nutch.net.protocols.ProtocolLogUtil
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.querystring.QuerystringURLNormalizer
- getConf() - Method in class org.apache.nutch.net.urlnormalizer.slash.SlashURLNormalizer
- getConf() - Method in class org.apache.nutch.parse.ext.ExtParser
- getConf() - Method in class org.apache.nutch.parse.feed.FeedParser
- getConf() - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
- getConf() - Method in class org.apache.nutch.parse.html.HtmlParser
- getConf() - Method in class org.apache.nutch.parse.js.JSParseFilter
- getConf() - Method in class org.apache.nutch.parse.metatags.MetaTagsParser
- getConf() - Method in class org.apache.nutch.parse.tika.TikaParser
- getConf() - Method in class org.apache.nutch.parse.zip.ZipParser
- getConf() - Method in class org.apache.nutch.parsefilter.debug.DebugParseFilter
- getConf() - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- getConf() - Method in class org.apache.nutch.parsefilter.regex.RegexParseFilter
- getConf() - Method in class org.apache.nutch.protocol.file.File
-
Get the
Configuration
object - getConf() - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Get the
Configuration
object - getConf() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getConf() - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
- getConf() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
- getConf() - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Get the
Configuration
object - getConf() - Method in class org.apache.nutch.publisher.NutchPublishers
- getConf() - Method in class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
- getConf() - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- getConf() - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
- getConf() - Method in class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
- getConf() - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
- getConf() - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
- getConf() - Method in class org.apache.nutch.urlfilter.domaindenylist.DomainDenylistURLFilter
- getConf() - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter
- getConf() - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
- getConf() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- getConf() - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
- getConf() - Method in class org.apache.nutch.util.GenericWritableConfigurable
- getConf() - Method in class org.creativecommons.nutch.CCIndexingFilter
- getConf() - Method in class org.creativecommons.nutch.CCParseFilter
- getConfId() - Method in class org.apache.nutch.service.model.request.DbQuery
- getConfId() - Method in class org.apache.nutch.service.model.request.JobConfig
- getConfId() - Method in class org.apache.nutch.service.model.request.ServiceConfig
- getConfId() - Method in class org.apache.nutch.service.model.response.JobInfo
- getConfig(String) - Method in class org.apache.nutch.service.resources.ConfigResource
-
Get configuration properties
- getConfigId() - Method in class org.apache.nutch.service.model.request.NutchConfig
- getConfigs() - Method in class org.apache.nutch.service.resources.ConfigResource
-
Returns a list of all configurations created.
- getConfiguration() - Method in class org.apache.nutch.service.model.response.NutchServerInfo
- getConfManager() - Method in class org.apache.nutch.service.NutchServer
- getConnectionFailures() - Method in class org.apache.nutch.hostdb.HostDatum
- getContent() - Method in interface org.apache.nutch.net.protocols.Response
-
Get the full content of the response.
- getContent() - Method in class org.apache.nutch.protocol.Content
-
The binary content retrieved.
- getContent() - Method in class org.apache.nutch.protocol.file.FileResponse
- getContent() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
- getContent() - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
- getContent() - Method in class org.apache.nutch.protocol.http.HttpResponse
- getContent() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
- getContent() - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
- getContent() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
- getContent() - Method in class org.apache.nutch.protocol.ProtocolOutput
- getContent() - Method in class org.apache.nutch.protocol.selenium.HttpResponse
- getContentMeta() - Method in class org.apache.nutch.parse.ParseData
-
The original
Metadata
retrieved from content - getContentType() - Method in exception org.apache.nutch.parse.ParserNotFound
- getContentType() - Method in class org.apache.nutch.protocol.Content
-
The media type of the retrieved content.
- getContentType() - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
- getCookie() - Method in class org.apache.nutch.fetcher.FetchItemQueue
- getCookie(URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
If per-host cookies are configured, this method will look it up for the given url.
- getCookiePolicy() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- getCookies() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
- getCountryName() - Method in class org.apache.nutch.util.domain.TopLevelDomain
-
Returns the country name if TLD is Country Code TLD
- getCrawlId() - Method in class org.apache.nutch.service.model.request.DbQuery
- getCrawlId() - Method in class org.apache.nutch.service.model.request.JobConfig
- getCrawlId() - Method in class org.apache.nutch.service.model.request.ServiceConfig
- getCrawlId() - Method in class org.apache.nutch.service.model.response.JobInfo
- getCredentials() - Method in interface org.apache.nutch.protocol.httpclient.HttpAuthentication
-
Gets the credentials generated by the HttpAuthentication object.
- getCredentials() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Gets the Basic credentials generated by this HttpBasicAuthentication object
- getCurrentKey() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
- getCurrentNode() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Get the node currently being processed.
- getCurrentNode() - Method in class org.apache.nutch.util.NodeWalker
-
Return the current node.
- getCurrentValue() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
- getCustomRequestHeaders() - Method in class org.apache.nutch.protocol.okhttp.OkHttp
- getData() - Method in interface org.apache.nutch.parse.Parse
-
Other data extracted from the page.
- getData() - Method in class org.apache.nutch.parse.ParseImpl
- getDatum() - Method in class org.apache.nutch.fetcher.FetchItem
- getDependencies() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of plugin ids.
- getDescriptor() - Method in class org.apache.nutch.plugin.Extension
-
Get the plugin descriptor.
- getDescriptor() - Method in class org.apache.nutch.plugin.Plugin
-
Returns the plugin descriptor
- getDnsFailures() - Method in class org.apache.nutch.hostdb.HostDatum
- getDocumentMeta() - Method in class org.apache.nutch.indexer.NutchDocument
- getDom(InputStream) - Static method in class org.apache.nutch.util.DomUtil
-
Returns parsed dom tree or null if any error
- getDomain() - Method in class org.apache.nutch.util.domain.DomainSuffix
- getDomainName(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the domain name of the url.
- getDomainName(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Get the domain name of the url.
- getDomainSuffix(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the
DomainSuffix
corresponding to the last public part of the hostname - getDomainSuffix(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the
DomainSuffix
corresponding to the last public part of the hostname - getDriverForPage(String, Configuration) - Static method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
- getDriverForPage(String, Configuration) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- getDumpPaths() - Method in class org.apache.nutch.service.model.response.ServiceInfo
- getDuplicate(CrawlDatum, CrawlDatum) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
- getElement(DocumentFragment, String) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
Finds the specified element and returns its value
- getEmptyParse(Configuration) - Method in class org.apache.nutch.parse.ParseStatus
-
Creates an empty
Parse
instance containing the status - getEmptyParseResult(String, Configuration) - Method in class org.apache.nutch.parse.ParseStatus
-
Creates an empty
ParseResult
for a given URL - getEventData() - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Get event data
- getEventType() - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Get type of this event object
- getExemptions() - Method in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
- getExitValue() - Method in class org.apache.nutch.util.CommandRunner
- getExportedLibUrls() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of exported libs as URLs
- getExtensionInstance() - Method in class org.apache.nutch.plugin.Extension
-
Return an instance of the extension implementation.
- getExtensionPoint(String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns a extension point identified by a extension point id.
- getExtensions() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns a array of extensions that lsiten to this extension point
- getExtensions() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns an array of extensions.
- getExtensions(String) - Method in class org.apache.nutch.parse.ParserFactory
-
Finds the best-suited parse plugin for a given contentType.
- getExtenstionPoints() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of extension points.
- getFetched() - Method in class org.apache.nutch.hostdb.HostDatum
- getFetchInterval() - Method in class org.apache.nutch.crawl.CrawlDatum
- getFetchItem() - Method in class org.apache.nutch.fetcher.FetchItemQueue
- getFetchItem() - Method in class org.apache.nutch.fetcher.FetchItemQueues
- getFetchItemQueue(String) - Method in class org.apache.nutch.fetcher.FetchItemQueues
- getFetchNodeDb() - Method in class org.apache.nutch.fetcher.FetchNodeDb
- getFetchNodeDb() - Method in class org.apache.nutch.service.NutchServer
- getFetchSchedule(Configuration) - Static method in class org.apache.nutch.crawl.FetchScheduleFactory
-
Return the FetchSchedule implementation specified within the given
Configuration
, orDefaultFetchSchedule
by default. - getFetchTime() - Method in class org.apache.nutch.crawl.CrawlDatum
-
Get the fetch time.
- getFetchTime() - Method in class org.apache.nutch.fetcher.FetchNode
- getField(String) - Method in class org.apache.nutch.indexer.NutchDocument
- getFieldName() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
- getFieldNames() - Method in class org.apache.nutch.indexer.NutchDocument
- getFieldValue(String) - Method in class org.apache.nutch.indexer.NutchDocument
- getFilters() - Method in class org.apache.nutch.net.URLFilters
- getFromUrl() - Method in class org.apache.nutch.crawl.Inlink
- getGeneralTags() - Method in class org.apache.nutch.parse.HTMLMetaTags
- getGone() - Method in class org.apache.nutch.hostdb.HostDatum
- getHeader(String) - Method in interface org.apache.nutch.net.protocols.Response
-
Get the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.file.FileResponse
-
Returns the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.ftp.FtpResponse
-
Returns the value of a named header.
- getHeader(String) - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
- getHeader(String) - Method in class org.apache.nutch.protocol.http.HttpResponse
- getHeader(String) - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
- getHeader(String) - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
- getHeader(String) - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
- getHeader(String) - Method in class org.apache.nutch.protocol.selenium.HttpResponse
- getHeaders() - Method in interface org.apache.nutch.net.protocols.Response
-
Get all the headers.
- getHeaders() - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
- getHeaders() - Method in class org.apache.nutch.protocol.http.HttpResponse
- getHeaders() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
- getHeaders() - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
- getHeaders() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
- getHeaders() - Method in class org.apache.nutch.protocol.selenium.HttpResponse
- getHeaders() - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
- getHomepageUrl() - Method in class org.apache.nutch.hostdb.HostDatum
- getHost(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the lowercased hostname for the URL or null if the URL is not well-formed formed.
- getHost(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the lowercased hostname for the URL.
- getHostname(Configuration) - Static method in class org.apache.nutch.tools.WARCUtils
- getHostName(String) - Static method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
Strip a URL, leaving only the host name.
- getHostSegments(String) - Static method in class org.apache.nutch.util.URLUtil
-
Partitions of the hostname of the url by "."
- getHostSegments(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Partitions of the hostname of the url by "."
- getHTMLContent(WebDriver, Configuration) - Static method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
- getHtmlPage(String) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
- getHtmlPage(String, Configuration) - Static method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
-
Function for obtaining the HTML BODY using the selected selenium webdriver There are a number of configuration properties within
nutch-site.xml
which determine whether to take screenshots of the rendered pages and persist them as timestamped .png's into HDFS. - getHtmlPage(String, Configuration) - Static method in class org.apache.nutch.protocol.selenium.HttpWebClient
-
Function for obtaining the HTML using the selected selenium webdriver There are a number of configuration properties within
nutch-site.xml
which determine whether to take screenshots of the rendered pages and persist them as timestamped .png's into HDFS. - getHttpEquivTags() - Method in class org.apache.nutch.parse.HTMLMetaTags
- getId() - Method in class org.apache.nutch.collection.Subcollection
- getId() - Method in class org.apache.nutch.exchange.ExchangeConfig
- getId() - Method in class org.apache.nutch.plugin.Extension
-
Return the unique id of the extension.
- getId() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns the unique id of the extension point.
- getId() - Method in class org.apache.nutch.service.model.request.SeedList
- getId() - Method in class org.apache.nutch.service.model.request.SeedUrl
- getId() - Method in class org.apache.nutch.service.model.response.JobInfo
- getID(String) - Static method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchUtils
-
Returns a normalised doc ID based on the URL of a document
- getImported() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getInfo() - Method in class org.apache.nutch.service.impl.JobWorker
- getInfo(String) - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
- getInfo(String, String) - Method in class org.apache.nutch.service.resources.JobResource
-
Get job info
- getInlinks(Text) - Method in class org.apache.nutch.crawl.LinkDbReader
- getInLinks() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getInLinks() - Method in interface org.apache.nutch.tools.CommonCrawlFormat
-
gets set of inlinks
- getInlinkScore() - Method in class org.apache.nutch.scoring.webgraph.Node
- getInProgressSize() - Method in class org.apache.nutch.fetcher.FetchItemQueue
- getInstance() - Static method in class org.apache.nutch.fetcher.FetchNodeDb
- getInstance() - Static method in class org.apache.nutch.plugin.URLStreamHandlerFactory
-
Get the singleton instance of this class.
- getInstance() - Static method in class org.apache.nutch.service.NutchServer
- getInstance() - Static method in class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyAllRule
- getInstance() - Static method in class org.apache.nutch.util.domain.DomainSuffixes
-
Singleton instance, lazy instantination
- getInstance(Element) - Static method in class org.apache.nutch.exchange.ExchangeConfig
- getInt(String, int) - Method in class org.apache.nutch.indexer.IndexWriterParams
- getIPAddress(Configuration) - Static method in class org.apache.nutch.tools.WARCUtils
- getJobClassName() - Method in class org.apache.nutch.service.model.request.JobConfig
- getJobFailureLogMessage(String, Job) - Static method in class org.apache.nutch.util.NutchJob
-
Method to return job failure log message.
- getJobHistory() - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
-
Get the Job history
- getJobManager() - Method in class org.apache.nutch.service.NutchServer
- getJobRunning() - Method in class org.apache.nutch.service.impl.NutchServerPoolExecutor
-
Get the list of currently running jobs
- getJobs() - Method in class org.apache.nutch.service.model.response.NutchServerInfo
- getJobs(String) - Method in class org.apache.nutch.service.resources.JobResource
-
Get job history for a given job regardless of the jobs state
- getJsonArray() - Method in class org.apache.nutch.tools.CommonCrawlConfig
- getJsonData() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getJsonData() - Method in interface org.apache.nutch.tools.CommonCrawlFormat
-
Get a string representation of the JSON structure of the URL content.
- getJsonData() - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- getJsonData(String, Content, Metadata) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getJsonData(String, Content, Metadata) - Method in interface org.apache.nutch.tools.CommonCrawlFormat
-
Returns a string representation of the JSON structure of the URL content.
- getJsonData(String, Content, Metadata, ParseData) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getJsonData(String, Content, Metadata, ParseData) - Method in interface org.apache.nutch.tools.CommonCrawlFormat
-
Returns a string representation of the JSON structure of the URL content.
- getJsonData(String, Content, Metadata, ParseData) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- getKey() - Method in class org.apache.nutch.collection.Subcollection
- getKey() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getKeyPrefix() - Method in class org.apache.nutch.tools.CommonCrawlConfig
- getL2Norm() - Method in class org.apache.nutch.scoring.similarity.cosine.DocVector
- getLastCheck() - Method in class org.apache.nutch.hostdb.HostDatum
- getLastModified() - Method in class org.apache.nutch.protocol.ProtocolStatus
- getLinks() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
- getLinkType() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- getLoginFormId() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- getLoginPostData() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- getLoginUrl() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- getLong(String, long) - Method in class org.apache.nutch.indexer.IndexWriterParams
- getMajorCode() - Method in class org.apache.nutch.parse.ParseStatus
- getMaxContent() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getMaxDuration() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
The time limit to download the entire content, in seconds.
- getMaxInterval(Text, float) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
Returns the max_interval for this URL, which might depend on the host.
- getMessage() - Method in class org.apache.nutch.parse.ParseStatus
- getMessage() - Method in class org.apache.nutch.protocol.ProtocolStatus
- getMeta(String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get metadata value for a given key.
- getMeta(String) - Method in class org.apache.nutch.parse.ParseData
-
Get a metadata single value.
- getMetadata() - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get all metadata.
- getMetadata() - Method in class org.apache.nutch.parse.Outlink
- getMetadata() - Method in class org.apache.nutch.protocol.Content
-
Other protocol-specific data.
- getMetadata() - Method in class org.apache.nutch.scoring.webgraph.Node
- getMetaData() - Method in class org.apache.nutch.crawl.CrawlDatum
-
Get CrawlDatum metadata
- getMetaData() - Method in class org.apache.nutch.hostdb.HostDatum
-
Get Host metadata.
- getMetaTags(HTMLMetaTags, Node, URL) - Static method in class org.apache.nutch.parse.html.HTMLMetaProcessor
-
Sets the indicators in
robotsMeta
to appropriate values, based on any META tags found under the givennode
. - getMetaTags(HTMLMetaTags, Node, URL) - Static method in class org.apache.nutch.parse.tika.HTMLMetaProcessor
-
Sets the indicators in
robotsMeta
to appropriate values, based on any META tags found under the givennode
. - getMetaValues(String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Get multiple metadata values for a given key.
- getMethod() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getMimeType(File) - Method in class org.apache.nutch.util.MimeUtil
-
Facade interface to Tika's underlying
MimeTypes.getMimeType(File)
method. - getMimeType(String) - Method in class org.apache.nutch.util.MimeUtil
-
Facade interface to Tika's underlying
MimeTypes.getMimeType(String)
method. - getMinInterval(Text, float) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
-
Returns the min_interval for this URL, which might depend on the host.
- getMinorCode() - Method in class org.apache.nutch.parse.ParseStatus
- getModifiedTime() - Method in class org.apache.nutch.crawl.CrawlDatum
- getMsg() - Method in class org.apache.nutch.service.model.response.JobInfo
- getName() - Method in class org.apache.nutch.collection.Subcollection
- getName() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns the name of the extension point.
- getName() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the name of the plugin.
- getName() - Method in class org.apache.nutch.protocol.ProtocolStatus
- getName() - Method in class org.apache.nutch.service.model.request.SeedList
- getNoCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Get the current value of
noCache
. - getNode() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
- getNodeValue(Node) - Static method in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
Returns the text value of the specified Node and child nodes
- getNoFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Get the current value of
noFollow
. - getNoIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Get the current value of
noIndex
. - getNormalizedName(String) - Static method in class org.apache.nutch.metadata.SpellCheckedMetadata
-
Get the normalized name of metadata attribute name.
- getNotExportedLibUrls() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a array of libraries as URLs that are not exported by the plugin.
- getNotModified() - Method in class org.apache.nutch.hostdb.HostDatum
- getNumInlinks() - Method in class org.apache.nutch.scoring.webgraph.Node
- getNumOfOutlinks() - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- getNumOutlinks() - Method in class org.apache.nutch.scoring.webgraph.Node
- getObject(String) - Method in class org.apache.nutch.util.ObjectCache
- getOrderedPlugins(Class<?>, String, String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Get ordered list of plugins.
- getOutlinks() - Method in class org.apache.nutch.fetcher.FetchNode
- getOutlinks() - Method in class org.apache.nutch.parse.ParseData
-
Get the outlinks of the page.
- getOutlinks(String, String, Configuration) - Static method in class org.apache.nutch.parse.OutlinkExtractor
-
Extracts
Outlink
from given plain text and adds anchor to the extractedOutlink
s - getOutlinks(String, Configuration) - Static method in class org.apache.nutch.parse.OutlinkExtractor
-
Extracts
Outlink
from given plain text. - getOutlinks(URL, ArrayList<Outlink>, List<Link>) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
- getOutlinks(URL, ArrayList<Outlink>, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
- getOutlinks(URL, ArrayList<Outlink>, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
- getOutlinkScore() - Method in class org.apache.nutch.scoring.webgraph.Node
- getOutputCommitter(TaskAttemptContext) - Method in class org.apache.nutch.parse.ParseOutputFormat
- getOutputDir() - Method in class org.apache.nutch.tools.CommonCrawlConfig
- getPage(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the page for the url.
- getParameters() - Method in class org.apache.nutch.exchange.ExchangeConfig
- getParams() - Method in class org.apache.nutch.service.model.request.NutchConfig
- getParse(Content) - Method in class org.apache.nutch.parse.ext.ExtParser
- getParse(Content) - Method in class org.apache.nutch.parse.feed.FeedParser
-
Parses the given feed and extracts out and parsers all linked items within the feed, using the underlying ROME feed parsing library.
- getParse(Content) - Method in class org.apache.nutch.parse.html.HtmlParser
- getParse(Content) - Method in class org.apache.nutch.parse.js.JSParseFilter
-
Parse a JavaScript file and extract outlinks
- getParse(Content) - Method in interface org.apache.nutch.parse.Parser
-
This method parses the given content and returns a map of <key, parse> pairs.
- getParse(Content) - Method in class org.apache.nutch.parse.tika.TikaParser
- getParse(Content) - Method in class org.apache.nutch.parse.zip.ZipParser
- getParseMeta() - Method in class org.apache.nutch.parse.ParseData
-
Other content properties.
- getParserById(String) - Method in class org.apache.nutch.parse.ParserFactory
-
Function returns a
Parser
instance with the specifiedextId
, representing its extension ID. - getParsers(String, String) - Method in class org.apache.nutch.parse.ParserFactory
-
Function returns an array of
Parser
s for a given content type. - getPartition(FloatWritable, Writable, int) - Method in class org.apache.nutch.crawl.Generator.Selector
-
Partition by host / domain or IP.
- getPartition(Text, Writable, int) - Method in class org.apache.nutch.crawl.URLPartitioner
-
Hash by host or domain name or IP address.
- getPassAllFilter() - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Get a path filter which allows all paths.
- getPassDirectoriesFilter(FileSystem) - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Get a path filter which allows all directories.
- getPath() - Method in class org.apache.nutch.service.model.request.ReaderConfig
- getPaths(FileStatus[]) - Static method in class org.apache.nutch.util.HadoopFSUtil
-
Turns an array of FileStatus into an array of Paths.
- getPattern() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
- getPluginClass() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the fully qualified name of the class which implements the abstarct
Plugin
class. - getPluginDescriptor(String) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns the descriptor of one plugin identified by a plugin id.
- getPluginDescriptors() - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns all registed plugin descriptors.
- getPluginFolder(String) - Method in class org.apache.nutch.plugin.PluginManifestParser
-
Return the named plugin folder.
- getPluginId() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the unique identifier of the plug-in or
null
. - getPluginInstance(PluginDescriptor) - Method in class org.apache.nutch.plugin.PluginRepository
-
Returns an instance of a plugin.
- getPluginPath() - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns the directory path of the plugin.
- getPort() - Method in class org.apache.nutch.service.NutchServer
- getPos() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns the current position in the file.
- getProgress() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns the percentage of progress in processing the file.
- getProgress() - Method in class org.apache.nutch.util.NutchTool
-
Get relative progress of the tool.
- getProperty(String, String) - Method in class org.apache.nutch.service.resources.ConfigResource
-
Get property
- getProtocol(String) - Method in class org.apache.nutch.protocol.ProtocolFactory
-
Returns the appropriate
Protocol
implementation for a url. - getProtocol(String) - Static method in class org.apache.nutch.util.URLUtil
- getProtocol(URL) - Method in class org.apache.nutch.protocol.ProtocolFactory
-
Returns the appropriate
Protocol
implementation for a url. - getProtocol(URL) - Static method in class org.apache.nutch.util.URLUtil
- getProtocolById(String) - Method in class org.apache.nutch.protocol.ProtocolFactory
- getProtocolOutput(String, CrawlDatum, boolean) - Method in class org.apache.nutch.util.AbstractChecker
- getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.file.File
-
Creates a
FileResponse
object corresponding to the url and return aProtocolOutput
object as per the content received - getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Creates a
FtpResponse
object corresponding to the url and returns aProtocolOutput
object as per the content received - getProtocolOutput(Text, CrawlDatum) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getProtocolOutput(Text, CrawlDatum) - Method in interface org.apache.nutch.protocol.Protocol
-
Get the
ProtocolOutput
for a given url and crawldatum - getProviderName() - Method in class org.apache.nutch.plugin.PluginDescriptor
- getProxyHost() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getProxyPort() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getQueueCount() - Method in class org.apache.nutch.fetcher.FetchItemQueues
- getQueueCountMaxExceptions() - Method in class org.apache.nutch.fetcher.FetchItemQueues
- getQueueID() - Method in class org.apache.nutch.fetcher.FetchItem
- getQueueSize() - Method in class org.apache.nutch.fetcher.FetchItemQueue
- getReaders(Path, Configuration) - Static method in class org.apache.nutch.util.SegmentReaderUtil
- getRealm() - Method in interface org.apache.nutch.protocol.httpclient.HttpAuthentication
-
Gets the realm used by the HttpAuthentication object during creation.
- getRealm() - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Gets the realm attribute of the HttpBasicAuthentication object.
- getReason() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse.TruncatedContent
- getRecordReader(InputSplit, Job, Mapper.Context) - Method in class org.apache.nutch.segment.ContentAsTextInputFormat
- getRecordReader(InputSplit, Job, Mapper.Context) - Method in class org.apache.nutch.tools.arc.ArcInputFormat
-
Get the
RecordReader
for reading the arc file. - getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat
- getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat
- getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.fetcher.FetcherOutputFormat
- getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.indexer.IndexerOutputFormat
- getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.parse.ParseOutputFormat
- getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentOutputFormat
- getRecordWriter(TaskAttemptContext) - Method in class org.apache.nutch.segment.SegmentReader.TextOutputFormat
- getRedirPerm() - Method in class org.apache.nutch.hostdb.HostDatum
- getRedirTemp() - Method in class org.apache.nutch.hostdb.HostDatum
- getRefresh() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Get the current value of
refresh
. - getRefreshHref() - Method in class org.apache.nutch.parse.HTMLMetaTags
- getRefreshTime() - Method in class org.apache.nutch.parse.HTMLMetaTags
- getRemovedFormFields() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- getReplacement() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
- getReprUrl() - Method in class org.apache.nutch.fetcher.FetcherThread
- getRequestAccept() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestAcceptEncoding() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestAcceptLanguage() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestContactEmail() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestContactName() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestHostAddress() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestHostName() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestRobots() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestSoftware() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getRequestUserAgent() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResource(String) - Method in class org.apache.nutch.plugin.PluginClassLoader
- getResourceAsStream(String) - Method in class org.apache.nutch.plugin.PluginClassLoader
- getResources(String) - Method in class org.apache.nutch.plugin.PluginClassLoader
- getResourceString(String, Locale) - Method in class org.apache.nutch.plugin.PluginDescriptor
-
Returns a I18N'd resource string.
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.htmlunit.Http
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.http.Http
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.httpclient.Http
-
Fetches the
url
with a configured HTTP client and gets the response. - getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.interactiveselenium.Http
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.okhttp.OkHttp
- getResponse(URL, CrawlDatum, boolean) - Method in class org.apache.nutch.protocol.selenium.Http
- getResponseAddress() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResponseContent() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResponseContentEncoding() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResponseContentType() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResponseDate() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResponseHostName() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResponseServer() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResponseStatus() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getResult() - Method in class org.apache.nutch.service.model.response.JobInfo
- getRetriesSinceFetch() - Method in class org.apache.nutch.crawl.CrawlDatum
- getReversedHost(String) - Static method in class org.apache.nutch.util.TableUtil
-
Given a reversed url, returns the reversed host E.g "com.foo.bar:http:8983/to/index.html?a=b" -> "com.foo.bar"
- getReverseKey() - Method in class org.apache.nutch.tools.CommonCrawlConfig
- getReverseKeyValue() - Method in class org.apache.nutch.tools.CommonCrawlConfig
- getRobotRules(Text, CrawlDatum, List<Content>) - Method in class org.apache.nutch.protocol.file.File
-
No robots parsing is done for file protocol.
- getRobotRules(Text, CrawlDatum, List<Content>) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Get the robots rules for a given url
- getRobotRules(Text, CrawlDatum, List<Content>) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getRobotRules(Text, CrawlDatum, List<Content>) - Method in interface org.apache.nutch.protocol.Protocol
-
Retrieve robot rules applicable for this URL.
- getRobotRulesSet(Protocol, URL, List<Content>) - Method in class org.apache.nutch.protocol.ftp.FtpRobotRulesParser
-
The hosts for which the caching of robots rules is yet to be done, it sends a Ftp request to the host corresponding to the
URL
passed, gets robots file, parses the rules and caches the rules object to avoid re-work in future. - getRobotRulesSet(Protocol, URL, List<Content>) - Method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
-
Get the rules from robots.txt which applies for the given
url
. - getRobotRulesSet(Protocol, URL, List<Content>) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
- getRobotRulesSet(Protocol, Text, List<Content>) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
- getRootNode() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Get the root node of the DOM being created.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Returns the name of the file of rules to use for a particular implementation.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
-
Rules specified as a config property will override rules specified as a config file.
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
-
Gets reader for regex rules
- getRulesReader(Configuration) - Method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
-
Rules specified as a config property will override rules specified as a config file.
- getRunningJobs() - Method in class org.apache.nutch.service.model.response.NutchServerInfo
- getSchema() - Method in class org.apache.nutch.plugin.ExtensionPoint
-
Returns a path to the xml schema of a extension point.
- getScopedRules() - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
- getScore() - Method in class org.apache.nutch.crawl.CrawlDatum
- getScore() - Method in class org.apache.nutch.hostdb.HostDatum
- getScore() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- getSeedFilePath() - Method in class org.apache.nutch.service.model.request.SeedList
- getSeedList() - Method in class org.apache.nutch.service.model.request.SeedUrl
- getSeedList(String) - Method in class org.apache.nutch.service.impl.SeedManagerImpl
- getSeedList(String) - Method in interface org.apache.nutch.service.SeedManager
- getSeedLists() - Method in class org.apache.nutch.service.resources.SeedResource
-
Gets the list of seedFiles already created
- getSeedManager() - Method in class org.apache.nutch.service.NutchServer
- getSeeds() - Method in class org.apache.nutch.service.impl.SeedManagerImpl
- getSeeds() - Method in interface org.apache.nutch.service.SeedManager
- getSeedUrls() - Method in class org.apache.nutch.service.model.request.SeedList
- getSeedUrlsCount() - Method in class org.apache.nutch.service.model.request.SeedList
- getServerStatus() - Method in class org.apache.nutch.service.resources.AdminResource
-
Get the status of the Nutch Server
- getSignature() - Method in class org.apache.nutch.crawl.CrawlDatum
- getSignature(Configuration) - Static method in class org.apache.nutch.crawl.SignatureFactory
- getSimpleDateFormat() - Method in class org.apache.nutch.tools.CommonCrawlConfig
- getSplits(JobContext) - Method in class org.apache.nutch.fetcher.Fetcher.InputFormat
-
Don't split inputs to keep things polite - a single fetch list must be processed in one fetcher task.
- getStartDate() - Method in class org.apache.nutch.service.model.response.NutchServerInfo
- getStarted() - Method in class org.apache.nutch.service.NutchServer
- getState() - Method in class org.apache.nutch.service.model.response.JobInfo
- getStats(Path, SegmentReader.SegmentReaderStats) - Method in class org.apache.nutch.segment.SegmentReader
- getStatus() - Method in class org.apache.nutch.crawl.CrawlDatum
- getStatus() - Method in class org.apache.nutch.fetcher.FetchNode
- getStatus() - Method in class org.apache.nutch.parse.ParseData
-
Get the status of parsing the page.
- getStatus() - Method in class org.apache.nutch.protocol.ProtocolOutput
- getStatus() - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- getStatus() - Method in class org.apache.nutch.util.domain.DomainSuffix
- getStatus() - Method in class org.apache.nutch.util.NutchTool
-
Returns current status of the running tool
- getStatusByName(String) - Static method in class org.apache.nutch.crawl.CrawlDatum
- getStatusName(byte) - Static method in class org.apache.nutch.crawl.CrawlDatum
- getStrings(String) - Method in class org.apache.nutch.indexer.IndexWriterParams
- getStrings(String, String...) - Method in class org.apache.nutch.indexer.IndexWriterParams
- getSubColection(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Get the named subcollection
- getSubCollections(String) - Method in class org.apache.nutch.collection.CollectionManager
-
Return names of collections url is part of
- getSystemName() - Method in class org.apache.nutch.protocol.ftp.Client
-
Fetches the system type name from the server and returns the string.
- getTargetPoint() - Method in class org.apache.nutch.plugin.Extension
-
Get target point
- getText() - Method in interface org.apache.nutch.parse.Parse
-
The textual content of the page.
- getText() - Method in class org.apache.nutch.parse.ParseImpl
- getText() - Method in class org.apache.nutch.parse.ParseText
- getText(StringBuffer, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This is a convinience method, equivalent to
getText(sb, node, false)
. - getText(StringBuffer, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
This is a convinience method, equivalent to
getText(sb, node, false)
. - getText(StringBuffer, Node, boolean) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This method takes a
StringBuffer
and a DOMNode
, and will append all the content text found beneath the DOM node to theStringBuffer
. - getThrownError() - Method in class org.apache.nutch.util.CommandRunner
- getTimeout() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getTimeout() - Method in class org.apache.nutch.util.CommandRunner
- getTimestamp() - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Get timestamp of current event.
- getTimestamp() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- getTimestamp() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getTitle() - Method in class org.apache.nutch.fetcher.FetchNode
- getTitle() - Method in class org.apache.nutch.parse.ParseData
-
Get the title of the page.
- getTitle(StringBuffer, Node) - Method in class org.apache.nutch.parse.html.DOMContentUtils
-
This method takes a
StringBuffer
and a DOMNode
, and will append the content text found beneath the firsttitle
node to theStringBuffer
. - getTitle(StringBuffer, Node) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
-
This method takes a
StringBuffer
and a DOMNode
, and will append the content text found beneath the firsttitle
node to theStringBuffer
. - getTlsPreferredCipherSuites() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getTlsPreferredProtocols() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getToFieldName() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
- getTokenStream() - Method in class org.apache.nutch.scoring.similarity.util.LuceneTokenizer
-
get the tokenStream created by
Tokenizer
- getTopLevelDomainName(String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the top level domain name of the url.
- getTopLevelDomainName(URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns the top level domain name of the url.
- getTotalSize() - Method in class org.apache.nutch.fetcher.FetchItemQueues
- getToUrl() - Method in class org.apache.nutch.parse.Outlink
- getType() - Method in class org.apache.nutch.service.model.request.DbQuery
- getType() - Method in class org.apache.nutch.service.model.request.JobConfig
- getType() - Method in class org.apache.nutch.service.model.response.JobInfo
- getType() - Method in class org.apache.nutch.util.domain.TopLevelDomain
- getTypes() - Method in class org.apache.nutch.crawl.NutchWritable
- getUnfetched() - Method in class org.apache.nutch.hostdb.HostDatum
- getUniqueFile(TaskAttemptContext, String) - Method in class org.apache.nutch.parse.ParseOutputFormat
- getUrl() - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Get URL of this event
- getUrl() - Method in class org.apache.nutch.fetcher.FetchItem
- getUrl() - Method in class org.apache.nutch.fetcher.FetchNode
- getUrl() - Method in interface org.apache.nutch.net.protocols.Response
-
Get the URL used to retrieve this response.
- getUrl() - Method in exception org.apache.nutch.parse.ParserNotFound
- getUrl() - Method in class org.apache.nutch.protocol.Content
-
The url fetched.
- getUrl() - Method in class org.apache.nutch.protocol.htmlunit.HttpResponse
- getUrl() - Method in class org.apache.nutch.protocol.http.HttpResponse
- getUrl() - Method in class org.apache.nutch.protocol.httpclient.HttpResponse
- getUrl() - Method in class org.apache.nutch.protocol.interactiveselenium.HttpResponse
- getUrl() - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse
- getUrl() - Method in exception org.apache.nutch.protocol.ProtocolNotFound
- getUrl() - Method in class org.apache.nutch.protocol.selenium.HttpResponse
- getUrl() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- getUrl() - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
- getUrl() - Method in class org.apache.nutch.service.model.request.SeedUrl
- getUrl() - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- getUrl() - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- getURL2() - Method in class org.apache.nutch.fetcher.FetchItem
- getUrlMD5(String) - Static method in class org.apache.nutch.util.DumpFileUtil
- getUseHttp11() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getUserAgent() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- getUUID(Configuration) - Static method in class org.apache.nutch.util.NutchConfiguration
-
Retrieve a Nutch UUID of this configuration object, or null if the configuration was created elsewhere.
- getValues() - Method in class org.apache.nutch.indexer.NutchField
- getValues(String) - Method in class org.apache.nutch.metadata.Metadata
-
Get the values associated to a metadata name.
- getValues(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
- getVersion() - Method in class org.apache.nutch.parse.ParseData
- getVersion() - Method in class org.apache.nutch.parse.ParseStatus
- getVersion() - Method in class org.apache.nutch.plugin.PluginDescriptor
- getWaitForExit() - Method in class org.apache.nutch.util.CommandRunner
- getWARCInfoContent(Configuration) - Static method in class org.apache.nutch.tools.WARCUtils
- getWarcSize() - Method in class org.apache.nutch.tools.CommonCrawlConfig
- getWeight() - Method in class org.apache.nutch.indexer.NutchDocument
- getWeight() - Method in class org.apache.nutch.indexer.NutchField
- getWhiteList() - Method in class org.apache.nutch.collection.Subcollection
-
Returns whitelist
- getWhiteListString() - Method in class org.apache.nutch.collection.Subcollection
-
Returns whitelist String
- getWriter() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Return null since there is no Writer for this class.
- gone - Variable in class org.apache.nutch.hostdb.HostDatum
- GONE - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Resource is gone.
- guessEncoding(Content, String) - Method in class org.apache.nutch.util.EncodingDetector
-
Guess the encoding with the previously specified list of clues.
- GZIPUtils - Class in org.apache.nutch.util
-
A collection of utility methods for working on GZIPed data.
- GZIPUtils() - Constructor for class org.apache.nutch.util.GZIPUtils
H
- HadoopFSUtil - Class in org.apache.nutch.util
- HadoopFSUtil() - Constructor for class org.apache.nutch.util.HadoopFSUtil
- hasDbStatus(CrawlDatum) - Static method in class org.apache.nutch.crawl.CrawlDatum
- hasFetchStatus(CrawlDatum) - Static method in class org.apache.nutch.crawl.CrawlDatum
- hashCode() - Method in class org.apache.nutch.crawl.CrawlDatum
- hashCode() - Method in class org.apache.nutch.crawl.Inlink
- hashCode() - Method in class org.apache.nutch.parse.Outlink
- hashCode() - Method in class org.apache.nutch.plugin.PluginClassLoader
- hashCode() - Method in class org.apache.nutch.protocol.httpclient.DummySSLProtocolSocketFactory
- hashCode() - Method in class org.apache.nutch.service.model.request.SeedList
- hashCode() - Method in class org.apache.nutch.service.model.request.SeedUrl
- HashComparator() - Constructor for class org.apache.nutch.crawl.Generator.HashComparator
- hasHomepageUrl() - Method in class org.apache.nutch.hostdb.HostDatum
- hasHostDomainRules - Variable in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Whether there are host- or domain-specific rules.
- hasMetaData() - Method in class org.apache.nutch.hostdb.HostDatum
- hasNext() - Method in class org.apache.nutch.util.NodeWalker
- hasObject(String) - Method in class org.apache.nutch.util.ObjectCache
- head(String, int) - Method in class org.apache.nutch.service.impl.LinkReader
- head(String, int) - Method in class org.apache.nutch.service.impl.NodeReader
- head(String, int) - Method in class org.apache.nutch.service.impl.SequenceReader
- head(String, int) - Method in interface org.apache.nutch.service.NutchReader
- HeadingsParseFilter - Class in org.apache.nutch.parse.headings
-
HtmlParseFilter to retrieve h1 and h2 values from the DOM.
- HeadingsParseFilter() - Constructor for class org.apache.nutch.parse.headings.HeadingsParseFilter
- homepageUrl - Variable in class org.apache.nutch.hostdb.HostDatum
- host - Variable in class org.apache.nutch.hostdb.ResolverThread
- host - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- HOST - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
- hostDatum - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- HostDatum - Class in org.apache.nutch.hostdb
- HostDatum() - Constructor for class org.apache.nutch.hostdb.HostDatum
- HostDatum(float) - Constructor for class org.apache.nutch.hostdb.HostDatum
- HostDatum(float, Date) - Constructor for class org.apache.nutch.hostdb.HostDatum
- HostDatum(float, Date, String) - Constructor for class org.apache.nutch.hostdb.HostDatum
- HOSTDB_CHECK_FAILED - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_CHECK_KNOWN - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_CHECK_NEW - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_CRAWLDATUM_PROCESSORS - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_DUMP_HEADER - Static variable in class org.apache.nutch.hostdb.ReadHostDb
- HOSTDB_DUMP_HOMEPAGES - Static variable in class org.apache.nutch.hostdb.ReadHostDb
- HOSTDB_DUMP_HOSTNAMES - Static variable in class org.apache.nutch.hostdb.ReadHostDb
- HOSTDB_FILTER_EXPRESSION - Static variable in class org.apache.nutch.hostdb.ReadHostDb
- HOSTDB_FORCE_CHECK - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_NUM_RESOLVER_THREADS - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_NUMERIC_FIELDS - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_PERCENTILES - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_PURGE_FAILED_HOSTS_THRESHOLD - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_RECHECK_INTERVAL - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_STRING_FIELDS - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_URL_FILTERING - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTDB_URL_NORMALIZING - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- HOSTNAME - Static variable in class org.apache.nutch.tools.WARCUtils
- hostOrDomain() - Method in class org.apache.nutch.urlfilter.api.RegexRule
-
Return if this rule is used for filtering-in or out.
- hostProtocolMapping - Variable in class org.apache.nutch.protocol.ProtocolFactory
- HOSTS - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- HOSTS - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- hostText - Variable in class org.apache.nutch.hostdb.ResolverThread
- HostURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.host
-
URL normalizer for mapping hosts to their desired form.
- HostURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
- HTMLLanguageParser - Class in org.apache.nutch.analysis.lang
- HTMLLanguageParser() - Constructor for class org.apache.nutch.analysis.lang.HTMLLanguageParser
- HTMLMetaProcessor - Class in org.apache.nutch.parse.html
-
Class for parsing META Directives from DOM trees.
- HTMLMetaProcessor - Class in org.apache.nutch.parse.tika
-
Class for parsing META Directives from DOM trees.
- HTMLMetaProcessor() - Constructor for class org.apache.nutch.parse.html.HTMLMetaProcessor
- HTMLMetaProcessor() - Constructor for class org.apache.nutch.parse.tika.HTMLMetaProcessor
- HTMLMetaTags - Class in org.apache.nutch.parse
-
This class holds the information about HTML "meta" tags extracted from a page.
- HTMLMetaTags() - Constructor for class org.apache.nutch.parse.HTMLMetaTags
- HtmlParseFilter - Interface in org.apache.nutch.parse
-
Extension point for DOM-based HTML parsers.
- HTMLPARSEFILTER_ORDER - Static variable in class org.apache.nutch.parse.HtmlParseFilters
- HtmlParseFilters - Class in org.apache.nutch.parse
-
Creates and caches
HtmlParseFilter
implementing plugins. - HtmlParseFilters(Configuration) - Constructor for class org.apache.nutch.parse.HtmlParseFilters
- HtmlParser - Class in org.apache.nutch.parse.html
- HtmlParser() - Constructor for class org.apache.nutch.parse.html.HtmlParser
- HtmlUnitWebDriver - Class in org.apache.nutch.protocol.htmlunit
- HtmlUnitWebDriver() - Constructor for class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
- HtmlUnitWebWindowListener - Class in org.apache.nutch.protocol.htmlunit
- HtmlUnitWebWindowListener() - Constructor for class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
- HtmlUnitWebWindowListener(int) - Constructor for class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
- Http - Class in org.apache.nutch.protocol.htmlunit
- Http - Class in org.apache.nutch.protocol.http
- Http - Class in org.apache.nutch.protocol.httpclient
-
This class is a protocol plugin that configures an HTTP client for Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
- Http - Class in org.apache.nutch.protocol.interactiveselenium
- Http - Class in org.apache.nutch.protocol.selenium
- Http() - Constructor for class org.apache.nutch.protocol.htmlunit.Http
-
Default constructor.
- Http() - Constructor for class org.apache.nutch.protocol.http.Http
-
Public default constructor.
- Http() - Constructor for class org.apache.nutch.protocol.httpclient.Http
-
Constructs this plugin.
- Http() - Constructor for class org.apache.nutch.protocol.interactiveselenium.Http
- Http() - Constructor for class org.apache.nutch.protocol.selenium.Http
- HTTP - org.apache.nutch.protocol.htmlunit.HttpResponse.Scheme
- HTTP - org.apache.nutch.protocol.http.HttpResponse.Scheme
- HTTP - org.apache.nutch.protocol.interactiveselenium.HttpResponse.Scheme
- HTTP - org.apache.nutch.protocol.selenium.HttpResponse.Scheme
- HTTP_HEADER_FROM - Static variable in class org.apache.nutch.tools.WARCUtils
- HTTP_HEADER_USER_AGENT - Static variable in class org.apache.nutch.tools.WARCUtils
- HTTP_LOG_SUPPRESSION - Static variable in class org.apache.nutch.net.protocols.ProtocolLogUtil
- HttpAuthentication - Interface in org.apache.nutch.protocol.httpclient
-
The base level of services required for Http Authentication
- HttpAuthenticationException - Exception in org.apache.nutch.protocol.httpclient
-
Can be used to identify problems during creation of Authentication objects.
- HttpAuthenticationException() - Constructor for exception org.apache.nutch.protocol.httpclient.HttpAuthenticationException
-
Constructs a new exception with null as its detail message.
- HttpAuthenticationException(String) - Constructor for exception org.apache.nutch.protocol.httpclient.HttpAuthenticationException
-
Constructs a new exception with the specified detail message.
- HttpAuthenticationException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.httpclient.HttpAuthenticationException
-
Constructs a new exception with the specified message and cause.
- HttpAuthenticationException(Throwable) - Constructor for exception org.apache.nutch.protocol.httpclient.HttpAuthenticationException
-
Constructs a new exception with the specified cause and detail message from given clause if it is not null.
- HttpAuthenticationFactory - Class in org.apache.nutch.protocol.httpclient
-
Provides the Http protocol implementation with the ability to authenticate when prompted.
- HttpAuthenticationFactory(Configuration) - Constructor for class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
- HttpBase - Class in org.apache.nutch.protocol.http.api
- HttpBase() - Constructor for class org.apache.nutch.protocol.http.api.HttpBase
-
Creates a new instance of HttpBase
- HttpBase(Logger) - Constructor for class org.apache.nutch.protocol.http.api.HttpBase
-
Creates a new instance of HttpBase
- HttpBasicAuthentication - Class in org.apache.nutch.protocol.httpclient
-
Implementation of RFC 2617 Basic Authentication.
- HttpBasicAuthentication(String, Configuration) - Constructor for class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
-
Construct an HttpBasicAuthentication for the given challenge parameters.
- HttpDateFormat - Class in org.apache.nutch.net.protocols
-
Parse and format HTTP dates in HTTP headers, e.g., used to fill the "If-Modified-Since" request header field.
- HttpDateFormat() - Constructor for class org.apache.nutch.net.protocols.HttpDateFormat
- HttpException - Exception in org.apache.nutch.protocol.http.api
- HttpException() - Constructor for exception org.apache.nutch.protocol.http.api.HttpException
- HttpException(String) - Constructor for exception org.apache.nutch.protocol.http.api.HttpException
- HttpException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.http.api.HttpException
- HttpException(Throwable) - Constructor for exception org.apache.nutch.protocol.http.api.HttpException
- HttpFormAuthConfigurer - Class in org.apache.nutch.protocol.httpclient
- HttpFormAuthConfigurer() - Constructor for class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- HttpFormAuthentication - Class in org.apache.nutch.protocol.httpclient
- HttpFormAuthentication(String, String, Map<String, String>, Map<String, String>, Set<String>) - Constructor for class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
- HttpFormAuthentication(HttpFormAuthConfigurer, HttpClient, Http) - Constructor for class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
- HttpHeaders - Interface in org.apache.nutch.metadata
-
A collection of HTTP header names.
- HttpResponse - Class in org.apache.nutch.protocol.htmlunit
-
An HTTP response.
- HttpResponse - Class in org.apache.nutch.protocol.http
-
An HTTP response.
- HttpResponse - Class in org.apache.nutch.protocol.httpclient
-
An HTTP response.
- HttpResponse - Class in org.apache.nutch.protocol.interactiveselenium
- HttpResponse - Class in org.apache.nutch.protocol.selenium
- HttpResponse(HttpBase, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.htmlunit.HttpResponse
-
Default public constructor.
- HttpResponse(HttpBase, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.http.HttpResponse
-
Default public constructor.
- HttpResponse(Http, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.interactiveselenium.HttpResponse
- HttpResponse(Http, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.selenium.HttpResponse
- HttpResponse.Scheme - Enum in org.apache.nutch.protocol.htmlunit
- HttpResponse.Scheme - Enum in org.apache.nutch.protocol.http
- HttpResponse.Scheme - Enum in org.apache.nutch.protocol.interactiveselenium
- HttpResponse.Scheme - Enum in org.apache.nutch.protocol.selenium
- HttpRobotRulesParser - Class in org.apache.nutch.protocol.http.api
-
This class is used for parsing robots for urls belonging to HTTP protocol.
- HttpRobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
- HTTPS - org.apache.nutch.protocol.htmlunit.HttpResponse.Scheme
- HTTPS - org.apache.nutch.protocol.http.HttpResponse.Scheme
- HTTPS - org.apache.nutch.protocol.interactiveselenium.HttpResponse.Scheme
- HTTPS - org.apache.nutch.protocol.selenium.HttpResponse.Scheme
- HttpWebClient - Class in org.apache.nutch.protocol.selenium
- HttpWebClient() - Constructor for class org.apache.nutch.protocol.selenium.HttpWebClient
I
- IDENTIFIER - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system.
- IDLE - org.apache.nutch.service.model.response.JobInfo.State
- IF_MODIFIED_SINCE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- ignorableWhitespace(char[], int, int) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of ignorable whitespace in element content.
- IGNORE_EXTERNAL_LINKS - Static variable in class org.apache.nutch.crawl.LinkDb
- IGNORE_INTERNAL_LINKS - Static variable in class org.apache.nutch.crawl.LinkDb
- in - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
- IN_USE - org.apache.nutch.util.domain.DomainSuffix.Status
- INC_RATE - Variable in class org.apache.nutch.crawl.AdaptiveFetchSchedule
- incConnectionFailures() - Method in class org.apache.nutch.hostdb.HostDatum
- incDnsFailures() - Method in class org.apache.nutch.hostdb.HostDatum
- incrementExceptionCounter() - Method in class org.apache.nutch.fetcher.FetchItemQueue
- index(Path, Path, List<Path>, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
- index(Path, Path, List<Path>, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
- index(Path, Path, List<Path>, boolean, boolean, String) - Method in class org.apache.nutch.indexer.IndexingJob
- index(Path, Path, List<Path>, boolean, boolean, String, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
- index(Path, Path, List<Path>, boolean, boolean, String, boolean, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
- index(Path, Path, List<Path>, boolean, boolean, String, boolean, boolean, boolean, boolean) - Method in class org.apache.nutch.indexer.IndexingJob
- INDEX - org.apache.nutch.service.JobManager.JobType
- INDEX - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- INDEX - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- INDEXER_BINARY_AS_BASE64 - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- INDEXER_DELETE - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- INDEXER_DELETE_ROBOTS_NOINDEX - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- INDEXER_DELETE_SKIPPED - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- INDEXER_NO_COMMIT - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- INDEXER_PARAMS - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- INDEXER_SKIP_NOTMODIFIED - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- IndexerMapper() - Constructor for class org.apache.nutch.indexer.IndexerMapReduce.IndexerMapper
- IndexerMapReduce - Class in org.apache.nutch.indexer
-
This class is typically invoked from within
IndexingJob
and handles all MapReduce functionality required when undertaking indexing. - IndexerMapReduce() - Constructor for class org.apache.nutch.indexer.IndexerMapReduce
- IndexerMapReduce.IndexerMapper - Class in org.apache.nutch.indexer
- IndexerMapReduce.IndexerReducer - Class in org.apache.nutch.indexer
- IndexerOutputFormat - Class in org.apache.nutch.indexer
- IndexerOutputFormat() - Constructor for class org.apache.nutch.indexer.IndexerOutputFormat
- IndexerReducer() - Constructor for class org.apache.nutch.indexer.IndexerMapReduce.IndexerReducer
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Dampen the boost value by scorePower.
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method calculates a indexed document score/boost.
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.ScoringFilters
- indexerScore(Text, NutchDocument, CrawlDatum, CrawlDatum, Parse, Inlinks, float) - Method in class org.apache.nutch.scoring.tld.TLDScoringFilter
- IndexingException - Exception in org.apache.nutch.indexer
- IndexingException() - Constructor for exception org.apache.nutch.indexer.IndexingException
- IndexingException(String) - Constructor for exception org.apache.nutch.indexer.IndexingException
- IndexingException(String, Throwable) - Constructor for exception org.apache.nutch.indexer.IndexingException
- IndexingException(Throwable) - Constructor for exception org.apache.nutch.indexer.IndexingException
- IndexingFilter - Interface in org.apache.nutch.indexer
-
Extension point for indexing.
- INDEXINGFILTER_ORDER - Static variable in class org.apache.nutch.indexer.IndexingFilters
- IndexingFilters - Class in org.apache.nutch.indexer
-
Creates and caches
IndexingFilter
implementing plugins. - IndexingFilters(Configuration) - Constructor for class org.apache.nutch.indexer.IndexingFilters
- IndexingFiltersChecker - Class in org.apache.nutch.indexer
-
Reads and parses a URL and run the indexers on it.
- IndexingFiltersChecker() - Constructor for class org.apache.nutch.indexer.IndexingFiltersChecker
- IndexingJob - Class in org.apache.nutch.indexer
-
Generic indexer which relies on the plugins implementing IndexWriter
- IndexingJob() - Constructor for class org.apache.nutch.indexer.IndexingJob
- IndexingJob(Configuration) - Constructor for class org.apache.nutch.indexer.IndexingJob
- IndexWriter - Interface in org.apache.nutch.indexer
- IndexWriterConfig - Class in org.apache.nutch.indexer
- IndexWriterParams - Class in org.apache.nutch.indexer
- IndexWriterParams(Map<? extends String, ? extends String>) - Constructor for class org.apache.nutch.indexer.IndexWriterParams
-
Fill IndexWriterParams from map.
- indexWriters(NutchDocument) - Method in class org.apache.nutch.exchange.Exchanges
-
Returns all the indexers where the document must be sent to.
- IndexWriters - Class in org.apache.nutch.indexer
-
Creates and caches
IndexWriter
implementing plugins. - IndexWriters.IndexWriterWrapper - Class in org.apache.nutch.indexer
- IndexWriterWrapper() - Constructor for class org.apache.nutch.indexer.IndexWriters.IndexWriterWrapper
- inflate(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array.
- inflateBestEffort(byte[]) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array.
- inflateBestEffort(byte[], int) - Static method in class org.apache.nutch.util.DeflateUtils
-
Returns an inflated copy of the input array, truncated to
sizeLimit
bytes, if necessary. - INFRASTRUCTURE - org.apache.nutch.util.domain.DomainSuffix.Status
- INFRASTRUCTURE - org.apache.nutch.util.domain.TopLevelDomain.Type
- init() - Method in class org.apache.nutch.collection.CollectionManager
- init(Path) - Method in class org.apache.nutch.crawl.LinkDbReader
- initialize(InputSplit, TaskAttemptContext) - Method in class org.apache.nutch.tools.arc.ArcRecordReader
- initialize(Element) - Method in class org.apache.nutch.collection.Subcollection
-
Initialize Subcollection from dom element
- initializeSchedule(Text, CrawlDatum) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
Initialize fetch schedule related data.
- initializeSchedule(Text, CrawlDatum) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Initialize fetch schedule related data.
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level.
- initialScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Set an initial score for newly discovered pages.
- initialScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a new initial score, used when adding newly discovered pages.
- initMRJob(Path, Path, Collection<Path>, Job, boolean) - Static method in class org.apache.nutch.indexer.IndexerMapReduce
- inject(Path, Path) - Method in class org.apache.nutch.crawl.Injector
- inject(Path, Path, boolean, boolean) - Method in class org.apache.nutch.crawl.Injector
- inject(Path, Path, boolean, boolean, boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.Injector
- INJECT - org.apache.nutch.service.JobManager.JobType
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
- injectedScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Set an initial score for newly injected pages.
- injectedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate a new initial score, used when injecting new pages.
- InjectMapper() - Constructor for class org.apache.nutch.crawl.Injector.InjectMapper
- Injector - Class in org.apache.nutch.crawl
-
Injector takes a flat text file of URLs (or a folder containing text files) and merges ("injects") these URLs into the CrawlDb.
- Injector() - Constructor for class org.apache.nutch.crawl.Injector
- Injector(Configuration) - Constructor for class org.apache.nutch.crawl.Injector
- Injector.InjectMapper - Class in org.apache.nutch.crawl
-
InjectMapper reads the CrawlDb seeds are injected into the plain-text seed files and parses each line into the URL and metadata.
- Injector.InjectReducer - Class in org.apache.nutch.crawl
-
Combine multiple new entries for a url.
- InjectReducer() - Constructor for class org.apache.nutch.crawl.Injector.InjectReducer
- Inlink - Class in org.apache.nutch.crawl
-
An incoming link to a page.
- Inlink() - Constructor for class org.apache.nutch.crawl.Inlink
- Inlink(String, String) - Constructor for class org.apache.nutch.crawl.Inlink
- INLINK - Static variable in class org.apache.nutch.scoring.webgraph.LinkDatum
- INLINK_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
- inLinks - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- Inlinks - Class in org.apache.nutch.crawl
-
A list of
Inlink
s. - Inlinks() - Constructor for class org.apache.nutch.crawl.Inlinks
- InputCompatMapper() - Constructor for class org.apache.nutch.segment.SegmentReader.InputCompatMapper
- InputCompatReducer() - Constructor for class org.apache.nutch.segment.SegmentReader.InputCompatReducer
- InputFormat() - Constructor for class org.apache.nutch.fetcher.Fetcher.InputFormat
- install(Job, Path) - Static method in class org.apache.nutch.crawl.CrawlDb
- install(Job, Path) - Static method in class org.apache.nutch.crawl.LinkDb
- InteractiveSeleniumHandler - Interface in org.apache.nutch.protocol.interactiveselenium.handlers
- INTERNAL - org.apache.nutch.net.protocols.Response.TruncatedContentReason
-
implementation internal reason
- invert(Path, Path[], boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDb
- invert(Path, Path, boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDb
- Inverter() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter
- INVERTLINKS - org.apache.nutch.service.JobManager.JobType
- InvertMapper() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertMapper
- InvertReducer() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertReducer
- IP - Static variable in class org.apache.nutch.tools.WARCUtils
- IP_ADDRESS - Static variable in interface org.apache.nutch.net.protocols.Response
-
Key to hold the IP address the request is sent to if
store.ip.address
is true - IPFilterRules - Class in org.apache.nutch.protocol.okhttp
-
Optionally limit or block connections to IP address ranges (localhost/loopback or site-local addresses, subnet ranges given in CIDR notation, or single IP addresses).
- IPFilterRules(Configuration) - Constructor for class org.apache.nutch.protocol.okhttp.IPFilterRules
- isAllowListed(URL) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Check whether a URL belongs to a allowlisted host.
- isAnySuccess() - Method in class org.apache.nutch.parse.ParseResult
-
A convenience method which returns true if at least one of the parses is successful.
- isCanonical() - Method in interface org.apache.nutch.parse.Parse
-
Indicates if the parse is coming from a url or a sub-url
- isCanonical() - Method in class org.apache.nutch.parse.ParseImpl
- isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
- isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
- isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
- isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
- isClientTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
- isCompressed() - Method in class org.apache.nutch.tools.CommonCrawlConfig
- isCookieEnabled() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- isDomainSuffix(String) - Method in class org.apache.nutch.util.domain.DomainSuffixes
-
Return whether the extension is a registered domain entry
- isEligibleForCheck(HostDatum) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
-
Determines whether a record is eligible for recheck.
- isEmpty() - Method in class org.apache.nutch.hostdb.HostDatum
- isEmpty() - Method in class org.apache.nutch.parse.ParseResult
-
Checks whether the result is empty.
- isEmpty() - Method in class org.apache.nutch.protocol.okhttp.IPFilterRules
- isEmpty(String) - Static method in class org.apache.nutch.util.StringUtil
-
Checks if a string is empty (ie is null or empty).
- isExempted(String, String) - Method in class org.apache.nutch.net.URLExemptionFilters
-
Run all defined filters.
- isForce() - Method in class org.apache.nutch.service.model.request.NutchConfig
- isHalted() - Method in class org.apache.nutch.fetcher.FetcherThread
- isHomePageOf(URL, String) - Static method in class org.apache.nutch.util.URLUtil
-
Test whether a URL is the home page or root page of a host.
- isIfModifiedSinceEnabled() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- isIgnoreCase() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- isIndexable(Path, FileSystem) - Static method in class org.apache.nutch.segment.SegmentChecker
-
Check if the segment is indexable.
- isLoginRedirect() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- isMagic(byte[]) - Static method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns true if the byte array passed matches the gzip header magic number.
- isModeAccept() - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- isModelCreated - Static variable in class org.apache.nutch.scoring.similarity.cosine.Model
- isMultiValued(String) - Method in class org.apache.nutch.metadata.Metadata
-
Returns true if named value is multivalued.
- isParsed(Path, FileSystem) - Static method in class org.apache.nutch.segment.SegmentChecker
-
Check the segment to see if it is has been parsed before.
- isParsing(Configuration) - Static method in class org.apache.nutch.fetcher.Fetcher
- isPermanentFailure() - Method in class org.apache.nutch.protocol.ProtocolStatus
- isRedirect() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
- isRedirect() - Method in class org.apache.nutch.protocol.ProtocolStatus
- isRemoteVerificationEnabled() - Method in class org.apache.nutch.protocol.ftp.Client
-
Return whether or not verification of the remote host participating in data connections is enabled.
- isRunning() - Method in class org.apache.nutch.service.NutchServer
- isSameDomainName(String, String) - Static method in class org.apache.nutch.util.URLUtil
-
Returns whether the given urls have the same domain name.
- isSameDomainName(URL, URL) - Static method in class org.apache.nutch.util.URLUtil
-
Returns whether the given urls have the same domain name.
- isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.htmlunit.DummyX509TrustManager
- isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.http.DummyX509TrustManager
- isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.httpclient.DummyX509TrustManager
- isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.interactiveselenium.DummyX509TrustManager
- isServerTrusted(X509Certificate[]) - Method in class org.apache.nutch.protocol.selenium.DummyX509TrustManager
- isStoreHttpHeaders() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- isStoreHttpRequest() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- isStoreIPAddress() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- isStorePartialAsTruncated() - Method in class org.apache.nutch.protocol.http.api.HttpBase
-
Whether to save partial fetches as truncated content, cf.
- isStoringContent(Configuration) - Static method in class org.apache.nutch.fetcher.Fetcher
- isSuccess() - Method in class org.apache.nutch.parse.ParseResult
-
A convenience method which returns true only if all parses are successful.
- isSuccess() - Method in class org.apache.nutch.parse.ParseStatus
- isSuccess() - Method in class org.apache.nutch.protocol.ProtocolStatus
- isTlsCheckCertificates() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- isTransientFailure() - Method in class org.apache.nutch.protocol.ProtocolStatus
- isTruncated(Content) - Static method in class org.apache.nutch.parse.ParseSegment
-
Checks if the page's content is truncated.
- isValid() - Method in class org.apache.nutch.indexer.replace.FieldReplacer
-
Does this FieldReplacer have a valid fieldname and pattern?
- isWhiteSpace(char) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Returns whether the specified ch conforms to the XML 1.0 definition of whitespace.
- isWhiteSpace(char[], int, int) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- isWhiteSpace(String) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- isWhiteSpace(StringBuffer) - Static method in class org.apache.nutch.parse.html.XMLCharacterRecognizer
-
Tell if the string is whitespace.
- iterator() - Method in class org.apache.nutch.crawl.Inlinks
- iterator() - Method in class org.apache.nutch.indexer.NutchDocument
-
Iterate over all fields.
- iterator() - Method in class org.apache.nutch.parse.ParseResult
-
Iterate over all entries in the <url, Parse> map.
J
- JexlExchange - Class in org.apache.nutch.exchange.jexl
- JexlExchange() - Constructor for class org.apache.nutch.exchange.jexl.JexlExchange
- JexlIndexingFilter - Class in org.apache.nutch.indexer.jexl
-
An
IndexingFilter
that allows filtering of documents based on a JEXL expression. - JexlIndexingFilter() - Constructor for class org.apache.nutch.indexer.jexl.JexlIndexingFilter
- JexlUtil - Class in org.apache.nutch.util
-
Utility methods for handling JEXL expressions
- JexlUtil() - Constructor for class org.apache.nutch.util.JexlUtil
- JobConfig - Class in org.apache.nutch.service.model.request
-
Job-specific configuration.
- JobConfig() - Constructor for class org.apache.nutch.service.model.request.JobConfig
- JobFactory - Class in org.apache.nutch.service.impl
- JobFactory() - Constructor for class org.apache.nutch.service.impl.JobFactory
- JobInfo - Class in org.apache.nutch.service.model.response
-
This is the response object containing Job information
- JobInfo(String, JobConfig, JobInfo.State, String) - Constructor for class org.apache.nutch.service.model.response.JobInfo
- JobInfo.State - Enum in org.apache.nutch.service.model.response
- jobManager - Variable in class org.apache.nutch.service.resources.AbstractResource
- JobManager - Interface in org.apache.nutch.service
- JobManager.JobType - Enum in org.apache.nutch.service
- JobManagerImpl - Class in org.apache.nutch.service.impl
- JobManagerImpl(JobFactory, ConfManager, NutchServerPoolExecutor) - Constructor for class org.apache.nutch.service.impl.JobManagerImpl
- JobResource - Class in org.apache.nutch.service.resources
- JobResource() - Constructor for class org.apache.nutch.service.resources.JobResource
- JobWorker - Class in org.apache.nutch.service.impl
- JobWorker(JobConfig, Configuration, NutchTool) - Constructor for class org.apache.nutch.service.impl.JobWorker
-
To initialize JobWorker thread with the Job Configurations provided by user.
- jsonArray - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- JsonIndenter() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.JsonIndenter
- JSParseFilter - Class in org.apache.nutch.parse.js
-
This class is a heuristic link extractor for JavaScript files and code snippets.
- JSParseFilter() - Constructor for class org.apache.nutch.parse.js.JSParseFilter
K
- KafkaConstants - Interface in org.apache.nutch.indexwriter.kafka
- KafkaIndexWriter - Class in org.apache.nutch.indexwriter.kafka
-
Sends Nutch documents to a configured Kafka Cluster
- KafkaIndexWriter() - Constructor for class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- keepClientCnxOpen - Variable in class org.apache.nutch.util.AbstractChecker
- KEY_SERIALIZER - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
- KEY_STORE_PASSWORD - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- KEY_STORE_PATH - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- KEY_STORE_TYPE - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- keyPrefix - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- KILLED - org.apache.nutch.service.model.response.JobInfo.State
- KILLING - org.apache.nutch.service.model.response.JobInfo.State
- killJob() - Method in class org.apache.nutch.service.impl.JobWorker
- killJob() - Method in class org.apache.nutch.util.NutchTool
-
Kill the job immediately.
L
- LANGUAGE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A language of the intellectual content of the resource.
- LanguageIndexingFilter - Class in org.apache.nutch.analysis.lang
-
An
IndexingFilter
that add alang
(language) field to the document. - LanguageIndexingFilter() - Constructor for class org.apache.nutch.analysis.lang.LanguageIndexingFilter
-
Constructs a new Language Indexing Filter.
- LAST_MODIFIED - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- lastCheck - Variable in class org.apache.nutch.hostdb.HostDatum
- leftPad(String, int) - Static method in class org.apache.nutch.util.StringUtil
-
Returns a copy of
s
(left padded) with leading spaces so that it's length islength
. - LENGTH - org.apache.nutch.net.protocols.Response.TruncatedContentReason
-
fetch exceeded configured http.content.limit
- LICENSE_LOCATION - Static variable in interface org.apache.nutch.metadata.CreativeCommons
- LICENSE_URL - Static variable in interface org.apache.nutch.metadata.CreativeCommons
- LineRecordWriter(DataOutputStream) - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
- LineRecordWriter(DataOutputStream) - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter
- LinkAnalysisScoringFilter - Class in org.apache.nutch.scoring.link
- LinkAnalysisScoringFilter() - Constructor for class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
- LinkDatum - Class in org.apache.nutch.scoring.webgraph
-
A class for holding link information including the url, anchor text, a score, the timestamp of the link and a link type.
- LinkDatum() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Default constructor, no url, timestamp, score, or link type.
- LinkDatum(String) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Creates a LinkDatum with a given url.
- LinkDatum(String, String) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
-
Creates a LinkDatum with a url and an anchor text.
- LinkDatum(String, String, long) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDatum
- LinkDb - Class in org.apache.nutch.crawl
-
Maintains an inverted link map, listing incoming links for each url.
- LinkDb() - Constructor for class org.apache.nutch.crawl.LinkDb
- LinkDb(Configuration) - Constructor for class org.apache.nutch.crawl.LinkDb
- LinkDb.LinkDbMapper - Class in org.apache.nutch.crawl
- LinkDBDumpMapper() - Constructor for class org.apache.nutch.crawl.LinkDbReader.LinkDBDumpMapper
- LinkDbFilter - Class in org.apache.nutch.crawl
-
This class provides a way to separate the URL normalization and filtering steps from the rest of LinkDb manipulation code.
- LinkDbFilter() - Constructor for class org.apache.nutch.crawl.LinkDbFilter
- LinkDbMapper() - Constructor for class org.apache.nutch.crawl.LinkDb.LinkDbMapper
- LinkDbMerger - Class in org.apache.nutch.crawl
-
This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.
- LinkDbMerger() - Constructor for class org.apache.nutch.crawl.LinkDbMerger
- LinkDbMerger(Configuration) - Constructor for class org.apache.nutch.crawl.LinkDbMerger
- LinkDbMerger.LinkDbMergeReducer - Class in org.apache.nutch.crawl
- LinkDbMergeReducer() - Constructor for class org.apache.nutch.crawl.LinkDbMerger.LinkDbMergeReducer
- LinkDbReader - Class in org.apache.nutch.crawl
-
Read utility for the LinkDb.
- LinkDbReader() - Constructor for class org.apache.nutch.crawl.LinkDbReader
- LinkDbReader(Configuration, Path) - Constructor for class org.apache.nutch.crawl.LinkDbReader
- LinkDbReader.LinkDBDumpMapper - Class in org.apache.nutch.crawl
- LinkDumper - Class in org.apache.nutch.scoring.webgraph
-
The LinkDumper tool creates a database of node to inlink information that can be read using the nested Reader class.
- LinkDumper() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper
- LinkDumper.Inverter - Class in org.apache.nutch.scoring.webgraph
-
Inverts outlinks from the WebGraph to inlinks and attaches node information.
- LinkDumper.Inverter.InvertMapper - Class in org.apache.nutch.scoring.webgraph
-
Wraps all values in ObjectWritables.
- LinkDumper.Inverter.InvertReducer - Class in org.apache.nutch.scoring.webgraph
-
Inverts outlinks to inlinks while attaching node information to the outlink.
- LinkDumper.LinkNode - Class in org.apache.nutch.scoring.webgraph
-
Bean class which holds url to node information.
- LinkDumper.LinkNodes - Class in org.apache.nutch.scoring.webgraph
-
Writable class which holds an array of LinkNode objects.
- LinkDumper.Merger - Class in org.apache.nutch.scoring.webgraph
-
Merges LinkNode objects into a single array value per url.
- LinkDumper.Reader - Class in org.apache.nutch.scoring.webgraph
-
Reader class which will print out the url and all of its inlinks to system out.
- LinkNode() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
- LinkNode(String, Node) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
- LinkNodes() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
- LinkNodes(LinkDumper.LinkNode[]) - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
- LinkParams(String, String, int) - Constructor for class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
- LinkRank - Class in org.apache.nutch.scoring.webgraph
- LinkRank() - Constructor for class org.apache.nutch.scoring.webgraph.LinkRank
-
Default constructor.
- LinkRank(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.LinkRank
-
Configurable constructor.
- linkRead() - Method in class org.apache.nutch.service.resources.ReaderResouce
-
Get Link Reader response schema
- linkRead(ReaderConfig, int, int, int, boolean) - Method in class org.apache.nutch.service.resources.ReaderResouce
-
Read link object
- LinkReader - Class in org.apache.nutch.service.impl
- LinkReader() - Constructor for class org.apache.nutch.service.impl.LinkReader
- LINKS_INLINKS_HOST - Static variable in class org.apache.nutch.indexer.links.LinksIndexingFilter
- LINKS_ONLY_HOSTS - Static variable in class org.apache.nutch.indexer.links.LinksIndexingFilter
- LINKS_OUTLINKS_HOST - Static variable in class org.apache.nutch.indexer.links.LinksIndexingFilter
- LinksIndexingFilter - Class in org.apache.nutch.indexer.links
- LinksIndexingFilter() - Constructor for class org.apache.nutch.indexer.links.LinksIndexingFilter
- list() - Method in interface org.apache.nutch.service.ConfManager
- list() - Method in class org.apache.nutch.service.impl.ConfManagerImpl
- list(String, JobInfo.State) - Method in class org.apache.nutch.service.impl.JobManagerImpl
- list(String, JobInfo.State) - Method in interface org.apache.nutch.service.JobManager
- list(List<Path>, Writer) - Method in class org.apache.nutch.segment.SegmentReader
- listDumpPaths(String) - Method in class org.apache.nutch.service.resources.ServicesResource
- loadClass(String, boolean) - Method in class org.apache.nutch.plugin.PluginClassLoader
- LOCATION - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- lock(Configuration, Path, boolean) - Static method in class org.apache.nutch.crawl.CrawlDb
- LOCK_NAME - Static variable in class org.apache.nutch.crawl.CrawlDb
- LOCK_NAME - Static variable in class org.apache.nutch.crawl.LinkDb
- LOCK_NAME - Static variable in class org.apache.nutch.hostdb.UpdateHostDb
- LOCK_NAME - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
- LOCK_NAME - Static variable in class org.apache.nutch.util.SitemapProcessor
- LockUtil - Class in org.apache.nutch.util
-
Utility methods for handling application-level locking.
- LockUtil() - Constructor for class org.apache.nutch.util.LockUtil
- LOG - Static variable in class org.apache.nutch.crawl.Generator
- LOG - Static variable in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- LOG - Static variable in class org.apache.nutch.plugin.PluginManifestParser
- LOG - Static variable in class org.apache.nutch.plugin.PluginRepository
- LOG - Static variable in class org.apache.nutch.plugin.URLStreamHandlerFactory
- LOG - Static variable in class org.apache.nutch.protocol.file.File
- LOG - Static variable in class org.apache.nutch.protocol.ftp.Ftp
- LOG - Static variable in class org.apache.nutch.protocol.htmlunit.Http
- LOG - Static variable in class org.apache.nutch.protocol.http.Http
- LOG - Static variable in class org.apache.nutch.protocol.httpclient.Http
- LOG - Static variable in class org.apache.nutch.protocol.interactiveselenium.Http
- LOG - Static variable in class org.apache.nutch.protocol.okhttp.IPFilterRules
- LOG - Static variable in class org.apache.nutch.protocol.okhttp.OkHttp
- LOG - Static variable in class org.apache.nutch.protocol.okhttp.OkHttpResponse
- LOG - Static variable in class org.apache.nutch.protocol.selenium.Http
- LOG - Static variable in interface org.apache.nutch.service.NutchReader
- LOG - Static variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- LOG - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
- logConf() - Method in class org.apache.nutch.protocol.http.api.HttpBase
- logDateFormat - Static variable in class org.apache.nutch.util.TimingUtil
-
Formats dates for logging
- logDateMillis(long) - Static method in class org.apache.nutch.util.TimingUtil
-
Convert epoch milliseconds (
System.currentTimeMillis()
) into date string (local time zone) used for logging - login() - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
- login(String, String) - Method in class org.apache.nutch.protocol.ftp.Client
-
Login to the FTP server using the provided username and password.
- logout() - Method in class org.apache.nutch.protocol.ftp.Client
-
Logout of the FTP server by sending the QUIT command.
- logShort(Throwable) - Method in class org.apache.nutch.net.protocols.ProtocolLogUtil
-
Return true if exception is configured to be logged as short message without stack trace, usually done for frequent exceptions with obvious reasons (e.g., UnknownHostException), configurable by
http.log.exceptions.suppress.stack
- longestMatch(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns the longest prefix of
input
that is matched, ornull
if no match exists. - longestMatch(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns the longest suffix of
input
that is matched, ornull
if no match exists. - longestMatch(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns the longest substring of
input
that is matched by a pattern in the trie, ornull
if no match exists. - LuceneAnalyzerUtil - Class in org.apache.nutch.scoring.similarity.util
-
Creates a custom analyzer based on user provided inputs
- LuceneAnalyzerUtil(LuceneAnalyzerUtil.StemFilterType, boolean) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil
-
Creates an analyzer instance based on Lucene default stopword set if the param useStopFilter is set to true
- LuceneAnalyzerUtil(LuceneAnalyzerUtil.StemFilterType, List<String>, boolean) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil
-
Creates an analyzer instance based on user provided stop words.
- LuceneAnalyzerUtil.StemFilterType - Enum in org.apache.nutch.scoring.similarity.util
- LuceneTokenizer - Class in org.apache.nutch.scoring.similarity.util
- LuceneTokenizer(String, LuceneTokenizer.TokenizerType, boolean, LuceneAnalyzerUtil.StemFilterType) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneTokenizer
-
Creates a tokenizer based on param values
- LuceneTokenizer(String, LuceneTokenizer.TokenizerType, List<String>, boolean, LuceneAnalyzerUtil.StemFilterType) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneTokenizer
-
Creates a tokenizer based on param values
- LuceneTokenizer(String, LuceneTokenizer.TokenizerType, LuceneAnalyzerUtil.StemFilterType, int, int) - Constructor for class org.apache.nutch.scoring.similarity.util.LuceneTokenizer
-
Creates a tokenizer for the ngram model based on param values
- LuceneTokenizer.TokenizerType - Enum in org.apache.nutch.scoring.similarity.util
M
- m_currentNode - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Current node
- m_doc - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Root document
- m_docFrag - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
First node of document fragment or null if not a DocumentFragment
- m_elemStack - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Vector of element nodes
- m_inCData - Variable in class org.apache.nutch.parse.html.DOMBuilder
-
Flag indicating that we are processing a CData section
- main(String[]) - Static method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDb
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDbMerger
-
Run the tool.
- main(String[]) - Static method in class org.apache.nutch.crawl.CrawlDbReader
- main(String[]) - Static method in class org.apache.nutch.crawl.DeduplicationJob
- main(String[]) - Static method in class org.apache.nutch.crawl.Generator
-
Generate a fetchlist from the crawldb.
- main(String[]) - Static method in class org.apache.nutch.crawl.Injector
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDb
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDbMerger
-
Run the job
- main(String[]) - Static method in class org.apache.nutch.crawl.LinkDbReader
- main(String[]) - Static method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
- main(String[]) - Static method in class org.apache.nutch.crawl.TextProfileSignature
- main(String[]) - Static method in class org.apache.nutch.fetcher.Fetcher
-
Run the fetcher.
- main(String[]) - Static method in class org.apache.nutch.hostdb.ReadHostDb
- main(String[]) - Static method in class org.apache.nutch.hostdb.UpdateHostDb
- main(String[]) - Static method in class org.apache.nutch.indexer.CleaningJob
- main(String[]) - Static method in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
-
Main method for invoking this tool
- main(String[]) - Static method in class org.apache.nutch.indexer.IndexingFiltersChecker
- main(String[]) - Static method in class org.apache.nutch.indexer.IndexingJob
- main(String[]) - Static method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- main(String[]) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
- main(String[]) - Static method in class org.apache.nutch.net.URLFilterChecker
- main(String[]) - Static method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
- main(String[]) - Static method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
Spits out patterns and substitutions that are in the configuration file.
- main(String[]) - Static method in class org.apache.nutch.net.URLNormalizerChecker
- main(String[]) - Static method in class org.apache.nutch.parse.feed.FeedParser
-
Runs a command line version of this
Parser
. - main(String[]) - Static method in class org.apache.nutch.parse.html.HtmlParser
- main(String[]) - Static method in class org.apache.nutch.parse.js.JSParseFilter
-
Main method which can be run from command line with the plugin option.
- main(String[]) - Static method in class org.apache.nutch.parse.ParseData
- main(String[]) - Static method in class org.apache.nutch.parse.ParserChecker
- main(String[]) - Static method in class org.apache.nutch.parse.ParseSegment
- main(String[]) - Static method in class org.apache.nutch.parse.ParseText
- main(String[]) - Static method in class org.apache.nutch.parse.zip.ZipParser
- main(String[]) - Static method in class org.apache.nutch.plugin.PluginRepository
-
Loads all necessary dependencies for a selected plugin, and then runs one of the classes' main() method.
- main(String[]) - Static method in class org.apache.nutch.protocol.Content
- main(String[]) - Static method in class org.apache.nutch.protocol.file.File
-
Quick way for running this class.
- main(String[]) - Static method in class org.apache.nutch.protocol.ftp.Ftp
-
For debugging.
- main(String[]) - Static method in class org.apache.nutch.protocol.htmlunit.Http
- main(String[]) - Static method in class org.apache.nutch.protocol.http.Http
- main(String[]) - Static method in class org.apache.nutch.protocol.httpclient.Http
-
Main method.
- main(String[]) - Static method in class org.apache.nutch.protocol.interactiveselenium.Http
- main(String[]) - Static method in class org.apache.nutch.protocol.okhttp.OkHttp
- main(String[]) - Static method in class org.apache.nutch.protocol.RobotRulesParser
- main(String[]) - Static method in class org.apache.nutch.protocol.selenium.Http
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkDumper
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkDumper.Reader
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.LinkRank
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.NodeDumper
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.NodeReader
-
Runs the NodeReader tool.
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
- main(String[]) - Static method in class org.apache.nutch.scoring.webgraph.WebGraph
- main(String[]) - Static method in class org.apache.nutch.segment.SegmentMerger
- main(String[]) - Static method in class org.apache.nutch.segment.SegmentReader
- main(String[]) - Static method in class org.apache.nutch.service.NutchServer
- main(String[]) - Static method in class org.apache.nutch.tools.arc.ArcSegmentCreator
- main(String[]) - Static method in class org.apache.nutch.tools.CommonCrawlDataDumper
-
Main method for invoking this tool
- main(String[]) - Static method in class org.apache.nutch.tools.DmozParser
-
Command-line access.
- main(String[]) - Static method in class org.apache.nutch.tools.FileDumper
-
Main method for invoking this tool
- main(String[]) - Static method in class org.apache.nutch.tools.FreeGenerator
- main(String[]) - Static method in class org.apache.nutch.tools.ResolveUrls
-
Runs the resolve urls tool.
- main(String[]) - Static method in class org.apache.nutch.tools.ShowProperties
- main(String[]) - Static method in class org.apache.nutch.tools.warc.WARCExporter
- main(String[]) - Static method in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
- main(String[]) - Static method in class org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter
- main(String[]) - Static method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
- main(String[]) - Static method in class org.apache.nutch.urlfilter.regex.RegexURLFilter
- main(String[]) - Static method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- main(String[]) - Static method in class org.apache.nutch.util.CommandRunner
- main(String[]) - Static method in class org.apache.nutch.util.CrawlCompletionStats
- main(String[]) - Static method in class org.apache.nutch.util.domain.DomainStatistics
- main(String[]) - Static method in class org.apache.nutch.util.EncodingDetector
- main(String[]) - Static method in class org.apache.nutch.util.PrefixStringMatcher
- main(String[]) - Static method in class org.apache.nutch.util.ProtocolStatusStatistics
- main(String[]) - Static method in class org.apache.nutch.util.SitemapProcessor
- main(String[]) - Static method in class org.apache.nutch.util.StringUtil
- main(String[]) - Static method in class org.apache.nutch.util.SuffixStringMatcher
- main(String[]) - Static method in class org.apache.nutch.util.URLUtil
-
For testing
- main(HttpBase, String[]) - Static method in class org.apache.nutch.protocol.http.api.HttpBase
- main(RegexURLFilterBase, String[]) - Static method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Filter the standard input using a RegexURLFilterBase.
- majorCodes - Static variable in class org.apache.nutch.parse.ParseStatus
- makeClient(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
-
Generates a RestHighLevelClient with the hosts given
- makeClient(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
-
Generates a RestHighLevelClient with the hosts given
- map(FloatWritable, Generator.SelectorEntry, Mapper.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorInverseMapper
- map(Text, BytesWritable, Mapper.Context) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator.ArcSegmentCreatorMapper
-
Runs the Map job to translate an arc record into output for Nutch segments.
- map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertMapper
- map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterMapper
- map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.crawl.Injector.InjectMapper
- map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbMapper
-
Mapper ingesting records from the HostDB, CrawlDB and plaintext host scores file.
- map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.indexer.IndexerMapReduce.IndexerMapper
- map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbMapper
- map(Text, Writable, Mapper.Context) - Method in class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCMapper
- map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DBFilter
- map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.indexer.CleaningJob.DBFilter
- map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
- map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorMapper
- map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbFilter
- map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
- map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateMapper
- map(Text, CrawlDatum, Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
- map(Text, Inlinks, Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDbFilter
- map(Text, Inlinks, Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDbReader.LinkDBDumpMapper
- map(Text, MetaWrapper, Mapper.Context) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentMergerMapper
- map(Text, ParseData, Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDb.LinkDbMapper
- map(Text, Node, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterMapper
- map(Text, Node, Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperMapper
- map(WritableComparable<?>, Text, Mapper.Context) - Method in class org.apache.nutch.tools.FreeGenerator.FG.FGMapper
- map(WritableComparable<?>, Writable, Mapper.Context) - Method in class org.apache.nutch.segment.SegmentReader.InputCompatMapper
- map(WritableComparable<?>, Content, Mapper.Context) - Method in class org.apache.nutch.parse.ParseSegment.ParseSegmentMapper
- mask(String) - Static method in class org.apache.nutch.util.StringUtil
-
Mask sensitive strings - passwords, etc.
- mask(String, char) - Static method in class org.apache.nutch.util.StringUtil
-
Mask sensitive strings - passwords, etc.
- mask(String, Pattern, char) - Static method in class org.apache.nutch.util.StringUtil
-
Mask sensitive strings - passwords, etc.
- match(String) - Method in class org.apache.nutch.urlfilter.api.RegexRule
-
Checks if a url matches this rule.
- match(URL) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyAllRule
- match(URL) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyPathQueryRule
- match(URL) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.DenyPathRule
- match(URL) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.Rule
- match(NutchDocument) - Method in interface org.apache.nutch.exchange.Exchange
-
Determines if the document must go to the related index writers.
- match(NutchDocument) - Method in class org.apache.nutch.exchange.jexl.JexlExchange
-
Determines if the document must go to the related index writers.
- matchChar(TrieStringMatcher.TrieNode, String, int) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Get the next
TrieStringMatcher.TrieNode
visited, given that you are atnode
, and that the next character in the input is theidx
'th character ofs
. - matches(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns true if the given
String
is matched by a prefix in the trie - matches(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns true if the given
String
is matched by a suffix in the trie - matches(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns true if the given
String
is matched by a pattern in the trie - MAX_BULK_DOCS - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- MAX_BULK_DOCS - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- MAX_BULK_LENGTH - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- MAX_BULK_LENGTH - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- MAX_DEPTH_KEY - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
- MAX_DEPTH_KEY_W - Static variable in class org.apache.nutch.scoring.depth.DepthScoringFilter
- MAX_DOC_COUNT - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
- MAX_DOCS_BATCH - Static variable in interface org.apache.nutch.indexwriter.cloudsearch.CloudSearchConstants
- MAX_WARC_FILE_SIZE - Static variable in class org.apache.nutch.tools.CommonCrawlFormatWARC
- maxContent - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The length limit for downloaded content, in bytes.
- maxCrawlDelay - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Skip page if Crawl-Delay longer than this value.
- maxDuration - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The time limit to download the entire content, in seconds.
- maxInterval - Variable in class org.apache.nutch.crawl.AbstractFetchSchedule
- maxNumRedirects - Variable in class org.apache.nutch.protocol.RobotRulesParser
- MD5Signature - Class in org.apache.nutch.crawl
-
Default implementation of a page signature.
- MD5Signature() - Constructor for class org.apache.nutch.crawl.MD5Signature
- merge(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.CrawlDbMerger
- merge(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.LinkDbMerger
- merge(Path, Path[], boolean, boolean, long) - Method in class org.apache.nutch.segment.SegmentMerger
- Merger() - Constructor for class org.apache.nutch.crawl.CrawlDbMerger.Merger
- Merger() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
- metadata - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
- metadata - Variable in class org.apache.nutch.metadata.Metadata
-
A map of all metadata attributes.
- metadata - Variable in class org.apache.nutch.parse.ParserChecker
- metadata - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- metaData - Variable in class org.apache.nutch.hostdb.HostDatum
- Metadata - Class in org.apache.nutch.metadata
-
A multi-valued metadata container.
- Metadata() - Constructor for class org.apache.nutch.metadata.Metadata
-
Constructs a new, empty metadata.
- METADATA_CONTENT - Static variable in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
- METADATA_DATUM - Static variable in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
- METADATA_PARSED - Static variable in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
- MetadataIndexer - Class in org.apache.nutch.indexer.metadata
-
Indexer which can be configured to extract metadata from the crawldb, parse metadata or content metadata.
- MetadataIndexer() - Constructor for class org.apache.nutch.indexer.metadata.MetadataIndexer
- MetadataScoringFilter - Class in org.apache.nutch.scoring.metadata
-
For documentation:
org.apache.nutch.scoring.metadata
- MetadataScoringFilter() - Constructor for class org.apache.nutch.scoring.metadata.MetadataScoringFilter
- metadataSource - Static variable in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
Metadata source field name
- metadataToJson(Metadata) - Method in class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCReducer
-
Adds keys/values of a Nuta metadata container to a JsonObject.
- MetaTagsParser - Class in org.apache.nutch.parse.metatags
-
Parse HTML meta tags (keywords, description) and store them in the parse metadata so that they can be indexed with the index-metadata plugin with the prefix 'metatag.'.
- MetaTagsParser() - Constructor for class org.apache.nutch.parse.metatags.MetaTagsParser
- MetaWrapper - Class in org.apache.nutch.metadata
-
This is a simple decorator that adds metadata to any Writable-s that can be serialized by
NutchWritable
. - MetaWrapper() - Constructor for class org.apache.nutch.metadata.MetaWrapper
- MetaWrapper(Writable, Configuration) - Constructor for class org.apache.nutch.metadata.MetaWrapper
- MetaWrapper(Metadata, Writable, Configuration) - Constructor for class org.apache.nutch.metadata.MetaWrapper
- MimeAdaptiveFetchSchedule - Class in org.apache.nutch.crawl
-
Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types.
- MimeAdaptiveFetchSchedule() - Constructor for class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
- MIMEFILTER_REGEX_FILE - Static variable in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
- MimeTypeIndexingFilter - Class in org.apache.nutch.indexer.filter
-
An
IndexingFilter
that allows filtering of documents based on the MIME Type detected by Tika - MimeTypeIndexingFilter() - Constructor for class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
- MimeUtil - Class in org.apache.nutch.util
-
This is a facade class to insulate Nutch from its underlying Mime Type substrate library, Apache Tika.
- MimeUtil(Configuration) - Constructor for class org.apache.nutch.util.MimeUtil
- MIN_CONFIDENCE_KEY - Static variable in class org.apache.nutch.util.EncodingDetector
- MissingDependencyException - Exception in org.apache.nutch.plugin
-
MissingDependencyException
will be thrown if a plugin dependency cannot be found. - MissingDependencyException(String) - Constructor for exception org.apache.nutch.plugin.MissingDependencyException
- MissingDependencyException(Throwable) - Constructor for exception org.apache.nutch.plugin.MissingDependencyException
- Model - Class in org.apache.nutch.scoring.similarity.cosine
-
This class creates a model used to store Document vector representation of the corpus.
- Model() - Constructor for class org.apache.nutch.scoring.similarity.cosine.Model
- MODIFIED - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Date on which the resource was changed.
- modifyWebClient(WebClient) - Method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebDriver
- MoreIndexingFilter - Class in org.apache.nutch.indexer.more
-
Add (or reset) a few metaData properties as respective fields (if they are available), so that they can be accurately used within the search index.
- MoreIndexingFilter() - Constructor for class org.apache.nutch.indexer.more.MoreIndexingFilter
- MOVED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Resource has moved permanently.
N
- NaiveBayesParseFilter - Class in org.apache.nutch.parsefilter.naivebayes
-
Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevant it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.
- NaiveBayesParseFilter() - Constructor for class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- names() - Method in class org.apache.nutch.metadata.Metadata
-
Returns an array of the names contained in the metadata.
- next(Text, BytesWritable) - Method in class org.apache.nutch.tools.arc.ArcRecordReader
-
Returns true if the next record in the split is read into the key and value pair.
- nextKeyValue() - Method in class org.apache.nutch.tools.arc.ArcRecordReader
- nextNode() - Method in class org.apache.nutch.util.NodeWalker
-
Returns the next
Node
on the stack and pushes all of its children onto the stack, allowing us to walk the node tree without the use of recursion. - NO_THRESHOLD - Static variable in class org.apache.nutch.util.EncodingDetector
- Node - Class in org.apache.nutch.scoring.webgraph
-
A class which holds the number of inlinks and outlinks for a given url along with an inlink score from a link analysis program and any metadata.
- Node() - Constructor for class org.apache.nutch.scoring.webgraph.Node
- NODE_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
- nodeChar - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
- NodeDumper - Class in org.apache.nutch.scoring.webgraph
-
A tools that dumps out the top urls by number of inlinks, number of outlinks, or by score, to a text file.
- NodeDumper() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper
- NodeDumper.Dumper - Class in org.apache.nutch.scoring.webgraph
-
Outputs the hosts or domains with an associated value.
- NodeDumper.Dumper.DumperMapper - Class in org.apache.nutch.scoring.webgraph
-
Outputs the host or domain as key for this record and numInlinks, numOutlinks or score as the value.
- NodeDumper.Dumper.DumperReducer - Class in org.apache.nutch.scoring.webgraph
-
Outputs either the sum or the top value for this record.
- NodeDumper.Sorter - Class in org.apache.nutch.scoring.webgraph
-
Outputs the top urls sorted in descending order.
- NodeDumper.Sorter.SorterMapper - Class in org.apache.nutch.scoring.webgraph
-
Outputs the url with the appropriate number of inlinks, outlinks, or for score.
- NodeDumper.Sorter.SorterReducer - Class in org.apache.nutch.scoring.webgraph
-
Flips and collects the url and numeric sort value.
- nodeRead() - Method in class org.apache.nutch.service.resources.ReaderResouce
-
Get schema of the Node object
- nodeRead(ReaderConfig, int, int, int, boolean) - Method in class org.apache.nutch.service.resources.ReaderResouce
-
Read Node object as stored in the Nutch Webgraph
- NodeReader - Class in org.apache.nutch.scoring.webgraph
-
Reads and prints to system out information for a single node from the NodeDb in the WebGraph.
- NodeReader - Class in org.apache.nutch.service.impl
- NodeReader() - Constructor for class org.apache.nutch.scoring.webgraph.NodeReader
- NodeReader() - Constructor for class org.apache.nutch.service.impl.NodeReader
- NodeReader(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.NodeReader
- NodeWalker - Class in org.apache.nutch.util
-
A utility class that allows the walking of any DOM tree using a stack instead of recursion.
- NodeWalker(Node) - Constructor for class org.apache.nutch.util.NodeWalker
-
Starts the
Node
tree from the root node. - NONE - org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
- NORM_HOST_IDN - Static variable in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
- NORM_HOST_TRIM_TRAILING_DOT - Static variable in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
- normalize - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
-
Attempts to normalize the input URL string
- normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
- normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
- normalize(String, String) - Method in interface org.apache.nutch.net.URLNormalizer
- normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
- normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer
- normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.querystring.QuerystringURLNormalizer
- normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
- normalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.slash.SlashURLNormalizer
- normalize(String, String) - Method in class org.apache.nutch.net.URLNormalizers
-
Normalize
- normalizeEscapedFragment(String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
-
Returns a normalized input URL.
- normalizeHashedFragment(String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
-
Returns a normalized input URL.
- normalizers - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- normalizers - Variable in class org.apache.nutch.indexer.IndexingFiltersChecker
- normalizers - Variable in class org.apache.nutch.parse.ParserChecker
- NOT_FETCHED - org.apache.nutch.util.domain.DomainStatistics.MyCounter
- NOT_IN_USE - org.apache.nutch.util.domain.DomainSuffix.Status
- NOT_TRUNCATED - org.apache.nutch.net.protocols.Response.TruncatedContentReason
- NOTFETCHING - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Not fetching.
- NOTFOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Resource was not found.
- notModified - Variable in class org.apache.nutch.hostdb.HostDatum
- NOTMODIFIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Unchanged since the last fetch.
- NOTPARSED - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing was not performed.
- now - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- numericFields - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- numericFieldWritables - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- numFailures() - Method in class org.apache.nutch.hostdb.HostDatum
- numJobs - Variable in class org.apache.nutch.util.NutchTool
- numOverDue - Variable in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
- numRecords() - Method in class org.apache.nutch.hostdb.HostDatum
- numResolverThreads - Variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- Nutch - Interface in org.apache.nutch.metadata
-
A collection of Nutch internal metadata constants.
- NutchConfig - Class in org.apache.nutch.service.model.request
- NutchConfig() - Constructor for class org.apache.nutch.service.model.request.NutchConfig
- NutchConfiguration - Class in org.apache.nutch.util
-
Utility to create Hadoop
Configuration
s that include Nutch-specific resources. - NutchDocument - Class in org.apache.nutch.indexer
-
A
NutchDocument
is the unit of indexing. - NutchDocument() - Constructor for class org.apache.nutch.indexer.NutchDocument
- nutchFetchIntervalMDName - Static variable in class org.apache.nutch.crawl.Injector
-
metadata key reserved for setting a custom fetchInterval for a specific URL
- NutchField - Class in org.apache.nutch.indexer
-
This class represents a multi-valued field with a weight.
- NutchField() - Constructor for class org.apache.nutch.indexer.NutchField
- NutchField(Object) - Constructor for class org.apache.nutch.indexer.NutchField
- NutchField(Object, float) - Constructor for class org.apache.nutch.indexer.NutchField
- nutchFixedFetchIntervalMDName - Static variable in class org.apache.nutch.crawl.Injector
-
metadata key reserved for setting a fixed custom fetchInterval for a specific URL
- NutchIndexAction - Class in org.apache.nutch.indexer
-
A
NutchIndexAction
is the new unit of indexing holding the document and action information. - NutchIndexAction() - Constructor for class org.apache.nutch.indexer.NutchIndexAction
- NutchIndexAction(NutchDocument, byte) - Constructor for class org.apache.nutch.indexer.NutchIndexAction
- NutchJob - Class in org.apache.nutch.util
-
A
Job
for Nutch jobs. - NutchJob(Configuration, String) - Constructor for class org.apache.nutch.util.NutchJob
-
Deprecated., use instead
Job.getInstance(Configuration)
orJob.getInstance(Configuration, String)
. - NutchPublisher - Interface in org.apache.nutch.publisher
-
All publisher subscriber model implementations should implement this interface.
- NutchPublishers - Class in org.apache.nutch.publisher
- NutchPublishers(Configuration) - Constructor for class org.apache.nutch.publisher.NutchPublishers
- NutchReader - Interface in org.apache.nutch.service
- nutchScoreMDName - Static variable in class org.apache.nutch.crawl.Injector
-
metadata key reserved for setting a custom score for a specific URL
- NutchServer - Class in org.apache.nutch.service
- NutchServerInfo - Class in org.apache.nutch.service.model.response
- NutchServerInfo() - Constructor for class org.apache.nutch.service.model.response.NutchServerInfo
- NutchServerPoolExecutor - Class in org.apache.nutch.service.impl
- NutchServerPoolExecutor(int, int, long, TimeUnit, BlockingQueue<Runnable>) - Constructor for class org.apache.nutch.service.impl.NutchServerPoolExecutor
- NutchTool - Class in org.apache.nutch.util
- NutchTool() - Constructor for class org.apache.nutch.util.NutchTool
- NutchTool(Configuration) - Constructor for class org.apache.nutch.util.NutchTool
- NutchWritable - Class in org.apache.nutch.crawl
- NutchWritable() - Constructor for class org.apache.nutch.crawl.NutchWritable
- NutchWritable(Writable) - Constructor for class org.apache.nutch.crawl.NutchWritable
O
- ObjectCache - Class in org.apache.nutch.util
- ObjectInputFormat() - Constructor for class org.apache.nutch.segment.SegmentMerger.ObjectInputFormat
- OkHttp - Class in org.apache.nutch.protocol.okhttp
- OkHttp() - Constructor for class org.apache.nutch.protocol.okhttp.OkHttp
- OkHttpResponse - Class in org.apache.nutch.protocol.okhttp
- OkHttpResponse(OkHttp, URL, CrawlDatum) - Constructor for class org.apache.nutch.protocol.okhttp.OkHttpResponse
- OkHttpResponse.TruncatedContent - Class in org.apache.nutch.protocol.okhttp
-
Container to store whether and why content has been truncated
- OLD_OUTLINK_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
- open() - Method in class org.apache.nutch.exchange.Exchanges
-
Opens each configured exchange.
- open(Map<String, String>) - Method in interface org.apache.nutch.exchange.Exchange
-
Initializes the internal variables.
- open(Map<String, String>) - Method in class org.apache.nutch.exchange.jexl.JexlExchange
-
Initializes the internal variables.
- open(Configuration, String) - Method in interface org.apache.nutch.indexer.IndexWriter
-
Deprecated.use
IndexWriter.open(IndexWriterParams)
} instead. - open(Configuration, String) - Method in class org.apache.nutch.indexer.IndexWriters
-
Initializes the internal variables of index writers.
- open(Configuration, String) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- open(Configuration, String) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- open(Configuration, String) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- open(Configuration, String) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- open(Configuration, String) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- open(Configuration, String) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- open(Configuration, String) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- open(Configuration, String) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- open(IndexWriterParams) - Method in interface org.apache.nutch.indexer.IndexWriter
-
Initializes the internal variables from a given index writer configuration.
- open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
-
Initializes the internal variables from a given index writer configuration.
- open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
-
Initializes the internal variables from a given index writer configuration.
- open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
-
Initializes the internal variables from a given index writer configuration.
- open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
-
Initializes the internal variables from a given index writer configuration.
- open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
-
Initializes the internal variables from a given index writer configuration.
- open(IndexWriterParams) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
-
Initializes the internal variables from a given index writer configuration.
- openChannel() - Method in class org.apache.nutch.rabbitmq.RabbitMQClient
-
Opens a new channel into the opened connection.
- openReaders() - Method in class org.apache.nutch.crawl.LinkDbReader
- OpenSearch1xConstants - Interface in org.apache.nutch.indexwriter.opensearch1x
- OpenSearch1xIndexWriter - Class in org.apache.nutch.indexwriter.opensearch1x
-
Sends NutchDocuments to a configured OpenSearch index.
- OpenSearch1xIndexWriter() - Constructor for class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- OPERATOR - Static variable in class org.apache.nutch.tools.WARCUtils
- OPICScoringFilter - Class in org.apache.nutch.scoring.opic
-
This plugin implements a variant of an Online Page Importance Computation (OPIC) score, described in this paper: Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive On-Line Page Importance Computation.
- OPICScoringFilter() - Constructor for class org.apache.nutch.scoring.opic.OPICScoringFilter
- OPTIONS - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- OPTIONS - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- org.apache.nutch.analysis.lang - package org.apache.nutch.analysis.lang
-
Text document language identifier.
- org.apache.nutch.collection - package org.apache.nutch.collection
-
Subcollection is a subset of an index.
- org.apache.nutch.crawl - package org.apache.nutch.crawl
-
Crawl control code and tools to run the crawler.
- org.apache.nutch.exchange - package org.apache.nutch.exchange
-
Control code for exchange component, which acts in indexing job and decides to which index writer a document should be routed, based on plugins behavior.
- org.apache.nutch.exchange.jexl - package org.apache.nutch.exchange.jexl
-
Plugin of Exchange component based on JEXL expressions.
- org.apache.nutch.fetcher - package org.apache.nutch.fetcher
-
The Nutch multi-threaded fetching module
- org.apache.nutch.hostdb - package org.apache.nutch.hostdb
- org.apache.nutch.indexer - package org.apache.nutch.indexer
-
Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
- org.apache.nutch.indexer.anchor - package org.apache.nutch.indexer.anchor
-
An indexing plugin for inbound anchor text.
- org.apache.nutch.indexer.arbitrary - package org.apache.nutch.indexer.arbitrary
-
Indexing filter to add document arbitrary data to the index from the output of a user-specified class.
- org.apache.nutch.indexer.basic - package org.apache.nutch.indexer.basic
-
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
- org.apache.nutch.indexer.feed - package org.apache.nutch.indexer.feed
-
Indexing filter to index meta data from RSS feeds.
- org.apache.nutch.indexer.filter - package org.apache.nutch.indexer.filter
- org.apache.nutch.indexer.geoip - package org.apache.nutch.indexer.geoip
-
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
- org.apache.nutch.indexer.jexl - package org.apache.nutch.indexer.jexl
-
This plugin implements a dynamic indexing filter which uses JEXL expressions to allow filtering based on the page's metadata
- org.apache.nutch.indexer.links - package org.apache.nutch.indexer.links
- org.apache.nutch.indexer.metadata - package org.apache.nutch.indexer.metadata
-
Indexing filter to add document metadata to the index.
- org.apache.nutch.indexer.more - package org.apache.nutch.indexer.more
-
A more indexing plugin, adds "more" index fields:last modified date, MIME type, content length.
- org.apache.nutch.indexer.replace - package org.apache.nutch.indexer.replace
-
Indexing filter to allow pattern replacements on metadata.
- org.apache.nutch.indexer.staticfield - package org.apache.nutch.indexer.staticfield
-
A simple plugin called at indexing that adds fields with static data.
- org.apache.nutch.indexer.subcollection - package org.apache.nutch.indexer.subcollection
-
Indexing filter to assign documents to subcollections.
- org.apache.nutch.indexer.tld - package org.apache.nutch.indexer.tld
-
Top Level Domain Indexing plugin.
- org.apache.nutch.indexer.urlmeta - package org.apache.nutch.indexer.urlmeta
-
URL Meta Tag Indexing Plugin
- org.apache.nutch.indexwriter.cloudsearch - package org.apache.nutch.indexwriter.cloudsearch
- org.apache.nutch.indexwriter.csv - package org.apache.nutch.indexwriter.csv
-
Index writer plugin to write a plain CSV file.
- org.apache.nutch.indexwriter.dummy - package org.apache.nutch.indexwriter.dummy
-
Index writer plugin for debugging, writes pairs of <action, url> to a text file, action is one of "add", "update", or "delete".
- org.apache.nutch.indexwriter.elastic - package org.apache.nutch.indexwriter.elastic
-
Index writer plugin for Elasticsearch.
- org.apache.nutch.indexwriter.kafka - package org.apache.nutch.indexwriter.kafka
-
Index writer plugin to produce JSON messages to Kafka.
- org.apache.nutch.indexwriter.opensearch1x - package org.apache.nutch.indexwriter.opensearch1x
-
Index writer plugin for OpenSearch.
- org.apache.nutch.indexwriter.rabbit - package org.apache.nutch.indexwriter.rabbit
- org.apache.nutch.indexwriter.solr - package org.apache.nutch.indexwriter.solr
-
Index writer plugin for Apache Solr.
- org.apache.nutch.metadata - package org.apache.nutch.metadata
-
A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
- org.apache.nutch.microformats.reltag - package org.apache.nutch.microformats.reltag
-
A microformats Rel-Tag Parser/Indexer/Querier plugin.
- org.apache.nutch.net - package org.apache.nutch.net
-
Web-related interfaces: URL
filters
andnormalizers
. - org.apache.nutch.net.protocols - package org.apache.nutch.net.protocols
-
Helper classes related to the
Protocol
interface, see alsoorg.apache.nutch.protocol
. - org.apache.nutch.net.urlnormalizer.ajax - package org.apache.nutch.net.urlnormalizer.ajax
- org.apache.nutch.net.urlnormalizer.basic - package org.apache.nutch.net.urlnormalizer.basic
-
URL normalizer performing basic normalizations: remove default ports, e.g., port 80 for
http://
URLs remove needless slashes and dot segments in the path component remove anchors use percent-encoding (only) where needed E.g.,https://www.example.org/a/../b//./select%2Dlang.php?lang=español#anchor
is normalized tohttps://www.example.org/b/select-lang.php?lang=espa%C3%B1ol
Optional and configurable normalizations are: convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see propertyurlnormalizer.basic.host.idn
remove a trailing dot from host names, see propertyurlnormalizer.basic.host.trim-trailing-dot
- org.apache.nutch.net.urlnormalizer.host - package org.apache.nutch.net.urlnormalizer.host
-
URL normalizer renaming hosts to a canonical form listed in the configuration file.
- org.apache.nutch.net.urlnormalizer.pass - package org.apache.nutch.net.urlnormalizer.pass
-
URL normalizer dummy which does not change URLs.
- org.apache.nutch.net.urlnormalizer.protocol - package org.apache.nutch.net.urlnormalizer.protocol
-
URL normalizer to normalize the protocol for all URLs of a given host or domain.
- org.apache.nutch.net.urlnormalizer.querystring - package org.apache.nutch.net.urlnormalizer.querystring
-
URL normalizer which sort the elements in the query part to avoid duplicates by permutations.
- org.apache.nutch.net.urlnormalizer.regex - package org.apache.nutch.net.urlnormalizer.regex
-
URL normalizer with configurable rules based on regular expressions (
Pattern
). - org.apache.nutch.net.urlnormalizer.slash - package org.apache.nutch.net.urlnormalizer.slash
- org.apache.nutch.parse - package org.apache.nutch.parse
-
The
Parse
interface and related classes. - org.apache.nutch.parse.ext - package org.apache.nutch.parse.ext
-
Parse wrapper to run external command to do the parsing.
- org.apache.nutch.parse.feed - package org.apache.nutch.parse.feed
-
Parse RSS feeds.
- org.apache.nutch.parse.headings - package org.apache.nutch.parse.headings
-
Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
- org.apache.nutch.parse.html - package org.apache.nutch.parse.html
-
An HTML document parsing plugin.
- org.apache.nutch.parse.js - package org.apache.nutch.parse.js
-
Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
- org.apache.nutch.parse.metatags - package org.apache.nutch.parse.metatags
-
Parse filter to extract meta tags: keywords, description, etc.
- org.apache.nutch.parse.tika - package org.apache.nutch.parse.tika
-
Parse various document formats with help of Apache Tika.
- org.apache.nutch.parse.zip - package org.apache.nutch.parse.zip
-
Parse ZIP files: embedded files are recursively passed to appropriate parsers.
- org.apache.nutch.parsefilter.debug - package org.apache.nutch.parsefilter.debug
-
Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).
- org.apache.nutch.parsefilter.naivebayes - package org.apache.nutch.parsefilter.naivebayes
-
Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevent it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.
- org.apache.nutch.parsefilter.regex - package org.apache.nutch.parsefilter.regex
-
RegexParseFilter.
- org.apache.nutch.plugin - package org.apache.nutch.plugin
-
The Nutch
Plugin
System. - org.apache.nutch.protocol - package org.apache.nutch.protocol
-
Classes related to the
Protocol
interface, see alsoorg.apache.nutch.net.protocols
. - org.apache.nutch.protocol.file - package org.apache.nutch.protocol.file
-
Protocol plugin which supports retrieving local file resources.
- org.apache.nutch.protocol.ftp - package org.apache.nutch.protocol.ftp
-
Protocol plugin which supports retrieving documents via the ftp protocol.
- org.apache.nutch.protocol.htmlunit - package org.apache.nutch.protocol.htmlunit
-
Protocol plugin which supports retrieving documents via HTTP/HTTPS using Selenium and the HtmlUnitDriver web driver for the for the HtmlUnit headless browser.
- org.apache.nutch.protocol.http - package org.apache.nutch.protocol.http
-
Protocol plugin which supports retrieving documents via the http protocol.
- org.apache.nutch.protocol.http.api - package org.apache.nutch.protocol.http.api
-
Common API used by HTTP plugins (
http
,httpclient
, etc.) - org.apache.nutch.protocol.httpclient - package org.apache.nutch.protocol.httpclient
-
Protocol plugin which supports retrieving documents via the HTTP andHTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
- org.apache.nutch.protocol.interactiveselenium - package org.apache.nutch.protocol.interactiveselenium
-
Protocol plugin which supports retrieving documents using and interacting with Selenium.
- org.apache.nutch.protocol.interactiveselenium.handlers - package org.apache.nutch.protocol.interactiveselenium.handlers
-
Handler implementations to interact with Selenium for
org.apache.nutch.protocol.interactiveselenium
. - org.apache.nutch.protocol.okhttp - package org.apache.nutch.protocol.okhttp
-
Protocol plugin for HTTP/HTTPS based on okhttp, supports HTTP 1.1 and/or http/2.
- org.apache.nutch.protocol.selenium - package org.apache.nutch.protocol.selenium
-
Protocol plugin which supports retrieving documents via Selenium.
- org.apache.nutch.publisher - package org.apache.nutch.publisher
- org.apache.nutch.publisher.rabbitmq - package org.apache.nutch.publisher.rabbitmq
-
Publisher package to implement queues
- org.apache.nutch.rabbitmq - package org.apache.nutch.rabbitmq
- org.apache.nutch.scoring - package org.apache.nutch.scoring
-
The
ScoringFilter
interface. - org.apache.nutch.scoring.depth - package org.apache.nutch.scoring.depth
-
Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).
- org.apache.nutch.scoring.link - package org.apache.nutch.scoring.link
-
Scoring filter used in conjunction with
WebGraph
. - org.apache.nutch.scoring.metadata - package org.apache.nutch.scoring.metadata
-
Metadata Scoring Plugin
- org.apache.nutch.scoring.opic - package org.apache.nutch.scoring.opic
-
Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
- org.apache.nutch.scoring.orphan - package org.apache.nutch.scoring.orphan
-
Scoring filter to modify score or status of orphaned pages (no inlinks found for a configurable amount of time).
- org.apache.nutch.scoring.similarity - package org.apache.nutch.scoring.similarity
- org.apache.nutch.scoring.similarity.cosine - package org.apache.nutch.scoring.similarity.cosine
-
Implements the cosine similarity metric for scoring relevant documents
- org.apache.nutch.scoring.similarity.util - package org.apache.nutch.scoring.similarity.util
-
Utility package for Lucene functions.
- org.apache.nutch.scoring.tld - package org.apache.nutch.scoring.tld
-
Top Level Domain Scoring plugin.
- org.apache.nutch.scoring.urlmeta - package org.apache.nutch.scoring.urlmeta
-
URL Meta Tag Scoring Plugin
- org.apache.nutch.scoring.webgraph - package org.apache.nutch.scoring.webgraph
- org.apache.nutch.segment - package org.apache.nutch.segment
-
A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
- org.apache.nutch.service - package org.apache.nutch.service
- org.apache.nutch.service.impl - package org.apache.nutch.service.impl
- org.apache.nutch.service.model.request - package org.apache.nutch.service.model.request
- org.apache.nutch.service.model.response - package org.apache.nutch.service.model.response
- org.apache.nutch.service.resources - package org.apache.nutch.service.resources
- org.apache.nutch.tools - package org.apache.nutch.tools
-
Miscellaneous tools.
- org.apache.nutch.tools.arc - package org.apache.nutch.tools.arc
-
Tools to read the Arc file format.
- org.apache.nutch.tools.warc - package org.apache.nutch.tools.warc
-
Tools to import / export between Nutch segments and WARC archives.
- org.apache.nutch.urlfilter.api - package org.apache.nutch.urlfilter.api
-
Generic
URL filter
library, abstracting away from regular expression implementations. - org.apache.nutch.urlfilter.automaton - package org.apache.nutch.urlfilter.automaton
-
URL filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.
- org.apache.nutch.urlfilter.domain - package org.apache.nutch.urlfilter.domain
-
URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
- org.apache.nutch.urlfilter.domaindenylist - package org.apache.nutch.urlfilter.domaindenylist
-
URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
- org.apache.nutch.urlfilter.fast - package org.apache.nutch.urlfilter.fast
-
URL filter plugin that first does fast exact suffix matches on host/domain names before applying regular expressions to the path component of a URL.
- org.apache.nutch.urlfilter.ignoreexempt - package org.apache.nutch.urlfilter.ignoreexempt
-
URL filter plugin which identifies exemptions to external urls when when external urls are set to ignore.
- org.apache.nutch.urlfilter.prefix - package org.apache.nutch.urlfilter.prefix
-
URL filter plugin to include only URLs which match one of a given list of URL prefixes.
- org.apache.nutch.urlfilter.regex - package org.apache.nutch.urlfilter.regex
-
URL filter plugin to include and/or exclude URLs matching Java regular expressions.
- org.apache.nutch.urlfilter.suffix - package org.apache.nutch.urlfilter.suffix
-
URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
- org.apache.nutch.urlfilter.validator - package org.apache.nutch.urlfilter.validator
-
URL filter plugin that validates given urls.
- org.apache.nutch.util - package org.apache.nutch.util
-
Miscellaneous utility classes.
- org.apache.nutch.util.domain - package org.apache.nutch.util.domain
-
Classes for domain name analysis.
- org.creativecommons.nutch - package org.creativecommons.nutch
-
Sample plugins that parse and index Creative Commons metadata.
- ORIGINAL_CHAR_ENCODING - Static variable in interface org.apache.nutch.metadata.Nutch
- ORPHAN_KEY_WRITABLE - Static variable in class org.apache.nutch.scoring.orphan.OrphanScoringFilter
- orphanedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.orphan.OrphanScoringFilter
- orphanedScore(Text, CrawlDatum) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method may change the score or status of CrawlDatum during CrawlDb update, when the URL is neither fetched nor has any inlinks.
- orphanedScore(Text, CrawlDatum) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate orphaned page score during CrawlDb.update().
- OrphanScoringFilter - Class in org.apache.nutch.scoring.orphan
-
Orphan scoring filter that determines whether a page has become orphaned, e.g.
- OrphanScoringFilter() - Constructor for class org.apache.nutch.scoring.orphan.OrphanScoringFilter
- Outlink - Class in org.apache.nutch.parse
-
An outgoing link from a page.
- Outlink() - Constructor for class org.apache.nutch.parse.Outlink
- Outlink(String, String) - Constructor for class org.apache.nutch.parse.Outlink
- OUTLINK - Static variable in class org.apache.nutch.scoring.webgraph.LinkDatum
- OUTLINK_DIR - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph
- OutlinkDb() - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
Default constructor.
- OutlinkDb(Configuration) - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
-
Configurable constructor.
- OutlinkDbMapper() - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbMapper
- OutlinkDbReducer() - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbReducer
- OutlinkExtractor - Class in org.apache.nutch.parse
-
Extractor to extract
Outlink
s / URLs from plain text using Regular Expressions. - OutlinkExtractor() - Constructor for class org.apache.nutch.parse.OutlinkExtractor
- overDueTime - Variable in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
- overDueTimeLimit - Variable in class org.apache.nutch.hostdb.FetchOverdueCrawlDatumProcessor
P
- parse(InputStream) - Method in class org.apache.nutch.collection.CollectionManager
- parse(String) - Static method in class org.apache.nutch.segment.SegmentPart
-
Create SegmentPart from a String in format "segmentName/partName".
- parse(Path) - Method in class org.apache.nutch.parse.ParseSegment
- parse(Content) - Method in class org.apache.nutch.parse.ParseUtil
- Parse - Interface in org.apache.nutch.parse
-
The result of parsing a page's raw content.
- PARSE - org.apache.nutch.service.JobManager.JobType
- PARSE_DIR_NAME - Static variable in class org.apache.nutch.crawl.CrawlDatum
- PARSE_FORMAT - Static variable in class org.apache.nutch.net.protocols.HttpDateFormat
-
Use a less restrictive format for parsing: accept single-digit day-of-month and any timezone
- parseArgs(String[], int) - Method in class org.apache.nutch.util.AbstractChecker
- parseByExtensionId(String, Content) - Method in class org.apache.nutch.parse.ParseUtil
- parseCharacterEncoding(String) - Static method in class org.apache.nutch.util.EncodingDetector
-
Parse the character encoding from the specified content type header.
- parsed - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
- ParseData - Class in org.apache.nutch.parse
-
Data extracted from a page's content.
- ParseData() - Constructor for class org.apache.nutch.parse.ParseData
- ParseData(ParseStatus, String, Outlink[], Metadata) - Constructor for class org.apache.nutch.parse.ParseData
- ParseData(ParseStatus, String, Outlink[], Metadata, Metadata) - Constructor for class org.apache.nutch.parse.ParseData
- parseDmozFile(File, int, boolean, int, Pattern) - Method in class org.apache.nutch.tools.DmozParser
-
Iterate through all the items in this structured DMOZ file.
- parseErrors - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
- ParseException - Exception in org.apache.nutch.parse
- ParseException() - Constructor for exception org.apache.nutch.parse.ParseException
- ParseException(String) - Constructor for exception org.apache.nutch.parse.ParseException
- ParseException(String, Throwable) - Constructor for exception org.apache.nutch.parse.ParseException
- ParseException(Throwable) - Constructor for exception org.apache.nutch.parse.ParseException
- parseExpression(String) - Static method in class org.apache.nutch.util.JexlUtil
-
Parses the given expression to a JEXL expression.
- ParseImpl - Class in org.apache.nutch.parse
-
The result of parsing a page's raw content.
- ParseImpl() - Constructor for class org.apache.nutch.parse.ParseImpl
- ParseImpl(String, ParseData) - Constructor for class org.apache.nutch.parse.ParseImpl
- ParseImpl(Parse) - Constructor for class org.apache.nutch.parse.ParseImpl
- ParseImpl(ParseText, ParseData) - Constructor for class org.apache.nutch.parse.ParseImpl
- ParseImpl(ParseText, ParseData, boolean) - Constructor for class org.apache.nutch.parse.ParseImpl
- parseList(List<String>, String) - Method in class org.apache.nutch.collection.Subcollection
-
Create a list of patterns from a chunk of text, patterns are separated with a newline
- ParseOutputFormat - Class in org.apache.nutch.parse
- ParseOutputFormat() - Constructor for class org.apache.nutch.parse.ParseOutputFormat
- parsePluginFolder(String[]) - Method in class org.apache.nutch.plugin.PluginManifestParser
-
Returns a list of all found plugin descriptors.
- Parser - Interface in org.apache.nutch.parse
-
A parser for content generated by a
Protocol
implementation. - ParserChecker - Class in org.apache.nutch.parse
-
Parser checker, useful for testing parser.
- ParserChecker() - Constructor for class org.apache.nutch.parse.ParserChecker
- ParseResult - Class in org.apache.nutch.parse
-
A utility class that stores result of a parse.
- ParseResult(String) - Constructor for class org.apache.nutch.parse.ParseResult
-
Create a container for parse results.
- ParserFactory - Class in org.apache.nutch.parse
-
Creates and caches
Parser
plugins. - ParserFactory(Configuration) - Constructor for class org.apache.nutch.parse.ParserFactory
- ParserNotFound - Exception in org.apache.nutch.parse
- ParserNotFound(String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
- ParserNotFound(String, String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
- ParserNotFound(String, String, String) - Constructor for exception org.apache.nutch.parse.ParserNotFound
- parseRules(String, byte[], String, String) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Deprecated.
- parseRules(String, byte[], String, Collection<String>) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Parses the robots content using the
SimpleRobotRulesParser
from crawler-commons - ParseSegment - Class in org.apache.nutch.parse
- ParseSegment() - Constructor for class org.apache.nutch.parse.ParseSegment
- ParseSegment(Configuration) - Constructor for class org.apache.nutch.parse.ParseSegment
- ParseSegment.ParseSegmentMapper - Class in org.apache.nutch.parse
- ParseSegment.ParseSegmentReducer - Class in org.apache.nutch.parse
- ParseSegmentMapper() - Constructor for class org.apache.nutch.parse.ParseSegment.ParseSegmentMapper
- ParseSegmentReducer() - Constructor for class org.apache.nutch.parse.ParseSegment.ParseSegmentReducer
- ParseStatus - Class in org.apache.nutch.parse
- ParseStatus() - Constructor for class org.apache.nutch.parse.ParseStatus
- ParseStatus(int) - Constructor for class org.apache.nutch.parse.ParseStatus
- ParseStatus(int, int) - Constructor for class org.apache.nutch.parse.ParseStatus
- ParseStatus(int, int, String) - Constructor for class org.apache.nutch.parse.ParseStatus
-
Simplified constructor for passing just a text message.
- ParseStatus(int, int, String[]) - Constructor for class org.apache.nutch.parse.ParseStatus
- ParseStatus(int, String) - Constructor for class org.apache.nutch.parse.ParseStatus
-
Simplified constructor for passing just a text message.
- ParseStatus(int, String[]) - Constructor for class org.apache.nutch.parse.ParseStatus
- ParseStatus(Throwable) - Constructor for class org.apache.nutch.parse.ParseStatus
- ParseText - Class in org.apache.nutch.parse
- ParseText() - Constructor for class org.apache.nutch.parse.ParseText
- ParseText(String) - Constructor for class org.apache.nutch.parse.ParseText
- ParseUtil - Class in org.apache.nutch.parse
- ParseUtil(Configuration) - Constructor for class org.apache.nutch.parse.ParseUtil
-
Overloaded constructor
- partialAsTruncated - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Whether to save partial fetches as truncated content.
- PARTITION_MODE_DOMAIN - Static variable in class org.apache.nutch.crawl.URLPartitioner
- PARTITION_MODE_HOST - Static variable in class org.apache.nutch.crawl.URLPartitioner
- PARTITION_MODE_IP - Static variable in class org.apache.nutch.crawl.URLPartitioner
- PARTITION_MODE_KEY - Static variable in class org.apache.nutch.crawl.URLPartitioner
- PartitionReducer() - Constructor for class org.apache.nutch.crawl.Generator.PartitionReducer
- partName - Variable in class org.apache.nutch.segment.SegmentPart
-
Name of the segment part (ie.
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
-
Takes the metadata, which was lumped inside the content, and replicates it within your parse data.
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
- passScoreAfterParsing(Text, Content, Parse) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
Currently a part of score distribution is performed using only data coming from the parsing process.
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.ScoringFilters
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
- passScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Takes the metadata, which was lumped inside the content, and replicates it within your parse data.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
-
Takes the metadata, specified in your "scoring.db.md" property, from the datum object and injects it into the content.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into
Content
metadata. - passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.ScoringFilters
- passScoreBeforeParsing(Text, CrawlDatum, Content) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content.
- PassURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.pass
-
This URLNormalizer doesn't change urls.
- PassURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
- PASSWORD - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- PASSWORD - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- PASSWORD - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- PATH - Static variable in interface org.apache.nutch.indexwriter.dummy.DummyConstants
- pattern - Variable in class org.apache.nutch.urlfilter.fast.FastURLFilter.Rule
- percentiles - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- PERM_REFRESH_TIME - Static variable in class org.apache.nutch.fetcher.Fetcher
- Pluggable - Interface in org.apache.nutch.plugin
-
Defines the capability of a class to be plugged into Nutch.
- Plugin - Class in org.apache.nutch.plugin
-
A nutch-plugin is an container for a set of custom logic that provide extensions to the nutch core functionality or another plugin that provides an API for extending.
- Plugin(PluginDescriptor, Configuration) - Constructor for class org.apache.nutch.plugin.Plugin
-
Overloaded constructor
- PluginClassLoader - Class in org.apache.nutch.plugin
-
The
PluginClassLoader
is a child-first classloader that only contains classes of the runtime libraries setuped in the plugin manifest file and exported libraries of plugins that are required plugins. - PluginClassLoader(URL[], ClassLoader) - Constructor for class org.apache.nutch.plugin.PluginClassLoader
-
Overloaded constructor
- PluginDescriptor - Class in org.apache.nutch.plugin
-
The
PluginDescriptor
provide access to all meta information of a nutch-plugin, as well to the internationalizable resources and the plugin own classloader. - PluginDescriptor(String, String, String, String, String, String, Configuration) - Constructor for class org.apache.nutch.plugin.PluginDescriptor
-
Overloaded constructor
- PluginManifestParser - Class in org.apache.nutch.plugin
-
The
PluginManifestParser
provides a mechanism for parsing Nutch plugin manifest files (plugin.xml
) contained in aString
of plugin directories. - PluginManifestParser(Configuration, PluginRepository) - Constructor for class org.apache.nutch.plugin.PluginManifestParser
- PluginRepository - Class in org.apache.nutch.plugin
-
The plugin repository is a registry of all plugins.
- PluginRepository(Configuration) - Constructor for class org.apache.nutch.plugin.PluginRepository
- PluginRuntimeException - Exception in org.apache.nutch.plugin
-
PluginRuntimeException
will be thrown until a exception in the plugin managemnt occurs. - PluginRuntimeException(String) - Constructor for exception org.apache.nutch.plugin.PluginRuntimeException
- PluginRuntimeException(Throwable) - Constructor for exception org.apache.nutch.plugin.PluginRuntimeException
- PORT - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- PORT - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
- PORT - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- PORTERSTEM_FILTER - org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
- pos - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
- PrefixStringMatcher - Class in org.apache.nutch.util
-
A class for efficiently matching
String
s against a set of prefixes. - PrefixStringMatcher(String[]) - Constructor for class org.apache.nutch.util.PrefixStringMatcher
-
Creates a new
PrefixStringMatcher
which will matchString
s with any prefix in the supplied array. - PrefixStringMatcher(Collection<String>) - Constructor for class org.apache.nutch.util.PrefixStringMatcher
-
Creates a new
PrefixStringMatcher
which will matchString
s with any prefix in the suppliedCollection
. - PrefixURLFilter - Class in org.apache.nutch.urlfilter.prefix
-
Filters URLs based on a file of URL prefixes.
- PrefixURLFilter() - Constructor for class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
- PrefixURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
- PrintCommandListener - Class in org.apache.nutch.protocol.ftp
-
This is a support class for logging all ftp command/reply traffic.
- PrintCommandListener(Logger) - Constructor for class org.apache.nutch.protocol.ftp.PrintCommandListener
- PROBLEMATIC_HEADERS - Static variable in class org.apache.nutch.tools.WARCUtils
- process(String, StringBuilder) - Method in class org.apache.nutch.crawl.CrawlDbReader
- process(String, StringBuilder) - Method in class org.apache.nutch.crawl.LinkDbReader
- process(String, StringBuilder) - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
- process(String, StringBuilder) - Method in class org.apache.nutch.net.URLFilterChecker
- process(String, StringBuilder) - Method in class org.apache.nutch.net.URLNormalizerChecker
- process(String, StringBuilder) - Method in class org.apache.nutch.parse.ParserChecker
- process(String, StringBuilder) - Method in class org.apache.nutch.util.AbstractChecker
- processDeflateEncoded(byte[], URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- processDriver(WebDriver) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefalultMultiInteractionHandler
- processDriver(WebDriver) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultClickAllAjaxLinksHandler
- processDriver(WebDriver) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultHandler
- processDriver(WebDriver) - Method in interface org.apache.nutch.protocol.interactiveselenium.handlers.InteractiveSeleniumHandler
- processDumpJob(String, String, String) - Method in class org.apache.nutch.crawl.LinkDbReader
- processDumpJob(String, String, Configuration, String, String, String, Integer, String, Float) - Method in class org.apache.nutch.crawl.CrawlDbReader
- processGzipEncoded(byte[], URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- processingInstruction(String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of a processing instruction.
- processSingle(String) - Method in class org.apache.nutch.util.AbstractChecker
- processStatJob(String, Configuration, boolean) - Method in class org.apache.nutch.crawl.CrawlDbReader
- processStdin() - Method in class org.apache.nutch.util.AbstractChecker
- processTCP(int) - Method in class org.apache.nutch.util.AbstractChecker
- processTopNJob(String, long, float, String, Configuration) - Method in class org.apache.nutch.crawl.CrawlDbReader
- PROPOSED - org.apache.nutch.util.domain.DomainSuffix.Status
- PROTO_NOT_FOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
This protocol was not found.
- PROTO_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- Protocol - Interface in org.apache.nutch.protocol
-
A retriever of url content.
- PROTOCOL_REDIR - Static variable in class org.apache.nutch.fetcher.Fetcher
- PROTOCOL_STATUS_CODE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- protocolCommandSent(ProtocolCommandEvent) - Method in class org.apache.nutch.protocol.ftp.PrintCommandListener
- ProtocolException - Exception in org.apache.nutch.net.protocols
-
Deprecated.Use
ProtocolException
instead. - ProtocolException - Exception in org.apache.nutch.protocol
- ProtocolException() - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException() - Constructor for exception org.apache.nutch.protocol.ProtocolException
- ProtocolException(String) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(String) - Constructor for exception org.apache.nutch.protocol.ProtocolException
- ProtocolException(String, Throwable) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(String, Throwable) - Constructor for exception org.apache.nutch.protocol.ProtocolException
- ProtocolException(Throwable) - Constructor for exception org.apache.nutch.net.protocols.ProtocolException
-
Deprecated.
- ProtocolException(Throwable) - Constructor for exception org.apache.nutch.protocol.ProtocolException
- ProtocolFactory - Class in org.apache.nutch.protocol
-
Creates and caches
Protocol
plugins. - ProtocolFactory(Configuration) - Constructor for class org.apache.nutch.protocol.ProtocolFactory
- ProtocolLogUtil - Class in org.apache.nutch.net.protocols
- ProtocolLogUtil() - Constructor for class org.apache.nutch.net.protocols.ProtocolLogUtil
- ProtocolNotFound - Exception in org.apache.nutch.protocol
- ProtocolNotFound(String) - Constructor for exception org.apache.nutch.protocol.ProtocolNotFound
- ProtocolNotFound(String, String) - Constructor for exception org.apache.nutch.protocol.ProtocolNotFound
- ProtocolOutput - Class in org.apache.nutch.protocol
-
Simple aggregate to pass from protocol plugins both content and protocol status.
- ProtocolOutput(Content) - Constructor for class org.apache.nutch.protocol.ProtocolOutput
- ProtocolOutput(Content, ProtocolStatus) - Constructor for class org.apache.nutch.protocol.ProtocolOutput
- protocolReplyReceived(ProtocolCommandEvent) - Method in class org.apache.nutch.protocol.ftp.PrintCommandListener
- ProtocolStatus - Class in org.apache.nutch.protocol
- ProtocolStatus() - Constructor for class org.apache.nutch.protocol.ProtocolStatus
- ProtocolStatus(int) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
- ProtocolStatus(int, long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
- ProtocolStatus(int, Object) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
- ProtocolStatus(int, Object, long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
- ProtocolStatus(int, String[]) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
- ProtocolStatus(int, String[], long) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
- ProtocolStatus(Throwable) - Constructor for class org.apache.nutch.protocol.ProtocolStatus
- ProtocolStatusStatistics - Class in org.apache.nutch.util
-
Extracts protocol status code information from the crawl database.
- ProtocolStatusStatistics() - Constructor for class org.apache.nutch.util.ProtocolStatusStatistics
- ProtocolStatusStatistics.ProtocolStatusStatisticsCombiner - Class in org.apache.nutch.util
- ProtocolStatusStatisticsCombiner() - Constructor for class org.apache.nutch.util.ProtocolStatusStatistics.ProtocolStatusStatisticsCombiner
- ProtocolURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.protocol
-
URL normalizer to normalize the protocol for all URLs of a given host or domain, e.g.
- ProtocolURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer
- proxyException - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The proxy exception list.
- proxyHost - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The proxy hostname.
- proxyPort - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The proxy port.
- proxyType - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The proxy port.
- PSEUDO_DOMAIN - org.apache.nutch.util.domain.DomainSuffix.Status
- publish(Object, Configuration) - Method in interface org.apache.nutch.publisher.NutchPublisher
-
This method publishes the event.
- publish(Object, Configuration) - Method in class org.apache.nutch.publisher.NutchPublishers
- publish(Object, Configuration) - Method in class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
- publish(String, String, RabbitMQMessage) - Method in class org.apache.nutch.rabbitmq.RabbitMQClient
-
Publishes a new message over an exchange.
- publish(FetcherThreadEvent, Configuration) - Method in class org.apache.nutch.fetcher.FetcherThreadPublisher
-
Publish event to all registered publishers
- PUBLISHER - Static variable in interface org.apache.nutch.metadata.DublinCore
-
An entity responsible for making the resource available.
- purgeFailedHostsThreshold - Variable in class org.apache.nutch.hostdb.ResolverThread
- purgeFailedHostsThreshold - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- put(String, FetchNode) - Method in class org.apache.nutch.fetcher.FetchNodeDb
- put(String, ParseText, ParseData) - Method in class org.apache.nutch.parse.ParseResult
-
Store a result of parsing.
- put(Text, ParseText, ParseData) - Method in class org.apache.nutch.parse.ParseResult
-
Store a result of parsing.
- putAllMetaData(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Add all metadata from other CrawlDatum to this CrawlDatum.
- putAllMetaData(HostDatum) - Method in class org.apache.nutch.hostdb.HostDatum
-
Add all metadata from other HostDatum to this HostDatum.
Q
- query(Map<String, String>, Configuration, String, String) - Method in class org.apache.nutch.crawl.CrawlDbReader
- QuerystringURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.querystring
-
URL normalizer plugin for normalizing query strings but sorting query string parameters.
- QuerystringURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.querystring.QuerystringURLNormalizer
- queue - Variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- QUEUE_MODE_DOMAIN - Static variable in class org.apache.nutch.fetcher.FetchItemQueues
- QUEUE_MODE_HOST - Static variable in class org.apache.nutch.fetcher.FetchItemQueues
- QUEUE_MODE_IP - Static variable in class org.apache.nutch.fetcher.FetchItemQueues
- QueueFeeder - Class in org.apache.nutch.fetcher
-
This class feeds the queues with input items, and re-fills them as items are consumed by FetcherThread-s.
- QueueFeeder(Mapper.Context, FetchItemQueues, int) - Constructor for class org.apache.nutch.fetcher.QueueFeeder
R
- RabbitIndexWriter - Class in org.apache.nutch.indexwriter.rabbit
- RabbitIndexWriter() - Constructor for class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- RabbitMQClient - Class in org.apache.nutch.rabbitmq
-
Client for RabbitMQ
- RabbitMQClient(String) - Constructor for class org.apache.nutch.rabbitmq.RabbitMQClient
-
Builds a new instance of
RabbitMQClient
- RabbitMQClient(String, int, String, String, String) - Constructor for class org.apache.nutch.rabbitmq.RabbitMQClient
-
Builds a new instance of
RabbitMQClient
- RabbitMQMessage - Class in org.apache.nutch.rabbitmq
- RabbitMQMessage() - Constructor for class org.apache.nutch.rabbitmq.RabbitMQMessage
- RabbitMQPublisherImpl - Class in org.apache.nutch.publisher.rabbitmq
- RabbitMQPublisherImpl() - Constructor for class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
- read(DataInput) - Static method in class org.apache.nutch.crawl.CrawlDatum
- read(DataInput) - Static method in class org.apache.nutch.crawl.Inlink
- read(DataInput) - Static method in class org.apache.nutch.parse.Outlink
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseData
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseImpl
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseStatus
- read(DataInput) - Static method in class org.apache.nutch.parse.ParseText
- read(DataInput) - Static method in class org.apache.nutch.protocol.Content
- read(DataInput) - Static method in class org.apache.nutch.protocol.ProtocolStatus
- read(String) - Method in class org.apache.nutch.service.impl.LinkReader
- read(String) - Method in class org.apache.nutch.service.impl.NodeReader
- read(String) - Method in class org.apache.nutch.service.impl.SequenceReader
- read(String) - Method in interface org.apache.nutch.service.NutchReader
- readConfiguration(Reader) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- readdb(DbQuery) - Method in class org.apache.nutch.service.resources.DbResource
- READDB - org.apache.nutch.service.JobManager.JobType
- Reader() - Constructor for class org.apache.nutch.scoring.webgraph.LinkDumper.Reader
- ReaderConfig - Class in org.apache.nutch.service.model.request
- ReaderConfig() - Constructor for class org.apache.nutch.service.model.request.ReaderConfig
- ReaderResouce - Class in org.apache.nutch.service.resources
-
The Reader endpoint enables a user to read sequence files, nodes and links from the Nutch webgraph.
- ReaderResouce() - Constructor for class org.apache.nutch.service.resources.ReaderResouce
- readFields(DataInput) - Method in class org.apache.nutch.crawl.CrawlDatum
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Generator.SelectorEntry
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Inlink
- readFields(DataInput) - Method in class org.apache.nutch.crawl.Inlinks
- readFields(DataInput) - Method in class org.apache.nutch.hostdb.HostDatum
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchDocument
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchField
- readFields(DataInput) - Method in class org.apache.nutch.indexer.NutchIndexAction
- readFields(DataInput) - Method in class org.apache.nutch.metadata.Metadata
- readFields(DataInput) - Method in class org.apache.nutch.metadata.MetaWrapper
- readFields(DataInput) - Method in class org.apache.nutch.parse.Outlink
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseData
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseImpl
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseStatus
- readFields(DataInput) - Method in class org.apache.nutch.parse.ParseText
- readFields(DataInput) - Method in class org.apache.nutch.protocol.Content
- readFields(DataInput) - Method in class org.apache.nutch.protocol.ProtocolStatus
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
- readFields(DataInput) - Method in class org.apache.nutch.scoring.webgraph.Node
- readFields(DataInput) - Method in class org.apache.nutch.util.GenericWritableConfigurable
- readHostDb() - Method in class org.apache.nutch.crawl.Generator.SelectorReducer
- ReadHostDb - Class in org.apache.nutch.hostdb
- ReadHostDb() - Constructor for class org.apache.nutch.hostdb.ReadHostDb
- readingCrawlDb - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- readUrl(String, String, Configuration, StringBuilder) - Method in class org.apache.nutch.crawl.CrawlDbReader
- recheckInterval - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- REDIR_EXCEEDED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Too many redirects.
- redirectIsQueuedRecently(Text) - Method in class org.apache.nutch.fetcher.FetchItemQueues
- redirPerm - Variable in class org.apache.nutch.hostdb.HostDatum
- redirTemp - Variable in class org.apache.nutch.hostdb.HostDatum
- reduce(K, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
- reduce(ByteWritable, Iterable<Text>, Reducer.Context) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
- reduce(FloatWritable, Iterable<Text>, Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
- reduce(FloatWritable, Iterable<Text>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterReducer
- reduce(FloatWritable, Iterable<Generator.SelectorEntry>, Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorReducer
- reduce(Text, Iterable<FloatWritable>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperReducer
- reduce(Text, Iterable<LongWritable>, Reducer.Context) - Method in class org.apache.nutch.util.CrawlCompletionStats.CrawlCompletionStatsCombiner
- reduce(Text, Iterable<LongWritable>, Reducer.Context) - Method in class org.apache.nutch.util.domain.DomainStatistics.DomainStatisticsCombiner
- reduce(Text, Iterable<LongWritable>, Reducer.Context) - Method in class org.apache.nutch.util.ProtocolStatusStatistics.ProtocolStatusStatisticsCombiner
- reduce(Text, Iterable<ObjectWritable>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterReducer
- reduce(Text, Iterable<ObjectWritable>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertReducer
- reduce(Text, Iterable<Writable>, Reducer.Context) - Method in class org.apache.nutch.parse.ParseSegment.ParseSegmentReducer
- reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
- reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReducer
- reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
- reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateReducer
- reduce(Text, Iterable<CrawlDatum>, Reducer.Context) - Method in class org.apache.nutch.crawl.Injector.InjectReducer
-
Merge the input records of one URL as per rules below :
- reduce(Text, Iterable<Generator.SelectorEntry>, Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.PartitionReducer
- reduce(Text, Iterable<Generator.SelectorEntry>, Reducer.Context) - Method in class org.apache.nutch.tools.FreeGenerator.FG.FGReducer
- reduce(Text, Iterable<Inlinks>, Reducer.Context) - Method in class org.apache.nutch.crawl.LinkDbMerger.LinkDbMergeReducer
- reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCReducer
- reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.segment.SegmentReader.InputCompatReducer
- reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
- reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
- reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.indexer.IndexerMapReduce.IndexerReducer
- reduce(Text, Iterable<NutchWritable>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbReducer
- reduce(Text, Iterable<MetaWrapper>, Reducer.Context) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentMergerReducer
- reduce(Text, Iterable<LinkDumper.LinkNode>, Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Merger
-
Aggregate all LinkNode objects for a given url.
- regex() - Method in class org.apache.nutch.urlfilter.api.RegexRule
-
Return if this rule's regex.
- regexEscape(String) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
Escapes any character that needs escaping so it can be used in a regexp.
- regexNormalize(String, String) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
This function does the replacements by iterating through all the regex patterns.
- RegexParseFilter - Class in org.apache.nutch.parsefilter.regex
-
RegexParseFilter.
- RegexParseFilter() - Constructor for class org.apache.nutch.parsefilter.regex.RegexParseFilter
- RegexRule - Class in org.apache.nutch.urlfilter.api
-
A generic regular expression rule.
- RegexRule(boolean, String) - Constructor for class org.apache.nutch.urlfilter.api.RegexRule
-
Constructs a new regular expression rule.
- RegexRule(boolean, String, String) - Constructor for class org.apache.nutch.urlfilter.api.RegexRule
-
Constructs a new regular expression rule.
- RegexURLFilter - Class in org.apache.nutch.urlfilter.regex
-
Filters URLs based on a file of regular expressions using the
Java Regex implementation
. - RegexURLFilter() - Constructor for class org.apache.nutch.urlfilter.regex.RegexURLFilter
- RegexURLFilter(String) - Constructor for class org.apache.nutch.urlfilter.regex.RegexURLFilter
- RegexURLFilterBase - Class in org.apache.nutch.urlfilter.api
-
Generic
URLFilter
based on regular expressions. - RegexURLFilterBase() - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new empty RegexURLFilterBase
- RegexURLFilterBase(File) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and init it with a file of rules.
- RegexURLFilterBase(Reader) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and init it with a Reader of rules.
- RegexURLFilterBase(String) - Constructor for class org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
Constructs a new RegexURLFilter and inits it with a list of rules.
- RegexURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.regex
-
Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.
- RegexURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer()*
- RegexURLNormalizer(Configuration) - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
- RegexURLNormalizer(Configuration, String) - Constructor for class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
Constructor which can be passed the configuration file name, so it doesn't look in other configuration files for it.
- REGION - Static variable in interface org.apache.nutch.indexwriter.cloudsearch.CloudSearchConstants
- registerPluginRepository(PluginRepository) - Method in class org.apache.nutch.plugin.URLStreamHandlerFactory
-
Use this method once a new PluginRepository was created to register it.
- REJECTED - org.apache.nutch.util.domain.DomainSuffix.Status
- REL_TAG - Static variable in class org.apache.nutch.microformats.reltag.RelTagParser
- RELATION - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A reference to a related resource.
- reloadRules() - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter
- RelTagIndexingFilter - Class in org.apache.nutch.microformats.reltag
-
An
IndexingFilter
that addtag
field(s) to the document. - RelTagIndexingFilter() - Constructor for class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
- RelTagParser - Class in org.apache.nutch.microformats.reltag
-
Adds microformat rel-tags of document if found.
- RelTagParser() - Constructor for class org.apache.nutch.microformats.reltag.RelTagParser
- remove(String) - Method in class org.apache.nutch.metadata.Metadata
-
Remove a metadata and all its associated values.
- remove(String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
- removeField(String) - Method in class org.apache.nutch.indexer.NutchDocument
- removeLockFile(Configuration, Path) - Static method in class org.apache.nutch.util.LockUtil
-
Remove lock file.
- removeLockFile(FileSystem, Path) - Static method in class org.apache.nutch.util.LockUtil
-
Remove lock file.
- replace(String) - Method in class org.apache.nutch.indexer.replace.FieldReplacer
-
Return the replacement value for a field value.
- replace(FileSystem, Path, Path, boolean) - Static method in class org.apache.nutch.util.FSUtils
-
Replaces the current path with the new path and if set removes the old path.
- replacefirstoccuranceof(String, String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Train
- replaceHost(String, String, String) - Method in class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
- ReplaceIndexer - Class in org.apache.nutch.indexer.replace
-
Do pattern replacements on selected field contents prior to indexing.
- ReplaceIndexer() - Constructor for class org.apache.nutch.indexer.replace.ReplaceIndexer
- REPORT - org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
- REPR_URL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- reprUrl - Variable in class org.apache.nutch.hostdb.UpdateHostDbMapper
- REQUEST - Static variable in interface org.apache.nutch.net.protocols.Response
-
Key to hold the HTTP request if
store.http.request
is true - reset() - Method in class org.apache.nutch.indexer.NutchField
- reset() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets all boolean values to
false
. - resetFailures() - Method in class org.apache.nutch.hostdb.HostDatum
- resetStatistics() - Method in class org.apache.nutch.hostdb.HostDatum
- resolveEncodingAlias(String) - Static method in class org.apache.nutch.util.EncodingDetector
- resolverThread - Variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- ResolverThread - Class in org.apache.nutch.hostdb
-
Simple runnable that performs DNS lookup for a single host.
- ResolverThread(String, HostDatum, Reducer.Context, int) - Constructor for class org.apache.nutch.hostdb.ResolverThread
-
Overloaded constructor.
- resolveURL(URL, String) - Static method in class org.apache.nutch.util.URLUtil
-
Resolve relative URL-s and fix a java.net.URL error in handling of URLs with pure query targets.
- resolveUrls() - Method in class org.apache.nutch.tools.ResolveUrls
-
Creates a thread pool for resolving urls.
- ResolveUrls - Class in org.apache.nutch.tools
-
A simple tool that will spin up multiple threads to resolve urls to ip addresses.
- ResolveUrls(String) - Constructor for class org.apache.nutch.tools.ResolveUrls
-
Create a new ResolveUrls with a file from the local file system.
- ResolveUrls(String, int) - Constructor for class org.apache.nutch.tools.ResolveUrls
-
Create a new ResolveUrls with a urls file and a number of threads for the Thread pool.
- Response - Interface in org.apache.nutch.net.protocols
-
A response interface.
- RESPONSE_HEADERS - Static variable in interface org.apache.nutch.net.protocols.Response
-
Key to hold the HTTP response header if
store.http.headers
is true - RESPONSE_TIME - Static variable in class org.apache.nutch.protocol.http.api.HttpBase
- Response.TruncatedContentReason - Enum in org.apache.nutch.net.protocols
- responseTime - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Record response time in CrawlDatum's meta data, see property http.store.responsetime.
- results - Variable in class org.apache.nutch.util.NutchTool
- retrieveFile(String, OutputStream, int) - Method in class org.apache.nutch.protocol.ftp.Client
-
retrieve file for path
- retrieveList(String, List<FTPFile>, int, FTPFileEntryParser) - Method in class org.apache.nutch.protocol.ftp.Client
-
Retrieve list reply for path
- retrieveNgrams(Configuration) - Static method in class org.apache.nutch.scoring.similarity.cosine.Model
-
Retrieves mingram and maxgram from configuration
- RETRY - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Temporary failure.
- reverseHost(String) - Static method in class org.apache.nutch.util.TableUtil
- reverseKey - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- reverseKeyValue - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- reverseUrl(String) - Static method in class org.apache.nutch.tools.CommonCrawlDataDumper
- reverseUrl(String) - Static method in class org.apache.nutch.util.TableUtil
-
Reverses a url's domain.
- reverseUrl(URL) - Static method in class org.apache.nutch.util.TableUtil
-
Reverses a url's domain.
- rightPad(String, int) - Static method in class org.apache.nutch.util.StringUtil
-
Returns a copy of
s
(right padded) with trailing spaces so that it's length islength
. - RIGHTS - Static variable in interface org.apache.nutch.metadata.DublinCore
-
Information about rights held in and over the resource.
- RobotRulesParser - Class in org.apache.nutch.protocol
-
This class uses crawler-commons for handling the parsing of
robots.txt
files. - RobotRulesParser() - Constructor for class org.apache.nutch.protocol.RobotRulesParser
- RobotRulesParser(Configuration) - Constructor for class org.apache.nutch.protocol.RobotRulesParser
- ROBOTS - Static variable in class org.apache.nutch.tools.WARCUtils
- ROBOTS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Access denied by robots.txt rules.
- ROBOTS_METATAG - Static variable in interface org.apache.nutch.metadata.Nutch
-
Name to store the robots metatag in
ParseData
's metadata. - root - Variable in class org.apache.nutch.util.TrieStringMatcher
- Rule(String) - Constructor for class org.apache.nutch.urlfilter.fast.FastURLFilter.Rule
- run() - Method in class org.apache.nutch.fetcher.FetcherThread
- run() - Method in class org.apache.nutch.fetcher.QueueFeeder
- run() - Method in class org.apache.nutch.hostdb.ResolverThread
- run() - Method in class org.apache.nutch.service.impl.JobWorker
- run() - Method in class org.apache.nutch.service.impl.ServiceWorker
- run() - Method in class org.apache.nutch.util.AbstractChecker
- run(String[]) - Method in class org.apache.nutch.crawl.CrawlDb
- run(String[]) - Method in class org.apache.nutch.crawl.CrawlDbMerger
- run(String[]) - Method in class org.apache.nutch.crawl.CrawlDbReader
- run(String[]) - Method in class org.apache.nutch.crawl.DeduplicationJob
- run(String[]) - Method in class org.apache.nutch.crawl.Generator
- run(String[]) - Method in class org.apache.nutch.crawl.Injector
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDb
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDbMerger
- run(String[]) - Method in class org.apache.nutch.crawl.LinkDbReader
- run(String[]) - Method in class org.apache.nutch.fetcher.Fetcher
- run(String[]) - Method in class org.apache.nutch.hostdb.ReadHostDb
- run(String[]) - Method in class org.apache.nutch.hostdb.UpdateHostDb
- run(String[]) - Method in class org.apache.nutch.indexer.CleaningJob
- run(String[]) - Method in class org.apache.nutch.indexer.IndexingFiltersChecker
- run(String[]) - Method in class org.apache.nutch.indexer.IndexingJob
- run(String[]) - Method in class org.apache.nutch.net.URLFilterChecker
- run(String[]) - Method in class org.apache.nutch.net.URLNormalizerChecker
- run(String[]) - Method in class org.apache.nutch.parse.ParserChecker
- run(String[]) - Method in class org.apache.nutch.parse.ParseSegment
- run(String[]) - Method in class org.apache.nutch.protocol.RobotRulesParser
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper
-
Runs the LinkDumper tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.LinkRank
-
Runs the LinkRank tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper
-
Runs the node dumper tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
Runs the ScoreUpdater tool.
- run(String[]) - Method in class org.apache.nutch.scoring.webgraph.WebGraph
-
Parses command link arguments and runs the WebGraph jobs.
- run(String[]) - Method in class org.apache.nutch.segment.SegmentMerger
-
Run this tool
- run(String[]) - Method in class org.apache.nutch.segment.SegmentReader
- run(String[]) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator
- run(String[]) - Method in class org.apache.nutch.tools.CommonCrawlDataDumper
- run(String[]) - Method in class org.apache.nutch.tools.FreeGenerator
- run(String[]) - Method in class org.apache.nutch.tools.ShowProperties
- run(String[]) - Method in class org.apache.nutch.tools.warc.WARCExporter
- run(String[]) - Method in class org.apache.nutch.util.CrawlCompletionStats
- run(String[]) - Method in class org.apache.nutch.util.domain.DomainStatistics
- run(String[]) - Method in class org.apache.nutch.util.ProtocolStatusStatistics
- run(String[]) - Method in class org.apache.nutch.util.SitemapProcessor
- run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.CrawlDb
- run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.DeduplicationJob
- run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.Generator
- run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.Injector
-
Used by the Nutch REST service
- run(Map<String, Object>, String) - Method in class org.apache.nutch.crawl.LinkDb
- run(Map<String, Object>, String) - Method in class org.apache.nutch.fetcher.Fetcher
- run(Map<String, Object>, String) - Method in class org.apache.nutch.indexer.IndexingJob
- run(Map<String, Object>, String) - Method in class org.apache.nutch.parse.ParseSegment
- run(Map<String, Object>, String) - Method in class org.apache.nutch.tools.CommonCrawlDataDumper
-
Used by the REST service
- run(Map<String, Object>, String) - Method in class org.apache.nutch.util.NutchTool
-
Runs the tool, using a map of arguments.
- run(Mapper.Context) - Method in class org.apache.nutch.fetcher.Fetcher.FetcherRun
- RUNNING - org.apache.nutch.service.model.response.JobInfo.State
S
- save() - Method in class org.apache.nutch.collection.CollectionManager
-
Save collections into file
- saveDom(OutputStream, DocumentFragment) - Static method in class org.apache.nutch.util.DomUtil
-
Save dom into
OutputStream
- saveDom(OutputStream, Element) - Static method in class org.apache.nutch.util.DomUtil
-
Save dom into
OutputStream
- SCHEDULE_DEC_RATE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
- SCHEDULE_INC_RATE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
- SCHEDULE_MIME_FILE - Static variable in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
- SCHEME - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- SCHEME - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- SCOPE_CRAWLDB - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when updating the CrawlDb with new URLs.
- SCOPE_DEFAULT - Static variable in class org.apache.nutch.net.URLNormalizers
-
Default scope.
- SCOPE_FETCHER - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used by
Fetcher
when processing redirect URLs. - SCOPE_GENERATE_HOST_COUNT - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used by
Generator
. - SCOPE_INDEXER - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when indexing URLs.
- SCOPE_INJECT - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used by
Injector
. - SCOPE_LINKDB - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when updating the LinkDb with new URLs.
- SCOPE_OUTLINK - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used when constructing new
Outlink
instances. - SCOPE_PARTITION - Static variable in class org.apache.nutch.net.URLNormalizers
-
Scope used by
URLPartitioner
. - score - Variable in class org.apache.nutch.hostdb.HostDatum
- SCORE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- ScoreUpdater - Class in org.apache.nutch.scoring.webgraph
-
Updates the score from the WebGraph node database into the crawl database.
- ScoreUpdater() - Constructor for class org.apache.nutch.scoring.webgraph.ScoreUpdater
- ScoreUpdater.ScoreUpdaterMapper - Class in org.apache.nutch.scoring.webgraph
-
Changes input into ObjectWritables.
- ScoreUpdater.ScoreUpdaterReducer - Class in org.apache.nutch.scoring.webgraph
-
Creates new CrawlDatum objects with the updated score from the NodeDb or with a cleared score.
- ScoreUpdaterMapper() - Constructor for class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterMapper
- ScoreUpdaterReducer() - Constructor for class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterReducer
- ScoringFilter - Interface in org.apache.nutch.scoring
-
A contract defining behavior of scoring plugins.
- ScoringFilterException - Exception in org.apache.nutch.scoring
-
Specialized exception for errors during scoring.
- ScoringFilterException() - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
- ScoringFilterException(String) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
- ScoringFilterException(String, Throwable) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
- ScoringFilterException(Throwable) - Constructor for exception org.apache.nutch.scoring.ScoringFilterException
- ScoringFilters - Class in org.apache.nutch.scoring
-
Creates and caches
ScoringFilter
implementing plugins. - ScoringFilters(Configuration) - Constructor for class org.apache.nutch.scoring.ScoringFilters
- sdf - Static variable in class org.apache.nutch.util.SitemapProcessor
- SECONDS_PER_DAY - Static variable in interface org.apache.nutch.crawl.FetchSchedule
- secondsToDaysHMS(long) - Static method in class org.apache.nutch.util.TimingUtil
-
Show time in seconds as days, hours, minutes and seconds (d days, hh:mm:ss)
- secondsToHMS(long) - Static method in class org.apache.nutch.util.TimingUtil
-
Show time in seconds as hours, minutes and seconds (hh:mm:ss)
- SeedList - Class in org.apache.nutch.service.model.request
- SeedList() - Constructor for class org.apache.nutch.service.model.request.SeedList
- SeedManager - Interface in org.apache.nutch.service
- SeedManagerImpl - Class in org.apache.nutch.service.impl
- SeedManagerImpl() - Constructor for class org.apache.nutch.service.impl.SeedManagerImpl
- SeedResource - Class in org.apache.nutch.service.resources
- SeedResource() - Constructor for class org.apache.nutch.service.resources.SeedResource
- SeedUrl - Class in org.apache.nutch.service.model.request
- SeedUrl() - Constructor for class org.apache.nutch.service.model.request.SeedUrl
- SeedUrl(String) - Constructor for class org.apache.nutch.service.model.request.SeedUrl
- SEGMENT_NAME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- SegmentChecker - Class in org.apache.nutch.segment
-
Checks whether a segment is valid, or has a certain status (generated, fetched, parsed), or can be used safely for a certain processing step (e.g., indexing).
- SegmentChecker() - Constructor for class org.apache.nutch.segment.SegmentChecker
- SegmentMergeFilter - Interface in org.apache.nutch.segment
-
Interface used to filter segments during segment merge.
- SegmentMergeFilters - Class in org.apache.nutch.segment
-
This class wraps all
SegmentMergeFilter
extensions in a single object so it is easier to operate on them. - SegmentMergeFilters(Configuration) - Constructor for class org.apache.nutch.segment.SegmentMergeFilters
- SegmentMerger - Class in org.apache.nutch.segment
-
This tool takes several segments and merges their data together.
- SegmentMerger() - Constructor for class org.apache.nutch.segment.SegmentMerger
- SegmentMerger(Configuration) - Constructor for class org.apache.nutch.segment.SegmentMerger
- SegmentMerger.ObjectInputFormat - Class in org.apache.nutch.segment
-
Wraps inputs in an
MetaWrapper
, to permit merging different types in reduce and use additional metadata. - SegmentMerger.SegmentMergerMapper - Class in org.apache.nutch.segment
- SegmentMerger.SegmentMergerReducer - Class in org.apache.nutch.segment
-
NOTE: in selecting the latest version we rely exclusively on the segment name (not all segment data contain time information).
- SegmentMerger.SegmentOutputFormat - Class in org.apache.nutch.segment
- SegmentMergerMapper() - Constructor for class org.apache.nutch.segment.SegmentMerger.SegmentMergerMapper
- SegmentMergerReducer() - Constructor for class org.apache.nutch.segment.SegmentMerger.SegmentMergerReducer
- segmentName - Variable in class org.apache.nutch.segment.SegmentPart
-
Name of the segment (just the last path component).
- SegmentOutputFormat() - Constructor for class org.apache.nutch.segment.SegmentMerger.SegmentOutputFormat
- SegmentPart - Class in org.apache.nutch.segment
-
Utility class for handling information about segment parts.
- SegmentPart() - Constructor for class org.apache.nutch.segment.SegmentPart
- SegmentPart(String, String) - Constructor for class org.apache.nutch.segment.SegmentPart
- SegmentReader - Class in org.apache.nutch.segment
-
Dump the content of a segment.
- SegmentReader() - Constructor for class org.apache.nutch.segment.SegmentReader
- SegmentReader.InputCompatMapper - Class in org.apache.nutch.segment
- SegmentReader.InputCompatReducer - Class in org.apache.nutch.segment
- SegmentReader.SegmentReaderStats - Class in org.apache.nutch.segment
- SegmentReader.TextOutputFormat - Class in org.apache.nutch.segment
-
Implements a text output format
- SegmentReaderStats() - Constructor for class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
- SegmentReaderUtil - Class in org.apache.nutch.util
- SegmentReaderUtil() - Constructor for class org.apache.nutch.util.SegmentReaderUtil
- segnum - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
- Selector() - Constructor for class org.apache.nutch.crawl.Generator.Selector
- SelectorEntry() - Constructor for class org.apache.nutch.crawl.Generator.SelectorEntry
- SelectorInverseMapper() - Constructor for class org.apache.nutch.crawl.Generator.SelectorInverseMapper
- SelectorMapper() - Constructor for class org.apache.nutch.crawl.Generator.SelectorMapper
- SelectorReducer() - Constructor for class org.apache.nutch.crawl.Generator.SelectorReducer
- sendNoOp() - Method in class org.apache.nutch.protocol.ftp.Client
-
Sends a NOOP command to the FTP server.
- Separator(String) - Constructor for class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
- sepStr - Variable in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
- seqRead(ReaderConfig, int, int, int, boolean) - Method in class org.apache.nutch.service.resources.ReaderResouce
-
Read a sequence file
- SequenceReader - Class in org.apache.nutch.service.impl
-
Enables reading a sequence file and methods provide different ways to read the file.
- SequenceReader() - Constructor for class org.apache.nutch.service.impl.SequenceReader
- serialize(Writable, JsonGenerator, SerializerProvider) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer
- server - Variable in class org.apache.nutch.service.resources.AbstractResource
- SERVER_TYPE - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- SERVER_URLS - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- ServiceConfig - Class in org.apache.nutch.service.model.request
- ServiceConfig() - Constructor for class org.apache.nutch.service.model.request.ServiceConfig
- ServiceInfo - Class in org.apache.nutch.service.model.response
- ServiceInfo() - Constructor for class org.apache.nutch.service.model.response.ServiceInfo
- ServicesResource - Class in org.apache.nutch.service.resources
-
The services resource defines an endpoint to enable the user to carry out Nutch jobs like dump, commoncrawldump, etc.
- ServicesResource() - Constructor for class org.apache.nutch.service.resources.ServicesResource
- ServiceWorker - Class in org.apache.nutch.service.impl
- ServiceWorker(ServiceConfig, NutchTool) - Constructor for class org.apache.nutch.service.impl.ServiceWorker
- set(String) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
- set(String, String) - Method in class org.apache.nutch.metadata.Metadata
-
Set metadata name/value.
- set(String, String) - Method in class org.apache.nutch.metadata.SpellCheckedMetadata
- set(CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Copy the contents of another instance into this instance.
- setAdditionalPostHeaders(Map<String, String>) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- setAll(Properties) - Method in class org.apache.nutch.metadata.Metadata
-
Copy All key-value pairs from properties.
- setAnchor(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- setArgs(String[]) - Method in class org.apache.nutch.parse.ParseStatus
- setArgs(String[]) - Method in class org.apache.nutch.protocol.ProtocolStatus
- setArgs(Map<String, Object>) - Method in class org.apache.nutch.service.model.request.JobConfig
- setArgs(Map<String, Object>) - Method in class org.apache.nutch.service.model.request.ServiceConfig
- setArgs(Map<String, Object>) - Method in class org.apache.nutch.service.model.response.JobInfo
- setArgs(Map<String, String>) - Method in class org.apache.nutch.service.model.request.DbQuery
- setBaseHref(URL) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the
baseHref
. - setBlackList(String) - Method in class org.apache.nutch.collection.Subcollection
-
Set contents of blacklist from String
- setBody(byte[]) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
- setCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets
noCache
tofalse
. - setChildNodes(Outlink[]) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- setChildren(List<FetchNodeDbInfo.ChildNode>) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- setClazz(String) - Method in class org.apache.nutch.plugin.Extension
-
Sets the Class that implement the concret extension and is only used until model creation at system start up.
- setCode(int) - Method in class org.apache.nutch.protocol.ProtocolStatus
- setCommand(String) - Method in class org.apache.nutch.util.CommandRunner
- setCompressed(boolean) - Method in class org.apache.nutch.tools.CommonCrawlConfig
- setConf(Configuration) - Method in class org.apache.nutch.analysis.lang.HTMLLanguageParser
- setConf(Configuration) - Method in class org.apache.nutch.analysis.lang.LanguageIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
- setConf(Configuration) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
- setConf(Configuration) - Method in class org.apache.nutch.crawl.Generator.Selector
- setConf(Configuration) - Method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
- setConf(Configuration) - Method in class org.apache.nutch.crawl.Signature
- setConf(Configuration) - Method in class org.apache.nutch.crawl.TextProfileSignature
- setConf(Configuration) - Method in class org.apache.nutch.crawl.URLPartitioner
- setConf(Configuration) - Method in class org.apache.nutch.exchange.jexl.JexlExchange
- setConf(Configuration) - Method in class org.apache.nutch.indexer.anchor.AnchorIndexingFilter
-
Set the
Configuration
object - setConf(Configuration) - Method in class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
-
Set the
Configuration
object - setConf(Configuration) - Method in class org.apache.nutch.indexer.basic.BasicIndexingFilter
-
Set the
Configuration
object - setConf(Configuration) - Method in class org.apache.nutch.indexer.CleaningJob
- setConf(Configuration) - Method in class org.apache.nutch.indexer.feed.FeedIndexingFilter
-
Sets the
Configuration
object used to configure thisIndexingFilter
. - setConf(Configuration) - Method in class org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.indexer.jexl.JexlIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.indexer.links.LinksIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.indexer.metadata.MetadataIndexer
- setConf(Configuration) - Method in class org.apache.nutch.indexer.more.MoreIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.indexer.replace.ReplaceIndexer
- setConf(Configuration) - Method in class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
Set the
Configuration
object - setConf(Configuration) - Method in class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.indexer.tld.TLDIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
-
handles conf assignment and pulls the value assignment from the "urlmeta.tags" property
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- setConf(Configuration) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- setConf(Configuration) - Method in class org.apache.nutch.microformats.reltag.RelTagIndexingFilter
- setConf(Configuration) - Method in class org.apache.nutch.microformats.reltag.RelTagParser
- setConf(Configuration) - Method in class org.apache.nutch.net.protocols.ProtocolLogUtil
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.host.HostURLNormalizer
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.querystring.QuerystringURLNormalizer
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
- setConf(Configuration) - Method in class org.apache.nutch.net.urlnormalizer.slash.SlashURLNormalizer
- setConf(Configuration) - Method in class org.apache.nutch.parse.ext.ExtParser
- setConf(Configuration) - Method in class org.apache.nutch.parse.feed.FeedParser
-
Sets the
Configuration
object for thisParser
. - setConf(Configuration) - Method in class org.apache.nutch.parse.headings.HeadingsParseFilter
- setConf(Configuration) - Method in class org.apache.nutch.parse.html.DOMContentUtils
- setConf(Configuration) - Method in class org.apache.nutch.parse.html.HtmlParser
- setConf(Configuration) - Method in class org.apache.nutch.parse.js.JSParseFilter
- setConf(Configuration) - Method in class org.apache.nutch.parse.metatags.MetaTagsParser
- setConf(Configuration) - Method in class org.apache.nutch.parse.tika.DOMContentUtils
- setConf(Configuration) - Method in class org.apache.nutch.parse.tika.TikaParser
- setConf(Configuration) - Method in class org.apache.nutch.parse.zip.ZipParser
- setConf(Configuration) - Method in class org.apache.nutch.parsefilter.debug.DebugParseFilter
- setConf(Configuration) - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- setConf(Configuration) - Method in class org.apache.nutch.parsefilter.regex.RegexParseFilter
- setConf(Configuration) - Method in class org.apache.nutch.protocol.file.File
-
Set the
Configuration
object - setConf(Configuration) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the
Configuration
object - setConf(Configuration) - Method in class org.apache.nutch.protocol.htmlunit.Http
-
Set the
Configuration
object. - setConf(Configuration) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- setConf(Configuration) - Method in class org.apache.nutch.protocol.http.api.HttpRobotRulesParser
- setConf(Configuration) - Method in class org.apache.nutch.protocol.http.Http
-
Set the
Configuration
object. - setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.Http
-
Reads the configuration from the Nutch configuration files and sets the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
- setConf(Configuration) - Method in class org.apache.nutch.protocol.httpclient.HttpBasicAuthentication
- setConf(Configuration) - Method in class org.apache.nutch.protocol.interactiveselenium.Http
- setConf(Configuration) - Method in class org.apache.nutch.protocol.okhttp.OkHttp
- setConf(Configuration) - Method in class org.apache.nutch.protocol.RobotRulesParser
-
Set the
Configuration
object - setConf(Configuration) - Method in class org.apache.nutch.publisher.NutchPublishers
- setConf(Configuration) - Method in class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
- setConf(Configuration) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- setConf(Configuration) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- setConf(Configuration) - Method in class org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
- setConf(Configuration) - Method in class org.apache.nutch.scoring.metadata.MetadataScoringFilter
-
handles conf assignment and pulls the value assignment from the "scoring.db.md", "scoring.content.md" and "scoring.parse.md" properties.
- setConf(Configuration) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
- setConf(Configuration) - Method in class org.apache.nutch.scoring.orphan.OrphanScoringFilter
- setConf(Configuration) - Method in class org.apache.nutch.scoring.similarity.cosine.CosineSimilarity
- setConf(Configuration) - Method in interface org.apache.nutch.scoring.similarity.SimilarityModel
- setConf(Configuration) - Method in class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
- setConf(Configuration) - Method in class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
-
handles conf assignment and pulls the value assignment from the "urlmeta.tags" property
- setConf(Configuration) - Method in class org.apache.nutch.segment.SegmentMerger
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.api.RegexURLFilterBase
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.domain.DomainURLFilter
-
Sets the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.domaindenylist.DomainDenylistURLFilter
-
Sets the configuration.
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.prefix.PrefixURLFilter
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- setConf(Configuration) - Method in class org.apache.nutch.urlfilter.validator.UrlValidator
- setConf(Configuration) - Method in class org.apache.nutch.util.GenericWritableConfigurable
- setConf(Configuration) - Method in class org.apache.nutch.util.NutchTool
- setConf(Configuration) - Method in class org.creativecommons.nutch.CCIndexingFilter
- setConf(Configuration) - Method in class org.creativecommons.nutch.CCParseFilter
- setConfId(String) - Method in class org.apache.nutch.service.model.request.DbQuery
- setConfId(String) - Method in class org.apache.nutch.service.model.request.JobConfig
- setConfId(String) - Method in class org.apache.nutch.service.model.request.ServiceConfig
- setConfId(String) - Method in class org.apache.nutch.service.model.response.JobInfo
- setConfig(Configuration) - Method in interface org.apache.nutch.publisher.NutchPublisher
-
Use implementation specific configurations
- setConfig(Configuration) - Method in class org.apache.nutch.publisher.NutchPublishers
- setConfig(Configuration) - Method in class org.apache.nutch.publisher.rabbitmq.RabbitMQPublisherImpl
- setConfigId(String) - Method in class org.apache.nutch.service.model.request.NutchConfig
- setConfiguration(Set<String>) - Method in class org.apache.nutch.service.model.response.NutchServerInfo
- setConnectionFailures(Long) - Method in class org.apache.nutch.hostdb.HostDatum
- setContent(byte[]) - Method in class org.apache.nutch.protocol.Content
- setContent(Content) - Method in class org.apache.nutch.protocol.ProtocolOutput
- setContentType(String) - Method in class org.apache.nutch.protocol.Content
- setContentType(String) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
- setCookie(Text) - Method in class org.apache.nutch.fetcher.FetchItemQueue
- setCookiePolicy(String) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- setCookies(String) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
- setCrawlId(String) - Method in class org.apache.nutch.service.model.request.DbQuery
- setCrawlId(String) - Method in class org.apache.nutch.service.model.request.JobConfig
- setCrawlId(String) - Method in class org.apache.nutch.service.model.request.ServiceConfig
- setCrawlId(String) - Method in class org.apache.nutch.service.model.response.JobInfo
- setDataTimeout(int) - Method in class org.apache.nutch.protocol.ftp.Client
-
Sets the timeout in milliseconds to use for data connection.
- setDescriptor(PluginDescriptor) - Method in class org.apache.nutch.plugin.Extension
-
Sets the plugin descriptor and is only used until model creation at system start up.
- setDnsFailures(Long) - Method in class org.apache.nutch.hostdb.HostDatum
- setDocumentLocator(Locator) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive an object for locating the origin of SAX document events.
- setDumpPaths(List<String>) - Method in class org.apache.nutch.service.model.response.ServiceInfo
- setEventData(Map<String, Object>) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Set metadata to this even
- setEventType(FetcherThreadEvent.PublishEventType) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Set event type of this object
- setFetched(long) - Method in class org.apache.nutch.hostdb.HostDatum
- setFetchInterval(float) - Method in class org.apache.nutch.crawl.CrawlDatum
- setFetchInterval(int) - Method in class org.apache.nutch.crawl.CrawlDatum
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
Sets the
fetchInterval
andfetchTime
on a successfully fetched page. - setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.AdaptiveFetchSchedule
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.DefaultFetchSchedule
- setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
Sets the
fetchInterval
andfetchTime
on a successfully fetched page. - setFetchSchedule(Text, CrawlDatum, long, long, long, long, int) - Method in class org.apache.nutch.crawl.MimeAdaptiveFetchSchedule
- setFetchTime(long) - Method in class org.apache.nutch.crawl.CrawlDatum
-
Sets either the time of the last fetch or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.
- setFetchTime(long) - Method in class org.apache.nutch.fetcher.FetchNode
- setFileType(int) - Method in class org.apache.nutch.protocol.ftp.Client
-
Sets the file type to be transferred.
- setFilterFromPath(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- setFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets
noFollow
tofalse
. - setFollowTalk(boolean) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set followTalk i.e.
- setForce(boolean) - Method in class org.apache.nutch.service.model.request.NutchConfig
- setFromConf(IndexWriterParams, String) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
- setFromConf(IndexWriterParams, String, boolean) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
- setGone(long) - Method in class org.apache.nutch.hostdb.HostDatum
- setHalted(boolean) - Method in class org.apache.nutch.fetcher.FetcherThread
- setHeaders(String) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
- setHeaders(Map<String, Object>) - Method in class org.apache.nutch.rabbitmq.RabbitMQMessage
- setHomepageUrl(String) - Method in class org.apache.nutch.hostdb.HostDatum
- setId(Long) - Method in class org.apache.nutch.service.model.request.SeedList
- setId(Long) - Method in class org.apache.nutch.service.model.request.SeedUrl
- setId(String) - Method in class org.apache.nutch.plugin.Extension
-
Sets the unique extension Id and is only used until model creation at system start up.
- setId(String) - Method in class org.apache.nutch.service.model.response.JobInfo
- setIDAttribute(String, Element) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Set an ID string to node association in the ID table.
- setIgnoreCase(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- setIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets
noIndex
tofalse
. - setIndexedConf(Configuration, int) - Method in class org.apache.nutch.indexer.arbitrary.ArbitraryIndexingFilter
-
Set the
Configuration
object for a specific set of values in the config - setInfo(JobInfo) - Method in class org.apache.nutch.service.impl.JobWorker
- setInLinks(List<String>) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- setInLinks(List<String>) - Method in interface org.apache.nutch.tools.CommonCrawlFormat
-
sets inlinks of this document
- setInlinkScore(float) - Method in class org.apache.nutch.scoring.webgraph.Node
- setInputStream(InputStream) - Method in class org.apache.nutch.util.CommandRunner
- setJobClassName(String) - Method in class org.apache.nutch.service.model.request.JobConfig
- setJobs(Collection<JobInfo>) - Method in class org.apache.nutch.service.model.response.NutchServerInfo
- setJsonArray(boolean) - Method in class org.apache.nutch.tools.CommonCrawlConfig
- setKeepConnection(boolean) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Whether to keep ftp connection.
- setKeyPrefix(String) - Method in class org.apache.nutch.tools.CommonCrawlConfig
- setLastCheck() - Method in class org.apache.nutch.hostdb.HostDatum
- setLastCheck(Date) - Method in class org.apache.nutch.hostdb.HostDatum
- setLastModified(long) - Method in class org.apache.nutch.protocol.ProtocolStatus
- setLinks(LinkDumper.LinkNode[]) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
- setLinkType(byte) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- setLoginFormId(String) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- setLoginPostData(Map<String, String>) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- setLoginRedirect(boolean) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- setLoginUrl(String) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- setMajorCode(byte) - Method in class org.apache.nutch.parse.ParseStatus
- setMaxContentLength(int) - Method in class org.apache.nutch.protocol.file.File
-
Set the length after at which content is truncated.
- setMaxContentLength(int) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the length after at which content is truncated.
- setMessage(String) - Method in class org.apache.nutch.parse.ParseStatus
- setMessage(String) - Method in class org.apache.nutch.protocol.ProtocolStatus
- setMeta(String, String) - Method in class org.apache.nutch.metadata.MetaWrapper
-
Set metadata.
- setMetadata(MapWritable) - Method in class org.apache.nutch.parse.Outlink
- setMetadata(Metadata) - Method in class org.apache.nutch.protocol.Content
-
Other protocol-specific data.
- setMetadata(Metadata) - Method in class org.apache.nutch.scoring.webgraph.Node
- setMetaData(MapWritable) - Method in class org.apache.nutch.crawl.CrawlDatum
- setMetaData(MapWritable) - Method in class org.apache.nutch.hostdb.HostDatum
- setMinorCode(short) - Method in class org.apache.nutch.parse.ParseStatus
- setModeAccept(boolean) - Method in class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- setModifiedTime(long) - Method in class org.apache.nutch.crawl.CrawlDatum
- setMsg(String) - Method in class org.apache.nutch.service.model.response.JobInfo
- setName(String) - Method in class org.apache.nutch.service.model.request.SeedList
- setNoCache() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets
noCache
totrue
. - setNode(Node) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
- setNoFollow() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets
noFollow
totrue
. - setNoIndex() - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets
noIndex
totrue
. - setNotModified(long) - Method in class org.apache.nutch.hostdb.HostDatum
- setNumInlinks(int) - Method in class org.apache.nutch.scoring.webgraph.Node
- setNumOfOutlinks(int) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- setNumOutlinks(int) - Method in class org.apache.nutch.scoring.webgraph.Node
- setObject(String, Object) - Method in class org.apache.nutch.util.ObjectCache
- setOutlinks(Outlink[]) - Method in class org.apache.nutch.fetcher.FetchNode
- setOutlinks(Outlink[]) - Method in class org.apache.nutch.parse.ParseData
- setOutputDir(String) - Method in class org.apache.nutch.tools.CommonCrawlConfig
- setPageGoneSchedule(Text, CrawlDatum, long, long, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method specifies how to schedule refetching of pages marked as GONE.
- setPageGoneSchedule(Text, CrawlDatum, long, long, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method specifies how to schedule refetching of pages marked as GONE.
- setPageRetrySchedule(Text, CrawlDatum, long, long, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
- setPageRetrySchedule(Text, CrawlDatum, long, long, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
- setParams(Map<String, String>) - Method in class org.apache.nutch.service.model.request.NutchConfig
- setParseMeta(Metadata) - Method in class org.apache.nutch.parse.ParseData
- setPath(String) - Method in class org.apache.nutch.service.model.request.ReaderConfig
- setPoolSize(int) - Static method in class org.apache.nutch.util.MimeUtil
- setPort(int) - Static method in class org.apache.nutch.service.NutchServer
- setProperty(String, String, String) - Method in interface org.apache.nutch.service.ConfManager
- setProperty(String, String, String) - Method in class org.apache.nutch.service.impl.ConfManagerImpl
-
Sets the given property in the configuration associated with the confId
- setReason(Response.TruncatedContentReason) - Method in class org.apache.nutch.protocol.okhttp.OkHttpResponse.TruncatedContent
- setRedirect(boolean) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthentication
- setRedirPerm(long) - Method in class org.apache.nutch.hostdb.HostDatum
- setRedirTemp(long) - Method in class org.apache.nutch.hostdb.HostDatum
- setRefresh(boolean) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets
refresh
to the supplied value. - setRefreshHref(URL) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the
refreshHref
. - setRefreshTime(int) - Method in class org.apache.nutch.parse.HTMLMetaTags
-
Sets the
refreshTime
. - setRemoteVerificationEnabled(boolean) - Method in class org.apache.nutch.protocol.ftp.Client
-
Enable or disable verification that the remote host taking part of a data connection is the same as the host to which the control connection is attached.
- setRemovedFormFields(Set<String>) - Method in class org.apache.nutch.protocol.httpclient.HttpFormAuthConfigurer
- setResult(Map<String, Object>) - Method in class org.apache.nutch.service.model.response.JobInfo
- setRetriesSinceFetch(int) - Method in class org.apache.nutch.crawl.CrawlDatum
- setReverseKey(boolean) - Method in class org.apache.nutch.tools.CommonCrawlConfig
- setReverseKeyValue(String) - Method in class org.apache.nutch.tools.CommonCrawlConfig
- setRunningJobs(Collection<JobInfo>) - Method in class org.apache.nutch.service.model.response.NutchServerInfo
- setScore(float) - Method in class org.apache.nutch.crawl.CrawlDatum
- setScore(float) - Method in class org.apache.nutch.hostdb.HostDatum
- setScore(float) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- setSeedFilePath(String) - Method in class org.apache.nutch.service.model.request.SeedList
- setSeedList(String, SeedList) - Method in class org.apache.nutch.service.impl.SeedManagerImpl
- setSeedList(String, SeedList) - Method in interface org.apache.nutch.service.SeedManager
- setSeedList(SeedList) - Method in class org.apache.nutch.service.model.request.SeedUrl
- setSeedUrls(Collection<SeedUrl>) - Method in class org.apache.nutch.service.model.request.SeedList
- setSignature(byte[]) - Method in class org.apache.nutch.crawl.CrawlDatum
- setSimpleDateFormat(boolean) - Method in class org.apache.nutch.tools.CommonCrawlConfig
- setStartDate(Date) - Method in class org.apache.nutch.service.model.response.NutchServerInfo
- setState(JobInfo.State) - Method in class org.apache.nutch.service.model.response.JobInfo
- setStatus(int) - Method in class org.apache.nutch.crawl.CrawlDatum
- setStatus(int) - Method in class org.apache.nutch.fetcher.FetchNode
- setStatus(int) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- setStatus(ProtocolStatus) - Method in class org.apache.nutch.protocol.ProtocolOutput
- setStdErrorStream(OutputStream) - Method in class org.apache.nutch.util.CommandRunner
- setStdOutputStream(OutputStream) - Method in class org.apache.nutch.util.CommandRunner
- setTermFreqVector(HashMap<String, Integer>) - Method in class org.apache.nutch.scoring.similarity.cosine.DocVector
- setTimeLimit(long) - Method in class org.apache.nutch.fetcher.QueueFeeder
- setTimeout(int) - Method in class org.apache.nutch.protocol.ftp.Ftp
-
Set the timeout.
- setTimeout(int) - Method in class org.apache.nutch.util.CommandRunner
- setTimestamp(long) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- setTimestamp(Long) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Set timestamp for this event
- setTitle(String) - Method in class org.apache.nutch.fetcher.FetchNode
- setType(String) - Method in class org.apache.nutch.service.model.request.DbQuery
- setType(JobManager.JobType) - Method in class org.apache.nutch.service.model.request.JobConfig
- setType(JobManager.JobType) - Method in class org.apache.nutch.service.model.response.JobInfo
- setUnfetched(long) - Method in class org.apache.nutch.hostdb.HostDatum
- setup(Mapper.Context) - Method in class org.apache.nutch.tools.arc.ArcSegmentCreator.ArcSegmentCreatorMapper
-
Configures the job mapper.
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.Injector.InjectMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.indexer.IndexerMapReduce.IndexerMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbMapper
-
Configures the OutlinkDb job mapper.
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DBFilter
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbFilter
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbDumpMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.fetcher.Fetcher.FetcherRun
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDbFilter
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDbReader.LinkDBDumpMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentMergerMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.crawl.LinkDb.LinkDbMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterMapper
-
Configures the mapper, sets the flag for type of content and the topN number if any.
- setup(Mapper.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.tools.FreeGenerator.FG.FGMapper
- setup(Mapper.Context) - Method in class org.apache.nutch.parse.ParseSegment.ParseSegmentMapper
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.indexer.CleaningJob.DeleterReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbTopNReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterReducer
-
Configures the reducer, sets the flag for type of content and the topN number if any.
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.SelectorReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.NodeDumper.Dumper.DumperReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater.ScoreUpdaterReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.Inverter.InvertReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbMerger.Merger
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.Generator.CrawlDbUpdater.CrawlDbUpdateReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.Injector.InjectReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.LinkDbMerger.LinkDbMergeReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.segment.SegmentReader.InputCompatReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDbStatReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
-
Configures the thread pool and prestarts all resolver threads.
- setup(Reducer.Context) - Method in class org.apache.nutch.indexer.IndexerMapReduce.IndexerReducer
- setup(Reducer.Context) - Method in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb.OutlinkDbReducer
-
Configures the OutlinkDb job reducer.
- setup(Reducer.Context) - Method in class org.apache.nutch.segment.SegmentMerger.SegmentMergerReducer
- setUrl(String) - Method in class org.apache.nutch.fetcher.FetcherThreadEvent
-
Set URL of this event (fetched page)
- setUrl(String) - Method in class org.apache.nutch.parse.Outlink
- setUrl(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- setUrl(String) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
- setUrl(String) - Method in class org.apache.nutch.service.model.request.SeedUrl
- setUrl(String) - Method in class org.apache.nutch.service.model.response.FetchNodeDbInfo
- setUrl(Text) - Method in class org.apache.nutch.fetcher.FetchNode
- setURLScoreAfterParsing(Text, Content, Parse) - Method in class org.apache.nutch.scoring.similarity.cosine.CosineSimilarity
- setURLScoreAfterParsing(Text, Content, Parse) - Method in interface org.apache.nutch.scoring.similarity.SimilarityModel
- setVectorEntry(int, long) - Method in class org.apache.nutch.scoring.similarity.cosine.DocVector
- setWaitForExit(boolean) - Method in class org.apache.nutch.util.CommandRunner
- setWarcSize(long) - Method in class org.apache.nutch.tools.CommonCrawlConfig
- setWeight(float) - Method in class org.apache.nutch.indexer.NutchDocument
- setWeight(float) - Method in class org.apache.nutch.indexer.NutchField
- setWhiteList(String) - Method in class org.apache.nutch.collection.Subcollection
-
Set contents of whitelist from String
- setWhiteList(ArrayList<String>) - Method in class org.apache.nutch.collection.Subcollection
- shortestMatch(String) - Method in class org.apache.nutch.util.PrefixStringMatcher
-
Returns the shortest prefix of
input
that is matched, ornull
if no match exists. - shortestMatch(String) - Method in class org.apache.nutch.util.SuffixStringMatcher
-
Returns the shortest suffix of
input
that is matched, ornull
if no match exists. - shortestMatch(String) - Method in class org.apache.nutch.util.TrieStringMatcher
-
Returns the shortest substring of
input
that is matched by a pattern in the trie, ornull
if no match exists. - shouldCheck(HostDatum) - Method in class org.apache.nutch.hostdb.UpdateHostDbReducer
-
Determines whether a record should be checked.
- shouldFetch(Text, CrawlDatum, long) - Method in class org.apache.nutch.crawl.AbstractFetchSchedule
-
This method provides information whether the page is suitable for selection in the current fetchlist.
- shouldFetch(Text, CrawlDatum, long) - Method in interface org.apache.nutch.crawl.FetchSchedule
-
This method provides information whether the page is suitable for selection in the current fetchlist.
- shouldProcessURL(String) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefalultMultiInteractionHandler
- shouldProcessURL(String) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultClickAllAjaxLinksHandler
- shouldProcessURL(String) - Method in class org.apache.nutch.protocol.interactiveselenium.handlers.DefaultHandler
- shouldProcessURL(String) - Method in interface org.apache.nutch.protocol.interactiveselenium.handlers.InteractiveSeleniumHandler
- ShowProperties - Class in org.apache.nutch.tools
-
Tool to list properties and their values set by the current Nutch configuration
- ShowProperties() - Constructor for class org.apache.nutch.tools.ShowProperties
- shutDown() - Method in class org.apache.nutch.plugin.Plugin
-
Shutdown the plugin.
- Signature - Class in org.apache.nutch.crawl
- Signature() - Constructor for class org.apache.nutch.crawl.Signature
- SIGNATURE_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- SignatureComparator - Class in org.apache.nutch.crawl
- SignatureComparator() - Constructor for class org.apache.nutch.crawl.SignatureComparator
- SignatureFactory - Class in org.apache.nutch.crawl
-
Factory class, which instantiates a Signature implementation according to the current Configuration configuration.
- SimilarityModel - Interface in org.apache.nutch.scoring.similarity
- SimilarityScoringFilter - Class in org.apache.nutch.scoring.similarity
- SimilarityScoringFilter() - Constructor for class org.apache.nutch.scoring.similarity.SimilarityScoringFilter
- simpleDateFormat - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- sitemap(Path, Path, Path, boolean, boolean, boolean, int) - Method in class org.apache.nutch.util.SitemapProcessor
- SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT - Static variable in class org.apache.nutch.util.SitemapProcessor
- SITEMAP_OVERWRITE_EXISTING - Static variable in class org.apache.nutch.util.SitemapProcessor
- SITEMAP_REDIR_MAX - Static variable in class org.apache.nutch.util.SitemapProcessor
- SITEMAP_SIZE_MAX - Static variable in class org.apache.nutch.util.SitemapProcessor
- SITEMAP_STRICT_PARSING - Static variable in class org.apache.nutch.util.SitemapProcessor
- SITEMAP_URL_FILTERING - Static variable in class org.apache.nutch.util.SitemapProcessor
- SITEMAP_URL_NORMALIZING - Static variable in class org.apache.nutch.util.SitemapProcessor
- SitemapProcessor - Class in org.apache.nutch.util
-
Performs sitemap processing by fetching sitemap links, parsing the content and merging the URLs from sitemaps (with the metadata) into the CrawlDb.
- SitemapProcessor() - Constructor for class org.apache.nutch.util.SitemapProcessor
- size() - Method in class org.apache.nutch.crawl.Inlinks
- size() - Method in class org.apache.nutch.metadata.Metadata
-
Returns the number of metadata names in this metadata.
- size() - Method in class org.apache.nutch.parse.ParseResult
-
Return the number of parse outputs (both successful and failed)
- skip(DataInput) - Static method in class org.apache.nutch.crawl.Inlink
-
Skips over one Inlink in the input.
- skip(DataInput) - Static method in class org.apache.nutch.parse.Outlink
-
Skips over one Outlink in the input.
- SKIP_TRUNCATED - Static variable in class org.apache.nutch.parse.ParseSegment
- skipChildren() - Method in class org.apache.nutch.util.NodeWalker
-
Skips over and removes from the node stack the children of the last node.
- skippedEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of a skipped entity.
- SlashURLNormalizer - Class in org.apache.nutch.net.urlnormalizer.slash
- SlashURLNormalizer() - Constructor for class org.apache.nutch.net.urlnormalizer.slash.SlashURLNormalizer
- slice(String, int, int) - Method in class org.apache.nutch.service.impl.LinkReader
- slice(String, int, int) - Method in class org.apache.nutch.service.impl.NodeReader
- slice(String, int, int) - Method in class org.apache.nutch.service.impl.SequenceReader
- slice(String, int, int) - Method in interface org.apache.nutch.service.NutchReader
- SOFTWARE - Static variable in class org.apache.nutch.tools.WARCUtils
- SolrConstants - Interface in org.apache.nutch.indexwriter.solr
- SolrIndexWriter - Class in org.apache.nutch.indexwriter.solr
- SolrIndexWriter() - Constructor for class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- SolrUtils - Class in org.apache.nutch.indexwriter.solr
- SolrUtils() - Constructor for class org.apache.nutch.indexwriter.solr.SolrUtils
- Sorter() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter
- SorterMapper() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterMapper
- SorterReducer() - Constructor for class org.apache.nutch.scoring.webgraph.NodeDumper.Sorter.SorterReducer
- SOURCE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A reference to a resource from which the present resource is derived.
- SpellCheckedMetadata - Class in org.apache.nutch.metadata
-
A decorator to Metadata that adds spellchecking capabilities to property names.
- SpellCheckedMetadata() - Constructor for class org.apache.nutch.metadata.SpellCheckedMetadata
- splitEnd - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
- splitLen - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
- splitStart - Variable in class org.apache.nutch.tools.arc.ArcRecordReader
- SPONSORED - org.apache.nutch.util.domain.DomainSuffix.Status
- STANDARD - org.apache.nutch.scoring.similarity.util.LuceneTokenizer.TokenizerType
- start - Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
- start(String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Train
- START - org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
- startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
- startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
- startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
- startArray(String, boolean, boolean) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- startCDATA() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the start of a CDATA section.
- startDocument() - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of the beginning of a document.
- startDTD(String, String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the start of DTD declarations, if any.
- startElement(String, String, String, Attributes) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Receive notification of the beginning of an element.
- startEntity(String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Report the beginning of an entity.
- startObject(String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- startObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
- startObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
- startObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
- startObject(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- startPrefixMapping(String, String) - Method in class org.apache.nutch.parse.html.DOMBuilder
-
Begin the scope of a prefix-URI Namespace mapping.
- startServer() - Static method in class org.apache.nutch.service.NutchServer
- startUp() - Method in class org.apache.nutch.plugin.Plugin
-
Will be invoked until plugin start up.
- STARTUP - org.apache.nutch.util.domain.DomainSuffix.Status
- STAT_PROGRESS - Static variable in interface org.apache.nutch.metadata.Nutch
-
For progress of job.
- StaticFieldIndexer - Class in org.apache.nutch.indexer.staticfield
-
A simple plugin called at indexing that adds fields with static data.
- StaticFieldIndexer() - Constructor for class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
- statNames - Static variable in class org.apache.nutch.crawl.CrawlDatum
- status - Variable in class org.apache.nutch.util.NutchTool
- STATUS_BLOCKED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_DB_DUPLICATE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was marked as being a duplicate of another page
- STATUS_DB_FETCHED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was successfully fetched.
- STATUS_DB_GONE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page no longer exists.
- STATUS_DB_MAX - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Maximum value of DB-related status.
- STATUS_DB_NOTMODIFIED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was successfully fetched and found not modified.
- STATUS_DB_ORPHAN - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was marked as orphan, e.g.
- STATUS_DB_REDIR_PERM - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page permanently redirects to other page.
- STATUS_DB_REDIR_TEMP - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page temporarily redirects to other page.
- STATUS_DB_UNFETCHED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was not fetched yet.
- STATUS_FAILED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_FAILURE - Static variable in class org.apache.nutch.parse.ParseStatus
- STATUS_FETCH_GONE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching unsuccessful - page is gone.
- STATUS_FETCH_MAX - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Maximum value of fetch-related status.
- STATUS_FETCH_NOTMODIFIED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching successful - page is not modified.
- STATUS_FETCH_REDIR_PERM - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching permanently redirected to other page.
- STATUS_FETCH_REDIR_TEMP - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching temporarily redirected to other page.
- STATUS_FETCH_RETRY - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching unsuccessful, needs to be retried (transient errors).
- STATUS_FETCH_SUCCESS - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Fetching was successful.
- STATUS_GONE - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_INJECTED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page was newly injected.
- STATUS_LINKED - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page discovered through a link.
- STATUS_MODIFIED - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
Page is known to have been modified since our last visit.
- STATUS_NOTFETCHING - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_NOTFOUND - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_NOTMODIFIED - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
Page is known to remain unmodified since our last visit.
- STATUS_NOTMODIFIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_NOTPARSED - Static variable in class org.apache.nutch.parse.ParseStatus
- STATUS_PARSE_META - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page got metadata from a parser
- STATUS_REDIR_EXCEEDED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_RETRY - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_ROBOTS_DENIED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_SIGNATURE - Static variable in class org.apache.nutch.crawl.CrawlDatum
-
Page signature.
- STATUS_SUCCESS - Static variable in class org.apache.nutch.parse.ParseStatus
- STATUS_SUCCESS - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- STATUS_UNKNOWN - Static variable in interface org.apache.nutch.crawl.FetchSchedule
-
It is unknown whether page was changed since our last visit.
- STATUS_WOULDBLOCK - Static variable in class org.apache.nutch.protocol.ProtocolStatus
- StatusUpdateReducer() - Constructor for class org.apache.nutch.crawl.DeduplicationJob.StatusUpdateReducer
- stdin - Variable in class org.apache.nutch.util.AbstractChecker
- stop() - Method in class org.apache.nutch.service.NutchServer
- stop(String, String) - Method in class org.apache.nutch.service.impl.JobManagerImpl
- stop(String, String) - Method in interface org.apache.nutch.service.JobManager
- stop(String, String) - Method in class org.apache.nutch.service.resources.JobResource
-
Stop Job
- stopJob() - Method in class org.apache.nutch.service.impl.JobWorker
-
To stop the executing job
- stopJob() - Method in class org.apache.nutch.util.NutchTool
-
Stop the job with the possibility to resume.
- STOPPING - org.apache.nutch.service.model.response.JobInfo.State
- stopServer(boolean) - Method in class org.apache.nutch.service.resources.AdminResource
-
Stop the Nutch server
- storeHttpHeaders - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Record the HTTP response header in the metadata, see property
store.http.headers
. - storeHttpRequest - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Record the HTTP request in the metadata, see property
store.http.request
. - storeIPAddress - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Record the IP address of the responding server, see property
store.ip.address
. - stringFields - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- stringFieldWritables - Static variable in class org.apache.nutch.hostdb.UpdateHostDbReducer
- StringUtil - Class in org.apache.nutch.util
-
A collection of String processing utility methods.
- StringUtil() - Constructor for class org.apache.nutch.util.StringUtil
- stripNonCharCodepoints(String) - Static method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchUtils
- Subcollection - Class in org.apache.nutch.collection
-
SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.
- Subcollection(String, String, String, Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
-
public Constructor
- Subcollection(String, String, Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
-
public Constructor
- Subcollection(Configuration) - Constructor for class org.apache.nutch.collection.Subcollection
- SubcollectionIndexingFilter - Class in org.apache.nutch.indexer.subcollection
- SubcollectionIndexingFilter() - Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
- SubcollectionIndexingFilter(Configuration) - Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
- SUBJECT - Static variable in interface org.apache.nutch.metadata.DublinCore
-
The topic of the content of the resource.
- SUCCESS - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsing succeeded.
- SUCCESS - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Content was retrieved without errors.
- SUCCESS_REDIRECT - Static variable in class org.apache.nutch.parse.ParseStatus
-
Parsed content contains a directive to redirect to another URL.
- SuffixStringMatcher - Class in org.apache.nutch.util
-
A class for efficiently matching
String
s against a set of suffixes. - SuffixStringMatcher(String[]) - Constructor for class org.apache.nutch.util.SuffixStringMatcher
-
Creates a new
PrefixStringMatcher
which will matchString
s with any suffix in the supplied array. - SuffixStringMatcher(Collection<String>) - Constructor for class org.apache.nutch.util.SuffixStringMatcher
-
Creates a new
PrefixStringMatcher
which will matchString
s with any suffix in the suppliedCollection
- SuffixURLFilter - Class in org.apache.nutch.urlfilter.suffix
-
Filters URLs based on a file of URL suffixes.
- SuffixURLFilter() - Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- SuffixURLFilter(Reader) - Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
- SYSTEM_PROTOCOLS - Static variable in class org.apache.nutch.plugin.URLStreamHandlerFactory
-
Protocols covered by standard JVM URL handlers.
T
- TAB_CHARACTER - Static variable in class org.apache.nutch.crawl.Injector.InjectMapper
- TableUtil - Class in org.apache.nutch.util
- TableUtil() - Constructor for class org.apache.nutch.util.TableUtil
- TAG_BLACKLIST - Static variable in class org.apache.nutch.collection.Subcollection
- TAG_COLLECTION - Static variable in class org.apache.nutch.collection.Subcollection
- TAG_COLLECTIONS - Static variable in class org.apache.nutch.collection.Subcollection
- TAG_ID - Static variable in class org.apache.nutch.collection.Subcollection
- TAG_KEY - Static variable in class org.apache.nutch.collection.Subcollection
- TAG_NAME - Static variable in class org.apache.nutch.collection.Subcollection
- TAG_WHITELIST - Static variable in class org.apache.nutch.collection.Subcollection
- tcpPort - Variable in class org.apache.nutch.util.AbstractChecker
- TEMP_MOVED - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Resource has moved temporarily.
- TEMPLATE - Static variable in class org.apache.nutch.tools.CommonCrawlFormatWARC
- termFreqVector - Variable in class org.apache.nutch.scoring.similarity.cosine.DocVector
- terminal - Variable in class org.apache.nutch.util.TrieStringMatcher.TrieNode
- termVector - Variable in class org.apache.nutch.scoring.similarity.cosine.DocVector
- TEXT_PLAIN_CONTENT_TYPE - Static variable in class org.apache.nutch.parse.feed.FeedParser
- TextMD5Signature - Class in org.apache.nutch.crawl
-
Implementation of a page signature.
- TextMD5Signature() - Constructor for class org.apache.nutch.crawl.TextMD5Signature
- TextOutputFormat() - Constructor for class org.apache.nutch.segment.SegmentReader.TextOutputFormat
- TextProfileSignature - Class in org.apache.nutch.crawl
-
An implementation of a page signature.
- TextProfileSignature() - Constructor for class org.apache.nutch.crawl.TextProfileSignature
- throwBadRequestException(String) - Method in class org.apache.nutch.service.resources.AbstractResource
- TikaParser - Class in org.apache.nutch.parse.tika
-
Wrapper for Tika parsers.
- TikaParser() - Constructor for class org.apache.nutch.parse.tika.TikaParser
- TIME - org.apache.nutch.net.protocols.Response.TruncatedContentReason
-
fetch exceeded configured http.time.limit
- timelimitExceeded() - Method in class org.apache.nutch.fetcher.FetchItemQueues
- timeout - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The network timeout in millisecond
- TimingUtil - Class in org.apache.nutch.util
- TimingUtil() - Constructor for class org.apache.nutch.util.TimingUtil
- TITLE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
A name given to the resource.
- TLDIndexingFilter - Class in org.apache.nutch.indexer.tld
-
Adds the top-level domain extensions to the index
- TLDIndexingFilter() - Constructor for class org.apache.nutch.indexer.tld.TLDIndexingFilter
- TLDScoringFilter - Class in org.apache.nutch.scoring.tld
-
Scoring filter to boost top-level domains (TLDs).
- TLDScoringFilter() - Constructor for class org.apache.nutch.scoring.tld.TLDScoringFilter
- tlsCheckCertificate - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Whether to check TLS/SSL certificates
- tlsPreferredCipherSuites - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Which TLS/SSL cipher suites to support
- tlsPreferredProtocols - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Which TLS/SSL protocols to support
- toASCII(String) - Static method in class org.apache.nutch.util.URLUtil
- toByteArray(HttpHeaders) - Static method in class org.apache.nutch.tools.WARCUtils
- toContent() - Method in class org.apache.nutch.protocol.file.FileResponse
- toContent() - Method in class org.apache.nutch.protocol.ftp.FtpResponse
- toDate(String) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
- toHexString(byte[]) - Static method in class org.apache.nutch.util.StringUtil
-
Convenience call for
StringUtil.toHexString(byte[], String, int)
, wheresep = null; lineLen = Integer.MAX_VALUE
. - toHexString(byte[], String, int) - Static method in class org.apache.nutch.util.StringUtil
-
Get a text representation of a byte[] as hexadecimal String, where each pair of hexadecimal digits corresponds to consecutive bytes in the array.
- toLong(String) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
- TOPIC - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
- TopLevelDomain - Class in org.apache.nutch.util.domain
-
(From wikipedia) A top-level domain (TLD) is the last part of an Internet domain name; that is, the letters which follow the final dot of any domain name.
- TopLevelDomain(String, DomainSuffix.Status, float, String) - Constructor for class org.apache.nutch.util.domain.TopLevelDomain
- TopLevelDomain(String, TopLevelDomain.Type, DomainSuffix.Status, float) - Constructor for class org.apache.nutch.util.domain.TopLevelDomain
- TopLevelDomain.Type - Enum in org.apache.nutch.util.domain
- toString() - Method in class org.apache.nutch.crawl.CrawlDatum
- toString() - Method in class org.apache.nutch.crawl.Generator.SelectorEntry
- toString() - Method in class org.apache.nutch.crawl.Inlink
- toString() - Method in class org.apache.nutch.crawl.Inlinks
- toString() - Method in class org.apache.nutch.hostdb.HostDatum
- toString() - Method in class org.apache.nutch.indexer.IndexWriterConfig
- toString() - Method in class org.apache.nutch.indexer.NutchDocument
- toString() - Method in class org.apache.nutch.indexer.NutchField
- toString() - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter.Separator
- toString() - Method in class org.apache.nutch.metadata.Metadata
- toString() - Method in class org.apache.nutch.parse.html.DOMContentUtils.LinkParams
- toString() - Method in class org.apache.nutch.parse.HTMLMetaTags
- toString() - Method in class org.apache.nutch.parse.Outlink
- toString() - Method in class org.apache.nutch.parse.ParseData
- toString() - Method in class org.apache.nutch.parse.ParseStatus
- toString() - Method in class org.apache.nutch.parse.ParseText
- toString() - Method in class org.apache.nutch.plugin.Extension
- toString() - Method in class org.apache.nutch.protocol.Content
- toString() - Method in class org.apache.nutch.protocol.okhttp.CIDR
- toString() - Method in class org.apache.nutch.protocol.ProtocolStatus
- toString() - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- toString() - Method in class org.apache.nutch.scoring.webgraph.Node
- toString() - Method in class org.apache.nutch.segment.SegmentPart
-
Return a String representation of this class, in the form "segmentName/partName".
- toString() - Method in class org.apache.nutch.urlfilter.fast.FastURLFilter.Rule
- toString() - Method in class org.apache.nutch.util.domain.DomainSuffix
- toString(long) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
- toString(CharSequence) - Static method in class org.apache.nutch.util.TableUtil
-
Convert given Utf8 instance to String and and cleans out any offending "�" from the String.
- toString(String) - Method in class org.apache.nutch.protocol.Content
- toString(String, String) - Method in class org.apache.nutch.metadata.Metadata
- toString(Charset) - Method in class org.apache.nutch.protocol.Content
- toString(Calendar) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
- toString(Date) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
-
Get the HTTP format of the specified date.
- toUNICODE(String) - Static method in class org.apache.nutch.util.URLUtil
- toZonedDateTime(String) - Static method in class org.apache.nutch.net.protocols.HttpDateFormat
- train() - Method in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- Train - Class in org.apache.nutch.parsefilter.naivebayes
- Train() - Constructor for class org.apache.nutch.parsefilter.naivebayes.Train
- TRAINFILE_MODELFILTER - Static variable in class org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
- TRANSFER_ENCODING - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- TrieStringMatcher - Class in org.apache.nutch.util
-
TrieStringMatcher is a base class for simple tree-based string matching.
- TrieStringMatcher() - Constructor for class org.apache.nutch.util.TrieStringMatcher
- TrieStringMatcher.TrieNode - Class in org.apache.nutch.util
-
Node class for the character tree.
- TRUNCATED_CONTENT - Static variable in interface org.apache.nutch.net.protocols.Response
-
Key to hold boolean whether content has been truncated, e.g., because it exceeds
http.content.limit
- TRUNCATED_CONTENT_REASON - Static variable in interface org.apache.nutch.net.protocols.Response
-
Key to hold reason why content has been truncated, see
Response.TruncatedContentReason
- TruncatedContent() - Constructor for class org.apache.nutch.protocol.okhttp.OkHttpResponse.TruncatedContent
- TRUST_STORE_PASSWORD - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- TRUST_STORE_PATH - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- TRUST_STORE_TYPE - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- TYPE - Static variable in interface org.apache.nutch.metadata.DublinCore
-
The nature or genre of the content of the resource.
U
- unescape(String) - Method in class org.apache.nutch.net.urlnormalizer.ajax.AjaxURLNormalizer
-
Unescape some exotic characters in the fragment part
- unfetched - Variable in class org.apache.nutch.hostdb.HostDatum
- unflattenToHashmap(String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Classify
- unreverseHost(String) - Static method in class org.apache.nutch.util.TableUtil
- unreverseUrl(String) - Static method in class org.apache.nutch.util.TableUtil
- UNSPECIFIED - org.apache.nutch.net.protocols.Response.TruncatedContentReason
-
unknown reason
- UNSPONSORED - org.apache.nutch.util.domain.DomainSuffix.Status
- unzip(byte[]) - Static method in class org.apache.nutch.util.GZIPUtils
-
Returns an gunzipped copy of the input array.
- unzipBestEffort(byte[]) - Static method in class org.apache.nutch.util.GZIPUtils
-
Returns an gunzipped copy of the input array.
- unzipBestEffort(byte[], int) - Static method in class org.apache.nutch.util.GZIPUtils
-
Returns an gunzipped copy of the input array, truncated to
sizeLimit
bytes, if necessary. - update(Path, Path) - Method in class org.apache.nutch.scoring.webgraph.ScoreUpdater
-
Updates the inlink score in the web graph node databsae into the crawl database.
- update(Path, Path[], boolean, boolean) - Method in class org.apache.nutch.crawl.CrawlDb
- update(Path, Path[], boolean, boolean, boolean, boolean) - Method in class org.apache.nutch.crawl.CrawlDb
- update(NutchDocument) - Method in interface org.apache.nutch.indexer.IndexWriter
- update(NutchDocument) - Method in class org.apache.nutch.indexer.IndexWriters
- update(NutchDocument) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- update(NutchDocument) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- update(NutchDocument) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- update(NutchDocument) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- update(NutchDocument) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- update(NutchDocument) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- update(NutchDocument) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- update(NutchDocument) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- UPDATE - Static variable in class org.apache.nutch.indexer.NutchIndexAction
- UPDATEDB - org.apache.nutch.service.JobManager.JobType
- updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.AbstractScoringFilter
- updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.depth.DepthScoringFilter
- updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.opic.OPICScoringFilter
-
Increase the score by a sum of inlinked scores.
- updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.orphan.OrphanScoringFilter
-
Used for orphan control.
- updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in interface org.apache.nutch.scoring.ScoringFilter
-
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages.
- updateDbScore(Text, CrawlDatum, CrawlDatum, List<CrawlDatum>) - Method in class org.apache.nutch.scoring.ScoringFilters
-
Calculate updated page score during CrawlDb.update().
- updateHashMap(HashMap<String, Integer>, String) - Static method in class org.apache.nutch.parsefilter.naivebayes.Train
- UpdateHostDb - Class in org.apache.nutch.hostdb
-
Tool to create a HostDB from the CrawlDB.
- UpdateHostDb() - Constructor for class org.apache.nutch.hostdb.UpdateHostDb
- UpdateHostDbMapper - Class in org.apache.nutch.hostdb
-
Mapper ingesting HostDB and CrawlDB entries.
- UpdateHostDbMapper() - Constructor for class org.apache.nutch.hostdb.UpdateHostDbMapper
- UpdateHostDbReducer - Class in org.apache.nutch.hostdb
- UpdateHostDbReducer() - Constructor for class org.apache.nutch.hostdb.UpdateHostDbReducer
- updateProperty(String, String, String) - Method in class org.apache.nutch.service.resources.ConfigResource
-
Adds/Updates a particular property value in the configuration
- url - Variable in class org.apache.nutch.crawl.Generator.SelectorEntry
- url - Variable in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- URL_FILTER_NORMALIZE_ALL - Static variable in class org.apache.nutch.crawl.Injector
-
property to pass value of command-line option -filterNormalizeAll to mapper
- URL_FILTERING - Static variable in class org.apache.nutch.crawl.CrawlDbFilter
- URL_FILTERING - Static variable in class org.apache.nutch.crawl.LinkDbFilter
- URL_FILTERING - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- URL_FILTERING - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
- URL_NORMALIZING - Static variable in class org.apache.nutch.crawl.CrawlDbFilter
- URL_NORMALIZING - Static variable in class org.apache.nutch.crawl.LinkDbFilter
- URL_NORMALIZING - Static variable in class org.apache.nutch.indexer.IndexerMapReduce
- URL_NORMALIZING - Static variable in class org.apache.nutch.scoring.webgraph.WebGraph.OutlinkDb
- URL_NORMALIZING_SCOPE - Static variable in class org.apache.nutch.crawl.CrawlDbFilter
- URL_NORMALIZING_SCOPE - Static variable in class org.apache.nutch.crawl.Injector.InjectMapper
- URL_NORMALIZING_SCOPE - Static variable in class org.apache.nutch.crawl.LinkDbFilter
- URL_VERSION - Static variable in class org.apache.nutch.tools.arc.ArcSegmentCreator.ArcSegmentCreatorMapper
- URL_VERSION - Static variable in class org.apache.nutch.tools.arc.ArcSegmentCreator
- URLExemptionFilter - Interface in org.apache.nutch.net
-
Interface used to allow exemptions to external domain resources by overriding
db.ignore.external.links
. - URLExemptionFilters - Class in org.apache.nutch.net
-
Creates and caches
URLExemptionFilter
implementing plugins. - URLExemptionFilters(Configuration) - Constructor for class org.apache.nutch.net.URLExemptionFilters
- URLFilter - Interface in org.apache.nutch.net
-
Interface used to limit which URLs enter Nutch.
- URLFILTER_AUTOMATON_FILE - Static variable in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
- URLFILTER_AUTOMATON_RULES - Static variable in class org.apache.nutch.urlfilter.automaton.AutomatonURLFilter
- URLFILTER_FAST_FILE - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
- URLFILTER_FAST_MAX_LENGTH - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
- URLFILTER_FAST_PATH_MAX_LENGTH - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
- URLFILTER_FAST_QUERY_MAX_LENGTH - Static variable in class org.apache.nutch.urlfilter.fast.FastURLFilter
- URLFILTER_ORDER - Static variable in class org.apache.nutch.net.URLFilters
- URLFILTER_REGEX_FILE - Static variable in class org.apache.nutch.urlfilter.regex.RegexURLFilter
- URLFILTER_REGEX_RULES - Static variable in class org.apache.nutch.urlfilter.regex.RegexURLFilter
- URLFilterChecker - Class in org.apache.nutch.net
-
Checks one given filter or all filters.
- URLFilterChecker() - Constructor for class org.apache.nutch.net.URLFilterChecker
- URLFilterException - Exception in org.apache.nutch.net
- URLFilterException() - Constructor for exception org.apache.nutch.net.URLFilterException
- URLFilterException(String) - Constructor for exception org.apache.nutch.net.URLFilterException
- URLFilterException(String, Throwable) - Constructor for exception org.apache.nutch.net.URLFilterException
- URLFilterException(Throwable) - Constructor for exception org.apache.nutch.net.URLFilterException
- URLFilters - Class in org.apache.nutch.net
-
Creates and caches plugins implementing
URLFilter
. - URLFilters(Configuration) - Constructor for class org.apache.nutch.net.URLFilters
- urlKey - Static variable in class org.apache.nutch.crawl.DeduplicationJob
- URLMetaIndexingFilter - Class in org.apache.nutch.indexer.urlmeta
-
This is part of the URL Meta plugin.
- URLMetaIndexingFilter() - Constructor for class org.apache.nutch.indexer.urlmeta.URLMetaIndexingFilter
- URLMetaScoringFilter - Class in org.apache.nutch.scoring.urlmeta
-
For documentation:
org.apache.nutch.scoring.urlmeta
- URLMetaScoringFilter() - Constructor for class org.apache.nutch.scoring.urlmeta.URLMetaScoringFilter
- URLNormalizer - Interface in org.apache.nutch.net
-
Interface used to convert URLs to normal form and optionally perform substitutions
- URLNormalizerChecker - Class in org.apache.nutch.net
-
Checks one given normalizer or all normalizers.
- URLNormalizerChecker() - Constructor for class org.apache.nutch.net.URLNormalizerChecker
- URLNormalizers - Class in org.apache.nutch.net
-
This class uses a "chained filter" pattern to run defined normalizers.
- URLNormalizers(Configuration, String) - Constructor for class org.apache.nutch.net.URLNormalizers
- URLPartitioner - Class in org.apache.nutch.crawl
-
Partition urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'
- URLPartitioner() - Constructor for class org.apache.nutch.crawl.URLPartitioner
- URLStreamHandlerFactory - Class in org.apache.nutch.plugin
-
This URLStreamHandlerFactory knows about all the plugins in use and thus can create the correct URLStreamHandler even if it comes from a plugin classpath.
- URLUtil - Class in org.apache.nutch.util
-
Utility class for URL analysis
- URLUtil() - Constructor for class org.apache.nutch.util.URLUtil
- UrlValidator - Class in org.apache.nutch.urlfilter.validator
-
Validates URLs.
- UrlValidator() - Constructor for class org.apache.nutch.urlfilter.validator.UrlValidator
- usage - Variable in class org.apache.nutch.util.AbstractChecker
- usage() - Method in class org.apache.nutch.crawl.Injector
- usage() - Static method in class org.apache.nutch.util.SitemapProcessor
- USE_AUTH - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- USE_AUTH - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- useHttp11 - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Do we use HTTP/1.1?
- useHttp2 - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Whether to use HTTP/2
- useProxy - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
Indicates if a proxy is used
- useProxy(String) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- useProxy(URI) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- useProxy(URL) - Method in class org.apache.nutch.protocol.http.api.HttpBase
- USER - Static variable in interface org.apache.nutch.indexwriter.elastic.ElasticConstants
- USER - Static variable in interface org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xConstants
- USER_AGENT - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- userAgent - Variable in class org.apache.nutch.protocol.http.api.HttpBase
-
The Nutch 'User-Agent' request header
- USERNAME - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- UTF_8 - Static variable in class org.apache.nutch.crawl.DeduplicationJob
- UUID_KEY - Static variable in class org.apache.nutch.util.NutchConfiguration
V
- VAL_RESULT - Static variable in interface org.apache.nutch.metadata.Nutch
-
Name of the key used in the Result Map sent back by the REST endpoint
- VALUE_SERIALIZER - Static variable in interface org.apache.nutch.indexwriter.kafka.KafkaConstants
- valueOf(String) - Static method in enum org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.net.protocols.Response.TruncatedContentReason
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.protocol.htmlunit.HttpResponse.Scheme
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.protocol.http.HttpResponse.Scheme
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.protocol.interactiveselenium.HttpResponse.Scheme
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.protocol.selenium.HttpResponse.Scheme
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.scoring.similarity.util.LuceneTokenizer.TokenizerType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.service.JobManager.JobType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.service.model.response.JobInfo.State
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.util.domain.DomainStatistics.MyCounter
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.util.domain.DomainSuffix.Status
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.apache.nutch.util.domain.TopLevelDomain.Type
-
Returns the enum constant of this type with the specified name.
- values() - Static method in enum org.apache.nutch.fetcher.FetcherThreadEvent.PublishEventType
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.net.protocols.Response.TruncatedContentReason
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.protocol.htmlunit.HttpResponse.Scheme
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.protocol.http.HttpResponse.Scheme
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.protocol.interactiveselenium.HttpResponse.Scheme
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.protocol.selenium.HttpResponse.Scheme
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.scoring.similarity.util.LuceneAnalyzerUtil.StemFilterType
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.scoring.similarity.util.LuceneTokenizer.TokenizerType
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.service.JobManager.JobType
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.service.model.response.JobInfo.State
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.util.domain.DomainStatistics.MyCounter
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.util.domain.DomainSuffix.Status
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.apache.nutch.util.domain.TopLevelDomain.Type
-
Returns an array containing the constants of this enum type, in the order they are declared.
- VERSION - Static variable in class org.apache.nutch.indexer.NutchDocument
W
- walk(Node, URL, Metadata, Configuration) - Static method in class org.creativecommons.nutch.CCParseFilter.Walker
-
Scan the document adding attributes to metadata.
- WARCExporter - Class in org.apache.nutch.tools.warc
-
MapReduce job to exports Nutch segments as WARC files.
- WARCExporter() - Constructor for class org.apache.nutch.tools.warc.WARCExporter
- WARCExporter(Configuration) - Constructor for class org.apache.nutch.tools.warc.WARCExporter
- WARCExporter.WARCMapReduce - Class in org.apache.nutch.tools.warc
- WARCExporter.WARCMapReduce.WARCMapper - Class in org.apache.nutch.tools.warc
- WARCExporter.WARCMapReduce.WARCReducer - Class in org.apache.nutch.tools.warc
- WARCMapper() - Constructor for class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCMapper
- WARCMapReduce() - Constructor for class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce
- WARCReducer() - Constructor for class org.apache.nutch.tools.warc.WARCExporter.WARCMapReduce.WARCReducer
- WARCUtils - Class in org.apache.nutch.tools
- WARCUtils() - Constructor for class org.apache.nutch.tools.WARCUtils
- WebGraph - Class in org.apache.nutch.scoring.webgraph
-
Creates three databases, one for inlinks, one for outlinks, and a node database that holds the number of in and outlinks to a url and the current score for the url.
- WebGraph() - Constructor for class org.apache.nutch.scoring.webgraph.WebGraph
- WebGraph.OutlinkDb - Class in org.apache.nutch.scoring.webgraph
-
The OutlinkDb creates a database of all outlinks.
- WebGraph.OutlinkDb.OutlinkDbMapper - Class in org.apache.nutch.scoring.webgraph
-
Passes through existing LinkDatum objects from an existing OutlinkDb and maps out new LinkDatum objects from new crawls ParseData.
- WebGraph.OutlinkDb.OutlinkDbReducer - Class in org.apache.nutch.scoring.webgraph
- webWindowClosed(WebWindowEvent) - Method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
- webWindowContentChanged(WebWindowEvent) - Method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
- webWindowOpened(WebWindowEvent) - Method in class org.apache.nutch.protocol.htmlunit.HtmlUnitWebWindowListener
- WEIGHT_FIELD - Static variable in interface org.apache.nutch.indexwriter.solr.SolrConstants
- whitespacePattern - Static variable in class org.apache.nutch.parse.headings.HeadingsParseFilter
-
Pattern used to strip surpluss whitespace
- WORK_TYPE - Static variable in interface org.apache.nutch.metadata.CreativeCommons
- WOULDBLOCK - Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
Deprecated.
- WRITABLE_CONTENT_TYPE - Static variable in interface org.apache.nutch.metadata.HttpHeaders
- WRITABLE_FIXED_INTERVAL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- WRITABLE_GENERATE_TIME_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- WRITABLE_PROTO_STATUS_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- WRITABLE_REPR_URL_KEY - Static variable in interface org.apache.nutch.metadata.Nutch
- WritableSerializer() - Constructor for class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer
- write(DataOutput) - Method in class org.apache.nutch.crawl.CrawlDatum
- write(DataOutput) - Method in class org.apache.nutch.crawl.Generator.SelectorEntry
- write(DataOutput) - Method in class org.apache.nutch.crawl.Inlink
- write(DataOutput) - Method in class org.apache.nutch.crawl.Inlinks
- write(DataOutput) - Method in class org.apache.nutch.hostdb.HostDatum
- write(DataOutput) - Method in class org.apache.nutch.indexer.NutchDocument
- write(DataOutput) - Method in class org.apache.nutch.indexer.NutchField
- write(DataOutput) - Method in class org.apache.nutch.indexer.NutchIndexAction
- write(DataOutput) - Method in class org.apache.nutch.metadata.Metadata
- write(DataOutput) - Method in class org.apache.nutch.metadata.MetaWrapper
- write(DataOutput) - Method in class org.apache.nutch.parse.Outlink
- write(DataOutput) - Method in class org.apache.nutch.parse.ParseData
- write(DataOutput) - Method in class org.apache.nutch.parse.ParseImpl
- write(DataOutput) - Method in class org.apache.nutch.parse.ParseStatus
- write(DataOutput) - Method in class org.apache.nutch.parse.ParseText
- write(DataOutput) - Method in class org.apache.nutch.protocol.Content
- write(DataOutput) - Method in class org.apache.nutch.protocol.ProtocolStatus
- write(DataOutput) - Method in class org.apache.nutch.scoring.webgraph.LinkDatum
- write(DataOutput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNode
- write(DataOutput) - Method in class org.apache.nutch.scoring.webgraph.LinkDumper.LinkNodes
- write(DataOutput) - Method in class org.apache.nutch.scoring.webgraph.Node
- write(Text, CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
- write(Text, CrawlDatum) - Method in class org.apache.nutch.crawl.CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter
- write(NutchDocument) - Method in interface org.apache.nutch.indexer.IndexWriter
- write(NutchDocument) - Method in class org.apache.nutch.indexer.IndexWriters
- write(NutchDocument) - Method in class org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter
- write(NutchDocument) - Method in class org.apache.nutch.indexwriter.csv.CSVIndexWriter
- write(NutchDocument) - Method in class org.apache.nutch.indexwriter.dummy.DummyIndexWriter
- write(NutchDocument) - Method in class org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
- write(NutchDocument) - Method in class org.apache.nutch.indexwriter.kafka.KafkaIndexWriter
- write(NutchDocument) - Method in class org.apache.nutch.indexwriter.opensearch1x.OpenSearch1xIndexWriter
- write(NutchDocument) - Method in class org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter
- write(NutchDocument) - Method in class org.apache.nutch.indexwriter.solr.SolrIndexWriter
- writeArrayValue(String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- writeArrayValue(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
- writeArrayValue(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
- writeArrayValue(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
- writeArrayValue(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- writeKeyNull(String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- writeKeyNull(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
- writeKeyNull(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
- writeKeyNull(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
- writeKeyNull(String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- writeKeyValue(String, String) - Method in class org.apache.nutch.tools.AbstractCommonCrawlFormat
- writeKeyValue(String, String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJackson
- writeKeyValue(String, String) - Method in class org.apache.nutch.tools.CommonCrawlFormatJettinson
- writeKeyValue(String, String) - Method in class org.apache.nutch.tools.CommonCrawlFormatSimple
- writeKeyValue(String, String) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- writeObjectEntrySeparator(JsonGenerator) - Method in class org.apache.nutch.crawl.CrawlDbReader.JsonIndenter
- writeObjectFieldValueSeparator(JsonGenerator) - Method in class org.apache.nutch.crawl.CrawlDbReader.JsonIndenter
- writeOutAsDuplicate(CrawlDatum, Reducer.Context) - Method in class org.apache.nutch.crawl.DeduplicationJob.DedupReducer
- writeRequest(URI) - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- writeResponse() - Method in class org.apache.nutch.tools.CommonCrawlFormatWARC
- WWW_AUTHENTICATE - Static variable in class org.apache.nutch.protocol.httpclient.HttpAuthenticationFactory
-
The HTTP Authentication (WWW-Authenticate) header which is returned by a webserver requiring authentication.
X
- X_HIDE_HEADER - Static variable in class org.apache.nutch.tools.WARCUtils
- X_POINT_ID - Static variable in interface org.apache.nutch.exchange.Exchange
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.indexer.IndexingFilter
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.indexer.IndexWriter
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.net.URLExemptionFilter
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.net.URLFilter
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.net.URLNormalizer
- X_POINT_ID - Static variable in interface org.apache.nutch.parse.HtmlParseFilter
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.parse.Parser
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.protocol.Protocol
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.publisher.NutchPublisher
- X_POINT_ID - Static variable in interface org.apache.nutch.scoring.ScoringFilter
-
The name of the extension point.
- X_POINT_ID - Static variable in interface org.apache.nutch.segment.SegmentMergeFilter
-
The name of the extension point.
- XMLCharacterRecognizer - Class in org.apache.nutch.parse.html
-
Class used to verify whether the specified ch conforms to the XML 1.0 definition of whitespace.
- XMLCharacterRecognizer() - Constructor for class org.apache.nutch.parse.html.XMLCharacterRecognizer
Y
Z
- zip(byte[]) - Static method in class org.apache.nutch.util.GZIPUtils
-
Returns an gzipped copy of the input array.
- ZipParser - Class in org.apache.nutch.parse.zip
-
ZipParser class based on MSPowerPointParser class by Stephan Strittmatter.
- ZipParser() - Constructor for class org.apache.nutch.parse.zip.ZipParser
-
Creates a new instance of ZipParser
- ZipTextExtractor - Class in org.apache.nutch.parse.zip
- ZipTextExtractor(Configuration) - Constructor for class org.apache.nutch.parse.zip.ZipTextExtractor
-
Creates a new instance of ZipTextExtractor
_
- __openPassiveDataConnection(int, String) - Method in class org.apache.nutch.protocol.ftp.Client
-
Open a passive data connection socket
- _compare(byte[], int, int, byte[], int, int) - Static method in class org.apache.nutch.crawl.SignatureComparator
- _compare(Object, Object) - Static method in class org.apache.nutch.crawl.SignatureComparator
All Classes All Packages