Uses of Class
org.apache.nutch.crawl.CrawlDatum
-
Packages that use CrawlDatum Package Description org.apache.nutch.analysis.lang Text document language identifier.org.apache.nutch.crawl Crawl control code and tools to run the crawler.org.apache.nutch.fetcher The Nutch multi-threaded fetching moduleorg.apache.nutch.hostdb org.apache.nutch.indexer Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.org.apache.nutch.indexer.anchor An indexing plugin for inbound anchor text.org.apache.nutch.indexer.arbitrary Indexing filter to add document arbitrary data to the index from the output of a user-specified class.org.apache.nutch.indexer.basic A basic indexing plugin, adds basic fields: url, host, title, content, etc.org.apache.nutch.indexer.feed Indexing filter to index meta data from RSS feeds.org.apache.nutch.indexer.filter org.apache.nutch.indexer.geoip This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.org.apache.nutch.indexer.jexl This plugin implements a dynamic indexing filter which uses JEXL expressions to allow filtering based on the page's metadataorg.apache.nutch.indexer.links org.apache.nutch.indexer.metadata Indexing filter to add document metadata to the index.org.apache.nutch.indexer.more A more indexing plugin, adds "more" index fields:last modified date, MIME type, content length.org.apache.nutch.indexer.replace Indexing filter to allow pattern replacements on metadata.org.apache.nutch.indexer.staticfield A simple plugin called at indexing that adds fields with static data.org.apache.nutch.indexer.subcollection Indexing filter to assign documents to subcollections.org.apache.nutch.indexer.tld Top Level Domain Indexing plugin.org.apache.nutch.indexer.urlmeta URL Meta Tag Indexing Pluginorg.apache.nutch.microformats.reltag A microformats Rel-Tag Parser/Indexer/Querier plugin.org.apache.nutch.protocol Classes related to theProtocol
interface, see alsoorg.apache.nutch.net.protocols
.org.apache.nutch.protocol.file Protocol plugin which supports retrieving local file resources.org.apache.nutch.protocol.ftp Protocol plugin which supports retrieving documents via the ftp protocol.org.apache.nutch.protocol.htmlunit Protocol plugin which supports retrieving documents via HTTP/HTTPS using Selenium and the HtmlUnitDriver web driver for the for the HtmlUnit headless browser.org.apache.nutch.protocol.http Protocol plugin which supports retrieving documents via the http protocol.org.apache.nutch.protocol.http.api Common API used by HTTP plugins (http
,httpclient
, etc.)org.apache.nutch.protocol.httpclient Protocol plugin which supports retrieving documents via the HTTP andHTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.org.apache.nutch.protocol.interactiveselenium Protocol plugin which supports retrieving documents using and interacting with Selenium.org.apache.nutch.protocol.okhttp Protocol plugin for HTTP/HTTPS based on okhttp, supports HTTP 1.1 and/or http/2.org.apache.nutch.protocol.selenium Protocol plugin which supports retrieving documents via Selenium.org.apache.nutch.scoring TheScoringFilter
interface.org.apache.nutch.scoring.depth Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).org.apache.nutch.scoring.link Scoring filter used in conjunction withWebGraph
.org.apache.nutch.scoring.metadata Metadata Scoring Pluginorg.apache.nutch.scoring.opic Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.org.apache.nutch.scoring.orphan Scoring filter to modify score or status of orphaned pages (no inlinks found for a configurable amount of time).org.apache.nutch.scoring.similarity org.apache.nutch.scoring.similarity.cosine Implements the cosine similarity metric for scoring relevant documentsorg.apache.nutch.scoring.tld Top Level Domain Scoring plugin.org.apache.nutch.scoring.urlmeta URL Meta Tag Scoring Pluginorg.apache.nutch.segment A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.org.apache.nutch.util Miscellaneous utility classes.org.creativecommons.nutch Sample plugins that parse and index Creative Commons metadata. -
-
Uses of CrawlDatum in org.apache.nutch.analysis.lang
Methods in org.apache.nutch.analysis.lang with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
LanguageIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.crawl
Fields in org.apache.nutch.crawl declared as CrawlDatum Modifier and Type Field Description CrawlDatum
Generator.SelectorEntry. datum
Methods in org.apache.nutch.crawl that return CrawlDatum Modifier and Type Method Description CrawlDatum
AbstractFetchSchedule. forceRefetch(Text url, CrawlDatum datum, boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.CrawlDatum
FetchSchedule. forceRefetch(Text url, CrawlDatum datum, boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching.CrawlDatum
CrawlDbReader. get(String crawlDb, String url, Configuration config)
protected CrawlDatum
DeduplicationJob.DedupReducer. getDuplicate(CrawlDatum existingDoc, CrawlDatum newDoc)
CrawlDatum
AbstractFetchSchedule. initializeSchedule(Text url, CrawlDatum datum)
Initialize fetch schedule related data.CrawlDatum
FetchSchedule. initializeSchedule(Text url, CrawlDatum datum)
Initialize fetch schedule related data.static CrawlDatum
CrawlDatum. read(DataInput in)
CrawlDatum
AbstractFetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page.CrawlDatum
AdaptiveFetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
CrawlDatum
DefaultFetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
CrawlDatum
FetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page.CrawlDatum
MimeAdaptiveFetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
CrawlDatum
AbstractFetchSchedule. setPageGoneSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.CrawlDatum
FetchSchedule. setPageGoneSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.CrawlDatum
AbstractFetchSchedule. setPageRetrySchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.CrawlDatum
FetchSchedule. setPageRetrySchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.Methods in org.apache.nutch.crawl that return types with arguments of type CrawlDatum Modifier and Type Method Description RecordWriter<Text,CrawlDatum>
CrawlDbReader.CrawlDatumCsvOutputFormat. getRecordWriter(TaskAttemptContext context)
RecordWriter<Text,CrawlDatum>
CrawlDbReader.CrawlDatumJsonOutputFormat. getRecordWriter(TaskAttemptContext context)
Methods in org.apache.nutch.crawl with parameters of type CrawlDatum Modifier and Type Method Description long
AbstractFetchSchedule. calculateLastFetchTime(CrawlDatum datum)
This method return the last fetch time of the CrawlDatumlong
FetchSchedule. calculateLastFetchTime(CrawlDatum datum)
Calculates last fetch time of the given CrawlDatum.int
CrawlDatum. compareTo(CrawlDatum that)
Sort twoCrawlDatum
objects by decreasing score.CrawlDatum
AbstractFetchSchedule. forceRefetch(Text url, CrawlDatum datum, boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.CrawlDatum
FetchSchedule. forceRefetch(Text url, CrawlDatum datum, boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching.protected CrawlDatum
DeduplicationJob.DedupReducer. getDuplicate(CrawlDatum existingDoc, CrawlDatum newDoc)
static boolean
CrawlDatum. hasDbStatus(CrawlDatum datum)
static boolean
CrawlDatum. hasFetchStatus(CrawlDatum datum)
CrawlDatum
AbstractFetchSchedule. initializeSchedule(Text url, CrawlDatum datum)
Initialize fetch schedule related data.CrawlDatum
FetchSchedule. initializeSchedule(Text url, CrawlDatum datum)
Initialize fetch schedule related data.void
CrawlDbFilter. map(Text key, CrawlDatum value, Mapper.Context context)
void
CrawlDbReader.CrawlDbDumpMapper. map(Text key, CrawlDatum value, Mapper.Context context)
void
CrawlDbReader.CrawlDbStatMapper. map(Text key, CrawlDatum value, Mapper.Context context)
void
CrawlDbReader.CrawlDbTopNMapper. map(Text key, CrawlDatum value, Mapper.Context context)
void
DeduplicationJob.DBFilter. map(Text key, CrawlDatum value, Mapper.Context context)
void
Generator.CrawlDbUpdater.CrawlDbUpdateMapper. map(Text key, CrawlDatum value, Mapper.Context context)
void
Generator.SelectorMapper. map(Text key, CrawlDatum value, Mapper.Context context)
void
CrawlDatum. putAllMetaData(CrawlDatum other)
Add all metadata from other CrawlDatum to this CrawlDatum.void
CrawlDatum. set(CrawlDatum that)
Copy the contents of another instance into this instance.CrawlDatum
AbstractFetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page.CrawlDatum
AdaptiveFetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
CrawlDatum
DefaultFetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
CrawlDatum
FetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page.CrawlDatum
MimeAdaptiveFetchSchedule. setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
CrawlDatum
AbstractFetchSchedule. setPageGoneSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.CrawlDatum
FetchSchedule. setPageGoneSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.CrawlDatum
AbstractFetchSchedule. setPageRetrySchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.CrawlDatum
FetchSchedule. setPageRetrySchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.boolean
AbstractFetchSchedule. shouldFetch(Text url, CrawlDatum datum, long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist.boolean
FetchSchedule. shouldFetch(Text url, CrawlDatum datum, long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist.void
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter. write(Text key, CrawlDatum value)
void
CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter. write(Text key, CrawlDatum value)
protected void
DeduplicationJob.DedupReducer. writeOutAsDuplicate(CrawlDatum datum, Reducer.Context context)
Method parameters in org.apache.nutch.crawl with type arguments of type CrawlDatum Modifier and Type Method Description void
CrawlDbMerger.Merger. reduce(Text key, Iterable<CrawlDatum> values, Reducer.Context context)
void
CrawlDbReducer. reduce(Text key, Iterable<CrawlDatum> values, Reducer.Context context)
void
DeduplicationJob.DedupReducer. reduce(K key, Iterable<CrawlDatum> values, Reducer.Context context)
void
DeduplicationJob.StatusUpdateReducer. reduce(Text key, Iterable<CrawlDatum> values, Reducer.Context context)
void
Generator.CrawlDbUpdater.CrawlDbUpdateReducer. reduce(Text key, Iterable<CrawlDatum> values, Reducer.Context context)
void
Injector.InjectReducer. reduce(Text key, Iterable<CrawlDatum> values, Reducer.Context context)
Merge the input records of one URL as per rules below : -
Uses of CrawlDatum in org.apache.nutch.fetcher
Methods in org.apache.nutch.fetcher that return CrawlDatum Modifier and Type Method Description CrawlDatum
FetchItem. getDatum()
Methods in org.apache.nutch.fetcher with parameters of type CrawlDatum Modifier and Type Method Description org.apache.nutch.fetcher.FetchItemQueues.QueuingStatus
FetchItemQueues. addFetchItem(Text url, CrawlDatum datum)
static FetchItem
FetchItem. create(Text url, CrawlDatum datum, String queueMode)
Create an item.static FetchItem
FetchItem. create(Text url, CrawlDatum datum, String queueMode, int outlinkDepth)
Create an item.Constructors in org.apache.nutch.fetcher with parameters of type CrawlDatum Constructor Description FetchItem(Text url, URL u, CrawlDatum datum, String queueID)
FetchItem(Text url, URL u, CrawlDatum datum, String queueID, int outlinkDepth)
-
Uses of CrawlDatum in org.apache.nutch.hostdb
Fields in org.apache.nutch.hostdb declared as CrawlDatum Modifier and Type Field Description protected CrawlDatum
UpdateHostDbMapper. crawlDatum
Methods in org.apache.nutch.hostdb with parameters of type CrawlDatum Modifier and Type Method Description void
CrawlDatumProcessor. count(CrawlDatum crawlDatum)
Process a single crawl datum instance to aggregate custom counts.void
FetchOverdueCrawlDatumProcessor. count(CrawlDatum crawlDatum)
-
Uses of CrawlDatum in org.apache.nutch.indexer
Methods in org.apache.nutch.indexer with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
IndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse.NutchDocument
IndexingFilters. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
Run all defined filters.void
CleaningJob.DBFilter. map(Text key, CrawlDatum value, Mapper.Context context)
-
Uses of CrawlDatum in org.apache.nutch.indexer.anchor
Methods in org.apache.nutch.indexer.anchor with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
AnchorIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
TheAnchorIndexingFilter
filter object which supports boolean configuration settings for the deduplication of anchors. -
Uses of CrawlDatum in org.apache.nutch.indexer.arbitrary
Methods in org.apache.nutch.indexer.arbitrary with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
ArbitraryIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
TheArbitraryIndexingFilter
filter object uses reflection to instantiate the configured class and invoke the configured method. -
Uses of CrawlDatum in org.apache.nutch.indexer.basic
Methods in org.apache.nutch.indexer.basic with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
BasicIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
TheBasicIndexingFilter
filter object which supports few configuration settings for adding basic searchable fields. -
Uses of CrawlDatum in org.apache.nutch.indexer.feed
Methods in org.apache.nutch.indexer.feed with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
FeedIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
Extracts out the relevant fields: FEED_AUTHOR FEED_TAGS FEED_PUBLISHED FEED_UPDATED FEED And sends them to theIndexer
for indexing within the Nutch index. -
Uses of CrawlDatum in org.apache.nutch.indexer.filter
Methods in org.apache.nutch.indexer.filter with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
MimeTypeIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.geoip
Methods in org.apache.nutch.indexer.geoip with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
GeoIPIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.jexl
Methods in org.apache.nutch.indexer.jexl with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
JexlIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.links
Methods in org.apache.nutch.indexer.links with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
LinksIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.metadata
Methods in org.apache.nutch.indexer.metadata with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
MetadataIndexer. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.more
Methods in org.apache.nutch.indexer.more with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
MoreIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.replace
Methods in org.apache.nutch.indexer.replace with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
ReplaceIndexer. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.staticfield
Methods in org.apache.nutch.indexer.staticfield with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
StaticFieldIndexer. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
TheStaticFieldIndexer
filter object which adds fields as per configuration setting. -
Uses of CrawlDatum in org.apache.nutch.indexer.subcollection
Methods in org.apache.nutch.indexer.subcollection with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
SubcollectionIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.tld
Methods in org.apache.nutch.indexer.tld with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
TLDIndexingFilter. filter(NutchDocument doc, Parse parse, Text urlText, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.indexer.urlmeta
Methods in org.apache.nutch.indexer.urlmeta with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
URLMetaIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the CrawlDatum object. -
Uses of CrawlDatum in org.apache.nutch.microformats.reltag
Methods in org.apache.nutch.microformats.reltag with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
RelTagIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-
Uses of CrawlDatum in org.apache.nutch.protocol
Methods in org.apache.nutch.protocol with parameters of type CrawlDatum Modifier and Type Method Description ProtocolOutput
Protocol. getProtocolOutput(Text url, CrawlDatum datum)
Get theProtocolOutput
for a given url and crawldatumcrawlercommons.robots.BaseRobotRules
Protocol. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Retrieve robot rules applicable for this URL. -
Uses of CrawlDatum in org.apache.nutch.protocol.file
Methods in org.apache.nutch.protocol.file with parameters of type CrawlDatum Modifier and Type Method Description ProtocolOutput
File. getProtocolOutput(Text url, CrawlDatum datum)
Creates aFileResponse
object corresponding to the url and return aProtocolOutput
object as per the content receivedcrawlercommons.robots.BaseRobotRules
File. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
No robots parsing is done for file protocol.Constructors in org.apache.nutch.protocol.file with parameters of type CrawlDatum Constructor Description FileResponse(URL url, CrawlDatum datum, File file, Configuration conf)
Default public constructor -
Uses of CrawlDatum in org.apache.nutch.protocol.ftp
Methods in org.apache.nutch.protocol.ftp with parameters of type CrawlDatum Modifier and Type Method Description ProtocolOutput
Ftp. getProtocolOutput(Text url, CrawlDatum datum)
Creates aFtpResponse
object corresponding to the url and returns aProtocolOutput
object as per the content receivedcrawlercommons.robots.BaseRobotRules
Ftp. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Get the robots rules for a given urlConstructors in org.apache.nutch.protocol.ftp with parameters of type CrawlDatum Constructor Description FtpResponse(URL url, CrawlDatum datum, Ftp ftp, Configuration conf)
-
Uses of CrawlDatum in org.apache.nutch.protocol.htmlunit
Methods in org.apache.nutch.protocol.htmlunit with parameters of type CrawlDatum Modifier and Type Method Description protected Response
Http. getResponse(URL url, CrawlDatum datum, boolean redirect)
Constructors in org.apache.nutch.protocol.htmlunit with parameters of type CrawlDatum Constructor Description HttpResponse(HttpBase http, URL url, CrawlDatum datum)
Default public constructor. -
Uses of CrawlDatum in org.apache.nutch.protocol.http
Methods in org.apache.nutch.protocol.http with parameters of type CrawlDatum Modifier and Type Method Description protected Response
Http. getResponse(URL url, CrawlDatum datum, boolean redirect)
Constructors in org.apache.nutch.protocol.http with parameters of type CrawlDatum Constructor Description HttpResponse(HttpBase http, URL url, CrawlDatum datum)
Default public constructor. -
Uses of CrawlDatum in org.apache.nutch.protocol.http.api
Methods in org.apache.nutch.protocol.http.api with parameters of type CrawlDatum Modifier and Type Method Description ProtocolOutput
HttpBase. getProtocolOutput(Text url, CrawlDatum datum)
protected abstract Response
HttpBase. getResponse(URL url, CrawlDatum datum, boolean followRedirects)
crawlercommons.robots.BaseRobotRules
HttpBase. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
-
Uses of CrawlDatum in org.apache.nutch.protocol.httpclient
Methods in org.apache.nutch.protocol.httpclient with parameters of type CrawlDatum Modifier and Type Method Description protected Response
Http. getResponse(URL url, CrawlDatum datum, boolean redirect)
Fetches theurl
with a configured HTTP client and gets the response. -
Uses of CrawlDatum in org.apache.nutch.protocol.interactiveselenium
Methods in org.apache.nutch.protocol.interactiveselenium with parameters of type CrawlDatum Modifier and Type Method Description protected Response
Http. getResponse(URL url, CrawlDatum datum, boolean redirect)
Constructors in org.apache.nutch.protocol.interactiveselenium with parameters of type CrawlDatum Constructor Description HttpResponse(Http http, URL url, CrawlDatum datum)
-
Uses of CrawlDatum in org.apache.nutch.protocol.okhttp
Methods in org.apache.nutch.protocol.okhttp with parameters of type CrawlDatum Modifier and Type Method Description protected Response
OkHttp. getResponse(URL url, CrawlDatum datum, boolean redirect)
Constructors in org.apache.nutch.protocol.okhttp with parameters of type CrawlDatum Constructor Description OkHttpResponse(OkHttp okhttp, URL url, CrawlDatum datum)
-
Uses of CrawlDatum in org.apache.nutch.protocol.selenium
Methods in org.apache.nutch.protocol.selenium with parameters of type CrawlDatum Modifier and Type Method Description protected Response
Http. getResponse(URL url, CrawlDatum datum, boolean redirect)
Constructors in org.apache.nutch.protocol.selenium with parameters of type CrawlDatum Constructor Description HttpResponse(Http http, URL url, CrawlDatum datum)
-
Uses of CrawlDatum in org.apache.nutch.scoring
Methods in org.apache.nutch.scoring that return CrawlDatum Modifier and Type Method Description CrawlDatum
AbstractScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
CrawlDatum
ScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Distribute score value from the current page to all its outlinked pages.CrawlDatum
ScoringFilters. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Methods in org.apache.nutch.scoring with parameters of type CrawlDatum Modifier and Type Method Description CrawlDatum
AbstractScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
CrawlDatum
ScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Distribute score value from the current page to all its outlinked pages.CrawlDatum
ScoringFilters. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
float
AbstractScoringFilter. generatorSortValue(Text url, CrawlDatum datum, float initSort)
float
ScoringFilter. generatorSortValue(Text url, CrawlDatum datum, float initSort)
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.float
ScoringFilters. generatorSortValue(Text url, CrawlDatum datum, float initSort)
Calculate a sort value for Generate.float
AbstractScoringFilter. indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
float
ScoringFilter. indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
This method calculates a indexed document score/boost.float
ScoringFilters. indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
void
AbstractScoringFilter. initialScore(Text url, CrawlDatum datum)
void
ScoringFilter. initialScore(Text url, CrawlDatum datum)
Set an initial score for newly discovered pages.void
ScoringFilters. initialScore(Text url, CrawlDatum datum)
Calculate a new initial score, used when adding newly discovered pages.void
AbstractScoringFilter. injectedScore(Text url, CrawlDatum datum)
void
ScoringFilter. injectedScore(Text url, CrawlDatum datum)
Set an initial score for newly injected pages.void
ScoringFilters. injectedScore(Text url, CrawlDatum datum)
Calculate a new initial score, used when injecting new pages.default void
ScoringFilter. orphanedScore(Text url, CrawlDatum datum)
This method may change the score or status of CrawlDatum during CrawlDb update, when the URL is neither fetched nor has any inlinks.void
ScoringFilters. orphanedScore(Text url, CrawlDatum datum)
Calculate orphaned page score during CrawlDb.update().void
AbstractScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
void
ScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it intoContent
metadata.void
ScoringFilters. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
void
AbstractScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
void
ScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages.void
ScoringFilters. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
Calculate updated page score during CrawlDb.update().Method parameters in org.apache.nutch.scoring with type arguments of type CrawlDatum Modifier and Type Method Description CrawlDatum
AbstractScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
CrawlDatum
ScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Distribute score value from the current page to all its outlinked pages.CrawlDatum
ScoringFilters. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
void
AbstractScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
void
ScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages.void
ScoringFilters. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
Calculate updated page score during CrawlDb.update(). -
Uses of CrawlDatum in org.apache.nutch.scoring.depth
Methods in org.apache.nutch.scoring.depth that return CrawlDatum Modifier and Type Method Description CrawlDatum
DepthScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Methods in org.apache.nutch.scoring.depth with parameters of type CrawlDatum Modifier and Type Method Description CrawlDatum
DepthScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
float
DepthScoringFilter. generatorSortValue(Text url, CrawlDatum datum, float initSort)
float
DepthScoringFilter. indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
void
DepthScoringFilter. initialScore(Text url, CrawlDatum datum)
void
DepthScoringFilter. injectedScore(Text url, CrawlDatum datum)
void
DepthScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
void
DepthScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
Method parameters in org.apache.nutch.scoring.depth with type arguments of type CrawlDatum Modifier and Type Method Description CrawlDatum
DepthScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
void
DepthScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
-
Uses of CrawlDatum in org.apache.nutch.scoring.link
Methods in org.apache.nutch.scoring.link with parameters of type CrawlDatum Modifier and Type Method Description float
LinkAnalysisScoringFilter. generatorSortValue(Text url, CrawlDatum datum, float initSort)
float
LinkAnalysisScoringFilter. indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
void
LinkAnalysisScoringFilter. initialScore(Text url, CrawlDatum datum)
void
LinkAnalysisScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
-
Uses of CrawlDatum in org.apache.nutch.scoring.metadata
Methods in org.apache.nutch.scoring.metadata that return CrawlDatum Modifier and Type Method Description CrawlDatum
MetadataScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
This will take the metadata that you have listed in your "scoring.parse.md" property, and looks for them inside the parseData object.Methods in org.apache.nutch.scoring.metadata with parameters of type CrawlDatum Modifier and Type Method Description CrawlDatum
MetadataScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
This will take the metadata that you have listed in your "scoring.parse.md" property, and looks for them inside the parseData object.void
MetadataScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Takes the metadata, specified in your "scoring.db.md" property, from the datum object and injects it into the content.Method parameters in org.apache.nutch.scoring.metadata with type arguments of type CrawlDatum Modifier and Type Method Description CrawlDatum
MetadataScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
This will take the metadata that you have listed in your "scoring.parse.md" property, and looks for them inside the parseData object. -
Uses of CrawlDatum in org.apache.nutch.scoring.opic
Methods in org.apache.nutch.scoring.opic that return CrawlDatum Modifier and Type Method Description CrawlDatum
OPICScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.Methods in org.apache.nutch.scoring.opic with parameters of type CrawlDatum Modifier and Type Method Description CrawlDatum
OPICScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.float
OPICScoringFilter. generatorSortValue(Text url, CrawlDatum datum, float initSort)
UsegetScore()
.float
OPICScoringFilter. indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
Dampen the boost value by scorePower.void
OPICScoringFilter. initialScore(Text url, CrawlDatum datum)
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level.void
OPICScoringFilter. injectedScore(Text url, CrawlDatum datum)
void
OPICScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.void
OPICScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
Increase the score by a sum of inlinked scores.Method parameters in org.apache.nutch.scoring.opic with type arguments of type CrawlDatum Modifier and Type Method Description CrawlDatum
OPICScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.void
OPICScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)
Increase the score by a sum of inlinked scores. -
Uses of CrawlDatum in org.apache.nutch.scoring.orphan
Methods in org.apache.nutch.scoring.orphan with parameters of type CrawlDatum Modifier and Type Method Description void
OrphanScoringFilter. orphanedScore(Text url, CrawlDatum datum)
void
OrphanScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinks)
Used for orphan control.Method parameters in org.apache.nutch.scoring.orphan with type arguments of type CrawlDatum Modifier and Type Method Description void
OrphanScoringFilter. updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinks)
Used for orphan control. -
Uses of CrawlDatum in org.apache.nutch.scoring.similarity
Methods in org.apache.nutch.scoring.similarity that return CrawlDatum Modifier and Type Method Description CrawlDatum
SimilarityModel. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
CrawlDatum
SimilarityScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Methods in org.apache.nutch.scoring.similarity with parameters of type CrawlDatum Modifier and Type Method Description CrawlDatum
SimilarityModel. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
CrawlDatum
SimilarityScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Method parameters in org.apache.nutch.scoring.similarity with type arguments of type CrawlDatum Modifier and Type Method Description CrawlDatum
SimilarityModel. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
CrawlDatum
SimilarityScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
-
Uses of CrawlDatum in org.apache.nutch.scoring.similarity.cosine
Methods in org.apache.nutch.scoring.similarity.cosine that return CrawlDatum Modifier and Type Method Description CrawlDatum
CosineSimilarity. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Methods in org.apache.nutch.scoring.similarity.cosine with parameters of type CrawlDatum Modifier and Type Method Description CrawlDatum
CosineSimilarity. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
Method parameters in org.apache.nutch.scoring.similarity.cosine with type arguments of type CrawlDatum Modifier and Type Method Description CrawlDatum
CosineSimilarity. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
-
Uses of CrawlDatum in org.apache.nutch.scoring.tld
Methods in org.apache.nutch.scoring.tld with parameters of type CrawlDatum Modifier and Type Method Description float
TLDScoringFilter. indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
-
Uses of CrawlDatum in org.apache.nutch.scoring.urlmeta
Methods in org.apache.nutch.scoring.urlmeta that return CrawlDatum Modifier and Type Method Description CrawlDatum
URLMetaScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object.Methods in org.apache.nutch.scoring.urlmeta with parameters of type CrawlDatum Modifier and Type Method Description CrawlDatum
URLMetaScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object.void
URLMetaScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content.Method parameters in org.apache.nutch.scoring.urlmeta with type arguments of type CrawlDatum Modifier and Type Method Description CrawlDatum
URLMetaScoringFilter. distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object. -
Uses of CrawlDatum in org.apache.nutch.segment
Methods in org.apache.nutch.segment with parameters of type CrawlDatum Modifier and Type Method Description boolean
SegmentMergeFilter. filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given key (URL).boolean
SegmentMergeFilters. filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
Iterates over allSegmentMergeFilter
extensions and if any of them returns false, it will return false as well.Method parameters in org.apache.nutch.segment with type arguments of type CrawlDatum Modifier and Type Method Description boolean
SegmentMergeFilter. filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given key (URL).boolean
SegmentMergeFilters. filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
Iterates over allSegmentMergeFilter
extensions and if any of them returns false, it will return false as well. -
Uses of CrawlDatum in org.apache.nutch.util
Methods in org.apache.nutch.util with parameters of type CrawlDatum Modifier and Type Method Description protected ProtocolOutput
AbstractChecker. getProtocolOutput(String url, CrawlDatum datum, boolean checkRobotsTxt)
-
Uses of CrawlDatum in org.creativecommons.nutch
Methods in org.creativecommons.nutch with parameters of type CrawlDatum Modifier and Type Method Description NutchDocument
CCIndexingFilter. filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
-