Uses of Class
org.apache.nutch.protocol.Content
-
Packages that use Content Package Description org.apache.nutch.analysis.lang Text document language identifier.org.apache.nutch.crawl Crawl control code and tools to run the crawler.org.apache.nutch.microformats.reltag A microformats Rel-Tag Parser/Indexer/Querier plugin.org.apache.nutch.parse TheParse
interface and related classes.org.apache.nutch.parse.ext Parse wrapper to run external command to do the parsing.org.apache.nutch.parse.feed Parse RSS feeds.org.apache.nutch.parse.headings Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.org.apache.nutch.parse.html An HTML document parsing plugin.org.apache.nutch.parse.js Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.org.apache.nutch.parse.metatags Parse filter to extract meta tags: keywords, description, etc.org.apache.nutch.parse.tika Parse various document formats with help of Apache Tika.org.apache.nutch.parse.zip Parse ZIP files: embedded files are recursively passed to appropriate parsers.org.apache.nutch.parsefilter.debug Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).org.apache.nutch.parsefilter.naivebayes Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevent it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.org.apache.nutch.parsefilter.regex RegexParseFilter.org.apache.nutch.protocol Classes related to theProtocol
interface, see alsoorg.apache.nutch.net.protocols
.org.apache.nutch.protocol.file Protocol plugin which supports retrieving local file resources.org.apache.nutch.protocol.ftp Protocol plugin which supports retrieving documents via the ftp protocol.org.apache.nutch.protocol.http.api Common API used by HTTP plugins (http
,httpclient
, etc.)org.apache.nutch.scoring TheScoringFilter
interface.org.apache.nutch.scoring.depth Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).org.apache.nutch.scoring.link Scoring filter used in conjunction withWebGraph
.org.apache.nutch.scoring.metadata Metadata Scoring Pluginorg.apache.nutch.scoring.opic Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.org.apache.nutch.scoring.similarity org.apache.nutch.scoring.similarity.cosine Implements the cosine similarity metric for scoring relevant documentsorg.apache.nutch.scoring.urlmeta URL Meta Tag Scoring Pluginorg.apache.nutch.segment A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.org.apache.nutch.tools Miscellaneous tools.org.apache.nutch.util Miscellaneous utility classes.org.creativecommons.nutch Sample plugins that parse and index Creative Commons metadata. -
-
Uses of Content in org.apache.nutch.analysis.lang
Methods in org.apache.nutch.analysis.lang with parameters of type Content Modifier and Type Method Description ParseResult
HTMLLanguageParser. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
1. -
Uses of Content in org.apache.nutch.crawl
Methods in org.apache.nutch.crawl with parameters of type Content Modifier and Type Method Description byte[]
MD5Signature. calculate(Content content, Parse parse)
abstract byte[]
Signature. calculate(Content content, Parse parse)
byte[]
TextMD5Signature. calculate(Content content, Parse parse)
byte[]
TextProfileSignature. calculate(Content content, Parse parse)
-
Uses of Content in org.apache.nutch.microformats.reltag
Methods in org.apache.nutch.microformats.reltag with parameters of type Content Modifier and Type Method Description ParseResult
RelTagParser. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Scan the HTML document looking at possible rel-tags -
Uses of Content in org.apache.nutch.parse
Methods in org.apache.nutch.parse with parameters of type Content Modifier and Type Method Description ParseResult
HtmlParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.ParseResult
HtmlParseFilters. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Run all defined filters.ParseResult
Parser. getParse(Content c)
This method parses the given content and returns a map of <key, parse> pairs.static boolean
ParseSegment. isTruncated(Content content)
Checks if the page's content is truncated.void
ParseSegment.ParseSegmentMapper. map(WritableComparable<?> key, Content content, Mapper.Context context)
ParseResult
ParseUtil. parse(Content content)
ParseResult
ParseUtil. parseByExtensionId(String extId, Content content)
-
Uses of Content in org.apache.nutch.parse.ext
Methods in org.apache.nutch.parse.ext with parameters of type Content Modifier and Type Method Description ParseResult
ExtParser. getParse(Content content)
-
Uses of Content in org.apache.nutch.parse.feed
Methods in org.apache.nutch.parse.feed with parameters of type Content Modifier and Type Method Description ParseResult
FeedParser. getParse(Content content)
Parses the given feed and extracts out and parsers all linked items within the feed, using the underlying ROME feed parsing library. -
Uses of Content in org.apache.nutch.parse.headings
Methods in org.apache.nutch.parse.headings with parameters of type Content Modifier and Type Method Description ParseResult
HeadingsParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
-
Uses of Content in org.apache.nutch.parse.html
Methods in org.apache.nutch.parse.html with parameters of type Content Modifier and Type Method Description ParseResult
HtmlParser. getParse(Content content)
-
Uses of Content in org.apache.nutch.parse.js
Methods in org.apache.nutch.parse.js with parameters of type Content Modifier and Type Method Description ParseResult
JSParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Scan the JavaScript fragments of a HTML page looking for possibleOutlink
'sParseResult
JSParseFilter. getParse(Content c)
Parse a JavaScript file and extract outlinks -
Uses of Content in org.apache.nutch.parse.metatags
Methods in org.apache.nutch.parse.metatags with parameters of type Content Modifier and Type Method Description ParseResult
MetaTagsParser. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
-
Uses of Content in org.apache.nutch.parse.tika
Methods in org.apache.nutch.parse.tika with parameters of type Content Modifier and Type Method Description ParseResult
TikaParser. getParse(Content content)
-
Uses of Content in org.apache.nutch.parse.zip
Methods in org.apache.nutch.parse.zip with parameters of type Content Modifier and Type Method Description ParseResult
ZipParser. getParse(Content content)
-
Uses of Content in org.apache.nutch.parsefilter.debug
Methods in org.apache.nutch.parsefilter.debug with parameters of type Content Modifier and Type Method Description ParseResult
DebugParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
-
Uses of Content in org.apache.nutch.parsefilter.naivebayes
Methods in org.apache.nutch.parsefilter.naivebayes with parameters of type Content Modifier and Type Method Description ParseResult
NaiveBayesParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
-
Uses of Content in org.apache.nutch.parsefilter.regex
Methods in org.apache.nutch.parsefilter.regex with parameters of type Content Modifier and Type Method Description ParseResult
RegexParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
-
Uses of Content in org.apache.nutch.protocol
Methods in org.apache.nutch.protocol that return Content Modifier and Type Method Description Content
ProtocolOutput. getContent()
static Content
Content. read(DataInput in)
Methods in org.apache.nutch.protocol with parameters of type Content Modifier and Type Method Description void
ProtocolOutput. setContent(Content content)
Method parameters in org.apache.nutch.protocol with type arguments of type Content Modifier and Type Method Description crawlercommons.robots.BaseRobotRules
Protocol. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Retrieve robot rules applicable for this URL.abstract crawlercommons.robots.BaseRobotRules
RobotRulesParser. getRobotRulesSet(Protocol protocol, URL url, List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).crawlercommons.robots.BaseRobotRules
RobotRulesParser. getRobotRulesSet(Protocol protocol, Text url, List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).Constructors in org.apache.nutch.protocol with parameters of type Content Constructor Description ProtocolOutput(Content content)
ProtocolOutput(Content content, ProtocolStatus status)
-
Uses of Content in org.apache.nutch.protocol.file
Methods in org.apache.nutch.protocol.file that return Content Modifier and Type Method Description Content
FileResponse. toContent()
Method parameters in org.apache.nutch.protocol.file with type arguments of type Content Modifier and Type Method Description crawlercommons.robots.BaseRobotRules
File. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
No robots parsing is done for file protocol. -
Uses of Content in org.apache.nutch.protocol.ftp
Methods in org.apache.nutch.protocol.ftp that return Content Modifier and Type Method Description Content
FtpResponse. toContent()
Method parameters in org.apache.nutch.protocol.ftp with type arguments of type Content Modifier and Type Method Description crawlercommons.robots.BaseRobotRules
Ftp. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Get the robots rules for a given urlcrawlercommons.robots.BaseRobotRules
FtpRobotRulesParser. getRobotRulesSet(Protocol ftp, URL url, List<Content> robotsTxtContent)
The hosts for which the caching of robots rules is yet to be done, it sends a Ftp request to the host corresponding to theURL
passed, gets robots file, parses the rules and caches the rules object to avoid re-work in future. -
Uses of Content in org.apache.nutch.protocol.http.api
Method parameters in org.apache.nutch.protocol.http.api with type arguments of type Content Modifier and Type Method Description protected void
HttpRobotRulesParser. addRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse)
AppendContent
of robots.txt to robotsTxtContentcrawlercommons.robots.BaseRobotRules
HttpBase. getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
crawlercommons.robots.BaseRobotRules
HttpRobotRulesParser. getRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent)
Get the rules from robots.txt which applies for the givenurl
. -
Uses of Content in org.apache.nutch.scoring
Methods in org.apache.nutch.scoring with parameters of type Content Modifier and Type Method Description void
AbstractScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)
void
ScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)
Currently a part of score distribution is performed using only data coming from the parsing process.void
ScoringFilters. passScoreAfterParsing(Text url, Content content, Parse parse)
void
AbstractScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
void
ScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it intoContent
metadata.void
ScoringFilters. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
-
Uses of Content in org.apache.nutch.scoring.depth
Methods in org.apache.nutch.scoring.depth with parameters of type Content Modifier and Type Method Description void
DepthScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)
void
DepthScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
-
Uses of Content in org.apache.nutch.scoring.link
Methods in org.apache.nutch.scoring.link with parameters of type Content Modifier and Type Method Description void
LinkAnalysisScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)
void
LinkAnalysisScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
-
Uses of Content in org.apache.nutch.scoring.metadata
Methods in org.apache.nutch.scoring.metadata with parameters of type Content Modifier and Type Method Description void
MetadataScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)
Takes the metadata, which was lumped inside the content, and replicates it within your parse data.void
MetadataScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Takes the metadata, specified in your "scoring.db.md" property, from the datum object and injects it into the content. -
Uses of Content in org.apache.nutch.scoring.opic
Methods in org.apache.nutch.scoring.opic with parameters of type Content Modifier and Type Method Description void
OPICScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.void
OPICScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY. -
Uses of Content in org.apache.nutch.scoring.similarity
Methods in org.apache.nutch.scoring.similarity with parameters of type Content Modifier and Type Method Description void
SimilarityScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)
float
SimilarityModel. setURLScoreAfterParsing(Text url, Content content, Parse parse)
-
Uses of Content in org.apache.nutch.scoring.similarity.cosine
Methods in org.apache.nutch.scoring.similarity.cosine with parameters of type Content Modifier and Type Method Description float
CosineSimilarity. setURLScoreAfterParsing(Text url, Content content, Parse parse)
-
Uses of Content in org.apache.nutch.scoring.urlmeta
Methods in org.apache.nutch.scoring.urlmeta with parameters of type Content Modifier and Type Method Description void
URLMetaScoringFilter. passScoreAfterParsing(Text url, Content content, Parse parse)
Takes the metadata, which was lumped inside the content, and replicates it within your parse data.void
URLMetaScoringFilter. passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content. -
Uses of Content in org.apache.nutch.segment
Methods in org.apache.nutch.segment with parameters of type Content Modifier and Type Method Description boolean
SegmentMergeFilter. filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given key (URL).boolean
SegmentMergeFilters. filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
Iterates over allSegmentMergeFilter
extensions and if any of them returns false, it will return false as well. -
Uses of Content in org.apache.nutch.tools
Fields in org.apache.nutch.tools declared as Content Modifier and Type Field Description protected Content
AbstractCommonCrawlFormat. content
Methods in org.apache.nutch.tools with parameters of type Content Modifier and Type Method Description static CommonCrawlFormat
CommonCrawlFormatFactory. getCommonCrawlFormat(String formatType, String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)
Deprecated.String
AbstractCommonCrawlFormat. getJsonData(String url, Content content, Metadata metadata)
String
AbstractCommonCrawlFormat. getJsonData(String url, Content content, Metadata metadata, ParseData parseData)
String
CommonCrawlFormat. getJsonData(String url, Content content, Metadata metadata)
Returns a string representation of the JSON structure of the URL content.String
CommonCrawlFormat. getJsonData(String url, Content content, Metadata metadata, ParseData parseData)
Returns a string representation of the JSON structure of the URL content.String
CommonCrawlFormatWARC. getJsonData(String url, Content content, Metadata metadata, ParseData parseData)
Constructors in org.apache.nutch.tools with parameters of type Content Constructor Description AbstractCommonCrawlFormat(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)
CommonCrawlFormatJackson(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)
CommonCrawlFormatJettinson(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)
CommonCrawlFormatSimple(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config)
CommonCrawlFormatWARC(String url, Content content, Metadata metadata, Configuration nutchConf, CommonCrawlConfig config, ParseData parseData)
-
Uses of Content in org.apache.nutch.util
Methods in org.apache.nutch.util with parameters of type Content Modifier and Type Method Description void
EncodingDetector. autoDetectClues(Content content, boolean filter)
String
EncodingDetector. guessEncoding(Content content, String defaultValue)
Guess the encoding with the previously specified list of clues. -
Uses of Content in org.creativecommons.nutch
Methods in org.creativecommons.nutch with parameters of type Content Modifier and Type Method Description ParseResult
CCParseFilter. filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Adds metadata or otherwise modifies a parse of an HTML document, given the DOM tree of a page.
-