Package org.apache.nutch.protocol.file
Class File
- java.lang.Object
- 
- org.apache.nutch.protocol.file.File
 
- 
- All Implemented Interfaces:
- Configurable,- Pluggable,- Protocol
 
 public class File extends Object implements Protocol This class is a protocol plugin used for file: scheme. It createsFileResponseobject and gets the content of the url from it. Configurable parameters arefile.content.limitandfile.crawl.parentin nutch-default.xml defined under "file properties" section.- Author:
- John Xing
 
- 
- 
Field SummaryFields Modifier and Type Field Description protected static org.slf4j.LoggerLOG- 
Fields inherited from interface org.apache.nutch.protocol.ProtocolX_POINT_ID
 
- 
 - 
Constructor SummaryConstructors Constructor Description File()
 - 
Method SummaryAll Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description ConfigurationgetConf()Get theConfigurationobjectProtocolOutputgetProtocolOutput(Text url, CrawlDatum datum)Creates aFileResponseobject corresponding to the url and return aProtocolOutputobject as per the content receivedcrawlercommons.robots.BaseRobotRulesgetRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)No robots parsing is done for file protocol.static voidmain(String[] args)Quick way for running this class.voidsetConf(Configuration conf)Set theConfigurationobjectvoidsetMaxContentLength(int maxContentLength)Set the length after at which content is truncated.
 
- 
- 
- 
Method Detail- 
setConfpublic void setConf(Configuration conf) Set theConfigurationobject- Specified by:
- setConfin interface- Configurable
 
 - 
getConfpublic Configuration getConf() Get theConfigurationobject- Specified by:
- getConfin interface- Configurable
 
 - 
setMaxContentLengthpublic void setMaxContentLength(int maxContentLength) Set the length after at which content is truncated.- Parameters:
- maxContentLength- max content in bytes
 
 - 
getProtocolOutputpublic ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) Creates aFileResponseobject corresponding to the url and return aProtocolOutputobject as per the content received- Specified by:
- getProtocolOutputin interface- Protocol
- Parameters:
- url- Text containing the url
- datum- The CrawlDatum object corresponding to the url
- Returns:
- ProtocolOutputobject for the content of the file indicated by url
 
 - 
mainpublic static void main(String[] args) throws Exception Quick way for running this class. Useful for debugging.- Parameters:
- args- run with no args to print help
- Throws:
- Exception- if there is a fatal error running this class with the given input
 
 - 
getRobotRulespublic crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent) No robots parsing is done for file protocol. So this returns a set of empty rules which will allow every url.- Specified by:
- getRobotRulesin interface- Protocol
- Parameters:
- url- URL to check
- datum- page datum
- robotsTxtContent- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response- Contentis appended to the passed list. If null is passed nothing is stored.
- Returns:
- robot rules (specific for this URL or default), never null
 
 
- 
 
-