Package org.apache.nutch.protocol.file
Class File
- java.lang.Object
-
- org.apache.nutch.protocol.file.File
-
- All Implemented Interfaces:
Configurable
,Pluggable
,Protocol
public class File extends Object implements Protocol
This class is a protocol plugin used for file: scheme. It createsFileResponse
object and gets the content of the url from it. Configurable parameters arefile.content.limit
andfile.crawl.parent
in nutch-default.xml defined under "file properties" section.- Author:
- John Xing
-
-
Field Summary
Fields Modifier and Type Field Description protected static org.slf4j.Logger
LOG
-
Fields inherited from interface org.apache.nutch.protocol.Protocol
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description File()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description Configuration
getConf()
Get theConfiguration
objectProtocolOutput
getProtocolOutput(Text url, CrawlDatum datum)
Creates aFileResponse
object corresponding to the url and return aProtocolOutput
object as per the content receivedcrawlercommons.robots.BaseRobotRules
getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
No robots parsing is done for file protocol.static void
main(String[] args)
Quick way for running this class.void
setConf(Configuration conf)
Set theConfiguration
objectvoid
setMaxContentLength(int maxContentLength)
Set the length after at which content is truncated.
-
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
Set theConfiguration
object- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
Get theConfiguration
object- Specified by:
getConf
in interfaceConfigurable
-
setMaxContentLength
public void setMaxContentLength(int maxContentLength)
Set the length after at which content is truncated.- Parameters:
maxContentLength
- max content in bytes
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
Creates aFileResponse
object corresponding to the url and return aProtocolOutput
object as per the content received- Specified by:
getProtocolOutput
in interfaceProtocol
- Parameters:
url
- Text containing the urldatum
- The CrawlDatum object corresponding to the url- Returns:
ProtocolOutput
object for the content of the file indicated by url
-
main
public static void main(String[] args) throws Exception
Quick way for running this class. Useful for debugging.- Parameters:
args
- run with no args to print help- Throws:
Exception
- if there is a fatal error running this class with the given input
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
No robots parsing is done for file protocol. So this returns a set of empty rules which will allow every url.- Specified by:
getRobotRules
in interfaceProtocol
- Parameters:
url
- URL to checkdatum
- page datumrobotsTxtContent
- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContent
is appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
-