Package org.apache.nutch.protocol.ftp
Class Ftp
- java.lang.Object
-
- org.apache.nutch.protocol.ftp.Ftp
-
- All Implemented Interfaces:
Configurable
,Pluggable
,Protocol
public class Ftp extends Object implements Protocol
This class is a protocol plugin used for ftp: scheme. It createsFtpResponse
object and gets the content of the url from it. Configurable parameters areftp.username
,ftp.password
,ftp.content.limit
,ftp.timeout
,ftp.server.timeout
,ftp.password
,ftp.keep.connection
andftp.follow.talk
. For details see "FTP properties" section innutch-default.xml
.
-
-
Field Summary
Fields Modifier and Type Field Description protected static org.slf4j.Logger
LOG
-
Fields inherited from interface org.apache.nutch.protocol.Protocol
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description Ftp()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
finalize()
int
getBufferSize()
Configuration
getConf()
Get theConfiguration
objectProtocolOutput
getProtocolOutput(Text url, CrawlDatum datum)
Creates aFtpResponse
object corresponding to the url and returns aProtocolOutput
object as per the content receivedcrawlercommons.robots.BaseRobotRules
getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Get the robots rules for a given urlstatic void
main(String[] args)
For debugging.void
setConf(Configuration conf)
Set theConfiguration
objectvoid
setFollowTalk(boolean followTalk)
Set followTalk i.e.void
setKeepConnection(boolean keepConnection)
Whether to keep ftp connection.void
setMaxContentLength(int length)
Set the length after at which content is truncated.void
setTimeout(int to)
Set the timeout.
-
-
-
Method Detail
-
setTimeout
public void setTimeout(int to)
Set the timeout.- Parameters:
to
- a maximum timeout in milliseconds
-
setMaxContentLength
public void setMaxContentLength(int length)
Set the length after at which content is truncated.- Parameters:
length
- max content length in bytes
-
setFollowTalk
public void setFollowTalk(boolean followTalk)
Set followTalk i.e. to log dialogue between our client and remote server. Useful for debugging.- Parameters:
followTalk
- if true will follow, false by default
-
setKeepConnection
public void setKeepConnection(boolean keepConnection)
Whether to keep ftp connection. Useful if crawling same host again and again. When set to true, it avoids connection, login and dir list parser setup for subsequent URLs. If it is set to true, however, you must make sure (roughly): (1) ftp.timeout is less than ftp.server.timeout (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay) Otherwise there will be too many "delete client because idled too long" messages in thread logs.- Parameters:
keepConnection
- if true we will keep the connection, false by default
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
Creates aFtpResponse
object corresponding to the url and returns aProtocolOutput
object as per the content received- Specified by:
getProtocolOutput
in interfaceProtocol
- Parameters:
url
- Text containing the ftp urldatum
- The CrawlDatum object corresponding to the url- Returns:
ProtocolOutput
object for the url
-
main
public static void main(String[] args) throws Exception
For debugging.- Parameters:
args
- run with no args for help- Throws:
Exception
- if there is an error running this program
-
setConf
public void setConf(Configuration conf)
Set theConfiguration
object- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
Get theConfiguration
object- Specified by:
getConf
in interfaceConfigurable
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Get the robots rules for a given url- Specified by:
getRobotRules
in interfaceProtocol
- Parameters:
url
- URL to checkdatum
- page datumrobotsTxtContent
- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContent
is appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
getBufferSize
public int getBufferSize()
-
-