Class Ftp

  • All Implemented Interfaces:
    Configurable, Pluggable, Protocol

    public class Ftp
    extends Object
    implements Protocol
    This class is a protocol plugin used for ftp: scheme. It creates FtpResponse object and gets the content of the url from it. Configurable parameters are ftp.username, ftp.password, ftp.content.limit, ftp.timeout, ftp.server.timeout, ftp.password, ftp.keep.connection and ftp.follow.talk . For details see "FTP properties" section in nutch-default.xml.
    • Field Detail

      • LOG

        protected static final org.slf4j.Logger LOG
    • Constructor Detail

      • Ftp

        public Ftp()
    • Method Detail

      • setTimeout

        public void setTimeout​(int to)
        Set the timeout.
        Parameters:
        to - a maximum timeout in milliseconds
      • setMaxContentLength

        public void setMaxContentLength​(int length)
        Set the length after at which content is truncated.
        Parameters:
        length - max content length in bytes
      • setFollowTalk

        public void setFollowTalk​(boolean followTalk)
        Set followTalk i.e. to log dialogue between our client and remote server. Useful for debugging.
        Parameters:
        followTalk - if true will follow, false by default
      • setKeepConnection

        public void setKeepConnection​(boolean keepConnection)
        Whether to keep ftp connection. Useful if crawling same host again and again. When set to true, it avoids connection, login and dir list parser setup for subsequent URLs. If it is set to true, however, you must make sure (roughly): (1) ftp.timeout is less than ftp.server.timeout (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay) Otherwise there will be too many "delete client because idled too long" messages in thread logs.
        Parameters:
        keepConnection - if true we will keep the connection, false by default
      • finalize

        protected void finalize()
        Overrides:
        finalize in class Object
      • main

        public static void main​(String[] args)
                         throws Exception
        For debugging.
        Parameters:
        args - run with no args for help
        Throws:
        Exception - if there is an error running this program
      • getRobotRules

        public crawlercommons.robots.BaseRobotRules getRobotRules​(Text url,
                                                                  CrawlDatum datum,
                                                                  List<Content> robotsTxtContent)
        Get the robots rules for a given url
        Specified by:
        getRobotRules in interface Protocol
        Parameters:
        url - URL to check
        datum - page datum
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        Returns:
        robot rules (specific for this URL or default), never null
      • getBufferSize

        public int getBufferSize()