Class FtpRobotRulesParser

  • All Implemented Interfaces:
    Configurable, Tool

    public class FtpRobotRulesParser
    extends RobotRulesParser
    This class is used for parsing robots for urls belonging to FTP protocol. It extends the generic RobotRulesParser class and contains Ftp protocol specific implementation for obtaining the robots file.
    • Constructor Detail

      • FtpRobotRulesParser

        public FtpRobotRulesParser​(Configuration conf)
    • Method Detail

      • getRobotRulesSet

        public crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol ftp,
                                                                     URL url,
                                                                     List<Content> robotsTxtContent)
        The hosts for which the caching of robots rules is yet to be done, it sends a Ftp request to the host corresponding to the URL passed, gets robots file, parses the rules and caches the rules object to avoid re-work in future.
        Specified by:
        getRobotRulesSet in class RobotRulesParser
        Parameters:
        ftp - The Protocol object
        url - URL
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        Returns:
        robotRules A BaseRobotRules object for the rules