Package org.apache.nutch.protocol.ftp
Class FtpRobotRulesParser
- java.lang.Object
-
- org.apache.nutch.protocol.RobotRulesParser
-
- org.apache.nutch.protocol.ftp.FtpRobotRulesParser
-
- All Implemented Interfaces:
Configurable
,Tool
public class FtpRobotRulesParser extends RobotRulesParser
This class is used for parsing robots for urls belonging to FTP protocol. It extends the genericRobotRulesParser
class and contains Ftp protocol specific implementation for obtaining the robots file.
-
-
Field Summary
-
Fields inherited from class org.apache.nutch.protocol.RobotRulesParser
agentNames, allowList, CACHE, conf, DEFER_VISIT_RULES, EMPTY_RULES, FORBID_ALL_RULES, maxNumRedirects
-
-
Constructor Summary
Constructors Constructor Description FtpRobotRulesParser(Configuration conf)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description crawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol ftp, URL url, List<Content> robotsTxtContent)
The hosts for which the caching of robots rules is yet to be done, it sends a Ftp request to the host corresponding to theURL
passed, gets robots file, parses the rules and caches the rules object to avoid re-work in future.-
Methods inherited from class org.apache.nutch.protocol.RobotRulesParser
getConf, getRobotRulesSet, isAllowListed, main, parseRules, parseRules, run, setConf
-
-
-
-
Constructor Detail
-
FtpRobotRulesParser
public FtpRobotRulesParser(Configuration conf)
-
-
Method Detail
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol ftp, URL url, List<Content> robotsTxtContent)
The hosts for which the caching of robots rules is yet to be done, it sends a Ftp request to the host corresponding to theURL
passed, gets robots file, parses the rules and caches the rules object to avoid re-work in future.- Specified by:
getRobotRulesSet
in classRobotRulesParser
- Parameters:
ftp
- TheProtocol
objecturl
- URLrobotsTxtContent
- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContent
is appended to the passed list. If null is passed nothing is stored.- Returns:
- robotRules A
BaseRobotRules
object for the rules
-
-