Class RobotRulesParser

    • Field Detail

      • CACHE

        protected static final Hashtable<String,​crawlercommons.robots.BaseRobotRules> CACHE

        public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
        A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.

        public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
        A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.

        public static final crawlercommons.robots.BaseRobotRules DEFER_VISIT_RULES
        A BaseRobotRules object appropriate for use when the robots.txt file failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code. The crawler should suspend crawling for a certain (but not too long) time, see property http.robots.503.defer.visits.
      • agentNames

        protected Set<String> agentNames
      • maxNumRedirects

        protected int maxNumRedirects
      • allowList

        protected Set<String> allowList
        set of host names or IPs to be explicitly excluded from robots.txt checking
    • Constructor Detail

      • RobotRulesParser

        public RobotRulesParser()
      • RobotRulesParser

        public RobotRulesParser​(Configuration conf)
    • Method Detail

      • isAllowListed

        public boolean isAllowListed​(URL url)
        Check whether a URL belongs to a allowlisted host.
        url - a URL to check against rules
        true if always allowed (robots.txt rules are ignored), false otherwise
      • parseRules

        public crawlercommons.robots.BaseRobotRules parseRules​(String url,
                                                               byte[] content,
                                                               String contentType,
                                                               String robotName)
        Parses the robots content using the SimpleRobotRulesParser from crawler-commons
        url - The robots.txt URL
        content - Contents of the robots file in a byte array
        contentType - The content type of the robots file
        robotName - A string containing all the robots agent names used by parser for matching
        BaseRobotRules object
      • parseRules

        public crawlercommons.robots.BaseRobotRules parseRules​(String url,
                                                               byte[] content,
                                                               String contentType,
                                                               Collection<String> robotNames)
        Parses the robots content using the SimpleRobotRulesParser from crawler-commons
        url - The robots.txt URL
        content - Contents of the robots file in a byte array
        contentType - The content type of the robots file
        robotNames - A collection containing all the robots agent names used by parser for matching
        BaseRobotRules object
      • getRobotRulesSet

        public crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol protocol,
                                                                     Text url,
                                                                     List<Content> robotsTxtContent)
        Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
        protocol - Protocol
        url - URL to check
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        robot rules (specific for this URL or default), never null
      • getRobotRulesSet

        public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol protocol,
                                                                              URL url,
                                                                              List<Content> robotsTxtContent)
        Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
        protocol - Protocol
        url - URL to check
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        robot rules (specific for this URL or default), never null
      • run

        public int run​(String[] args)
        Specified by:
        run in interface Tool