Class RobotRulesParser

    • Field Detail

      • CACHE

        protected static final Hashtable<String,​crawlercommons.robots.BaseRobotRules> CACHE
      • EMPTY_RULES

        public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
        A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
      • FORBID_ALL_RULES

        public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
        A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
      • DEFER_VISIT_RULES

        public static final crawlercommons.robots.BaseRobotRules DEFER_VISIT_RULES
        A BaseRobotRules object appropriate for use when the robots.txt file failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code. The crawler should suspend crawling for a certain (but not too long) time, see property http.robots.503.defer.visits.
      • agentNames

        protected Set<String> agentNames
      • maxNumRedirects

        protected int maxNumRedirects
      • allowList

        protected Set<String> allowList
        set of host names or IPs to be explicitly excluded from robots.txt checking
    • Constructor Detail

      • RobotRulesParser

        public RobotRulesParser()
      • RobotRulesParser

        public RobotRulesParser​(Configuration conf)
    • Method Detail

      • isAllowListed

        public boolean isAllowListed​(URL url)
        Check whether a URL belongs to a allowlisted host.
        Parameters:
        url - a URL to check against rules
        Returns:
        true if always allowed (robots.txt rules are ignored), false otherwise
      • parseRules

        @Deprecated
        public crawlercommons.robots.BaseRobotRules parseRules​(String url,
                                                               byte[] content,
                                                               String contentType,
                                                               String robotName)
        Deprecated.
        Parses the robots content using the SimpleRobotRulesParser from crawler-commons
        Parameters:
        url - The robots.txt URL
        content - Contents of the robots file in a byte array
        contentType - The content type of the robots file
        robotName - A string containing all the robots agent names used by parser for matching
        Returns:
        BaseRobotRules object
      • parseRules

        public crawlercommons.robots.BaseRobotRules parseRules​(String url,
                                                               byte[] content,
                                                               String contentType,
                                                               Collection<String> robotNames)
        Parses the robots content using the SimpleRobotRulesParser from crawler-commons
        Parameters:
        url - The robots.txt URL
        content - Contents of the robots file in a byte array
        contentType - The content type of the robots file
        robotNames - A collection containing all the robots agent names used by parser for matching
        Returns:
        BaseRobotRules object
      • getRobotRulesSet

        public crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol protocol,
                                                                     Text url,
                                                                     List<Content> robotsTxtContent)
        Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
        Parameters:
        protocol - Protocol
        url - URL to check
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        Returns:
        robot rules (specific for this URL or default), never null
      • getRobotRulesSet

        public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol protocol,
                                                                              URL url,
                                                                              List<Content> robotsTxtContent)
        Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
        Parameters:
        protocol - Protocol
        url - URL to check
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        Returns:
        robot rules (specific for this URL or default), never null
      • run

        public int run​(String[] args)
        Specified by:
        run in interface Tool