Class HttpRobotRulesParser

  • All Implemented Interfaces:
    Configurable, Tool

    public class HttpRobotRulesParser
    extends RobotRulesParser
    This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.
    • Field Detail

      • allowForbidden

        protected boolean allowForbidden
      • deferVisits503

        protected boolean deferVisits503
    • Constructor Detail

      • HttpRobotRulesParser

        public HttpRobotRulesParser​(Configuration conf)
    • Method Detail

      • getCacheKey

        protected static String getCacheKey​(URL url)
        Compose unique key to store and access robot rules in cache for given URL
        Parameters:
        url - to generate a unique key for
        Returns:
        the cached unique key
      • getRobotRulesSet

        public crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol http,
                                                                     URL url,
                                                                     List<Content> robotsTxtContent)
        Get the rules from robots.txt which applies for the given url. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.

        Following RFC 9309, section 2.3.1.2. Redirects, up to five consecutive HTTP redirects are followed when fetching the robots.txt file. The max. number of redirects followed is configurable by the property http.robots.redirect.max.

        Specified by:
        getRobotRulesSet in class RobotRulesParser
        Parameters:
        http - The Protocol object
        url - URL
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        Returns:
        robotRules A BaseRobotRules object for the rules
      • addRobotsContent

        protected void addRobotsContent​(List<Content> robotsTxtContent,
                                        URL robotsUrl,
                                        Response robotsResponse)
        Append Content of robots.txt to robotsTxtContent
        Parameters:
        robotsTxtContent - container to store robots.txt response content
        robotsUrl - robots.txt URL
        robotsResponse - response object to be stored