java.lang.Object
- org.apache.nutch.protocol.RobotRulesParser
- - org.apache.nutch.protocol.http.api.HttpRobotRulesParser

All Implemented Interfaces:

Configurable, Tool
```
public class HttpRobotRulesParser
extends RobotRulesParser
```
This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.

Field Summary

Fields
Modifier and Type Field Description

protected boolean allowForbidden

protected boolean deferVisits503
- Fields inherited from class org.apache.nutch.protocol.RobotRulesParser
  agentNames, allowList, CACHE, conf, DEFER_VISIT_RULES, EMPTY_RULES, FORBID_ALL_RULES, maxNumRedirects

Constructor Summary

Constructors
Constructor Description

HttpRobotRulesParser(Configuration conf)

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected void`	`addRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse)`	Append `Content` of robots.txt to robotsTxtContent
`protected static String`	`getCacheKey(URL url)`	Compose unique key to store and access robot rules in cache for given URL
`crawlercommons.robots.BaseRobotRules`	`getRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent)`	Get the rules from robots.txt which applies for the given `url`.
`void`	`setConf(Configuration conf)`	Set the `Configuration` object

Methods inherited from class org.apache.nutch.protocol.RobotRulesParser
getConf, getRobotRulesSet, isAllowListed, main, parseRules, parseRules, run

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - allowForbidden
```
protected boolean allowForbidden
```
  - deferVisits503
```
protected boolean deferVisits503
```
- Constructor Detail
  - HttpRobotRulesParser
```
public HttpRobotRulesParser(Configuration conf)
```
- Method Detail
  - setConf
```
public void setConf(Configuration conf)
```
    Description copied from class: RobotRulesParser
    
    Set the Configuration object
    
    Specified by:
    
    setConf in interface Configurable
    
    Overrides:
    
    setConf in class RobotRulesParser
  - getCacheKey
```
protected static String getCacheKey(URL url)
```
    Compose unique key to store and access robot rules in cache for given URL
    
    Parameters:
    
    url - to generate a unique key for
    
    Returns:
    
    the cached unique key
  - getRobotRulesSet
```
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http,
                                                             URL url,
                                                             List<Content> robotsTxtContent)
```
    Get the rules from robots.txt which applies for the given url. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.
    Following RFC 9309, section 2.3.1.2. Redirects, up to five consecutive HTTP redirects are followed when fetching the robots.txt file. The max. number of redirects followed is configurable by the property http.robots.redirect.max.
    
    Specified by:
    
    getRobotRulesSet in class RobotRulesParser
    
    Parameters:
    
    http - The Protocol object
    
    url - URL
    
    robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
    
    Returns:
    
    robotRules A BaseRobotRules object for the rules
  - addRobotsContent
```
protected void addRobotsContent(List<Content> robotsTxtContent,
                                URL robotsUrl,
                                Response robotsResponse)
```
    Append Content of robots.txt to robotsTxtContent
    
    Parameters:
    
    robotsTxtContent - container to store robots.txt response content
    
    robotsUrl - robots.txt URL
    
    robotsResponse - response object to be stored

Modifier and Type	Field	Description
`protected boolean`	`allowForbidden`
`protected boolean`	`deferVisits503`

Class HttpRobotRulesParser

Field Summary

Fields inherited from class org.apache.nutch.protocol.RobotRulesParser

Constructor Summary

Method Summary

Methods inherited from class org.apache.nutch.protocol.RobotRulesParser

Methods inherited from class java.lang.Object

Field Detail

allowForbidden

deferVisits503

Constructor Detail

HttpRobotRulesParser

Method Detail

setConf

getCacheKey

getRobotRulesSet

addRobotsContent