java.lang.Object
- org.apache.nutch.protocol.RobotRulesParser

All Implemented Interfaces:

Configurable, Tool

Direct Known Subclasses:

FtpRobotRulesParser, HttpRobotRulesParser
```
public abstract class RobotRulesParser
extends Object
implements Tool
```
This class uses crawler-commons for handling the parsing of robots.txt files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser. Protocol-specific implementations have to implement the method getRobotRulesSet(org.apache.nutch.protocol.Protocol,org.apache.hadoop.io.Text,java.util.List<org.apache.nutch.protocol.Content>).

Field Summary

Fields
Modifier and Type	Field	Description
`protected Set<String>`	`agentNames`
`protected Set<String>`	`allowList`	set of host names or IPs to be explicitly excluded from robots.txt checking
`protected static Hashtable<String,crawlercommons.robots.BaseRobotRules>`	`CACHE`
`protected Configuration`	`conf`
`static crawlercommons.robots.BaseRobotRules`	`DEFER_VISIT_RULES`	A `BaseRobotRules` object appropriate for use when the `robots.txt` file failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code.
`static crawlercommons.robots.BaseRobotRules`	`EMPTY_RULES`	A `BaseRobotRules` object appropriate for use when the `robots.txt` file is empty or missing; all requests are allowed.
`static crawlercommons.robots.BaseRobotRules`	`FORBID_ALL_RULES`	A `BaseRobotRules` object appropriate for use when the `robots.txt` file is not fetched due to a `403/Forbidden` response; all requests are disallowed.
`protected int`	`maxNumRedirects`

Constructor Summary

Constructors
Constructor Description

RobotRulesParser()

RobotRulesParser(Configuration conf)

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Deprecated Methods
Modifier and Type	Method	Description
`Configuration`	`getConf()`	Get the `Configuration` object
`abstract crawlercommons.robots.BaseRobotRules`	`getRobotRulesSet(Protocol protocol, URL url, List<Content> robotsTxtContent)`	Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
`crawlercommons.robots.BaseRobotRules`	`getRobotRulesSet(Protocol protocol, Text url, List<Content> robotsTxtContent)`	Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
`boolean`	`isAllowListed(URL url)`	Check whether a URL belongs to a allowlisted host.
`static void`	`main(String[] args)`
`crawlercommons.robots.BaseRobotRules`	`parseRules(String url, byte[] content, String contentType, String robotName)`	Deprecated.
`crawlercommons.robots.BaseRobotRules`	`parseRules(String url, byte[] content, String contentType, Collection<String> robotNames)`	Parses the robots content using the `SimpleRobotRulesParser` from crawler-commons
`int`	`run(String[] args)`
`void`	`setConf(Configuration conf)`	Set the `Configuration` object

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - CACHE
```
protected static final Hashtable<String,crawlercommons.robots.BaseRobotRules> CACHE
```
  - EMPTY_RULES
```
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
```
    A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
  - FORBID_ALL_RULES
```
public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
```
    A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
  - DEFER_VISIT_RULES
```
public static final crawlercommons.robots.BaseRobotRules DEFER_VISIT_RULES
```
    A BaseRobotRules object appropriate for use when the robots.txt file failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code. The crawler should suspend crawling for a certain (but not too long) time, see property http.robots.503.defer.visits.
  - conf
```
protected Configuration conf
```
  - agentNames
```
protected Set<String> agentNames
```
  - maxNumRedirects
```
protected int maxNumRedirects
```
  - allowList
```
protected Set<String> allowList
```
    set of host names or IPs to be explicitly excluded from robots.txt checking
- Constructor Detail
  - RobotRulesParser
```
public RobotRulesParser()
```
  - RobotRulesParser
```
public RobotRulesParser(Configuration conf)
```
- Method Detail
  - setConf
```
public void setConf(Configuration conf)
```
    Set the Configuration object
    
    Specified by:
    
    setConf in interface Configurable
  - getConf
```
public Configuration getConf()
```
    Get the Configuration object
    
    Specified by:
    
    getConf in interface Configurable
  - isAllowListed
```
public boolean isAllowListed(URL url)
```
    Check whether a URL belongs to a allowlisted host.
    
    Parameters:
    
    url - a URL to check against rules
    
    Returns:
    
    true if always allowed (robots.txt rules are ignored), false otherwise
  - parseRules
```
@Deprecated
public crawlercommons.robots.BaseRobotRules parseRules(String url,
                                                       byte[] content,
                                                       String contentType,
                                                       String robotName)
```
    Deprecated.
    
    Parses the robots content using the SimpleRobotRulesParser from crawler-commons
    
    Parameters:
    
    url - The robots.txt URL
    
    content - Contents of the robots file in a byte array
    
    contentType - The content type of the robots file
    
    robotName - A string containing all the robots agent names used by parser for matching
    
    Returns:
    
    BaseRobotRules object
  - parseRules
```
public crawlercommons.robots.BaseRobotRules parseRules(String url,
                                                       byte[] content,
                                                       String contentType,
                                                       Collection<String> robotNames)
```
    Parses the robots content using the SimpleRobotRulesParser from crawler-commons
    
    Parameters:
    
    url - The robots.txt URL
    
    content - Contents of the robots file in a byte array
    
    contentType - The content type of the robots file
    
    robotNames - A collection containing all the robots agent names used by parser for matching
    
    Returns:
    
    BaseRobotRules object
  - getRobotRulesSet
```
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
                                                             Text url,
                                                             List<Content> robotsTxtContent)
```
    Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
    
    Parameters:
    
    protocol - Protocol
    
    url - URL to check
    
    robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
    
    Returns:
    
    robot rules (specific for this URL or default), never null
  - getRobotRulesSet
```
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
                                                                      URL url,
                                                                      List<Content> robotsTxtContent)
```
    Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).
    
    Parameters:
    
    protocol - Protocol
    
    url - URL to check
    
    robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
    
    Returns:
    
    robot rules (specific for this URL or default), never null
  - run
```
public int run(String[] args)
```
    Specified by:
    
    run in interface Tool
  - main
```
public static void main(String[] args)
                 throws Exception
```
    Throws:
    
    Exception

Constructors
Constructor	Description
`RobotRulesParser()`
`RobotRulesParser(Configuration conf)`

Class RobotRulesParser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

CACHE

EMPTY_RULES

FORBID_ALL_RULES

DEFER_VISIT_RULES

conf

agentNames

maxNumRedirects

allowList

Constructor Detail

RobotRulesParser

RobotRulesParser

Method Detail

setConf

getConf

isAllowListed

parseRules

parseRules

getRobotRulesSet

getRobotRulesSet

run

main