Class HttpRobotRulesParser
- java.lang.Object
-
- org.apache.nutch.protocol.RobotRulesParser
-
- org.apache.nutch.protocol.http.api.HttpRobotRulesParser
-
- All Implemented Interfaces:
Configurable
,Tool
public class HttpRobotRulesParser extends RobotRulesParser
This class is used for parsing robots for urls belonging to HTTP protocol. It extends the genericRobotRulesParser
class and contains Http protocol specific implementation for obtaining the robots file.
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
allowForbidden
protected boolean
deferVisits503
-
Fields inherited from class org.apache.nutch.protocol.RobotRulesParser
agentNames, allowList, CACHE, conf, DEFER_VISIT_RULES, EMPTY_RULES, FORBID_ALL_RULES, maxNumRedirects
-
-
Constructor Summary
Constructors Constructor Description HttpRobotRulesParser(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
addRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse)
AppendContent
of robots.txt to robotsTxtContentprotected static String
getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URLcrawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent)
Get the rules from robots.txt which applies for the givenurl
.void
setConf(Configuration conf)
Set theConfiguration
object-
Methods inherited from class org.apache.nutch.protocol.RobotRulesParser
getConf, getRobotRulesSet, isAllowListed, main, parseRules, parseRules, run
-
-
-
-
Constructor Detail
-
HttpRobotRulesParser
public HttpRobotRulesParser(Configuration conf)
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
Description copied from class:RobotRulesParser
Set theConfiguration
object- Specified by:
setConf
in interfaceConfigurable
- Overrides:
setConf
in classRobotRulesParser
-
getCacheKey
protected static String getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URL- Parameters:
url
- to generate a unique key for- Returns:
- the cached unique key
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent)
Get the rules from robots.txt which applies for the givenurl
. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.Following RFC 9309, section 2.3.1.2. Redirects, up to five consecutive HTTP redirects are followed when fetching the robots.txt file. The max. number of redirects followed is configurable by the property
http.robots.redirect.max
.- Specified by:
getRobotRulesSet
in classRobotRulesParser
- Parameters:
http
- TheProtocol
objecturl
- URLrobotsTxtContent
- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContent
is appended to the passed list. If null is passed nothing is stored.- Returns:
- robotRules A
BaseRobotRules
object for the rules
-
addRobotsContent
protected void addRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse)
AppendContent
of robots.txt to robotsTxtContent- Parameters:
robotsTxtContent
- container to store robots.txt response contentrobotsUrl
- robots.txt URLrobotsResponse
- response object to be stored
-
-