Package org.apache.nutch.protocol
Class RobotRulesParser
- java.lang.Object
-
- org.apache.nutch.protocol.RobotRulesParser
-
- All Implemented Interfaces:
Configurable
,Tool
- Direct Known Subclasses:
FtpRobotRulesParser
,HttpRobotRulesParser
public abstract class RobotRulesParser extends Object implements Tool
This class uses crawler-commons for handling the parsing ofrobots.txt
files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser. Protocol-specific implementations have to implement the methodgetRobotRulesSet(org.apache.nutch.protocol.Protocol,org.apache.hadoop.io.Text,java.util.List<org.apache.nutch.protocol.Content>)
.
-
-
Field Summary
Fields Modifier and Type Field Description protected Set<String>
agentNames
protected Set<String>
allowList
set of host names or IPs to be explicitly excluded from robots.txt checkingprotected static Hashtable<String,crawlercommons.robots.BaseRobotRules>
CACHE
protected Configuration
conf
static crawlercommons.robots.BaseRobotRules
DEFER_VISIT_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code.static crawlercommons.robots.BaseRobotRules
EMPTY_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file is empty or missing; all requests are allowed.static crawlercommons.robots.BaseRobotRules
FORBID_ALL_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file is not fetched due to a403/Forbidden
response; all requests are disallowed.protected int
maxNumRedirects
-
Constructor Summary
Constructors Constructor Description RobotRulesParser()
RobotRulesParser(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Deprecated Methods Modifier and Type Method Description Configuration
getConf()
Get theConfiguration
objectabstract crawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol protocol, URL url, List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).crawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol protocol, Text url, List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).boolean
isAllowListed(URL url)
Check whether a URL belongs to a allowlisted host.static void
main(String[] args)
crawlercommons.robots.BaseRobotRules
parseRules(String url, byte[] content, String contentType, String robotName)
Deprecated.crawlercommons.robots.BaseRobotRules
parseRules(String url, byte[] content, String contentType, Collection<String> robotNames)
Parses the robots content using theSimpleRobotRulesParser
from crawler-commonsint
run(String[] args)
void
setConf(Configuration conf)
Set theConfiguration
object
-
-
-
Field Detail
-
EMPTY_RULES
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file is empty or missing; all requests are allowed.
-
FORBID_ALL_RULES
public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file is not fetched due to a403/Forbidden
response; all requests are disallowed.
-
DEFER_VISIT_RULES
public static final crawlercommons.robots.BaseRobotRules DEFER_VISIT_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file failed to fetch with a 503 "Internal Server Error" (or other 5xx) status code. The crawler should suspend crawling for a certain (but not too long) time, see propertyhttp.robots.503.defer.visits
.
-
conf
protected Configuration conf
-
maxNumRedirects
protected int maxNumRedirects
-
-
Constructor Detail
-
RobotRulesParser
public RobotRulesParser()
-
RobotRulesParser
public RobotRulesParser(Configuration conf)
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
Set theConfiguration
object- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
Get theConfiguration
object- Specified by:
getConf
in interfaceConfigurable
-
isAllowListed
public boolean isAllowListed(URL url)
Check whether a URL belongs to a allowlisted host.- Parameters:
url
- aURL
to check against rules- Returns:
- true if always allowed (robots.txt rules are ignored), false otherwise
-
parseRules
@Deprecated public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
Deprecated.Parses the robots content using theSimpleRobotRulesParser
from crawler-commons- Parameters:
url
- The robots.txt URLcontent
- Contents of the robots file in a byte arraycontentType
- The content type of the robots filerobotName
- A string containing all the robots agent names used by parser for matching- Returns:
- BaseRobotRules object
-
parseRules
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, Collection<String> robotNames)
Parses the robots content using theSimpleRobotRulesParser
from crawler-commons- Parameters:
url
- The robots.txt URLcontent
- Contents of the robots file in a byte arraycontentType
- The content type of the robots filerobotNames
- A collection containing all the robots agent names used by parser for matching- Returns:
- BaseRobotRules object
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, Text url, List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).- Parameters:
protocol
-Protocol
url
- URL to checkrobotsTxtContent
- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContent
is appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
getRobotRulesSet
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url, List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to the given URL, parse it and return the set of robot rules applicable for the configured agent name(s).- Parameters:
protocol
-Protocol
url
- URL to checkrobotsTxtContent
- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContent
is appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
-