Class File

  • All Implemented Interfaces:
    Configurable, Pluggable, Protocol

    public class File
    extends Object
    implements Protocol
    This class is a protocol plugin used for file: scheme. It creates FileResponse object and gets the content of the url from it. Configurable parameters are file.content.limit and file.crawl.parent in nutch-default.xml defined under "file properties" section.
    John Xing
    • Field Detail

      • LOG

        protected static final org.slf4j.Logger LOG
    • Constructor Detail

      • File

        public File()
    • Method Detail

      • setMaxContentLength

        public void setMaxContentLength​(int maxContentLength)
        Set the length after at which content is truncated.
        maxContentLength - max content in bytes
      • main

        public static void main​(String[] args)
                         throws Exception
        Quick way for running this class. Useful for debugging.
        args - run with no args to print help
        Exception - if there is a fatal error running this class with the given input
      • getRobotRules

        public crawlercommons.robots.BaseRobotRules getRobotRules​(Text url,
                                                                  CrawlDatum datum,
                                                                  List<Content> robotsTxtContent)
        No robots parsing is done for file protocol. So this returns a set of empty rules which will allow every url.
        Specified by:
        getRobotRules in interface Protocol
        url - URL to check
        datum - page datum
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        robot rules (specific for this URL or default), never null