Class File

  • All Implemented Interfaces:
    Configurable, Pluggable, Protocol

    public class File
    extends Object
    implements Protocol
    This class is a protocol plugin used for file: scheme. It creates FileResponse object and gets the content of the url from it. Configurable parameters are file.content.limit and file.crawl.parent in nutch-default.xml defined under "file properties" section.
    Author:
    John Xing
    • Field Detail

      • LOG

        protected static final org.slf4j.Logger LOG
    • Constructor Detail

      • File

        public File()
    • Method Detail

      • setMaxContentLength

        public void setMaxContentLength​(int maxContentLength)
        Set the length after at which content is truncated.
        Parameters:
        maxContentLength - max content in bytes
      • main

        public static void main​(String[] args)
                         throws Exception
        Quick way for running this class. Useful for debugging.
        Parameters:
        args - run with no args to print help
        Throws:
        Exception - if there is a fatal error running this class with the given input
      • getRobotRules

        public crawlercommons.robots.BaseRobotRules getRobotRules​(Text url,
                                                                  CrawlDatum datum,
                                                                  List<Content> robotsTxtContent)
        No robots parsing is done for file protocol. So this returns a set of empty rules which will allow every url.
        Specified by:
        getRobotRules in interface Protocol
        Parameters:
        url - URL to check
        datum - page datum
        robotsTxtContent - container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). Response Content is appended to the passed list. If null is passed nothing is stored.
        Returns:
        robot rules (specific for this URL or default), never null