Class FeedParser

  • All Implemented Interfaces:
    Configurable, Parser, Pluggable

    public class FeedParser
    extends Object
    implements Parser
    Since:
    NUTCH-444

    A new RSS/ATOM FeedParser that rapidly parses all referenced links and content present in the feed.

    Author:
    dogacan, mattmann
    • Constructor Detail

      • FeedParser

        public FeedParser()
    • Method Detail

      • getParse

        public ParseResult getParse​(Content content)
        Parses the given feed and extracts out and parsers all linked items within the feed, using the underlying ROME feed parsing library.
        Specified by:
        getParse in interface Parser
        Parameters:
        content - A Content object representing the feed that is being parsed by this Parser.
        Returns:
        A ParseResult containing all Parsed feeds that were present in the feed file that this Parser dealt with.
      • setConf

        public void setConf​(Configuration conf)
        Sets the Configuration object for this Parser. This Parser expects the following configuration properties to be set:
        • URLNormalizers - properties in the configuration object to set up the default url normalizers.
        • URLFilters - properties in the configuration object to set up the default url filters.
        Specified by:
        setConf in interface Configurable
        Parameters:
        conf - The Hadoop Configuration object to use to configure this Parser.
      • main

        public static void main​(String[] args)
                         throws Exception
        Runs a command line version of this Parser.
        Parameters:
        args - A single argument (expected at arg[0]) representing a path on the local filesystem that points to a feed file.
        Throws:
        Exception - If any error occurs.