Package org.apache.nutch.parse
Class ParserFactory
- java.lang.Object
-
- org.apache.nutch.parse.ParserFactory
-
-
Field Summary
Fields Modifier and Type Field Description static String
DEFAULT_PLUGIN
Wildcard for default plugins.
-
Constructor Summary
Constructors Constructor Description ParserFactory(Configuration conf)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected List<Extension>
getExtensions(String contentType)
Finds the best-suited parse plugin for a given contentType.Parser
getParserById(String id)
Function returns aParser
instance with the specifiedextId
, representing its extension ID.Parser[]
getParsers(String contentType, String url)
Function returns an array ofParser
s for a given content type.
-
-
-
Field Detail
-
DEFAULT_PLUGIN
public static final String DEFAULT_PLUGIN
Wildcard for default plugins.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
ParserFactory
public ParserFactory(Configuration conf)
-
-
Method Detail
-
getParsers
public Parser[] getParsers(String contentType, String url) throws ParserNotFound
Function returns an array ofParser
s for a given content type. The function consults the internal list of parse plugins for the ParserFactory to determine the list of pluginIds, then gets the appropriate extension points to instantiate asParser
s.- Parameters:
contentType
- The contentType to return theArray
ofParser
s for.url
- The url for the content that may allow us to get the type from the file suffix.- Returns:
- An
Array
ofParser
s for the given contentType. If there were plugins mapped to a contentType via theparse-plugins.xml
file, but never enabled via theplugin.includes
Nutch conf, then those plugins won't be part of this array, i.e., they will be skipped. So, if the ordered list of parsing plugins fortext/plain
was[parse-text,parse-html, parse-rtf]
, and onlyparse-html
andparse-rtf
were enabled viaplugin.includes
, then this ordered Array would consist of twoParser
interfaces,[parse-html, parse-rtf]
. - Throws:
ParserNotFound
- if there is a runtime error locating a parser for the given content type and url
-
getParserById
public Parser getParserById(String id) throws ParserNotFound
Function returns aParser
instance with the specifiedextId
, representing its extension ID. If the Parser instance isn't found, then the function throws aParserNotFound
exception. If the function is able to find theParser
in the internalPARSER_CACHE
then it will return the already instantiated Parser. Otherwise, if it has to instantiate the Parser itself , then this function will cache that Parser in the internalPARSER_CACHE
.- Parameters:
id
- The string extension ID (e.g., "org.apache.nutch.parse.rss.RSSParser", "org.apache.nutch.parse.rtf.RTFParseFactory") of theParser
implementation to return.- Returns:
- A
Parser
implementation specified by the parameterid
. - Throws:
ParserNotFound
- If the Parser is not found (i.e., registered with the extension point), or if the there aPluginRuntimeException
instantiating theParser
.
-
-