Package org.apache.nutch.analysis.lang
Class HTMLLanguageParser
- java.lang.Object
-
- org.apache.nutch.analysis.lang.HTMLLanguageParser
-
- All Implemented Interfaces:
Configurable
,HtmlParseFilter
,Pluggable
public class HTMLLanguageParser extends Object implements HtmlParseFilter
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description HTMLLanguageParser()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ParseResult
filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
1.Configuration
getConf()
void
setConf(Configuration conf)
-
-
-
Method Detail
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
- 1. html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
- 2. meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified -html.shtml#language)
- 3. meta http-equiv (content-language)
(http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2)
- Specified by:
filter
in interfaceHtmlParseFilter
- Parameters:
content
- theContent
for a given responseparseResult
- the result of running on or moreParser
's on the content.metaTags
- a populatedHTMLMetaTags
objectdoc
- aDocumentFragment
(DOM) which can be processed in the filtering process.- Returns:
- a filtered
ParseResult
- See Also:
Parser.getParse(Content)
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
-