Package org.apache.nutch.parse.html
Class DOMContentUtils
- java.lang.Object
-
- org.apache.nutch.parse.html.DOMContentUtils
-
public class DOMContentUtils extends Object
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
DOMContentUtils.LinkParams
-
Constructor Summary
Constructors Constructor Description DOMContentUtils(Configuration conf)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description String
getBase(Node node)
If Node contains a BASE tag then it's HREF is returned.void
getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)
void
getText(StringBuffer sb, Node node)
This is a convinience method, equivalent togetText(sb, node, false)
.boolean
getText(StringBuffer sb, Node node, boolean abortOnNestedAnchors)
This method takes aStringBuffer
and a DOMNode
, and will append all the content text found beneath the DOM node to theStringBuffer
.boolean
getTitle(StringBuffer sb, Node node)
This method takes aStringBuffer
and a DOMNode
, and will append the content text found beneath the firsttitle
node to theStringBuffer
.void
setConf(Configuration conf)
-
-
-
Constructor Detail
-
DOMContentUtils
public DOMContentUtils(Configuration conf)
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
-
getText
public boolean getText(StringBuffer sb, Node node, boolean abortOnNestedAnchors)
This method takes aStringBuffer
and a DOMNode
, and will append all the content text found beneath the DOM node to theStringBuffer
.If
abortOnNestedAnchors
is true, DOM traversal will be aborted and theStringBuffer
will not contain any text encountered after a nested anchor is found.- Parameters:
sb
- aStringBuffer
used to store content text found beneath the DOM node... if any existsnode
- a DOMNode
to check for content textabortOnNestedAnchors
- true to abort if nested anchors are encountered, false otherwise- Returns:
- true if nested anchors were found
-
getText
public void getText(StringBuffer sb, Node node)
This is a convinience method, equivalent togetText(sb, node, false)
.- Parameters:
sb
- aStringBuffer
used to store content text found beneath the DOM node... if any existsnode
- a DOMNode
to check for content text
-
getTitle
public boolean getTitle(StringBuffer sb, Node node)
This method takes aStringBuffer
and a DOMNode
, and will append the content text found beneath the firsttitle
node to theStringBuffer
.- Parameters:
sb
- aStringBuffer
used to store content text found beneath the DOM node... if any existsnode
- a DOMNode
to check for content text- Returns:
- true if a title node was found, false otherwise
-
getBase
public String getBase(Node node)
If Node contains a BASE tag then it's HREF is returned.- Parameters:
node
- a DOMNode
to check for a BASE tag- Returns:
- HREF if one exists
-
getOutlinks
public void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)
This method finds all anchors below the supplied DOMnode
, and creates appropriateOutlink
records for each (relative to the suppliedbase
URL), and adds them to theoutlinks
ArrayList
.Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
-
-