Class DOMContentUtils


  • public class DOMContentUtils
    extends Object
    A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
    • Constructor Detail

      • DOMContentUtils

        public DOMContentUtils​(Configuration conf)
    • Method Detail

      • getText

        public void getText​(StringBuffer sb,
                            Node node)
        This is a convinience method, equivalent to getText(sb, node, false).
        Parameters:
        sb - a StringBuffer used to store content text found beneath the DOM node... if any exists
        node - a DOM Node to check for content text
      • getTitle

        public boolean getTitle​(StringBuffer sb,
                                Node node)
        This method takes a StringBuffer and a DOM Node, and will append the content text found beneath the first title node to the StringBuffer.
        Parameters:
        sb - a StringBuffer used to store content text found beneath the DOM node... if any exists
        node - a DOM Node to check for content text
        Returns:
        true if a title node was found, false otherwise
      • getBase

        public String getBase​(Node node)
        If Node contains a BASE tag then it's HREF is returned.
        Parameters:
        node - a DOM Node to check for a BASE tag
        Returns:
        HREF if one exists
      • getOutlinks

        public void getOutlinks​(URL base,
                                ArrayList<Outlink> outlinks,
                                Node node)
        This method finds all anchors below the supplied DOM node, and creates appropriate Outlink records for each (relative to the supplied base URL), and adds them to the outlinks ArrayList.

        Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).

        Parameters:
        base - the canonical URL
        outlinks - the ArrayList of Outlink's associated with the base URL
        node - a Node under which to discover anchors
      • getOutlinks

        public void getOutlinks​(URL base,
                                ArrayList<Outlink> outlinks,
                                List<org.apache.tika.sax.Link> tikaExtractedOutlinks)