Class DmozParser

  • public class DmozParser
    extends Object
    Utility that converts DMOZ RDF into a flat file of URLs to be injected.
    • Constructor Detail

      • DmozParser

        public DmozParser()
    • Method Detail

      • parseDmozFile

        public void parseDmozFile​(File dmozFile,
                                  int subsetDenom,
                                  boolean includeAdult,
                                  int skew,
                                  Pattern topicPattern)
                           throws IOException,
        Iterate through all the items in this structured DMOZ file. Add each URL to the web db.
        dmozFile - the input DMOZ File
        subsetDenom - Subset denominator filter
        includeAdult - To include adult content or not.
        skew - skew factor the the subset denominator filter. Only emit with a chance of 1/denominator
        topicPattern - a Pattern which will match again "r:id" element
        IOException - if there is a fatal error reading the input DMOZ file
        SAXException - can be thrown if there is an error configuring the internal SAXParser or XMLReader
        ParserConfigurationException - can be thrown if there is an error configuring the internal SAXParserFactory
      • main

        public static void main​(String[] argv)
                         throws Exception
        Command-line access. User may add URLs via a flat text file or the structured DMOZ file. By default, we ignore Adult material (as categorized by DMOZ).
        argv - input arguments for this tool. If less than one argument is provided the tool will print help.
        Exception - if there is a fatal error