Class DmozParser


  • public class DmozParser
    extends Object
    Utility that converts DMOZ RDF into a flat file of URLs to be injected.
    • Constructor Detail

      • DmozParser

        public DmozParser()
    • Method Detail

      • parseDmozFile

        public void parseDmozFile​(File dmozFile,
                                  int subsetDenom,
                                  boolean includeAdult,
                                  int skew,
                                  Pattern topicPattern)
                           throws IOException,
                                  SAXException,
                                  ParserConfigurationException
        Iterate through all the items in this structured DMOZ file. Add each URL to the web db.
        Parameters:
        dmozFile - the input DMOZ File
        subsetDenom - Subset denominator filter
        includeAdult - To include adult content or not.
        skew - skew factor the the subset denominator filter. Only emit with a chance of 1/denominator
        topicPattern - a Pattern which will match again "r:id" element
        Throws:
        IOException - if there is a fatal error reading the input DMOZ file
        SAXException - can be thrown if there is an error configuring the internal SAXParser or XMLReader
        ParserConfigurationException - can be thrown if there is an error configuring the internal SAXParserFactory
      • main

        public static void main​(String[] argv)
                         throws Exception
        Command-line access. User may add URLs via a flat text file or the structured DMOZ file. By default, we ignore Adult material (as categorized by DMOZ).
        Parameters:
        argv - input arguments for this tool. If less than one argument is provided the tool will print help.
        Throws:
        Exception - if there is a fatal error