Package org.apache.nutch.tools
Class DmozParser
- java.lang.Object
-
- org.apache.nutch.tools.DmozParser
-
-
Constructor Summary
Constructors Constructor Description DmozParser()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static void
main(String[] argv)
Command-line access.void
parseDmozFile(File dmozFile, int subsetDenom, boolean includeAdult, int skew, Pattern topicPattern)
Iterate through all the items in this structured DMOZ file.
-
-
-
Method Detail
-
parseDmozFile
public void parseDmozFile(File dmozFile, int subsetDenom, boolean includeAdult, int skew, Pattern topicPattern) throws IOException, SAXException, ParserConfigurationException
Iterate through all the items in this structured DMOZ file. Add each URL to the web db.- Parameters:
dmozFile
- the input DMOZFile
subsetDenom
- Subset denominator filterincludeAdult
- To include adult content or not.skew
- skew factor the the subset denominator filter. Only emit with a chance of 1/denominatortopicPattern
- aPattern
which will match again "r:id" element- Throws:
IOException
- if there is a fatal error reading the input DMOZ fileSAXException
- can be thrown if there is an error configuring the internalSAXParser
orXMLReader
ParserConfigurationException
- can be thrown if there is an error configuring the internalSAXParserFactory
-
main
public static void main(String[] argv) throws Exception
Command-line access. User may add URLs via a flat text file or the structured DMOZ file. By default, we ignore Adult material (as categorized by DMOZ).- Parameters:
argv
- input arguments for this tool. If less than one argument is provided the tool will print help.- Throws:
Exception
- if there is a fatal error
-
-