Class NodeDumper
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.scoring.webgraph.NodeDumper
-
- All Implemented Interfaces:
Configurable
,Tool
public class NodeDumper extends Configured implements Tool
A tools that dumps out the top urls by number of inlinks, number of outlinks, or by score, to a text file. One of the major uses of this tool is to check the top scoring urls of a link analysis program such as LinkRank. For number of inlinks or number of outlinks the WebGraph program will need to have been run. For link analysis score a program such as LinkRank will need to have been run which updates the NodeDb of the WebGraph.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
NodeDumper.Dumper
Outputs the hosts or domains with an associated value.static class
NodeDumper.Sorter
Outputs the top urls sorted in descending order.
-
Constructor Summary
Constructors Constructor Description NodeDumper()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
dumpNodes(Path webGraphDb, org.apache.nutch.scoring.webgraph.NodeDumper.DumpType type, long topN, Path output, boolean asEff, org.apache.nutch.scoring.webgraph.NodeDumper.NameType nameType, org.apache.nutch.scoring.webgraph.NodeDumper.AggrType aggrType, boolean asSequenceFile)
Runs the process to dump the top urls out to a text file.static void
main(String[] args)
int
run(String[] args)
Runs the node dumper tool.-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Method Detail
-
dumpNodes
public void dumpNodes(Path webGraphDb, org.apache.nutch.scoring.webgraph.NodeDumper.DumpType type, long topN, Path output, boolean asEff, org.apache.nutch.scoring.webgraph.NodeDumper.NameType nameType, org.apache.nutch.scoring.webgraph.NodeDumper.AggrType aggrType, boolean asSequenceFile) throws Exception
Runs the process to dump the top urls out to a text file.- Parameters:
webGraphDb
- TheWebGraph
from which to pull values.type
- the node property type to dump, one ofNodeDumper.DumpType.INLINKS
,NodeDumper.DumpType.OUTLINKS
orNodeDumper.DumpType.SCORES
topN
- maximum value of top links to dumpoutput
- aPath
to write output toasEff
- if true set equals-sign as separator for Solr's ExternalFileField, false otherwisenameType
- eitherNodeDumper.NameType.HOST
orNodeDumper.NameType.DOMAIN
aggrType
- the aggregation type, eitherNodeDumper.AggrType.MAX
orNodeDumper.AggrType.SUM
asSequenceFile
- true output will be written asSequenceFileOutputFormat
, otherwise defaultTextOutputFormat
- Throws:
Exception
- If an error occurs while dumping the top values.
-
-