Class NodeDumper

  • All Implemented Interfaces:
    Configurable, Tool

    public class NodeDumper
    extends Configured
    implements Tool
    A tools that dumps out the top urls by number of inlinks, number of outlinks, or by score, to a text file. One of the major uses of this tool is to check the top scoring urls of a link analysis program such as LinkRank. For number of inlinks or number of outlinks the WebGraph program will need to have been run. For link analysis score a program such as LinkRank will need to have been run which updates the NodeDb of the WebGraph.
    • Constructor Detail

      • NodeDumper

        public NodeDumper()
    • Method Detail

      • dumpNodes

        public void dumpNodes​(Path webGraphDb,
                              org.apache.nutch.scoring.webgraph.NodeDumper.DumpType type,
                              long topN,
                              Path output,
                              boolean asEff,
                              org.apache.nutch.scoring.webgraph.NodeDumper.NameType nameType,
                              org.apache.nutch.scoring.webgraph.NodeDumper.AggrType aggrType,
                              boolean asSequenceFile)
                       throws Exception
        Runs the process to dump the top urls out to a text file.
        Parameters:
        webGraphDb - The WebGraph from which to pull values.
        type - the node property type to dump, one of NodeDumper.DumpType.INLINKS, NodeDumper.DumpType.OUTLINKS or NodeDumper.DumpType.SCORES
        topN - maximum value of top links to dump
        output - a Path to write output to
        asEff - if true set equals-sign as separator for Solr's ExternalFileField, false otherwise
        nameType - either NodeDumper.NameType.HOST or NodeDumper.NameType.DOMAIN
        aggrType - the aggregation type, either NodeDumper.AggrType.MAX or NodeDumper.AggrType.SUM
        asSequenceFile - true output will be written as SequenceFileOutputFormat, otherwise default TextOutputFormat
        Throws:
        Exception - If an error occurs while dumping the top values.