Class WebGraph
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.scoring.webgraph.WebGraph
-
- All Implemented Interfaces:
Configurable
,Tool
public class WebGraph extends Configured implements Tool
Creates three databases, one for inlinks, one for outlinks, and a node database that holds the number of in and outlinks to a url and the current score for the url. The score is set by an analysis program such as LinkRank. The WebGraph is an update-able database. Outlinks are stored by their fetch time or by the current system time if no fetch time is available. Only the most recent version of outlinks for a given url is stored. As more crawls are executed and the WebGraph updated, newer Outlinks will replace older Outlinks. This allows the WebGraph to adapt to changes in the link structure of the web. The Inlink database is created from the Outlink database and is regenerated when the WebGraph is updated. The Node database is created from both the Inlink and Outlink databases. Because the Node database is overwritten when the WebGraph is updated and because the Node database holds current scores for urls it is recommended that a crawl-cycle (one or more full crawls) fully complete before the WebGraph is updated and some type of analysis, such as LinkRank, is run to update scores in the Node database in a stable fashion.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
WebGraph.OutlinkDb
The OutlinkDb creates a database of all outlinks.
-
Field Summary
Fields Modifier and Type Field Description static String
INLINK_DIR
static String
LOCK_NAME
static String
NODE_DIR
static String
OLD_OUTLINK_DIR
static String
OUTLINK_DIR
-
Constructor Summary
Constructors Constructor Description WebGraph()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
createWebGraph(Path webGraphDb, Path[] segments, boolean normalize, boolean filter)
Creates the three different WebGraph databases, Outlinks, Inlinks, and Node.static void
main(String[] args)
int
run(String[] args)
Parses command link arguments and runs the WebGraph jobs.-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
LOCK_NAME
public static final String LOCK_NAME
- See Also:
- Constant Field Values
-
INLINK_DIR
public static final String INLINK_DIR
- See Also:
- Constant Field Values
-
OUTLINK_DIR
public static final String OUTLINK_DIR
- See Also:
- Constant Field Values
-
OLD_OUTLINK_DIR
public static final String OLD_OUTLINK_DIR
- See Also:
- Constant Field Values
-
NODE_DIR
public static final String NODE_DIR
- See Also:
- Constant Field Values
-
-
Method Detail
-
createWebGraph
public void createWebGraph(Path webGraphDb, Path[] segments, boolean normalize, boolean filter) throws IOException, InterruptedException, ClassNotFoundException
Creates the three different WebGraph databases, Outlinks, Inlinks, and Node. If a current WebGraph exists then it is updated, if it doesn't exist then a new WebGraph database is created.- Parameters:
webGraphDb
- The WebGraph to create or update.segments
- The array of segments used to update the WebGraph. Newer segments and fetch times will overwrite older segments.normalize
- whether to use URLNormalizers on URL's in the segmentfilter
- whether to use URLFilters on URL's in the segment- Throws:
IOException
- If an error occurs while processing the WebGraph.InterruptedException
- if the Job is interrupted during executionClassNotFoundException
- if classes required to run the Job cannot be located
-
-