Package org.apache.nutch.tools.arc
Class ArcSegmentCreator
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.tools.arc.ArcSegmentCreator
-
- All Implemented Interfaces:
Configurable
,Tool
public class ArcSegmentCreator extends Configured implements Tool
The
ArcSegmentCreator
is a replacement for fetcher that will take arc files as input and produce a nutch segment as output.Arc files are tars of compressed gzips which are produced by both the internet archive project and the grub distributed crawler project.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
ArcSegmentCreator.ArcSegmentCreatorMapper
-
Field Summary
Fields Modifier and Type Field Description static String
URL_VERSION
-
Constructor Summary
Constructors Constructor Description ArcSegmentCreator()
ArcSegmentCreator(Configuration conf)
Constructor that sets the job configuration.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
void
createSegments(Path arcFiles, Path segmentsOutDir)
Creates the arc files to segments job.static String
generateSegmentName()
Generates a random name for the segments.static void
main(String[] args)
int
run(String[] args)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
URL_VERSION
public static final String URL_VERSION
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
ArcSegmentCreator
public ArcSegmentCreator()
-
ArcSegmentCreator
public ArcSegmentCreator(Configuration conf)
Constructor that sets the job configuration.- Parameters:
conf
- a populatedConfiguration
-
-
Method Detail
-
generateSegmentName
public static String generateSegmentName()
Generates a random name for the segments.- Returns:
- The generated segment name.
-
close
public void close()
-
createSegments
public void createSegments(Path arcFiles, Path segmentsOutDir) throws IOException, InterruptedException, ClassNotFoundException
Creates the arc files to segments job.- Parameters:
arcFiles
- The path to the directory holding the arc filessegmentsOutDir
- The output directory for writing the segments- Throws:
IOException
- If an IO error occurs while running the job.InterruptedException
- if thisJob
is interruptedClassNotFoundException
- if there is an error locating a class during runtime
-
-