Package org.apache.nutch.tools.arc
Class ArcRecordReader
- java.lang.Object
-
- org.apache.hadoop.mapreduce.RecordReader<Text,BytesWritable>
-
- org.apache.nutch.tools.arc.ArcRecordReader
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public class ArcRecordReader extends RecordReader<Text,BytesWritable>
TheArchRecordReader
class provides a record reader which reads records from arc files. Arc files are essentially tars of gzips. Each record in an arc file is a compressed gzip. Multiple records are concatenated together to form a complete arc. For more information on the arc file format- See Also:
- ArcFileFormat. Arc files are used by the Internet Archive and grub projects., archive.org, grub.org
-
-
Field Summary
Fields Modifier and Type Field Description protected Configuration
conf
protected long
fileLen
protected FSDataInputStream
in
protected long
pos
protected long
splitEnd
protected long
splitLen
protected long
splitStart
-
Constructor Summary
Constructors Constructor Description ArcRecordReader(Configuration conf, FileSplit split)
Constructor that sets the configuration and file split.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
close()
Closes the record reader resources.Text
createKey()
Creates a new instance of theText
object for the key.BytesWritable
createValue()
Creates a new instance of theBytesWritable
object for the keyText
getCurrentKey()
BytesWritable
getCurrentValue()
long
getPos()
Returns the current position in the file.float
getProgress()
Returns the percentage of progress in processing the file.void
initialize(InputSplit split, TaskAttemptContext context)
static boolean
isMagic(byte[] input)
Returns true if the byte array passed matches the gzip header magic number.boolean
next(Text key, BytesWritable value)
Returns true if the next record in the split is read into the key and value pair.boolean
nextKeyValue()
-
-
-
Field Detail
-
conf
protected Configuration conf
-
splitStart
protected long splitStart
-
pos
protected long pos
-
splitEnd
protected long splitEnd
-
splitLen
protected long splitLen
-
fileLen
protected long fileLen
-
in
protected FSDataInputStream in
-
-
Constructor Detail
-
ArcRecordReader
public ArcRecordReader(Configuration conf, FileSplit split) throws IOException
Constructor that sets the configuration and file split.- Parameters:
conf
- The job configuration.split
- The file split to read from.- Throws:
IOException
- If an IO error occurs while initializing file split.
-
-
Method Detail
-
isMagic
public static boolean isMagic(byte[] input)
Returns true if the byte array passed matches the gzip header magic number.
- Parameters:
input
- The byte array to check.- Returns:
- True if the byte array matches the gzip header magic number.
-
close
public void close() throws IOException
Closes the record reader resources.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in classRecordReader<Text,BytesWritable>
- Throws:
IOException
-
createKey
public Text createKey()
Creates a new instance of theText
object for the key.- Returns:
Text
-
createValue
public BytesWritable createValue()
Creates a new instance of theBytesWritable
object for the key- Returns:
BytesWritable
-
getPos
public long getPos() throws IOException
Returns the current position in the file.- Returns:
- The long of the current position in the file.
- Throws:
IOException
- if there is a fatal I/O error reading the position within theFSDataInputStream
-
getProgress
public float getProgress() throws IOException
Returns the percentage of progress in processing the file. This will be represented as a float from 0 to 1 with 1 being 100% completed.- Specified by:
getProgress
in classRecordReader<Text,BytesWritable>
- Returns:
- The percentage of progress as a float from 0 to 1.
- Throws:
IOException
-
getCurrentValue
public BytesWritable getCurrentValue()
- Specified by:
getCurrentValue
in classRecordReader<Text,BytesWritable>
-
getCurrentKey
public Text getCurrentKey()
- Specified by:
getCurrentKey
in classRecordReader<Text,BytesWritable>
-
nextKeyValue
public boolean nextKeyValue()
- Specified by:
nextKeyValue
in classRecordReader<Text,BytesWritable>
-
initialize
public void initialize(InputSplit split, TaskAttemptContext context)
- Specified by:
initialize
in classRecordReader<Text,BytesWritable>
-
next
public boolean next(Text key, BytesWritable value) throws IOException
Returns true if the next record in the split is read into the key and value pair. The key will be the arc record header and the values will be the raw content bytes of the arc record.
- Parameters:
key
- The record keyvalue
- The record value- Returns:
- True if the next record is read.
- Throws:
IOException
- If an error occurs while reading the record value.
-
-