Package org.apache.nutch.hostdb
Class UpdateHostDbMapper
- java.lang.Object
-
- org.apache.hadoop.mapreduce.Mapper<Text,Writable,Text,NutchWritable>
-
- org.apache.nutch.hostdb.UpdateHostDbMapper
-
public class UpdateHostDbMapper extends Mapper<Text,Writable,Text,NutchWritable>
Mapper ingesting HostDB and CrawlDB entries. Additionally it can also read host score info from a plain text key/value file generated by the Webgraph's NodeDumper tool.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Mapper
Mapper.Context
-
-
Field Summary
Fields Modifier and Type Field Description protected String[]
args
protected String
buffer
protected CrawlDatum
crawlDatum
protected boolean
filter
protected URLFilters
filters
protected Text
host
protected HostDatum
hostDatum
protected boolean
normalize
protected URLNormalizers
normalizers
protected boolean
readingCrawlDb
protected String
reprUrl
-
Constructor Summary
Constructors Constructor Description UpdateHostDbMapper()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected String
filterNormalize(String hostName)
Filters and or normalizes the input hostname by applying the configured URL filters and normalizers the URL "http://hostname/".void
map(Text key, Writable value, Mapper.Context context)
Mapper ingesting records from the HostDB, CrawlDB and plaintext host scores file.void
setup(Mapper.Context context)
-
-
-
Field Detail
-
host
protected Text host
-
hostDatum
protected HostDatum hostDatum
-
crawlDatum
protected CrawlDatum crawlDatum
-
reprUrl
protected String reprUrl
-
buffer
protected String buffer
-
args
protected String[] args
-
filter
protected boolean filter
-
normalize
protected boolean normalize
-
readingCrawlDb
protected boolean readingCrawlDb
-
filters
protected URLFilters filters
-
normalizers
protected URLNormalizers normalizers
-
-
Method Detail
-
setup
public void setup(Mapper.Context context)
-
filterNormalize
protected String filterNormalize(String hostName)
Filters and or normalizes the input hostname by applying the configured URL filters and normalizers the URL "http://hostname/".- Parameters:
hostName
- the input hostname- Returns:
- the normalized hostname, or null if the URL is excluded by URL filters or failed to be normalized converted
-
map
public void map(Text key, Writable value, Mapper.Context context) throws IOException, InterruptedException
Mapper ingesting records from the HostDB, CrawlDB and plaintext host scores file. Statistics and scores are passed on.- Overrides:
map
in classMapper<Text,Writable,Text,NutchWritable>
- Parameters:
key
- recordText
keyvalue
- associatedWritable
objectcontext
-Reducer.Context
for writing custom counters and output.- Throws:
IOException
InterruptedException
-
-