Package org.apache.nutch.indexer.replace
Class ReplaceIndexer
- java.lang.Object
-
- org.apache.nutch.indexer.replace.ReplaceIndexer
-
- All Implemented Interfaces:
Configurable
,IndexingFilter
,Pluggable
public class ReplaceIndexer extends Object implements IndexingFilter
Do pattern replacements on selected field contents prior to indexing. To use this plugin, addindex-replace
to yourplugin.includes
. Example:<property> <name>plugin.includes</name> <value>protocol-(http)|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|replace)|urlnormalizer-(pass|regex|basic)|indexer-solr</value> </property>
And then add theindex.replace.regexp
property toconf/nutch-site.xml
. This contains a list of replacement instructions per field name, one per line. eg.fieldname=/regexp/replacement/[flags]
<property> <name>index.replace.regexp</name> <value> hostmatch=.*\\.com title=/search/replace/2 </value> </property>
hostmatch=
andurlmatch=
lines indicate the match pattern for a host or url. The field replacements that follow this line will apply only to pages from the matching host or url. Replacements run in the order specified. Field names may appear multiple times if multiple replacements are needed. The property format is defined in greater detail inconf/nutch-default.xml
.- Author:
- Peter Ciuffetti
- See Also:
- NUTCH-2058
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description ReplaceIndexer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description NutchDocument
filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse.Configuration
getConf()
void
setConf(Configuration conf)
-
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
Description copied from interface:IndexingFilter
Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.- Specified by:
filter
in interfaceIndexingFilter
- Parameters:
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the page (fetch datum from segment containing fetch status and fetch time)inlinks
- page inlinks- Returns:
- modified (or a new) document instance, or null (meaning the document should be discarded)
- Throws:
IndexingException
- if an error occurs during during filtering
-
-