Class ReplaceIndexer

  • All Implemented Interfaces:
    Configurable, IndexingFilter, Pluggable

    public class ReplaceIndexer
    extends Object
    implements IndexingFilter
    Do pattern replacements on selected field contents prior to indexing. To use this plugin, add index-replace to your plugin.includes. Example:
       <property>
        <name>plugin.includes</name>
        <value>protocol-(http)|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|replace)|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
       </property>
     
    And then add the index.replace.regexp property to conf/nutch-site.xml. This contains a list of replacement instructions per field name, one per line. eg.
       fieldname=/regexp/replacement/[flags]
     
       <property>
        <name>index.replace.regexp</name>
        <value>
          hostmatch=.*\\.com
          title=/search/replace/2
        </value>
       </property>
     
    hostmatch= and urlmatch= lines indicate the match pattern for a host or url. The field replacements that follow this line will apply only to pages from the matching host or url. Replacements run in the order specified. Field names may appear multiple times if multiple replacements are needed. The property format is defined in greater detail in conf/nutch-default.xml.
    Author:
    Peter Ciuffetti
    See Also:
    NUTCH-2058
    • Constructor Detail

      • ReplaceIndexer

        public ReplaceIndexer()
    • Method Detail

      • filter

        public NutchDocument filter​(NutchDocument doc,
                                    Parse parse,
                                    Text url,
                                    CrawlDatum datum,
                                    Inlinks inlinks)
                             throws IndexingException
        Description copied from interface: IndexingFilter
        Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.
        Specified by:
        filter in interface IndexingFilter
        Parameters:
        doc - document instance for collecting fields
        parse - parse data instance
        url - page url
        datum - crawl datum for the page (fetch datum from segment containing fetch status and fetch time)
        inlinks - page inlinks
        Returns:
        modified (or a new) document instance, or null (meaning the document should be discarded)
        Throws:
        IndexingException - if an error occurs during during filtering