Package org.creativecommons.nutch
Class CCIndexingFilter
- java.lang.Object
-
- org.creativecommons.nutch.CCIndexingFilter
-
- All Implemented Interfaces:
Configurable
,IndexingFilter
,Pluggable
public class CCIndexingFilter extends Object implements IndexingFilter
Adds basic searchable fields to a document.
-
-
Field Summary
Fields Modifier and Type Field Description static String
FIELD
The name of the document field we use.-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description CCIndexingFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addUrlFeatures(NutchDocument doc, String urlString)
Add the features represented by a license URL.NutchDocument
filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse.Configuration
getConf()
void
setConf(Configuration conf)
-
-
-
Field Detail
-
FIELD
public static String FIELD
The name of the document field we use.
-
-
Method Detail
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
Description copied from interface:IndexingFilter
Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.- Specified by:
filter
in interfaceIndexingFilter
- Parameters:
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the page (fetch datum from segment containing fetch status and fetch time)inlinks
- page inlinks- Returns:
- modified (or a new) document instance, or null (meaning the document should be discarded)
- Throws:
IndexingException
- if an error occurs during during filtering
-
addUrlFeatures
public void addUrlFeatures(NutchDocument doc, String urlString)
Add the features represented by a license URL. Urls are of the form "http://creativecommons.org/licenses/xx-xx/xx/xx", where "xx" names a license feature.- Parameters:
doc
- aNutchDocument
to augmenturlString
- the url to extract features from
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
-