Class LuceneTokenizer
- java.lang.Object
-
- org.apache.nutch.scoring.similarity.util.LuceneTokenizer
-
public class LuceneTokenizer extends Object
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
LuceneTokenizer.TokenizerType
-
Constructor Summary
Constructors Constructor Description LuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, boolean useStopFilter, LuceneAnalyzerUtil.StemFilterType stemFilterType)
Creates a tokenizer based on param valuesLuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, List<String> stopWords, boolean addToDefault, LuceneAnalyzerUtil.StemFilterType stemFilterType)
Creates a tokenizer based on param valuesLuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, LuceneAnalyzerUtil.StemFilterType stemFilterType, int mingram, int maxgram)
Creates a tokenizer for the ngram model based on param values
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description org.apache.lucene.analysis.TokenStream
getTokenStream()
get the tokenStream created byTokenizer
-
-
-
Constructor Detail
-
LuceneTokenizer
public LuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, boolean useStopFilter, LuceneAnalyzerUtil.StemFilterType stemFilterType)
Creates a tokenizer based on param values- Parameters:
content
- - The text to tokenizetokenizer
- - the type of tokenizer to use CLASSIC or DEFAULTuseStopFilter
- - if set to true the token stream will be filtered using default Lucene stopsetstemFilterType
- a preferredLuceneAnalyzerUtil.StemFilterType
to use. Can be one ofLuceneAnalyzerUtil.StemFilterType.PORTERSTEM_FILTER
,LuceneAnalyzerUtil.StemFilterType.ENGLISHMINIMALSTEM_FILTER
, orLuceneAnalyzerUtil.StemFilterType.NONE
-
LuceneTokenizer
public LuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, List<String> stopWords, boolean addToDefault, LuceneAnalyzerUtil.StemFilterType stemFilterType)
Creates a tokenizer based on param values- Parameters:
content
- - The text to tokenizetokenizer
- - the type of tokenizer to use CLASSIC or DEFAULTstopWords
- - Provide a set of user defined stop wordsaddToDefault
- - If set to true, the stopSet words will be added to the Lucene default stop set. If false, then only the user provided words will be used as the stop setstemFilterType
- a preferredLuceneAnalyzerUtil.StemFilterType
to use. Can be one ofLuceneAnalyzerUtil.StemFilterType.PORTERSTEM_FILTER
,LuceneAnalyzerUtil.StemFilterType.ENGLISHMINIMALSTEM_FILTER
, orLuceneAnalyzerUtil.StemFilterType.NONE
-
LuceneTokenizer
public LuceneTokenizer(String content, LuceneTokenizer.TokenizerType tokenizer, LuceneAnalyzerUtil.StemFilterType stemFilterType, int mingram, int maxgram)
Creates a tokenizer for the ngram model based on param values- Parameters:
content
- - The text to tokenizetokenizer
- - the type of tokenizer to use CLASSIC or DEFAULTstemFilterType
- - Type of stemming to performmingram
- - Value of mingram for tokenizingmaxgram
- - Value of maxgram for tokenizing
-
-