Package org.apache.nutch.collection
Class Subcollection
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.collection.Subcollection
-
- All Implemented Interfaces:
Configurable
,URLFilter
,Pluggable
public class Subcollection extends Configured implements URLFilter
SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.
-
-
Field Summary
Fields Modifier and Type Field Description static String
TAG_BLACKLIST
static String
TAG_COLLECTION
static String
TAG_COLLECTIONS
static String
TAG_ID
static String
TAG_KEY
static String
TAG_NAME
static String
TAG_WHITELIST
-
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description Subcollection(String id, String name, String key, Configuration conf)
public ConstructorSubcollection(String id, String name, Configuration conf)
public ConstructorSubcollection(Configuration conf)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description String
filter(String urlString)
Simple "indexOf" currentFilter for matching patterns.String
getBlackListString()
Returns blacklist StringString
getId()
String
getKey()
String
getName()
List<String>
getWhiteList()
Returns whitelistString
getWhiteListString()
Returns whitelist Stringvoid
initialize(Element collection)
Initialize Subcollection from dom elementprotected void
parseList(List<String> list, String text)
Create a list of patterns from a chunk of text, patterns are separated with a newlinevoid
setBlackList(String list)
Set contents of blacklist from Stringvoid
setWhiteList(String list)
Set contents of whitelist from Stringvoid
setWhiteList(ArrayList<String> whiteList)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
TAG_COLLECTIONS
public static final String TAG_COLLECTIONS
- See Also:
- Constant Field Values
-
TAG_COLLECTION
public static final String TAG_COLLECTION
- See Also:
- Constant Field Values
-
TAG_WHITELIST
public static final String TAG_WHITELIST
- See Also:
- Constant Field Values
-
TAG_BLACKLIST
public static final String TAG_BLACKLIST
- See Also:
- Constant Field Values
-
TAG_NAME
public static final String TAG_NAME
- See Also:
- Constant Field Values
-
TAG_KEY
public static final String TAG_KEY
- See Also:
- Constant Field Values
-
TAG_ID
public static final String TAG_ID
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
Subcollection
public Subcollection(String id, String name, Configuration conf)
public Constructor- Parameters:
id
- Id of SubCollectionname
- Name of SubCollectionconf
- A populatedConfiguration
-
Subcollection
public Subcollection(String id, String name, String key, Configuration conf)
public Constructor- Parameters:
id
- Id of SubCollectionname
- Name of SubCollectionkey
- SubCollection keyconf
- A populatedConfiguration
-
Subcollection
public Subcollection(Configuration conf)
-
-
Method Detail
-
getName
public String getName()
- Returns:
- Returns the name
-
getKey
public String getKey()
- Returns:
- Returns the key
-
getId
public String getId()
- Returns:
- Returns the id
-
getWhiteListString
public String getWhiteListString()
Returns whitelist String- Returns:
- Whitelist String
-
getBlackListString
public String getBlackListString()
Returns blacklist String- Returns:
- Blacklist String
-
setWhiteList
public void setWhiteList(ArrayList<String> whiteList)
- Parameters:
whiteList
- The whiteList to set.
-
filter
public String filter(String urlString)
Simple "indexOf" currentFilter for matching patterns.rules for evaluation are as follows: 1. if pattern matches in blacklist then url is rejected 2. if pattern matches in whitelist then url is allowed 3. url is rejected
- Specified by:
filter
in interfaceURLFilter
- Parameters:
urlString
- the URL string the filter is applied on- Returns:
- the original URL string if the URL is accepted by the filter or null in case the URL is rejected
- See Also:
URLFilter.filter(java.lang.String)
-
initialize
public void initialize(Element collection)
Initialize Subcollection from dom element- Parameters:
collection
- A DOMElement
for use in creating theSubcollection
-
parseList
protected void parseList(List<String> list, String text)
Create a list of patterns from a chunk of text, patterns are separated with a newline- Parameters:
list
- An initializedList
to insert String patterns.text
- A chunkl fo text (hopefully) containing patterns.
-
setBlackList
public void setBlackList(String list)
Set contents of blacklist from String- Parameters:
list
- the blacklist contents
-
setWhiteList
public void setWhiteList(String list)
Set contents of whitelist from String- Parameters:
list
- the whitelist contents
-
-