Class RegexURLNormalizer
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
-
- All Implemented Interfaces:
Configurable
,URLNormalizer
public class RegexURLNormalizer extends Configured implements URLNormalizer
Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.This class uses the
urlnormalizer.regex.file
property. It should be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.This class also supports different rules depending on the scope. Please see the javadoc in
URLNormalizers
for more details.- Author:
- Luke Baker, Andrzej Bialecki
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.net.URLNormalizer
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description RegexURLNormalizer()
The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer()*RegexURLNormalizer(Configuration conf)
RegexURLNormalizer(Configuration conf, String filename)
Constructor which can be passed the configuration file name, so it doesn't look in other configuration files for it.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>>
getScopedRules()
static void
main(String[] args)
Spits out patterns and substitutions that are in the configuration file.String
normalize(String urlString, String scope)
String
regexNormalize(String urlString, String scope)
This function does the replacements by iterating through all the regex patterns.void
setConf(Configuration conf)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
-
-
-
-
Constructor Detail
-
RegexURLNormalizer
public RegexURLNormalizer()
The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer()*
-
RegexURLNormalizer
public RegexURLNormalizer(Configuration conf)
-
RegexURLNormalizer
public RegexURLNormalizer(Configuration conf, String filename) throws IOException, PatternSyntaxException
Constructor which can be passed the configuration file name, so it doesn't look in other configuration files for it.- Parameters:
conf
- A populatedConfiguration
filename
- A specific configuration file- Throws:
IOException
- if there is an error locatingf the specified input filePatternSyntaxException
- If there is an error whilst interpreting rule patterns.
-
-
Method Detail
-
getScopedRules
public HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> getScopedRules()
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
- Overrides:
setConf
in classConfigured
-
regexNormalize
public String regexNormalize(String urlString, String scope)
This function does the replacements by iterating through all the regex patterns. It accepts a string url as input and returns the altered string.- Parameters:
urlString
- A url string to processscope
- The identifier for a specific scoped rule- Returns:
- The altered string
-
normalize
public String normalize(String urlString, String scope) throws MalformedURLException
- Specified by:
normalize
in interfaceURLNormalizer
- Throws:
MalformedURLException
-
main
public static void main(String[] args) throws PatternSyntaxException, IOException
Spits out patterns and substitutions that are in the configuration file.- Parameters:
args
- accepts one argument which is a scope- Throws:
IOException
- Can be thrown bynormalize(String, String)
PatternSyntaxException
- If there is an error with the provided scope rule pattern.
-
-