Class RegexURLNormalizer

  • All Implemented Interfaces:
    Configurable, URLNormalizer

    public class RegexURLNormalizer
    extends Configured
    implements URLNormalizer
    Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.

    This class uses the urlnormalizer.regex.file property. It should be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.

    This class also supports different rules depending on the scope. Please see the javadoc in URLNormalizers for more details.

    Author:
    Luke Baker, Andrzej Bialecki
    • Constructor Detail

      • RegexURLNormalizer

        public RegexURLNormalizer()
        The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer()*
      • RegexURLNormalizer

        public RegexURLNormalizer​(Configuration conf)
      • RegexURLNormalizer

        public RegexURLNormalizer​(Configuration conf,
                                  String filename)
                           throws IOException,
                                  PatternSyntaxException
        Constructor which can be passed the configuration file name, so it doesn't look in other configuration files for it.
        Parameters:
        conf - A populated Configuration
        filename - A specific configuration file
        Throws:
        IOException - if there is an error locatingf the specified input file
        PatternSyntaxException - If there is an error whilst interpreting rule patterns.