Class EncodingDetector


  • public class EncodingDetector
    extends Object
    A simple class for detecting character encodings.

    Broadly this encompasses two functions, which are distinctly separate:

    1. Auto detecting a set of "clues" from input text.
    2. Taking a set of clues and making a "best guess" as to the "real" encoding.

    A caller will often have some extra information about what the encoding might be (e.g. from the HTTP header or HTML meta-tags, often wrong but still potentially useful clues). The types of clues may differ from caller to caller. Thus a typical calling sequence is:

    • Run step (1) to generate a set of auto-detected clues;
    • Combine these clues with the caller-dependent "extra clues" available;
    • Run step (2) to guess what the most probable answer is.
    • Constructor Detail

      • EncodingDetector

        public EncodingDetector​(Configuration conf)
    • Method Detail

      • autoDetectClues

        public void autoDetectClues​(Content content,
                                    boolean filter)
      • addClue

        public void addClue​(String value,
                            String source,
                            int confidence)
      • addClue

        public void addClue​(String value,
                            String source)
      • guessEncoding

        public String guessEncoding​(Content content,
                                    String defaultValue)
        Guess the encoding with the previously specified list of clues.
        Parameters:
        content - Content instance
        defaultValue - Default encoding to return if no encoding can be detected with enough confidence. Note that this will not be normalized with resolveEncodingAlias(java.lang.String)
        Returns:
        Guessed encoding or defaultValue
      • clearClues

        public void clearClues()
        Clears all clues.
      • resolveEncodingAlias

        public static String resolveEncodingAlias​(String encoding)
      • parseCharacterEncoding

        public static String parseCharacterEncoding​(String contentType)
        Parse the character encoding from the specified content type header. If the content type is null, or there is no explicit character encoding, null is returned.
        This method was copied from org.apache.catalina.util.RequestUtil, which is licensed under the Apache License, Version 2.0 (the "License").
        Parameters:
        contentType - a content type header
        Returns:
        a trimmed string representation of the 'charset=' value, null if this is not available