Package org.apache.nutch.util
Class EncodingDetector
- java.lang.Object
-
- org.apache.nutch.util.EncodingDetector
-
public class EncodingDetector extends Object
A simple class for detecting character encodings.Broadly this encompasses two functions, which are distinctly separate:
- Auto detecting a set of "clues" from input text.
- Taking a set of clues and making a "best guess" as to the "real" encoding.
A caller will often have some extra information about what the encoding might be (e.g. from the HTTP header or HTML meta-tags, often wrong but still potentially useful clues). The types of clues may differ from caller to caller. Thus a typical calling sequence is:
- Run step (1) to generate a set of auto-detected clues;
- Combine these clues with the caller-dependent "extra clues" available;
- Run step (2) to guess what the most probable answer is.
-
-
Field Summary
Fields Modifier and Type Field Description static String
MIN_CONFIDENCE_KEY
static int
NO_THRESHOLD
-
Constructor Summary
Constructors Constructor Description EncodingDetector(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addClue(String value, String source)
void
addClue(String value, String source, int confidence)
void
autoDetectClues(Content content, boolean filter)
void
clearClues()
Clears all clues.String
guessEncoding(Content content, String defaultValue)
Guess the encoding with the previously specified list of clues.static void
main(String[] args)
static String
parseCharacterEncoding(String contentType)
Parse the character encoding from the specified content type header.static String
resolveEncodingAlias(String encoding)
-
-
-
Field Detail
-
NO_THRESHOLD
public static final int NO_THRESHOLD
- See Also:
- Constant Field Values
-
MIN_CONFIDENCE_KEY
public static final String MIN_CONFIDENCE_KEY
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
EncodingDetector
public EncodingDetector(Configuration conf)
-
-
Method Detail
-
autoDetectClues
public void autoDetectClues(Content content, boolean filter)
-
guessEncoding
public String guessEncoding(Content content, String defaultValue)
Guess the encoding with the previously specified list of clues.- Parameters:
content
- Content instancedefaultValue
- Default encoding to return if no encoding can be detected with enough confidence. Note that this will not be normalized withresolveEncodingAlias(java.lang.String)
- Returns:
- Guessed encoding or defaultValue
-
clearClues
public void clearClues()
Clears all clues.
-
parseCharacterEncoding
public static String parseCharacterEncoding(String contentType)
Parse the character encoding from the specified content type header. If the content type is null, or there is no explicit character encoding,null
is returned.
This method was copied from org.apache.catalina.util.RequestUtil, which is licensed under the Apache License, Version 2.0 (the "License").- Parameters:
contentType
- a content type header- Returns:
- a trimmed string representation of the 'charset=' value, null if this is not available
-
main
public static void main(String[] args) throws IOException
- Throws:
IOException
-
-