com.upokecenter.text.Encodings

com.upokecenter.text.Encodings

public final class Encodings extends Object

Contains methods for converting text from one character encoding to another. This class also contains convenience methods for converting strings and other character inputs to sequences of bytes and vice versa.

The WHATWG Encoding Standard defines algorithms for the most common character encodings used on Web pages and recommends the UTF-8 encoding for new specifications and Web pages. Calling the GetEncoding(name) method returns one of the character encodings with the given name under the Encoding Standard.

Now let's define some terms.

Encoding Terms

There are several kinds of character encodings:

Getting an Encoding

The Encoding Standard includes UTF-8, UTF-16, and many legacy encodings, and gives each one of them a name. The GetEncoding(name) method takes a name string and returns an ICharacterEncoding object that implements that encoding, or null if the name is unrecognized.

However, the Encoding Standard is designed to include only encodings commonly used on Web pages, not in other protocols such as email. For email, the Encoding class includes an alternate function GetEncoding(name, forEmail). Setting forEmail to true will use rules modified from the Encoding Standard to better suit encoding and decoding text from email messages.

Classes for Character Encodings

This Encodings class provides access to common character encodings through classes as described below:

Custom Encodings

Classes that implement the ICharacterEncoding interface can provide additional character encodings not included in the Encoding Standard. Some examples of these include the following:

(Note that this library doesn't implement either encoding.)

Fields

Methods

Field Details

UTF8

public static final ICharacterEncoding UTF8

Character encoding object for the UTF-8 character encoding, which represents each code point in the universal coded character set using 1 to 4 bytes.

Method Details

DecodeToString

public static String DecodeToString(ICharacterEncoding encoding, IByteReader input)

Reads bytes from a data source and converts the bytes from a given encoding to a text string.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: "encoding.DecodeString(input)". If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

DecodeToString

public static String DecodeToString(ICharacterEncoding enc, InputStream input)

Decodes data read from a data stream into a text string in the given character encoding.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.DecodeToString(input). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(input). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

DecodeToString

public static String DecodeToString(ICharacterEncoding enc, byte[] bytes)

Reads a byte array from a data source and converts the bytes from a given encoding to a text string. Errors in decoding are handled by replacing erroneous bytes with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(bytes). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(bytes). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

DecodeToString

public static String DecodeToString(ICharacterEncoding enc, byte[] bytes, int offset, int length)

Reads a portion of a byte array from a data source and converts the bytes from a given encoding to a text string. Errors in decoding are handled by replacing erroneous bytes with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(bytes, offset, length). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(bytes, offset, length). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

EncodeToBytes

public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoding encoding)

Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoding). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

EncodeToBytes

public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder)

Reads Unicode characters from a character input and writes them to a byte array encoded using a given character encoding. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

EncodeToBytes

public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder, boolean htmlFallback)

Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder and fallback strategy.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder, htmlFallback). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

EncodeToBytes

public static byte[] EncodeToBytes(String str, ICharacterEncoding enc)

Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any string object and can be called as follows: str.EncodeToBytes(enc). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.EncodeToBytes(enc). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

EncodeToBytes

public static byte[] EncodeToBytes(String str, ICharacterEncoding enc, boolean htmlFallback)

Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding and using the given encoder fallback strategy. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.EncodeToBytes(enc, htmlFallback). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

EncodeToWriter

public static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, IWriter writer)

Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoding). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToWriter(encoding, writer). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

Throws:

EncodeToWriter

public static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, IWriter writer)

Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToWriter(encoder, writer). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

Throws:

EncodeToWriter

public static void EncodeToWriter(String str, ICharacterEncoding enc, IWriter writer)

Converts a text string to bytes and writes the bytes to an output byte writer. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte stream, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any string object and can be called as follows: str.EncodeToBytes(enc, writer). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.EncodeToWriter(enc, writer). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

Throws:

EncodeToWriter

public static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, OutputStream output) throws IOException

Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoding). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToWriter(encoding, output). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

Throws:

EncodeToWriter

public static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, OutputStream output) throws IOException

Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToWriter(encoder, output). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

Throws:

EncodeToWriter

public static void EncodeToWriter(String str, ICharacterEncoding enc, OutputStream output) throws IOException

Converts a text string to bytes and writes the bytes to an output data stream. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte stream, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any string object and can be called as follows: str.EncodeToBytes(enc, writer). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.EncodeToWriter(enc, output). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

Throws:

GetDecoderInput

public static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, IByteReader stream)

Converts a character encoding into a character input stream, given a streamable source of bytes. The input stream doesn't check the first few bytes for a byte-order mark indicating a Unicode encoding such as UTF-8 before using the character encoding's decoder.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: "encoding.GetDecoderInput(input)". If the object's class already has a GetDecoderInput method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

GetDecoderInput

public static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, InputStream input)

Converts a character encoding into a character input stream, given a data stream. The input stream doesn't check the first few bytes for a byte-order mark indicating a Unicode encoding such as UTF-8 before using the character encoding's decoder.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.GetDecoderInput(input). If the object's class already has a GetDecoderInput method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.GetDecoderInput(input). If the object's class already has a GetDecoderInput method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

GetDecoderInputSkipBom

public static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, IByteReader stream)

Converts a character encoding into a character input stream, given a streamable source of bytes. But if the input stream starts with a UTF-8 or UTF-16 byte order mark, the input is decoded as UTF-8 or UTF-16, as the case may be, rather than the given character encoding.

This method implements the "decode" algorithm specified in the Encoding standard.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.GetDecoderInputSkipBom(input). If the object's class already has a GetDecoderInputSkipBom method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

GetDecoderInputSkipBom

public static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, InputStream input)

Converts a character encoding into a character input stream, given a readable data stream. But if the input stream starts with a UTF-8 or UTF-16 byte order mark, the input is decoded as UTF-8 or UTF-16, as the case may be, rather than the given character encoding.This method implements the "decode" algorithm specified in the Encoding standard.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.GetDecoderInputSkipBom(input). If the object's class already has a GetDecoderInputSkipBom method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

GetEncoding

public static ICharacterEncoding GetEncoding(String name)

Returns a character encoding from the given name.

Parameters:

Returns:

GetEncoding

public static ICharacterEncoding GetEncoding(String name, boolean forEmail, boolean allowReplacement)

Returns a character encoding from the given name.

Parameters:

Returns:

GetEncoding

public static ICharacterEncoding GetEncoding(String name, boolean forEmail)

Returns a character encoding from the given name.

Parameters:

Returns:

InputToString

public static String InputToString(ICharacterInput reader)

Reads Unicode characters from a character input and converts them to a text string.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: reader.InputToString(). If the object's class already has a InputToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

ResolveAlias

public static String ResolveAlias(String name)

Resolves a character encoding's name to a standard form. This involves changing aliases of a character encoding to a standardized name.

In several Internet specifications, this name is known as a "charset" parameter. In HTML and HTTP, for example, the "charset" parameter indicates the encoding used to represent text in the HTML page, text file, etc.

Parameters:

Returns:

ResolveAliasForEmail

public static String ResolveAliasForEmail(String name)

Resolves a character encoding's name to a canonical form, using rules more suitable for email.

Parameters:

Returns:

StringToBytes

public static byte[] StringToBytes(ICharacterEncoding encoding, String str)

Converts a text string to a byte array encoded in a given character encoding. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.StringToBytes(str). If the object's class already has a StringToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

StringToBytes

public static byte[] StringToBytes(ICharacterEncoder encoder, String str)

Converts a text string to a byte array using the given character encoder. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoder and can be called as follows: encoder.StringToBytes(str). If the object's class already has a StringToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoder and can be called as follows: encoder.StringToBytes(str). If the object's class already has a StringToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

StringToInput

public static ICharacterInput StringToInput(String str)

Converts a text string to a character input. The resulting input can then be used to encode the text to bytes, or to read the string code point by code point, among other things. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any string object and can be called as follows: str.StringToInput(offset, length). If the object's class already has a StringToInput method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.StringToInput(). If the object's class already has a StringToInput method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

StringToInput

public static ICharacterInput StringToInput(String str, int offset, int length)

Converts a portion of a text string to a character input. The resulting input can then be used to encode the text to bytes, or to read the string code point by code point, among other things. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any string object and can be called as follows: str.StringToInput(offset, length). If the object's class already has a StringToInput method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.StringToInput(offset, length). If the object's class already has a StringToInput method with the same parameters, that method takes precedence over this extension method.

Parameters:

Returns:

Throws:

Back to Encoding start page.