com.upokecenter.text.Encodings

# com.upokecenter.text.Encodings

public final class Encodings extends Object

Contains methods for converting text from one character encoding to another. This class also contains convenience methods for converting strings and other character inputs to sequences of bytes and vice versa.

The WHATWG Encoding Standard defines algorithms for the most common character encodings used on Web pages and recommends the UTF-8 encoding for new specifications and Web pages. Calling the GetEncoding(name) method returns one of the character encodings with the specified name under the Encoding Standard.

Now let's define some terms.

Encoding Terms

A code point is a number that identifies a single text character, such as a letter, digit, or symbol. (A collection of such characters is also called an abstract character repertoire.)
A coded character set is a set of code points which are each assigned to a single text character. As used here, coded character sets don't define how code points are laid out in memory.
A character encoding is a mapping from a sequence of code points, in one or more specific coded character sets, to a sequence of bytes and vice versa. (For brevity, the rest of this documentation may use the term encoding instead. RFC 6365 uses the analogous term charset instead; in this documentation, however, charset is used only to refer to the names that identify a character encoding.)
ASCII is a 128-code-point coded character set that includes the English letters and digits, common punctuation and symbols, and control characters. As used here, its code points match the code points within the Basic Latin block (0-127 or U+0000 to U+007F) of the Unicode Standard.

There are several kinds of character encodings:

Single-byte encodings define a coded character set that assigns one code point to one byte. Thus, they can have a maximum of 256 code points. For example:
(a) ISO 8859 encodings and windows-1252.
(b) ASCII is usually used as a single-byte encoding where each code point fits in the lower 7 bits of an eight-bit byte (in that case, the encoding is often called US-ASCII). In the Encoding Standard, all single-byte encodings use the ASCII characters as the first 128 code points of their coded character sets.
Multi-byte encodings include code points from one or more coded character sets and assign some or all code points to several bytes. For example:
(a) UTF-16LE and UTF-16BE are two encodings defined in the Unicode Standard. They use 2 bytes for the most common code points, and 4 bytes for supplementary code points.
(b) UTF-8 is another encoding defined in the Unicode Standard. It uses 1 byte for ASCII and 2 to 4 bytes for the other Unicode code points.
(c) Most legacy East Asian encodings, such as Shift_JIS, GBK, and Big5 use 1 byte for ASCII (or a slightly modified version) and, usually, 2 or more bytes for national standard coded character sets. In many of these encodings, notably Shift_JIS, characters whose code points use one byte traditionally take half the space of characters whose code points use two bytes.
Escape-based encodings are combinations of single- and/or multi-byte encodings, and use escape sequences and/or shift codes to change which encoding to use for the bytes that follow. For example:
(a) ISO-2022-JP supports several escape sequences that shift into different encodings, including a Katakana, a Kanji, and an ASCII encoding (with ASCII as the default).
(b) UTF-7 (not included in the Encoding Standard) is an encoding that uses the Unicode Standard's coded character set, which is encoded using a limited subset of ASCII. The plus symbol (U+002B) is used to shift into a UTF-16BE multi-byte encoding (converted to a modified version of base-64) to encode other Unicode code points.
The Encoding Standard also defines a replacement encoding, which causes a decoding error and is used to alias a few problematic or unsupported encoding names, such as hz-gb-2312.

Getting an Encoding

The Encoding Standard includes UTF-8, UTF-16, and many legacy encodings, and gives each one of them a name. The GetEncoding(name) method takes a name string and returns an ICharacterEncoding object that implements that encoding, or null if the name is unrecognized.

However, the Encoding Standard is designed to include only encodings commonly used on Web pages, not in other protocols such as email. For email, the Encoding class includes an alternate function GetEncoding(name, forEmail). Setting forEmail to true will use rules modified from the Encoding Standard to better suit encoding and decoding text from email messages.

Classes for Character Encodings

This Encodings class provides access to common character encodings through classes as described below:

An encoder class is a class that converts a sequence of bytes to a sequence of code points in the universal character set (otherwise known under the name Unicode). An encoder class implements the ICharacterEncoder interface.
A decoder class is a class that converts a sequence of Unicode code points to a sequence of bytes. A decoder class implements the ICharacterDecoder interface.
An encoding class allows access to both an encoder class and a decoder class and implements the ICharacterEncoding interface. The encoder and decoder classes should implement the same character encoding.

Custom Encodings

Classes that implement the ICharacterEncoding interface can provide additional character encodings not included in the Encoding Standard. Some examples of these include the following:

A modified version of UTF-8 used in Java's serialization formats.
A modified version of UTF-7 used in the IMAP email protocol.

(Note that this library doesn't implement either encoding.)

Fields

static final ICharacterEncoding UTF8
Character encoding object for the UTF-8 character encoding, which represents each code point in the universal coded character set using 1 to 4 bytes.

Methods

static String DecodeToString(ICharacterEncoding enc, byte[] bytes)
Reads a byte array from a data source and converts the bytes from a given encoding to a text string.
static String DecodeToString(ICharacterEncoding enc, byte[] bytes, int offset, int length)
Reads a portion of a byte array from a data source and converts the bytes from a given encoding to a text string.
static String DecodeToString(ICharacterEncoding encoding, IByteReader input)
Reads bytes from a data source and converts the bytes from a given encoding to a text string.
static String DecodeToString(ICharacterEncoding enc, InputStream input)
Decodes data read from a data stream into a text string in the specified character encoding.
static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder)
Reads Unicode characters from a character input and writes them to a byte array encoded using a given character encoding.
static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder, boolean htmlFallback)
Reads Unicode characters from a character input and writes them to a byte array encoded using the specified character encoder and fallback strategy.
static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoding encoding)
Reads Unicode characters from a character input and writes them to a byte array encoded using the specified character encoder.
static byte[] EncodeToBytes(String str, ICharacterEncoding enc)
Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding.
static byte[] EncodeToBytes(String str, ICharacterEncoding enc, boolean htmlFallback)
Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding and using the specified encoder fallback strategy.
static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, IWriter writer)
Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding.
static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, OutputStream output)
Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding.
static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, IWriter writer)
Reads Unicode characters from a character input and writes them to a byte array encoded using the specified character encoder.
static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, OutputStream output)
Reads Unicode characters from a character input and writes them to a byte array encoded using the specified character encoder.
static void EncodeToWriter(String str, ICharacterEncoding enc, IWriter writer)
Converts a text string to bytes and writes the bytes to an output byte writer.
static void EncodeToWriter(String str, ICharacterEncoding enc, OutputStream output)
Converts a text string to bytes and writes the bytes to an output data stream.
static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, IByteReader stream)
Converts a character encoding into a character input stream, given a streamable source of bytes.
static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, InputStream input)
Converts a character encoding into a character input stream, given a data stream.
static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, IByteReader stream)
Converts a character encoding into a character input stream, given a streamable source of bytes.
static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, InputStream input)
Converts a character encoding into a character input stream, given a readable data stream.
static ICharacterEncoding GetEncoding(String name)
Returns a character encoding from the specified name.
static ICharacterEncoding GetEncoding(String name, boolean forEmail)
Returns a character encoding from the specified name.
static ICharacterEncoding GetEncoding(String name, boolean forEmail, boolean allowReplacement)
Returns a character encoding from the specified name.
static String InputToString(ICharacterInput reader)
Reads Unicode characters from a character input and converts them to a text string.
static String ResolveAlias(String name)
Resolves a character encoding’s name to a standard form.
static String ResolveAliasForEmail(String name)
Resolves a character encoding’s name to a canonical form, using rules more suitable for email.
static byte[] StringToBytes(ICharacterEncoder encoder, String str)
Converts a text string to a byte array using the specified character encoder.
static byte[] StringToBytes(ICharacterEncoding encoding, String str)
Converts a text string to a byte array encoded in a given character encoding.
static ICharacterInput StringToInput(String str)
Converts a text string to a character input.
static ICharacterInput StringToInput(String str, int offset, int length)
Converts a portion of a text string to a character input.

Field Details

UTF8

public static final ICharacterEncoding UTF8

Character encoding object for the UTF-8 character encoding, which represents each code point in the universal coded character set using 1 to 4 bytes.

Method Details

DecodeToString

public static String DecodeToString(ICharacterEncoding encoding, IByteReader input)

Reads bytes from a data source and converts the bytes from a given encoding to a text string.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: "encoding.DecodeString(input)". If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

encoding - An object that implements a given character encoding. Any bytes that can’t be decoded are converted to the replacement character (U+FFFD).
input - An object that implements a byte stream.

Returns:

The converted string.

Throws:

NullPointerException - The parameter encoding or input is null.

DecodeToString

public static String DecodeToString(ICharacterEncoding enc, InputStream input)

Decodes data read from a data stream into a text string in the specified character encoding.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.DecodeToString(input). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(input). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

enc - An object implementing a character encoding (gives access to an encoder and a decoder).
input - A readable byte stream.

Returns:

A string consisting of the decoded text.

Throws:

NullPointerException - The parameter “encoding” or input is null.

DecodeToString

public static String DecodeToString(ICharacterEncoding enc, byte[] bytes)

Reads a byte array from a data source and converts the bytes from a given encoding to a text string. Errors in decoding are handled by replacing erroneous bytes with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(bytes). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(bytes). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

enc - An object implementing a character encoding (gives access to an encoder and a decoder).
bytes - A byte array.

Returns:

A string consisting of the decoded text.

Throws:

NullPointerException - The parameter enc or bytes is null.

DecodeToString

public static String DecodeToString(ICharacterEncoding enc, byte[] bytes, int offset, int length)

Reads a portion of a byte array from a data source and converts the bytes from a given encoding to a text string. Errors in decoding are handled by replacing erroneous bytes with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(bytes, offset, length). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: enc.DecodeToString(bytes, offset, length). If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

enc - An object implementing a character encoding (gives access to an encoder and a decoder).
bytes - A byte array containing the desired portion to read.
offset - An index starting at 0 showing where the desired portion of bytes begins.
length - The length, in bytes, of the desired portion of bytes (but not more than bytes ‘s length).

Returns:

A string consisting of the decoded text.

Throws:

NullPointerException - The parameter enc or bytes is null.
IllegalArgumentException - Either offset or length is less than 0 or greater than bytes ‘s length, or bytes ‘s length minus offset is less than length.
NullPointerException - The parameter enc or bytes is null.

EncodeToBytes

public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoding encoding)

Reads Unicode characters from a character input and writes them to a byte array encoded using the specified character encoder. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoding). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

input - An object that implements a stream of universal code points.
encoding - An object that implements a given character encoding.

Returns:

A byte array containing the encoded text.

Throws:

NullPointerException - The parameter encoding is null.

EncodeToBytes

public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder)

Reads Unicode characters from a character input and writes them to a byte array encoded using a given character encoding. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

input - An object that implements a stream of universal code points.
encoder - An object that implements a character encoder.

Returns:

A byte array.

Throws:

NullPointerException - The parameter encoder or input is null.

EncodeToBytes

public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder, boolean htmlFallback)

Reads Unicode characters from a character input and writes them to a byte array encoded using the specified character encoder and fallback strategy.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder, htmlFallback). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

input - An object that implements a stream of universal code points.
encoder - A character encoder that takes Unicode characters and writes them into bytes.
htmlFallback - If true, when the encoder encounters invalid characters that can’t be mapped into bytes, writes the HTML decimal escape for the invalid characters instead. If false, writes a question mark byte (0x3f) upon encountering invalid characters.

Returns:

A byte array containing the encoded characters.

Throws:

NullPointerException - The parameter encoder or input is null.
NullPointerException - The parameter encoder or input is null.

EncodeToBytes

public static byte[] EncodeToBytes(String str, ICharacterEncoding enc)

Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any string object and can be called as follows: str.EncodeToBytes(enc). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.EncodeToBytes(enc). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

str - A text string to encode to a byte array.
enc - An object implementing a character encoding (gives access to an encoder and a decoder).

Returns:

A byte array containing the encoded text string.

Throws:

NullPointerException - The parameter str or enc is null.

EncodeToBytes

public static byte[] EncodeToBytes(String str, ICharacterEncoding enc, boolean htmlFallback)

Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding and using the specified encoder fallback strategy. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.EncodeToBytes(enc, htmlFallback). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

str - A text string to encode to a byte array.
enc - An object implementing a character encoding (gives access to an encoder and a decoder).
htmlFallback - If true, when the encoder encounters invalid characters that can’t be mapped into bytes, writes the HTML decimal escape for the invalid characters instead. If false, writes a question mark byte (0x3f) upon encountering invalid characters.

Returns:

A byte array containing the encoded text string.

Throws:

NullPointerException - The parameter str or enc is null.

EncodeToWriter

public static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, IWriter writer)

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoding). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToWriter(encoding, writer). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

input - An object that implements a stream of universal code points.
encoding - An object that implements a character encoding.
writer - A byte writer to write the encoded bytes to.

Throws:

NullPointerException - The parameter encoding is null.

EncodeToWriter

public static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, IWriter writer)

Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToWriter(encoder, writer). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

input - An object that implements a stream of universal code points.
encoder - An object that implements a character encoder.
writer - A byte writer to write the encoded bytes to.

Throws:

NullPointerException - The parameter encoder or input is null.

EncodeToWriter

public static void EncodeToWriter(String str, ICharacterEncoding enc, IWriter writer)

Converts a text string to bytes and writes the bytes to an output byte writer. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte stream, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any string object and can be called as follows: str.EncodeToBytes(enc, writer). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.EncodeToWriter(enc, writer). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

str - A text string to encode.
enc - An object implementing a character encoding (gives access to an encoder and a decoder).
writer - A byte writer where the encoded bytes will be written to.

Throws:

NullPointerException - The parameter str or enc is null.

EncodeToWriter

public static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, OutputStream output) throws IOException

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoding). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToWriter(encoding, output). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

input - An object that implements a stream of universal code points.
encoding - An object that implements a character encoding.
output - A writable data stream.

Throws:

NullPointerException - The parameter encoding is null.
IOException

EncodeToWriter

public static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, OutputStream output) throws IOException

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToBytes(encoder). If the object's class already has a EncodeToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: input.EncodeToWriter(encoder, output). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

input - An object that implements a stream of universal code points.
encoder - An object that implements a character encoder.
output - A writable data stream.

Throws:

NullPointerException - The parameter encoder or input is null.
IOException

EncodeToWriter

public static void EncodeToWriter(String str, ICharacterEncoding enc, OutputStream output) throws IOException

Converts a text string to bytes and writes the bytes to an output data stream. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte stream, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.EncodeToWriter(enc, output). If the object's class already has a EncodeToWriter method with the same parameters, that method takes precedence over this extension method.

Parameters:

str - A text string to encode.
enc - An object implementing a character encoding (gives access to an encoder and a decoder).
output - A writable data stream.

Throws:

NullPointerException - The parameter str or enc is null.
IOException

GetDecoderInput

public static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, IByteReader stream)

Converts a character encoding into a character input stream, given a streamable source of bytes. The input stream doesn't check the first few bytes for a byte-order mark indicating a Unicode encoding such as UTF-8 before using the character encoding's decoder.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: "encoding.GetDecoderInput(input)". If the object's class already has a GetDecoderInput method with the same parameters, that method takes precedence over this extension method.

Parameters:

encoding - Encoding that exposes a decoder to be converted into a character input stream. If the decoder returns -2 (indicating a decode error), the character input stream handles the error by returning a replacement character in its place.
stream - Byte stream to convert into Unicode characters.

Returns:

An ICharacterInput object.

Throws:

NullPointerException - The parameter encoding is null.

GetDecoderInput

public static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, InputStream input)

Converts a character encoding into a character input stream, given a data stream. The input stream doesn't check the first few bytes for a byte-order mark indicating a Unicode encoding such as UTF-8 before using the character encoding's decoder.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.GetDecoderInput(input). If the object's class already has a GetDecoderInput method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.GetDecoderInput(input). If the object's class already has a GetDecoderInput method with the same parameters, that method takes precedence over this extension method.

Parameters:

encoding - Encoding object that exposes a decoder to be converted into a character input stream. If the decoder returns -2 (indicating a decode error), the character input stream handles the error by returning a replacement character in its place.
input - Byte stream to convert into Unicode characters.

Returns:

An ICharacterInput object.

Throws:

NullPointerException - The parameter encoding is null.

GetDecoderInputSkipBom

public static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, IByteReader stream)

Converts a character encoding into a character input stream, given a streamable source of bytes. But if the input stream starts with a UTF-8 or UTF-16 byte order mark, the input is decoded as UTF-8 or UTF-16, as the case may be, rather than the specified character encoding.

This method implements the "decode" algorithm specified in the Encoding standard.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.GetDecoderInputSkipBom(input). If the object's class already has a GetDecoderInputSkipBom method with the same parameters, that method takes precedence over this extension method.

Parameters:

encoding - Encoding object that exposes a decoder to be converted into a character input stream. If the decoder returns -2 (indicating a decode error), the character input stream handles the error by returning a replacement character in its place.
stream - Byte stream to convert into Unicode characters.

Returns:

An ICharacterInput object.

GetDecoderInputSkipBom

public static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, InputStream input)

Converts a character encoding into a character input stream, given a readable data stream. But if the input stream starts with a UTF-8 or UTF-16 byte order mark, the input is decoded as UTF-8 or UTF-16, as the case may be, rather than the specified character encoding.This method implements the "decode" algorithm specified in the Encoding standard.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.GetDecoderInputSkipBom(input). If the object's class already has a GetDecoderInputSkipBom method with the same parameters, that method takes precedence over this extension method.

Parameters:

encoding - Encoding object that exposes a decoder to be converted into a character input stream. If the decoder returns -2 (indicating a decode error), the character input stream handles the error by returning a replacement character in its place.
input - Byte stream to convert into Unicode characters.

Returns:

An ICharacterInput object.

GetEncoding

public static ICharacterEncoding GetEncoding(String name)

Returns a character encoding from the specified name.

Parameters:

name - A string naming a character encoding. See the ResolveAlias method. Can be null.

Returns:

An object implementing a character encoding (gives access to an encoder and a decoder).

GetEncoding

public static ICharacterEncoding GetEncoding(String name, boolean forEmail, boolean allowReplacement)

Returns a character encoding from the specified name.

Parameters:

name - A string naming a character encoding. See the ResolveAlias method. Can be null.
forEmail - If false, uses the encoding resolution rules in the Encoding Standard. If true, uses modified rules as described in the ResolveAliasForEmail method.
allowReplacement - Has no effect.

Returns:

An object that enables encoding and decoding text in the specified character encoding. Returns null if the name is null or empty, or if it names an unrecognized or unsupported encoding.

GetEncoding

public static ICharacterEncoding GetEncoding(String name, boolean forEmail)

Returns a character encoding from the specified name.

Parameters:

name - A string naming a character encoding. See the ResolveAlias method. Can be null.
forEmail - If false, uses the encoding resolution rules in the Encoding Standard. If true, uses modified rules as described in the ResolveAliasForEmail method. If the resolved encoding is “GB18030” or “GBK” (in any combination of case), uses either an encoding intended to conform to the 2022 version of GB18030 if ‘forEmail’ is true, or the definition of the encoding in the WHATWG Encoding Standard (as of July 7, 2023) if ‘forEmail’ is false.

Returns:

An object that enables encoding and decoding text in the specified character encoding. Returns null if the name is null or empty, or if it names an unrecognized or unsupported encoding.

InputToString

public static String InputToString(ICharacterInput reader)

Reads Unicode characters from a character input and converts them to a text string.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterInput and can be called as follows: reader.InputToString(). If the object's class already has a InputToString method with the same parameters, that method takes precedence over this extension method.

Parameters:

reader - A character input whose characters will be converted to a text string.

Returns:

A text string containing the characters read.

Throws:

NullPointerException - The parameter reader is null.

ResolveAlias

public static String ResolveAlias(String name)

Resolves a character encoding's name to a standard form. This involves changing aliases of a character encoding to a standardized name.

In several Internet specifications, this name is known as a "charset" parameter. In HTML and HTTP, for example, the "charset" parameter indicates the encoding used to represent text in the HTML page, text file, or other document.

Parameters:

name - A string that names a given character encoding. Can be null. Any leading and trailing whitespace (U+0009, U+000c, U+000D, U+000A, U+0010) is removed before resolving the encoding’s name, and encoding names are matched using a basic case-insensitive comparison. (Two strings are equal in such a comparison, if they match after converting the basic uppercase letters A to Z (U+0041 to U+005A) in both strings to basic lowercase letters.) The Encoding Standard supports only the following encodings (and defines aliases for most of them). <ul> <li> UTF-8 - UTF-8 (8-bit encoding of the universal coded character set, the encoding recommended by the Encoding Standard for new data formats)</li><li> UTF-16LE - UTF-16 little-endian (16-bit UCS)</li><li> UTF-16BE - UTF-16 big-endian (16-bit UCS)</li><li>The special-purpose encoding x-user-defined</li><li>The special-purpose encoding replacement.</li><li>28 legacy single-byte encodings: <ul> <li> windows-1252 : Western Europe (Note: The Encoding Standard aliases the names US-ASCII and ISO-8859-1 to windows-1252, which uses a different coded character set from either; it differs from ISO-8859-1 by assigning different characters to some bytes from 0x80 to 0x9F. The Encoding Standard does this for compatibility with existing Web pages.)</li><li> ISO-8859-2, windows-1250 : Central Europe</li><li> ISO-8859-10 : Northern Europe</li><li> ISO-8859-4, windows-1257 : Baltic</li><li> ISO-8859-13 : Estonian</li><li> ISO-8859-14 : Celtic</li><li> ISO-8859-16 : Romanian</li><li> ISO-8859-5, IBM-866, KOI8-R, windows-1251, x-mac-cyrillic : Cyrillic</li><li> KOI8-U : Ukrainian</li><li> ISO-8859-7, windows-1253 : Greek</li><li> ISO-8859-6, windows-1256 : Arabic</li><li> ISO-8859-8, ISO-8859-8-I, windows-1255 : Hebrew</li><li> ISO-8859-3 : Latin 3</li><li> ISO-8859-15, windows-1254 : Turkish</li><li> windows-874 : Thai</li><li> windows-1258 : Vietnamese</li><li> macintosh : Mac Roman</li></ul></li><li>Three legacy Japanese encodings: Shift_JIS, EUC-JP, ISO-2022-JP</li><li>Two legacy simplified Chinese encodings: GBK and gb18030</li><li> Big5 : legacy traditional Chinese encoding</li><li> EUC-KR : legacy Korean encoding</li></ul> The UTF-8, UTF-16LE, and UTF-16BE encodings don’t encode a byte-order mark at the start of the text (doing so is not recommended for UTF-8, while in UTF-16LE and UTF-16BE, the byte-order mark character U+FEFF is treated as an ordinary character, unlike in the UTF-16 encoding form). The Encoding Standard aliases UTF-16 to UTF-16LE “to deal with deployed content”..

Returns:

A standardized name for the encoding. Returns the empty string if name is null or empty, or if the encoding name is unsupported.

ResolveAliasForEmail

public static String ResolveAliasForEmail(String name)

Resolves a character encoding’s name to a canonical form, using rules more suitable for email.

Parameters:

name - A string naming a character encoding. Can be null. Any leading and trailing whitespace (U+0009, U+000c, U+000D, U+000A, U+0010) is removed before resolving the encoding’s name, and encoding names are matched using a basic case-insensitive comparison. (Two strings are equal in such a comparison, if they match after converting the basic uppercase letters A to Z (U+0041 to U+005A) in both strings to basic lowercase letters.) Uses a modified version of the rules in the Encoding Standard to better conform, in some cases, to email standards like MIME. Encoding names and aliases not registered with the Internet Assigned Numbers Authority (IANA) are not supported, with the exception of ascii, utf8, cp1252, and names 10 characters or longer starting with iso-8859-. Also, the following additional encodings are supported. Note that the case combination GB18030, the combination registered with IANA, rather than gb18030, can be returned by this method. <ul> <li> US-ASCII - ASCII single-byte encoding, rather than an alias to windows-1252 as specified in the Encoding Standard. The coded character set’s code points match those in the Unicode Standard’s Basic Latin block (0-127 or U+0000 to U+007F). This method name ascii is treated as an alias to US-ASCII even though it is not registered with IANA as a charset name and RFC 2046 (part of MIME) reserves the name “ASCII”. A future version of this method may stop supporting the alias ascii.</li><li> ISO-8859-1 - Latin-1 single-byte encoding, rather than an alias to windows-1252 as specified in the Encoding Standard. The coded character set’s code points match those in the Unicode Standard’s Basic Latin and Latin-1 Supplement blocks (0-255 or U+0000 to U+00FF).</li><li> UTF-16 - UTF-16 without a fixed byte order, rather than an alias to UTF-16LE as specified in the Encoding Standard. The byte order is little-endian if the byte stream starts with 0xff 0xfe; otherwise, big-endian. A leading 0xff 0xfe or 0xfe 0xff in the byte stream is skipped.</li><li> UTF-7 - UTF-7 (7-bit universal coded character set). The name unicode-1-1-utf-7 is not supported and is not treated as an alias to UTF-7, even though it uses the same character encoding scheme as UTF-7, because RFC 1642, which defined the former UTF-7, is linked to a different Unicode version with an incompatible character repertoire (notably, the Hangul syllables have different code point assignments in Unicode 1.1 and earlier than in Unicode 2.0 and later).</li><li>ISO-2022-JP-2 - similar to “ISO-2022-JP”, except that the decoder supports additional coded character sets.</li></ul>

Returns:

A standardized name for the encoding. Returns the empty string if name is null or empty, or if the encoding name is unsupported.

StringToBytes

public static byte[] StringToBytes(ICharacterEncoding encoding, String str)

Converts a text string to a byte array encoded in a given character encoding. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: encoding.StringToBytes(str). If the object's class already has a StringToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

encoding - An object that implements a character encoding.
str - A string to be encoded into a byte array.

Returns:

A byte array containing the string encoded in the specified text encoding.

Throws:

NullPointerException - The parameter encoding is null.

StringToBytes

public static byte[] StringToBytes(ICharacterEncoder encoder, String str)

Converts a text string to a byte array using the specified character encoder. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoder and can be called as follows: encoder.StringToBytes(str). If the object's class already has a StringToBytes method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoder and can be called as follows: encoder.StringToBytes(str). If the object's class already has a StringToBytes method with the same parameters, that method takes precedence over this extension method.

Parameters:

encoder - An object that implements a character encoder.
str - A text string to encode into a byte array.

Returns:

A byte array.

Throws:

NullPointerException - The parameter encoder or str is null.

StringToInput

public static ICharacterInput StringToInput(String str)

Converts a text string to a character input. The resulting input can then be used to encode the text to bytes, or to read the string code point by code point, among other things. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any string object and can be called as follows: str.StringToInput(offset, length). If the object's class already has a StringToInput method with the same parameters, that method takes precedence over this extension method.

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.StringToInput(). If the object's class already has a StringToInput method with the same parameters, that method takes precedence over this extension method.

Parameters:

str - The parameter str is a text string.

Returns:

An ICharacterInput object.

Throws:

NullPointerException - The parameter str is null.

StringToInput

public static ICharacterInput StringToInput(String str, int offset, int length)

Converts a portion of a text string to a character input. The resulting input can then be used to encode the text to bytes, or to read the string code point by code point, among other things. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).

In the.NET implementation, this method is implemented as an extension method to any object implementing string and can be called as follows: str.StringToInput(offset, length). If the object's class already has a StringToInput method with the same parameters, that method takes precedence over this extension method.

Parameters:

str - The parameter str is a text string.
offset - An index starting at 0 showing where the desired portion of str begins.
length - The length, in code units, of the desired portion of str (but not more than str ‘s length).

Returns:

An ICharacterInput object.

Throws:

NullPointerException - The parameter str is null.
IllegalArgumentException - Either offset or length is less than 0 or greater than str ‘s length, or str ‘s length minus offset is less than length.
NullPointerException - The parameter str is null.

Back to Encoding start page.