com.upokecenter.text.Encodings
com.upokecenter.text.Encodings
public final class Encodings extends Object
Contains methods for converting text from one character encoding to another. This class also contains convenience methods for converting strings and other character inputs to sequences of bytes and vice versa.
The
WHATWG Encoding Standard defines algorithms for the most common character
encodings used on Web pages and recommends the UTF-8 encoding for new
specifications and Web pages. Calling the GetEncoding(name)
method
returns one of the character encodings with the given name under the
Encoding Standard.
Now let's define some terms.
Encoding Terms
- A code point is a number that identifies a single text character, such as a letter, digit, or symbol. (A collection of such characters is also called an abstract character repertoire.)
- A coded character set is a set of code points which are each assigned to a single text character. As used here, coded character sets don't define how code points are laid out in memory.
- A character encoding is a mapping from a sequence of code points, in one or more specific coded character sets, to a sequence of bytes and vice versa. (For brevity, the rest of this documentation may use the term encoding instead. RFC 6365 uses the analogous term charset instead; in this documentation, however, charset is used only to refer to the names that identify a character encoding.)
- ASCII is a 128-code-point coded character set that includes the English letters and digits, common punctuation and symbols, and control characters. As used here, its code points match the code points within the Basic Latin block (0-127 or U+0000 to U+007F) of the Unicode Standard.
There are several kinds of character encodings:
- Single-byte encodings define a coded character set that assigns one code point to one byte. Thus, they can have a maximum of 256 code points. For example:
- (a) ISO 8859 encodings and
windows-1252
. - (b) ASCII is usually used as a single-byte encoding
where each code point fits in the lower 7 bits of an eight-bit byte (in that
case, the encoding is often called
US-ASCII
). In the Encoding Standard, all single-byte encodings use the ASCII characters as the first 128 code points of their coded character sets. - Multi-byte encodings include code points from one or more coded character sets and assign some or all code points to several bytes. For example:
- (a)
UTF-16LE
andUTF-16BE
are two encodings defined in the Unicode Standard. They use 2 bytes for the most common code points, and 4 bytes for supplementary code points. - (b)
UTF-8
is another encoding defined in the Unicode Standard. It uses 1 byte for ASCII and 2 to 4 bytes for the other Unicode code points. - (c) Most legacy East
Asian encodings, such as
Shift_JIS
,GBK
, andBig5
use 1 byte for ASCII (or a slightly modified version) and, usually, 2 or more bytes for national standard coded character sets. In many of these encodings, notablyShift_JIS
, characters whose code points use one byte traditionally take half the space of characters whose code points use two bytes. - Escape-based encodings are combinations of single- and/or multi-byte encodings, and use escape sequences and/or shift codes to change which encoding to use for the bytes that follow. For example:
- (a)
ISO-2022-JP
supports several escape sequences that shift into different encodings, including a Katakana, a Kanji, and an ASCII encoding (with ASCII as the default). - (b) UTF-7 (not included in the Encoding Standard) is an encoding that uses the Unicode Standard's coded character set, which is encoded using a limited subset of ASCII. The plus symbol (U+002B) is used to shift into a UTF-16BE multi-byte encoding (converted to a modified version of base-64) to encode other Unicode code points.
- The Encoding Standard also defines a replacement
encoding, which causes a decoding error and is used to alias a few
problematic or unsupported encoding names, such as
hz-gb-2312
.
Getting an Encoding
The Encoding
Standard includes UTF-8, UTF-16, and many legacy encodings, and gives each
one of them a name. The GetEncoding(name)
method takes a name string
and returns an ICharacterEncoding object that implements that encoding, or
null
if the name is unrecognized.
However, the Encoding
Standard is designed to include only encodings commonly used on Web pages,
not in other protocols such as email. For email, the Encoding class includes
an alternate function GetEncoding(name, forEmail)
. Setting
forEmail
to true
will use rules modified from the Encoding Standard
to better suit encoding and decoding text from email messages.
Classes for Character Encodings
This Encodings class provides access to common character encodings through classes as described below:
- An encoder class is a class that converts a
sequence of bytes to a sequence of code points in the universal character
set (otherwise known under the name Unicode). An encoder class implements
the
ICharacterEncoder
interface. - A decoder class is a
class that converts a sequence of Unicode code points to a sequence of
bytes. A decoder class implements the
ICharacterDecoder
interface. - An encoding class allows access to both an encoder
class and a decoder class and implements the
ICharacterEncoding
interface. The encoder and decoder classes should implement the same character encoding.
Custom Encodings
Classes that implement the ICharacterEncoding interface can provide additional character encodings not included in the Encoding Standard. Some examples of these include the following:
- A modified version of UTF-8 used in Java's serialization formats.
- A modified version of UTF-7 used in the IMAP email protocol.
(Note that this library doesn't implement either encoding.)
Fields
static final ICharacterEncoding UTF8
Character encoding object for the UTF-8 character encoding, which represents each code point in the universal coded character set using 1 to 4 bytes.
Methods
static String DecodeToString(ICharacterEncoding enc, byte[] bytes)
Reads a byte array from a data source and converts the bytes from a given encoding to a text string.static String DecodeToString(ICharacterEncoding enc, byte[] bytes, int offset, int length)
Reads a portion of a byte array from a data source and converts the bytes from a given encoding to a text string.static String DecodeToString(ICharacterEncoding encoding, IByteReader input)
Reads bytes from a data source and converts the bytes from a given encoding to a text string.static String DecodeToString(ICharacterEncoding enc, InputStream input)
Decodes data read from a data stream into a text string in the given character encoding.static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder)
Reads Unicode characters from a character input and writes them to a byte array encoded using a given character encoding.static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder, boolean htmlFallback)
Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder and fallback strategy.static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoding encoding)
Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder.static byte[] EncodeToBytes(String str, ICharacterEncoding enc)
Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding.static byte[] EncodeToBytes(String str, ICharacterEncoding enc, boolean htmlFallback)
Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding and using the given encoder fallback strategy.static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, IWriter writer)
Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding.static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, OutputStream output)
Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding.static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, IWriter writer)
Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder.static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, OutputStream output)
Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder.static void EncodeToWriter(String str, ICharacterEncoding enc, IWriter writer)
Converts a text string to bytes and writes the bytes to an output byte writer.static void EncodeToWriter(String str, ICharacterEncoding enc, OutputStream output)
Converts a text string to bytes and writes the bytes to an output data stream.static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, IByteReader stream)
Converts a character encoding into a character input stream, given a streamable source of bytes.static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, InputStream input)
Converts a character encoding into a character input stream, given a data stream.static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, IByteReader stream)
Converts a character encoding into a character input stream, given a streamable source of bytes.static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, InputStream input)
Converts a character encoding into a character input stream, given a readable data stream.static ICharacterEncoding GetEncoding(String name)
Returns a character encoding from the given name.static ICharacterEncoding GetEncoding(String name, boolean forEmail)
Returns a character encoding from the given name.static ICharacterEncoding GetEncoding(String name, boolean forEmail, boolean allowReplacement)
Returns a character encoding from the given name.static String InputToString(ICharacterInput reader)
Reads Unicode characters from a character input and converts them to a text string.static String ResolveAlias(String name)
Resolves a character encoding's name to a standard form.static String ResolveAliasForEmail(String name)
Resolves a character encoding's name to a canonical form, using rules more suitable for email.static byte[] StringToBytes(ICharacterEncoder encoder, String str)
Converts a text string to a byte array using the given character encoder.static byte[] StringToBytes(ICharacterEncoding encoding, String str)
Converts a text string to a byte array encoded in a given character encoding.static ICharacterInput StringToInput(String str)
Converts a text string to a character input.static ICharacterInput StringToInput(String str, int offset, int length)
Converts a portion of a text string to a character input.
Field Details
UTF8
public static final ICharacterEncoding UTF8
Character encoding object for the UTF-8 character encoding, which represents each code point in the universal coded character set using 1 to 4 bytes.
Method Details
DecodeToString
public static String DecodeToString(ICharacterEncoding encoding, IByteReader input)
Reads bytes from a data source and converts the bytes from a given encoding to a text string.
In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: "encoding.DecodeString(input)". If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.
Parameters:
encoding
- An object that implements a given character encoding. Any bytes that can't be decoded are converted to the replacement character (U+FFFD).input
- An object that implements a byte stream.
Returns:
- The converted string.
Throws:
NullPointerException
- The parameterencoding
orinput
is null.
DecodeToString
public static String DecodeToString(ICharacterEncoding enc, InputStream input)
Decodes data read from a data stream into a text string in the given character encoding.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
encoding.DecodeToString(input)
. If the object's class already has a
DecodeToString method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
enc.DecodeToString(input)
. If the object's class already has a
DecodeToString
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
enc
- An object implementing a character encoding (gives access to an encoder and a decoder).input
- A readable byte stream.
Returns:
- A string consisting of the decoded text.
Throws:
NullPointerException
- The parameter "encoding" orinput
is null.
DecodeToString
public static String DecodeToString(ICharacterEncoding enc, byte[] bytes)
Reads a byte array from a data source and converts the bytes from a given encoding to a text string. Errors in decoding are handled by replacing erroneous bytes with the replacement character (U+FFFD).
In the.NET
implementation, this method is implemented as an extension method to any
object implementing ICharacterEncoding and can be called as follows:
enc.DecodeToString(bytes)
. If the object's class already has a
DecodeToString method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
enc.DecodeToString(bytes)
. If the object's class already has a
DecodeToString
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
enc
- An object implementing a character encoding (gives access to an encoder and a decoder).bytes
- A byte array.
Returns:
- A string consisting of the decoded text.
Throws:
NullPointerException
- The parameterenc
orbytes
is null.
DecodeToString
public static String DecodeToString(ICharacterEncoding enc, byte[] bytes, int offset, int length)
Reads a portion of a byte array from a data source and converts the bytes from a given encoding to a text string. Errors in decoding are handled by replacing erroneous bytes with the replacement character (U+FFFD).
In
the.NET implementation, this method is implemented as an extension method to
any object implementing ICharacterEncoding and can be called as follows:
enc.DecodeToString(bytes, offset, length)
. If the object's class
already has a DecodeToString method with the same parameters, that method
takes precedence over this extension method.
In the.NET
implementation, this method is implemented as an extension method to any
object implementing ICharacterEncoding and can be called as follows:
enc.DecodeToString(bytes, offset, length)
. If the object's class already
has a DecodeToString
method with the same parameters, that method
takes precedence over this extension method.
Parameters:
enc
- An object implementing a character encoding (gives access to an encoder and a decoder).bytes
- A byte array containing the desired portion to read.offset
- An index starting at 0 showing where the desired portion ofbytes
begins.length
- The length, in bytes, of the desired portion ofbytes
(but not more thanbytes
's length).
Returns:
- A string consisting of the decoded text.
Throws:
NullPointerException
- The parameterenc
orbytes
is null.IllegalArgumentException
- Eitheroffset
orlength
is less than 0 or greater thanbytes
's length, orbytes
's length minusoffset
is less thanlength
.NullPointerException
- The parameterenc
orbytes
is null.
EncodeToBytes
public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoding encoding)
Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this
method is implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToBytes(encoding)
. If the object's class already has a
EncodeToBytes
method with the same parameters, that method takes precedence
over this extension method.
Parameters:
input
- An object that implements a stream of universal code points.encoding
- An object that implements a given character encoding.
Returns:
- A byte array containing the encoded text.
Throws:
NullPointerException
- The parameterencoding
is null.
EncodeToBytes
public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder)
Reads Unicode characters from a character input and writes them to a byte array encoded using a given character encoding. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this method
is implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToBytes(encoder)
. If the object's class already has a
EncodeToBytes method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToBytes(encoder)
. If the object's class already has a
EncodeToBytes
method with the same parameters, that method takes precedence
over this extension method.
Parameters:
input
- An object that implements a stream of universal code points.encoder
- An object that implements a character encoder.
Returns:
- A byte array.
Throws:
NullPointerException
- The parameterencoder
orinput
is null.
EncodeToBytes
public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder, boolean htmlFallback)
Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder and fallback strategy.
In the.NET implementation, this method is implemented as an extension
method to any object implementing ICharacterInput and can be called as
follows: input.EncodeToBytes(encoder, htmlFallback)
. If the object's
class already has a EncodeToBytes
method with the same parameters,
that method takes precedence over this extension method.
Parameters:
input
- An object that implements a stream of universal code points.encoder
- A character encoder that takes Unicode characters and writes them into bytes.htmlFallback
- If true, when the encoder encounters invalid characters that can't be mapped into bytes, writes the HTML decimal escape for the invalid characters instead. If false, writes a question mark byte (0x3f) upon encountering invalid characters.
Returns:
- A byte array containing the encoded characters.
Throws:
NullPointerException
- The parameterencoder
orinput
is null.NullPointerException
- The parameterencoder
orinput
is null.
EncodeToBytes
public static byte[] EncodeToBytes(String str, ICharacterEncoding enc)
Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this method is implemented as an extension
method to any string object and can be called as follows:
str.EncodeToBytes(enc)
. If the object's class already has a EncodeToBytes
method with the same parameters, that method takes precedence over this
extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing string and can
be called as follows: str.EncodeToBytes(enc)
. If the object's class
already has a EncodeToBytes
method with the same parameters, that
method takes precedence over this extension method.
Parameters:
str
- A text string to encode to a byte array.enc
- An object implementing a character encoding (gives access to an encoder and a decoder).
Returns:
- A byte array containing the encoded text string.
Throws:
NullPointerException
- The parameterstr
orenc
is null.
EncodeToBytes
public static byte[] EncodeToBytes(String str, ICharacterEncoding enc, boolean htmlFallback)
Reads Unicode characters from a text string and writes them to a byte array encoded in a given character encoding and using the given encoder fallback strategy. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).
In
the.NET implementation, this method is implemented as an extension method to
any object implementing string and can be called as follows:
str.EncodeToBytes(enc, htmlFallback)
. If the object's class already has a
EncodeToBytes
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
str
- A text string to encode to a byte array.enc
- An object implementing a character encoding (gives access to an encoder and a decoder).htmlFallback
- If true, when the encoder encounters invalid characters that can't be mapped into bytes, writes the HTML decimal escape for the invalid characters instead. If false, writes a question mark byte (0x3f) upon encountering invalid characters.
Returns:
- A byte array containing the encoded text string.
Throws:
NullPointerException
- The parameterstr
orenc
is null.
EncodeToWriter
public static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, IWriter writer)
Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this method
is implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToBytes(encoding)
. If the object's class already has a
EncodeToBytes method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToWriter(encoding, writer)
. If the object's class already has a
EncodeToWriter
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
input
- An object that implements a stream of universal code points.encoding
- An object that implements a character encoding.writer
- A byte writer to write the encoded bytes to.
Throws:
NullPointerException
- The parameterencoding
is null.
EncodeToWriter
public static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, IWriter writer)
Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToBytes(encoder)
. If the object's class already has a
EncodeToBytes method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToWriter(encoder, writer)
. If the object's class already has a
EncodeToWriter
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
input
- An object that implements a stream of universal code points.encoder
- An object that implements a character encoder.writer
- A byte writer to write the encoded bytes to.
Throws:
NullPointerException
- The parameterencoder
orinput
is null.
EncodeToWriter
public static void EncodeToWriter(String str, ICharacterEncoding enc, IWriter writer)
Converts a text string to bytes and writes the bytes to an output byte writer. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte stream, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this
method is implemented as an extension method to any string object and can be
called as follows: str.EncodeToBytes(enc, writer)
. If the object's
class already has a EncodeToBytes method with the same parameters, that
method takes precedence over this extension method.
In the.NET
implementation, this method is implemented as an extension method to any
object implementing string and can be called as follows:
str.EncodeToWriter(enc, writer)
. If the object's class already has a
EncodeToWriter
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
str
- A text string to encode.enc
- An object implementing a character encoding (gives access to an encoder and a decoder).writer
- A byte writer where the encoded bytes will be written to.
Throws:
NullPointerException
- The parameterstr
orenc
is null.
EncodeToWriter
public static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, OutputStream output) throws IOException
Reads Unicode characters from a character input and writes them to a byte array encoded using the given character encoder. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this method
is implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToBytes(encoding)
. If the object's class already has a
EncodeToBytes method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToWriter(encoding, output)
. If the object's class already has a
EncodeToWriter
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
input
- An object that implements a stream of universal code points.encoding
- An object that implements a character encoding.output
- A writable data stream.
Throws:
NullPointerException
- The parameterencoding
is null.IOException
EncodeToWriter
public static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, OutputStream output) throws IOException
Reads Unicode characters from a character input and writes them to a byte array encoded in a given character encoding. When writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToBytes(encoder)
. If the object's class already has a
EncodeToBytes method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterInput and can be called as follows:
input.EncodeToWriter(encoder, output)
. If the object's class already has a
EncodeToWriter
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
input
- An object that implements a stream of universal code points.encoder
- An object that implements a character encoder.output
- A writable data stream.
Throws:
NullPointerException
- The parameterencoder
orinput
is null.IOException
EncodeToWriter
public static void EncodeToWriter(String str, ICharacterEncoding enc, OutputStream output) throws IOException
Converts a text string to bytes and writes the bytes to an output data stream. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte stream, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this
method is implemented as an extension method to any string object and can be
called as follows: str.EncodeToBytes(enc, writer)
. If the object's
class already has a EncodeToBytes method with the same parameters, that
method takes precedence over this extension method.
In the.NET
implementation, this method is implemented as an extension method to any
object implementing string and can be called as follows:
str.EncodeToWriter(enc, output)
. If the object's class already has a
EncodeToWriter
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
str
- A text string to encode.enc
- An object implementing a character encoding (gives access to an encoder and a decoder).output
- A writable data stream.
Throws:
NullPointerException
- The parameterstr
orenc
is null.IOException
GetDecoderInput
public static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, IByteReader stream)
Converts a character encoding into a character input stream, given a streamable source of bytes. The input stream doesn't check the first few bytes for a byte-order mark indicating a Unicode encoding such as UTF-8 before using the character encoding's decoder.
In the.NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: "encoding.GetDecoderInput(input)". If the object's class already has a GetDecoderInput method with the same parameters, that method takes precedence over this extension method.
Parameters:
encoding
- Encoding that exposes a decoder to be converted into a character input stream. If the decoder returns -2 (indicating a decode error), the character input stream handles the error by returning a replacement character in its place.stream
- Byte stream to convert into Unicode characters.
Returns:
- An ICharacterInput object.
Throws:
NullPointerException
- The parameterencoding
is null.
GetDecoderInput
public static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, InputStream input)
Converts a character encoding into a character input stream, given a data stream. The input stream doesn't check the first few bytes for a byte-order mark indicating a Unicode encoding such as UTF-8 before using the character encoding's decoder.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
encoding.GetDecoderInput(input)
. If the object's class already has a
GetDecoderInput method with the same parameters, that method takes
precedence over this extension method.
In the.NET implementation,
this method is implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
encoding.GetDecoderInput(input)
. If the object's class already has a
GetDecoderInput
method with the same parameters, that method takes
precedence over this extension method.
Parameters:
encoding
- Encoding object that exposes a decoder to be converted into a character input stream. If the decoder returns -2 (indicating a decode error), the character input stream handles the error by returning a replacement character in its place.input
- Byte stream to convert into Unicode characters.
Returns:
- An ICharacterInput object.
Throws:
NullPointerException
- The parameterencoding
is null.
GetDecoderInputSkipBom
public static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, IByteReader stream)
Converts a character encoding into a character input stream, given a streamable source of bytes. But if the input stream starts with a UTF-8 or UTF-16 byte order mark, the input is decoded as UTF-8 or UTF-16, as the case may be, rather than the given character encoding.
This method implements the "decode" algorithm specified in the Encoding standard.
In the.NET implementation, this method is implemented as an extension
method to any object implementing ICharacterEncoding and can be called as
follows: encoding.GetDecoderInputSkipBom(input)
. If the object's
class already has a GetDecoderInputSkipBom
method with the same
parameters, that method takes precedence over this extension method.
Parameters:
encoding
- Encoding object that exposes a decoder to be converted into a character input stream. If the decoder returns -2 (indicating a decode error), the character input stream handles the error by returning a replacement character in its place.stream
- Byte stream to convert into Unicode characters.
Returns:
- An ICharacterInput object.
GetDecoderInputSkipBom
public static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, InputStream input)
Converts a character encoding into a character input stream, given a readable data stream. But if the input stream starts with a UTF-8 or UTF-16 byte order mark, the input is decoded as UTF-8 or UTF-16, as the case may be, rather than the given character encoding.This method implements the "decode" algorithm specified in the Encoding standard.
In the.NET
implementation, this method is implemented as an extension method to any
object implementing ICharacterEncoding and can be called as follows:
encoding.GetDecoderInputSkipBom(input)
. If the object's class already has a
GetDecoderInputSkipBom
method with the same parameters, that method
takes precedence over this extension method.
Parameters:
encoding
- Encoding object that exposes a decoder to be converted into a character input stream. If the decoder returns -2 (indicating a decode error), the character input stream handles the error by returning a replacement character in its place.input
- Byte stream to convert into Unicode characters.
Returns:
- An ICharacterInput object.
GetEncoding
public static ICharacterEncoding GetEncoding(String name)
Returns a character encoding from the given name.
Parameters:
name
- A string naming a character encoding. See the ResolveAlias method. Can be null.
Returns:
- An object implementing a character encoding (gives access to an encoder and a decoder).
GetEncoding
public static ICharacterEncoding GetEncoding(String name, boolean forEmail, boolean allowReplacement)
Returns a character encoding from the given name.
Parameters:
name
- A string naming a character encoding. See the ResolveAlias method. Can be null.forEmail
- If false, uses the encoding resolution rules in the Encoding Standard. If true, uses modified rules as described in the ResolveAliasForEmail method.allowReplacement
- Has no effect.
Returns:
- An object that enables encoding and decoding text in the given character encoding. Returns null if the name is null or empty, or if it names an unrecognized or unsupported encoding.
GetEncoding
public static ICharacterEncoding GetEncoding(String name, boolean forEmail)
Returns a character encoding from the given name.
Parameters:
name
- A string naming a character encoding. See the ResolveAlias method. Can be null.forEmail
- If false, uses the encoding resolution rules in the Encoding Standard. If true, uses modified rules as described in the ResolveAliasForEmail method. If the resolved encoding is "GB18030" or "GBK" (in any combination of case), uses either an encoding intended to conform to the 2022 version of GB18030 if 'forEmail' is true, or the definition of the encoding in the WHATWG Encoding Standard (as of July 7, 2023) if 'forEmail' is false.
Returns:
- An object that enables encoding and decoding text in the given character encoding. Returns null if the name is null or empty, or if it names an unrecognized or unsupported encoding.
InputToString
public static String InputToString(ICharacterInput reader)
Reads Unicode characters from a character input and converts them to a text string.
In the.NET implementation, this method is implemented
as an extension method to any object implementing ICharacterInput and can be
called as follows: reader.InputToString()
. If the object's class
already has a InputToString method with the same parameters, that method
takes precedence over this extension method.
Parameters:
reader
- A character input whose characters will be converted to a text string.
Returns:
- A text string containing the characters read.
Throws:
NullPointerException
- The parameterreader
is null.
ResolveAlias
public static String ResolveAlias(String name)
Resolves a character encoding's name to a standard form. This involves changing aliases of a character encoding to a standardized name.
In several Internet specifications, this name is known as a "charset" parameter. In HTML and HTTP, for example, the "charset" parameter indicates the encoding used to represent text in the HTML page, text file, etc.
Parameters:
name
-A string that names a given character encoding. Can be null. Any leading and trailing whitespace (U+0009, U+000c, U+000D, U+000A, U+0010) is removed before resolving the encoding's name, and encoding names are matched using a basic case-insensitive comparison. (Two strings are equal in such a comparison, if they match after converting the basic upper-case letters A to Z (U+0041 to U+005A) in both strings to basic lower-case letters.) The Encoding Standard supports only the following encodings (and defines aliases for most of them).
-
UTF-8
- UTF-8 (8-bit encoding of the universal coded character set, the encoding recommended by the Encoding Standard for new data formats) -
UTF-16LE
- UTF-16 little-endian (16-bit UCS) -
UTF-16BE
- UTF-16 big-endian (16-bit UCS) - The special-purpose encoding
x-user-defined
- The special-purpose encoding
replacement
. - 28 legacy single-byte encodings:
-
windows-1252
: Western Europe (Note: The Encoding Standard aliases the namesUS-ASCII
andISO-8859-1
towindows-1252
, which uses a different coded character set from either; it differs fromISO-8859-1
by assigning different characters to some bytes from 0x80 to 0x9F. The Encoding Standard does this for compatibility with existing Web pages.) -
ISO-8859-2
,windows-1250
: Central Europe -
ISO-8859-10
: Northern Europe -
ISO-8859-4
,windows-1257
: Baltic -
ISO-8859-13
: Estonian -
ISO-8859-14
: Celtic -
ISO-8859-16
: Romanian -
ISO-8859-5
,IBM-866
,KOI8-R
,windows-1251
,x-mac-cyrillic
: Cyrillic -
KOI8-U
: Ukrainian -
ISO-8859-7
,windows-1253
: Greek -
ISO-8859-6
,windows-1256
: Arabic -
ISO-8859-8
,ISO-8859-8-I
,windows-1255
: Hebrew -
ISO-8859-3
: Latin 3 -
ISO-8859-15
,windows-1254
: Turkish -
windows-874
: Thai -
windows-1258
: Vietnamese -
macintosh
: Mac Roman
-
- Three legacy Japanese encodings:
Shift_JIS
,EUC-JP
,ISO-2022-JP
- Two legacy
simplified Chinese encodings:
GBK
andgb18030
-
Big5
: legacy traditional Chinese encoding -
EUC-KR
: legacy Korean encoding
The
.UTF-8
,UTF-16LE
, andUTF-16BE
encodings don't encode a byte-order mark at the start of the text (doing so is not recommended forUTF-8
, while inUTF-16LE
andUTF-16BE
, the byte-order mark character U+FEFF is treated as an ordinary character, unlike in the UTF-16 encoding form). The Encoding Standard aliasesUTF-16
toUTF-16LE
"to deal with deployed content".-
Returns:
- A standardized name for the encoding. Returns the empty string if
name
is null or empty, or if the encoding name is unsupported.
ResolveAliasForEmail
public static String ResolveAliasForEmail(String name)
Resolves a character encoding's name to a canonical form, using rules more suitable for email.
Parameters:
name
-A string naming a character encoding. Can be null. Any leading and trailing whitespace (U+0009, U+000c, U+000D, U+000A, U+0010) is removed before resolving the encoding's name, and encoding names are matched using a basic case-insensitive comparison. (Two strings are equal in such a comparison, if they match after converting the basic upper-case letters A to Z (U+0041 to U+005A) in both strings to basic lower-case letters.) Uses a modified version of the rules in the Encoding Standard to better conform, in some cases, to email standards like MIME. Encoding names and aliases not registered with the Internet Assigned Numbers Authority (IANA) are not supported, with the exception of
ascii
,utf8
,cp1252
, and names 10 characters or longer starting withiso-8859-
. Also, the following additional encodings are supported. Note that the case combinationGB18030
, the combination registered with IANA, rather thangb18030
, can be returned by this method.-
US-ASCII
- ASCII single-byte encoding, rather than an alias towindows-1252
as specified in the Encoding Standard. The coded character set's code points match those in the Unicode Standard's Basic Latin block (0-127 or U+0000 to U+007F). This method nameascii
is treated as an alias toUS-ASCII
even though it is not registered with IANA as a charset name and RFC 2046 (part of MIME) reserves the name "ASCII". A future version of this method may stop supporting the aliasascii
. -
ISO-8859-1
- Latin-1 single-byte encoding, rather than an alias towindows-1252
as specified in the Encoding Standard. The coded character set's code points match those in the Unicode Standard's Basic Latin and Latin-1 Supplement blocks (0-255 or U+0000 to U+00FF). -
UTF-16
- UTF-16 without a fixed byte order, rather than an alias toUTF-16LE
as specified in the Encoding Standard. The byte order is little endian if the byte stream starts with 0xff 0xfe; otherwise, big endian. A leading 0xff 0xfe or 0xfe 0xff in the byte stream is skipped. -
UTF-7
- UTF-7 (7-bit universal coded character set). The nameunicode-1-1-utf-7
is not supported and is not treated as an alias toUTF-7
, even though it uses the same character encoding scheme as UTF-7, because RFC 1642, which defined the former UTF-7, is linked to a different Unicode version with an incompatible character repertoire (notably, the Hangul syllables have different code point assignments in Unicode 1.1 and earlier than in Unicode 2.0 and later). ISO-2022-JP-2
- similar to "ISO-2022-JP", except that the decoder supports additional character sets.
-
Returns:
- A standardized name for the encoding. Returns the empty string if
name
is null or empty, or if the encoding name is unsupported.
StringToBytes
public static byte[] StringToBytes(ICharacterEncoding encoding, String str)
Converts a text string to a byte array encoded in a given character encoding. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this
method is implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
encoding.StringToBytes(str)
. If the object's class already has a
StringToBytes method with the same parameters, that method takes precedence
over this extension method.
Parameters:
encoding
- An object that implements a character encoding.str
- A string to be encoded into a byte array.
Returns:
- A byte array containing the string encoded in the given text encoding.
Throws:
NullPointerException
- The parameterencoding
is null.
StringToBytes
public static byte[] StringToBytes(ICharacterEncoder encoder, String str)
Converts a text string to a byte array using the given character encoder. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD), and when writing to the byte array, any characters that can't be encoded are replaced with the byte 0x3f (the question mark character).
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoder and can be called as follows:
encoder.StringToBytes(str)
. If the object's class already has a
StringToBytes method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoder and can be called as follows:
encoder.StringToBytes(str)
. If the object's class already has a
StringToBytes
method with the same parameters, that method takes precedence
over this extension method.
Parameters:
encoder
- An object that implements a character encoder.str
- A text string to encode into a byte array.
Returns:
- A byte array.
Throws:
NullPointerException
- The parameterencoder
orstr
is null.
StringToInput
public static ICharacterInput StringToInput(String str)
Converts a text string to a character input. The resulting input can then be used to encode the text to bytes, or to read the string code point by code point, among other things. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).
In the.NET implementation, this method is implemented as an extension
method to any string object and can be called as follows:
str.StringToInput(offset, length)
. If the object's class already has a
StringToInput method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing string and can
be called as follows: str.StringToInput()
. If the object's class
already has a StringToInput
method with the same parameters, that
method takes precedence over this extension method.
Parameters:
str
- The parameterstr
is a text string.
Returns:
- An ICharacterInput object.
Throws:
NullPointerException
- The parameterstr
is null.
StringToInput
public static ICharacterInput StringToInput(String str, int offset, int length)
Converts a portion of a text string to a character input. The resulting input can then be used to encode the text to bytes, or to read the string code point by code point, among other things. When reading the string, any unpaired surrogate characters are replaced with the replacement character (U+FFFD).
In the.NET implementation, this method is implemented as an
extension method to any string object and can be called as follows:
str.StringToInput(offset, length)
. If the object's class already has a
StringToInput method with the same parameters, that method takes precedence
over this extension method.
In the.NET implementation, this method is
implemented as an extension method to any object implementing string and can
be called as follows: str.StringToInput(offset, length)
. If the
object's class already has a StringToInput
method with the same
parameters, that method takes precedence over this extension method.
Parameters:
str
- The parameterstr
is a text string.offset
- An index starting at 0 showing where the desired portion ofstr
begins.length
- The length, in code units, of the desired portion ofstr
(but not more thanstr
's length).
Returns:
- An ICharacterInput object.
Throws:
NullPointerException
- The parameterstr
is null.IllegalArgumentException
- Eitheroffset
orlength
is less than 0 or greater thanstr
's length, orstr
's length minusoffset
is less thanlength
.NullPointerException
- The parameterstr
is null.