PeterO.DataUtilities
## PeterO.DataUtilities
public static class DataUtilities
Contains methods useful for reading and writing text strings. It is designed to have no dependencies other than the basic runtime class library. Many of these methods work with text encoded in UTF-8, an encoding form of the Unicode Standard which uses one byte to encode the most basic characters and two to four bytes to encode other characters. For example, the GetUtf8
method converts a text string to an array of bytes in UTF-8.
In C# and Java, text strings are represented as sequences of 16-bit values called char
s. These sequences are well-formed under UTF-16, a 16-bit encoding form of Unicode, except if they contain unpaired surrogate code points. (A surrogate code point is used to encode supplementary characters, those with code points U+10000 or higher, in UTF-16. A surrogate pair is a high surrogate, U+D800 to U+DBFF, followed by a low surrogate, U+DC00 to U+DFFF. An unpaired surrogate code point is a surrogate not appearing in a surrogate pair.) Many of the methods in this class allow setting the behavior to follow when unpaired surrogate code points are found in text strings, such as throwing an error or treating the unpaired surrogate as a replacement character (U+FFFD).
Member Summary
[CodePointAt(string, int)](#CodePointAt_string_int)
- Gets the Unicode code point at the given index of the string.[CodePointAt(string, int, int)](#CodePointAt_string_int_int)
- Gets the Unicode code point at the given index of the string.[CodePointBefore(string, int)](#CodePointBefore_string_int)
- Gets the Unicode code point just before the given index of the string.[CodePointBefore(string, int, int)](#CodePointBefore_string_int_int)
- Gets the Unicode code point just before the given index of the string.[CodePointCompare(string, string)](#CodePointCompare_string_string)
- Compares two strings in Unicode code point order.[CodePointLength(string)](#CodePointLength_string)
- Finds the number of Unicode code points in the given text string.[GetUtf8Bytes(string, bool)](#GetUtf8Bytes_string_bool)
- Encodes a string in UTF-8 as a byte array.[GetUtf8Bytes(string, bool, bool)](#GetUtf8Bytes_string_bool_bool)
- Encodes a string in UTF-8 as a byte array.[GetUtf8Length(string, bool)](#GetUtf8Length_string_bool)
- Calculates the number of bytes needed to encode a string in UTF-8.[GetUtf8String(byte[], bool)](#GetUtf8String_byte_bool)
- Generates a text string from a UTF-8 byte array.[GetUtf8String(byte[], int, int, bool)](#GetUtf8String_byte_int_int_bool)
- Generates a text string from a portion of a UTF-8 byte array.[ReadUtf8(System.IO.Stream, int, System.Text.StringBuilder, bool)](#ReadUtf8_System_IO_Stream_int_System_Text_StringBuilder_bool)
- Reads a string in UTF-8 encoding from a data stream.[ReadUtf8FromBytes(byte[], int, int, System.Text.StringBuilder, bool)](#ReadUtf8FromBytes_byte_int_int_System_Text_StringBuilder_bool)
- Reads a string in UTF-8 encoding from a byte array.[ReadUtf8ToString(System.IO.Stream)](#ReadUtf8ToString_System_IO_Stream)
- Reads a string in UTF-8 encoding from a data stream in full and returns that string.[ReadUtf8ToString(System.IO.Stream, int, bool)](#ReadUtf8ToString_System_IO_Stream_int_bool)
- Reads a string in UTF-8 encoding from a data stream and returns that string.[ToLowerCaseAscii(string)](#ToLowerCaseAscii_string)
- Returns a string with the basic uppercase letters A to Z (U+0041 to U+005A) converted to the corresponding basic lowercase letters.[ToUpperCaseAscii(string)](#ToUpperCaseAscii_string)
- Returns a string with the basic lowercase letters A to Z (U+0061 to U+007A) converted to the corresponding basic uppercase letters.[WriteUtf8(string, int, int, System.IO.Stream, bool)](#WriteUtf8_string_int_int_System_IO_Stream_bool)
- Writes a portion of a string in UTF-8 encoding to a data stream.[WriteUtf8(string, int, int, System.IO.Stream, bool, bool)](#WriteUtf8_string_int_int_System_IO_Stream_bool_bool)
- Writes a portion of a string in UTF-8 encoding to a data stream.[WriteUtf8(string, System.IO.Stream, bool)](#WriteUtf8_string_System_IO_Stream_bool)
- Writes a string in UTF-8 encoding to a data stream.
public static int CodePointAt( string str, int index);
Gets the Unicode code point at the given index of the string.
Parameters:
-
str: The parameter str is a text string.
-
index: Index of the current position into the string.
Return Value:
The Unicode code point at the given position. Returns -1 if index is 0 or less, or is greater than or equal to the string’s length. Returns the replacement character (U+FFFD) if the code point at that position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units.
Exceptions:
- System.ArgumentNullException: The parameter str is null.
public static int CodePointAt( string str, int index, int surrogateBehavior);
Gets the Unicode code point at the given index of the string.
The following example shows how to iterate a text string code point by code point, terminating the loop when an unpaired surrogate is found.
for (var i = 0;i<str.Length; ++i) { int codePoint = DataUtilities.CodePointAt(str, i, 2); if (codePoint < 0) { break; /* Unpaired surrogate */ } Console.WriteLine("codePoint:"+codePoint); if (codePoint >= 0x10000) { i++; /* Supplementary code point */ } }
.
Parameters:
-
str: The parameter str is a text string.
-
index: Index of the current position into the string.
-
surrogateBehavior: Specifies what kind of value to return if the code point at the given index is an unpaired surrogate code point: if 0, return the replacement character (U + FFFD); if 1, return the value of the surrogate code point; if neither 0 nor 1, return -1.
Return Value:
The Unicode code point at the given position. Returns -1 if index is 0 or less, or is greater than or equal to the string’s length. Returns a value as specified under surrogateBehavior if the code point at that position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units.
Exceptions:
- System.ArgumentNullException: The parameter str is null.
public static int CodePointBefore( string str, int index);
Gets the Unicode code point just before the given index of the string.
Parameters:
-
str: The parameter str is a text string.
-
index: Index of the current position into the string.
Return Value:
The Unicode code point at the previous position. Returns -1 if index is 0 or less, or is greater than or equal to the string’s length. Returns the replacement character (U+FFFD) if the code point at the previous position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units.
Exceptions:
- System.ArgumentNullException: The parameter str is null.
public static int CodePointBefore( string str, int index, int surrogateBehavior);
Gets the Unicode code point just before the given index of the string.
Parameters:
-
str: The parameter str is a text string.
-
index: Index of the current position into the string.
-
surrogateBehavior: Specifies what kind of value to return if the previous code point is an unpaired surrogate code point: if 0, return the replacement character (U+FFFD); if 1, return the value of the surrogate code point; if neither 0 nor 1, return -1.
Return Value:
The Unicode code point at the previous position. Returns -1 if index is 0 or less, or is greater than or equal to the string’s length. Returns a value as specified under surrogateBehavior if the code point at the previous position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units.
Exceptions:
- System.ArgumentNullException: The parameter str is null.
public static int CodePointCompare( string strA, string strB);
Compares two strings in Unicode code point order. Unpaired surrogate code points are treated as individual code points.
Parameters:
-
strA: The first string. Can be null.
-
strB: The second string. Can be null.
Return Value:
A value indicating which string is “ less” or “ greater” . 0: Both strings are equal or null. Less than 0: a is null and b isn’t; or the first code point that’s different is less in A than in B; or b starts with a and is longer than a. Greater than 0: b is null and a isn’t; or the first code point that’s different is greater in A than in B; or a starts with b and is longer than b.
public static int CodePointLength( string str);
Finds the number of Unicode code points in the given text string. Unpaired surrogate code points increase this number by 1. This is not necessarily the length of the string in “char” s.
Parameters:
- str: The parameter str is a text string.
Return Value:
The number of Unicode code points in the given string.
Exceptions:
- System.ArgumentNullException: The parameter str is null.
public static byte[] GetUtf8Bytes( string str, bool replace);
Encodes a string in UTF-8 as a byte array. This method does not insert a byte-order mark (U+FEFF) at the beginning of the encoded byte array.
REMARK: It is not recommended to use Encoding.UTF8.GetBytes
in.NET, or the getBytes()
method in Java to do this. For instance, getBytes()
encodes text strings in a default (so not fixed) character encoding, which can be undesirable.
Parameters:
-
str: The parameter str is a text string.
-
replace: If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.
Return Value:
The string encoded in UTF-8.
Exceptions:
-
System.ArgumentNullException: The parameter str is null.
-
System.ArgumentException: The string contains an unpaired surrogate code point and replace is false, or an internal error occurred.
public static byte[] GetUtf8Bytes( string str, bool replace, bool lenientLineBreaks);
Encodes a string in UTF-8 as a byte array. This method does not insert a byte-order mark (U+FEFF) at the beginning of the encoded byte array.
REMARK: It is not recommended to use Encoding.UTF8.GetBytes
in.NET, or the getBytes()
method in Java to do this. For instance, getBytes()
encodes text strings in a default (so not fixed) character encoding, which can be undesirable.
Parameters:
-
str: The parameter str is a text string.
-
replace: If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.
-
lenientLineBreaks: If true, replaces carriage return (CR) not followed by line feed (LF) and LF not preceded by CR with CR-LF pairs.
Return Value:
The string encoded in UTF-8.
Exceptions:
-
System.ArgumentNullException: The parameter str is null.
-
System.ArgumentException: The string contains an unpaired surrogate code point and replace is false, or an internal error occurred.
public static long GetUtf8Length( string str, bool replace);
Calculates the number of bytes needed to encode a string in UTF-8.
Parameters:
-
str: The parameter str is a text string.
-
replace: If true, treats unpaired surrogate code points as having 3 UTF-8 bytes (the UTF-8 length of the replacement character U+FFFD).
Return Value:
The number of bytes needed to encode the given string in UTF-8, or -1 if the string contains an unpaired surrogate code point and replace is false.
Exceptions:
- System.ArgumentNullException: The parameter str is null.
public static string GetUtf8String( byte[] bytes, bool replace);
Generates a text string from a UTF-8 byte array.
Parameters:
-
bytes: A byte array containing text encoded in UTF-8.
-
replace: If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.
Return Value:
A string represented by the UTF-8 byte array.
Exceptions:
-
System.ArgumentNullException: The parameter bytes is null.
-
System.ArgumentException: The string is not valid UTF-8 and replace is false.
public static string GetUtf8String( byte[] bytes, int offset, int bytesCount, bool replace);
Generates a text string from a portion of a UTF-8 byte array.
Parameters:
-
bytes: A byte array containing text encoded in UTF-8.
-
offset: Offset into the byte array to start reading.
-
bytesCount: Length, in bytes, of the UTF-8 text string.
-
replace: If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.
Return Value:
A string represented by the UTF-8 byte array.
Exceptions:
-
System.ArgumentNullException: The parameter bytes is null.
-
System.ArgumentException: The portion of the byte array is not valid UTF-8 and replace is false.
-
System.ArgumentException: The parameter offset is less than 0, bytesCount is less than 0, or offset plus bytesCount is greater than the length of “data” .
public static int ReadUtf8( System.IO.Stream stream, int bytesCount, System.Text.StringBuilder builder, bool replace);
Reads a string in UTF-8 encoding from a data stream.
Parameters:
-
stream: A readable data stream.
-
bytesCount: The length, in bytes, of the string. If this is less than 0, this function will read until the end of the stream.
-
builder: A string builder object where the resulting string will be stored.
-
replace: If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.
Return Value:
0 if the entire string was read without errors, -1 if the string is not valid UTF-8 and replace is false, or -2 if the end of the stream was reached before the last character was read completely (which is only the case if bytesCount is 0 or greater).
Exceptions:
-
System.IO.IOException: An I/O error occurred.
-
System.ArgumentNullException: The parameter stream is null or builder is null.
public static int ReadUtf8FromBytes( byte[] data, int offset, int bytesCount, System.Text.StringBuilder builder, bool replace);
Reads a string in UTF-8 encoding from a byte array.
Parameters:
-
data: A byte array containing a UTF-8 text string.
-
offset: Offset into the byte array to start reading.
-
bytesCount: Length, in bytes, of the UTF-8 text string.
-
builder: A string builder object where the resulting string will be stored.
-
replace: If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.
Return Value:
0 if the entire string was read without errors, or -1 if the string is not valid UTF-8 and replace is false.
Exceptions:
-
System.ArgumentNullException: The parameter data is null or builder is null.
-
System.ArgumentException: The parameter offset is less than 0, bytesCount is less than 0, or offset plus bytesCount is greater than the length of data .
public static string ReadUtf8ToString( System.IO.Stream stream);
Reads a string in UTF-8 encoding from a data stream in full and returns that string. Replaces invalid encoding with the replacement character (U+FFFD).
Parameters:
- stream: A readable data stream.
Return Value:
The string read.
Exceptions:
-
System.IO.IOException: An I/O error occurred.
-
System.ArgumentNullException: The parameter stream is null.
public static string ReadUtf8ToString( System.IO.Stream stream, int bytesCount, bool replace);
Reads a string in UTF-8 encoding from a data stream and returns that string.
Parameters:
-
stream: A readable data stream.
-
bytesCount: The length, in bytes, of the string. If this is less than 0, this function will read until the end of the stream.
-
replace: If true, replaces invalid encoding with the replacement character (U+FFFD). If false, throws an error if an unpaired surrogate code point is seen.
Return Value:
The string read.
Exceptions:
-
System.IO.IOException: An I/O error occurred; or, the string is not valid UTF-8 and replace is false.
-
System.ArgumentNullException: The parameter stream is null.
public static string ToLowerCaseAscii( string str);
Returns a string with the basic uppercase letters A to Z (U+0041 to U+005A) converted to the corresponding basic lowercase letters. Other characters remain unchanged.
Parameters:
- str: The parameter str is a text string.
Return Value:
The converted string, or null if str is null.
public static string ToUpperCaseAscii( string str);
Returns a string with the basic lowercase letters A to Z (U+0061 to U+007A) converted to the corresponding basic uppercase letters. Other characters remain unchanged.
Parameters:
- str: The parameter str is a text string.
Return Value:
The converted string, or null if str is null.
public static int WriteUtf8( string str, int offset, int length, System.IO.Stream stream, bool replace);
Writes a portion of a string in UTF-8 encoding to a data stream.
Parameters:
-
str: A string to write.
-
offset: The Index starting at 0 where the string portion to write begins.
-
length: The length of the string portion to write.
-
stream: A writable data stream.
-
replace: If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.
Return Value:
0 if the entire string portion was written; or -1 if the string portion contains an unpaired surrogate code point and replace is false.
Exceptions:
-
System.ArgumentNullException: The parameter str is null or stream is null.
-
System.IO.IOException: An I/O error occurred.
-
System.ArgumentException: Either offset or length is less than 0 or greater than str ‘s length, or str ‘s length minus offset is less than length .
public static int WriteUtf8( string str, int offset, int length, System.IO.Stream stream, bool replace, bool lenientLineBreaks);
Writes a portion of a string in UTF-8 encoding to a data stream.
Parameters:
-
str: A string to write.
-
offset: The Index starting at 0 where the string portion to write begins.
-
length: The length of the string portion to write.
-
stream: A writable data stream.
-
replace: If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.
-
lenientLineBreaks: If true, replaces carriage return (CR) not followed by line feed (LF) and LF not preceded by CR with CR-LF pairs.
Return Value:
0 if the entire string portion was written; or -1 if the string portion contains an unpaired surrogate code point and replace is false.
Exceptions:
-
System.ArgumentNullException: The parameter str is null or stream is null.
-
System.ArgumentException: The parameter offset is less than 0, length is less than 0, or offset plus length is greater than the string’s length.
-
System.IO.IOException: An I/O error occurred.
public static int WriteUtf8( string str, System.IO.Stream stream, bool replace);
Writes a string in UTF-8 encoding to a data stream.
Parameters:
-
str: A string to write.
-
stream: A writable data stream.
-
replace: If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.
Return Value:
0 if the entire string was written; or -1 if the string contains an unpaired surrogate code point and replace is false.
Exceptions:
-
System.ArgumentNullException: The parameter str is null or stream is null.
-
System.IO.IOException: An I/O error occurred.