PeterO.DataUtilities

## PeterO.DataUtilities

public static class DataUtilities

Contains methods useful for reading and writing text strings. It is designed to have no dependencies other than the basic runtime class library. Many of these methods work with text encoded in UTF-8, an encoding form of the Unicode Standard which uses one byte to encode the most basic characters and two to four bytes to encode other characters. For example, the GetUtf8 method converts a text string to an array of bytes in UTF-8.

In C# and Java, text strings are represented as sequences of 16-bit values called char s. These sequences are well-formed under UTF-16, a 16-bit encoding form of Unicode, except if they contain unpaired surrogate code points. (A surrogate code point is used to encode supplementary characters, those with code points U+10000 or higher, in UTF-16. A surrogate pair is a high surrogate, U+D800 to U+DBFF, followed by a low surrogate, U+DC00 to U+DFFF. An unpaired surrogate code point is a surrogate not appearing in a surrogate pair.) Many of the methods in this class allow setting the behavior to follow when unpaired surrogate code points are found in text strings, such as throwing an error or treating the unpaired surrogate as a replacement character (U+FFFD).

Member Summary

### CodePointAt

public static int CodePointAt(
    string str,
    int index);

Gets the Unicode code point at the given index of the string.

Parameters:

Return Value:

The Unicode code point at the given position. Returns -1 if index is 0 or less, or is greater than or equal to the string’s length. Returns the replacement character (U+FFFD) if the code point at that position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units.

Exceptions:

### CodePointAt

public static int CodePointAt(
    string str,
    int index,
    int surrogateBehavior);

Gets the Unicode code point at the given index of the string.

The following example shows how to iterate a text string code point by code point, terminating the loop when an unpaired surrogate is found.

for (var i = 0;i<str.Length; ++i) { int codePoint =
            DataUtilities.CodePointAt(str, i, 2); if (codePoint < 0) { break; /*
            Unpaired surrogate */ } Console.WriteLine("codePoint:"+codePoint); if
            (codePoint >= 0x10000) { i++; /* Supplementary code point */ } }

.

Parameters:

Return Value:

The Unicode code point at the given position. Returns -1 if index is 0 or less, or is greater than or equal to the string’s length. Returns a value as specified under surrogateBehavior if the code point at that position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units.

Exceptions:

### CodePointBefore

public static int CodePointBefore(
    string str,
    int index);

Gets the Unicode code point just before the given index of the string.

Parameters:

Return Value:

The Unicode code point at the previous position. Returns -1 if index is 0 or less, or is greater than or equal to the string’s length. Returns the replacement character (U+FFFD) if the code point at the previous position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units.

Exceptions:

### CodePointBefore

public static int CodePointBefore(
    string str,
    int index,
    int surrogateBehavior);

Gets the Unicode code point just before the given index of the string.

Parameters:

Return Value:

The Unicode code point at the previous position. Returns -1 if index is 0 or less, or is greater than or equal to the string’s length. Returns a value as specified under surrogateBehavior if the code point at the previous position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units.

Exceptions:

### CodePointCompare

public static int CodePointCompare(
    string strA,
    string strB);

Compares two strings in Unicode code point order. Unpaired surrogate code points are treated as individual code points.

Parameters:

Return Value:

A value indicating which string is “ less” or “ greater” . 0: Both strings are equal or null. Less than 0: a is null and b isn’t; or the first code point that’s different is less in A than in B; or b starts with a and is longer than a. Greater than 0: b is null and a isn’t; or the first code point that’s different is greater in A than in B; or a starts with b and is longer than b.

### CodePointLength

public static int CodePointLength(
    string str);

Finds the number of Unicode code points in the given text string. Unpaired surrogate code points increase this number by 1. This is not necessarily the length of the string in “char” s.

Parameters:

Return Value:

The number of Unicode code points in the given string.

Exceptions:

### GetUtf8Bytes

public static byte[] GetUtf8Bytes(
    string str,
    bool replace);

Encodes a string in UTF-8 as a byte array. This method does not insert a byte-order mark (U+FEFF) at the beginning of the encoded byte array.

REMARK: It is not recommended to use Encoding.UTF8.GetBytes in.NET, or the getBytes() method in Java to do this. For instance, getBytes() encodes text strings in a default (so not fixed) character encoding, which can be undesirable.

Parameters:

Return Value:

The string encoded in UTF-8.

Exceptions:

### GetUtf8Bytes

public static byte[] GetUtf8Bytes(
    string str,
    bool replace,
    bool lenientLineBreaks);

Encodes a string in UTF-8 as a byte array. This method does not insert a byte-order mark (U+FEFF) at the beginning of the encoded byte array.

REMARK: It is not recommended to use Encoding.UTF8.GetBytes in.NET, or the getBytes() method in Java to do this. For instance, getBytes() encodes text strings in a default (so not fixed) character encoding, which can be undesirable.

Parameters:

Return Value:

The string encoded in UTF-8.

Exceptions:

### GetUtf8Length

public static long GetUtf8Length(
    string str,
    bool replace);

Calculates the number of bytes needed to encode a string in UTF-8.

Parameters:

Return Value:

The number of bytes needed to encode the given string in UTF-8, or -1 if the string contains an unpaired surrogate code point and replace is false.

Exceptions:

### GetUtf8String

public static string GetUtf8String(
    byte[] bytes,
    bool replace);

Generates a text string from a UTF-8 byte array.

Parameters:

Return Value:

A string represented by the UTF-8 byte array.

Exceptions:

### GetUtf8String

public static string GetUtf8String(
    byte[] bytes,
    int offset,
    int bytesCount,
    bool replace);

Generates a text string from a portion of a UTF-8 byte array.

Parameters:

Return Value:

A string represented by the UTF-8 byte array.

Exceptions:

### ReadUtf8

public static int ReadUtf8(
    System.IO.Stream stream,
    int bytesCount,
    System.Text.StringBuilder builder,
    bool replace);

Reads a string in UTF-8 encoding from a data stream.

Parameters:

Return Value:

0 if the entire string was read without errors, -1 if the string is not valid UTF-8 and replace is false, or -2 if the end of the stream was reached before the last character was read completely (which is only the case if bytesCount is 0 or greater).

Exceptions:

### ReadUtf8FromBytes

public static int ReadUtf8FromBytes(
    byte[] data,
    int offset,
    int bytesCount,
    System.Text.StringBuilder builder,
    bool replace);

Reads a string in UTF-8 encoding from a byte array.

Parameters:

Return Value:

0 if the entire string was read without errors, or -1 if the string is not valid UTF-8 and replace is false.

Exceptions:

### ReadUtf8ToString

public static string ReadUtf8ToString(
    System.IO.Stream stream);

Reads a string in UTF-8 encoding from a data stream in full and returns that string. Replaces invalid encoding with the replacement character (U+FFFD).

Parameters:

Return Value:

The string read.

Exceptions:

### ReadUtf8ToString

public static string ReadUtf8ToString(
    System.IO.Stream stream,
    int bytesCount,
    bool replace);

Reads a string in UTF-8 encoding from a data stream and returns that string.

Parameters:

Return Value:

The string read.

Exceptions:

### ToLowerCaseAscii

public static string ToLowerCaseAscii(
    string str);

Returns a string with the basic upper-case letters A to Z (U+0041 to U+005A) converted to the corresponding basic lower-case letters. Other characters remain unchanged.

Parameters:

Return Value:

The converted string, or null if str is null.

### ToUpperCaseAscii

public static string ToUpperCaseAscii(
    string str);

Returns a string with the basic lower-case letters A to Z (U+0061 to U+007A) converted to the corresponding basic upper-case letters. Other characters remain unchanged.

Parameters:

Return Value:

The converted string, or null if str is null.

### WriteUtf8

public static int WriteUtf8(
    string str,
    int offset,
    int length,
    System.IO.Stream stream,
    bool replace);

Writes a portion of a string in UTF-8 encoding to a data stream.

Parameters:

Return Value:

0 if the entire string portion was written; or -1 if the string portion contains an unpaired surrogate code point and replace is false.

Exceptions:

### WriteUtf8

public static int WriteUtf8(
    string str,
    int offset,
    int length,
    System.IO.Stream stream,
    bool replace,
    bool lenientLineBreaks);

Writes a portion of a string in UTF-8 encoding to a data stream.

Parameters:

Return Value:

0 if the entire string portion was written; or -1 if the string portion contains an unpaired surrogate code point and replace is false.

Exceptions:

### WriteUtf8

public static int WriteUtf8(
    string str,
    System.IO.Stream stream,
    bool replace);

Writes a string in UTF-8 encoding to a data stream.

Parameters:

Return Value:

0 if the entire string was written; or -1 if the string contains an unpaired surrogate code point and replace is false.

Exceptions:

Back to CBOR start page.