Skip to content

Encoding and Decoding

by Alex Peck on September 10th, 2009

Encoding is the process of transforming a sequence of characters into a sequence of bytes, decoding is the reversal of this process. It is important to employ the correct encoding format, and particular attention should be paid when performing low level string operations or working with programs designed for regions which use non latin alphabets.

ASCII, ISO and Unicode

The venerable ASCII encoding standard is the basis for modern character encoding. ASCII encodes non printable control characters, as well as a modern latin alphabet, digits, punctuation marks, and a few miscellaneous symbols. These values are encoded in 7-bits (0-127). ASCII does not standardise use of values 128 to 255 in 8-bits; different regions invented their own standards and this makes it difficult to exchange text encoded using potentially different standards across regions.

ANSI went some way toward solving this by defining standardised code pages consisting of both the standard ASCII values (0-127) and language specific values (128-255). For example, ISO 8859-1 is intended for western european languages. Clearly, ISO 8859 encodings are still not truly interoperable, but at least you know which characters are represented between 128-255.

Unicode was introduced almost twenty years ago to provide a world text encoding scheme, capable of encoding the characters used in every living language. Although initially restricted to 16-bits, there are now multiple Unicode encoding schemes.

System.Text Encodings

  • UTF32Encoding is a UTF-32 encoding representing each code point as a 32-bit integer.
  • UnicodeEncoding is a UTF-16 encoding representing each code point as a sequence of one to two 16-bit integers.
  • UTF8Encoding is a UTF-8 encoding representing each code point as a sequence of one to four bytes.
  • UTF7Encoding is a UTF-7 encoding representing Unicode characters as sequences of 7-bit ASCII characters.
  • ASCIIEncoding corresponds to the Windows code page 20127. ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F.
  • ANSI/ISO encodings can be accessed via the Encoding.GetEncoding method. This page gives a list of the supported encodings.

I was struck by the inconsistent naming of these classes, and that they don’t follow Microsoft’s naming conventions for capitalisation.

A simple example

Using Encoding.GetEncoding we can retrieve a Western European codepage, then get some encoded bytes using the GetBytes method. This is only slightly more typing than using one of the standard encoding classes described above.

byte[] westernEuroBytes = Encoding.GetEncoding("Windows-1252").GetBytes("foo bar");
byte[] utf16Bytes = Encoding.Unicode.GetBytes("foo bar");

Enumerating supported code pages

You can enumerate the available encodings as follows, this essentially reproduces the table found here.

foreach (EncodingInfo ei in Encoding.GetEncodings())
{
    Console.WriteLine("{0}, {1}, {2}", ei.CodePage, ei.Name, ei.DisplayName);
}

Encoding & File I/O

If you use an encoding other than UTF7, you must explicitly declare it when reading from the file, as follows.

using (StreamWriter sw = new StreamWriter("filename.txt", false, Encoding.Unicode))
{
    sw.WriteLine("foo bar");
}
 
using (StreamReader sw = new StreamReader("filename.txt", Encoding.Unicode))
{
    string line = sw.ReadLine();
}

If you encode the file using ASCII, you can decode it using ASCII, UTF-7 and UTF-8 encodings. UTF-16 and UTF-32 use larger bytes, and are therefore incompatible.

No comments yet

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS