<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>out &#62;&#62; m_Conscientia; &#187; Encoding</title>
	<atom:link href="http://blog.hypercomplex.co.uk/index.php/tag/encoding/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.hypercomplex.co.uk</link>
	<description>a multidimensional braindump</description>
	<lastBuildDate>Sat, 27 Aug 2011 19:16:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Encoding and Decoding</title>
		<link>http://blog.hypercomplex.co.uk/index.php/2009/09/encoding-and-decoding/</link>
		<comments>http://blog.hypercomplex.co.uk/index.php/2009/09/encoding-and-decoding/#comments</comments>
		<pubDate>Thu, 10 Sep 2009 17:52:34 +0000</pubDate>
		<dc:creator>Alex Peck</dc:creator>
				<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[.NET]]></category>
		<category><![CDATA[Application Development Foundation]]></category>
		<category><![CDATA[Decoding]]></category>
		<category><![CDATA[Encoding]]></category>

		<guid isPermaLink="false">http://blog.hypercomplex.co.uk/?p=294</guid>
		<description><![CDATA[Encoding is the process of transforming a sequence of characters into a sequence of bytes, decoding is the reversal of this process. It is important to employ the correct encoding format, and particular attention should be paid when performing low level string operations or working with programs designed for regions which use non latin alphabets. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx">Encoding</a> is the process of transforming a sequence of characters into a sequence of bytes, decoding is the reversal of this process. It is important to employ the correct encoding format, and particular attention should be paid when performing low level string operations or working with programs designed for regions which use non <a href="http://en.wikipedia.org/wiki/Basic_modern_Latin_alphabet">latin alphabets</a>.</p>
<p><strong>ASCII, ISO and Unicode</strong></p>
<p>The venerable <a href="http://en.wikipedia.org/wiki/ASCII">ASCII</a> encoding standard is the basis for modern character encoding. ASCII encodes non printable control characters, as well as a modern latin alphabet, digits, punctuation marks, and a few miscellaneous symbols. These values are encoded in 7-bits (0-127). ASCII does not standardise use of values 128 to 255 in 8-bits; different regions invented their own standards and this makes it difficult to exchange text encoded using potentially different standards across regions.</p>
<p>ANSI went some way toward solving this by defining standardised code pages consisting of both the standard ASCII values (0-127) and language specific values (128-255). For example, <a href="http://en.wikipedia.org/wiki/ISO/IEC_8859-1">ISO 8859-1</a> is intended for western european languages. Clearly, ISO 8859 encodings are still not truly interoperable, but at least you know which characters are represented between 128-255.</p>
<p><a href="http://en.wikipedia.org/wiki/Unicode">Unicode</a> was introduced almost twenty years ago to provide a world text encoding scheme, capable of encoding the characters used in every living language. Although initially restricted to 16-bits, there are now multiple Unicode encoding schemes.</p>
<p><strong>System.Text Encodings</strong></p>
<ul>
<li><a href="http://msdn.microsoft.com/en-us/library/system.text.utf32encoding.aspx">UTF32Encoding</a> is a UTF-32 encoding representing each code point as a 32-bit integer.</li>
<li><a href="http://msdn.microsoft.com/en-us/library/system.text.unicodeencoding.aspx">UnicodeEncoding</a> is a UTF-16 encoding representing each code point as a sequence of one to two 16-bit integers.</li>
<li><a href="http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx">UTF8Encoding</a> is a UTF-8 encoding representing each code point as a sequence of one to four bytes.</li>
<li><a href="http://msdn.microsoft.com/en-us/library/system.text.utf7encoding.aspx">UTF7Encoding</a> is a UTF-7 encoding representing Unicode characters as sequences of 7-bit ASCII characters.</li>
<li><a href="http://msdn.microsoft.com/en-us/library/system.text.asciiencoding.aspx">ASCIIEncoding</a> corresponds to the Windows code page 20127. ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F.</li>
<li>ANSI/ISO encodings can be accessed via the <a href="http://msdn.microsoft.com/en-us/library/system.text.encoding.getencoding.aspx">Encoding.GetEncoding</a> method. <a href="http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx">This</a> page gives a list of the supported encodings.</li>
</ul>
<p>I was struck by the inconsistent naming of these classes, and that they don&#8217;t follow Microsoft&#8217;s naming conventions for <a href="http://msdn.microsoft.com/en-us/library/ms229043.aspx">capitalisation</a>.</p>
<p><strong>A simple example</strong></p>
<p>Using <a href="http://msdn.microsoft.com/en-us/library/system.text.encoding.getencoding.aspx">Encoding.GetEncoding</a> we can retrieve a Western European codepage, then get some encoded bytes using the GetBytes method. This is only slightly more typing than using one of the standard encoding classes described above.</p>

<div class="wp_codebox"><table><tr id="p2944"><td class="code" id="p294code4"><pre class="csharp" style="font-family:monospace;"><span style="color: #6666cc; font-weight: bold;">byte</span><span style="color: #008000;">&#91;</span><span style="color: #008000;">&#93;</span> westernEuroBytes <span style="color: #008000;">=</span> Encoding<span style="color: #008000;">.</span><span style="color: #0000FF;">GetEncoding</span><span style="color: #008000;">&#40;</span><span style="color: #666666;">&quot;Windows-1252&quot;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">.</span><span style="color: #0000FF;">GetBytes</span><span style="color: #008000;">&#40;</span><span style="color: #666666;">&quot;foo bar&quot;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
<span style="color: #6666cc; font-weight: bold;">byte</span><span style="color: #008000;">&#91;</span><span style="color: #008000;">&#93;</span> utf16Bytes <span style="color: #008000;">=</span> Encoding<span style="color: #008000;">.</span><span style="color: #0000FF;">Unicode</span><span style="color: #008000;">.</span><span style="color: #0000FF;">GetBytes</span><span style="color: #008000;">&#40;</span><span style="color: #666666;">&quot;foo bar&quot;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span></pre></td></tr></table></div>

<p><strong>Enumerating supported code pages</strong></p>
<p>You can enumerate the available encodings as follows, this essentially reproduces the table found <a href="http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx">here</a>.</p>

<div class="wp_codebox"><table><tr id="p2945"><td class="code" id="p294code5"><pre class="csharp" style="font-family:monospace;"><span style="color: #0600FF; font-weight: bold;">foreach</span> <span style="color: #008000;">&#40;</span>EncodingInfo ei <span style="color: #0600FF; font-weight: bold;">in</span> Encoding<span style="color: #008000;">.</span><span style="color: #0000FF;">GetEncodings</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #666666;">&quot;{0}, {1}, {2}&quot;</span>, ei<span style="color: #008000;">.</span><span style="color: #0000FF;">CodePage</span>, ei<span style="color: #008000;">.</span><span style="color: #0000FF;">Name</span>, ei<span style="color: #008000;">.</span><span style="color: #0000FF;">DisplayName</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
<span style="color: #008000;">&#125;</span></pre></td></tr></table></div>

<p><strong>Encoding &#038; File I/O</strong></p>
<p>If you use an encoding other than UTF7, you must explicitly declare it when reading from the file, as follows.</p>

<div class="wp_codebox"><table><tr id="p2946"><td class="code" id="p294code6"><pre class="csharp" style="font-family:monospace;"><span style="color: #0600FF; font-weight: bold;">using</span> <span style="color: #008000;">&#40;</span>StreamWriter sw <span style="color: #008000;">=</span> <a href="http://www.google.com/search?q=new+msdn.microsoft.com"><span style="color: #008000;">new</span></a> StreamWriter<span style="color: #008000;">&#40;</span><span style="color: #666666;">&quot;filename.txt&quot;</span>, <span style="color: #0600FF; font-weight: bold;">false</span>, Encoding<span style="color: #008000;">.</span><span style="color: #0000FF;">Unicode</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    sw<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #666666;">&quot;foo bar&quot;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
<span style="color: #008000;">&#125;</span>
&nbsp;
<span style="color: #0600FF; font-weight: bold;">using</span> <span style="color: #008000;">&#40;</span>StreamReader sw <span style="color: #008000;">=</span> <a href="http://www.google.com/search?q=new+msdn.microsoft.com"><span style="color: #008000;">new</span></a> StreamReader<span style="color: #008000;">&#40;</span><span style="color: #666666;">&quot;filename.txt&quot;</span>, Encoding<span style="color: #008000;">.</span><span style="color: #0000FF;">Unicode</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
    <span style="color: #6666cc; font-weight: bold;">string</span> line <span style="color: #008000;">=</span> sw<span style="color: #008000;">.</span><span style="color: #0000FF;">ReadLine</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
<span style="color: #008000;">&#125;</span></pre></td></tr></table></div>

<p>If you encode the file using ASCII, you can decode it using ASCII, UTF-7 and UTF-8 encodings. UTF-16 and UTF-32 use larger bytes, and are therefore incompatible.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.hypercomplex.co.uk/index.php/2009/09/encoding-and-decoding/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

