Computer Science - Unicode

Updated: 2018-12-11

1 Byte = 8 bits = 2 hex

Unicode vs ASCII vs ISO-8859-1

  • ASCII: 7 bits, 128 code point

    • Codes below 32 were called unprintable
  • ISO-8859-1(latin-1): 8 bits, 256 code point
  • UNICODE:

    • Myth: Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters.
    • Truth: 0x0 -> 0x10FFFF

      • 1,114,112 code points = 1,112,064 valid code points + 2,048 surrogate code points
      • code points U+D800 to U+DFFF reserved for high and low surrogates used to encode codepoint values greater than U+FFFF
      • The U+ means "Unicode" and the numbers are hexadecimal
      • Hello: U+0048 U+0065 U+006C U+006C U+006F (This is code point, not how it is stored in memory)

Unicode vs UTF-8/UTF-16/UTF-32

  • Unicode: the code space(1,114,112 code points)
  • UTF-8/UTF-16/UTF-32: the encoding method

UTF-8

UTF-8 uses the following rules:

If the code point is < 128, it’s represented by the corresponding byte value. If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Hello: 48 65 6C 6C 6F(the same as ASCII)

  • Variable-width encoding(one to four bytes/8-bit unit)

    • one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding
    • 1 byte(7 bits): U+0000 -> U+007F 0xxxxxxx
    • 2 bytes(11 bits): U+0080 -> U+07FF 110xxxxx 10xxxxxx
    • 3 bytes(16 bits): U+0800 -> U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
    • 4 bytes(21 bits): U+10000 -> U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-16(UCS-2)

  • Variable-width encoding(one or two 16-bit unit)

    • one 16-bit unit(2 bytes, direct mapping): U+0000 to U+FFFF(excluding U+D800 to U+DFFF)
    • two 16-bit units(4 bytes): U+010000 to U+10FFFF, also called supplementary characters

      • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
      • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range0xD800..0xDBFF.
      • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range0xDC00..0xDFFF.
  • The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters
  • Need to figure out high-endian or low-endian

UTF-32(UCS-4)

  • UTF-32 – a 32-bit, fixed-width encoding

Read More

UTF-8 Everywhere Manifesto: http://www.utf8everywhere.org/ Python Unicode HowTo

Java: UTF-16

Setting the character encoding should be done in the Content-Type http header, but can also be set with the attribute Always Include the Character Encoding! If charset is not set in HTML, browser will guess the encoding

In header:

Content-Type: text/plain; charset="UTF-8"

In html5, these are equivalent:

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

In order for all browsers to recognize a declaration, it must be

  • Within the element,
  • Before any elements that contain text, such as the element, AND</li> <li>Within the first 512 bytes of your document, including DOCTYPE and whitespace</li> </ul> <p>In Node.js <a href="http://nodejs.org/api/http.html#http_response_write_chunk_encoding" target="_blank" rel="nofollow noopener noreferrer">http.ServerResponse</a></p> <div class="gatsby-highlight" data-language="javascript"><pre class="language-javascript"><code class="language-javascript">response<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>chunk<span class="token punctuation">,</span> <span class="token punctuation">[</span>encoding<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre></div> <p>where encoding is <code class="language-text">'utf-8'</code> by default</p> <p>Python:</p> <p>bytes.decode() str.encode()</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token operator">>></span><span class="token operator">></span> a<span class="token operator">=</span><span class="token string">"\u00a5123"</span> <span class="token operator">>></span><span class="token operator">></span> a <span class="token string">'¥123'</span> <span class="token operator">>></span><span class="token operator">></span> <span class="token string">"\u00a5"</span><span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span> b<span class="token string">'\xc2\xa5'</span> <span class="token operator">>></span><span class="token operator">></span> <span class="token string">"Hello"</span><span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span> b<span class="token string">'Hello'</span> <span class="token operator">>></span><span class="token operator">></span> <span class="token string">"你好"</span><span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span> b<span class="token string">'\xe4\xbd\xa0\xe5\xa5\xbd'</span></code></pre></div> <p><a href="http://en.wikipedia.org/wiki/Byte_order_mark" target="_blank" rel="nofollow noopener noreferrer">Byte Order Mark(BOM)</a></p> <ul> <li>big-endian: FE FF(hexadecimal) 254 255(decimal)</li> <li>little-endian: FF FE(hexadecimal) 255 254(decimal)</li> </ul> <p>Example:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token operator">>></span><span class="token operator">></span> <span class="token string">"你好"</span><span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"utf-16"</span><span class="token punctuation">)</span> b<span class="token string">'\xff\xfe`O}Y'</span></code></pre></div> <p>Use <code class="language-text">utf-16</code> with BOM</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token operator">>></span><span class="token operator">></span> b<span class="token string">'\xff\xfe`O}Y'</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">"utf-16"</span><span class="token punctuation">)</span> <span class="token string">'你好'</span></code></pre></div> <p>Or use default(LE)</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token operator">>></span><span class="token operator">></span> b<span class="token string">'`O}Y'</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">"utf-16"</span><span class="token punctuation">)</span> <span class="token string">'你好'</span></code></pre></div> <p>Use <code class="language-text">utf-16-le</code> and skip BOM</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token operator">>></span><span class="token operator">></span> b<span class="token string">'`O}Y'</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">"utf-16-le"</span><span class="token punctuation">)</span> <span class="token string">'你好'</span></code></pre></div> <p>Use <code class="language-text">utf-16-be</code> will generate something wrong...</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token operator">>></span><span class="token operator">></span> b<span class="token string">'`O}Y'</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">"utf-16-be"</span><span class="token punctuation">)</span> <span class="token string">'恏絙'</span></code></pre></div> <p>Java byte is signed, thus you have a range between -128 and 127</p> <p>UTF-16 example:</p> <div class="gatsby-highlight" data-language="java"><pre class="language-java"><code class="language-java">String s <span class="token operator">=</span> <span class="token string">"你好"</span><span class="token punctuation">;</span> <span class="token keyword">byte</span><span class="token punctuation">[</span><span class="token punctuation">]</span> b1 <span class="token operator">=</span> s<span class="token punctuation">.</span><span class="token function">getBytes</span><span class="token punctuation">(</span>Charset<span class="token punctuation">.</span><span class="token function">forName</span><span class="token punctuation">(</span><span class="token string">"UTF-16"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// or ...</span> <span class="token comment">// byte[] b1 = s.getBytes("UTF-8");</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">byte</span> b <span class="token operator">:</span> b1<span class="token punctuation">)</span> <span class="token punctuation">{</span> System<span class="token punctuation">.</span>out<span class="token punctuation">.</span><span class="token function">print</span><span class="token punctuation">(</span>Integer<span class="token punctuation">.</span><span class="token function">toHexString</span><span class="token punctuation">(</span>Byte<span class="token punctuation">.</span><span class="token function">toUnsignedInt</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">" "</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> System<span class="token punctuation">.</span>out<span class="token punctuation">.</span><span class="token function">println</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// fe ff 4f 60 59 7d</span></code></pre></div> <p>UTF-16 uses 2 bytes for each Chinese character</p> <ul> <li><code class="language-text">fe ff</code>: BOM</li> <li><code class="language-text">4f 60</code>: 你</li> <li><code class="language-text">59 7d</code>: 好</li> </ul> <p>UTF-8 example:</p> <div class="gatsby-highlight" data-language="java"><pre class="language-java"><code class="language-java"><span class="token keyword">byte</span><span class="token punctuation">[</span><span class="token punctuation">]</span> b2 <span class="token operator">=</span> s<span class="token punctuation">.</span><span class="token function">getBytes</span><span class="token punctuation">(</span>Charset<span class="token punctuation">.</span><span class="token function">forName</span><span class="token punctuation">(</span><span class="token string">"UTF-8"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">byte</span> b <span class="token operator">:</span> b2<span class="token punctuation">)</span> <span class="token punctuation">{</span> System<span class="token punctuation">.</span>out<span class="token punctuation">.</span><span class="token function">print</span><span class="token punctuation">(</span>Integer<span class="token punctuation">.</span><span class="token function">toHexString</span><span class="token punctuation">(</span>Byte<span class="token punctuation">.</span><span class="token function">toUnsignedInt</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">" "</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> System<span class="token punctuation">.</span>out<span class="token punctuation">.</span><span class="token function">println</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//e4 bd a0 e5 a5 bd</span></code></pre></div> <p>UTF-8 uses 3 bytes for each Chinese character</p> <ul> <li><code class="language-text">e4 bd a0</code>: 你</li> <li><code class="language-text">e5 a5 bd</code>: 好</li> </ul></div></section></div><div class="c1 c5 c2"><section class="article-ad-section"><div id="amzn-assoc-ad-65e42415-fb3e-4f4e-affa-c740847c2e9d"></div><script async="" src="//z-na.amazon-adsystem.com/widgets/onejs?MarketPlace=US&adInstanceId=65e42415-fb3e-4f4e-affa-c740847c2e9d"></script></section></div></div></div><div><footer id="footer"><div class="container"><div class="c28 c51"><div class="c29 c60"><div class="footer-image"><div class=" gatsby-image-wrapper" style="position:relative;overflow:hidden;display:inline-block;width:180px;height:40px"><noscript><picture><source srcSet="/static/001a7a0af89c05b6bbca87b97c562d64/1417c/wordmark.png 1x, /static/001a7a0af89c05b6bbca87b97c562d64/eb628/wordmark.png 1.5x, /static/001a7a0af89c05b6bbca87b97c562d64/6ae85/wordmark.png 2x, /static/001a7a0af89c05b6bbca87b97c562d64/65059/wordmark.png 3x" /><img width="180" height="40" src="/static/001a7a0af89c05b6bbca87b97c562d64/1417c/wordmark.png" alt="" style="position:absolute;top:0;left:0;transition:opacity 0.5s;transition-delay:0.5s;opacity:1;width:100%;height:100%;object-fit:cover;object-position:center"/></picture></noscript></div></div></div><div class="c29 c64" style="text-align:right"><a href="https://github.com/hackingnote"><i class="fab fa-github-square fa-fw"></i></a><a href="https://www.facebook.com/0xHACKINGNOTE/"><i class="fab fa-facebook fa-fw"></i></a><a href="mailto:contact@hackingnote.com"><i class="fas fa-envelope fa-fw"></i></a><a class="about-link" href="/en/privacy-policy">PRIVACY POLICY</a><a class="about-link" href="/en/about">ABOUT</a><a class="about-link" href="/zh/programming/overview/" style="font-family:微软雅黑, 华文细黑">中文</a></div></div></div></footer></div><link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.5.0/css/all.css" integrity="sha384-B4dIYHKNBt8Bc12p+WXckhzcICo0wtJAoU8YZTY5qE0Id1GSseTk6S+L3BlXeVIU" crossOrigin="anonymous"/></div></div></div><script id="gatsby-script-loader">/*<![CDATA[*/window.page={"componentChunkName":"component---src-templates-page-js","jsonName":"en-computer-science-unicode-fa5","path":"/en/computer-science/unicode"};window.dataPath="933/path---en-computer-science-unicode-fa-5-843-Va044ppKC3mk6W0spYbo95OLZE";/*]]>*/</script><script id="gatsby-chunk-mapping">/*<![CDATA[*/window.___chunkMapping={"app":["/app.08eb512c534233bffd38.css","/app-66fc6f4ce68e0edb0b1c.js"],"component---node-modules-gatsby-plugin-offline-app-shell-js":["/component---node-modules-gatsby-plugin-offline-app-shell-js-d33c03f1fabf4da3dac1.js"],"component---src-templates-page-js":["/component---src-templates-page-js.4dc797f716873cdedd79.css","/component---src-templates-page-js-78f60fba2fd1792a1d2c.js"],"component---src-templates-about-js":["/component---src-templates-about-js.9177e51c9f906d56e921.css","/component---src-templates-about-js-c96ce8464f0e3dee106a.js"],"component---src-pages-index-js":["/component---src-pages-index-js.6db9087c02600036d717.css","/component---src-pages-index-js-3044856c93d23c8b9bea.js"],"component---src-pages-404-js":["/component---src-pages-404-js.0f27b422a2be1bc3b86b.css","/component---src-pages-404-js-d1916ac193f29fbe14ed.js"],"component---src-pages-en-interview-problems-index-js":["/component---src-pages-en-interview-problems-index-js.4dc797f716873cdedd79.css","/component---src-pages-en-interview-problems-index-js-b1b58aa729c2e1efd9ed.js"],"component---src-pages-zh-index-js":["/component---src-pages-zh-index-js.b7ed0effd950d9f9ec6d.css","/component---src-pages-zh-index-js-3520923bf3dadcca0837.js"],"pages-manifest":["/pages-manifest-ca06aba7c7700e902d55.js"]};/*]]>*/</script><script src="/component---src-templates-page-js-78f60fba2fd1792a1d2c.js" async=""></script><script src="/2-6a3b44844832a5a5a47b.js" async=""></script><script src="/0-ae22ca669efdf19c1f1a.js" async=""></script><script src="/app-66fc6f4ce68e0edb0b1c.js" async=""></script><script src="/webpack-runtime-8117ba9f100b0376aa75.js" async=""></script></body></html>