logo

Unicode

Last Updated: 2022-04-03

Unicode

  • Myth: Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters.
  • Truth: 0x0 -> 0x10FFFF(2^20 + 2^16 = 1,114,112)
    • 1,114,112 code points = 1,112,064 valid code points + 2,048 surrogate code points
    • code points U+D800 to U+DFFF reserved for high and low surrogates used to encode code point values greater than U+FFFF
    • The U+ means "Unicode" and the numbers are hexadecimal
    • Hello: U+0048 U+0065 U+006C U+006C U+006F (This is code point, not how it is stored in memory)

Unicode vs ASCII vs ISO-8859-1

Range Code Points
ASCII 7 bits 128
ISO-8859-1(latin-1) 8 bits 256
UNICODE 0x0 -> 0x10FFFF 1,114,112

Note: ASCII codes below 32 were called unprintable

Unicode vs UTF-8/UTF-16/UTF-32

  • Unicode: the code space(1,114,112 code points)
  • UTF-8/UTF-16/UTF-32: the encoding method
Variable or Fixed Length Length
UTF-8 Variable one to four 8-bit units
UTF-16 Variable one or two 16-bit units
UTF-32 Fixed one 32-bit unit

UTF-8

UTF-8 uses the following rules:

  • If the code point is < 128, it’s represented by the corresponding byte value.
  • If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Example: Hello => 48 65 6C 6C 6F(the same as ASCII)

  • Variable-width encoding(one to four bytes/8-bit unit)
    • one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding
    • 1 byte(7 bits): U+0000 -> U+007F, 0xxxxxxx
    • 2 bytes(11 bits): U+0080 -> U+07FF, 110xxxxx 10xxxxxx
    • 3 bytes(16 bits): U+0800 -> U+FFFF, 1110xxxx 10xxxxxx 10xxxxxx
    • 4 bytes(21 bits): U+10000 -> U+1FFFFF, 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-16(UCS-2)

  • Variable-width encoding(one or two 16-bit unit)
    • one 16-bit unit(2 bytes, direct mapping): U+0000 to U+FFFF(excluding U+D800 to U+DFFF)
    • two 16-bit units(4 bytes): U+010000 to U+10FFFF, also called supplementary characters
      • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
      • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
      • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
  • The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters
  • Need to figure out high-endian or low-endian

UTF-32 (UCS-4)

  • UTF-32 – a 32-bit, fixed-width encoding

Usage

UTF-8 Everywhere Manifesto

http://www.utf8everywhere.org