Computer Science - Unicode

Updated: 2019-01-27

Unicode

  • Myth: Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters.
  • Truth: 0x0 -> 0x10FFFF(2^20 + 2^16 = 1,114,112)

    • 1,114,112 code points = 1,112,064 valid code points + 2,048 surrogate code points
    • code points U+D800 to U+DFFF reserved for high and low surrogates used to encode code point values greater than U+FFFF
    • The U+ means "Unicode" and the numbers are hexadecimal
    • Hello: U+0048 U+0065 U+006C U+006C U+006F (This is code point, not how it is stored in memory)

Unicode vs ASCII vs ISO-8859-1

Range Code Points
ASCII 7 bits 128
ISO-8859-1(latin-1) 8 bits 256
UNICODE 0x0 -> 0x10FFFF 1,114,112

Note: ASCII codes below 32 were called unprintable

Unicode vs UTF-8/UTF-16/UTF-32

  • Unicode: the code space(1,114,112 code points)
  • UTF-8/UTF-16/UTF-32: the encoding method
Variable or Fixed Length Length
UTF-8 Variable one to four 8-bit units
UTF-16 Variable one or two 16-bit units
UTF-32 Fixed one 32-bit unit

UTF-8

UTF-8 uses the following rules:

  • If the code point is < 128, it’s represented by the corresponding byte value.
  • If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Example: Hello => 48 65 6C 6C 6F(the same as ASCII)

  • Variable-width encoding(one to four bytes/8-bit unit)

    • one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding
    • 1 byte(7 bits): U+0000 -> U+007F, 0xxxxxxx
    • 2 bytes(11 bits): U+0080 -> U+07FF, 110xxxxx 10xxxxxx
    • 3 bytes(16 bits): U+0800 -> U+FFFF, 1110xxxx 10xxxxxx 10xxxxxx
    • 4 bytes(21 bits): U+10000 -> U+1FFFFF, 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-16(UCS-2)

  • Variable-width encoding(one or two 16-bit unit)

    • one 16-bit unit(2 bytes, direct mapping): U+0000 to U+FFFF(excluding U+D800 to U+DFFF)
    • two 16-bit units(4 bytes): U+010000 to U+10FFFF, also called supplementary characters
    • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
    • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
    • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
  • The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters
  • Need to figure out high-endian or low-endian

UTF-32(UCS-4)

  • UTF-32 – a 32-bit, fixed-width encoding

Usage

UTF-8 Everywhere Manifesto

http://www.utf8everywhere.org

In Java

Java's internal encoding is UTF-16, however the default encoding is UTF-8.

In HTTP/HTML

Setting the character encoding should be done in the Content-Type http header, but can also be set with the <meta charset> attribute

Always Include the Character Encoding! If charset is not set in HTML, browser will guess the encoding

In header:

Content-Type: text/plain; charset="UTF-8"

In html5, these are equivalent:

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

In order for all browsers to recognize a <meta charset> declaration, it must be

  • Within the <head> element,
  • Before any elements that contain text, such as the <title> element, AND
  • Within the first 512 bytes of your document, including DOCTYPE and whitespace

In Node.js

http.ServerResponse

response.write(chunk, [encoding]);

where encoding is 'utf-8' by default

In Python

Python Unicode HowTo

  • bytes.decode()
  • str.encode()

Examples

>>> a="\u00a5123"
>>> a
'¥123'

>>> "\u00a5".encode("utf-8")
b'\xc2\xa5'

>>> "Hello".encode("utf-8")
b'Hello'
>>> "你好".encode("utf-8")
b'\xe4\xbd\xa0\xe5\xa5\xbd'

Byte Order Mark(BOM)

  • big-endian: FE FF(hexadecimal) 254 255(decimal)
  • little-endian: FF FE(hexadecimal) 255 254(decimal)

Example:

>>> "你好".encode("utf-16")
b'\xff\xfe`O}Y'

Use utf-16 with BOM

>>> b'\xff\xfe`O}Y'.decode("utf-16")
'你好'

Or use default(LE)

>>> b'`O}Y'.decode("utf-16")
'你好'

Use utf-16-le and skip BOM

>>> b'`O}Y'.decode("utf-16-le")
'你好'

Use utf-16-be will generate something wrong...

>>> b'`O}Y'.decode("utf-16-be")
'恏絙'

Java byte is signed, thus you have a range between -128 and 127

UTF-16 example:

String s = "你好";

byte[] b1 = s.getBytes(Charset.forName("UTF-16"));
// or ...
// byte[] b1 = s.getBytes("UTF-8");

for (byte b : b1) {
    System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
// fe ff 4f 60 59 7d

UTF-16 uses 2 bytes for each Chinese character

  • fe ff: BOM
  • 4f 60: 你
  • 59 7d: 好

UTF-8 example:

byte[] b2 = s.getBytes(Charset.forName("UTF-8"));

for (byte b : b2) {
    System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
//e4 bd a0 e5 a5 bd

UTF-8 uses 3 bytes for each Chinese character

  • e4 bd a0: 你
  • e5 a5 bd: 好