logo

Unicode

Unicode

  • Myth: Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters.
  • Truth: 0x0 -> 0x10FFFF(2^20 + 2^16 = 1,114,112)
    • 1,114,112 code points = 1,112,064 valid code points + 2,048 surrogate code points
    • code points U+D800 to U+DFFF reserved for high and low surrogates used to encode code point values greater than U+FFFF
    • The U+ means "Unicode" and the numbers are hexadecimal
    • Hello: U+0048 U+0065 U+006C U+006C U+006F (This is code point, not how it is stored in memory)

Unicode vs ASCII vs ISO-8859-1

Range Code Points
ASCII 7 bits 128
ISO-8859-1(latin-1) 8 bits 256
UNICODE 0x0 -> 0x10FFFF 1,114,112

Note: ASCII codes below 32 were called unprintable

Unicode vs UTF-8/UTF-16/UTF-32

  • Unicode: the code space(1,114,112 code points)
  • UTF-8/UTF-16/UTF-32: the encoding method
Variable or Fixed Length Length
UTF-8 Variable one to four 8-bit units
UTF-16 Variable one or two 16-bit units
UTF-32 Fixed one 32-bit unit

UTF-8

UTF-8 uses the following rules:

  • If the code point is < 128, it’s represented by the corresponding byte value.
  • If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Example: Hello => 48 65 6C 6C 6F(the same as ASCII)

  • Variable-width encoding(one to four bytes/8-bit unit)
    • one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding
    • 1 byte(7 bits): U+0000 -> U+007F, 0xxxxxxx
    • 2 bytes(11 bits): U+0080 -> U+07FF, 110xxxxx 10xxxxxx
    • 3 bytes(16 bits): U+0800 -> U+FFFF, 1110xxxx 10xxxxxx 10xxxxxx
    • 4 bytes(21 bits): U+10000 -> U+1FFFFF, 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-16(UCS-2)

  • Variable-width encoding(one or two 16-bit unit)
    • one 16-bit unit(2 bytes, direct mapping): U+0000 to U+FFFF(excluding U+D800 to U+DFFF)
    • two 16-bit units(4 bytes): U+010000 to U+10FFFF, also called supplementary characters
      • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
      • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
      • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
  • The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters
  • Need to figure out high-endian or low-endian

UTF-32 (UCS-4)

  • UTF-32 – a 32-bit, fixed-width encoding

Usage

UTF-8 Everywhere Manifesto

http://www.utf8everywhere.org

By Language

Java

Java's internal encoding is UTF-16, however the default encoding is UTF-8.

UTF-8 is the default charset of the standard Java APIs since JDK 18. (In JDK 17 and earlier, the default charset is determined when the Java runtime starts, and it depends on the OS.)

Java byte is signed, thus you have a range between -128 and 127

UTF-16 example:

String s = "你好";

byte[] b1 = s.getBytes(Charset.forName("UTF-16"));
// or ...
// byte[] b1 = s.getBytes("UTF-8");

for (byte b : b1) {
    System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
// fe ff 4f 60 59 7d

UTF-16 uses 2 bytes for each Chinese character

  • fe ff: BOM
  • 4f 60: 你
  • 59 7d: 好

UTF-8 example:

byte[] b2 = s.getBytes(Charset.forName("UTF-8"));

for (byte b : b2) {
    System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
//e4 bd a0 e5 a5 bd

UTF-8 uses 3 bytes for each Chinese character

  • e4 bd a0: 你
  • e5 a5 bd: 好

JavaScript

response.write(chunk, [encoding]);

where encoding is 'utf-8' by default.

Python

Python Unicode HowTo

  • bytes.decode()
  • str.encode()

Examples:

>>> a="\u00a5123"
>>> a
'¥123'

>>> "\u00a5".encode("utf-8")
b'\xc2\xa5'

>>> "Hello".encode("utf-8")
b'Hello'
>>> "你好".encode("utf-8")
b'\xe4\xbd\xa0\xe5\xa5\xbd'

Byte Order Mark(BOM)

  • big-endian: FE FF(hexadecimal) 254 255(decimal)
  • little-endian: FF FE(hexadecimal) 255 254(decimal)

Example:

>>> "你好".encode("utf-16")
b'\xff\xfe`O}Y'

Use utf-16 with BOM

>>> b'\xff\xfe`O}Y'.decode("utf-16")
'你好'

Or use default(LE)

>>> b'`O}Y'.decode("utf-16")
'你好'

Use utf-16-le and skip BOM

>>> b'`O}Y'.decode("utf-16-le")
'你好'

Use utf-16-be will generate something wrong...

>>> b'`O}Y'.decode("utf-16-be")
'恏絙'

Go

Rune = Unicode Code Point. The Rune type is an alias of int32.

In Go, a string is a sequence of bytes and not of a Rune.

HTTP/HTML

Setting the character encoding should be done in the Content-Type http header, but can also be set with the <meta charset> attribute

Always Include the Character Encoding! If charset is not set in HTML, browser will guess the encoding

In header:

Content-Type: text/plain; charset="UTF-8"

In Html5, these are equivalent:

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

In order for all browsers to recognize a <meta charset> declaration, it must be

  • Within the <head> element,
  • Before any elements that contain text, such as the <title> element, AND
  • Within the first 512 bytes of your document, including DOCTYPE and whitespace