Unicode
Last Updated: 2022-04-03
Unicode
- Myth: Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters.
- Truth:
0x0
->0x10FFFF
(2^20 + 2^16 = 1,114,112)- 1,114,112 code points = 1,112,064 valid code points + 2,048 surrogate code points
- code points
U+D800
toU+DFFF
reserved for high and low surrogates used to encode code point values greater thanU+FFFF
- The
U+
means "Unicode" and the numbers are hexadecimal Hello
:U+0048 U+0065 U+006C U+006C U+006F
(This is code point, not how it is stored in memory)
Unicode vs ASCII vs ISO-8859-1
Range | Code Points | |
---|---|---|
ASCII | 7 bits | 128 |
ISO-8859-1(latin-1) | 8 bits | 256 |
UNICODE | 0x0 -> 0x10FFFF |
1,114,112 |
Note: ASCII codes below 32 were called unprintable
Unicode vs UTF-8/UTF-16/UTF-32
- Unicode: the code space(1,114,112 code points)
- UTF-8/UTF-16/UTF-32: the encoding method
Variable or Fixed Length | Length | |
---|---|---|
UTF-8 | Variable | one to four 8-bit units |
UTF-16 | Variable | one or two 16-bit units |
UTF-32 | Fixed | one 32-bit unit |
UTF-8
UTF-8 uses the following rules:
- If the code point is < 128, it’s represented by the corresponding byte value.
- If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
Example: Hello
=> 48 65 6C 6C 6F
(the same as ASCII)
- Variable-width encoding(one to four bytes/8-bit unit)
- one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding
- 1 byte(7 bits):
U+0000
->U+007F
,0xxxxxxx
- 2 bytes(11 bits):
U+0080
->U+07FF
,110xxxxx 10xxxxxx
- 3 bytes(16 bits):
U+0800
->U+FFFF
,1110xxxx 10xxxxxx 10xxxxxx
- 4 bytes(21 bits):
U+10000
->U+1FFFFF
,11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-16(UCS-2)
- Variable-width encoding(one or two 16-bit unit)
- one 16-bit unit(2 bytes, direct mapping):
U+0000
toU+FFFF
(excludingU+D800
toU+DFFF
) - two 16-bit units(4 bytes):
U+010000
toU+10FFFF
, also called supplementary characters0x010000
is subtracted from the code point, leaving a 20-bit number in the range0..0x0FFFFF
.- The top ten bits (a number in the range
0..0x03FF
) are added to0xD800
to give the first 16-bit code unit or high surrogate, which will be in the range0xD800..0xDBFF
. - The low ten bits (also in the range
0..0x03FF
) are added to0xDC00
to give the second 16-bit code unit or low surrogate, which will be in the range0xDC00..0xDFFF
.
- one 16-bit unit(2 bytes, direct mapping):
- The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters
- Need to figure out high-endian or low-endian
UTF-32 (UCS-4)
- UTF-32 – a 32-bit, fixed-width encoding