Unicode
Unicode
- Myth: Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters.
- Truth:
0x0
->0x10FFFF
(2^20 + 2^16 = 1,114,112)- 1,114,112 code points = 1,112,064 valid code points + 2,048 surrogate code points
- code points
U+D800
toU+DFFF
reserved for high and low surrogates used to encode code point values greater thanU+FFFF
- The
U+
means "Unicode" and the numbers are hexadecimal Hello
:U+0048 U+0065 U+006C U+006C U+006F
(This is code point, not how it is stored in memory)
Unicode vs ASCII vs ISO-8859-1
Range | Code Points | |
---|---|---|
ASCII | 7 bits | 128 |
ISO-8859-1(latin-1) | 8 bits | 256 |
UNICODE | 0x0 -> 0x10FFFF |
1,114,112 |
Note: ASCII codes below 32 were called unprintable
Unicode vs UTF-8/UTF-16/UTF-32
- Unicode: the code space(1,114,112 code points)
- UTF-8/UTF-16/UTF-32: the encoding method
Variable or Fixed Length | Length | |
---|---|---|
UTF-8 | Variable | one to four 8-bit units |
UTF-16 | Variable | one or two 16-bit units |
UTF-32 | Fixed | one 32-bit unit |
UTF-8
UTF-8 uses the following rules:
- If the code point is < 128, it’s represented by the corresponding byte value.
- If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
Example: Hello
=> 48 65 6C 6C 6F
(the same as ASCII)
- Variable-width encoding(one to four bytes/8-bit unit)
- one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding
- 1 byte(7 bits):
U+0000
->U+007F
,0xxxxxxx
- 2 bytes(11 bits):
U+0080
->U+07FF
,110xxxxx 10xxxxxx
- 3 bytes(16 bits):
U+0800
->U+FFFF
,1110xxxx 10xxxxxx 10xxxxxx
- 4 bytes(21 bits):
U+10000
->U+1FFFFF
,11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-16(UCS-2)
- Variable-width encoding(one or two 16-bit unit)
- one 16-bit unit(2 bytes, direct mapping):
U+0000
toU+FFFF
(excludingU+D800
toU+DFFF
) - two 16-bit units(4 bytes):
U+010000
toU+10FFFF
, also called supplementary characters0x010000
is subtracted from the code point, leaving a 20-bit number in the range0..0x0FFFFF
.- The top ten bits (a number in the range
0..0x03FF
) are added to0xD800
to give the first 16-bit code unit or high surrogate, which will be in the range0xD800..0xDBFF
. - The low ten bits (also in the range
0..0x03FF
) are added to0xDC00
to give the second 16-bit code unit or low surrogate, which will be in the range0xDC00..0xDFFF
.
- one 16-bit unit(2 bytes, direct mapping):
- The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters
- Need to figure out high-endian or low-endian
UTF-32 (UCS-4)
- UTF-32 – a 32-bit, fixed-width encoding
Usage
UTF-8 Everywhere Manifesto
By Language
Java
Java's internal encoding is UTF-16, however the default encoding is UTF-8.
UTF-8 is the default charset of the standard Java APIs since JDK 18. (In JDK 17 and earlier, the default charset is determined when the Java runtime starts, and it depends on the OS.)
Java byte is signed, thus you have a range between -128 and 127
UTF-16 example:
String s = "你好";
byte[] b1 = s.getBytes(Charset.forName("UTF-16"));
// or ...
// byte[] b1 = s.getBytes("UTF-8");
for (byte b : b1) {
System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
// fe ff 4f 60 59 7d
UTF-16 uses 2 bytes for each Chinese character
fe ff
: BOM4f 60
: 你59 7d
: 好
UTF-8 example:
byte[] b2 = s.getBytes(Charset.forName("UTF-8"));
for (byte b : b2) {
System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
//e4 bd a0 e5 a5 bd
UTF-8 uses 3 bytes for each Chinese character
e4 bd a0
: 你e5 a5 bd
: 好
JavaScript
response.write(chunk, [encoding]);
where encoding is 'utf-8'
by default.
Python
Python Unicode HowTo
bytes.decode()
str.encode()
Examples:
>>> a="\u00a5123"
>>> a
'¥123'
>>> "\u00a5".encode("utf-8")
b'\xc2\xa5'
>>> "Hello".encode("utf-8")
b'Hello'
>>> "你好".encode("utf-8")
b'\xe4\xbd\xa0\xe5\xa5\xbd'
- big-endian: FE FF(hexadecimal) 254 255(decimal)
- little-endian: FF FE(hexadecimal) 255 254(decimal)
Example:
>>> "你好".encode("utf-16")
b'\xff\xfe`O}Y'
Use utf-16
with BOM
>>> b'\xff\xfe`O}Y'.decode("utf-16")
'你好'
Or use default(LE
)
>>> b'`O}Y'.decode("utf-16")
'你好'
Use utf-16-le
and skip BOM
>>> b'`O}Y'.decode("utf-16-le")
'你好'
Use utf-16-be
will generate something wrong...
>>> b'`O}Y'.decode("utf-16-be")
'恏絙'
Go
Rune = Unicode Code Point. The Rune type is an alias of int32
.
In Go, a string is a sequence of bytes and not of a Rune.
HTTP/HTML
Setting the character encoding should be done in the Content-Type
http header, but can also be set with the <meta charset>
attribute
Always Include the Character Encoding! If charset is not set in HTML, browser will guess the encoding
In header:
Content-Type: text/plain; charset="UTF-8"
In Html5, these are equivalent:
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
In order for all browsers to recognize a <meta charset>
declaration, it must be
- Within the
<head>
element, - Before any elements that contain text, such as the
<title>
element, AND - Within the first 512 bytes of your document, including
DOCTYPE
and whitespace