Polyglot CheatSheet - Unicode
Java
Java's internal encoding is UTF-16, however the default encoding is UTF-8.
UTF-8 is the default charset of the standard Java APIs since JDK 18. (In JDK 17 and earlier, the default charset is determined when the Java runtime starts, and it depends on the OS.)
Java byte is signed, thus you have a range between -128 and 127
UTF-16 example:
String s = "你好";
byte[] b1 = s.getBytes(Charset.forName("UTF-16"));
// or ...
// byte[] b1 = s.getBytes("UTF-8");
for (byte b : b1) {
System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
// fe ff 4f 60 59 7d
UTF-16 uses 2 bytes for each Chinese character
fe ff
: BOM4f 60
: 你59 7d
: 好
UTF-8 example:
byte[] b2 = s.getBytes(Charset.forName("UTF-8"));
for (byte b : b2) {
System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
//e4 bd a0 e5 a5 bd
UTF-8 uses 3 bytes for each Chinese character
e4 bd a0
: 你e5 a5 bd
: 好
JavaScript
response.write(chunk, [encoding]);
where encoding is 'utf-8'
by default.s
Python
Python Unicode HowTo
bytes.decode()
str.encode()
Examples:
>>> a="\u00a5123"
>>> a
'¥123'
>>> "\u00a5".encode("utf-8")
b'\xc2\xa5'
>>> "Hello".encode("utf-8")
b'Hello'
>>> "你好".encode("utf-8")
b'\xe4\xbd\xa0\xe5\xa5\xbd'
- big-endian: FE FF(hexadecimal) 254 255(decimal)
- little-endian: FF FE(hexadecimal) 255 254(decimal)
Example:
>>> "你好".encode("utf-16")
b'\xff\xfe`O}Y'
Use utf-16
with BOM
>>> b'\xff\xfe`O}Y'.decode("utf-16")
'你好'
Or use default(LE
)
>>> b'`O}Y'.decode("utf-16")
'你好'
Use utf-16-le
and skip BOM
>>> b'`O}Y'.decode("utf-16-le")
'你好'
Use utf-16-be
will generate something wrong...
>>> b'`O}Y'.decode("utf-16-be")
'恏絙'
HTTP/HTML
Setting the character encoding should be done in the Content-Type
http header, but can also be set with the <meta charset>
attribute
Always Include the Character Encoding! If charset is not set in HTML, browser will guess the encoding
In header:
Content-Type: text/plain; charset="UTF-8"
In Html5, these are equivalent:
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
In order for all browsers to recognize a <meta charset>
declaration, it must be
- Within the
<head>
element, - Before any elements that contain text, such as the
<title>
element, AND - Within the first 512 bytes of your document, including
DOCTYPE
and whitespace