logo

Polyglot CheatSheet - Unicode

Last Updated: 2022-04-03

Java

Java's internal encoding is UTF-16, however the default encoding is UTF-8.

UTF-8 is the default charset of the standard Java APIs since JDK 18. (In JDK 17 and earlier, the default charset is determined when the Java runtime starts, and it depends on the OS.)

Java byte is signed, thus you have a range between -128 and 127

UTF-16 example:

String s = "你好";

byte[] b1 = s.getBytes(Charset.forName("UTF-16"));
// or ...
// byte[] b1 = s.getBytes("UTF-8");

for (byte b : b1) {
    System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
// fe ff 4f 60 59 7d

UTF-16 uses 2 bytes for each Chinese character

  • fe ff: BOM
  • 4f 60: 你
  • 59 7d: 好

UTF-8 example:

byte[] b2 = s.getBytes(Charset.forName("UTF-8"));

for (byte b : b2) {
    System.out.print(Integer.toHexString(Byte.toUnsignedInt(b)) + " ");
}
System.out.println();
//e4 bd a0 e5 a5 bd

UTF-8 uses 3 bytes for each Chinese character

  • e4 bd a0: 你
  • e5 a5 bd: 好

JavaScript

response.write(chunk, [encoding]);

where encoding is 'utf-8' by default.s

Python

Python Unicode HowTo

  • bytes.decode()
  • str.encode()

Examples:

>>> a="\u00a5123"
>>> a
'¥123'

>>> "\u00a5".encode("utf-8")
b'\xc2\xa5'

>>> "Hello".encode("utf-8")
b'Hello'
>>> "你好".encode("utf-8")
b'\xe4\xbd\xa0\xe5\xa5\xbd'

Byte Order Mark(BOM)

  • big-endian: FE FF(hexadecimal) 254 255(decimal)
  • little-endian: FF FE(hexadecimal) 255 254(decimal)

Example:

>>> "你好".encode("utf-16")
b'\xff\xfe`O}Y'

Use utf-16 with BOM

>>> b'\xff\xfe`O}Y'.decode("utf-16")
'你好'

Or use default(LE)

>>> b'`O}Y'.decode("utf-16")
'你好'

Use utf-16-le and skip BOM

>>> b'`O}Y'.decode("utf-16-le")
'你好'

Use utf-16-be will generate something wrong...

>>> b'`O}Y'.decode("utf-16-be")
'恏絙'

HTTP/HTML

Setting the character encoding should be done in the Content-Type http header, but can also be set with the <meta charset> attribute

Always Include the Character Encoding! If charset is not set in HTML, browser will guess the encoding

In header:

Content-Type: text/plain; charset="UTF-8"

In Html5, these are equivalent:

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

In order for all browsers to recognize a <meta charset> declaration, it must be

  • Within the <head> element,
  • Before any elements that contain text, such as the <title> element, AND
  • Within the first 512 bytes of your document, including DOCTYPE and whitespace