Polyglot: Concepts
Overview
Concepts
Compile Time and Runtime
Standard Libraries
Build Tools
Numerics
Memory Management
Zero Cost Abstractions
Design Patterns
Compatibility
Functional Programming
Bit Manipulation
Generics
String Interning
Polyglot: Concurrency
Polyglot: Language Notes

# Unicode

Updated: 2022-04-03

## Unicode

• Myth: Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters.
• Truth: 0x0 -> 0x10FFFF(2^20 + 2^16 = 1,114,112)
• 1,114,112 code points = 1,112,064 valid code points + 2,048 surrogate code points
• code points U+D800 to U+DFFF reserved for high and low surrogates used to encode code point values greater than U+FFFF
• The U+ means "Unicode" and the numbers are hexadecimal
• Hello: U+0048 U+0065 U+006C U+006C U+006F (This is code point, not how it is stored in memory)

## Unicode vs ASCII vs ISO-8859-1

Range Code Points
ASCII 7 bits 128
ISO-8859-1(latin-1) 8 bits 256
UNICODE 0x0 -> 0x10FFFF 1,114,112

Note: ASCII codes below 32 were called unprintable

## Unicode vs UTF-8/UTF-16/UTF-32

• Unicode: the code space(1,114,112 code points)
• UTF-8/UTF-16/UTF-32: the encoding method
Variable or Fixed Length Length
UTF-8 Variable one to four 8-bit units
UTF-16 Variable one or two 16-bit units
UTF-32 Fixed one 32-bit unit

### UTF-8

UTF-8 uses the following rules:

• If the code point is < 128, it’s represented by the corresponding byte value.
• If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Example: Hello => 48 65 6C 6C 6F(the same as ASCII)

• Variable-width encoding(one to four bytes/8-bit unit)
• one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding
• 1 byte(7 bits): U+0000 -> U+007F, 0xxxxxxx
• 2 bytes(11 bits): U+0080 -> U+07FF, 110xxxxx 10xxxxxx
• 3 bytes(16 bits): U+0800 -> U+FFFF, 1110xxxx 10xxxxxx 10xxxxxx
• 4 bytes(21 bits): U+10000 -> U+1FFFFF, 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

### UTF-16(UCS-2)

• Variable-width encoding(one or two 16-bit unit)
• one 16-bit unit(2 bytes, direct mapping): U+0000 to U+FFFF(excluding U+D800 to U+DFFF)
• two 16-bit units(4 bytes): U+010000 to U+10FFFF, also called supplementary characters
• 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
• The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
• The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
• The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters
• Need to figure out high-endian or low-endian

### UTF-32 (UCS-4)

• UTF-32 – a 32-bit, fixed-width encoding

## Usage

### UTF-8 Everywhere Manifesto

http://www.utf8everywhere.org