Programming Languages - Bits and Bytes
1 byte = 8-bit integer, in the range 0 to 255
Bit vs Byte
- bit: a single
0
or1
. 2 different values. The most basic unit of computing - byte: 1 byte = 8 bits = 2 hex = 256 different values. A.k.a "octet". Still "naked"
0
s and1
s, can be interpreted in different ways. - characters:
- historically, 1 byte (8 bits) is used to encode a single character: ASCII uses 7 bits, 128 code points, more than enough for English characters (both lowercase and uppercase). One extra bit can be used as a parity bit.
- now one character may need more than one byte to store, depend on the encoding.
Multiple-byte Units
Unit | Abbrev. | Bytes | Unit | Abbrev. | Bytes |
---|---|---|---|---|---|
kB | kilobyte | 1000 | KiB | kibibyte | 1024 |
MB | megabyte | 10002 | MiB | mebibyte | 10242 |
GB | gigabyte | 10003 | GiB | gibibyte | 10243 |
TB | terabyte | 10004 | TiB | tebibyte | 10244 |
PB | petabyte | 10005 | PiB | pebibyte | 10245 |
EB | exabyte | 10006 | EiB | exbibyte | 10246 |
ZB | zettabyte | 10007 | ZiB | zebibyte | 10247 |
YB | yottabyte | 10008 | YiB | yobibyte | 10248 |
Note if there's an i
in the unit name: with an i
, it is binary, otherwise decimal.
For example:
- terabyte (TB): 1012, or 10004, or 1,000,000,000,000 bytes
- tebibyte (TiB): 240, or 10244, or 1,099,511,627,776 bytes, roughly 1TiB = 1.1TB
Real world examples
- 3.5 inch Floppy Disk: 1,440 KiB = 1.47 MB = 1.41 MiB
- CD: up to 700 MB
- DVD: 4.7 GB = 4.38 GiB for a single-layered, single-sided disc
- Blu-ray: 25 GB for single-layer
- The Complete Works of William Shakespeare would occupy about 5,600,000 bytes when written in plain text without formatting.
Bit Manipulation
AND(&
)
| AND | 0 | 1 |
| --- | --- | --- |
| 0 | 0 | 0 |
| 1 | 0 | 1 |
OR(|
)
| OR | 0 | 1 |
| --- | --- | --- |
| 0 | 0 | 0 |
| 1 | 0 | 1 |
XOR(^
)
| XOR | 0 | 1 |
| --- | --- | --- |
| 0 | 0 | 0 |
| 1 | 0 | 1 |
NOT(!
)
| NOT | |
| --- | --- |
| 0 | 1 |
| 1 | 0 |
Basic Operations
- Bitwise And:
&
- Bitwise exclusive OR:
^
- Bitwise inclusive OR:
|
- Unary bitwise complement:
~
- Signed left shift:
<<
- Signed right shift:
>>
- Unsigned right shift:
>>>
By Language
Java
int bitmask = 0x000F;
int val = 0x2222;
System.out.println(val & bitmask);
// 2
System.out.println(~256);
// -257
Python
bytes
is an immutable array of bytes (PyString)bytearray
is a mutable array of bytes (PyBytes)memoryview
is a bytes view on another object (PyMemory)
bytes
literal: b'...'
str
objects: hold character databytes
objects: hold raw bytes
Indexing returns a integer:
>>> a = b'asdf'
>>> a
b'asdf'
>>> a[0]
97
while str
returns a character:
>>> b = 'asdf'
>>> b
'asdf'
>>> b[0]
'a'
- Assigning or comparing an object that is not an integer to an element causes a TypeError exception.
- Assigning an element to a value outside the range 0 to 255 causes a ValueError exception.
string must comes with encoding
bytearray:
>>> a = bytearray("123", 'utf-8')
>>> a[0]
49
>>> a[1]
50
bytes:
>>> a = bytes('abc', 'utf-8')
>>> a
b'abc'
Endianness
Endianness: the order of the bytes
Big vs Little
- big-endian: the most significant byte first
- little-endian: the least significant byte first
Other names:
- Network byte order: big-endian
Example:
0A 0B 0C 0D
- Big-endian: stored as
0A 0B 0C 0D
in memory - Little-endian: stored as
0D 0C 0B 0A
in memory
Why Little-endian
A 32-bit memory location with content 4A 00 00 00 can be read at the same address as either 8-bit (value = 4A
), 16-bit (004A
), 24-bit (00004A
), or 32-bit (0000004A
), all of which retain the same numeric value.
Systems
- big endian: Java, IPv6 (network byte order), IBM z/Architecture mainframes,
- little endian: Intel x86 processor
In Action
In Python:
>>> import struct
>>> struct.pack("<I", 1)
b'\x01\x00\x00\x00'
>>> struct.pack(">I", 1)
b'\x00\x00\x00\x01'
where <
means little-endian, and >
means big-endian. I
for 32-bit unsigned integer, so it takes 4 bytes. In little-endian, byte \x01
is stored first, while in big-endian, it is stored last.
Check system byte order:
>>> import sys
>>> sys.byteorder
'little'