logo

Programming Languages - Bits and Bytes

1 byte = 8-bit integer, in the range 0 to 255

Bit vs Byte

  • bit: a single 0 or 1. 2 different values. The most basic unit of computing
  • byte: 1 byte = 8 bits = 2 hex = 256 different values. A.k.a "octet". Still "naked" 0s and 1s, can be interpreted in different ways.
  • characters:
    • historically, 1 byte (8 bits) is used to encode a single character: ASCII uses 7 bits, 128 code points, more than enough for English characters (both lowercase and uppercase). One extra bit can be used as a parity bit.
    • now one character may need more than one byte to store, depend on the encoding.

Multiple-byte Units

Unit Abbrev. Bytes Unit Abbrev. Bytes
kB kilobyte 1000 KiB kibibyte 1024
MB megabyte 10002 MiB mebibyte 10242
GB gigabyte 10003 GiB gibibyte 10243
TB terabyte 10004 TiB tebibyte 10244
PB petabyte 10005 PiB pebibyte 10245
EB exabyte 10006 EiB exbibyte 10246
ZB zettabyte 10007 ZiB zebibyte 10247
YB yottabyte 10008 YiB yobibyte 10248

Note if there's an i in the unit name: with an i, it is binary, otherwise decimal.

For example:

  • terabyte (TB): 1012, or 10004, or 1,000,000,000,000 bytes
  • tebibyte (TiB): 240, or 10244, or 1,099,511,627,776 bytes, roughly 1TiB = 1.1TB

Real world examples

  • 3.5 inch Floppy Disk: 1,440 KiB = 1.47 MB = 1.41 MiB
  • CD: up to 700 MB
  • DVD: 4.7 GB = 4.38 GiB for a single-layered, single-sided disc
  • Blu-ray: 25 GB for single-layer
  • The Complete Works of William Shakespeare would occupy about 5,600,000 bytes when written in plain text without formatting.

Bit Manipulation

AND(&)

| AND | 0   | 1   |
| --- | --- | --- |
| 0   | 0   | 0   |
| 1   | 0   | 1   |

OR(|)

| OR  | 0   | 1   |
| --- | --- | --- |
| 0   | 0   | 0   |
| 1   | 0   | 1   |

XOR(^)

| XOR | 0   | 1   |
| --- | --- | --- |
| 0   | 0   | 0   |
| 1   | 0   | 1   |

NOT(!)

| NOT |     |
| --- | --- |
| 0   | 1   |
| 1   | 0   |

Basic Operations

  • Bitwise And: &
  • Bitwise exclusive OR: ^
  • Bitwise inclusive OR: |
  • Unary bitwise complement: ~
  • Signed left shift: <<
  • Signed right shift: >>
  • Unsigned right shift: >>>

By Language

Java

int bitmask = 0x000F;
int val = 0x2222;

System.out.println(val & bitmask);
// 2
System.out.println(~256);
// -257

Python

  • bytes is an immutable array of bytes (PyString)
  • bytearray is a mutable array of bytes (PyBytes)
  • memoryview is a bytes view on another object (PyMemory)

bytes literal: b'...'

  • str objects: hold character data
  • bytes objects: hold raw bytes

Indexing returns a integer:

>>> a = b'asdf'
>>> a
b'asdf'
>>> a[0]
97

while str returns a character:

>>> b = 'asdf'
>>> b
'asdf'
>>> b[0]
'a'
  • Assigning or comparing an object that is not an integer to an element causes a TypeError exception.
  • Assigning an element to a value outside the range 0 to 255 causes a ValueError exception.

string must comes with encoding

bytearray:

>>> a = bytearray("123", 'utf-8')
>>> a[0]
49
>>> a[1]
50

bytes:

>>> a = bytes('abc', 'utf-8')
>>> a
b'abc'

Endianness

Endianness: the order of the bytes

Big vs Little

  • big-endian: the most significant byte first
  • little-endian: the least significant byte first

Other names:

  • Network byte order: big-endian

Example:

0A 0B 0C 0D
  • Big-endian: stored as 0A 0B 0C 0D in memory
  • Little-endian: stored as 0D 0C 0B 0A in memory

Why Little-endian

A 32-bit memory location with content 4A 00 00 00 can be read at the same address as either 8-bit (value = 4A), 16-bit (004A), 24-bit (00004A), or 32-bit (0000004A), all of which retain the same numeric value.

Systems

  • big endian: Java, IPv6 (network byte order), IBM z/Architecture mainframes,
  • little endian: Intel x86 processor

In Action

In Python:

>>> import struct
>>> struct.pack("<I", 1)
b'\x01\x00\x00\x00'
>>> struct.pack(">I", 1)
b'\x00\x00\x00\x01'

where < means little-endian, and > means big-endian. I for 32-bit unsigned integer, so it takes 4 bytes. In little-endian, byte \x01 is stored first, while in big-endian, it is stored last.

Check system byte order:

>>> import sys
>>> sys.byteorder
'little'