Linux - Isolation
The Linux kernel uses a combination of several distinct features to achieve process isolation. There is no single "isolation API"; instead, technologies like Docker or Systemd combine these features to create the illusion of a separate environment (a container).
1. Namespaces (The "View")
Namespaces are the most critical component. They control what a process can see. If you put a process in a new namespace, it essentially "hallucinates" that it is the only process on the machine, or that it has its own dedicated network.
There are 7 key namespaces:
- PID (Process ID): The process looks inside and sees itself as PID 1 (root of the tree), even if it is PID 12345 on the host. It cannot see or kill processes outside its bubble.
- MNT (Mount): The process has its own view of the file system mount points. Unmounting
/homeinside the namespace doesn't affect the host. - NET (Network): The process gets its own IP address, routing table, and firewall rules (iptables). This is how containers can have their own
eth0interface. - UTS (UNIX Time-Sharing): Allows the process to have its own hostname and domain name.
- IPC (Inter-Process Communication): Prevents the process from accessing shared memory segments or message queues of other processes.
- USER: Allows a process to be "root" inside the namespace but a regular, unprivileged user outside. This is a massive security feature (Rootless containers).
- CGROUP: Hides the hierarchy of control groups (see below) from the process.
2. Control Groups (The "Resources")
While Namespaces handle visibility, cgroups (Control Groups) handle resource usage. They ensure one isolated process cannot crash the server by eating all the RAM or CPU.
- Limits: You can set a hard limit (e.g., "This group gets max 512MB RAM").
- Prioritization: You can give critical processes more CPU "shares" than background tasks.
- Accounting: It tracks exactly how much resource a group has used (used for billing in cloud computing).
- Freezing: You can "pause" an entire group of processes instantly and resume them later.
3. Capabilities & Seccomp (The "Permissions")
Even if a process is isolated, what if it manages to hack the kernel? These features limit the actions a process can take.
- Capabilities: Traditionally, "root" had all powers. Capabilities break "root" into roughly 40 separate pieces (e.g.,
CAP_NET_ADMINto change IP addresses,CAP_SYS_TIMEto change the clock). You can give a process "root" status but remove the capability to change the system clock or load kernel drivers. - Seccomp (Secure Computing): This acts as a firewall for System Calls. A process typically needs to make calls to the kernel (like "open file", "spawn process"). Seccomp allows you to whitelist only the specific calls a program needs (e.g., "read" and "write") and block everything else (like "reboot"), instantly killing the process if it tries.
4. File System Isolation (Chroot / Pivot_root)
This is the oldest form of isolation.
- Chroot (Change Root): It changes the root directory (
/) for a process to a specific folder (e.g.,/var/jail). The process cannot see files outside that folder. - Pivot_root: A more modern and secure version used by containers to swap the entire system mount table, effectively making the old file system inaccessible.
Summary: How it fits together
When you start a Docker container, Linux is doing this:
- Clone: Creates a new process with new Namespaces (so it can't see neighbors).
- Cgroups: Assigns that process to a Control Group (so it can't eat all RAM).
- Pivot_root: Switches the file system to a container image (so it sees distinct files).
- Capabilities/Seccomp: Drops dangerous privileges (so it can't reboot the host).