logo

Pods

A Pod is the smallest execution unit in Kubernetes. While we often think of containers as the core unit, Kubernetes actually manages Pods, not individual containers.

The best way to think of a Pod is as a "Logical Host." It acts like a virtual machine where several processes (containers) run together.

What makes a Pod a "Pod"? (The Shared Resources)

The magic of a Pod is that the containers inside it are "wrapped" together using Linux Namespaces. Here is exactly what they share:

  • Network Namespace (Shared IP): All containers in a Pod share the same IP address and port space.
    • If Container A is listening on port 8080, Container B can reach it by calling localhost:8080.
    • Side effect: Containers in the same Pod cannot use the same port (e.g., you can't have two containers both trying to bind to :80).
  • UTS Namespace (Shared Hostname): All containers in the Pod see the same hostname.
  • IPC Namespace (Inter-Process Communication): Containers can communicate via shared memory or semaphores.
  • Storage (Volumes): If you define a Volume in a Pod, all containers in that Pod can mount it and see the same files.
  • PID Namespace (Optional): By default, containers usually have their own process list, but you can configure a Pod so that Container A can see (and signal) processes in Container B.

What they do NOT share: Containers still have separate File Systems (except for shared volumes) and separate Resource Limits (CPU/RAM).

Is it a K8s, CRI, or containerd concept?

Think of it this way: Kubernetes invented the "Pod" abstraction, and the CRI was built to make sure other runtimes (like containerd or CRI-O) could support that abstraction.

Kubernetes: Pod

The "Pod" was invented by Kubernetes. It is a high-level API object. Kubernetes decided that "single container per unit" was too limiting (you couldn't easily have "sidecars" like log-shippers or proxies).

CRI (Container Runtime Interface): PodSandbox

When Kubernetes wants to start a Pod, it talks to the CRI (the middleman). The CRI is a specification that says: "Hey runtime, I need a 'Pod Sandbox' with these network settings, and then I need you to put these containers inside it."

Note that there's no "Pod" at the CRI level, only PodSandbox.

containerd level: The "Pod" is an Illusion

Does containerd Recognize Pods? Yes and No.

  • "No" (Core containerd): at its lowest level, containerd is a container manager. It knows how to start a container, manage an image, and handle a filesystem. It doesn't inherently care about the Kubernetes concept of a "Pod." To core containerd, a container is just a container.
  • "Yes" (The CRI Plugin): However, containerd includes a CRI (Container Runtime Interface) plugin. This plugin is the "translator" that talks to the Kubernetes Kubelet.

How is it implemented? (The "Pause" Container)

If you look at a running Pod at the system level (using nerdctl ps or docker ps), you will often see a container you didn't create called the "Pause" container (or the Infra container).

This is the secret to the Pod:

  1. The runtime starts the Pause container first.
  2. The Pause container "acquires" all the namespaces (the IP, the hostname, etc.).
  3. The Pause container then goes to "sleep" (it does nothing but hold those namespaces open).
  4. Your actual containers are then started and told: "Don't create your own network/IPC; just join the namespaces owned by the Pause container."

If your application containers crash and restart, the Pause container stays alive, ensuring the Pod's IP address remains the same.

When the Kubelet tells containerd to "Create a Sandbox," it doesn't explicitly say "run the pause image." It says "prepare the environment." However, the industry-standard way to implement a PodSandbox on Linux is by running a pause container.

The "pause" container is primarily a Kubernetes concept that has become a de facto standard for any runtime that implements the CRI (Container Runtime Interface).

How is it implemented in gVisor?

When you move from a standard runtime (like runc) to a sandboxed runtime like gVisor (runsc), the way a Pod is implemented changes fundamentally.

In a standard runtime, the Pod is a "soft" boundary created by grouping Linux namespaces. In gVisor, the Pod is a "hard" boundary defined by a dedicated guest kernel instance.

In gVisor, the implementation of a Pod revolves around a component called the Sentry.

The "One Sentry Per Pod" Rule

In a standard environment, all containers in a Pod talk directly to the host Linux kernel (via syscalls). In gVisor, one "Sentry" process is started for the entire Pod.

  • The Sentry is essentially a user-space guest kernel (written in Go).
  • All containers within that Pod run inside that specific Sentry instance.
  • The containers don't see the host kernel; they only see the Sentry.

How Sharing Works (Internalized Namespaces)

Because all containers in a gVisor Pod live inside the same Sentry, "sharing" resources is managed internally by gVisor rather than by the Linux host kernel:

  • Shared Network (Netstack): gVisor has its own integrated network stack (called Netstack). The Sentry manages a single network interface for the Pod. When Container A binds to a port, the Sentry records that in its internal table. When Container B tries to connect to localhost, the Sentry handles that traffic entirely within its own memory—the packets never even hit the host's bridge or iptables.
  • Shared Memory/IPC: The Sentry manages the virtual memory for all containers in the Pod. It allows them to communicate via IPC (Inter-Process Communication) because they are all being "vetted" by the same guest kernel.
  • The "Pause" Container in gVisor: While Kubernetes/CRI still technically starts a "Pause" container to satisfy the API, in gVisor, the "Pause" container is essentially the process that initializes the Sentry. Once the Sentry is up, it acts as the "foundation" for all other containers in that Pod.

The Security Boundary

This architecture changes the "Pod" from a logical group to a physical fortress:

  • In runc: If a container escapes its namespace, it is still on the host kernel.
  • In gVisor: If a container "escapes," it only escapes into the Sentry. It still hasn't reached the host kernel. Since the Sentry is unique to that Pod, the "blast radius" of an attack is strictly limited to that single Pod.

Comparison: Standard Pod vs. gVisor Pod

Feature Standard Pod (runc) gVisor Pod (runsc)
Kernel Shared Host Linux Kernel Dedicated Guest Kernel (Sentry)
Isolation Namespaces & Cgroups Syscall Interception (Sandbox)
Networking Host Kernel's Network Stack Sentry's Internal Netstack
Implementation Multiple containers joining 1 set of namespaces Multiple containers running inside 1 Sentry
Performance High (Direct syscalls) Lower (Syscall overhead)
Security Good Excellent (Defense in depth)

How are containers attached to the pod?

The Execution Flow

The process of "attaching" looks roughly like this in the code:

  • Step A: The runtime fork()s a new process for your app.
  • Step B: Before that process starts your application code (like python or nginx), it calls setns() for the Network, IPC, and UTS namespaces of the pause container.
  • Step C: Now that the process is "inside" the correct namespaces, it calls execve() to transform into your application.