logo

gVisor - KVM

When gVisor uses the KVM platform, it turns the Sentry into a "Type-2 Hypervisor." However, unlike QEMU or VMware, gVisor doesn't boot a whole operating system. It uses KVM's hardware acceleration purely to trap system calls and isolate memory.

Step-by-step flow

Step 1: The Handshake (/dev/kvm)

When runsc starts with --platform=kvm, the first thing the Sentry process does is open the /dev/kvm device on your host.

  • It calls KVM_CREATE_VM to create a new "Virtual Machine" container in the Linux kernel.
  • It calls KVM_CREATE_VCPU to create virtual CPUs.
  • Crucially: To the host kernel, the Sentry is just a process using KVM. To the Sentry, these vCPUs are the "engines" it will use to run your application code.

Step 2: Mapping the "World" (Memory)

The Sentry creates a memory map for the sandbox.

  • It allocates chunks of its own memory (Host Virtual Address).
  • It tells KVM: "Treat this chunk of my memory as the Guest Physical Memory for the sandbox."
  • It sets up EPT (Extended Page Tables). This is a hardware feature of the CPU that allows the application to have its own private memory space that the Sentry can control perfectly.

Step 3: Setting the "Trap" (Syscall Redirection)

In a normal VM, a system call inside the VM goes to the Guest Linux Kernel. gVisor doesn't have a Guest Kernel.

  • The Sentry programs the vCPU's LSTAR register.
  • In x86 architecture, the LSTAR register tells the CPU: "When someone hits the SYSCALL instruction, jump to this memory address."
  • The Sentry points this to a tiny piece of gVisor code that triggers a VM-EXIT.

Step 4: Entering "Guest Mode" (KVM_RUN)

The Sentry is ready to run your app (e.g., Python).

  1. The Sentry loads the Python code into the "Guest" memory.
  2. The Sentry calls the KVM_RUN system call on the host.
  3. The Context Switch: The physical CPU switches into "Guest Mode" (VMX Root/Non-Root). The CPU is now executing the Python code at full hardware speed.

Step 5: The System Call (The "Trap" Springs)

Your Python app tries to do something, like write("hello").

  1. The app executes the SYSCALL instruction.
  2. Because of Step 3, the CPU realizes: "I am in Guest Mode, and a syscall happened. I must exit."
  3. The CPU performs a VM-EXIT.

Step 6: The Return to the Sentry

The KVM_RUN call in the Sentry (from Step 4) finally returns.

  • The Sentry process "wakes up" in the Host mode.
  • KVM provides a "Reason Code" saying: "The guest exited because it tried to make a system call."
  • The Sentry looks at the vCPU's registers (RAX, RDI, etc.) to see which syscall it was (e.g., write).

Step 7: Emulation (The Work)

Now the Sentry (the "Guest Kernel" written in Go) does its job.

  • It checks if the app is allowed to write to that file.
  • It interacts with the Gofer if it needs to touch the host disk.
  • It updates its internal "Virtual File Descriptor" state.

Step 8: Re-entry

Once the work is done:

  1. The Sentry puts the "success" code (e.g., the number of bytes written) into the vCPU's RAX register.
  2. It calls KVM_RUN again.
  3. The CPU jumps back into Guest Mode, and the Python app continues from the very next line of code, never knowing it was momentarily paused and inspected.

Why use KVM instead of Ptrace?

Feature Ptrace Platform KVM Platform
Switching Software-based (Signals/Context switches) Hardware-based (CPU VM-EXIT/ENTRY)
Speed Slow (4 context switches per syscall) Fast (Bypasses much of host kernel)
Isolation Standard Namespaces Hardware Memory Isolation (EPT)
Reliability Works inside other VMs (Nested) Requires Hardware Virtualization support

How is it different from the normal VM on KVM?

The fundamental difference is in what is inside the virtual machine.

In a Normal VM (like QEMU, AWS EC2, or VMware), KVM is used to simulate a complete physical computer so a full OS can boot. In gVisor, KVM is used as a secure sandbox for a single application process.

The "Guest" Content

In a Normal VM, the Guest Kernel is inside the sandbox. In gVisor, the Guest Kernel (the Sentry) is outside the sandbox.

  • Normal VM: You boot a Guest Kernel (e.g., a full Linux kernel) plus a whole user-space (Systemd, Bash, etc.). The application runs on top of that guest kernel.
  • gVisor: There is no Guest Kernel inside the VM. The "Guest" is just your application (e.g., Python). The Sentry (which lives on the host) acts as the kernel, but it "reaches into" the KVM sandbox to manage the app.

The Syscall Target

This is the most critical technical difference.

  • Normal VM: When an app makes a syscall, it stays inside the VM. The Guest Kernel handles it. KVM (and the host) are never even aware a syscall happened.
  • gVisor: Every single syscall made by the app causes an immediate VM-EXIT. The app "breaks out" of the hardware sandbox, and the Sentry catches it on the host, inspects it, and decides what to do.
    • In a normal VM, a VM-EXIT is a rare "heavy" event. In gVisor, it is the standard way every syscall is handled.

Hardware Emulation

  • Normal VM: KVM works with a tool like QEMU to provide "Virtual Hardware." The VM thinks it has a PCI bus, a real Intel network card, a BIOS/UEFI, and a Disk Controller.
  • gVisor: There is zero hardware emulation. There is no virtual NIC, no virtual disk, and no BIOS. The Sentry simply maps memory segments into the KVM sandbox. If the app wants to "write to disk," it doesn't talk to a virtual disk controller; it makes a syscall that the Sentry intercepts.

Memory Management

  • Normal VM: The VM is usually allocated a fixed "chunk" of RAM (e.g., 4GB). The host kernel sees one big block of memory used by the VM process.
  • gVisor: Because it's a sandbox for a process, memory management is more dynamic. The Sentry can map and unmap memory for the application more like a traditional OS manages a process, though it still uses KVM's EPT (Extended Page Tables) to enforce the hardware boundary.

Booting Speed

  • Normal VM: Slow. You have to wait for the virtual BIOS to initialize, the bootloader to run, the kernel to decompress, and systemd to start services. This takes seconds.
  • gVisor: Near-instant. The Sentry sets up the KVM structures and immediately jumps the CPU to the application's first instruction. It "feels" like starting a container, not a VM.

Comparison

Feature Normal VM (QEMU/KVM) gVisor (runsc --platform=kvm)
Guest OS Full Linux/Windows Kernel None (Sentry handles syscalls)
Syscall Handling Internal (handled inside the VM) External (triggers a VM-EXIT)
Hardware Virtual NIC, PCI, BIOS, Disks None (Only CPU and RAM)
Goal Run a separate OS Sandbox a single process
Isolation Hardware-level Hardware-level
Analogy A Whole House (with its own plumbing). A Glass Isolation Room in an existing hospital.

Why does gVisor do this?

By using KVM in this "weird" way, gVisor gets the security of a VM (the hardware enforces the boundary) but keeps the agility of a container (no extra kernel to manage, fast startup, and lower memory overhead). It uses KVM as a CPU-level filter rather than a machine emulator.

Which parts of KVM are used by gVisor? Which parts are not?

gVisor uses KVM as a "Thin Sandbox." It takes the isolation power of KVM (the hardware-enforced memory walls) but throws away the complexity of KVM (the hardware emulation). This allows gVisor to be much faster and lighter than a traditional VM while maintaining the same level of hardware-backed security.

While a normal VMM (Virtual Machine Monitor) like QEMU or Firecracker tries to use KVM to build a "Virtual Computer," gVisor uses KVM only to build a "Virtual CPU Jail."

What gVisor USES from KVM (The Core)

gVisor uses the parts of the KVM API that handle the CPU and RAM. These are the "raw" virtualization features:

  • VCPU Creation: It calls KVM_CREATE_VCPU. It needs KVM to manage the hardware registers (RAX, RIP, etc.) of the application thread.
  • Memory Mapping (EPT): It uses KVM to set up Extended Page Tables. This is the hardware feature that prevents the application from seeing any memory that the Sentry hasn't explicitly given it.
  • The Run Loop: It uses KVM_RUN to tell the physical CPU: "Go into Guest Mode and execute this code until something happens."

What gVisor SKIPS (The "Normal VM" stuff)

In a normal VM, KVM provides a massive amount of infrastructure to simulate a motherboard. gVisor ignores almost all of it:

  • Interrupt Controllers (APIC/PIC): Normal VMs need complex logic to handle hardware interrupts (like a mouse click or a disk finishing a task). gVisor doesn't have virtual hardware, so it doesn't need virtual interrupts.
  • I/O Port Emulation (IN / OUT): In a normal VM, the Guest OS talks to hardware using I/O ports. KVM intercepts these. gVisor applications don't use I/O ports; they use SYSCALL.
  • BIOS/UEFI: gVisor doesn't "boot." There is no virtual firmware.
  • Device Bus (PCI/USB): There are no virtual buses in gVisor.

The Technical "Twist": Syscalls vs. I/O

This is the most important technical distinction between gVisor and QEMU:

  • Normal VMM (QEMU): Configures KVM so that the Guest OS can handle its own system calls. KVM only exits to QEMU when the Guest OS tries to talk to Hardware (like a Virtual Disk).
  • gVisor VMM (runsc): Configures KVM so that it triggers a VM-EXIT on every System Call.
    • gVisor uses the KVM_SET_GUEST_DEBUG or specific intercept bits to ensure that as soon as the app hits the SYSCALL instruction, KVM hands control back to the Sentry.

Comparison: Role of KVM

Function Normal VM (QEMU/Firecracker) gVisor (runsc)
KVM Usage Full (CPU + RAM + Interrupts + I/O) Minimal (CPU + RAM only)
Virtual Hardware Yes (NIC, Disk, GPU, etc.) None
Guest OS Full Kernel (Linux/Windows) None (Only the App)
VM-EXIT Trigger Hardware I/O / Faults System Calls
The "VMM" QEMU (Large, C) Sentry (Slim, Go)

Is gVisor a VMM?

Yes. In the KVM model, runsc is the VMM.

In the Linux kernel's eyes, runsc is doing exactly what QEMU does: it is a user-space process that opened /dev/kvm and is calling ioctl commands to run code in a virtual context.

However, because runsc doesn't have to simulate a hard drive, a network card, or a power button, its "VMM" code is much smaller and more secure than QEMU. It uses KVM as a system call interceptor rather than a hardware emulator.