gVisor - KVM
When gVisor uses the KVM platform, it turns the Sentry into a "Type-2 Hypervisor." However, unlike QEMU or VMware, gVisor doesn't boot a whole operating system. It uses KVM's hardware acceleration purely to trap system calls and isolate memory.
Step-by-step flow
Step 1: The Handshake (/dev/kvm)
When runsc starts with --platform=kvm, the first thing the Sentry process does is open the /dev/kvm device on your host.
- It calls
KVM_CREATE_VMto create a new "Virtual Machine" container in the Linux kernel. - It calls
KVM_CREATE_VCPUto create virtual CPUs. - Crucially: To the host kernel, the Sentry is just a process using KVM. To the Sentry, these vCPUs are the "engines" it will use to run your application code.
Step 2: Mapping the "World" (Memory)
The Sentry creates a memory map for the sandbox.
- It allocates chunks of its own memory (Host Virtual Address).
- It tells KVM: "Treat this chunk of my memory as the Guest Physical Memory for the sandbox."
- It sets up EPT (Extended Page Tables). This is a hardware feature of the CPU that allows the application to have its own private memory space that the Sentry can control perfectly.
Step 3: Setting the "Trap" (Syscall Redirection)
In a normal VM, a system call inside the VM goes to the Guest Linux Kernel. gVisor doesn't have a Guest Kernel.
- The Sentry programs the vCPU's
LSTARregister. - In x86 architecture, the
LSTARregister tells the CPU: "When someone hits theSYSCALLinstruction, jump to this memory address." - The Sentry points this to a tiny piece of gVisor code that triggers a VM-EXIT.
Step 4: Entering "Guest Mode" (KVM_RUN)
The Sentry is ready to run your app (e.g., Python).
- The Sentry loads the Python code into the "Guest" memory.
- The Sentry calls the
KVM_RUNsystem call on the host. - The Context Switch: The physical CPU switches into "Guest Mode" (VMX Root/Non-Root). The CPU is now executing the Python code at full hardware speed.
Step 5: The System Call (The "Trap" Springs)
Your Python app tries to do something, like write("hello").
- The app executes the
SYSCALLinstruction. - Because of Step 3, the CPU realizes: "I am in Guest Mode, and a syscall happened. I must exit."
- The CPU performs a VM-EXIT.
Step 6: The Return to the Sentry
The KVM_RUN call in the Sentry (from Step 4) finally returns.
- The Sentry process "wakes up" in the Host mode.
- KVM provides a "Reason Code" saying: "The guest exited because it tried to make a system call."
- The Sentry looks at the vCPU's registers (RAX, RDI, etc.) to see which syscall it was (e.g.,
write).
Step 7: Emulation (The Work)
Now the Sentry (the "Guest Kernel" written in Go) does its job.
- It checks if the app is allowed to write to that file.
- It interacts with the Gofer if it needs to touch the host disk.
- It updates its internal "Virtual File Descriptor" state.
Step 8: Re-entry
Once the work is done:
- The Sentry puts the "success" code (e.g., the number of bytes written) into the vCPU's RAX register.
- It calls
KVM_RUNagain. - The CPU jumps back into Guest Mode, and the Python app continues from the very next line of code, never knowing it was momentarily paused and inspected.
Why use KVM instead of Ptrace?
| Feature | Ptrace Platform | KVM Platform |
|---|---|---|
| Switching | Software-based (Signals/Context switches) | Hardware-based (CPU VM-EXIT/ENTRY) |
| Speed | Slow (4 context switches per syscall) | Fast (Bypasses much of host kernel) |
| Isolation | Standard Namespaces | Hardware Memory Isolation (EPT) |
| Reliability | Works inside other VMs (Nested) | Requires Hardware Virtualization support |
How is it different from the normal VM on KVM?
The fundamental difference is in what is inside the virtual machine.
In a Normal VM (like QEMU, AWS EC2, or VMware), KVM is used to simulate a complete physical computer so a full OS can boot. In gVisor, KVM is used as a secure sandbox for a single application process.
The "Guest" Content
In a Normal VM, the Guest Kernel is inside the sandbox. In gVisor, the Guest Kernel (the Sentry) is outside the sandbox.
- Normal VM: You boot a Guest Kernel (e.g., a full Linux kernel) plus a whole user-space (Systemd, Bash, etc.). The application runs on top of that guest kernel.
- gVisor: There is no Guest Kernel inside the VM. The "Guest" is just your application (e.g., Python). The Sentry (which lives on the host) acts as the kernel, but it "reaches into" the KVM sandbox to manage the app.
The Syscall Target
This is the most critical technical difference.
- Normal VM: When an app makes a syscall, it stays inside the VM. The Guest Kernel handles it. KVM (and the host) are never even aware a syscall happened.
- gVisor: Every single syscall made by the app causes an immediate VM-EXIT. The app "breaks out" of the hardware sandbox, and the Sentry catches it on the host, inspects it, and decides what to do.
- In a normal VM, a VM-EXIT is a rare "heavy" event. In gVisor, it is the standard way every syscall is handled.
Hardware Emulation
- Normal VM: KVM works with a tool like QEMU to provide "Virtual Hardware." The VM thinks it has a PCI bus, a real Intel network card, a BIOS/UEFI, and a Disk Controller.
- gVisor: There is zero hardware emulation. There is no virtual NIC, no virtual disk, and no BIOS. The Sentry simply maps memory segments into the KVM sandbox. If the app wants to "write to disk," it doesn't talk to a virtual disk controller; it makes a syscall that the Sentry intercepts.
Memory Management
- Normal VM: The VM is usually allocated a fixed "chunk" of RAM (e.g., 4GB). The host kernel sees one big block of memory used by the VM process.
- gVisor: Because it's a sandbox for a process, memory management is more dynamic. The Sentry can map and unmap memory for the application more like a traditional OS manages a process, though it still uses KVM's EPT (Extended Page Tables) to enforce the hardware boundary.
Booting Speed
- Normal VM: Slow. You have to wait for the virtual BIOS to initialize, the bootloader to run, the kernel to decompress, and systemd to start services. This takes seconds.
- gVisor: Near-instant. The Sentry sets up the KVM structures and immediately jumps the CPU to the application's first instruction. It "feels" like starting a container, not a VM.
Comparison
| Feature | Normal VM (QEMU/KVM) | gVisor (runsc --platform=kvm) |
|---|---|---|
| Guest OS | Full Linux/Windows Kernel | None (Sentry handles syscalls) |
| Syscall Handling | Internal (handled inside the VM) | External (triggers a VM-EXIT) |
| Hardware | Virtual NIC, PCI, BIOS, Disks | None (Only CPU and RAM) |
| Goal | Run a separate OS | Sandbox a single process |
| Isolation | Hardware-level | Hardware-level |
| Analogy | A Whole House (with its own plumbing). | A Glass Isolation Room in an existing hospital. |
Why does gVisor do this?
By using KVM in this "weird" way, gVisor gets the security of a VM (the hardware enforces the boundary) but keeps the agility of a container (no extra kernel to manage, fast startup, and lower memory overhead). It uses KVM as a CPU-level filter rather than a machine emulator.
Which parts of KVM are used by gVisor? Which parts are not?
gVisor uses KVM as a "Thin Sandbox." It takes the isolation power of KVM (the hardware-enforced memory walls) but throws away the complexity of KVM (the hardware emulation). This allows gVisor to be much faster and lighter than a traditional VM while maintaining the same level of hardware-backed security.
While a normal VMM (Virtual Machine Monitor) like QEMU or Firecracker tries to use KVM to build a "Virtual Computer," gVisor uses KVM only to build a "Virtual CPU Jail."
What gVisor USES from KVM (The Core)
gVisor uses the parts of the KVM API that handle the CPU and RAM. These are the "raw" virtualization features:
- VCPU Creation: It calls
KVM_CREATE_VCPU. It needs KVM to manage the hardware registers (RAX, RIP, etc.) of the application thread. - Memory Mapping (EPT): It uses KVM to set up Extended Page Tables. This is the hardware feature that prevents the application from seeing any memory that the Sentry hasn't explicitly given it.
- The Run Loop: It uses
KVM_RUNto tell the physical CPU: "Go into Guest Mode and execute this code until something happens."
What gVisor SKIPS (The "Normal VM" stuff)
In a normal VM, KVM provides a massive amount of infrastructure to simulate a motherboard. gVisor ignores almost all of it:
- Interrupt Controllers (APIC/PIC): Normal VMs need complex logic to handle hardware interrupts (like a mouse click or a disk finishing a task). gVisor doesn't have virtual hardware, so it doesn't need virtual interrupts.
- I/O Port Emulation (
IN/OUT): In a normal VM, the Guest OS talks to hardware using I/O ports. KVM intercepts these. gVisor applications don't use I/O ports; they useSYSCALL. - BIOS/UEFI: gVisor doesn't "boot." There is no virtual firmware.
- Device Bus (PCI/USB): There are no virtual buses in gVisor.
The Technical "Twist": Syscalls vs. I/O
This is the most important technical distinction between gVisor and QEMU:
- Normal VMM (QEMU): Configures KVM so that the Guest OS can handle its own system calls. KVM only exits to QEMU when the Guest OS tries to talk to Hardware (like a Virtual Disk).
- gVisor VMM (
runsc): Configures KVM so that it triggers a VM-EXIT on every System Call.- gVisor uses the
KVM_SET_GUEST_DEBUGor specific intercept bits to ensure that as soon as the app hits theSYSCALLinstruction, KVM hands control back to the Sentry.
- gVisor uses the
Comparison: Role of KVM
| Function | Normal VM (QEMU/Firecracker) | gVisor (runsc) |
|---|---|---|
| KVM Usage | Full (CPU + RAM + Interrupts + I/O) | Minimal (CPU + RAM only) |
| Virtual Hardware | Yes (NIC, Disk, GPU, etc.) | None |
| Guest OS | Full Kernel (Linux/Windows) | None (Only the App) |
| VM-EXIT Trigger | Hardware I/O / Faults | System Calls |
| The "VMM" | QEMU (Large, C) | Sentry (Slim, Go) |
Is gVisor a VMM?
Yes. In the KVM model, runsc is the VMM.
In the Linux kernel's eyes, runsc is doing exactly what QEMU does: it is a user-space process that opened /dev/kvm and is calling ioctl commands to run code in a virtual context.
However, because runsc doesn't have to simulate a hard drive, a network card, or a power button, its "VMM" code is much smaller and more secure than QEMU. It uses KVM as a system call interceptor rather than a hardware emulator.