Linux - KVM

How does KVM work?

KVM (Kernel-based Virtual Machine) is a unique virtualization technology because it doesn't try to build a hypervisor from scratch. Instead, it turns the Linux kernel itself into a hypervisor.

When you load the KVM module, the Linux kernel stops being just an Operating System and starts acting as a "Type 1" (bare-metal) hypervisor.

1. The Core Architecture: KVM + QEMU

KVM rarely works alone. It is almost always paired with QEMU (Quick Emulator). They split the work of running a Virtual Machine (VM):

KVM (The Engine): Runs inside the Linux kernel. It handles the "heavy lifting"—managing the CPU and memory. It uses the physical processor's hardware extensions to run guest code at near-native speed.
QEMU (The Hardware Store): Runs in user-space (like a normal app). It emulates the "peripherals" that a computer needs but that the CPU doesn't provide, such as the motherboard, BIOS, disk controllers, and USB ports.

2. How the CPU Works (Hardware Extensions)

In the past, virtualization was slow because the hypervisor had to "translate" every instruction the guest OS tried to run. Modern KVM uses hardware-assisted virtualization (Intel VT-x or AMD-V).

The CPU essentially gains two "planes" of existence:

Root Mode: Where the Host Linux Kernel (the hypervisor) lives. It has full control over the hardware.
Non-Root Mode: Where the Guest VM lives. The guest thinks it's running on bare metal, but it is actually restricted.

The KVM Loop:

VM Entry: QEMU tells KVM to start the VM. The kernel executes a VMENTER instruction, switching the CPU to "Non-Root" mode.
Native Execution: The guest OS runs instructions directly on the physical CPU at full speed.
VM Exit: If the guest tries to do something sensitive (like talking to a hard drive or changing hardware settings), the CPU triggers a VMEXIT.
Handling: The CPU switches back to "Root" mode. KVM looks at why the guest stopped, handles the request (often by asking QEMU to emulate the hardware response), and then performs another VMENTER to let the guest continue.

3. Memory and I/O (VirtIO)

Memory: KVM uses Nested Page Tables (EPT for Intel, NPT for AMD). This allows the hardware to map a "Guest Physical Address" directly to a "Host Physical Address" without the kernel having to intervene in every memory access, which drastically improves performance.
I/O (VirtIO): Traditional hardware emulation (pretending to be a specific real-world network card) is slow. KVM uses VirtIO, a "paravirtualized" standard. The guest OS knows it is a VM and uses a special driver to send data to the host via a shared memory ring, bypassing the slow "VM Exit" loop for every single data packet.

4. The "VM as a Process" Model

One of KVM's greatest strengths is that every VM is just a standard Linux process. Because a VM is "just a process" to the host:

Scheduling: The standard Linux scheduler decides which VM gets CPU time, just like it does for Chrome or Spotify.
Security: You can use standard Linux tools like top, kill, and SELinux to monitor and secure your VMs.
Memory Management: KVM can use Linux features like KSM (Kernel Same-page Merging), which finds identical memory pages across different VMs and merges them to save RAM.

Summary Comparison

Feature	Traditional Emulation	KVM Virtualization
Speed	Slow (Software-translated)	Near-Native (Hardware-executed)
CPU Requirement	Any	Must support Intel VT-x or AMD-V
Drivers	Generic (Realtek, IDE)	Optimized (VirtIO)
Management	Proprietary	Standard Linux tools (`virsh`, `htop`)

In short: KVM works by letting the CPU do the hard work of running code, while the Linux kernel manages the resources and QEMU provides the "fake" hardware for the guest to talk to.

Is KVM a Loadable Kernel Module?

KVM (Kernel-based Virtual Machine) is a set of loadable kernel modules in Linux.

KVM consists of two main types of modules that you can see if you run lsmod | grep kvm on a Linux machine:

kvm.ko: The core, processor-agnostic module that provides the main virtualization infrastructure.
Processor-Specific Modules: These handle the hardware-specific virtualization instructions:
- kvm-intel.ko: For Intel processors (using VT-x).
- kvm-amd.ko: For AMD processors (using AMD-V).

How to check if kvm is enabled?

Use kvm-ok to check if kvm is enabled?

$ kvm-ok

If kvm-ok is not available, you can get it by installing cpu-checker:

$ sudo apt install cpu-checker

If without kvm-ok:

$ ls /dev/kvm
$ lsmod | grep kvm

Check if CPU virtualization is enabled:

# Intel
# vmx=Virtual Machine Extensions
$ cat /proc/cpuinfo | grep vmx

# AMD
# svm=Secure Virtual Machine
$ cat /proc/cpuinfo | grep svm

How to enable KVM?

If you KVM is not running, make sure the virtualization features are enabled in BIOS:

SVM (Secure Virtual Machine) by AMD
Virtualization Technology by Intel
IOMMU: Input–output memory management unit

KVM under the hood

To understand KVM (Kernel-based Virtual Machine) under the hood, you have to stop thinking of it as a separate "emulator" and start seeing it as a Kernel Extension that adds a new "mode" to the CPU.

1. The Entry Point: `/dev/kvm`

KVM follows the Unix philosophy that "everything is a file." It exposes itself to the system as a character device: /dev/kvm.

When a userspace program (like QEMU, Firecracker, or Google’s Cloud VMM) wants to start a VM, it doesn't use a specialized start_vm syscall. Instead, it uses the open() syscall on /dev/kvm to get a file descriptor, and then it lives almost entirely inside ioctl() calls.

Devices are exposed to the VM through KVM.

Some devices may be paravirtualized:

Networking (VirtioNet)
Disk (VirtioSCSI)

Some devices may be emulated:

Disk (NVMe)
Debugging (Serial Port)
vTPM (Trusted Platform Module)

Some devices may be passthrough (passthrough devices gives the guest “full” access to a physical device):

GPU, TPU

2. The Hierarchy of `ioctl` Calls

The "Magic" of KVM happens through three levels of ioctl descriptors:

System Level (kvm_fd): You call ioctl(kvm_fd, KVM_CREATE_VM). This tells the kernel to allocate a new VM instance. It returns a VM File Descriptor.
VM Level (vm_fd): You call ioctl(vm_fd, KVM_CREATE_VCPU). This creates a virtual CPU. It returns a VCPU File Descriptor.
VCPU Level (vcpu_fd): This is where the actual execution happens. You call ioctl(vcpu_fd, KVM_RUN). This is the trigger that tells the physical CPU to switch into "Guest Mode."

3. Memory Management: `mmap` is King

KVM does not have its own memory manager. It relies on the standard Linux memory management subsystem.

Allocation: The userspace VMM (e.g., QEMU) allocates memory for the VM using a standard mmap() of anonymous memory (RAM).
Registration: The VMM then tells KVM about this memory using the KVM_SET_USER_MEMORY_REGION ioctl.
Shadow Paging / EPT: KVM uses Extended Page Tables (EPT) (on Intel) or Nested Page Tables (NPT) (on AMD). This allows the hardware to map the Guest's "Physical" memory addresses directly to the Host's "Physical" addresses.

Because it's just standard Linux memory, features like KSM (Kernel Samepage Merging) can scan the RAM and de-duplicate identical pages across multiple VMs, saving massive amounts of memory.

4. The "KVM Run Loop" (How code actually executes)

When the VMM calls ioctl(vcpu_fd, KVM_RUN), the CPU transitions from Host Mode to Guest Mode.

Execution: The guest OS runs directly on the hardware at native speed.
The Trap (VM-Exit): The guest stays in Guest Mode until it does something it isn't allowed to do—like accessing a hardware port or a specific memory-mapped I/O region.
Handling: The CPU hardware triggers a VM-Exit. The ioctl(KVM_RUN) call finally returns in userspace.
Emulation: The VMM (QEMU) looks at the "Exit Reason." If the guest tried to write to a disk, QEMU performs the write to a .qcow2 file on the host's Linux filesystem.
Re-entry: QEMU calls ioctl(vcpu_fd, KVM_RUN) again, and the loop continues.

5. Key Linux Features KVM Depends On

Beyond ioctl and mmap, KVM leverages these specific kernel features:

A. The Linux Scheduler (CFS/EEVDF)

Since each VCPU is just a standard Linux thread, the kernel scheduler handles them like any other process. If you have 4 VCPUs, the Linux kernel sees 4 threads and decides which physical cores they should run on. This is why KVM performance is so high—it benefits from 30+ years of scheduler optimization.

B. VirtIO (Paravirtualization)

Instead of emulating an ancient, slow IDE disk drive, KVM uses VirtIO. This is a shared-memory transport.

The Guest and Host share a "ring buffer" in RAM (using vring).
The Guest puts data in the ring and sends an interrupt.
The Host reads it directly.
This avoids the expensive "VM-Exit" cycle for every byte of data.

C. Eventfd and Irqfd

To handle interrupts efficiently, KVM uses eventfd.

eventfd is a syscall that creates a file descriptor for event notification.
KVM uses this to allow userspace to "signal" an interrupt into the guest without having to do a full context switch.

D. Cgroups and Namespaces

Because a VM is just a process, you can put a KVM VM inside a Cgroup.

Want to limit a VM to exactly 10% of the CPU? Use Cgroups.
Want to hide the VM from other processes? Use Namespaces.
This is how "MicroVMs" like AWS Firecracker or Google’s gVisor/Cloud VMM provide such high security—they wrap a KVM process in standard Linux container isolation.

Summary Table: The KVM "Syscall Stack"

Action	Syscall Used	Result
Initialize KVM	`open("/dev/kvm")`	Obtain KVM Handle
Create VM	`ioctl(KVM_CREATE_VM)`	Obtain VM Handle
Allocate VM RAM	`mmap()`	Userspace memory buffer
Map RAM to VM	`ioctl(KVM_SET_USER_MEMORY_REGION)`	Hardware-assisted memory mapping
Create VCPU	`ioctl(KVM_CREATE_VCPU)`	Obtain VCPU Handle
Execute Guest	`ioctl(KVM_RUN)`	CPU enters "Guest Mode"
Handle I/O	`read()` / `write()`	VMM performs disk/network actions

Conclusion: KVM is successful because it is lazy. It doesn't reinvent scheduling, memory management, or hardware drivers. It simply provides the "bridge" (via ioctl) to let the Linux kernel manage those things while the CPU hardware runs the guest code.

Is KVM always backed by hardware?

The short answer is yes. KVM (Kernel-based Virtual Machine) is fundamentally designed to be a hardware-assisted virtualization technology.

Without the specific "Virtualization Extensions" in your physical CPU, KVM cannot function.

The Hardware Requirement

KVM requires one of two specific hardware features built into the silicon of your processor:

Intel VT-x (Virtualization Technology)
AMD-V (AMD Virtualization)

These aren't just "software features"; they are physical circuits and extra CPU instructions (like VMLAUNCH, VMRESUME, and VMCALL) that allow the processor to switch between "Host" and "Guest" modes at the hardware level.

What happens if you don't have the hardware?

If you try to use KVM on a CPU that doesn't support it (or where it is disabled in the BIOS):

The Kernel Module fails: The kvm_intel or kvm_amd kernel modules will refuse to load.
The Device is missing: The file /dev/kvm will not exist.
The App fails: If you run runsc --platform=kvm, it will immediately crash with an error saying it cannot open /dev/kvm.

The "Software" Alternative: Emulation

People often get confused because tools like QEMU can run VMs without hardware. But there is a massive difference:

KVM (Hardware-Assisted): The CPU runs the guest code directly. If the guest wants to add $1+1$ , the physical CPU adds $1+1$ . This is near-native speed.
QEMU TCG (Software Emulation): If you don't have KVM, QEMU uses a "Tiny Code Generator." It reads the guest's $1+1$ instruction, translates it into several host instructions, and executes them. This is 10x to 100x slower than KVM.

gVisor does not support software emulation. It only supports KVM (Hardware) or Systrap/Ptrace (Software-based syscall trapping).

The "Edge Case": Nested Virtualization

You might wonder: "I'm running a VM in Google Cloud, and I can run gVisor with KVM inside it. Where is the hardware?"

This is called Nested Virtualization.

Level 0: The physical Intel/AMD CPU in the data center.
Level 1: The Cloud Provider's hypervisor. It "fakes" the VT-x instructions and passes them down to the real hardware.
Level 2: Your VM, which "sees" the fake VT-x and uses it to run KVM.

Even in this case, there is real hardware at the bottom of the chain. If the physical CPU at Level 0 didn't have VT-x, the whole tower would collapse.

The flow of using /dev/kvm

When a program like QEMU, Firecracker, or Android Emulator wants to create a Virtual Machine, it follows a specific flow involving /dev/kvm.

The Opening: Gaining Access

The process begins with a standard file operation.

The System Call: The Virtual Machine Monitor (VMM), like QEMU, calls fd = open("/dev/kvm", O_RDWR);.
The Permission Check: The kernel checks if the user has permission. This is why you often have to add your user to the kvm group.
The Result: The kernel returns a File Descriptor (FD). At this stage, "opening" the device simply means the program has a "handshake" with the KVM kernel module.

The Configuration: The `ioctl` Dance

Unlike a text file where you use read() and write(), you interact with /dev/kvm using ioctl() (Input/Output Control). This is how you send complex commands to a device driver.

The flow moves in layers:

A. Create the VM

The program says: "I have the KVM FD, now give me an actual Virtual Machine."

Command: vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
What happens: The kernel allocates memory structures to keep track of a new VM. It returns a new File Descriptor (vm_fd) specifically for that VM.

B. Setup Memory

A VM needs RAM. The VMM (QEMU) allocates a chunk of its own memory (standard RAM) and tells the KVM device to map it.

Command: ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region);
What happens: The device driver tells the CPU: "When the guest tries to access memory address X, look at the host's physical memory address Y."

C. Create the Virtual CPU (vCPU)

Now the VM needs a processor to run code.

Command: vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
What happens: The kernel creates a virtual processor. It returns yet another File Descriptor (vcpu_fd) used to control that specific CPU core.

The Execution: Putting the CPU into "Guest Mode"

This is the most critical part of the "open" device's lifecycle.

The Run Command: The program calls ioctl(vcpu_fd, KVM_RUN, 0);.
The Context Switch: The Linux kernel performs a "VMENTRY." It literally hands control of the physical CPU core over to the Guest OS.
Hardware Speed: The code inside the VM now runs at native hardware speed. The "device" (the KVM driver) is essentially standing back and letting the hardware do the work.

The Exit: Why the Device "Wakes Up"

The VM will eventually do something it isn't allowed to do directly (like talk to a hard drive or change power settings).

The Trap: The hardware triggers a "VMEXIT." The CPU stops the guest and jumps back into the Linux Kernel.
The Return: The ioctl(KVM_RUN) call, which had been "hanging" while the VM was running, finally returns to the VMM (QEMU).
The Handling: QEMU looks at why the VM stopped. If the VM tried to send data to a "virtual" disk, QEMU performs that task in regular Linux user-space and then calls KVM_RUN again to resume the VM.

Closing: Tearing it down

When you shut down the VM:

The program calls close(vcpu_fd), then close(vm_fd), and finally close(kvm_fd).
Reference Counting: As each FD is closed, the Linux kernel decrements a counter.
Cleanup: When the final count hits zero, the KVM driver tells the CPU to release all virtualization resources, clears the memory mappings, and frees up the RAM.

Summary of the Flow

Open /dev/kvm: Get a handle to the virtualization system.
ioctl (Create VM): Create the "container" for the VM.
ioctl (Create vCPU): Create the "engine" for the VM.
ioctl (KVM_RUN): The "open" device switches the hardware into a special state to run the guest.
Close: Release the hardware features back to the host OS.

In the context of Linux, /dev/kvm being "open" means a process has successfully reserved the right to use the CPU's virtualization hardware and has active memory structures in the kernel managing that state.

VMEXIT

Who defines VMEXIT?

VMEXIT is defined by the CPU hardware architects (Intel and AMD), not by the KVM software.

KVM is the software that handles the VMEXIT, but the event itself is a hardwired feature of the physical processor.

Intel and AMD created virtualization extensions (Intel VT-x and AMD-V SVM) to allow a CPU to switch between "Host mode" and "Guest mode."

Intel's Definition: In Intel’s manuals, it is called "VM Exit." It occurs when the processor transitions from "non-root operation" (the Guest) to "root operation" (the Host/KVM).
AMD's Definition: In AMD’s manuals, it is called "#VMEXIT."

When is VMEXIT triggered?

The CPU designers decided exactly which actions trigger a VMEXIT. For example, the hardware is hardcoded to trigger a VMEXIT if the guest tries to:

Execute the HLT (halt) instruction.
Access hardware ports directly (IN/OUT instructions).
Access memory that hasn't been assigned to it.
Trigger an external interrupt (like a mouse click or timer).

The Data Structure (VMCS / VMCB)

When a VMEXIT happens, the CPU needs to tell the software (KVM) why it stopped. The hardware defines a specific data structure in RAM for this:

Intel: Uses the VMCS (Virtual Machine Control Structure).
AMD: Uses the VMCB (Virtual Machine Control Block).

The CPU hardware automatically writes an "Exit Reason" code into these structures. For example:

Reason 0x01: The guest tried to perform a Triple Fault.
Reason 0x0A: The guest tried to execute the CPUID instruction.

Does VMEXIT exit to userspace or kernel?

A VMEXIT always exits first to the host kernel.

Because the CPU's "Guest Mode" is a hardware state, only the kernel (running in the most privileged Ring 0) has the authority to receive and decode the initial exit signal from the processor. Once the kernel takes control, it decides whether to handle the event itself or pass it further up to userspace (e.g., QEMU).

How does KVM handle the exit?

In the KVM source code, there is a large "switch" statement (often found in arch/x86/kvm/vmx/vmx.c). It looks roughly like this:

switch (exit_reason) {
    case EXIT_REASON_HLT:
        return handle_halt(vcpu);
    case EXIT_REASON_IO_INSTRUCTION:
        return handle_io(vcpu);
    case EXIT_REASON_CPUID:
        return handle_cpuid(vcpu);
    // ... many more cases ...
}

How do gVisor and QEMU use KVM differently?

To understand the difference, you first have to look at what sits inside the virtual machine. Both use KVM, but they use it to build two completely different "worlds."

TL;DR:

QEMU: the guest kernel (Linux kernel) runs inside the VM.
gVisor: the guest kernel (Sentry) runs outside the VM.

QEMU: The "Whole House" Approach

QEMU uses KVM to create a Full Virtual Machine. It wants to trick an entire Operating System into thinking it is running on real hardware.

What’s inside the VM: A Bootloader, a Guest Kernel (e.g., Windows or a different Linux), and multiple Applications.
What KVM handles: Hardware events.
The VMEXIT Trigger: In QEMU, a VMEXIT happens when the Guest Kernel tries to talk to hardware.
- Example: The Guest Kernel tries to send data to a disk or a network card. The CPU triggers a VMEXIT, KVM catches it, and QEMU pretends to be a SATA controller or an Intel Ethernet card to handle the request.
Handling System Calls: When an application inside QEMU makes a system call (write()), it does not cause a VMEXIT. The Guest Kernel handles it entirely inside the VM.

gVisor: The "Glass Box" Approach

gVisor uses KVM to create a Sandboxed Process. It doesn't want to run a whole OS; it only wants to run one specific application (like a Python script or a Web Server) securely.

What’s inside the VM: Just the Application. There is no Guest Kernel inside the VM.
What KVM handles: The System Call boundary.
The VMEXIT Trigger: In gVisor, a VMEXIT happens every time the Application tries to talk to the Kernel.
- Example: The application tries to call open(). gVisor has configured the vCPU so that the SYSCALL instruction itself triggers a VMEXIT. KVM catches it, and the gVisor Sentry (acting as a "guest kernel" living outside the VM) handles the request.
Handling System Calls: This is the primary reason gVisor uses KVM. It uses the hardware to "trap" the application whenever it tries to do anything outside its own memory.

Key Differences at a Glance

Feature	QEMU + KVM	gVisor + KVM
What is Virtualized?	The Hardware (CPU, RAM, NIC, Disk)	The Linux Kernel API (Syscalls)
Guest Kernel?	Yes. A full kernel runs inside.	No. gVisor is the kernel (running outside).
`SYSCALL` Instruction	Handled by the Guest Kernel (No Exit).	Triggers a VMEXIT to the Sentry.
Primary Goal	Compatibility (Run any OS).	Security (Isolate a single process).
Memory Footprint	Large (Kernel + OS overhead).	Small (Just the app + Sentry).

How do they control when to trigger a VMEXIT?

QEMU: When an app in QEMU calls write(), the CPU looks at LSTAR, sees the Guest Linux Kernel's address, and jumps there. The CPU stays in "Guest Mode." No VMEXIT occurs.

gVisor (The "Missing Kernel" Trick): gVisor sets the LSTAR register (the syscall jump target) to an address that is not mapped in the Guest's memory. gVisor tells the hardware: "If there is a Page Fault (the CPU tries to access memory that isn't there), trigger a VMEXIT."

The Result:

The Application calls SYSCALL.
The CPU tries to jump to the address in LSTAR.
The CPU realizes that address doesn't exist (a Page Fault).
The CPU looks at the "Control Panel," sees that Page Faults require a VMEXIT, and stops the VM.
The gVisor Sentry wakes up, looks at the registers, and says: "Ah, I see you were trying to make a syscall. I'll handle that for you."

How is it implemented in gVisor?

In the gVisor source code (specifically in the pkg/sentry/platform/kvm directory), gVisor defines how the virtual CPU (vCPU) should behave.

LSTAR Setup: gVisor code sets the LSTAR register (the register that defines where the CPU jumps when a SYSCALL happens). It points this register to a specific "dead" address or a memory page that has restricted permissions.
Code Location: Look at pkg/sentry/platform/kvm/address_space.go and bluepill.go. These files contain the logic that determines how memory is mapped for the guest.

The Command Level: The `ioctl` System Call.

Once gVisor has decided on its "trap" strategy, it must tell the Linux Kernel. It does this by calling an ioctl on the File Descriptor it got when it opened /dev/kvm.

There are two specific calls used for this:

KVM_SET_MSRS: gVisor uses this to write the "invalid" jump address into the guest's LSTAR register.
KVM_SET_GUEST_DEBUG: This is a crucial one. KVM provides a "Guest Debug" API that allows a program (gVisor) to tell the kernel: "I want you to intercept specific exceptions (like Page Faults) and hand control back to me."

The Kernel Level: The KVM Module

Inside the Linux kernel source (in arch/x86/kvm/vmx/vmx.c), KVM receives that ioctl. It translates gVisor's request into a hardware-readable format.

KVM updates a bitmask in the VMCS (Virtual Machine Control Structure) known as the Exception Bitmap.

The Hardware Level: The VMCS Exception Bitmap

This is the "Where" at the physical level. The Exception Bitmap is a 32-bit field in the CPU's memory:

Each bit corresponds to a different type of processor exception (e.g., Bit 14 is Page Faults, Bit 3 is Breakpoints).
If Bit 14 is set to 1: The hardware is hardwired to trigger a VMEXIT every time a Page Fault occurs inside the Guest.
If Bit 14 is set to 0: The hardware tries to let the Guest's own IDT (Interrupt Descriptor Table) handle the fault.

The Full Chain of Command

gVisor (User Space): "I want to catch syscalls. I'll point the syscall entry point to a non-existent memory address."
KVM ioctl (The Bridge): gVisor calls KVM_SET_GUEST_DEBUG with the KVM_GUESTPROBE_TRAP flag.
KVM Module (Kernel Space): Receives the call and says, "Understood. I will now modify the hardware VMCS for this vCPU."
VMCS (Hardware Control): KVM writes a 1 to Bit 14 of the Exception Bitmap in the VMCS.
The CPU (Physical Hardware): Now, every time the application executes a SYSCALL, it jumps to that bad address, triggers a Page Fault, sees the 1 in the Exception Bitmap, and instantly performs a VMEXIT.

KVM/VFIO

To understand KVM/VFIO, it is best to look at them as a duo that allows a Virtual Machine (VM) to act like a physical computer by giving it direct control over hardware.

In the world of Linux virtualization, KVM is the engine, and VFIO is the bridge that connects the VM directly to physical hardware (like a GPU).

The Problem: While KVM is great at virtualizing the CPU and RAM, it's not great at virtualizing complex hardware like high-end Graphics Cards (GPUs). Normally, the VM has to use a "virtual" slow driver, which is why most VMs have terrible 3D performance.

What is VFIO? (The Passthrough)

VFIO (Virtual Function I/O) is a framework that allows you to take a physical PCIe device (GPU, Network Card, NVMe drive) and "pass it through" to a VM.

The Mechanism: Instead of the Linux host OS using the device, VFIO "hides" the device from the host and hands the keys directly to the VM.
The Result: The VM sees the actual hardware. If you pass through an NVIDIA RTX 4090, the Windows VM sees a real RTX 4090, installs the real NVIDIA drivers, and performs at 95–99% of native speed.

Key Components (The "How it Works")

To make KVM/VFIO work, several technologies must cooperate:

IOMMU (The Security Guard)

This is the most critical requirement. IOMMU (Intel VT-d or AMD-Vi) is a hardware feature on your motherboard/CPU.

In a normal system, hardware devices can access any part of system RAM. This is dangerous in a VM.
IOMMU restricts a device so it can only access the memory assigned to its specific VM.
IOMMU Groups: Your motherboard groups devices together. To pass through a GPU, it usually needs to be in its own "isolated" group so you don't accidentally pass through your USB controller or SATA ports along with it.

vfio-pci

This is a specific driver. Before the VM starts, you "bind" your hardware to vfio-pci. This tells the Linux host: "Don't touch this GPU; I'm saving it for a VM."

QEMU

While KVM handles the CPU, QEMU is the software that emulates the rest of the computer (motherboard, USB slots, etc.) and coordinates with VFIO to plug the physical hardware into the virtual slots.

Why do people use KVM/VFIO?

The most common use case is "Gaming on Linux" via a Windows Guest (often called a "Battlestation in a Box").

Besides that, KVM/VFIO is actually a foundational technology in professional data centers, AI research, and cloud computing. KVM/VFIO is used to handle GPU/TPU/NPU resources for heavy computational tasks.

AI and Machine Learning (The Primary Enterprise Use Case)

The explosion of Large Language Models (LLMs) and Deep Learning relies heavily on KVM/VFIO.

CUDA Access: To train a model using NVIDIA’s CUDA toolkit inside a virtualized environment, the VM needs direct access to the GPU hardware. VFIO provides the "bare-metal" performance required for Tensor Cores to operate at full speed.
Isolation in Shared Servers: A company might have a massive server with 8x H100 GPUs. Using KVM/VFIO, they can carve that server into 8 separate VMs, giving each data scientist one dedicated physical GPU that is hardware-isolated from the others.
TPU Passthrough: Google’s TPUs (Tensor Processing Units), when used in PCIe form factors or within Google Cloud’s infrastructure, utilize similar passthrough technologies to ensure the VM can talk directly to the TPU silicon without software overhead.

Cloud Infrastructure (IaaS)

If you go to AWS (EC2), Google Cloud (GCP), or Azure and rent a "GPU Instance" (like a p4d.24xlarge), you are almost certainly using a system built on KVM and VFIO (or a proprietary equivalent like Nitro).

Multi-Tenancy: Cloud providers use VFIO to ensure that User A cannot see the data inside User B’s GPU memory.
SR-IOV (The Professional Evolution): In enterprise setups, they often use SR-IOV (Single Root I/O Virtualization) alongside VFIO. This allows a single physical GPU to "split" itself into multiple virtual PCIe devices, each of which is passed through to a different VM via VFIO.

Professional Media Production & Rendering

High-end studios use KVM/VFIO to centralize their hardware.

Render Farms: Passing through GPUs to VMs for Blender, OctaneRender, or V-Ray allows studios to spin up rendering nodes dynamically.
Remote Workstations: Instead of giving every editor a $10,000 workstation, a company puts several high-end GPUs in a server room. Editors connect via a thin client to a VM that has a GPU passed through via VFIO, allowing them to use DaVinci Resolve or Adobe Premiere with full hardware acceleration.

Scientific Research and Simulation

In High-Performance Computing (HPC), researchers use VMs to create reproducible environments.

GPU Acceleration: Tasks like molecular modeling, weather simulation, and fluid dynamics require massive parallel processing. VFIO allows these researchers to use Linux-based clusters where every node is a VM with direct access to physical accelerators.

Hardware Development and CI/CD

Driver Development: Engineers writing drivers for new GPUs or TPUs use VFIO to pass the hardware to a VM. If the driver causes a "Kernel Panic" (system crash), it only crashes the VM, not the host machine, saving hours of reboot time.
Automated Testing: Companies use KVM/VFIO in their CI/CD pipelines to automatically test if their software (like a game engine or AI framework) works correctly on specific physical hardware.

Passing through "Other" Accelerators

VFIO isn't limited to GPUs. In the enterprise, it is used for:

FPGAs: Field Programmable Gate Arrays used for high-frequency trading or custom signal processing.
NVMe Drives: For "Bare Metal" storage performance inside a VM.
Network Cards (NICs): For high-speed 100Gbps networking using DPDK (Data Plane Development Kit), where the VM handles network packets directly to reduce latency.

The Hardware Requirements

You cannot run KVM/VFIO on just any computer. You generally need:

Two GPUs: Usually one for the Linux Host (can be integrated Intel/AMD graphics) and one dedicated "Guest" GPU for the VM. (Note: Single-GPU passthrough is possible but very difficult).
CPU/Motherboard Support: Both must support IOMMU (VT-d or AMD-Vi).
Two Monitors (or one monitor with two inputs): Since the guest GPU outputs a real video signal, you need a way to see it (or use software like Looking Glass to pipe the VM's frame buffer back into a window on Linux).