logo

Linux - Protection Rings

In the context of Linux and computer architecture, Protection Rings are a mechanism built into the CPU hardware to protect data and functionality from faults (crashes) and malicious behavior.

Think of them as levels of "security clearance" for software.

The Core Concept: Hierarchy of Privilege

Most modern CPUs (especially x86/Intel/AMD) support four rings, numbered 0 to 3. The lower the number, the more power the software has.

  • Ring 0 (Kernel Mode): The most privileged. Software here has direct access to the CPU instructions and all hardware (RAM, Disk, Network Card).
  • Rings 1 and 2: Originally intended for device drivers or OS services, but Linux does not use them. Most modern operating systems (Windows, macOS, Linux) skip these to ensure portability across different CPU architectures that might only have two levels.
  • Ring 3 (User Mode): The least privileged. This is where your applications (Chrome, Spotify, Terminal) run. They are isolated from the hardware and each other.

Why do we need Rings?

The primary goal is Stability and Security.

  • Stability: If a program in Ring 3 crashes (like a buggy video game), it only kills that process. If every program had Ring 0 access, a single bug in a calculator app could overwrite the memory being used by the disk controller, causing the entire computer to freeze or corrupt your files.
  • Security: Malware running in Ring 3 cannot "spy" on the memory of another program or access your webcam directly without asking the Kernel (Ring 0) for permission.

How Linux Uses Rings

Ring 0: The Kernel

In Linux, the Kernel is the only thing that lives in Ring 0. It is the "God Mode" of the system. It manages memory, schedules which programs get to use the CPU, and talks to the hardware drivers.

Ring 3: User Space

Everything else lives here. When you run a command like ls or open a browser, the CPU is set to Ring 3. If a Ring 3 program tries to execute a "privileged instruction" (like shutting down the CPU or directly accessing a memory address belonging to the Kernel), the CPU triggers a General Protection Fault, and the Linux kernel kills the program immediately.

How does a program "get work done"? (System Calls)

Since Ring 3 programs cannot touch hardware, how does a browser save a file to the disk? It uses a System Call (syscall).

  1. The Request: The Ring 3 application executes a special CPU instruction (like SYSCALL on x86_64).
  2. The Switch: The CPU "traps" the request, pauses the application, and switches the privilege level from Ring 3 to Ring 0.
  3. The Execution: The Linux Kernel takes over, verifies if the application has permission to write that file, and performs the physical disk operation.
  4. The Return: Once finished, the Kernel switches the CPU back to Ring 3 and hands control back to the application.

This process is called a Context Switch.

How does the CPU keep track of the rings?

To understand how the CPU keeps track of rings, we have to look at the x86 architecture (the foundation for most Linux servers and desktops).

The CPU doesn't have a single "Ring Variable" inside it. Instead, it tracks privilege through specific bits inside Segment Registers, primarily the CS (Code Segment) Register.

The Key Register: CS (Code Segment)

In x86 architecture, the CS register is a 16-bit register that tells the CPU which segment of memory the current code is running in. However, the CPU doesn't just use it for a memory address; it uses the bottom two bits to store the Current Privilege Level (CPL).

The Breakdown of the CS Register

A "Segment Selector" (the value inside the CS register) looks like this:

Bit Range Name Description
15 - 3 Index Points to an entry in the GDT (Global Descriptor Table).
2 TI Table Indicator (GDT vs LDT).
1 - 0 CPL Current Privilege Level. This is the "Ring."
  • If the last two bits are 00, the CPU is in Ring 0 (Kernel Mode).
  • If the last two bits are 11, the CPU is in Ring 3 (User Mode).

How it works: The "Check"

Every time the CPU executes an instruction, it performs a hardware-level check. It compares the CPL (where you are) against the DPL (Descriptor Privilege Level) of the data or memory you are trying to touch.

  1. The GDT (Global Descriptor Table): When Linux boots, the kernel creates a table in RAM called the GDT. This table has "entries" (descriptors) that define segments of memory.
  2. The DPL: Each entry in that table has a "Descriptor Privilege Level" (DPL) assigned to it. For example, the Kernel memory segment is marked with DPL 0; the User memory segment is marked with DPL 3.
  3. The Logic: If the CPL in your CS register is 3 (User), and you try to jump to a code segment or access a data segment with a DPL of 0, the CPU hardware raises a General Protection Fault and stops the instruction before it even happens.

How do the bits change? (The Transition)

A program cannot simply change the bits in the CS register itself. If a User Mode program tried to run MOV CS, AX to change its privilege, the CPU would crash the program immediately.

The bits only change through "Gates"—controlled hardware transitions:

The Old Way: Interrupt 0x80

In older 32-bit Linux, a program would trigger a software interrupt (INT 0x80).

  1. The CPU looks up a specific "Gate" in the IDT (Interrupt Descriptor Table).
  2. That gate contains the new CS value (pre-configured by the kernel) which has the CPL bits set to 00.
  3. The hardware switches the CS register to that value and jumps to the kernel's code.

The Modern Way: SYSCALL

On modern 64-bit Linux, the process is faster.

  1. There are special registers called MSRs (Model Specific Registers), specifically IA32_LSTAR.
  2. When the kernel boots, it writes the address of its system call handler into this MSR.
  3. When an app runs the SYSCALL instruction, the CPU automatically loads the new privilege levels into the CS register from the MSRs and jumps to the kernel code. This is hard-wired into the CPU's circuitry.

Beyond Segments: The Page Table (U/S Bit)

While the CS register tracks the "Ring," modern Linux relies more heavily on the MMU (Memory Management Unit) and Page Tables for actual memory protection.

Every 4KB "page" of RAM has a set of flags in the page table:

  • U/S Bit (User/Supervisor): If this bit is 0, only Ring 0 can touch this page. If it is 1, Ring 3 can touch it.

Even if the CS register says you are in Ring 3, the CPU checks this bit for every single memory access. If a User Mode program tries to touch a Supervisor page, the MMU triggers a Page Fault.

How it works in VMs?

If we only had those 2 bits in the CS register, a Virtual Machine (Guest) could not be distinguished from the Host OS—they would both show up as Ring 0, and the Guest could take over the physical hardware.

To solve this, modern CPUs (Intel VT-x and AMD-V) introduced a "fifth ring" (often called Ring -1) and a completely new operating mode for the processor.

The CPU doesn't just look at the 2 bits in CS anymore; it looks at the Execution Mode.

The Two Worlds: Root vs. Non-Root

Instead of just having Rings 0–3, the hardware adds a "toggle switch" on top of the entire CPU. These are called VMX (Virtual Machine Extensions) modes:

  1. VMX Root Operation (The Hypervisor World):

    • This is where the Host OS (the Hypervisor like KVM, VMware, or Xen) lives.
    • The Hypervisor has "Ring -1" authority.
    • It has full control over the physical hardware.
  2. VMX Non-Root Operation (The Guest World):

    • This is where the Guest VM (Windows or Linux running inside a window) lives.
    • Crucially: Inside this mode, the Guest still has its own Rings 0, 1, 2, and 3.
    • The Guest Kernel thinks it is in Ring 0 (CPL 00), but because it is in "Non-Root" mode, certain actions are "trapped" by the hardware.

The Brain of the VM: The VMCS (Virtual Machine Control Structure)

Because the 2-bit CS register is busy tracking the Guest's internal rings, the CPU needs a different place to store the "Host vs. Guest" context.

It uses a data structure in memory called the VMCS (Intel) or VMCB (AMD). This is a 4KB block of RAM that acts as a "Save File" for the VM. It tracks:

  • The Guest's registers (including its version of CS).
  • The Host's registers.
  • Exit Criteria: A list of things the Guest is NOT allowed to do.

How the CPU distinguishes them (VM-Exit)

When a Guest OS (running in Non-Root Ring 0) tries to perform a sensitive hardware operation—for example, talking to the physical hard drive or changing a CPU configuration register:

  1. The Hardware Check: The CPU checks the "Non-Root" flag.
  2. The Trap: The hardware sees the Guest is trying to do something privileged. Instead of executing the instruction, the CPU performs a VM-Exit.
  3. The Switch: The CPU hardware instantly:
    • Saves the Guest's current state (including that CS register) into the VMCS.
    • Flips the mode from "Non-Root" to "Root."
    • Jumps to the Hypervisor’s code (Host Kernel).
  4. Emulation: The Host OS looks at the VMCS to see what the Guest was trying to do, does it on the Guest's behalf (or lies to the Guest), and then performs a VM-Entry to flip back into the VM.

What about Memory? (Two-Dimensional Paging)

If the Guest is in Ring 0, it thinks it controls the Memory Management Unit (MMU). This is a huge security risk. To prevent a VM from seeing the Host's memory, the CPU uses Extended Page Tables (EPT) (Intel) or Nested Page Tables (NPT) (AMD).

The CPU now does a Double Lookup:

  1. Guest Virtual Address \rightarrow Guest Physical Address (Managed by the Guest's Ring 0).
  2. Guest Physical Address \rightarrow Host Physical Address (Managed by the CPU hardware + Hypervisor).

The Guest's Ring 0 bits (CS=00) only control the first step. The second step is hardware-enforced and invisible to the Guest.

Summary: The New Hierarchy

Level Mode Ring (CS Bits) Who?
Ring -1 VMX Root Ring 0 The Hypervisor (Host Linux / KVM)
Ring 0 VMX Non-Root Ring 0 The Guest Kernel (thinks it’s in charge)
Ring 3 VMX Non-Root Ring 3 The Guest App (Chrome running inside a VM)

In short: The CPU keeps track using a hidden internal mode bit (Root vs. Non-Root) that is separate from the CS register. This creates a "sandbox" where the Guest can have its own Ring 0 without actually having Ring 0 power over the physical machine.

If it is always 00 or 11, why not just use 1 bit?

Linux (and almost every other major OS) only uses Ring 0 (Kernel) and Ring 3 (Userspace). Rings 1 and 2 are almost entirely unused.

So why use 2 bits ( 2 2 = 4 2^2 = 4 levels) instead of 1 bit ( 2 1 = 2 2^1 = 2 levels)?

It wasn't Linux's choice

The 4-ring structure is "hard-wired" into the silicon of the x86 CPU architecture. When Intel designed the 80286 and 80386 (the grandfathers of modern chips), they decided on 4 rings.

Linux developers have to work with the "bits" the hardware provides. They can't change the CPU's instruction set, so they just use the 00 (Ring 0) and 11 (Ring 3) patterns and ignore the others.

Intel’s Original "Grand Vision" (That Failed)

In the 1980s, computer scientists thought that a highly secure OS would look like an onion with many layers. Intel designed the 4 rings for this specific purpose:

  • Ring 0: The core Kernel (Memory management, CPU scheduling).
  • Ring 1: Device Drivers (Graphics, Disk, Network).
  • Ring 2: Custom OS extensions or Databases.
  • Ring 3: Applications (Calculators, Word Processors).

The idea was: If your Printer Driver (Ring 1) crashed, it shouldn't be able to take down the Core Kernel (Ring 0).

Why Linux (and everyone else) ignored Rings 1 and 2

As the industry moved forward, developers realized two things:

  1. Complexity and Performance: Every time you move between rings, there is a performance "tax" (a context switch). Managing memory permissions across four different layers made the Kernel code incredibly complex and slow.
  2. Portability: Linux is designed to run on many different CPUs (ARM, MIPS, PowerPC, RISC-V). Many of those architectures only had two levels anyway. To keep the Linux code portable, developers stuck to the "Lowest Common Denominator": Kernel and User.

If Linux had been built specifically and only for the 80386, we might be using all 4 rings today.

Where the bits actually live

In the x86 architecture, the "Current Privilege Level" (CPL) is stored in the bottom 2 bits of the Code Segment (CS) register.

Since those 2 bits are physically there in the CPU's wiring, they must represent a value from 0 to 3. Even though Linux only uses 0 and 3, it cannot "delete" the physical wires that allow for 1 and 2.

The "Secret" Rings (Ring -1 and Ring -2)

Even though we only use two bits for the standard "On-CPU" rings, the industry actually ended up needing more levels later on, but they had to be hacked in "below" Ring 0:

  • Ring -1: Used by Hypervisors (KVM, VMware). The Hypervisor needs to be more powerful than the Linux Kernel so it can run multiple Kernels at once.
  • Ring -2: Used by the System Management Mode (SMM). This is a secret layer used by the motherboard firmware to handle power management and hardware bugs. Even the Kernel can't see what's happening here.

How does Ring -1 work?

The CPU has a hidden internal "bit" or state called VMX (Virtual Machine Extensions) Root Mode.

The "Inception" Logic:

  • When the CPU is in VMX Non-Root Mode, the standard 0–3 rings work normally. The Guest Linux Kernel thinks it's in Ring 0.
  • However, if the Guest Kernel tries to do something "illegal" (like touching real hardware), the CPU triggers a VM-Exit.
  • The CPU then flips into VMX Root Mode.

How does Ring -2 (System Management Mode) work?

Ring -2 is even deeper and more "invisible" than a hypervisor. It is used by the motherboard's firmware (BIOS/UEFI).

  • How it's represented: It is triggered by a specialized hardware signal called the SMI (System Management Interrupt).
  • The Hijack: When an SMI signal is sent (often by the motherboard hardware to handle a thermal emergency or a power button press), the CPU suspends everything:
    1. It saves the current state of the OS (even the Hypervisor!) into a secret, locked area of RAM called SMRAM.
    2. The CPU enters a special execution mode that ignores all standard ring protections.
    3. It runs the firmware code provided by the motherboard manufacturer.
  • Visibility: Linux has no idea this happened. The "clock" inside the OS might skip a few milliseconds, but the Kernel cannot see the code running in SMM, nor can it stop it.

Representation: It is a CPU state that is entered via an SMI and exited via a special instruction called RSM (Resume from System Management Mode).

Ring -3: The Management Engine (The "Hidden" Processor)

If you want to go even deeper, people often refer to Ring -3. This isn't even a "mode" on your CPU; it is a completely separate microprocessor embedded inside your chipset (the Intel Management Engine or AMD Platform Security Processor).

  • How it's represented: It is a separate physical chip (usually an internal 32-bit controller running a version of the Minix OS).
  • Power: It has its own independent power supply. It can turn your computer on or off, access the network card, and read your RAM, even if your main CPU is powered down or the OS is encrypted.

CPL vs DPL vs RPL

CPL and DPL are the two "halves" of the x86 security check. If you think of CPL as the ID card you are carrying, DPL is the security clearance required to open a specific door.

The CPU constantly compares these two values to decide if an instruction is allowed to execute.

CPL: Current Privilege Level (The "Who")

  • Where it lives: The bottom 2 bits of the CS (Code Segment) register.
  • What it represents: The current "Ring" the CPU is executing in (0, 1, 2, or 3).
  • When it changes: Only when the CPU switches tasks or jumps to a different code segment (like a system call).

DPL: Descriptor Privilege Level (The "Requirement")

  • Where it lives: Inside the Segment Descriptors (stored in the GDT or LDT tables).
  • What it represents: The "Privilege required" to access that specific segment of memory or that specific gateway.
  • When it changes: Never (usually). It is hardcoded by the Kernel when it sets up the memory tables during boot.

How they interact (The Rules)

When the CPU tries to access a segment, it performs a mathematical comparison. In the world of Rings, lower numbers = higher power.

For Data Access:

If you are in Ring 3 (CPL=3) and you try to read a data segment that is marked as Ring 0 (DPL=0):

  • The Check: Is CPL <= DPL?
  • The Result: 3 <= 0 is False.
  • The Action: The CPU triggers a General Protection Fault (Segmentation Fault).

For Code Access (Jumping to a new function):

If you try to JMP or CALL to a different code segment:

  • The Rule: Generally, your CPL must equal the DPL of the destination.
  • The Exception: If you want to move from a higher number (User) to a lower number (Kernel), you cannot just JMP. You must go through a Gate (like a System Call or Interrupt). The Gate has its own DPL that acts as a filter to make sure you are allowed to enter.

The "Third Wheel": RPL (Requested Privilege Level)

To make things slightly more confusing, there is a third value called RPL.

  • RPL lives in the bottom 2 bits of a Segment Selector (the pointer you use to pick a segment).
  • Why it exists: It prevents a "Privilege Escalation" attack.
  • The Scenario: Imagine a User program (Ring 3) asks the Kernel (Ring 0) to write data into a memory segment. If the User program passes a "pointer" to a Kernel memory segment, the Kernel (being Ring 0) might accidentally overwrite its own secret data.
  • The Solution: The Kernel sets the RPL of that pointer to 3 (User). The CPU then checks: Max(CPL, RPL) <= DPL. Because the RPL was 3, the access is denied, even though the Kernel (CPL 0) was the one performing the action.

Summary: The Key and the Lock

Level Name Analogy Location
CPL Current Privilege Level The Person CS Register
DPL Descriptor Privilege Level The Door's Lock GDT/IDT Table
RPL Requested Privilege Level The Proxy Segment Selector

The Expertise Insight: In Linux, when you see a "Segmentation Fault," it is often the hardware-level result of a CPL vs DPL mismatch. The CPU saw that your code (CPL 3) tried to touch a segment descriptor that was restricted to a higher privilege (DPL 0), and the hardware physically stopped the execution before the memory could be corrupted.