Linux - io_uring

io_uring is a high-performance asynchronous I/O interface for the Linux kernel. Introduced in 2019 (Linux kernel 5.1), it was designed to replace the older, limited Linux AIO (Asynchronous I/O) and provide a faster, more unified way to handle I/O operations for files, sockets, and more.

The Problem: Why did we need it?

Before io_uring, Linux had two main ways to handle I/O:

Synchronous I/O (read/write): Your program stops and waits for the disk or network to finish. This is slow for high-performance apps because the CPU sits idle.
Legacy Linux AIO: This was intended to be asynchronous, but it had major flaws:
- It only worked for files opened with O_DIRECT (bypassing the OS cache).
- It often blocked anyway if metadata wasn't ready.
- The API was complex and difficult to use.

System Call Overhead: Every time you call read() or write(), the CPU must switch from "User Mode" to "Kernel Mode." With modern high-speed NVMe drives, the time spent switching modes (context switching) became a bigger bottleneck than the hardware itself.

The Solution: How `io_uring` Works

The "uring" in the name stands for Submission and Completion Queue Ring Buffers.

io_uring uses two ring buffers shared between the User Space (your app) and the Kernel Space:

Submission Queue (SQ): The application writes I/O requests (e.g., "read this file") into this ring.
Completion Queue (CQ): The kernel writes the results of those requests (e.g., "here is your data" or "success") into this ring.

The Magic: Shared Memory

Because these rings are in shared memory, the application can place a request in the SQ without performing a system call. The kernel can then pick up that request, process it, and drop the result in the CQ. The application simply checks the CQ for results whenever it is ready.

Key Features and Modes

Submission Polling (SQPOLL)

In this mode, the kernel creates a dedicated thread that constantly "polls" the Submission Queue for new work. This means the application can perform thousands of I/O operations without ever performing a single system call. This virtually eliminates the overhead of context switching.

Unified API

Unlike the old AIO, io_uring works for almost everything:

File I/O (buffered or direct)
Network sockets (accept, send, recv)
Timeouts
Pipe I/O
Even specialized operations like splice or tee.

Request Chaining

You can tell io_uring to link operations together. For example: "Read this file, and only if that succeeds, write the data to this socket." The kernel handles the logic, saving the round-trip time back to the application.

Fixed Files and Buffers

You can "register" files or memory buffers with the kernel in advance. This allows the kernel to map the data once, rather than re-mapping it for every single I/O operation, further increasing speed.

Why is it a "Game Changer"?

Performance: It is significantly faster than any other I/O method on Linux. In some benchmarks, it delivers millions of IOPS (Input/Output Operations Per Second) on a single CPU core.
Efficiency: By reducing system calls and context switches, it lowers CPU usage, which is critical for high-density servers and databases.
Ease of Use (via liburing): While the raw kernel interface is complex, a helper library called liburing makes it much easier for developers to use.

What does uring mean?

The name reflects the two core components of the design:

u (User): This signifies that the interface is based on memory shared between User-space and the Kernel. Because the memory is mapped into the user application's address space, the app can write requests directly to the buffer without performing a system call.
ring (Ringbuffer): This refers to the Circular Queue (or Ring Buffer) data structure used to manage the requests. There are two of them: the Submission Queue (SQ) and the Completion Queue (CQ).

Why a "Ring"?

In computer science, a ring buffer is an efficient way to handle a producer-consumer relationship:

The Submission Queue: The User is the producer (adds I/O tasks) and the Kernel is the consumer (picks them up to execute).
The Completion Queue: The Kernel is the producer (adds finished results) and the User is the consumer (reads the results).

Why the "U" is the most important part:

In older Linux I/O models, you had to "trap" into the kernel (a system call) to tell the OS to do something. This process involves:

Saving CPU registers.
Switching from User Mode to Kernel Mode.
Copying data from user memory to kernel memory.

By using a User-mapped ring buffer, io_uring allows the user and kernel to communicate by simply reading and writing to shared memory locations. This "Zero-copy" approach to command submission is what makes it so much faster than previous methods.

Fun fact: While it technically stands for "User Ring," many developers also consider it the "Universal Ring" because, unlike previous Linux AIO attempts, it works for almost every type of I/O (files, sockets, pipes, etc.).

How does it work under the hood?

While you can use raw system calls, io_uring is designed such that you almost never call them directly. Instead, you use liburing, a helper library that handles the "heavy lifting" of memory barriers and ring management.

However, to understand how it works "under the hood," we must look at the three core system calls.

The Three Core System Calls

A. `io_uring_setup(unsigned entries, struct io_uring_params *params)`

This creates the rings.

What it does: It tells the kernel, "I want a submission queue of size X."
What it returns: A file descriptor (fd) representing the io_uring instance.
The Magic: After calling this, the application must call mmap() on this file descriptor. This maps the kernel's ring buffer memory directly into the application's memory space. This is why it's fast: once mapped, the app and kernel talk via memory, not via copies.

B. `io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags)`

This is the "shuttle" that moves data between the app and the kernel.

to_submit: Tells the kernel, "I just put $N$ items in the ring; go look at them."
min_complete: Tells the kernel, "Don't return to me until $M$ tasks are finished." (This allows the app to sleep until work is done).
Optimization: If you are using SQPOLL (Submission Queue Polling) mode, you often don't need to call this at all, because a kernel thread is already watching the memory.

C. `io_uring_register(unsigned int fd, unsigned int opcode, void *arg, unsigned int nr_args)`

This is used for performance tuning.

It allows you to "pre-register" files or memory buffers.
Normally, for every I/O, the kernel has to "map" your memory and "open" the file reference. Registering them once at startup removes that overhead from the "fast path" of your loop.

The Data Structures (The "Language")

There are two main structures the code interacts with:

struct io_uring_sqe (Submission Queue Entry):
- Contains the "Command": read, write, connect, etc.
- Contains the "Arguments": File descriptor, buffer address, length, and offset.
- Contains user_data: A 64-bit pointer or ID you provide so you can identify this request when it finishes.
struct io_uring_cqe (Completion Queue Entry):
- Contains the user_data you sent (so you know which request this is).
- Contains res: The result (e.g., number of bytes read, or -errno if it failed).

What the code looks like (using `liburing`)

This is a simplified example of reading a file asynchronously.

#include <liburing.h>
#include <fcntl.h>

int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    char buffer[4096];

    // 1. Setup the ring (Depth of 32 entries)
    io_uring_queue_init(32, &ring, 0);

    int fd = open("test.txt", O_RDONLY);

    // 2. Get a "Submission Queue Entry" from the ring
    sqe = io_uring_get_sqe(&ring);

    // 3. Prepare the "Read" command (No syscall yet!)
    io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0);

    // Set user_data so we can identify this specific read later
    io_uring_sqe_set_data(sqe, (void*)1234);

    // 4. Submit to kernel (Calls io_uring_enter internally)
    io_uring_submit(&ring);

    // 5. Wait for the completion
    // This will block until the kernel puts a result in the CQ
    io_uring_wait_cqe(&ring, &cqe);

    // 6. Process the result
    if (cqe->res > 0) {
        printf("Read %d bytes\n", cqe->res);
    }

    // 7. Mark the completion as "seen" so the kernel can reuse the space
    io_uring_cqe_seen(&ring, cqe);

    // Clean up
    close(fd);
    io_uring_queue_exit(&ring);
    return 0;
}

The Workflow Summary

Initialization: io_uring_queue_init() sets up the shared memory rings.
Preparation: io_uring_get_sqe() grabs a slot in the ring; you fill it with your I/O request.
Submission: io_uring_submit() tells the kernel there is work to do.
Processing: The kernel performs the I/O in the background (asynchronously).
Completion: The kernel writes the result to the Completion Ring.
Reaping: Your app calls io_uring_wait_cqe() or peek_cqe() to get the result and then "acknowledges" it.

Why this is better than `epoll` or `select`?

With epoll, the kernel tells you a socket is ready to be read. You then have to perform a second step (the read() syscall) to actually get the data.

With io_uring, you tell the kernel "Read this data and wake me up when you are done." By the time your app wakes up, the data is already sitting in your buffer. This saves an entire round-trip of communication between the App and the Kernel.

What happens if the buffer is full?

Because io_uring uses fixed-size buffers allocated at startup, handling "full" states is a critical part of its design. The behavior is different depending on which ring—Submission or Completion—is full.

If the Submission Queue (SQ) is Full

Scenario: Your application is producing I/O requests faster than the kernel (or the hardware) can pick them up.

The Application Blocks/Waits: The application is responsible for managing the SQ. If you try to add a new "Submission Queue Entry" (SQE) and the ring is full, you simply cannot write to it without overwriting your own pending requests.
The Fix (io_uring_enter): Usually, the application will call the system call io_uring_enter().
- If you call it with the IORING_ENTER_GETEVENTS flag, the application will "sleep" (block) until the kernel processes some SQEs and frees up space, or until a certain number of completions appear in the other ring.
The liburing approach: If you use the standard library (liburing), the helper function io_uring_get_sqe() will simply return NULL if the ring is full, telling your app: "Stop, you need to submit what you have and wait for some space to clear."

If the Completion Queue (CQ) is Full

Scenario: The kernel has finished several I/O tasks, but your application hasn't "read" them from the CQ yet. This is more complex because the kernel must deliver the result to you.

Internal Overflow List: To prevent data loss, the kernel maintains an internal overflow list. If the CQ ring is full, the kernel will store the completion results in its own private memory.
Backpressure: While there are items in the kernel’s internal overflow list, the kernel may stop picking up new work from the Submission Queue. This creates "backpressure," forcing the application to slow down.
Performance Hit: Moving items into and out of the internal overflow list is slower than using the shared ring buffer. You want to avoid this state for maximum performance.
Clearing the Overflow: Once your application starts reading from the CQ (moving the "head" pointer), the kernel will notice there is space and start moving items from its internal list into the CQ ring.
The "Overflow" Flag: There is a flag (IORING_SQ_CQ_OVERFLOW) in the shared memory that the kernel flips to let the application know: "Hey, the ring filled up, and I have extra results waiting for you in the background."

How to prevent this (The Design Standard)

To avoid these "Full" states, developers usually follow two rules of thumb:

Sizing the Rings: When initializing io_uring, you define the size of the SQ. By default, the kernel creates a Completion Queue (CQ) that is twice as large as the SQ (e.g., if SQ is 128, CQ is 256). This provides a "buffer" so that even if every single submission finishes at the same time, the CQ doesn't overflow immediately.
The "One-in, One-out" Pattern: High-performance apps often try to maintain a steady state. If you have 100 requests in flight, you wait for 10 completions before submitting 10 more.

Where is io_uring used?

io_uring is rapidly becoming the standard for high-performance Linux software. It is used anywhere that "every microsecond counts" or where thousands of concurrent I/O operations occur.

Databases (The Biggest Adopters)

Databases are the most natural fit for io_uring because they perform massive amounts of random disk I/O and need to keep the CPU free for processing queries.

ScyllaDB: A NoSQL database that was one of the first to go "all-in" on io_uring. They reported massive performance gains by moving away from legacy Linux AIO.
RocksDB: The high-performance storage engine (used by Meta and others) uses io_uring for multi-threaded reads and writes.
PostgreSQL: There is ongoing work to integrate io_uring into Postgres to improve its asynchronous write and prefetch capabilities.
TiDB (TiKV): The distributed storage layer uses io_uring for faster disk operations.

Web Servers and Networking

Traditional servers use epoll, which tells you when a socket is ready. io_uring goes further by actually performing the send and recv operations asynchronously.

Nginx: Recently added support for io_uring to handle file I/O more efficiently (e.g., serving static files).
Envoy Proxy: High-performance service mesh (used by Google/Lyft) has been implementing io_uring support to reduce CPU overhead in high-traffic environments.
HAProxy: Uses it to speed up internal operations and connection handling.

Programming Language Runtimes

Many "async" programming languages originally used a "thread pool" to fake asynchronous file I/O. They are now replacing those pools with io_uring.

Node.js: The underlying library, libuv, is integrating io_uring. This means eventually every Node.js app will benefit from it without the developer changing a line of code.
Rust: The Rust ecosystem is a leader in io_uring adoption.
- Tokio-uring: A version of the popular tokio runtime built specifically for io_uring.
- Glommio: A specialized Rust runtime for "thread-per-core" architectures that relies entirely on io_uring.
Java (Netty): The Netty project (the foundation for most high-performance Java networking) has an io_uring transport layer that is significantly faster than their epoll layer.

Storage and Virtualization

QEMU / KVM: When you run a Virtual Machine (VM) on Linux, the "virtual disk" of that VM is just a file on the host. QEMU uses io_uring to pass I/O from the guest VM to the host hardware with almost zero overhead, making VM disk performance nearly as fast as "bare metal."
Samba: The software that lets Linux talk to Windows file shares uses io_uring to speed up file transfers.
Ceph: Distributed storage systems use it to coordinate data writing across many disks and nodes simultaneously.

Specialized High-Performance Tools

Redis: While Redis is mostly in-memory, it uses disk for persistence (AOF/RDB). Version 7.0+ uses io_uring to handle these disk writes without blocking the main event loop.
Vector: A high-performance log and metrics observability tool (built by Datadog) uses io_uring to ingest and write massive amounts of data with minimal CPU usage.

Comparison

Feature	Synchronous (`read/write`)	Legacy AIO	io_uring
Blocking	Yes	Sometimes	No
Buffered I/O	Yes	No (Direct only)	Yes
Socket Support	Yes	No	Yes
Syscall Overhead	High (1 per op)	Medium	Very Low (Zero with SQPOLL)
Complexity	Simple	Hard	Moderate (Easy with liburing)

Conclusion

io_uring is the future of Linux I/O. It is being rapidly adopted by high-performance software like Node.js, Nginx, ScyllaDB, and the Rust tokio ecosystem. If you are building an application that needs to handle massive amounts of data or thousands of concurrent connections, io_uring is the gold standard.