Linux - epoll
To understand epoll, you first have to understand the problem it solved: The C10k Problem (handling 10,000 concurrent connections at once).
epoll is a Linux-specific system call for scalable I/O event notification. It is the "secret sauce" behind high-performance software like Nginx, Redis, Node.js, and HAProxy.
The "e" stands for Event.
The Analogy: The Waiter Problem
Imagine a busy restaurant with 1,000 tables. Only 5 of those tables are currently ready to order food. How does the waiter find them?
select/poll(The Old Way): The waiter walks to Table 1 and asks "Ready?", then Table 2, then Table 3... all the way to Table 1,000. Even if only one person wants water, the waiter must check every single table every time.- Complexity: . The more tables you have, the slower the waiter gets.
epoll(The Modern Way): The waiter sits at a central station. Each table has a electronic ringer. When a customer is ready, they press the button. The ringer lights up a specific bulb at the waiter's station. The waiter looks at the board, sees bulbs 5, 22, and 801 are lit, and goes straight to those tables.- Complexity: (effectively). It doesn't matter if there are 10 tables or 1 million; the effort to find the "ready" ones is the same.
Why epoll is better than select and poll
| Feature | select |
poll |
epoll |
|---|---|---|---|
| Limit | Hard-coded (usually 1024) | Unlimited | Unlimited |
| Performance | Drops as connections grow | Drops as connections grow | Stays constant |
| Kernel Work | Scans all fds every time | Scans all fds every time | Only looks at the "Ready List" |
| Data Copying | Copies list to kernel every time | Copies list to kernel every time | Zero-copy (shares list via kernel) |
Who waits for whom?
It is Userspace waiting for the Kernel.
The userspace program (like a web server) says: "Kernel, I have 10,000 open connections. I don't know which one will send data next, and I don't want to sit here in a loop checking them all. Wake me up when something happens." Then the userspace program calls the system call epoll_wait() to sleep.
Eventually, a packet arrives from the internet and hits the Network Interface Card (NIC).
- The NIC triggers a Hardware Interrupt.
- The Kernel stops what it’s doing to handle that packet.
- The Kernel looks at the packet and says: "Aha! This is for Socket #54."
- The Kernel then checks its "Interest List" and sees: "Process 'Nginx' asked me to tell it when Socket #54 had data."
The Kernel now moves the Userspace process from the "Sleep" state back to the "Ready" state.
- The
epoll_wait()system call finally returns. - The program "wakes up" and receives a list from the Kernel saying: "Hey, I'm back. Socket #54 is the one that has data for you."
How epoll works (The 3 Syscalls)
Using epoll in a program involves three main steps:
epoll_create1(The Station): Creates an "epoll instance" (the waiter's station). It returns a file descriptor that represents the interest list.epoll_ctl(The Ringers): This is where you add, modify, or remove file descriptors (sockets) from the "Interest List." You are essentially telling the kernel: "Monitor this socket and let me know if it has data to read."- It uses a Red-Black Tree structure inside the kernel to store these descriptors, making additions and deletions very fast ().
epoll_wait(The Wait): The program "sleeps" here. When a monitored socket becomes active, the kernel places that socket into a "Ready List."epoll_waitthen wakes up and returns only the list of active sockets.
The Two Modes: Level-Triggered vs. Edge-Triggered
This is a common interview question regarding epoll.
- Level-Triggered (LT) - Default:
- The kernel tells you a socket is ready. If you don't read all the data, the kernel will keep telling you every time you call
epoll_waituntil the buffer is empty. - Safe, but slightly more overhead.
- The kernel tells you a socket is ready. If you don't read all the data, the kernel will keep telling you every time you call
- Edge-Triggered (ET) - High Performance:
- The kernel tells you a socket is ready only once when the state changes (e.g., when new data arrives). If you only read half the data, the kernel will not tell you again until new data arrives.
- Much faster, but you must use non-blocking I/O and loop until you get an
EAGAINerror to ensure the buffer is truly empty.
Why is it so fast?
- Red-Black Tree: Storing the "interest list" in a tree means we can add/remove millions of connections efficiently.
- The Ready List: The kernel handles the work of monitoring the hardware. When a packet hits the network card, an interrupt triggers the kernel to move that specific socket descriptor into the "Ready List." The application doesn't have to search for it.
- Reduced Context Switching: Because the interest list is maintained inside the kernel, the application doesn't have to pass a massive list of 100,000 descriptors to the kernel every time it wants to check for updates.
Summary
epoll is the reason a single Linux server can handle millions of concurrent web connections. By moving from a "Scanning" model to an "Event" model, it removed the bottleneck that previously prevented the internet from scaling to the massive size it is today.
Beyond epoll: io_uring
While epoll is great, it still requires a system call (epoll_wait) to wake up userspace. System calls are expensive because they require a context switch from "User Mode" to "Kernel Mode."
io_uring (introduced in 2019) allows userspace and the kernel to share a ring buffer in memory. Userspace drops a request into the ring, and the kernel picks it up without a system call. The kernel drops the result into another ring, and userspace picks it up.
It is essentially "zero-wakeup" I/O.
Summary
epollavoids waking you up to tell you "nothing happened" (unlikeselect).epollonly wakes you up with a list of active events, so you don't waste time searching.- Edge-Triggered mode allows you to minimize the number of wakeups per data event.
- It is the foundation of almost all modern, scalable Linux networking.