Linux - task_struct and mm_struct
In the Linux kernel, these two data structures are the "brain" and the "map" of a process. Everything you do on Linux—from running a simple ls to a complex web server—is managed through these structures.
task_struct (The Process Descriptor)
The task_struct is often called the Process Descriptor. It is a massive structure that contains every single piece of information the kernel needs to manage a process (or a thread).
In Linux, a "task" is the basic unit of execution. Whether it’s a heavy process or a lightweight thread, the kernel represents it with a task_struct.
Key information stored in task_struct:
- Process ID (PID): The unique identifier for the process.
- State: Is the process currently running, sleeping (waiting for input), or a zombie?
- Family Tree: Pointers to its parent process, its children, and sibling processes.
- Credentials: User ID (UID) and Group ID (GID) that determine what files the process can open.
- Filesystem: A list of all open files and the current working directory.
- Scheduling: The priority of the process and how much CPU time it has used.
- Memory Pointer: A pointer to the
mm_struct(see below).
Analogy: Think of task_struct as a Worker’s Personnel File. It contains their ID badge, their boss’s name, their current status (working/on break), and what keys they have to the building.
mm_struct (The Memory Descriptor)
The mm_struct represents the Virtual Address Space of a process. While task_struct describes who the process is, mm_struct describes where the process’s data lives in memory.
Key information stored in mm_struct:
- Page Tables (
pgd): The "translator" that turns virtual memory addresses (what the program sees) into physical RAM addresses (where the data actually is). - Memory Areas (VMAs): A list of
vm_area_structs. These define "zones" in memory, such as:- The Code section (text).
- The Data section (variables).
- The Heap (dynamic memory like
malloc). - The Stack (local variables and function calls).
- Start/End addresses: Pointers to exactly where the heap starts, where the stack ends, etc.
Analogy: Think of mm_struct as a Floor Plan of the worker’s office. It shows exactly where the desk is, where the files are stored, and which rooms they are allowed to enter.
How they work together
The mm_struct is not "found" directly by the CPU; it is found through the task_struct. (i.e. task_struct is a top level object; mm_struct is not. )
The task_struct contains a pointer (named mm) that points to the mm_struct.
struct task_struct {
...
pid_t pid;
struct mm_struct *mm; // Pointer to the memory map
...
}
The Difference between Processes and Threads
Linux does not technically distinguish between a process and a thread at the kernel level.
- A Process is a
task_structthat has its own unique memory space (mm_struct). - A Thread is a
task_structthat shares itsmm_struct(and files) with another task_struct.
Crucial Insight: This is why threads are called "lightweight." They don't need a new memory map; they just share the "office floor plan" of the parent. If Thread A changes a variable in memory, Thread B sees it immediately because they are looking at the same
mm_struct.
Special Case: Kernel Threads
You might notice some tasks in top or ps that have names in brackets, like [kworker] or [ksoftirqd]. These are Kernel Threads.
Kernel threads have a task_struct, but their mm pointer is NULL.
Why? Because they only operate in "Kernel Space" and do not have a private user-space memory map (they don't need a heap or a user stack).
How does kernel track task_structs?
The kernel doesn't just store task_struct objects in a single pile; it uses multiple overlapping data structures to track them depending on what it needs to do (e.g., find a process by ID, find the next process to run, or find all children of a parent).
1. The Global Task List (The "Master List")
Every single task_struct in the system is part of a circular doubly linked list.
- The Head: The list starts with the
init_task(the "swapper" or "idle" process, PID 0). - The Links: Each
task_structhas atasksfield (of typestruct list_head) containing pointers to the previous and next task in the list. - Purpose: This allows the kernel to iterate through every process in the system. When you run
psortop, the kernel is essentially walking this massive list from start to finish.
2. The PID Hash Table (Finding by ID)
Scanning a linked list of 1,000+ processes just to find "PID 542" would be too slow ( complexity). To find a specific process quickly, the kernel uses a Hash Table.
- Mechanism: The kernel hashes the PID to get an index in a table.
- The Structure: Since multiple PIDs might hash to the same value (a collision), each bucket in the hash table points to a linked list of
task_structobjects. - Speed: This allows the kernel to find any process by its ID almost instantly ( complexity).
(Note: Modern kernels actually use a more complex structure called IDR (Integer ID Management) which uses radix trees, but the concept of a fast look-up table remains the same.)
3. The current Macro (The "Right Now" Tracker)
On a multi-core system, each CPU core is running exactly one task_struct at any given microsecond. The kernel needs a way to instantly answer the question: "Who am I right now?"
- Modern x86 Architecture: The kernel uses Per-CPU variables. A specific CPU register (or a fixed memory offset) stores a pointer to the
task_structcurrently occupying that core. - The
currentMacro: When kernel code wants to see the current process's UID or open files, it simply referencescurrent->uidorcurrent->files.
4. The Family Tree (Parent/Child Tracking)
Processes have strict "bloodlines." The kernel tracks these relationships using pointers within the task_struct:
real_parent: Points to thetask_structthat created this process.children: The head of a list containing all the "kids" this process has spawned.sibling: Links this process to other children of the same parent.
This hierarchy is critical. When a process dies, the kernel uses these pointers to find the parent to send a "Child Exit" signal (SIGCHLD).
5. The Scheduler's Runqueues (The "Waiting Room")
Just because a process exists doesn't mean it is running. Most processes are sleeping (waiting for a keypress or a network packet).
The Scheduler maintains its own tracking structures:
- Runqueue: Each CPU core has a "Runqueue" of tasks that are ready to run (
TASK_RUNNING). - Red-Black Tree: In the CFS (Completely Fair Scheduler), tasks are stored in a Red-Black Tree (a balanced search tree) based on how much CPU time they have consumed. The task that has had the "least" time is at the far left of the tree and gets picked next.
6. Wait Queues (The "Waiting List")
If a process is waiting for something specific (like data from a hard drive or a mutex lock), it is removed from the Runqueue and placed into a Wait Queue associated with that specific event.
- When the hard drive finishes reading the data, it triggers an interrupt.
- The kernel then looks at the Wait Queue for that disk and moves the associated
task_structback to the Runqueue.
Summary Table
| Feature | task_struct |
mm_struct |
|---|---|---|
| Common Name | Process Descriptor | Memory Descriptor |
| Scope | One per thread/process | One per address space |
| Main Job | Identity, Scheduling, State | Memory Layout, Page Tables |
| Shared? | No (unique to every task) | Yes (shared by threads in a process) |
| Location | Defined in <linux/sched.h> |
Defined in <linux/mm_types.h> |