Linux - task_struct and mm_struct

In the Linux kernel, these two data structures are the "brain" and the "map" of a process. Everything you do on Linux—from running a simple ls to a complex web server—is managed through these structures.

`task_struct` (The Process Descriptor)

The task_struct is often called the Process Descriptor. It is a massive structure that contains every single piece of information the kernel needs to manage a process (or a thread).

In Linux, a "task" is the basic unit of execution. Whether it’s a heavy process or a lightweight thread, the kernel represents it with a task_struct.

Key information stored in task_struct:

Process ID (PID): The unique identifier for the process.
State: Is the process currently running, sleeping (waiting for input), or a zombie?
Family Tree: Pointers to its parent process, its children, and sibling processes.
Credentials: User ID (UID) and Group ID (GID) that determine what files the process can open.
Filesystem: A list of all open files and the current working directory.
Scheduling: The priority of the process and how much CPU time it has used.
Memory Pointer: A pointer to the mm_struct (see below).

Analogy: Think of task_struct as a Worker’s Personnel File. It contains their ID badge, their boss’s name, their current status (working/on break), and what keys they have to the building.

`mm_struct` (The Memory Descriptor)

The mm_struct represents the Virtual Address Space of a process. While task_struct describes who the process is, mm_struct describes where the process’s data lives in memory.

Key information stored in mm_struct:

Page Tables (pgd): The "translator" that turns virtual memory addresses (what the program sees) into physical RAM addresses (where the data actually is).
Memory Areas (VMAs): A list of vm_area_structs. These define "zones" in memory, such as:
- The Code section (text).
- The Data section (variables).
- The Heap (dynamic memory like malloc).
- The Stack (local variables and function calls).
Start/End addresses: Pointers to exactly where the heap starts, where the stack ends, etc.

Analogy: Think of mm_struct as a Floor Plan of the worker’s office. It shows exactly where the desk is, where the files are stored, and which rooms they are allowed to enter.

How they work together

The mm_struct is not "found" directly by the CPU; it is found through the task_struct. (i.e. task_struct is a top level object; mm_struct is not. )

The task_struct contains a pointer (named mm) that points to the mm_struct.

struct task_struct {
    ...
    pid_t pid;
    struct mm_struct *mm;  // Pointer to the memory map
    ...
}

The Difference between Processes and Threads

Linux does not technically distinguish between a process and a thread at the kernel level.

A Process is a task_struct that has its own unique memory space (mm_struct).
A Thread is a task_struct that shares its mm_struct (and files) with another task_struct.

Crucial Insight: This is why threads are called "lightweight." They don't need a new memory map; they just share the "office floor plan" of the parent. If Thread A changes a variable in memory, Thread B sees it immediately because they are looking at the same mm_struct.

Special Case: Kernel Threads

You might notice some tasks in top or ps that have names in brackets, like [kworker] or [ksoftirqd]. These are Kernel Threads.

Kernel threads have a task_struct, but their mm pointer is NULL.

Why? Because they only operate in "Kernel Space" and do not have a private user-space memory map (they don't need a heap or a user stack).

How does kernel track task_structs?

The kernel doesn't just store task_struct objects in a single pile; it uses multiple overlapping data structures to track them depending on what it needs to do (e.g., find a process by ID, find the next process to run, or find all children of a parent).

1. The Global Task List (The "Master List")

Every single task_struct in the system is part of a circular doubly linked list.

The Head: The list starts with the init_task (the "swapper" or "idle" process, PID 0).
The Links: Each task_struct has a tasks field (of type struct list_head) containing pointers to the previous and next task in the list.
Purpose: This allows the kernel to iterate through every process in the system. When you run ps or top, the kernel is essentially walking this massive list from start to finish.

2. The PID Hash Table (Finding by ID)

Scanning a linked list of 1,000+ processes just to find "PID 542" would be too slow ( $O(n)$ complexity). To find a specific process quickly, the kernel uses a Hash Table.

Mechanism: The kernel hashes the PID to get an index in a table.
The Structure: Since multiple PIDs might hash to the same value (a collision), each bucket in the hash table points to a linked list of task_struct objects.
Speed: This allows the kernel to find any process by its ID almost instantly ( $O(1)$ complexity).

(Note: Modern kernels actually use a more complex structure called IDR (Integer ID Management) which uses radix trees, but the concept of a fast look-up table remains the same.)

3. The `current` Macro (The "Right Now" Tracker)

On a multi-core system, each CPU core is running exactly one task_struct at any given microsecond. The kernel needs a way to instantly answer the question: "Who am I right now?"

Modern x86 Architecture: The kernel uses Per-CPU variables. A specific CPU register (or a fixed memory offset) stores a pointer to the task_struct currently occupying that core.
The current Macro: When kernel code wants to see the current process's UID or open files, it simply references current->uid or current->files.

4. The Family Tree (Parent/Child Tracking)

Processes have strict "bloodlines." The kernel tracks these relationships using pointers within the task_struct:

real_parent: Points to the task_struct that created this process.
children: The head of a list containing all the "kids" this process has spawned.
sibling: Links this process to other children of the same parent.

This hierarchy is critical. When a process dies, the kernel uses these pointers to find the parent to send a "Child Exit" signal (SIGCHLD).

5. The Scheduler's Runqueues (The "Waiting Room")

Just because a process exists doesn't mean it is running. Most processes are sleeping (waiting for a keypress or a network packet).

The Scheduler maintains its own tracking structures:

Runqueue: Each CPU core has a "Runqueue" of tasks that are ready to run (TASK_RUNNING).
Red-Black Tree: In the CFS (Completely Fair Scheduler), tasks are stored in a Red-Black Tree (a balanced search tree) based on how much CPU time they have consumed. The task that has had the "least" time is at the far left of the tree and gets picked next.

6. Wait Queues (The "Waiting List")

If a process is waiting for something specific (like data from a hard drive or a mutex lock), it is removed from the Runqueue and placed into a Wait Queue associated with that specific event.

When the hard drive finishes reading the data, it triggers an interrupt.
The kernel then looks at the Wait Queue for that disk and moves the associated task_struct back to the Runqueue.

Summary Table

Feature	`task_struct`	`mm_struct`
Common Name	Process Descriptor	Memory Descriptor
Scope	One per thread/process	One per address space
Main Job	Identity, Scheduling, State	Memory Layout, Page Tables
Shared?	No (unique to every task)	Yes (shared by threads in a process)
Location	Defined in `<linux/sched.h>`	Defined in `<linux/mm_types.h>`