Understanding OverlayFS in Linux and Containers
If you’ve ever wondered how you can start 50 different Docker containers in seconds without your hard drive exploding, the answer is OverlayFS.
OverlayFS is the "secret sauce" of the container world. It is a specialized filesystem that allows Linux to stack multiple directories on top of each other and present them as a single, unified view.
What is OverlayFS?
OverlayFS is a Union Filesystem. Unlike a traditional filesystem (like EXT4 or XFS) that manages how data is physically laid out on a disk, a Union Filesystem is a "logical" filesystem. It takes existing directories and "unions" them together.
Imagine you have two overhead projector transparencies:
- The bottom one has a map of a city.
- The top one has red lines showing bus routes.
When you look through both, you see a single image of a city with bus routes. OverlayFS does exactly this with folders on your Linux machine.
The Four Pillars of an Overlay
To create an OverlayFS mount, Linux uses four specific directories:
- LowerDir (The Base): This is the "read-only" layer. You can have multiple LowerDirs stacked together. In the container world, these are your Image Layers.
- UpperDir (The Changes): This is the "writable" layer. Any new files you create or existing files you modify are stored here.
- MergedDir (The View): This is where the magic happens. This directory doesn't actually "hold" files; it is a virtual view that combines the Lower and Upper directories. This is what the Running Container actually sees.
- WorkDir (The Scratchpad): An internal directory used by the Linux kernel to handle "atomic" operations (like moving files) before they are finalized in the UpperDir.
How Containers Use OverlayFS
When you run a command like docker run ubuntu, a specific sequence of events occurs in the filesystem:
Step A: The Image is "Lower"
The Ubuntu image you downloaded is actually a set of tarballs. Containerd unpacks these into several LowerDirs. These layers are read-only; they never change, which is why 100 containers can share the same Ubuntu image safely.
Step B: The Container is "Upper"
The moment the container starts, the engine creates a brand new, empty UpperDir specifically for that container instance.
Step C: The Mount
The kernel mounts these together. The container’s process is "trapped" inside the MergedDir.
- If the app looks for
/bin/bash, it sees it from the LowerDir. - If the app creates
/test.txt, it is physically written to the UpperDir.
Copy-on-Write (CoW): The Killer Feature
One of the most elegant parts of OverlayFS is how it handles file modifications. This is known as Copy-on-Write.
Suppose your container wants to edit /etc/hosts, a file that exists in the read-only LowerDir. Since the LowerDir cannot be changed, OverlayFS performs a "magic trick":
- It copies the file from the LowerDir up to the UpperDir.
- It allows the container to modify that copy in the UpperDir.
- In the Merged view, the version in the UpperDir "hides" the version in the LowerDir.
The result? The original image remains pristine and untouched, while the container gets its own custom version of the file instantly.
What happens when you delete a container?
This is why containers are so "lightweight" and "ephemeral."
When you stop and delete a container, the container engine simply:
- Unmounts the OverlayFS.
- Deletes the UpperDir and WorkDir.
Just like that, all the changes made by that container are gone. The LowerDir (the image) stays on your disk, taking up no extra space, ready to be used by the next container.
Why OverlayFS Won the Container War
Before OverlayFS was merged into the Linux Kernel (around version 3.18), container engines used other drivers like AUFS or DeviceMapper. OverlayFS eventually became the standard because:
- Performance: It is built into the kernel and operates at near-native speeds.
- Inodes Efficiency: It is much lighter on system resources (Inodes) than its predecessors.
- Page Cache Sharing: Since multiple containers use the same LowerDir files, the Linux kernel is smart enough to load that file into RAM only once and share it across all containers, saving massive amounts of memory.
Overlay2
The transition from the original overlay driver to overlay2 was a major milestone in container history. Today, overlay2 is the universal default for Docker, containerd, and Podman.
The original version is now considered legacy and is almost never used in modern production environments.
The Core Difference: Multiple LowerDirs
The biggest technical difference between the two is how they handle layer stacking.
- Original
overlay: Could only handle two layers at a time: one "upper" and one "lower."- The Problem: Since container images often have 10, 20, or 30 layers, the original driver had to create a complex "chain" of mounts. Each layer was a separate directory that had to be linked to the one below it. This was incredibly "noisy" for the Linux kernel to manage.
overlay2: Introduced support for multiple LowerDirs (up to 128 layers) in a single mount command.- The Solution: Instead of a complex chain,
overlay2tells the kernel: "Here is one writable folder, and here are 20 read-only folders. Merge them all at once." This is much faster and cleaner for the kernel to process.
- The Solution: Instead of a complex chain,
Inode Efficiency (The "Disk Space" Fix)
In Linux, every file and directory consumes an Inode (an entry in the filesystem table). There is a finite number of Inodes on your disk.
- Original
overlay: Because of its "chaining" method, it had to create a massive amount of hard links and subdirectories to track the relationships between layers. This often led to Inode Exhaustion—users would find that their disk reported 50% free space, but they couldn't create any new files because theoverlaydriver had used up all the available Inodes. overlay2: By natively supporting multiple layers in one mount, it significantly reduced the need for hard links. It is much "gentler" on the filesystem's metadata, making it far more stable for large-scale container deployments.
Performance
- Original
overlay: As the number of image layers increased, performance dropped. The kernel had to traverse a deep "tree" of mounts to find a file in a bottom layer. overlay2: Performance remains high regardless of how many layers you have (up to the 128-layer limit). Searching for a file in the 50th layer is nearly as fast as searching in the 1st layer.
How to verify on your system
You can see which driver you are using by running:
docker info | grep "Storage Driver"
# OR, for containerd/nerdctl:
nerdctl info | grep "Storage Driver"
On almost any modern machine, the output will be: Storage Driver: overlay2.
Where are the layers stored?
The location depends on which container engine you are using, but they all follow the same pattern: they live inside the persistent state directory (/var/lib) because layers must survive a reboot.
If you use containerd (Standard for Kubernetes)
The layers are managed by the "snapshotter."
- Path:
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/ - Structure: Inside this folder, you will see numbered directories (1, 2, 3...).
- Inside each number:
fs/: This is the actual directory containing the files for that layer.work/: The OverlayFS internal workspace.
If you use Docker
Docker calls its layers "overlay2" storage.
- Path:
/var/lib/docker/overlay2/ - Structure: You will see many folders with long, random-looking hashes (e.g.,
7b3a4f...). - Inside each hash folder:
diff/: This contains the actual file changes for that specific layer.link: A short identifier for the layer (used to prevent long path string errors in the kernel).lower: A text file listing all the layers that sit underneath this one.merged/: (Only present if the container is running). This is the "Merged View" we discussed. It is the actual mount point the container uses.work/: The internal kernel workspace.
If you use Podman
- Root mode:
/var/lib/containers/storage/overlay/ - Rootless mode (Most common):
~/.local/share/containers/storage/overlay/
How to "Read" these folders
If you go into one of these directories, you can actually see the files.
For example, if you are in a Docker layer's diff folder:
- If that layer installed
python, you will see ausr/bin/pythonfile inside thatdifffolder. - If that layer deleted a file from a lower layer, you will see a "Whiteout" file (a special character device file that tells OverlayFS to "hide" the file from the view).
A Look at the "l" directory (The "Shortcut" folder)
Inside /var/lib/docker/overlay2/, you will also see a directory named l (lowercase L).
- This folder contains symbolic links to all the actual layer folders.
- Why? Because the Linux
mountcommand has a limit on how many characters a command can be. Since long hashes make for very long commands, Docker creates short "nicknames" (like6YJ2...) in thelfolder and uses those to build the OverlayFS mount.
⚠️ Warning: Don't Touch!
While it is fascinating to explore these directories, never manually add, delete, or edit files inside these folders while the container engine is running.
- Corruption: You can corrupt the container's filesystem.
- Inconsistency: The container engine keeps a metadata database (
meta.dbfor containerd) that tracks these files. If you change a file behind its back, the database and the disk will no longer match, leading to errors that are very hard to fix.
What are "backing filesystems"?
OverlayFS is not a "fully functional" standalone filesystem in the way that EXT4, XFS, or Btrfs are. It is what’s known as a Stacking Filesystem.
The "Manager" vs. The "Worker"
Think of a traditional filesystem (like EXT4) as a Worker. It knows exactly which physical sectors on the hard drive represent which files. It handles the low-level hardware talk.
Think of OverlayFS as a Manager. It doesn't own any space on the hard drive. It doesn't know how to talk to a disk. Instead, it "hires" other filesystems to do the physical work.
The "Backing Filesystems" are the actual EXT4, XFS, or Btrfs partitions where the data physically resides.
Why is it not "Fully Functional"?
If you were to take a brand new hard drive, you could not "format" it as OverlayFS.
- You cannot run
mkfs.overlay /dev/sdb. (That command doesn't exist!) - You must first format the drive as a standard filesystem (like XFS).
- Then, you create directories on that XFS drive.
- Finally, you tell the kernel: "Take these folders on XFS and overlay them."
OverlayFS lacks the following features of a "real" filesystem:
- Disk Layout: It has no concept of blocks, clusters, or cylinders.
- Journaling: It doesn't have a crash-recovery journal (it relies on the backing filesystem's journal).
- Hardware Support: It cannot talk to an SSD, NVMe, or HDD.
Requirements for a Backing Filesystem
Not every filesystem can be a "backing" filesystem for OverlayFS. To be a good "worker" for Overlay, the backing filesystem must support certain technical features, most notably d_type.
d_typesupport: This allows the filesystem to quickly tell the kernel what type of file an entry is (directory, file, symbolic link) without having to look at the Inode.- XFS Example: For a long time, this was a common headache. If you formatted an XFS drive without
ftype=1, OverlayFS would simply refuse to work, and Docker would fail to start. - Recommended Backing FS: Today, EXT4 and XFS (properly formatted) are the gold standards for backing OverlayFS.
The "Upper" and "Lower" Relationship
The backing filesystem actually does two different jobs:
- The LowerDir (The Image): Usually sits on a backing filesystem as a set of compressed or unpacked read-only files.
- The UpperDir (The Changes): This must be a filesystem that supports metadata attributes (like
trusted.overlaylabels). This is how OverlayFS keeps track of which files are "whiteouts" (deleted) or "opaque."
Why is this distinction important?
Because OverlayFS sits on top of another filesystem, it inherits the bugs and limits of that filesystem.
- If your backing filesystem (EXT4) runs out of Inodes, OverlayFS will fail, even if the "files" in the container look small.
- If your backing filesystem is slow at writing small files, your container will be slow at writing small files.
- NFS Caution: You generally cannot use NFS (Network File System) as a backing filesystem for the
UpperDirbecause NFS doesn't handle the special metadata and atomic "rename" operations OverlayFS requires.
Check the mounts
When you run mount | grep overlay, you are peeking under the hood of your container engine. The output is famously long and messy, but it follows a very strict pattern.
A Typical Example (Docker)
If you have a single Docker container running, the output will look something like this:
overlay on /var/lib/docker/overlay2/abc123.../merged type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/XYZ:/var/lib/docker/overlay2/l/123,upperdir=/var/lib/docker/overlay2/abc123.../diff,workdir=/var/lib/docker/overlay2/abc123.../work)
Breaking Down the Parts
| Part of Output | Description |
|---|---|
overlay |
The Source. Since OverlayFS is virtual, it doesn't have a physical device (like /dev/sda1), so it just identifies as "overlay". |
on /var/.../merged |
The Mount Point. This is the MergedDir. This is the directory the container process is actually "inside" of. |
type overlay |
The Filesystem Type. |
(rw, relatime) |
Mount Options. rw means it is Read-Write (the container can make changes). |
The "Options" Section (The most important part)
The text inside the parentheses (...) tells you exactly how the layers are stacked:
lowerdir=...: This is a colon-separated (:) list of all the read-only layers.- Note: In
overlay2, the order matters. The first directory in the list is the top-most read-only layer, and the last directory is the bottom-most (the base OS). - Notice the
/l/in the path? As we discussed earlier, these are short-name symlinks to keep the mount command from getting too long.
- Note: In
upperdir=...: This points to thedifffolder of that specific container. This is where every file you create while the container is running is physically stored on the host.workdir=...: Points to the internal "scratchpad" directory on the backing filesystem.
How it looks in Kubernetes (containerd)
If you are running in Kubernetes using containerd, the mount point looks slightly different:
overlay on /run/containerd/io.containerd.runtime.v2.task/k8s.io/xyz.../rootfs type overlay (...)
- The Mount Point: Kubernetes/containerd usually mounts the root filesystem (
rootfs) under/run/containerd/.... - The Data: Even though the mount is in
/run(volatile memory), thelowerdirandupperdirinside the options will still point to/var/lib/containerd/...(persistent disk).
Why is the output so "ugly"?
If you have an image with 30 layers, the lowerdir section of the mount output will be a massive string of text, often 1,000+ characters long. It looks like a wall of gibberish, but it is actually a very precise map:
- Bottom of the list: The Kernel looks here first for the base files (e.g.,
ld-linux.so). - Middle of the list: Contains your installed apps (e.g.,
python,nginx). - UpperDir: The Kernel looks here last. If a file exists here, it "covers up" whatever is in the lower layers.
Pro-Tip
If you want to see this in a much more readable format, try using the findmnt command:
findmnt -t overlay
This will show you the same information but in a nice tree structure that is much easier on the eyes!