logo

eBPF

What is a BPF Object?

In the context of Linux and eBPF (Extended Berkeley Packet Filter), a BPF Object refers to the compiled binary artifact (usually an ELF file) that contains your BPF code and data structures.

If you are writing BPF tools (using libbpf), the "BPF Object" (struct bpf_object) is the main handle you use in your user-space code to manage everything you are about to load into the kernel.

Think of the BPF Object as a Container or a Package.

What is inside a BPF Object?

When you compile your C code (e.g., my_tool.bpf.c) using clang, it creates an .o file. This file is the BPF Object. It contains three main things:

  • BPF Programs: The actual functions that will run in the kernel (e.g., "Trigger this code when sys_open is called"). A single object can contain multiple programs.
  • BPF Maps: The shared memory structures (Hash tables, Arrays) used to store data or share it between the kernel and user space.
  • Relocation Info & BTF: Metadata that tells libbpf how to adjust the code so it works on the specific version of Linux you are running (this is the "Compile Once, Run Everywhere" magic).

The Hierarchy

It helps to visualize the hierarchy libbpf uses:

  • BPF Object (The File)
    • Contains: BPF Map 1
    • Contains: BPF Map 2
    • Contains: BPF Program A (e.g., for kprobe/sys_execve)
    • Contains: BPF Program B (e.g., for tracepoint/syscalls/sys_enter_open)

How you use it (The Lifecycle)

In a typical BPF application (like one written in C or Go), the "Object" is the unit you manage during the setup phase:

  1. Open: You "open" the BPF Object. libbpf reads the ELF file and parses the sections but doesn't touch the kernel yet.
  2. Load: You "load" the BPF Object. libbpf creates the Maps in the kernel, verifies the bytecode, and loads the Programs into the kernel.
  3. Attach: You "attach" the specific Programs inside the Object to their hooks (events).

Summary

  • BPF Program: A single function (bytecode) that runs on an event.
  • BPF Map: A storage area for data.
  • BPF Object: The entire collection of programs and maps compiled together into a single file, which you load and manage as a group.

What is a Map?

In the context of BPF (eBPF), a Map is a shared data structure that allows the BPF program (running in the kernel) and your application (running in user space) to talk to each other.

Since BPF programs are highly restricted—they cannot access arbitrary memory and they exit immediately after their event finishes—they need a specific place to store data. That place is the Map.

Why do we need Maps?

BPF programs are "stateless." If you have a BPF program attached to a network packet, it wakes up, inspects one packet, and then vanishes. It doesn't remember what happened to the previous packet.

Maps solve two specific problems:

  1. Statefulness: They allow the BPF program to remember data between events (e.g., "I have seen 5 packets from this IP address so far").
  2. Communication: They allow the User Space application to read what the Kernel is seeing, or to send configuration down to the Kernel.

How it works (The Architecture)

      [ User Space App ]                   [ Kernel Space ]
      (Your Python/C/Go Tool)             (BPF Program)
              |                                  |
              | R/W                              | R/W
              v                                  v
      +--------------------------------------------------+
      |                    BPF MAP                       |
      |             (Key -> Value Store)                 |
      +--------------------------------------------------+
  1. The Kernel writes: A network packet arrives. The BPF program checks the Map: "Is this IP in the blocklist?" or it updates the Map: "Increment the counter for this IP."
  2. The User reads: Your tool running in the terminal reads the Map every second to display the current counters to you.

Common Types of Maps

While they are generally Key-Value stores, there are different "flavors" optimized for different jobs:

A. Hash Map (BPF_MAP_TYPE_HASH)

  • Structure: Standard Key-Value pair.
  • Use Case: You don't know the keys in advance.
  • Example: Counting how many bytes each specific Process ID (PID) has written.
    • Key: PID (1234)
    • Value: Bytes (500)

B. Array (BPF_MAP_TYPE_ARRAY)

  • Structure: A pre-allocated list (like a C array). Faster than a Hash Map but fixed size.
  • Use Case: Global settings or a small set of known keys.
  • Example: A simple on/off switch or global error counter.
    • Key: 0 (Global Error Count)
    • Value: 55

C. Ring Buffer (BPF_MAP_TYPE_RINGBUF)

  • Structure: A circular queue (First-In-First-Out).
  • Use Case: Sending "events" to user space efficiently.
  • Example: Every time a file is opened, the BPF program pushes the filename into the Ring Buffer. The User Space app sits in a loop pulling filenames out and printing them.

D. LPM Trie (BPF_MAP_TYPE_LPM_TRIE)

  • Structure: Longest Prefix Match.
  • Use Case: Networking/Firewalls.
  • Example: Matching IP subnets (like 192.168.1.0/24).

A Concrete Example

Imagine you want to count how often the function sys_open is called.

  1. Create Map: You declare an Array Map with 1 slot.
  2. BPF Program:
    • Triggers when sys_open starts.
    • Reads the value at Index 0.
    • Adds +1.
    • Updates the value at Index 0.
  3. User App:
    • Every 1 second, it reads Index 0 from the Map.
    • Prints: "Files opened so far: [Value]".

Why not just send a "Message"?

You might wonder why we don't just send a message like a standard API.

  • Efficiency: Maps allow the kernel to keep working without waiting for the user-space app to "acknowledge" the data.
  • Persistence: If your user-space app crashes and restarts, the data in the Map stays in the kernel. When the app comes back online, it just grabs the File Descriptor again and continues where it left off.

What are Pinned Maps?

A Pinned Map is a BPF map that has been given a permanent location in the file system (specifically, the BPF virtual file system) so that it stays alive even after the program that created it exits.

The Problem: The "Disappearing" Map

By default, BPF Maps are tied to the process (the tool) that creates them.

  1. You run your BPF tool (./my_monitor).
  2. The tool tells the kernel to create a Map.
  3. The kernel gives the tool a File Descriptor (FD) (a handle to hold onto that map).
  4. The Issue: If you close the tool, or if it crashes, the Operating System sees that the File Descriptor is closed. Since no one is holding onto the map anymore, the kernel deletes the map and all the data inside it is lost.

The Solution: Pinning

Pinning is the act of "exporting" that map to a specific path on the disk, usually under /sys/fs/bpf/.

By doing this, you are effectively telling the kernel: "Don't delete this map when my program closes. Keep it alive because it is 'pinned' to this filename."

How it Works

Instead of the map existing only in the nebulous "memory space" of your application, it becomes a visible file entry.

  1. Creation: Your app creates a map (e.g., my_packet_counter).
  2. Pinning: Your app calls the bpf_obj_pin function to link that map to /sys/fs/bpf/my_packet_counter.
  3. Exit: Your app closes. The map stays in memory.
  4. Retrieval: Later, you restart your app (or run a completely different app). It calls bpf_obj_get on that path. The kernel recognizes the path and gives your new app access to the existing data.

Key Use Cases

A. Persisting Data Across Restarts

Imagine a firewall tool that counts blocked packets.

  • Without Pinning: If you need to restart the userspace tool to update a setting, you lose all your historical counts.
  • With Pinning: You restart the tool, it reconnects to the pinned map, and your counters resume from where they left off (e.g., 500,001 instead of 0).

B. Sharing Data Between Different Programs

Pinned maps allow totally different applications to share data.

  • App A (Written in C): Updates a map with network traffic stats. Pins it to /sys/fs/bpf/traffic_stats.
  • App B (Written in Python): Opens /sys/fs/bpf/traffic_stats to read that data and display a graph on a website.
  • App C (iproute2/tc): Reads the same map to make routing decisions.

Important Nuances

  • It is not a "Real" File: Even though it looks like a file in /sys/fs/bpf, you cannot open it with a text editor or cat it. It is a "pseudo-file" that acts as a handle for the BPF object.
  • Reboots: Pinned maps survive application crashes and restarts, but they do not survive a system reboot. The data is still stored in RAM.
  • Unpinning: To delete a pinned map, you simply delete the file (e.g., rm /sys/fs/bpf/my_packet_counter). Once the file is gone and no programs are using it, the kernel frees the memory.

Which ebpf library to use?

Authoring an eBPF program involves two distinct parts: the Kernel-side code (the logic that runs inside the Linux kernel) and the User-space code (the loader and controller that manages the kernel program).

The Languages Used

Location Primary Language Why?
Kernel-side C (Restricted) The kernel is written in C. eBPF code is compiled into eBPF bytecode. It uses a restricted subset of C (no loops without bounds, no arbitrary memory access) so the Verifier can prove it’s safe.
Kernel-side Rust Recently possible via the Aya library. Rust is compiled directly to eBPF bytecode.
User-space Go, Rust, Python, C, C++ User-space can be anything that can talk to the bpf() system call. It handles loading the bytecode into the kernel and reading data from "Maps" (shared memory).

Option 1: libbpf (The Modern Standard)

This is the official library maintained by the Linux kernel community. It introduced CO-RE (Compile Once – Run Everywhere).

  • Language: C (Kernel) + C (User-space).
  • Pros:
    • CO-RE: The program adapts to different kernel versions automatically without needing to be recompiled on the target machine.
    • Performance: Lowest overhead; no heavy dependencies.
    • Official: It is the "source of truth" for eBPF features.
  • Cons:
    • Complexity: Requires a deep understanding of C and memory management.
    • Boilerplate: Lots of "setup" code required to load and attach programs.

Option 2: Cilium / ebpf (The Go Way)

Maintained by the Cilium project, this is a pure Go implementation. It is the most popular choice for Cloud-Native and Kubernetes networking.

  • Language: C (Kernel) + Go (User-space).
  • Pros:
    • No Cgo: It doesn't require a C compiler on the host to run; it talks to the kernel directly via Go.
    • Ecosystem: Fits perfectly into the Kubernetes/DevOps toolchain.
    • Stability: Highly battle-tested in massive production environments (Cilium).
  • Cons:
    • Two-Language Context: You still have to write the kernel logic in C, then switch to Go for the controller.

Option 3: Aya (The Pure Rust Way)

Aya is a newer library that aims to provide a completely "C-free" eBPF experience.

  • Language: Rust (Kernel) + Rust (User-space).
  • Pros:
    • Unified Language: You write Rust for both sides.
    • Memory Safety: Rust’s compiler helps prevent common bugs before the eBPF verifier even sees the code.
    • No LLVM dependency: It doesn't require clang on the target system.
  • Cons:
    • Maturity: Younger than libbpf or cilium/ebpf.
    • Niche: If you don't know Rust, the learning curve is very steep.

Option 4: BCC (BPF Compiler Collection)

This was the "original" popular way to write eBPF. It is now mostly used for rapid prototyping and CLI tools.

  • Language: C (Kernel) + Python/Lua (User-space).
  • Pros:
    • Ease of Use: You can write a Python script that embeds C code as a string. It handles all the compilation for you.
    • Great for Scripts: Perfect for one-off debugging tools.
  • Cons:
    • Extremely Heavy: It requires the LLVM/Clang compiler to be installed on every single server where the script runs.
    • Slow Startup: It compiles the C code every time you run the script.
    • No CO-RE: It is fragile across different kernel versions.

Option 5: bpftrace (The One-Liners)

A high-level tracing language inspired by awk and DTrace.

  • Language: DSL (Domain Specific Language).
  • Pros:
    • Instant: You can write a powerful tracer in a single line of terminal command.
    • Safety: The language is designed to be impossible to crash the kernel with.
  • Cons:
    • Limited: You cannot build complex networking logic or XDP load balancers with it. It is strictly for Observability.

Option 6: libbpf-rs (The Hybrid Rust Way)

libbpf-rs is a safe, idiomatic Rust wrapper around the official libbpf C library. It is the middle ground between the "official" C implementation and the "pure Rust" Aya approach.

  • Language: C (Kernel) + Rust (User-space).
  • Pros:
    • Best of Both Worlds: You get the performance and safety of Rust in user-space, while using the industry-standard, battle-tested libbpf for the kernel interaction.
    • Full CO-RE Support: Since it wraps libbpf, it has first-class support for "Compile Once – Run Everywhere."
    • Strong Tooling: The libbpf-cargo plugin allows you to automatically generate Rust "skeletons" (type-safe bindings) from your C eBPF code.
    • Stability: Because it relies on the same C library used by the Linux kernel developers, it is often more feature-complete than pure-Go or pure-Rust alternatives.
  • Cons:
    • Two-Language Context: You must still write your kernel-side logic in C. You cannot use Rust for the kernel part (unlike Aya).
    • Build Dependency: Requires a C compiler (clang) and the libbpf library headers to be available during the build process.
    • Complexity: Managing the boundary between C bytecode and Rust code can be slightly more complex than staying in a single language ecosystem.

Summary Recommendation

If you are building... Use this option:
High-perf Networking / K8s tools Cilium / ebpf (Go)
Official Kernel/Embedded Tools libbpf (C)
Reliable, Type-safe System Daemons libbpf-rs (Rust + C)
Modern, "Pure Rust" Security Tools Aya (Rust)
Quick Debugging/Ad-hoc Tracing bpftrace
Legacy Prototyping BCC (Python)

Note on the Verifier: Regardless of which tool you use, the Linux Kernel Verifier is the final judge. Even if your Rust or Go code is perfect, if the eBPF bytecode performs an "unsafe" action (like a loop that could run forever), the kernel will refuse to load it.

Is the compiled kernel code calling bpf() or passed as a parameter?

The compiled kernel code is sent as a parameter.

  • The kernel code (C or Rust) you write for eBPF never actually calls the bpf() system call.
  • The user-space code (like Cilium based Go application) executes the bpf() system call from user-space.

Where is BPF used?

  • BPF-LSM: BPF based Linux Security Moduel.
  • seccomp-bpf: This is a kernel feature that allows a process to restrict its own system calls. With the addition of eBPF, you can create very expressive and dynamic filters to define which syscalls an application is allowed to use and with what arguments. This is a key tool for container runtimes to provide a strong isolation boundary.
  • eBPF-based observability tools: A huge part of security is being able to see what's happening on your system. Projects like Falco use eBPF to monitor system calls and other kernel events to detect anomalous behavior. These tools provide deep, in-kernel visibility with minimal performance overhead, which is a significant improvement over traditional auditing systems like auditd.
  • Network security with eBPF: Tools like Cilium use eBPF to implement networking and security policies for containerized workloads. By operating directly in the kernel's networking data path, they can perform highly efficient packet filtering and enforce network policies with a deep understanding of application context.

What is BTF

In the world of BPF, BTF (BPF Type Format) is the metadata that describes the "DNA" of the Linux kernel's data structures.

If a BPF program is the "code," BTF is the "dictionary" that explains exactly what the kernel's memory looks like.

The Problem: Why did we need BTF?

Before BTF, BPF programs had a major flaw: They were fragile.

The Linux kernel is always changing. A core structure like task_struct (which represents a process) might have a field named prio (priority) at offset 16 in Kernel 5.4, but in Kernel 5.10, that field might have moved to offset 24.

  • The Old Way: You had to compile your BPF program on the specific machine where it was going to run, using that specific kernel's header files. This made distributing BPF tools (like Falco or Cilium) a nightmare.
  • The Result: If you tried to run a BPF program compiled for one kernel on another, it would read the wrong memory and either crash or report "garbage" data.

The Solution: BTF and CO-RE

BTF provides a compact way to describe every struct, union, enum, and function in the kernel. This metadata is now built into the Linux kernel itself (usually found at /sys/kernel/btf/vmlinux).

This enabled a concept called CO-RE (Compile Once – Run Everywhere).

How it works with BTF:

  1. Compile: You compile your BPF program once. Instead of hardcoding "offset 16," the compiler records a "Relocation Request" saying: "I need the field named 'prio' from 'task_struct'."
  2. Load: When you load the program, the Loader (libbpf) looks at the host's BTF metadata.
  3. Adjust: The loader sees that on this specific kernel, prio is at offset 24. It "rewrites" your BPF code on the fly to use the correct offset before handing it to the kernel.

Key Benefits of BTF

A. Extreme Compression

BTF is a stripped-down version of DWARF (the standard debugging format). While DWARF can be hundreds of megabytes, BTF is usually only 1–2 MB. This is small enough to be stored directly in the kernel's memory at all times.

B. Introspection (Self-Awareness)

Because the kernel understands BTF, you can ask the kernel to "pretty-print" its own data.

  • Example: If you use bpftool, you can dump a kernel structure in a human-readable format because the kernel uses BTF to know which bytes represent which variables.

C. Type Safety

When you load a BPF program, the kernel's Verifier uses BTF to ensure you aren't doing anything illegal. If your code tries to treat a "Pointer to a File" as an "Integer," the Verifier will see the BTF type mismatch and reject the program.

Visualizing BTF

If you have a modern Linux system, you can actually see the BTF data.

Run this command to see the definition of the task_struct directly from your running kernel:

bpftool btf dump file /sys/kernel/btf/vmlinux format c | grep -A 20 "struct task_struct {"

This output isn't coming from a header file on your disk; it is being generated live from the BTF metadata inside your kernel RAM.

XDP (Express Data Path)

To understand XDP (Express Data Path), think of it as a "Fast Track" or a "VIP Lane" for network packets entering a Linux server.

Here is the breakdown of what it is, how it works, and its relationship with eBPF.

What is XDP?

XDP is a high-performance data path in the Linux kernel that allows for extremely fast packet processing.

In a "normal" Linux network stack, when a packet hits the Network Interface Card (NIC), the kernel creates a complex data structure called an sk_buff (socket buffer). This involves memory allocation, metadata parsing, and several layers of overhead before your code ever sees the packet.

XDP changes this by running code the very instant the packet arrives at the driver level, before the kernel even touches it or creates an sk_buff.

Is it tied to eBPF or a general kernel feature?

XDP is strictly tied to eBPF.

XDP is not a standalone "feature" like a firewall setting; it is a hook point within the network driver that was specifically designed to execute eBPF programs.

  • You cannot use XDP without writing an eBPF program.
  • eBPF provides the "brain" (the logic), and XDP provides the "location" (the earliest possible point in the software stack).

How does XDP work? (The Flow)

When a packet arrives at a NIC supported by XDP, the following happens:

  1. Packet Arrival: The NIC receives the raw bits.
  2. eBPF Execution: The driver immediately executes an eBPF program loaded into the XDP hook.
  3. The Decision: The eBPF program inspects the packet and must return one of four actions:
    • XDP_DROP: Trash the packet immediately (Great for DDoS protection).
    • XDP_PASS: Send the packet up to the normal Linux network stack.
    • XDP_TX: Send the packet back out of the same network interface it came in (Great for load balancing).
    • XDP_REDIRECT: Send the packet to a different CPU or a different network interface.
    • XDP_ABORTED: Error state (drops the packet).

Why is XDP so fast?

The speed comes from Efficiency through Omission:

  • Zero Memory Allocation: It processes the packet "in place" without creating expensive sk_buff structures.
  • No Context Switching: The code runs inside the kernel context, so there is no jumping between "User Space" and "Kernel Space."
  • Early Exit: If you want to drop a malicious packet, you do it before the CPU spends any time "thinking" about it.

Comparison:

  • Standard Linux Stack: Can handle ~1–2 million packets per second (Mpps) per core.
  • XDP: Can handle 20 million+ Mpps per core.

The Three Modes of XDP

Depending on your hardware and drivers, XDP can run in three different ways:

  1. Native XDP (Best Balance): The eBPF program is executed directly by the network driver. This is very fast and supported by most modern 10G/40G/100G NICs (Intel, Mellanox, etc.).
  2. Offloaded XDP (Fastest): The eBPF program is loaded onto the NIC hardware itself (the NPU). The packet never even reaches the host CPU.
  3. Generic XDP (Compatibility): If your driver doesn't support XDP, the kernel provides a "fake" hook higher up the stack. It’s slower than Native but allows you to test XDP code on any hardware.

Real-World Use Cases

  • DDoS Protection: Cloudflare and Facebook use XDP to drop "bad" traffic at the edge. Because it's so efficient, one server can drop massive amounts of garbage traffic without crashing.
  • Load Balancing: The Katran load balancer (used by Meta/Facebook) uses XDP to redirect traffic to backend servers with almost zero latency.
  • Monitoring: You can use XDP to sample traffic or calculate statistics without slowing down the actual flow of data.

Summary

XDP is a programmable entry point in the Linux network stack. It is the "Where" (the driver level), while eBPF is the "How" (the code that runs there). Together, they allow Linux to perform networking tasks that were previously only possible with specialized hardware or complex bypass technologies like DPDK.

What is a "Collection"?

The term "collection" in the context of the cilium/ebpf library refers to a set of eBPF programs and maps that are loaded from a single eBPF object file.

A Collection = A single .o file = Programs + Maps

How to identify the container in ebpf?

Get cgroup id (an inode number)

In modern Kubernetes and Linux setups (Cgroup v2), every container lives in its own dedicated cgroup directory. The kernel assigns a unique 64-bit integer ID to every cgroup.

  • eBPF Helper: bpf_get_current_cgroup_id()
  • How it works: When a syscall happens, you call this helper. It returns a unique ID for the cgroup the process belongs to.
  • Pros: Very fast; built-in helper.
  • Cons: Returns a number (e.g., 402), not a string like container_abc123. You need to map that number back to a container name in user-space.

Then find the cgroup folder based on the inode:

find /sys/fs/cgroup -inum <CGROUP_ID>

The Namespace Way: Mount Namespace ID

Containers use separate Namespaces to isolate filesystems. Every mount namespace has a unique Inode number.

  • How it works: You can access the task_struct (the kernel's process representation) to find the namespace ID.
  • The Logic:
    1. Get the current task: bpf_get_current_task()
    2. Navigate the pointers: task->nsproxy->mnt_ns->ns.inum
  • Pros: Highly reliable for distinguishing between the host and any container.
  • Cons: Slightly more complex code (requires navigating kernel structures).

The "Human Readable" Way: Parsing the Cgroup Path

If you want the actual Container ID string (the 64-character hash like a1b2c3...), you have to look at the Cgroup Path. The path to a container’s cgroup almost always includes the Container ID.

  • The Logic:
    1. Get the task_struct.
    2. Follow the pointers to the css_set and then the cgroup.
    3. Read the kn->name (the directory name).
  • Example Path: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podUID.slice/cri-container-**a1b2c3...**.scope
  • Pros: You get the actual Container ID string directly in eBPF.
  • Cons: Reading strings in eBPF is "expensive" (performance-wise) and requires more code.

The Industry Standard: The "Map" Pattern

Since eBPF programs need to be extremely fast, most professional tools (like Cilium, Falco, or Tetragon) don't try to resolve "Container Names" inside the kernel. Instead, they use a BFP Map.

The Workflow:

  1. In User-space (the Go/C++ agent): The agent watches the Container Runtime (containerd). When a new container starts, it finds its cgroup ID and its Container Name.
  2. The Update: The agent writes this mapping into a BPF Hash Map: [Cgroup ID] -> [Container Metadata].
  3. In eBPF (the Kernel): When an event occurs, the eBPF program calls bpf_get_current_cgroup_id() and looks it up in the Map.

Code Example (C Snippet for eBPF)

If you are using BCC or libbpf, your code to get the cgroup ID looks like this:

SEC("kprobe/sys_clone")
int kprobe_clone(void *ctx) {
    // 1. Get the unique Cgroup ID
    uint64_t cgroup_id = bpf_get_current_cgroup_id();

    // 2. Get the Process ID
    uint32_t pid = bpf_get_current_pid_tgid() >> 32;

    bpf_printk("Process %d started in Cgroup %llu\n", pid, cgroup_id);
    return 0;
}

Is seccomp-bpf ebpf?

No, seccomp-bpf is not eBPF. While they both use the Berkeley Packet Filter (BPF) infrastructure, seccomp-bpf relies on the older, more restricted classic BPF (cBPF).