eBPF maps

This document describes what eBPF maps are, how you create them (Creating a map), and how to interact with them (Interacting with maps). The different map types available are described here: Types of eBPF maps.

Using eBPF maps is a method to keep state between invocations of the eBPF program, and allows sharing data between eBPF kernel programs, and also between kernel and user-space applications.

Basically a key/value store with arbitrary structure (from man-page bpf(2)):

eBPF maps are a generic data structure for storage of different data types. Data types are generally treated as binary blobs, so a user just specifies the size of the key and the size of the value at map-creation time. In other words, a key/value for a given map can have an arbitrary structure.

The map handles are file descriptors, and multiple maps can be created and accessed by multiple programs (from man-page bpf(2)):

A user process can create multiple maps (with key/value-pairs being opaque bytes of data) and access them via file descriptors. Different eBPF programs can access the same maps in parallel. It’s up to the user process and eBPF program to decide what they store inside maps.

Creating a map

A map is created based on a request from userspace, via the bpf syscall (specifically bpf_cmd BPF_MAP_CREATE), which returns a new file descriptor that refers to the map. On error, -1 is returned and errno is set to EINVAL, EPERM, or ENOMEM. These are the struct bpf_attr setup arguments to use when creating a map via the syscall:

bpf(BPF_MAP_CREATE, &bpf_attr, sizeof(bpf_attr));

Notice how this kernel ABI is extensible, as more struct arguments can easily be added later as the sizeof(bpf_attr) is passed along to the syscall. This also implies that API users must clear/zero sizeof(bpf_attr), as compiler can size-align the struct differently, to avoid garbage data to be interpreted as parameters by future kernels.

The following configuration attributes are needed when creating the map:

union bpf_attr {
 struct { /* anonymous struct used by BPF_MAP_CREATE command */
        __u32   map_type;       /* one of enum bpf_map_type */
        __u32   key_size;       /* size of key in bytes */
        __u32   value_size;     /* size of value in bytes */
        __u32   max_entries;    /* max number of entries in a map */
        __u32   map_flags;      /* prealloc or not */
 };
}

Kernel sample/bpf ELF convention

For programs under samples/bpf/, defining a map have been integrated with ELF binary generated by LLVM. This is purely one example of a userspace convention and not part of the kernel ABI. It still invokes the bpf syscall.

Map definitions are done by defining a struct bpf_map_def with an elf section __attribute__ SEC("maps"), in the xxx_kern.c file. The maps file descriptor is available in the userspace xxx_user.c file, via global array variable map_fd[], and the array map index corresponds to the order the maps sections were defined in elf file of xxx_kern.c file. Behind the scenes it is the load_bpf_file() call (from samples/bpf/bpf_load) that takes care of parsing ELF file compiled by LLVM, pickup ‘maps’ section and creates maps via the bpf syscall.

struct bpf_map_def {
      unsigned int type;
      unsigned int key_size;
      unsigned int value_size;
      unsigned int max_entries;
      unsigned int map_flags;
};

struct bpf_map_def SEC("maps") my_map = {
      .type        = BPF_MAP_TYPE_XXX,
      .key_size    = sizeof(u32),
      .value_size  = sizeof(u64),
      .max_entries = 42,
      .map_flags   = 0
};

Qdisc Traffic Control convention

It is worth mentioning, that qdisc TC (Traffic Control), also use ELF files for defining the maps, but it uses another layout. See man-page tc-bpf(8) and tc bpf examples in iproute2.git tree.

Interacting with maps

Interacting with eBPF maps happens through some lookup/update/delete primitives.

When writing eBFP programs using load helpers and libraries from samples/bpf/ and tools/lib/bpf/. Common function name API have been created that hides the details of how kernel vs. userspace access these primitives (which is quite different).

The common function names (parameters and return values differs):

void bpf_map_lookup_elem(map, void *key. ...);
void bpf_map_update_elem(map, void *key, ..., __u64 flags);
void bpf_map_delete_elem(map, void *key);

The flags argument in bpf_map_update_elem() allows to define semantics on whether the element exists:

/* File: include/uapi/linux/bpf.h */
/* flags for BPF_MAP_UPDATE_ELEM command */
#define BPF_ANY       0 /* create new element or update existing */
#define BPF_NOEXIST   1 /* create new element only if it didn't exist */
#define BPF_EXIST     2 /* only update existing element */

Userspace

The userspace API map helpers are defined in tools/lib/bpf/bpf.h and looks like this:

/* Userspace helpers */
int bpf_map_lookup_elem(int fd, void *key, void *value);
int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags);
int bpf_map_delete_elem(int fd, void *key);
/* Only userspace: */
int bpf_map_get_next_key(int fd, void *key, void *next_key);

Interacting with an eBPF map from userspace, happens through the bpf syscall and a file descriptor. See how the map handle int fd is a file descriptor . On success, zero is returned, on failures -1 is returned and errno is set.

Wrappers for the bpf syscall is implemented in tools/lib/bpf/bpf.c, and ends up calling functions in kernel/bpf/syscall.c, like map_lookup_elem.

/* Corresponding syscall bpf commands from userspace */
enum bpf_cmd {
      [...]
      BPF_MAP_LOOKUP_ELEM,
      BPF_MAP_UPDATE_ELEM,
      BPF_MAP_DELETE_ELEM,
      BPF_MAP_GET_NEXT_KEY,
      [...]
};

Notice how void *key and void *value are passed as a void pointers. Given the memory seperation between kernel and userspace, this is a copy of the value. Kernel primitives like copy_from_user() and copy_to_user() are used, e.g. see map_lookup_elem, which also kmalloc+kfree memory for a short period.

From userspace, there is no function call to atomically increment or decrement the value ‘in-place’. The bpf_map_update_elem() call will overwrite the existing value, with a copy of the value supplied. Depending on the map type, the overwrite will happen in an atomic way, e.g. using locking mechanisms specific to the map type.

Kernel-side eBPF program

The API mapping for eBPF programs on the kernel-side is fairly hard to follow. It related to samples/bpf/bpf_helpers.h and maps into kernel/bpf/helpers.c via macros.

/* eBPF program helpers */
void *bpf_map_lookup_elem(void *map, void *key);
int bpf_map_update_elem(void *map, void *key, void *value, unsigned long long flags);
int bpf_map_delete_elem(void *map, void *key);

The eBPF-program running kernel-side interacts more directly with the map data structures. For example the call bpf_map_lookup_elem() returns a direct pointer to the ‘value’ memory-element inside the kernel (while userspace gets a copy). This allows the eBPF-program to atomically increment or decrement the value ‘in-place’, by using appropiate compiler primitives like __sync_fetch_and_add(), which is understood by LLVM when generating eBPF instructions.

Todo

  1. describe how verifier validate map access to be safe.
  2. describe int return codes of bpf_map_update_elem + bpf_map_delete_elem.

Export map to filesystem

When Interacting with maps from Userspace a file descriptor is needed. There are two methods for sharing this file descriptor.

  1. By passing it over Unix-domain sockets.
  2. Exporting the map to a special bpf filesystem.

Option 2, exporting or pinning the map through the filesystem is more convenient and easier than option 1. Thus, this document will focus on option 2.

Todo

Describe the API for bpf_obj_pin and bpf_obj_get. Usage examples available in XDP blacklist for bpf_obj_pin() and XDP blacklist cmdline tool show use of bpf_obj_get().

Todo

add link to Daniel’s TC example of using Unix-domain sockets.