Profiling (Almost) Everything You Can Think of With Bpftrace

Welcome back to the month of profiling! The first two weeks we took a look at using the async-profiler to generate flame graphs and bcc-tools to understand your hardware. This week we’ll be focusing on eBPF, a rapidly evolving technology that lets you extract a ton of useful information out of the Linux Kernel even if a tool wasn’t built specifically for it. Think of it as a tool that lets you create small programs that run safely in the Linux Kernel with almost no overhead for the purpose of observability.

One of the amazing things about this technology is that we can hook into almost anything that happens Linux kernel in a safe way. We can attach to a number of probes, which can be static tracepoints (stable) or arbitrary kernel functions (kprobes), user functions (uprobe), processor events (hardware), and more. This is a bit more of a roll-up-your-sleeves situation than bcc-tools as it sometimes requires you to read a little bit of kernel source, but the reward is the ability to answer pretty much any question you might have about the kernel’s behavior. In this post we’ll look at:

  • Installation
  • Basics of a bpftrace script
  • Finding things we can hook into
  • Determine what we have access to
  • Write a custom bpftrace to look at file access patterns.

Installation

You’ll need a somewhat recent Linux distro to install bpftrace, with a kernel version newer than 4.9 recommended. On Ubuntu 22 it can be installed from the bpftrace package:

root@ubuntu-vm:~# apt search bpftrace
Sorting... Done
Full Text Search... Done
bpftrace/jammy,now 0.14.0-1 arm64 [installed]
  high-level tracing language for Linux eBPF

For other distributions see the project README.

Basic of bpftrace Script

bpftrace scripts are fairly basic in nature - they’re not meant to solve general purpose problems. There’s a very simple language specification with built-in and user defined variables, functions. For variables that stick around for the length of the program, we use @globals. For local, throw away ones, $scratch variables. Integers and strings are supported, but not floats. Here’s what those looks like:

@global_name = 10
@thread_local_variable_name[tid] = "hello" // tid is a global thread id, kernel pid
$scratch_name = "test"
@arr["test"] = "value"

We listen for specific events by putting the probe name on a line followed by brackets with the action to be taken. We can define a simple one-liner that prints to the console during a system call:

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { printf("Hello!\n"); }`

Having a script that attaches to events is so simple that there’s a whole document of useful one-liners. Here’s one that lets us see how long our reads are taking by program name:

$ bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; } kretprobe:vfs_read /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...

[...]
@ns[snmp-pass]:
[0, 1]                 0 |                                                    |
[2, 4)                 0 |                                                    |
h[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)               0 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)            27 |@@@@@@@@@                                           |
[512, 1k)            125 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[1k, 2k)              22 |@@@@@@@                                             |
[2k, 4k)               1 |                                                    |
[4k, 8k)              10 |@@@                                                 |
[8k, 16k)              1 |                                                    |
[16k, 32k)             3 |@                                                   |
[32k, 64k)           144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64k, 128k)            7 |@@                                                  |
[128k, 256k)          28 |@@@@@@@@@@                                          |
[256k, 512k)           2 |                                                    |
[512k, 1M)             3 |@                                                   |
[1M, 2M)               1 |                                                    |

The above listens for kprobe.vfs_read, stores the current time in a thread variable with @start[tid] = nsecs, listens for the return call with the same thread id kretprobe:vfs_read /@start[tid] /, then calculates the time spent with nsecs - @start[tid] and builds a power of two histogram with the hist() function, tracked in the @ns[comm] array. This is pretty incredible given how short the command is. Another interesting part here is the histogram function doesn’t overwrite previous variables like you might expect from other languages, it actually defines @ns[comm] as a histogram, and every time kretprobe:vfs_read is called it adds the value to the histogram.

If you find yourself using this often you can create a script. I’ve named it vfs_read.bt and set the executable bit with chmod 755 vfs_read.bt:

#!/usr/bin/env bpftrace

kprobe:vfs_read { 
    @start[tid] = nsecs; 
} 
kretprobe:vfs_read /@start[tid]/ { 
    @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); 
}

We’ve used the built in variable tid in the above program, there are several other built in variables you can access. Here’s some of the more useful ones I’ve found from the reference guide:

Variable Description
pid Process ID (kernel tgid)
tid Thread ID (kernel pid)
uid User ID
gid Group ID
nsecs Nanosecond timestamp. Alias of nsecs()
comm Process name
arg0, arg1, argN. Arguments to the traced function; assumed to be 64 bits wide
retval Return value from traced function
args The struct with all arguments of the traced function. Available in tracepoint, kfunc, and uprobe (with DWARF) probes.
$1, $2, $N, $#. Positional parameters for the bpftrace program

There’s a much more complete reference that’s worth spending some time reading. It contains a ton of useful information as well as Brendan Gregg’s cheat sheet.

Finding things we can hook into

The most stable option is to use one of the established tracepoints. Tracepoints have a static interface so bpfscripts should work across different kernel versions.

root@ubuntu-vm:~# bpftrace -l 'tracepoint:*read'
...
tracepoint:syscalls:sys_enter_read
tracepoint:syscalls:sys_exit_read
...

bpftrace accepts wildcards, and you can pipe the results to grep if you want more flexible searching.

Determine What We Have Access To

For tracepoints, using the -v option when listing tracepoints will show their arguments for use from the args builtin. For example:

root@ubuntu-vm:~# bpftrace -lv 'tracepoint:syscalls:sys_enter_read'
tracepoint:syscalls:sys_enter_read
    int __syscall_nr
    unsigned int fd
    char * buf
    size_t count

The above tracepoint is nice, but lacks the filename we’re reading from, it just has the file descriptor. Sometimes we may find that the tracepoint doesn’t have all the information we need. In that case, we can try the other probe types. The biggest hurdle I found when trying to learn how to write a bpftrace script was figuring out what the arguments are for non-traceponts. This requires a bit more work - we might have to read some kernel source. Getting a little familiar with the kernel headers is a good idea if you’re looking to do serious work with eBPF kernel probes. Let’s suppose we want to get more information than the above tracepoint can offer us.

Let’s look for something related to vfs_read, which I’ve noted in flame graphs previously when doing performance testing:

root@ubuntu-vm:~# bpftrace -lv '*vfs_read*'
kfunc:vfs_read
kfunc:vfs_readlink
kfunc:vfs_readv
kprobe:vfs_read
kprobe:vfs_readlink
kprobe:vfs_readv

Since this is a kprobe, we need to take a look at the definition for vfs_read rather than relying on the -v flag:

extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);

The first argument is a file - exactly what I’m after. Looking at part of the Linux filesystem headers, the file structure has a path f_path that we can use, and the return value is the number of bytes ssize_t. The file struct looks like this:

struct file {
    ...
	struct path		f_path;
}

The path struct header has a dentry pointer that is useful:

struct path {
    ...
    struct dentry *dentry;
} 

Following the struct to its definition we can find something called d_name:


struct dentry {
    ...
    struct qstr d_name;
    ...
}

Jumping to qstr we have a *name pointer.

struct qstr {
...
    const unsigned char *name;
};

Now we know we can access the filename through the first argument (arg0) in vfs_read. I also want to see the output every second, so I’ve added the interval section, with s and 1 for one second:

#!/usr/bin/env bpftrace

#include <linux/fs.h>

kprobe:vfs_read { 
    @start[tid] = nsecs; 
    @file[tid] = arg0; // first argument to vfs_read, a file struct
} 
kretprobe:vfs_read /@start[tid]/ 
{ 
    if (retval >= 0) {
        $file = (struct file *)@file[tid]; // cast to the file struct
        $name = $file->f_path.dentry->d_name.name; // follow the pointers to the name
        @ns[comm, str($name)] = hist(nsecs - @start[tid]);  // process and filename histogram
    }
    delete(@start[tid]); 
}
interval:s:1 {
    print(@ns);
    clear(@ns);
}

Saving the above as vfssize.bt and running it shows the below output (truncated a bit):

root@ubuntu-vm:~# ./vfssize.bt
Attaching 3 probes...
...
@ns[sshd, ptmx]:
[1K, 2K)               2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |

@ns[fio, test]:
[512, 1K)              1 |                                                    |
[1K, 2K)             634 |@@@@@@@@@@@@                                        |
[2K, 4K)             266 |@@@@@                                               |
[4K, 8K)              46 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)             1 |                                                    |
[32K, 64K)            38 |                                                    |
[64K, 128K)          619 |@@@@@@@@@@@@                                        |
[128K, 256K)        2585 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K)         347 |@@@@@@                                              |
[512K, 1M)            10 |                                                    |
[1M, 2M)               2 |                                                    |
[2M, 4M)               1 |                                                    |
...

Pretty fun!

Even if you aren’t planning on writing eBPF scripts, having a fundamental understanding of tracepoints and kprobes can be incredibly valuable. Next week I’ll show how to use flame graphs, bcc-tools and eBPF to optimize Apache Cassandra!

If you found this post helpful, please consider sharing to your network. I'm also available to help you be successful with your distributed systems! Please reach out if you're interested in working with me, and I'll be happy to schedule a free one-hour consultation.