Use dynamic tracing technology to trace network requests in the kernel.

This week, I helped a friend implement some interesting features using dynamic tracing tools like eBPF/SystemTap. This article serves as a summary.

Introduction#

In fact, some of the ideas from this week originally stemmed from a question a friend asked me one day:

Can we monitor which processes on the machine are sending ICMP requests? We need to obtain the PID, the ICMP packet's source address, destination address, and the process's startup command.

It's an interesting question. When we first encountered this question, our immediate reaction was, "Why not just have the processes on the machine write logs directly when they send ICMP packets?" Emmmm, let's use a meme to illustrate.

Chicken and Egg

Well, perhaps everyone knows what I'm trying to say. In such scenarios, we can only choose a bypass, non-intrusive approach.

When it comes to bypass tracing of packets, everyone's first thought is definitely to use tcpdump to capture packets. However, under today's problem, tcpdump can only capture packet information but cannot obtain specific PID, startup command, and other information.

So we may need to implement our requirements using other methods.

At the very beginning of our requirements, we had several possible approaches:

Access /proc/net/tcp to obtain the specific socket's inode information and then reverse lookup the associated PID.
Use eBPF + kprobe for monitoring.
Use SystemTap + kprobe for monitoring.

The first method can only obtain information at the TCP layer, but ICMP is not a TCP protocol (sigh) (although both belong to L4).

So in the end, it seems we only have the option of using eBPF/SystemTap in conjunction with kprobe.

Basic Trace#

Kprobe#

Before we proceed with the actual coding, we first need to understand Kprobe.

Let me quote a section from the official documentation:

Kprobes enables you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. You can trap at almost any kernel code address, specifying a handler routine to be invoked when the breakpoint is hit.
There are currently two types of probes: kprobes, and kretprobes (also called return probes). A kprobe can be inserted on virtually any instruction in the kernel. A return probe fires when a specified function returns.
In the typical case, Kprobes-based instrumentation is packaged as a kernel module. The module’s init function installs (“registers”) one or more probes, and the exit function unregisters them. A registration function such as register_kprobe() specifies where the probe is to be inserted and what handler is to be called when the probe is hit.

In simple terms, kprobe is a tracing mechanism provided by the kernel that triggers our callback function according to the rules we set when executing specific kernel functions. In the words of the official documentation, "You can trap at almost any kernel code address."

In our scenario today, whether using eBPF or SystemTap, we need to rely on Kprobe and choose appropriate hook points to complete our kernel call tracing.

So, in our scenario today, what function should we add the corresponding hook to?

First, let's think about it. ICMP is a layer 4 packet, ultimately encapsulated in an IP packet for distribution. Let's take a look at the key calls in the kernel for sending IP packets, as shown in the figure below.

Key System Calls in IP Layer

Here, I choose to use ip_finish_output as our hook point.

Okay, with the hook point confirmed, before we start coding, let's briefly introduce ip_finish_output.

ip_finish_output#

First, let's take a look at this function.

static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	int ret;

	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
	switch (ret) {
	case NET_XMIT_SUCCESS:
		return __ip_finish_output(net, sk, skb);
	case NET_XMIT_CN:
		return __ip_finish_output(net, sk, skb) ? : ret;
	default:
		kfree_skb(skb);
		return ret;
	}
}

We won't delve into the specific details here (because there are just too many, Orz). When the system call ip_finish_output is invoked, it will trigger the kprobe hook we set, and in our designated hook function, we will receive three parameters: net, sk, and skb (these three parameters are also the values when calling ip_finish_output).

Among these three parameters, we mainly focus on struct sk_buff *skb.

Those familiar with the Linux Kernel protocol stack implementation must be very familiar with the sk_buff data structure. This data structure is a core data structure related to networking in the Linux Kernel. By continuously offsetting pointers, this data structure can easily help us determine where the data we want to send/receive is stored in memory.

It might sound a bit abstract, so let's look at a diagram.

sk_buff

Taking the sending of a TCP packet as an example, we can see that the sk_buff goes through six stages:

a. Allocate a buffer based on some options in TCP, such as MSS.
b. Reserve enough space in the allocated memory buffer to accommodate all network layer headers (TCP/IP/Link, etc.) based on MAX_TCP_HEADER.
c. Fill in the TCP payload.
d. Fill in the TCP header.
e. Fill in the IP header.
f. Fill in the link header.

You can refer to the TCP segment structure for a more intuitive understanding.

TCP Segment Format

You can see that through some pointer operations in sk_buff, we can easily access the headers of different layers and the specific payload.

Okay, now let's officially start implementing the functionality we need.

eBPF + KProbe#

First, let me briefly introduce eBPF. BPF stands for Berkeley Packet Filter, which was originally designed to implement some packet filtering functions in the kernel. However, the community has since made many enhancements to it, allowing it to be applied beyond just networking. This is also the origin of the "e" in eBPF (extend).

Essentially, eBPF maintains a layer of VM in the kernel that can load code generated by specific rules, making the kernel more programmable (I will try to write an introductory article on eBPF from beginner to advanced later).

Tips: Tcpdump is built on BPF.

In this implementation, we used BCC to simplify the difficulty of writing our eBPF code.

Okay, let's look at the code.

from bcc import BPF
import ctypes

bpf_text = """
#include <linux/ptrace.h>
#include <linux/sched.h>        /* For TASK_COMM_LEN */
#include <linux/icmp.h>
#include <linux/ip.h>
#include <linux/netdevice.h>

struct probe_icmp_sample {
    u32 pid;
    u32 daddress;
    u32 saddress;
};

BPF_PERF_OUTPUT(probe_events);

static inline unsigned char *custom_skb_network_header(const struct sk_buff *skb)
{
	return skb->head + skb->network_header;
}

static inline struct iphdr *get_iphdr_in_icmp(const struct sk_buff *skb)
{
    return (struct iphdr *)custom_skb_network_header(skb);
}

int probe_icmp(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb){
    struct iphdr * ipdata=get_iphdr_in_icmp(skb);
    if (ipdata->protocol!=1){
        return 1;
    }
    u64 __pid_tgid = bpf_get_current_pid_tgid();
    u32 __pid = __pid_tgid;
    struct probe_icmp_sample __data = {0};
    __data.pid = __pid;
    u32 daddress;
    u32 saddress;
    bpf_probe_read(&daddress, sizeof(ipdata->daddr), &ipdata->daddr);
    bpf_probe_read(&saddress, sizeof(ipdata->daddr), &ipdata->saddr);
    __data.daddress=daddress;
    __data.saddress=saddress;
    probe_events.perf_submit(ctx, &__data, sizeof(__data));
    return 0;
}

"""


class IcmpSamples(ctypes.Structure):
    _fields_ = [
        ("pid", ctypes.c_uint32),
        ("daddress", ctypes.c_uint32),
        ("saddress", ctypes.c_uint32),
    ]


bpf = BPF(text=bpf_text)

filters = {}


def parse_ip_address(data):
    results = [0, 0, 0, 0]
    results[3] = data & 0xFF
    results[2] = (data >> 8) & 0xFF
    results[1] = (data >> 16) & 0xFF
    results[0] = (data >> 24) & 0xFF
    return ".".join([str(i) for i in results[::-1]])


def print_icmp_event(cpu, data, size):
    # event = b["probe_icmp_events"].event(data)
    event = ctypes.cast(data, ctypes.POINTER(IcmpSamples)).contents
    daddress = parse_ip_address(event.daddress)
    print(
        f"pid:{event.pid}, daddress:{daddress}, saddress:{parse_ip_address(event.saddress)}"
    )


bpf.attach_kprobe(event="ip_finish_output", fn_name="probe_icmp")

bpf["probe_events"].open_perf_buffer(print_icmp_event)
while 1:
    try:
        bpf.kprobe_poll()
    except KeyboardInterrupt:
        exit()

Okay, this code is technically mixed-language, with part in C and part in Python. The Python part is familiar to everyone; BCC helps us load our C code and attach it to the kprobe. Then, it continuously outputs the data we transmit from the kernel.

Now let's focus on the C part of the code (which strictly speaking is not standard C, but a layer of DSL encapsulated by BCC).

First, let's look at our two helper functions.

static inline unsigned char *custom_skb_network_header(const struct sk_buff *skb)
{
	return skb->head + skb->network_header;
}

static inline struct iphdr *get_iphdr_in_icmp(const struct sk_buff *skb)
{
    return (struct iphdr *)custom_skb_network_header(skb);
}

As mentioned earlier, we can calculate the address of the IP header in memory based on the head and network_header in sk_buff, and then we cast it to a pointer of the iphdr structure.

Next, we need to look at the iphdr.

struct iphdr {
#if defined(__LITTLE_ENDIAN_BITFIELD)
	__u8	ihl:4,
		version:4;
#elif defined (__BIG_ENDIAN_BITFIELD)
	__u8	version:4,
  		ihl:4;
#else
#error	"Please fix <asm/byteorder.h>"
#endif
	__u8	tos;
	__be16	tot_len;
	__be16	id;
	__be16	frag_off;
	__u8	ttl;
	__u8	protocol;
	__sum16	check;
	__be32	saddr;
	__be32	daddr;
	/*The options start here. */
};

Those familiar with the IP packet structure will recognize this. The saddr and daddr fields represent our source and destination addresses, while the protocol field indicates the type of our L4 protocol, where a value of 1 represents the ICMP protocol.

Now let's look at our trace function.

int probe_icmp(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb){
    struct iphdr * ipdata=get_iphdr_in_icmp(skb);
    if (ipdata->protocol!=1){
        return 1;
    }
    u64 __pid_tgid = bpf_get_current_pid_tgid();
    u32 __pid = __pid_tgid;
    struct probe_icmp_sample __data = {0};
    __data.pid = __pid;
    u32 daddress;
    u32 saddress;
    bpf_probe_read(&daddress, sizeof(ipdata->daddr), &ipdata->daddr);
    bpf_probe_read(&saddress, sizeof(ipdata->daddr), &ipdata->saddr);
    __data.daddress=daddress;
    __data.saddress=saddress;
    probe_events.perf_submit(ctx, &__data, sizeof(__data));
    return 0;
}

As mentioned earlier, when the kprobe is triggered, the three parameters of ip_finish_output will be passed to our trace function, allowing us to perform various operations based on the incoming data. Now let's introduce what the above code does:

Convert sk_buff to the corresponding iphdr.
Check if the current packet is of the ICMP protocol.
Use the kernel BPF-provided helper bpf_get_current_pid_tgid to obtain the PID of the process currently calling ip_finish_output.
Retrieve saddr and daddr. Note that we use bpf_probe_read, which is also a BPF-provided helper function. In principle, all data reading from the kernel in eBPF should be done using bpf_probe_read or bpf_probe_read_kernel for safety.
Submit the data through perf.

In this way, we can identify which processes on the machine are sending ICMP requests.

Let's take a look at the results.

Okay, we have basically met our requirements, but there is a small issue left for everyone to think about: how can we obtain the command line of the process based on the PID?

SystemTap + kprobe#

The eBPF version has been implemented, but there is a problem: eBPF can only be used in higher versions of the kernel. Generally speaking, on x86_64, Linux 3.16 introduced support for eBPF. The kprobe support for eBPF was implemented in Linux 4.1. Typically, we recommend using kernel versions 4.9 and above in conjunction with eBPF.

Now the question arises. In fact, we have many traditional setups like CentOS 7 + Linux 3.10. What should they do?

Linux 3.10 lives matter! CentOS 7 lives matter!

In that case, we have no choice but to switch to another technology stack. At this point, we first consider SystemTap, which was developed by RedHat and contributed to the community, and is available for lower versions.

%{
#include<linux/byteorder/generic.h>
#include<linux/if_ether.h>
#include<linux/skbuff.h>
#include<linux/ip.h>
#include<linux/in.h>
#include<linux/tcp.h>
#include <linux/sched.h>
#include <linux/list.h>
#include <linux/pid.h>
#include <linux/mm.h>
%}

function isicmp:long (data:long)
%{
    struct iphdr *ip;
    struct sk_buff *skb;
    int tmp = 0;

    skb = (struct sk_buff *) STAP_ARG_data;

    if (skb->protocol == htons(ETH_P_IP)){
            ip = (struct iphdr *) skb->data;
            tmp = (ip->protocol == 1);
    }
    STAP_RETVALUE = tmp;
%}

function task_execname_by_pid:string (pid:long) %{
    struct task_struct *task;

    task = pid_task(find_vpid(STAP_ARG_pid), PIDTYPE_PID);

//     proc_pid_cmdline(p, STAP_RETVALUE);
    snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s", task->comm);
    
%}

function ipsource:long (data:long)
%{
    struct sk_buff *skb;
    struct iphdr *ip;
    __be32 src;

    skb = (struct sk_buff *) STAP_ARG_data;

    ip = (struct iphdr *) skb->data;
    src = (__be32) ip->saddr;

    STAP_RETVALUE = src;
%}

/* Return ip destination address */
function ipdst:long (data:long)
%{
    struct sk_buff *skb;
    struct iphdr *ip;
    __be32 dst;

    skb = (struct sk_buff *) STAP_ARG_data;

    ip = (struct iphdr *) skb->data;
    dst = (__be32) ip->daddr;

    STAP_RETVALUE = dst;
%}

function parseIp:string (data:long) %{ 
    sprintf(STAP_RETVALUE,"%d.%d,%d.%d",(int)STAP_ARG_data &0xFF,(int)(STAP_ARG_data>>8)&0xFF,(int)(STAP_ARG_data>>16)&0xFF,(int)(STAP_ARG_data>>24)&0xFF);
%}


probe kernel.function("ip_finish_output").call {
    if (isicmp($skb)) {
        pid_data = pid()
        /* IP */
        ipdst = ipdst($skb)
        ipsrc = ipsource($skb)
        printf("pid is:%d,source address is:%s, destination address is %s, command is: '%s'\n",pid_data,parseIp(ipsrc),parseIp(ipdst),task_execname_by_pid(pid_data))
    
    } else {
        next
    }
}

As you can see, our approach is still the same: we use ip_finish_output as the kprobe hook point, then we obtain the corresponding iphdr and perform operations.

Well, the basic functionality of our requirements is about done. You can further enhance it by obtaining the complete process command line, etc.

Further Thoughts and Experiments#

You may not have a strong sense of the ICMP protocol, which is relatively obscure. So let's change the requirement to something more relatable:

Monitor which processes on the machine are sending HTTP 1.1 requests.

As usual, let's first look at the key calls in the system.

TCP

Here, we choose tcp_sendmsg as our entry point.

int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
	int ret;

	lock_sock(sk);
	ret = tcp_sendmsg_locked(sk, msg, size);
	release_sock(sk);

	return ret;
}

Here, the sock structure contains some key metadata.

struct sock {
	/*
	 * Now struct inet_timewait_sock also uses sock_common, so please just
	 * don't add nothing before this first member (__sk_common) --acme
	 */
	struct sock_common	__sk_common;
    ...
}

struct sock_common {
	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
	 * address on 64bit arches : cf INET_MATCH()
	 */
	union {
		__addrpair	skc_addrpair;
		struct {
			__be32	skc_daddr;
			__be32	skc_rcv_saddr;
		};
	};
	union  {
		unsigned int	skc_hash;
		__u16		skc_u16hashes[2];
	};
	/* skc_dport && skc_num must be grouped as well */
	union {
		__portpair	skc_portpair;
		struct {
			__be16	skc_dport;
			__u16	skc_num;
		};
	};
    ...
}

You can see that we can obtain the quintuple data of our port from sock, and we can retrieve specific data from msghdr.

For our requirement regarding HTTP, we only need to check whether the TCP packet we obtain contains HTTP/1.1 to roughly determine if the request is an HTTP 1.1 request (a rather brute-force approach, Hhhhh).

Now, let's look at the code.

from bcc import BPF
import ctypes
import binascii

bpf_text = """
#include <linux/ptrace.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#include <linux/socket.h>

struct ipv4_data_t {
    u32 pid;
    u64 ip;
    u32 saddr;
    u32 daddr;
    u16 lport;
    u16 dport;
    u64 state;
    u64 type;
    u8 data[300];
    u16 data_size;
};


BPF_PERF_OUTPUT(ipv4_events);

int trace_event(struct pt_regs *ctx,struct sock *sk, struct msghdr *msg, size_t size){
    if (sk == NULL)
        return 0;
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    

    // pull in details
    u16 family = sk->__sk_common.skc_family;
    u16 lport = sk->__sk_common.skc_num;
    u16 dport = sk->__sk_common.skc_dport;
    char state = sk->__sk_common.skc_state;

    if (family == AF_INET) {
        struct ipv4_data_t data4 = {};
        data4.pid = pid;
        data4.ip = 4;
        //data4.type = type;
        data4.saddr = sk->__sk_common.skc_rcv_saddr;
        data4.daddr = sk->__sk_common.skc_daddr;
        // lport is host order
        data4.lport = lport;
        data4.dport = ntohs(dport);
        data4.state = state;
        struct iov_iter temp_iov_iter=msg->msg_iter;
        struct iovec *temp_iov=temp_iov_iter.iov;
        bpf_probe_read_kernel(&data4.data_size, 4, &temp_iov->iov_len);
        u8 * temp_ptr;
        bpf_probe_read_kernel(&temp_ptr, sizeof(temp_ptr), &temp_iov->iov_base);
        bpf_probe_read_kernel(&data4.data, sizeof(data4.data), temp_ptr);
        ipv4_events.perf_submit(ctx, &data4, sizeof(data4));
    }
    return 0;
}

"""

bpf = BPF(text=bpf_text)

filters = {}


def parse_ip_address(data):
    results = [0, 0, 0, 0]
    results[3] = data & 0xFF
    results[2] = (data >> 8) & 0xFF
    results[1] = (data >> 16) & 0xFF
    results[0] = (data >> 24) & 0xFF
    return ".".join([str(i) for i in results[::-1]])


def print_http_payload(cpu, data, size):
    # event = b["probe_icmp_events"].event(data)
    # event = ctypes.cast(data, ctypes.POINTER(IcmpSamples)).contents
    event= bpf["ipv4_events"].event(data)
    daddress = parse_ip_address(event.daddr)
    # data=list(event.data)
    # temp=binascii.hexlify(data) 
    body = bytearray(event.data).hex()
    if "48 54 54 50 2f 31 2e 31".replace(" ", "") in body:
        # if "68747470" in temp.decode():
        print(
            f"pid:{event.pid}, daddress:{daddress}, saddress:{parse_ip_address(event.saddr)}, {event.lport}, {event.dport}, {event.data_size}"
        )


bpf.attach_kprobe(event="tcp_sendmsg", fn_name="trace_event")

bpf["ipv4_events"].open_perf_buffer(print_http_payload)
while 1:
    try:
        bpf.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Okay, let's take a look at the results.

Results

In fact, we can further extend this. For example, for languages like Go, which have fixed characteristics for HTTPS connections, we can also use relatively simple methods to trace the source of packets on the machine (you can refer to this article by Wuzhe, Why Do I Always Get 503 Service Unavailable When Accessing a Certain Website with Go?).

I also conducted a test, and you can refer to the code: https://github.com/Zheaoli/linux-traceing-script/blob/main/ebpf/go-https-tracing.py.

Conclusion#

In fact, whether it is eBPF or SystemTap, these dynamic tracing technologies can make the Linux Kernel more programmable. Compared to traditional methods like recompiling the kernel, they are more convenient and faster. The emergence of frameworks like BCC/BPFTrace further reduces the difficulty of observing the kernel.

Often, many of our requirements can be achieved more quickly using bypass methods. However, it is important to note that the introduction of dynamic tracing technologies inevitably increases the instability of the kernel and can affect performance to some extent. Therefore, we need to make trade-offs based on specific scenarios.

Alright, this article is about to conclude here. I hope to produce a series of articles on eBPF from beginner to advanced when I have time (flag++).