A Brief Discussion on Network Quality Monitoring in the Linux Kernel

This might be the last article of 2021 (Lunar New Year), or it could be the first article of 2022, but that entirely depends on when I finish writing it. This time, let's briefly discuss network monitoring in Linux.

Introduction#

This article is both a water article and not a water article. However, it is still aimed at beginners. This article has actually been sitting in my drafts for over a year, and the inspiration initially came from some of my work at Alibaba (which can be considered a leading but somewhat niche job in the country (XD).

With the development of technology, everyone's requirements for service stability are becoming increasingly high, and the premise of ensuring service quality is having a qualified monitoring coverage (Alibaba's requirement for service stability is called "1-5-10," meaning one minute to detect, five minutes to handle, and ten minutes to self-heal. Without sufficient coverage in monitoring for such stability requirements, everything is equivalent to circles). Among these, monitoring network quality is of utmost importance.

Before discussing network quality monitoring, we need to clarify the scope of the definition of network quality.

Abnormal situations on the network link
The processing capacity of the server's network

After clarifying this coverage, we can think about what indicators represent a decline in network quality. (Note: This article mainly analyzes monitoring of TCP and over TCP protocols, and will not elaborate further.)

Undoubtedly, if we experience packet loss
Send/receive queue blockage
Timeouts

Now let's look at the specific details.

As proposed by RFC793¹, RTO, and Retransmission Timer proposed by RFC6298², can measure packet transmission time. A rough summary is that the larger these two indicators are, the lower the network quality.
As proposed by RFC2018³, SACK, an imprecise summary is that the more SACKs there are, the more packet loss there is.
If our connection is frequently reset (RST), it also indicates that there are issues with our network quality.

Of course, in actual production processes, we can also use many other indicators to assist in measuring network quality, but since this article mainly introduces the ideas with a prototype focus, I won't elaborate too much.

After clarifying what indicators we want to obtain in this article, let's analyze how we can obtain these indicators.

Kernel Network Quality Monitoring#

Brutal Version#

Obtaining network metrics from the kernel essentially means getting the running state from the kernel. Speaking of this, those who have some understanding of Linux will definitely first think of checking The Proc Filesystem⁴ to see if we can obtain specific metrics. Yep, that's a good idea, and indeed we can obtain some metrics (this is also the principle behind tools like netstat).

In /proc/net/tcp, we can obtain metrics output by the kernel, which currently includes the following:

Connection status
Local port and address
Remote port and address
Receive queue length
Send queue length
Slow start threshold
RTO value
Inode ID of the socket to which the connection belongs
UID
Delay ACK soft clock

For a complete explanation, refer to proc_net_tcp.txt⁵.

This approach might be acceptable for a prototype, but its inherent drawbacks limit large-scale use in production.

The kernel has clearly stated that proc_net_tcp.txt⁵ is not recommended for use, in other words, it does not guarantee future compatibility and maintenance.
The metric information directly provided by the kernel is still too limited; some indicators like RTT and SRTT cannot be obtained, nor can specific events like SACK.
There are issues with real-time performance and accuracy based on the metrics output by the kernel. In other words, we can attempt this area without considering accuracy.
proc_net_tcp.txt⁵ is bound to the network namespace, meaning that in container scenarios, we need to traverse potentially multiple network namespaces and continuously use nsenter to obtain the corresponding metrics.

So, in this context, proc_net_tcp.txt⁵ is not very suitable for larger-scale usage scenarios. Therefore, we need to optimize it further.

Optimization 1.0#

In the previous section, we mentioned the drawbacks of directly obtaining data from The Proc Filesystem⁴. One important point mentioned is:

The kernel has clearly stated that proc_net_tcp.txt⁵ is not recommended for use, in other words, it does not guarantee future compatibility and maintenance.

So what is the recommended approach? The answer is netlink + sock_diag.

To briefly introduce, netlink⁶ is a mechanism introduced in Linux 2.2 for communication between Kernel Space and User Space, first proposed by RFC3549⁷. The official description of netlink⁶ is roughly as follows:

Netlink is used to transfer information between the kernel and user-space processes. It consists of a standard sockets-based interface for user space processes and an internal kernel API for kernel modules.
The internal kernel interface is not documented in this manual page. There is also an obsolete netlink interface via netlink character devices; this interface is not documented here and is provided only for backward compatibility.

In short, users can conveniently use netlink⁶ to interact with different Kernel Modules in the kernel.

In our scenario, we need to utilize sock_diag⁸, which is officially described as:

The sock_diag netlink subsystem provides a mechanism for obtaining information about sockets of various address families from the kernel. This subsystem can be used to obtain information about individual sockets or request a list of sockets.

In simple terms, we can use sock_diag⁷ to obtain the connection status and corresponding metrics of different sockets. (We can obtain all the metrics mentioned above, as well as more detailed metrics like RTT.) By the way, it should be noted that netlink⁶ can be configured to obtain metrics from all Network Namespaces.

When using netlink⁶, writing in Pure C can be quite cumbersome. Fortunately, the community has several well-packaged libraries, such as the netlink library⁸ packaged by vishvananda. Here’s a demo:

package main

import (
    "fmt"

	"github.com/vishvananda/netlink"
	"syscall"
)

func main() {
	results, err := netlink.SocketDiagTCPInfo(syscall.AF_INET)
	if err != nil {
		return
	}
	for _, item := range results {
		if item.TCPInfo != nil {
			fmt.Printf("Source:%s, Dest:%s, RTT:%d\n", item.InetDiagMsg.ID.Source.String(), item.InetDiagMsg.ID.Destination.String(), item.TCPInfo.Rtt)
		}
	}
}

The running example looks like this:

netlink

OK, now we can use the officially recommended Best Practice to obtain more comprehensive and detailed metrics without worrying about the Network namespace issue. However, we still have a more challenging problem regarding real-time performance.

If we choose periodic polling, then if a network fluctuation occurs during our polling interval, we will lose the corresponding scene. So how do we solve the real-time performance issue?

Optimization 2.0#

If we want to directly trigger our calls when specific events like retransmissions or connection resets occur, those who have read my previous blog might immediately consider using a combination of eBPF + kprobe to obtain real-time data by hitting key calls like tcp_reset and tcp_retransmit_skb. Sounds good!

However, there are still some minor issues.

The overhead of kprobe can be relatively high in high-frequency situations.
If we only need information like source_address, dest_address, source_port, dest_port, it is indeed a bit wasteful to go through kprobe to obtain the complete skb and then cast it.

So, do we have a better method? Yes!

In Linux, there is a basic infrastructure called Tracepoint⁹ for triggering and callback scenarios for a series of special events similar to our needs. This infrastructure can help us handle the requirement of listening for events and callbacks effectively. After Linux 4.15 and 4.16, six TCP-related Tracepoints⁹ were added:

The meanings of these Tracepoints⁹ can be inferred from their names.

When these Tracepoints⁹ are triggered, they will pass several parameters to the registered callback function. Here, I will list them:

tcp:tcp_retransmit_skb
    const void * skbaddr;
    const void * skaddr;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_send_reset
    const void * skbaddr;
    const void * skaddr;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_receive_reset
    const void * skaddr;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_destroy_sock
    const void * skaddr;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_retransmit_synack
    const void * skaddr;
    const void * req;
    __u16 sport;
    __u16 dport;
    __u8 saddr[4];
    __u8 daddr[4];
    __u8 saddr_v6[16];
    __u8 daddr_v6[16];
tcp:tcp_probe
    __u8 saddr[sizeof(struct sockaddr_in6)];
    __u8 daddr[sizeof(struct sockaddr_in6)];
    __u16 sport;
    __u16 dport;
    __u32 mark;
    __u16 length;
    __u32 snd_nxt;
    __u32 snd_una;
    __u32 snd_cwnd;
    __u32 ssthresh;
    __u32 snd_wnd;
    __u32 srtt;
    __u32 rcv_wnd;

Well, at this point, you might have an idea. Let's write some example code.

from bcc import BPF

bpf_text = """
BPF_RINGBUF_OUTPUT(tcp_event, 65536);

enum tcp_event_type {
    retrans_event,
    recv_rst_event,
};

struct event_data_t {
    enum tcp_event_type type;
    u16 sport;
    u16 dport;
    u8 saddr[4];
    u8 daddr[4];
    u32 pid;
};

TRACEPOINT_PROBE(tcp, tcp_retransmit_skb)
{
    struct event_data_t event_data={};
    event_data.type = retrans_event;
    event_data.sport = args->sport;
    event_data.dport = args->dport;
    event_data.pid=bpf_get_current_pid_tgid()>>32;
    bpf_probe_read_kernel(&event_data.saddr,sizeof(event_data.saddr), args->saddr);
    bpf_probe_read_kernel(&event_data.daddr,sizeof(event_data.daddr), args->daddr);
    tcp_event.ringbuf_output(&event_data, sizeof(struct event_data_t), 0);
    return 0;
}

TRACEPOINT_PROBE(tcp, tcp_receive_reset)
{
    struct event_data_t event_data={};
    event_data.type = recv_rst_event;
    event_data.sport = args->sport;
    event_data.dport = args->dport;
    event_data.pid=bpf_get_current_pid_tgid()>>32;
    bpf_probe_read_kernel(&event_data.saddr,sizeof(event_data.saddr), args->saddr);
    bpf_probe_read_kernel(&event_data.daddr,sizeof(event_data.daddr), args->daddr);
    tcp_event.ringbuf_output(&event_data, sizeof(struct event_data_t), 0);
    return 0;
}

"""

bpf = BPF(text=bpf_text)


def process_event_data(cpu, data, size):
    event = bpf["tcp_event"].event(data)
    event_type = "retransmit" if event.type == 0 else "recv_rst"
    print(
        "%s %d %d %s %s %d"
        % (
            event_type,
            event.sport,
            event.dport,
            ".".join([str(i) for i in event.saddr]),
            ".".join([str(i) for i in event.daddr]),
            event.pid,
        )
    )


bpf["tcp_event"].open_ring_buffer(process_event_data)


while True:
    bpf.ring_buffer_consume()

Here, I used tcp_receive_reset and tcp_retransmit_skb to monitor the programs on our machine. To demonstrate the specific effect, I first wrote a Go program to access Google, and then used sudo iptables -I OUTPUT -p tcp -m string --algo kmp --hex-string "|c02bc02fc02cc030cca9cca8c009c013c00ac014009c009d002f0035c012000a130113021303|" -j REJECT --reject-with tcp-reset to inject a Connection Reset into this Go program (the injection principle here is that the default library of Go has a fixed Client Hello feature for initiating HTTPS connections, and I used iptables to identify the directional traffic and then reset the connection).

The effect is as follows:

Tracepoint

Well, writing to this point, you might have realized that we can combine Tracepoints⁹ and netlink⁶ to meet our real-time needs.

Optimization 3.0#

In fact, up to this point, I have mostly discussed some prototypes and ideas. To meet production needs, there is still a lot of work to be done (this is also part of the work I did before), including but not limited to:

Performance optimization in engineering to avoid impacting services
Compatibility with container platforms like Kubernetes
Integration with data monitoring platforms like Prometheus
Possibly embedding CNI to obtain a more convenient monitoring path, etc.

In fact, the community has also done a lot of interesting work in this area, such as Cilium, which you can also follow if interested. I will also tidy up the code later and open source some of my previous implementation paths at the appropriate time.

Conclusion#

This article is about to conclude. Kernel network monitoring is ultimately a relatively niche field. I hope some of my experiences can help everyone. Well, I wish everyone a Happy New Year! May the Year of the Tiger bring you good luck! (The next article will be a summary of last year's work.)

Reference#

RFC793: https://datatracker.ietf.org/doc/html/rfc793
RFC6298: https://datatracker.ietf.org/doc/html/rfc6298
RFC2018: https://datatracker.ietf.org/doc/html/rfc2018
The /proc Filesystem: https://www.kernel.org/doc/html/latest/filesystems/proc.html
proc_net_tcp.txt: https://www.kernel.org/doc/Documentation/networking/proc_net_tcp.txt
netlink: https://man7.org/linux/man-pages/man7/netlink.7.html
sock_diag: https://man7.org/linux/man-pages/man7/sock_diag.7.html
vishvananda/netlink: https://github.com/vishvananda/netlink
Linux Tracepoint: https://www.kernel.org/doc/html/latest/trace/tracepoints.html