A brief discussion about Maglev, Google's soft load balancing practice

It's been a while since I last blogged, so let's write a simple note on reading papers. This article is based on a paper published by Google in 2016, Maglev: A Fast and Reliable Software Network Load Balancer, which shares the implementation of their internal software load balancing system that has been in large-scale use since 2008. There are many interesting details in it, and I'll write as much as I can.

Background#

The concept of load balancing is familiar to everyone, so I won't elaborate on it again. Now we need to consider Google's scenario. At the beginning of the design, Google needed a high-performance LB to handle the traffic of some of its major services, such as Google Search, Gmail, and so on. Given the enormous traffic, the LB needed to have very strong performance to handle a large amount of traffic.

In this case, the traditional idea might be to directly use specialized hardware load balancing; problems that can be solved with money are not considered issues (laughs). However, this kind of solution has significant problems:

The performance of a single point of hardware load balancing determines the total requests the entire network can handle.
There are flaws in HA. To ensure that the entire network cluster does not collapse when a single point fails, we usually need a 1:1 redundancy.
Lack of flexibility and programmability; there are no entry points for doing fancy operations.
It's too expensive. So expensive that even Google can't afford it (runs away).

Under such circumstances, Google began to consider building its own SLB (Software Load Balancer) system. The benefits are obvious, such as convenient scaling, reducing the required redundancy for HA from 1:1 to N+1, and easy customization. The architecture evolved into the diagram below:

However, the challenges are also evident. First, sufficient performance is needed to ensure that the cluster has enough throughput. At the same time, connection tracking needs to be implemented to ensure that packets from the same connection can be delivered to the same machine. There may also be a need to ensure transparent failover capability.

These requirements combine to form what we are going to discuss today: Maglev. Google has been using this LB system on a large scale since 2008.

A Glimpse of Maglev#

Background Knowledge#

Before continuing to discuss Maglev, we need to understand how Google currently uses Maglev. Below is a simplified schematic diagram:

Here, we need to introduce a very important concept called VIP (Virtual IP Address). Those who have used Kubernetes are certainly familiar with this concept. A VIP is not a physical IP bound to a network card. It can be seen as an abstraction of a group of backend endpoints. When you access this VIP, you are actually accessing the backend endpoints. For a more understandable example, in Kubernetes, after creating a set of Pods, we usually create a Service to associate with the corresponding Pods to expose the services provided by the Pods. The Service usually has an IP, which is a VIP. When we access the Service's IP, it typically randomly selects one of the backend Pods to handle the request.

Now, back to Maglev. Let's look at the entire process. Maglev associates with the VIP and then transparently passes the VIP to a group of Routers. When a user types https://www.google.com in the browser and hits enter, the browser performs DNS resolution. The DNS resolution is handled by Google's DNS servers. The DNS server selects the nearest cluster's VIP based on the user's region and returns it to the user, and then the browser establishes a connection based on the obtained VIP.

When the Router receives the corresponding packet, it forwards the packet to any node in the Maglev cluster that the VIP belongs to. Each node in the cluster has balanced weights. When a Maglev node receives the packet, it uses GRE (Generic Routing Encapsulation) to encapsulate it and then transmits it to the corresponding backend endpoint.

When the backend endpoint receives the packet, it processes the request. When the response data is ready, it performs encapsulation, using the VIP as the source address, the user's IP as the destination address, and the response data as the packet payload. At this point, the backend endpoint uses DSR (Direct Server Return) to return the packet directly, bypassing Maglev. This avoids putting additional burden on Maglev when the response is large. In fact, DSR has been widely used in L4 LB implementations, such as HAProxy, Envoy, etc. I might write a blog about it someday when I have time.

Maglev Configuration#

As mentioned earlier, Maglev receives VIP requests from Routers and forwards the corresponding traffic to the appropriate backend endpoints. Each Maglev consists of a Controller and a Forwarder, structured as shown below:

Both the Controller and Forwarder manage the relevant VIPs using a Configuration Object. This Configuration Object is actually another system (which can be roughly considered a registration center), and they communicate with each other via RPC.

On the Maglev machine, the Controller periodically checks the Forwarder. Based on the check results, it determines whether to submit/revoke all VIP registrations via BGP (either all succeed or all fail, which is actually to ensure system consistency). This ensures that traffic coming from the Router can be directed to healthy machines.

The VIP traffic coming from the Router will be handled by the Forwarder. In the Forwarder, each VIP is associated with one or more backend pools. Unless specifically handled, the backends in Maglev are service endpoints. A backend pool can contain a set of physical IPs of service endpoints or other backend pools. Each backend pool will design several monitoring checkers based on its specific needs, and packets will only be forwarded to healthy services. As mentioned earlier, the same service may be included in multiple backend pools, so the Forwarder will deduplicate based on specific addresses to avoid additional overhead.

The Config Manager of the Forwarder will be responsible for pulling, parsing, and validating the relevant configurations from the Configuration Object. All configuration submissions are atomic (either all succeed or all fail). During the process of pushing and parsing to effect, there is a very brief gap during which the configurations between Maglev clusters may be out of sync. However, due to the existence of consistent hashing, most requests can still be successfully delivered during this very short gap.

Maglev Implementation#

Alright, after discussing so much, let's look at some practical details of the entire Maglev system.

Overview#

As we know (as mentioned earlier), the Forwarder is responsible for the actual traffic forwarding work. Let's illustrate its structure with a diagram:

The Forwarder directly receives packets from the NIC (Network Interface Card) and then forwards them to the backend. All operations during this process do not go through the kernel (in fact, going through the kernel incurs additional costs).

Packets retrieved from the NIC are first processed by the Steering Module. During processing, the Steering Module calculates a hash based on the five-tuple (protocol, destination address, destination port, source address, source port). It then forwards the packet to the corresponding Receiving Queue. Each Receiving Queue corresponds to a processing thread. The processing thread filters out packets whose destination VIP does not match the locally registered VIP. It then recalculates the five-tuple hash and looks up the corresponding value in the Connection Tracking Table.

The Connection Tracking Table stores the backend corresponding to the previous five-tuple hash. If the lookup is successful, it directly reuses the backend; if not, it selects a new backend for this packet and adds the key-value pair to the Connection Tracking Table. If no backend is available at this time, the packet will be discarded. After completing the lookup operation, as mentioned earlier, the packet will be rewritten and placed into the transmission queue. Finally, the muxing module will send the packets from the transmission queue directly through the NIC.

Here’s a question: why not consider using a common strategy like round-robin in the Steering Module? Everyone knows that the processing speeds of each thread are inconsistent. If we directly use round-robin, it may lead to packet reordering in such situations. If we introduce the concept of weights to improve it, it will add new complexity, as the processing speeds of threads are dynamically changing. Another situation is connection tracking. Suppose we have a persistent connection; we need to ensure that every packet is sent to the same machine. Using round-robin would introduce additional complexity in this case. However, for special situations, such as when the receive queue is full or consistent hashing cannot handle the load, we will use round-robin as a backup method to replace consistent hashing. This situation is particularly useful when there are packets with the same five-tuple present simultaneously.

Efficient Packet Processing#

As mentioned earlier, Maglev directly operates on TCP packets, and given Google's enormous traffic, Maglev needs to have good forwarding performance. Otherwise, its throughput capacity will not meet the demands in large-scale scenarios. How does Google achieve this? Answer: by directly operating on the network card.

We all know that in Linux, copying packets from kernel space to user space is a very costly operation. Therefore, for scenarios with extreme performance demands, such as L4 load balancing, people tend to implement things in the kernel to avoid cross-state copying. This is also the idea behind tools like LVS. However, for larger-scale traffic, going from the network card to the kernel and processing through a bunch of filters in the kernel is also a very costly operation. As mentioned earlier, Maglev only relies on the five-tuple in the packet and does not need to care about the packet sequence number or payload. So Google thought: I have a bold idea! Let's look at a diagram:

Google chose to program directly on the NIC (i.e., network card). The Forwarder and NIC share a memory space. This memory maintains a circular pool of packets. The steering module and muxing module in the Forwarder each maintain three pointers to handle these packets, described in detail below.

First, the steering module maintains three pointers:

received, managing received packets.
reserved, managing received but unprocessed packets.
processed, managing processed packets.

The process is as follows: when the NIC receives a new packet, the memory pointed to by the received pointer is modified. When a packet is dispatched to a thread for processing, the memory address pointed to by the processed pointer is modified. Since it's a circular structure, the packets between received and processed are those that have been received but not yet processed, managed by the reserved pointer.

Correspondingly, the muxing module also maintains three pointers:

sent, managing packets that have been sent.
ready, managing packets that are ready and waiting to be sent.
recycled, managing recycled packets.

The corresponding process is as follows: when the steering module completes processing a packet, the memory pointed to by the ready pointer is modified, and it waits to be sent. When a packet is sent, the memory address pointed to by the sent pointer is modified. In addition to ready and sent, there is another state recycled that manages recycled packets.

We can see that during this process, no data copying occurs, which actually reduces some latency caused by copying data. However, this method has the problem that when pointers go out of bounds, it incurs significant additional overhead. Therefore, Google's approach is to use batch processing, for example, processing 3000 small packets at once, which is quite a clever operation.

Additionally, some extra optimizations need to be made, such as ensuring that packet processing threads do not share data to avoid race conditions, and binding threads to specific CPU cores to ensure performance, etc.

Currently, Google's approach is very efficient, with an average processing time of only 300 ns ($10^{-9}$s) per packet. As mentioned earlier, Google uses batch processing to handle packets, but this approach has the issue that when hardware interrupts occur, the time to reach the processing threshold may be much longer than in most cases. Therefore, Google designed a 50μs ($10^{-6}$s) timer to handle such situations. In other words, when hardware or other issues occur, the overall packet processing time may increase by 50μs (it feels like Google is showing off here, saying, "Look, our performance is great, only hardware is our bottleneck" (runs away)).

Backend Selection#

As mentioned earlier, the Forwarder selects a backend for the packets. For common TCP scenarios, it is crucial to forward packets with the same five-tuple to the same backend node. Google maintains a connection tracking table in Maglev to solve this problem. When a packet arrives, Maglev calculates its five-tuple hash and checks if it exists in the table. If it does not exist, it selects a node as the backend and adds the record to the table. If it exists, it directly reuses it.

This seems fine, right? Google: No, there are still problems!

First, consider a scenario where Maglev is connected to a group of Routers that do not provide connection affinity, meaning they do not guarantee that packets from the same connection are sent to the same machine. Therefore, it is possible that different packets from the same connection are sent to different machines. For example, let's assume the Router has connection affinity, but if a machine restarts, the connection tracking table may be cleared.

Another example is that we know the connection tracking table has a memory threshold. When faced with very high traffic or abnormal situations like SYN Flood, when the capacity of the connection tracking table reaches its threshold, we will inevitably clean up some data. At this time, the tracking information for a connection may be cleared. So how do we perform connection tracking in such cases?

Google's approach is to introduce consistent hashing.

Consistent Hashing: Maglev Hash#

The overall algorithm has many details, so I will only explain the general idea. For specific details, you can refer to the original text.

First, we need to determine the length M of the lookup table after preprocessing. All keys will be hashed into this lookup table, and each element in the lookup table will be mapped to a node.

The calculation of the lookup table is divided into two steps:

Calculate a value for each node for each item in the lookup table (which is referred to as permutation in the original text).
Based on this value, calculate which node each item in the lookup table maps to (stored in the entry, which in the original text is referred to as the final lookup table).

The permutation is an M×N matrix, with columns corresponding to the lookup table and rows corresponding to the nodes. To calculate the permutation, we need to select two hash algorithms to compute two values: offset and skip. Finally, we fill the permutation based on the offset and skip values, as described below:

offset ← h1(name[i]) mod M
skip ← h2(name[i]) mod (M − 1) + 1
permutation[i][j] ← (offset + j × skip) mod M

Where i is the index of the node in the Node Table, and j is the index in the lookup table.

After calculating the permutation, we can compute the final lookup table, which is represented as a one-dimensional array:

Here’s a piece of code to calculate the lookup table based on the already computed permutation:

from typing import List

# Calculate the lookup_table based on the already computed permutation
def calculate_lookup_table(n: int, m: int, permutation: List[List[int]]) -> List[int]:
    # result is the final hash table recording distribution
    result: List[int] = [-1] * m
    # next is used to resolve conflicts; during traversal, if the entry we want to fill is already occupied,
    # we use next to find the next row. This process continues until an empty position is found.
    # Since each column contains every value from 0 to M-1, we will definitely traverse every row.
    # The computational complexity is O(M logM) ~ O(M^2)
    next: List[int] = [0] * n
    flag = 0
    while True:
        for i in range(n):
            x = permutation[i][next[i]]
            while True:
                # Found an empty position, exit the search
                if result[x] == -1:
                    break
                next[i] += 1
                x = permutation[i][next[i]]
            result[x] = i
            next[i] += 1
            flag += 1
            # The table is filled, exit the calculation
            if flag == m:
                return result

In this loop, we can see that it will definitely end, and in the worst case, the complexity can be very high, potentially reaching O(M^2). The original text suggests choosing an M that is much larger than N (To avoid this happening we always choose M such that M ≫ N.) to keep the average complexity at O(M logM).

How does Google's self-developed consistent hashing algorithm perform in Maglev? The paper also conducted tests:

It can be seen that Maglev demonstrates better balance for different sizes of lookup tables.

To be honest, I see Maglev as essentially a hash with virtual nodes. Honestly, I didn't expect Google not to use more mature hashes like Dynamo. Is it due to policy reasons? (After all, Dynamo belongs to AWS) (runs away). By the way, Envoy also implements Maglev. See Evaluate other consistent hash LB algorithms for more details, and it has introduced weights, which is quite good. Interested readers can check it out (runs away).

To be honest, there are still many details about Maglev Hash that haven't been covered, but I'm too lazy to write them. I'll wait until I publish a blog analyzing consistent hashing later. Flag++

Maglev Optimization#

We have covered the basic principles of Maglev. However, if it is to be used as a large-scale production LB, many optimizations need to be made regarding the details. Since this involves many aspects, I will only briefly introduce a few here, and I still recommend that everyone read the original text directly.

Handling Fragmented Packets#

Those familiar with networking know that when transmitting packets based on the IP protocol, due to the size limitation of MTU, there may be cases of fragmented transmission, and these fragmented packets may not carry complete five-tuple information. For example, if a packet is split into two segments, the first segment will carry L3 and L4 header information, while the second segment will only carry L3 information. During transmission, due to network issues, Maglev cannot guarantee correct processing of the received data.

This is a significant problem because packet fragmentation is a very common scenario. So how should Maglev handle such situations? First, we need to determine how to ensure that all data can be successfully delivered:

Ensure that different segments of a single data packet are processed by the same Maglev instance.
Ensure that the backend selection results for different segments of the same data packet are consistent.

Okay, let's see how Google solves this problem.

First, each Maglev instance will have a special backend pool that contains all instances in the Maglev cluster. When data is received, Maglev will first calculate a hash based on the three-tuple (source address, destination address, protocol family) and then select a Maglev instance for forwarding. This ensures that different segments of the same data packet are transmitted to the same Maglev instance. Of course, GRE's recursive control needs to be utilized to avoid infinite loops.

Now let's see how condition 2 is satisfied. Each Maglev instance maintains a special table that records the forwarding results of the first data segment after fragmentation. Using the previous example, when the second segment of a packet arrives, Maglev will check the table for the forwarding result of the first segment. If it exists, it forwards directly; if it does not exist, it caches this segment until the first segment arrives or the timeout threshold is reached.

Monitoring and Debugging#

In reality, most usage does not require debugging (just kidding). Google has designed auxiliary monitoring and debugging tools for this system to assist in daily development iterations.

In terms of monitoring, there are both black-box and white-box monitoring methods. For example, specific monitoring nodes distributed globally to confirm the health status of VIPs. Of course, there is also a complete set of white-box monitoring. Google monitors specific server metrics while also monitoring Maglev's own metrics.

Additionally, there are some debugging tools. For instance, Google developed a packet tracer similar to X-Trace. It can send information with specific headers and payloads. When Maglev receives such special packets, in addition to forwarding them as usual, it will also report some key information to a designated location.

This actually reflects one of the advantages of software load balancing over hardware load balancing; both debuggability and iterability are unmatched by hardware load balancing.

Conclusion#

I actually spent quite a bit of time reading this article, and there are many details worth delving into, so I strongly recommend everyone to find the original text and read it; it's very good. Additionally, I would like to recommend an article by the Meituan technical team, which also references Maglev to implement their own high-performance L4 load balancing. See MGW——Meituan's High-Performance Layer 4 Load Balancer.

Alright, that's it for this article. This should be the most time-consuming article I've written. However, thinking about the few articles I still have to write makes my head hurt.

Gotta run!