Continue the discussion on the primary process in the container.

After discussing some overviews of the init process in containers in last week's article, under the guidance and collaboration of my mentor, a certain Chuan (you can find him on GitHub, jschwinger23), we explored the implementations of two widely used init processes in mainstream containers: dumb-init and tini, and continued to write a commentary on the topic.

Main Text#

Why do we need an init process, and what responsibilities should we expect it to undertake?#

Before continuing our discussion on dumb-init and tini, we need to review a question: Why do we need an init process? And what responsibilities should the init process we choose undertake?

In fact, there are two main scenarios where we need an init process to be hosted in front in the container context:

For scenarios involving graceful upgrades of binaries within the container, one mainstream approach is to fork a new process, exec a new binary file, with the new process handling new connections and the old process handling old connections. (Nginx adopts this solution.)
Situations where signals are not properly forwarded and processes are not correctly reaped.
In some scenarios like calico-node, for convenience in packaging, we run multiple binaries in the same container.

There isn't much to say about the first scenario; let's take a look at the testing for the second point.

First, we prepare the simplest Python file, demo1.py:

import time

time.sleep(10000)

Then, as usual, we wrap it with a bash script:

#!/bin/bash

python /root/demo1.py

Finally, we write the Dockerfile:

FROM python:3.9

ADD demo1.py /root/demo1.py
ADD demo1.sh /root/demo1.sh

ENTRYPOINT ["bash", "/root/demo1.sh"]

After building and starting execution, let's first take a look at the process structure.

Process Structure

No problem, now we use strace to trace the two processes 2049962 and 2050009, and then send a SIGTERM signal to the bash process 2049962.

Let's look at the results.

Trace Result of Process 2049962

Trace Result of Process 2050009

We can clearly see that when process 2049962 receives SIGTERM, it does not forward it to process 2050009. After we manually SIGKILL 2049962, 2050009 also exits immediately. Some may wonder why 2050009 exits after 2049962 does.

This is due to the characteristics of the PID namespace. Let's take a look at the relevant introduction in pid_namespaces:

If the "init" process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal.

When the init process in the current PID namespace exits, the kernel directly SIGKILLs the remaining processes in that PID namespace.

OK, when we combine this with the container scheduling framework, many pitfalls can occur in production. Here’s a previous complaint of mine:

We had a test service, Spring Cloud, where after going offline, the node could not be removed from the registry, and after much confusion, we found the issue...
Essentially, when the POD is removed, the K8S Scheduler sends a SIGTERM signal to the POD's ENTRYPOINT and waits thirty seconds (the default graceful shutdown timeout), and if there is no response, it will SIGKILL it directly.
The problem is that our Eureka version of the service starts via start.sh, ENTRYPOINT ["/home/admin/start.sh"], and the default shell in the container is /bin/sh in fork/exec mode, which causes my service process to not correctly receive the SIGTERM signal and thus be SIGKILLed.

Isn't that frustrating? Besides the inability to properly handle signal forwarding, a common issue with applications is the emergence of Z processes, where child processes end but cannot be correctly reaped. For example, the notorious Z process issue with early puppeteer. In such cases, aside from issues within the application itself, another possible reason is that in daemon process scenarios, orphan processes do not have the capability to reap child processes after being re-parented.

OK, after reviewing the common issues above, let's review the responsibilities that the init process in a container should undertake:

Signal forwarding.
Reaping Z processes.

Currently, in container scenarios, two main solutions are used as the init process within containers: dumb-init and tini. Both solutions handle orphan and Z processes reasonably well. However, the implementation of signal forwarding is quite complicated. So next...

Time for a discussion!

The Underwhelming dumb-init#

To some extent, dumb-init is a classic example of false advertising. The code implementation is very rough.

Let's take a look at the official promotion:

dumb-init runs as PID 1, acting like a simple init system. It launches a single process and then proxies all received signals to a session rooted at that child process.

Here, dumb-init claims to use process sessions in Linux. We all know that a process session, by default, shares a Process Group ID. So we can understand that dumb-init can completely forward signals to every process in the process group. Sounds great, right?

Let's test it out.

The test code is as follows, demo2.py:

import os
import time

pid = os.fork()
if pid == 0:
    cpid = os.fork()
time.sleep(1000)

The Dockerfile is as follows:

FROM python:3.9

RUN wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64
RUN chmod +x /usr/local/bin/dumb-init

ADD demo2.py /root/demo2.py

ENTRYPOINT ["/usr/local/bin/dumb-init", "--"]

CMD ["python", "/root/demo2.py"]

Build and run, let's first take a look at the process structure.

Process Structure of demo2

Then, as usual, we strace processes 2103908, 2103909, and 2103910, and send a SIGTERM signal to the dumb-init process.

strace 2103908

strace 2103909

strace 2103910

Hey? What happened, dumb-init? Why was 2103909 directly SIGKILLed without receiving SIGTERM?

Here we need to look at the key implementation of dumb-init:

void handle_signal(int signum) {
    DEBUG("Received signal %d.\n", signum);

    if (signal_temporary_ignores[signum] == 1) {
        DEBUG("Ignoring tty hand-off signal %d.\n", signum);
        signal_temporary_ignores[signum] = 0;
    } else if (signum == SIGCHLD) {
        int status, exit_status;
        pid_t killed_pid;
        while ((killed_pid = waitpid(-1, &status, WNOHANG)) > 0) {
            if (WIFEXITED(status)) {
                exit_status = WEXITSTATUS(status);
                DEBUG("A child with PID %d exited with exit status %d.\n", killed_pid, exit_status);
            } else {
                assert(WIFSIGNALED(status));
                exit_status = 128 + WTERMSIG(status);
                DEBUG("A child with PID %d was terminated by signal %d.\n", killed_pid, exit_status - 128);
            }

            if (killed_pid == child_pid) {
                forward_signal(SIGTERM);  // send SIGTERM to any remaining children
                DEBUG("Child exited with status %d. Goodbye.\n", exit_status);
                exit(exit_status);
            }
        }
    } else {
        forward_signal(signum);
        if (signum == SIGTSTP || signum == SIGTTOU || signum == SIGTTIN) {
            DEBUG("Suspending self due to TTY signal.\n");
            kill(getpid(), SIGSTOP);
        }
    }
}

This is the signal handling code of dumb-init. Upon receiving a signal, it forwards all signals except SIGCHLD (note that SIGKILL cannot be handled). Let's take a look at the signal forwarding logic:

void forward_signal(int signum) {
    signum = translate_signal(signum);
    if (signum != 0) {
        kill(use_setsid ? -child_pid : child_pid, signum);
        DEBUG("Forwarded signal %d to children.\n", signum);
    } else {
        DEBUG("Not forwarding signal %d to children (ignored).\n", signum);
    }
}

By default, it directly kills and sends signals, where -child_pid has the following characteristic:

If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.

Directly forwarding to the process group seems fine, right? So what’s the reason for the above? Let's review the previous statement: the behavior of killing a process group is an O(N) traversal. Got it, it's an O(N) traversal. No problem, right? Well, to not keep you in suspense, there is a race condition in dumb-init's implementation.

As we just said, killing the process group is an O(N) traversal, so some processes will receive the signal before others. For example, assuming our dumb-init's child process receives SIGTERM first, gracefully exits, and then dumb-init receives the SIGCHLD signal, it waits for the child process ID, determines that it is a process it directly manages, and exits. Since dumb-init is the init process in our current PID namespace, let's review the characteristics of the PID namespace:

If the "init" process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal.

After dumb-init exits, the remaining processes will be directly SIGKILLed by the kernel. This leads to the situation we observed where the child process did not receive the forwarded signal!

So let's emphasize this: The claim that dumb-init can forward signals to all processes is completely false advertising!

Moreover, please note that dumb-init claims to manage processes within a session! But in reality, they only perform signal forwarding for a process group! This is completely false advertising! Fake News!

Additionally, as mentioned above, in scenarios like hot updating binaries, dumb-init directly commits suicide after the process exits. This is no different from not using an init process at all!

We can test this with the following code, demo3.py:

import os
import time

pid = os.fork()
time.sleep(1000)

Forking a process results in a total of two processes.

The Dockerfile is as follows:

FROM python:3.9

RUN wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64
RUN chmod +x /usr/local/bin/dumb-init

ADD demo3.py /root/demo3.py

ENTRYPOINT ["/usr/local/bin/dumb-init", "--"]

CMD ["python", "/root/demo3.py"]

Build and execute, let's first look at the process structure.

Process Structure of demo3

Then simulate the old process exiting by directly SIGKILLing 2134836, and let's check the strace result of 2134837.

strace 2134837

As expected, after dumb-init exits, 2134837 is SIGKILLed by the kernel.

So let's review dumb-init's shortcomings! Now, let's discuss the implementation of tini.

A Friendly Discussion on Tini#

Fairly speaking, while tini's implementation also has its pitfalls, it is much more refined than dumb-init. Let's take a look at the code directly.

while (1) {
    /* Wait for one signal, and forward it */
    if (wait_and_forward_signal(&parent_sigset, child_pid)) {
        return 1;
    }

    /* Now, reap zombies */
    if (reap_zombies(child_pid, &child_exitcode)) {
        return 1;
    }

    if (child_exitcode != -1) {
        PRINT_TRACE("Exiting: child has exited");
        return child_exitcode;
    }
}

First, tini does not set a signal handler; it continuously loops through wait_and_forward_signal and reap_zombies.

int wait_and_forward_signal(sigset_t const* const parent_sigset_ptr, pid_t const child_pid) {
    siginfo_t sig;

    if (sigtimedwait(parent_sigset_ptr, &sig, &ts) == -1) {
        switch (errno) {
            case EAGAIN:
                break;
            case EINTR:
                break;
            default:
                PRINT_FATAL("Unexpected error in sigtimedwait: '%s'", strerror(errno));
                return 1;
        }
    } else {
        /* There is a signal to handle here */
        switch (sig.si_signo) {
            case SIGCHLD:
                /* Special-cased, as we don't forward SIGCHLD. Instead, we'll
                 * fallthrough to reaping processes.
                 */
                PRINT_DEBUG("Received SIGCHLD");
                break;
            default:
                PRINT_DEBUG("Passing signal: '%s'", strsignal(sig.si_signo));
                /* Forward anything else */
                if (kill(kill_process_group ? -child_pid : child_pid, sig.si_signo)) {
                    if (errno == ESRCH) {
                        PRINT_WARNING("Child was dead when forwarding signal");
                    } else {
                        PRINT_FATAL("Unexpected error when forwarding signal: '%s'", strerror(errno));
                        return 1;
                    }
                }
                break;
        }
    }

    return 0;
}

Using sigtimedwait to receive signals, it filters out SIGCHLD for forwarding.

int reap_zombies(const pid_t child_pid, int* const child_exitcode_ptr) {
    pid_t current_pid;
    int current_status;

    while (1) {
        current_pid = waitpid(-1, &current_status, WNOHANG);

        switch (current_pid) {

            case -1:
                if (errno == ECHILD) {
                    PRINT_TRACE("No child to wait");
                    break;
                }
                PRINT_FATAL("Error while waiting for pids: '%s'", strerror(errno));
                return 1;

            case 0:
                PRINT_TRACE("No child to reap");
                break;

            default:
                /* A child was reaped. Check whether it's the main one. If it is, then
                 * set the exit_code, which will cause us to exit once we've reaped everyone else.
                 */
                PRINT_DEBUG("Reaped child with pid: '%i'", current_pid);
                if (current_pid == child_pid) {
                    if (WIFEXITED(current_status)) {
                        /* Our process exited normally. */
                        PRINT_INFO("Main child exited normally (with status '%i')", WEXITSTATUS(current_status));
                        *child_exitcode_ptr = WEXITSTATUS(current_status);
                    } else if (WIFSIGNALED(current_status)) {
                        /* Our process was terminated. Emulate what sh / bash
                         * would do, which is to return 128 + signal number.
                         */
                        PRINT_INFO("Main child exited with signal (with signal '%s')", strsignal(WTERMSIG(current_status)));
                        *child_exitcode_ptr = 128 + WTERMSIG(current_status);
                    } else {
                        PRINT_FATAL("Main child exited for unknown reason");
                        return 1;
                    }

                    // Be safe, ensure the status code is indeed between 0 and 255.
                    *child_exitcode_ptr = *child_exitcode_ptr % (STATUS_MAX - STATUS_MIN + 1);

                    // If this exitcode was remapped, then set it to 0.
                    INT32_BITFIELD_CHECK_BOUNDS(expect_status, *child_exitcode_ptr);
                    if (INT32_BITFIELD_TEST(expect_status, *child_exitcode_ptr)) {
                        *child_exitcode_ptr = 0;
                    }
                } else if (warn_on_reap > 0) {
                    PRINT_WARNING("Reaped zombie process with pid=%i", current_pid);
                }

                // Check if other childs have been reaped.
                continue;
        }

        /* If we make it here, that's because we did not continue in the switch case. */
        break;
    }

    return 0;
}

Then in the reap_zombies function, it continuously uses waitpid to handle processes, exiting the loop when there are no child processes to wait for or when encountering other system errors.

Note the difference in implementation between tini and dumb-init: dumb-init commits suicide after reaping its entry child process, while tini will exit the loop only after all its child processes have exited, then determine whether to commit suicide.

Now let's test it.

Using the demo2 example, we will test the grandchild process scenario.

FROM python:3.9

ADD demo2.py /root/demo2.py
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

ENTRYPOINT [ "/tini","-s", "-g", "--"]
CMD ["python", "/root/demo2.py"]

Then build and execute, the process structure is as follows:

Process Structure of demo2-tini

Then, as usual, we strace, kill, and send SIGTERM to see:

strace 2160093

strace 2160094

strace 2160095

Well, as expected, is there no problem with tini's implementation? Let's prepare another example, demo4.py:

import os
import time
import signal
pid = os.fork()
if pid == 0:
    signal.signal(15, lambda _, __: time.sleep(1))
    cpid = os.fork()
time.sleep(1000)

Here we use time.sleep(1) to simulate that the program needs to handle SIGTERM gracefully, and we prepare the Dockerfile:

FROM python:3.9

ADD demo4.py /root/demo4.py
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

ENTRYPOINT [ "/tini","-s", "-g", "--"]
CMD ["python", "/root/demo4.py"]

Then build and allow it to run, looking at the process structure, it’s quick.

Process Structure of demo4

Then strace, sending SIGTERM in a series:

strace 2173315

strace 2173316

strace 2173317

We find that processes 2173316 and 2173317 successfully received the SIGTERM signal, but while processing, they were SIGKILLed. So why is this?

In fact, there is a potential race condition here.

When we start using tini, after 2173315 exits, 2173316 will be re-parented.

According to the kernel's re-parenting process, 2173317 is re-parented to the tini process.

However, when tini uses waitpid, it uses the WNOHANG option, so if the child process has not yet exited when executing waitpid, it will immediately return 0. Thus exiting the loop and starting the suicide process.

Isn't that frustrating? My mentor and I raised an issue about this: tini Exits Too Early Leading to Graceful Termination Failure

I also made a patch for it, which you can refer to use new threading to run waitpid (still in PoC, no unit tests written, and the handling is a bit rough).

In fact, the idea is simple: we do not use the WNOHANG option in waitpid, making it a blocking call, and then use a new thread to handle waitpid.

The test results of this patch are as follows:

Process Structure of demo5

strace 1808102

strace 1808104

strace 1808105

Well, as expected, the test has no issues.

Of course, some attentive friends may notice that the original tini also cannot handle binary updates, for the same reason as in demo5. You can test this as well.

In fact, my handling is quite rough and violent; we just need to ensure that tini's exit condition becomes it must wait until waitpid()=-1 && errno==ECHILD before exiting. The specific implementation method can be thought through together (there are actually quite a few).

Finally, let's summarize the core of the issue:

Both dumb-init and tini, in their current implementations, commit the same error: in the special scenario of containers, they do not wait for all descendant processes to exit before exiting. The solution is quite simple; the exit condition must be waitpid()=-1 && errno==ECHILD.

Conclusion#

This article criticized dumb-init and tini. The implementation of dumb-init is indeed underwhelming, while tini's implementation is much more refined. However, tini still has unreliable behavior, and the scenarios we expect, such as fork binary updates using an init process, cannot be achieved with either dumb-init or tini. Additionally, both dumb-init and tini currently share a common limitation: they cannot handle the situation where child process groups escape (for example, ten child processes each escaping to a separate process group).

Moreover, in the tests in this article, we used time.sleep(1) to simulate graceful shutdown behavior, and tini also fails to meet the requirements. So...

Ultimately, the fundamental point is that the application’s signals and process reaping should be self-determined. Any reliance on an init process to manage these basic behaviors is irresponsible in production. (If you really want an init process, use tini, but never use dumb-init.)

So, executing naked is the way to go; no init process ensures safety!

That's about it for this commentary. This article took nearly a week of my spare time from raising the issue to verifying the conclusion and patching the PoC (the first draft was completed after 4 AM). Finally, thanks to Chuan for staying up with me until after 3 AM a few times. Lastly, I hope you enjoy reading!