A little issue about execSync in Node.js

It's been a while since I wrote a water article. Yesterday, I helped someone check an issue with the special behavior of the execSync function in Node.js. It was quite interesting, so I decided to jot down some notes and write an article.

Background#

First, the person provided a screenshot.

Issue Screenshot

The basic issue can be abstracted as follows: when using the execSync function in Node.js to execute a command like ps -Af | grep -q -E -c "\\-\\-user-data-dir=\\.+App", Node.js occasionally throws an error. The specific stack trace is roughly as follows:

Uncaught Error: Command failed: ps -Af | grep -q -E -c "\-\-user-data-dir=\.+App"
    at checkExecSyncError (child_process.js:616:11)
    at Object.execSync (child_process.js:652:15) {
  status: 1,
  signal: null,
  output: [ null, <Buffer >, <Buffer > ],
  pid: 89073,
  stdout: <Buffer >,
  stderr: <Buffer >
}

However, the same command does not exhibit similar behavior in the terminal. So this issue is somewhat perplexing.

Analysis#

First, let's take a look at the description of execSync in the Node.js documentation.

The child_process.execSync() method is generally identical to child_process.exec() with the exception that the method will not return until the child process has fully closed. When a timeout has been encountered and killSignal is sent, the method won't return until the process has completely exited. If the child process intercepts and handles the SIGTERM signal and doesn't exit, the parent process will wait until the child process has exited.
If the process times out or has a non-zero exit code, this method will throw. The Error object will contain the entire result from child_process.spawnSync().
Never pass unsanitized user input to this function. Any input containing shell metacharacters may be used to trigger arbitrary command execution.

In essence, this function executes a command through a child process and will wait until the command execution times out. OK, no problem there. Next, let's look at the error stack mentioned above and the implementation code of execSync.

function execSync(command, options) {
  const opts = normalizeExecArgs(command, options, null);
  const inheritStderr = !opts.options.stdio;

  const ret = spawnSync(opts.file, opts.options);

  if (inheritStderr && ret.stderr)
    process.stderr.write(ret.stderr);

  const err = checkExecSyncError(ret, opts.args, command);

  if (err)
    throw err;

  return ret.stdout;
}

function checkExecSyncError(ret, args, cmd) {
  let err;
  if (ret.error) {
    err = ret.error;
  } else if (ret.status !== 0) {
    let msg = 'Command failed: ';
    msg += cmd || ArrayPrototypeJoin(args, ' ');
    if (ret.stderr && ret.stderr.length > 0)
      msg += `\n${ret.stderr.toString()}`;
    // eslint-disable-next-line no-restricted-syntax
    err = new Error(msg);
  }
  if (err) {
    ObjectAssign(err, ret);
  }
  return err;
}

We can see that after executing the command, execSync enters checkExecSyncError to check whether the child process's Exit Status Code is 0. If it is not 0, it considers the command execution to have failed and throws an exception.

It seems there is no issue here, so does that mean there was an error when we executed the command? Let's verify that.

For tools involving syscall issue troubleshooting in Linux (this issue also exists in environments like Mac, but for convenience, I reproduced it on Linux), it seems that apart from strace, there aren't many more mature and convenient tools available (though eBPF can also be used, to be honest, writing it myself would definitely not be as effective as strace).

So, let's run the command:

sudo strace -t -f -p $PID -o error_trace.txt

tips: When using strace, you can use the -f parameter to trace child processes created by the traced process.

Alright, after executing the command, I successfully obtained the entire syscall call chain. OK, let's start analyzing.

First, we quickly focus on the most critical part (since the entire file is too long, nearly 4K lines, I directly picked the key parts for analysis).

...
894259 13:21:23 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f12d9465a50) = 896940
...
896940 13:21:23 execve("/bin/sh", ["/bin/sh", "-c", "ps -Af | grep -E -c \"\\-\\-user-da"...], 0x4aae230 /* 40 vars */ <unfinished ...>
...
896940 13:21:24 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 896942
896940 13:21:24 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=896942, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
896940 13:21:24 rt_sigreturn({mask=[]}) = 896942
896940 13:21:24 exit_group(1)           = ?
896940 13:21:24 +++ exited with 1 +++

Here, let's clarify that Node.js does not directly use fork to create new processes but uses clone instead. The difference between the two is quite extensive, and if detailed, it could warrant a lengthy article on its own (I'll set a flag for that). For now, let's summarize it based on the official explanation:

These system calls create a new ("child") process, in a manner similar to fork(2). By contrast with fork(2), these system calls provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using these system calls, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors, and the table of signal handlers. These system calls also allow the new child process to be placed in separate namespaces(7).

In short, clone provides semantics similar to fork, but through clone, developers can have finer control over the details of the process/thread creation process.

OK, here we see that the main process 894259 created the process 896940 using clone. During execution, the process 896940 used the execve syscall through sh (which is the default behavior of execSync) to execute our command ps -Af | grep -q -E -c "\\-\\-user-data-dir=\\.+App". OK, we also see that 896940 indeed exited with an exit code of 1, consistent with our previous analysis. In other words, an error occurred when we executed the command. So where did this error occur?

Let's analyze the command. If you're familiar with common shells, you might notice that our command actually uses the pipe operator |. To be precise, when this operator appears, the two commands on either side will be executed in separate processes and communicate through a pipe. In other words, we can quickly locate these two processes by searching the text.

...
896941 13:21:23 execve("/bin/ps", ["ps", "-Af"], 0x564c16f6ec38 /* 40 vars */) = 0
...
896942 13:21:23 execve("/bin/grep", ["grep", "-E", "-c", "\\-\\-user-data-dir=\\.*"], 0x564c16f6ecb0 /* 40 vars */ <unfinished ...>
...
896941 13:21:24 <... exit_group resumed>) = ?
896941 13:21:24 +++ exited with 0 +++
...
896942 13:21:24 exit_group(1)           = ?
896942 13:21:24 +++ exited with 1 +++

OK, we find that the process 896942, which executed grep, exited directly with exit code 1. So why is that? After checking the official documentation for grep, I was almost speechless.

Normally, the exit status is 0 if selected lines are found and 1 otherwise. But the exit status is 2 if an error occurred, unless the -q or --quiet or --silent option is used and a selected line is found. Note, however, that POSIX only mandates, for programs such as grep, cmp, and diff, that the exit status in case of error be greater than 1; it is therefore advisable, for the sake of portability, to use logic that tests for this general condition instead of strict equality with 2.

If grep does not match any data, it will exit the process with exit code 1. If it matches, it exits with 0. However, according to standard semantics, doesn't exit code 1 mean Operation not permitted? It completely disregards the basic rules!

Conclusion#

After reading through the entire article, we can summarize two reasons:

When Node.js abstracts and encapsulates POSIX-related APIs, it directly follows the standard semantics to provide a safety net for users. Although theoretically, this should be an application-specific behavior.
grep did not adhere to the basic rules.

To be honest, I don't know how to evaluate which of these two aspects is more problematic. As mentioned earlier, handling the exit code of child processes should theoretically be an application-specific behavior, but Node.js has added a layer of encapsulation, which saves users from cognitive load but also introduces significant risks in non-standard scenarios.

One can only say that trade-offs must be made based on different scenarios.

Alright, that's it for this article. Since it was a spontaneous decision, I won't bother listing the related references in the text. That's about it, mission accomplished.jpg