Jul 1, 2019
This is a story about how some bad API design on my part caused
some ugly race conditions that were very tricky to break down. I’m
writing this story as a word of warning to others! The code itself
was written in Haskell, but the lessons apply to anyone working
with Unix-style processes.
Introducing typed-process
I maintain both the process
library in Haskell,
which is the standard way of launching child processes, as well as
the typed-process
library, which explores some
refinements to that API for more user friendliness. The API has two
main types: ProcessConfig
defines settings for
launching a process (command name, environment variables, etc), and
Process
represents a running child process that can be
interacted with. With that, we have some basic API usage that looks
like this:
This isn’t quite working code, but it gets the idea across
pretty nicely.
Exception safety
There’s a problem with the code above: it’s not exception-safe.
Let’s say that the helperFunction
call fails with a
runtime exception. The child process will never receive the
"quit"
input, we’ll never wait for the child process
to end, and ultimately we’ll end up with a process that’s sitting
around, twiddling its thumbs, unable to ever exit. (You may think
this is a zombie process, but zombie has a specific and different
meaning in the Unix world.)
The Haskell ecosystem, like many others, has a method for
providing exception safety. We call it the bracket pattern. You combine together resource allocation and cleanup actions using the helper function bracket
, and are
guaranteed when your block is finished, the cleanup action is
called, regardless of how the block finishes.
To make this work, we need a stopProcess
function.
This function is intelligent: if the process has already exited,
stopProcess
doesn’t do anything. However, if the process is still running, stopProcess
sends it a SIGTERM
signal, which for most well-behaved programs
will cause it to exit. (Unix processes can actually handle
SIGTERM
and continue running, but for our cases we’ll
pretend like it’s a process death sentence.)
So let’s rewrite the code above with bracket
:
And just like that, we have type safety, and avoid runaway
processes. Neato!
Let’s walk through the cases above. If any of the
actions in the block throw a runtime exception,
bracket
will trigger stopProcess
, resulting in a SIGTERM
being sent to the child. If, on
the other hand, no exception occurs, we know that the child process
has already exited thanks to the waitExitCode
call, and therefore stopProcess
will be a no-op. That’s
exactly the behavior we want.
Following Haskell best practices, we can capture this
bracket
call into a helper function called withProcess
:
And exception safety has been achieved!
Finally, one more addition. A common pattern in working with
child processes is checking that the exit code is a success, and
throwing an exception if it’s anything else. We have a helper
function withProcess_
that performs that exit code
checking too. This essentially looks like:
Playing with cat
We’re going to perform a cardinal Unix sin: use the
cat
executable when we’re not actually combining
together two different files. Please forgive me, it’s for a good
reason.
Below is a fully runnable Haskell script. You can install Stack, copy the code into Main.hs
, and run stack Main.hs
to run it. The program does the following:
Defines a process config where:
The child’s standard input is a new pipe
The child’s standard output is a new pipe
The child command line is
cat
with no arguments
Launch the process using
withProcess_
While the process is running, run two Haskell threads
concurrently:
Thread 1 will send the string
Hello World!n
to the child over standard input and then close the pipeThread 2 will capture everything from the child’s standard output, until the pipe is closed
Print the output captured from the child to the parent’s
standard output stream (aka the terminal in the way I’m testing
it)
When I run this on OS X, I fairly reliably get the expected
output:
However, when I run this on Linux, I will often get the
following instead:
Granted, not always, but often enough. So now we have a weird
exit failure and some non-determinism, in what appears to be a
really simple program. What gives?!?
ExitFailure (-15)
The first thing to identify is what this negative exit code is.
Haskell—like a few other ecosystems—uses a negative exit code to
indicate that the process exited due to a signal. In this case,
that means the child process (cat
) died with signal number 15, which is SIGTERM
. That’s certainly interesting… where have we seen a SIGTERM
come up before? Right, in stopProcess
. But it doesn’t quite make sense that stopProcess
would send the signal, since it only does so once the standard
output pipe from the child process has been closed. And we know
that cat
exits at exactly the same time as it
closes its standard output pipe… right?
Race condition!
Hopefully my scare italics above helped a bit. No, as it turns
out, the pipe’s closure and the child’s exit are not simultaneous. In fact, our cat
process will end up
doing something like the following:
read
from stdinIf there was more data:
write
to stdout and return to step 1If there was no more data, exit loop and continue with step 4
Close
stdin
Close
stdout
Exit with exit code 0 (indicating success)
The parent process, meanwhile, will repeatedly call
read
on the read end of the child’s stdout
pipe, and as soon as that read
indicates end of file (EOF), the block will exit, and
withProcess_
will do two things:
Call
stopProcess
Call
checkExitCode
to make sure the process exited successfully
There are multiple interleavings of events that can occur. The
success case looks like this:
Child closes
stdout
Child exits with exit code 0
Parent receives EOF on
read
Parent calls
stopProcess
, which is a no-op (child is already exited)checkExitCode
gets exit code 0 and is happy
However, it’s also possible with a different process timing to
get:
Child closes
stdout
Parent receives EOF on
read
Parent calls
stopProcess
, which sends aSIGTERM
to the childChild never has a chance to return exit code 0, it’s already dead
checkExitCode
sees that the child exited due to aSIGTERM
and throws an exception
This may seem like a corner case, but it’s already bitten me
twice: first in a test suite, and secondly as a major annoyance in the new Stack release.
Who to blame?
Well, as usually, the person to blame is myself.

Usage of the Unix process API can be tricky to get right, but
it’s clearly documented and well executed. And I’d argue that my
usage of withProcess_
is the right kind of abstraction. No, the problem is the implementation of withProcess_
. Let’s step through it again:
Launch a process
Run some block with the process
However that block exits (normal or exception), call
stopProcess
and then ensure there’s a success exit code
In our first usage above, we called waitExitCode
in
the block, which guaranteed in the success case
stopProcess
would always end up as a no-op. Everything
was fine. The problem was I made the assumption that
cat
‘s pipes closing was the same as the child process
exiting. We know that’s not true. However, given that this bug hit
me twice, it’s fair to say I’ve created an API which encourages
misuse.
Instead, here’s what I think is the better implementation for
withProcess_
:
Launch a process
Run some block with the process
If that block throws an exception, terminate the child process with
stopProcess
If that block succeeds, wait for the process to exit and then check that its exit code is a success
With this tweak to behavior, the code calling cat
above is safe, and I can sleep better at night.
Deprecations
Rolling out a new set of behavior which silently (meaning: no
compile-time change) modifies behavior at runtime is dangerous.
People using withProcess_
may be relying on exactly
its current behavior. Therefore, instead of replacing the current
withProcess_
behavior, the roll-out strategy is:
Introduce a new function
withProcessTerm_
, which has the same behavior aswithProcess_
todayIntroduce a new function
withProcessWait_
, which has the new behavior I just described aboveDeprecate
withProcess_
with a message indicatingthat the caller should use one of the replacement functions
instead
This will encourage users of typed-process
to analyze their usages of withProcess_
, see if they are
susceptible to the bug described here, and choose the appropriate
replacement.
Further reading
If you’re interested in learning more about any of this, here
are some (hopefully) helpful links: