async exceptions, STM, and deadlocks

Functional Programming

Jun 20, 2016

async exceptions, STM, and deadlocks

For a change of pace, I wanted to cover a simple topic:

asynchronous exceptions, and how they interact with Software

Transactional Memory and MVars, as well as the GHC

runtime system's deadlock detection. As fate would have it, the

topics in question occurred twice in the past month, once in a

bug report against the yaml package, and once in a major refactoring of FP

Complete's distributed computation platform. I'll focus on the

first since it's an easier case to explain, but I'll try to explain

the second (and jucier) one too.

To get this started off nicely, I'll share an interesting piece

of code and its output. If you're playing along at home, try

guessing what the output of this program will be, and if your guess

is different from the actual output, try to predict why. Hopefully,

by the end of this post, it will be clear (although still perhaps

surprising).

And actual output:

Right 5
Left throw a normal exception
example1.hs: thread blocked indefinitely in an MVar operation

To clarify, that last line of output means that the program itself exited because a BlockedIndefinitelyOnMVar

exception was thrown to the main thread and was not caught. Also,

this is possibly the first time I've written an interesting code

example for a blog post that uses only the base package :)

Synchronous vs asynchronous exceptions

There are two* ways an exception can be thrown in Haskell:

Synchronous exceptions are generated by the current
thread. What's important about these is that we generally want to
be able to recover from them. For example, if you try to read from
a file, and the file doesn't exist, you may wish to use some
default value instead of having your program exit, or perhaps
prompt the user for a different file location.
Asynchronous exceptions are thrown by either a different
user thread, or by the runtime system itself. For example, in the
async package, race will kill the
longer-running thread with an asynchronous exception. Similarly,
the timeout function will kill an action which has run for too long. And epic foreshadowment the runtime system
will kill threads which appear to be deadlocked on
MVars or STM actions.

In contrast to synchronous exceptions, we almost never want to
recover from asynchronous exceptions. In fact, this is a common
mistake in Haskell code, and from what I've seen has been the
largest source of confusion and concern amongst users when it comes
to Haskell's runtime exception system.

Alright, got that? Catch the synchronous ones and recover, never recover from asynchronous exceptions**.

* You could argue that creating an exception in a pure value, via functions like error and undefined, is a third category. We're going to ignore that for now.

** You may be wondering "why make it possible to catch async

exceptions if we can't recover from them?" The answer is that we

still want to be able to perform cleanup actions, like closing file

handles. This is the topic of the aforementioned upcoming blog

post.

Catching all exceptions

Quite a few years ago, I wrote a School of Haskell article on how to catch all exceptions. I'm

not going to rehash that article here, since (1) the content is

still current, and (2) I'll likely be sharing a new blog post in a

few days with a newer approach. But the gist of it is that if we

want to catch all synchronous exceptions in an action without

catching asynchronous ones, we run that action in a separate

thread, and then grab the result from that thread using a mutable

variable. The SoH article uses the async package, which hides away

some of the details. However, the core idea is easy enough to

express as I did above:

Let's address some important points in the design of this

function, since it has to step quite carefully around asynchronous

exceptions:

We use forkIOWithUnmask to ensure that the newly
created thread has asynchronous exceptions masked. This means that
we will receive no async exceptions in the new thread, until we
wrap an exception with the restore function.
We restore the user's action, meaning async exceptions can be received there. But we wrap that restore action with a try, catching all synchronous and asynchronous exceptions.
Since outside of the try we still have all async
exceptions masked, there's nothing preventing the
putMVar call in the child thread from being called. This means that, in all cases of the action terminating, the putMVar will ensure that the var is filled up.
The takeMVar in the parent thread will block until
the child thread completes. And since we're guaranteed that the
var will always be filled once the action terminates, we seem (once again, epic foreshadowment) to be guaranteed that the takeMVar call will always wait for action to somehow terminate, and once it's terminated
will return either the exception is died with or the success
result.

Deadlock protection

Alright, I've foreshadowed enough already, I'm guessing

everyone's figured out where the problem is going to come. Sure

enough, if we look at the program I started this post with, we see

that tryAny mayThrow3 >>= print does not catch the exception generated by mayThrow3, but instead

itself dies from GHC's deadlock protection

(BlockedIndefinitelyInMVar). Simon Marlow explained the reason for this to me twice, once in his (really really good) book, and once when I raised issue #14 on async because I hadn't been smart enough to connect the dots from the book to my actual issue.

As long as I'm linking so much to Github, you may also like seeing

my pull request to work around issue #14.

Don't worry, I'm not going to make you read all of that. The

summary is pretty simple: GHC's run time system has very

straightforward deadlock detection. When it finds a group of

threads that are all blocked on blocking mutable variables

(MVars or STM variables), and finds that

no other threads have references to the variables it question, it

determines that all of the threads are deadlocked, and sends all of

them asynchronous exceptions (either

BlockedIndefinitelyOnMVar or BlockedIndefinitelyOnSTM).

So let's analyze tryAny mayThrow3 in excruciating detail. The series of events here is:

Create a new empty MVar to allow the child thread to pass back the result of mayThrow3 to the parent thread.
Fork a new child thread (yes, exceptions are masked, but that's not terribly important for us).
Back in the parent thread: block waiting for that empty MVar to get filled.
In the child thread, set up try to catch all
exceptions, and then run the user's action (with async exceptions
unmasked).
As it happens, I carefully crafted mayThrow3 to deadlock on an MVar. So now the child thread can't make progress, and is fully deadlocked.

OK, the stage is set: the main thread is blocked on an MVar for the result. As I proved above, we know for a fact that this MVar will ultimately be filled with the

result of the user action, once the user action completes or dies

with an exception. However, GHC doesn't know this, it just sees the main thread blocked on an MVar action, and

that only the main thread and child thread have references to that

variable.

In the child thread, we've created a new MVar

deadlock. This is a true and legitimate deadlock (if you don't

believe me, go back and look at the implementation of

mayThrow3). So GHC decides that both the main thread and the child thread are currently blocked on MVar

actions, and there is no other thread that can unblock either of

them. And GHC is correct. However, the next part is the unfortunate

part:

GHC kills both threads with an asynchronous exceptions

Why is this unfortunate? Because we know that if the child

thread receives the async exception, it will exit, the result

MVar will get filled, and the main thread can complete

successfully. That's exactly the behavior we want, but not the

behavior GHC is able to deliver to us.

The workaround

The workaround I'm about to demonstrate is not elegant, and it

leaves a bad taste in my mouth. If anyone has a better idea, let me

know. The idea is this: we know that there's no legitimate reason

for the takeMVar inside tryAny to be killed with a BlockedIndefinitelyOnMVar, even if GHC

doesn't know that. So we outsmart GHC by catching and ignoring that

specific exception. And just in case our logic is wrong and there's

a flawed implementation here, we only catch-and-ignore a fixed

number of times (arbitrary choice: 10) to avoid creating a real

deadlock.

I've added a commit for enclosed-exceptions that implements this behavior, but for the lazy, here's a simplified version of it:

Swallow the bit of vomit rising in the back of your throat, try

this out in the original program, and see if it solves the issue.

You should see the following output:

Right 5
Left throw a normal exception
Left thread blocked indefinitely in an MVar operation
done

Note that the "thread blocked" has actually been caught, and the final done! line is actually printed.

Problem with STM-based tryAny

tryAny was already implemented safely in terms of the async package, which uses STM in place of MVars as we did here. Mitchell Rosen came up with a really convincing demonstration for why we shouldn't use STM for this purpose. To quote, with the program:

We get the rather surprising output:

Control.Concurrent.STM.atomically was nested

The reason is that decode is calling a bunch of IO functions as FFI calls to the libyaml library, and is wrapped in unsafePerformIO to make the morally-pure YAML-decoding action not live in IO. Under the surface, it uses tryAny to avoid unexpected exceptions from breaking out of the unsafePerformIO dungeon it

lives in. That's all well and good, but now if you use

decode inside an atomically block, you have two layers of STM occurring, which is strictly forbidden.

Extra credit Figure out why I needed to use $! to make this fail.

Our distributed computing problem

This problem arose about two months ago, and is far more

complicated than the examples I've discussed until now. In fact,

I've so far been unsuccessful at reproducing this with a minimal

test case, which is definitely annoying. Nonetheless, I can explain

what basically happened:

Have a queue of work items to be performed, represented by a TChan
Have one thread (filler) grabbing work items from somewhere (like a socket) and putting them on the TChan
Have another thread (worker) grabbing work items and performing them
The filler has logic to detect when no more items are incoming, and exits
Run these two threads together using the race
function, which guarantees that once the filler exits, the worker
will be killed too

Unfortunately, this runs into the same problem I described

above, at least in some circumstances. The thread calling

race is blocked on an MVar under the

surface, which filler and worker both have access to. filler

completes, puts something in the MVar, and then disappears, losing its reference to the MVar and the TChan. Meanwhile, worker is still blocked waiting on the TChan, to which only it has access now, resulting in a deadlock condition.

In our application, we ended up with a situation where the worker thread was deadlocked on the TChan, and the thread calling race was blocked on the MVar. It's a slightly more complicated case than this though, since in the simple race case, the thread calling race doesn't actually get blocked on the MVar. Nonetheless, it's another interesting case where the deadlock prevention kills too many threads.

Fortunately for our application, there's a very simple - and

more robust - solution, which I recommend to anyone doing this kind

of concurrent programming. The problem was that our worker thread

had no logic to terminate itself, and instead just waited to be

killed by race. Instead, we switched over to a closeable TChan implementation, where the filler thread could explicitly closed out the TChan and then

the worker would kill itself instead of waiting around for an async

exception.

So to leave this blog post with some recommendations, and a

small amount of extra foreshadowment (fourth time getting to use

that word in one blog post!):

Explicit exit conditions are better than implicit ones
Avoid relying on async exceptions for control flow if you can
Relying on the deadlock detection to clean up your messes is a bad idea
And even if you don't want the deadlock detection to clean up your messes, it may try to do so for you badly anyway