Jun 20, 2016
For a change of pace, I wanted to cover a simple topic:
asynchronous exceptions, and how they interact with Software
Transactional Memory and MVar
s, as well as the GHC
runtime system's deadlock detection. As fate would have it, the
topics in question occurred twice in the past month, once in a
bug report against the yaml package, and once in a major refactoring of FP
Complete's distributed computation platform. I'll focus on the
first since it's an easier case to explain, but I'll try to explain
the second (and jucier) one too.
To get this started off nicely, I'll share an interesting piece
of code and its output. If you're playing along at home, try
guessing what the output of this program will be, and if your guess
is different from the actual output, try to predict why. Hopefully,
by the end of this post, it will be clear (although still perhaps
surprising).
And actual output:
To clarify, that last line of output means that the program itself exited because a BlockedIndefinitelyOnMVar
exception was thrown to the main thread and was not caught. Also,
this is possibly the first time I've written an interesting code
example for a blog post that uses only the base package :)
Synchronous vs asynchronous exceptions
There are two* ways an exception can be thrown in Haskell:
Synchronous exceptions are generated by the current
thread. What's important about these is that we generally want to
be able to recover from them. For example, if you try to read from
a file, and the file doesn't exist, you may wish to use some
default value instead of having your program exit, or perhaps
prompt the user for a different file location.
Asynchronous exceptions are thrown by either a different
user thread, or by the runtime system itself. For example, in the
async
package,race
will kill thelonger-running thread with an asynchronous exception. Similarly,
the
timeout
function will kill an action which has run for too long. And epic foreshadowment the runtime systemwill kill threads which appear to be deadlocked on
MVar
s orSTM
actions.In contrast to synchronous exceptions, we almost never want to
recover from asynchronous exceptions. In fact, this is a common
mistake in Haskell code, and from what I've seen has been the
largest source of confusion and concern amongst users when it comes
to Haskell's runtime exception system.
Alright, got that? Catch the synchronous ones and recover, never recover from asynchronous exceptions**.
* You could argue that creating an exception in a pure value, via functions like error
and undefined
, is a third category. We're going to ignore that for now.
** You may be wondering "why make it possible to catch async
exceptions if we can't recover from them?" The answer is that we
still want to be able to perform cleanup actions, like closing file
handles. This is the topic of the aforementioned upcoming blog
post.
Catching all exceptions
Quite a few years ago, I wrote a School of Haskell article on how to catch all exceptions. I'm
not going to rehash that article here, since (1) the content is
still current, and (2) I'll likely be sharing a new blog post in a
few days with a newer approach. But the gist of it is that if we
want to catch all synchronous exceptions in an action without
catching asynchronous ones, we run that action in a separate
thread, and then grab the result from that thread using a mutable
variable. The SoH article uses the async package, which hides away
some of the details. However, the core idea is easy enough to
express as I did above:
Let's address some important points in the design of this
function, since it has to step quite carefully around asynchronous
exceptions:
We use
forkIOWithUnmask
to ensure that the newlycreated thread has asynchronous exceptions masked. This means that
we will receive no async exceptions in the new thread, until we
wrap an exception with the
restore
function.We restore the user's
action
, meaning async exceptions can be received there. But we wrap thatrestore action
with atry
, catching all synchronous and asynchronous exceptions.Since outside of the
try
we still have all asyncexceptions masked, there's nothing preventing the
putMVar
call in the child thread from being called. This means that, in all cases of theaction
terminating, theputMVar
will ensure that thevar
is filled up.The
takeMVar
in the parent thread will block untilthe child thread completes. And since we're guaranteed that the
var
will always be filled once theaction
terminates, we seem (once again, epic foreshadowment) to be guaranteed that thetakeMVar
call will always wait foraction
to somehow terminate, and once it's terminatedwill return either the exception is died with or the success
result.
Deadlock protection
Alright, I've foreshadowed enough already, I'm guessing
everyone's figured out where the problem is going to come. Sure
enough, if we look at the program I started this post with, we see
that tryAny mayThrow3 >>= print
does not catch the exception generated by mayThrow3
, but instead
itself dies from GHC's deadlock protection
(BlockedIndefinitelyInMVar
). Simon Marlow explained the reason for this to me twice, once in his (really really good) book, and once when I raised issue #14 on async because I hadn't been smart enough to connect the dots from the book to my actual issue.
As long as I'm linking so much to Github, you may also like seeing
my pull request to work around issue #14.
Don't worry, I'm not going to make you read all of that. The
summary is pretty simple: GHC's run time system has very
straightforward deadlock detection. When it finds a group of
threads that are all blocked on blocking mutable variables
(MVar
s or STM
variables), and finds that
no other threads have references to the variables it question, it
determines that all of the threads are deadlocked, and sends all of
them asynchronous exceptions (either
BlockedIndefinitelyOnMVar
or BlockedIndefinitelyOnSTM
).
So let's analyze tryAny mayThrow3
in excruciating detail. The series of events here is:
Create a new empty
MVar
to allow the child thread to pass back the result ofmayThrow3
to the parent thread.Fork a new child thread (yes, exceptions are masked, but that's not terribly important for us).
Back in the parent thread: block waiting for that empty
MVar
to get filled.In the child thread, set up
try
to catch allexceptions, and then run the user's action (with async exceptions
unmasked).
As it happens, I carefully crafted
mayThrow3
to deadlock on anMVar
. So now the child thread can't make progress, and is fully deadlocked.
OK, the stage is set: the main thread is blocked on an MVar
for the result. As I proved above, we know for a fact that this MVar
will ultimately be filled with the
result of the user action, once the user action completes or dies
with an exception. However, GHC doesn't know this, it just sees the main thread blocked on an MVar
action, and
that only the main thread and child thread have references to that
variable.
In the child thread, we've created a new MVar
deadlock. This is a true and legitimate deadlock (if you don't
believe me, go back and look at the implementation of
mayThrow3
). So GHC decides that both the main thread and the child thread are currently blocked on MVar
actions, and there is no other thread that can unblock either of
them. And GHC is correct. However, the next part is the unfortunate
part:
GHC kills both threads with an asynchronous exceptions
Why is this unfortunate? Because we know that if the child
thread receives the async exception, it will exit, the result
MVar
will get filled, and the main thread can complete
successfully. That's exactly the behavior we want, but not the
behavior GHC is able to deliver to us.
The workaround
The workaround I'm about to demonstrate is not elegant, and it
leaves a bad taste in my mouth. If anyone has a better idea, let me
know. The idea is this: we know that there's no legitimate reason
for the takeMVar
inside tryAny
to be killed with a BlockedIndefinitelyOnMVar
, even if GHC
doesn't know that. So we outsmart GHC by catching and ignoring that
specific exception. And just in case our logic is wrong and there's
a flawed implementation here, we only catch-and-ignore a fixed
number of times (arbitrary choice: 10) to avoid creating a real
deadlock.
I've added a commit for enclosed-exceptions that implements this behavior, but for the lazy, here's a simplified version of it:
Swallow the bit of vomit rising in the back of your throat, try
this out in the original program, and see if it solves the issue.
You should see the following output:
Note that the "thread blocked" has actually been caught, and the final done!
line is actually printed.
Problem with STM-based tryAny
tryAny
was already implemented safely in terms of the async package, which uses STM
in place of MVar
s as we did here. Mitchell Rosen came up with a really convincing demonstration for why we shouldn't use STM for this purpose. To quote, with the program:
We get the rather surprising output:
The reason is that decode
is calling a bunch of IO
functions as FFI calls to the libyaml library, and is wrapped in unsafePerformIO
to make the morally-pure YAML-decoding action not live in IO
. Under the surface, it uses tryAny
to avoid unexpected exceptions from breaking out of the unsafePerformIO
dungeon it
lives in. That's all well and good, but now if you use
decode
inside an atomically
block, you have two layers of STM
occurring, which is strictly forbidden.
Extra credit Figure out why I needed to use $!
to make this fail.
Our distributed computing problem
This problem arose about two months ago, and is far more
complicated than the examples I've discussed until now. In fact,
I've so far been unsuccessful at reproducing this with a minimal
test case, which is definitely annoying. Nonetheless, I can explain
what basically happened:
Have a queue of work items to be performed, represented by a
TChan
Have one thread (filler) grabbing work items from somewhere (like a socket) and putting them on the
TChan
Have another thread (worker) grabbing work items and performing them
The filler has logic to detect when no more items are incoming, and exits
Run these two threads together using the
race
function, which guarantees that once the filler exits, the worker
will be killed too
Unfortunately, this runs into the same problem I described
above, at least in some circumstances. The thread calling
race
is blocked on an MVar
under the
surface, which filler and worker both have access to. filler
completes, puts something in the MVar
, and then disappears, losing its reference to the MVar
and the TChan
. Meanwhile, worker is still blocked waiting on the TChan
, to which only it has access now, resulting in a deadlock condition.
In our application, we ended up with a situation where the worker thread was deadlocked on the TChan
, and the thread calling race
was blocked on the MVar
. It's a slightly more complicated case than this though, since in the simple race
case, the thread calling race
doesn't actually get blocked on the MVar
. Nonetheless, it's another interesting case where the deadlock prevention kills too many threads.
Fortunately for our application, there's a very simple - and
more robust - solution, which I recommend to anyone doing this kind
of concurrent programming. The problem was that our worker thread
had no logic to terminate itself, and instead just waited to be
killed by race
. Instead, we switched over to a closeable TChan
implementation, where the filler thread could explicitly closed out the TChan
and then
the worker would kill itself instead of waiting around for an async
exception.
So to leave this blog post with some recommendations, and a
small amount of extra foreshadowment (fourth time getting to use
that word in one blog post!):
Explicit exit conditions are better than implicit ones
Avoid relying on async exceptions for control flow if you can
Relying on the deadlock detection to clean up your messes is a bad idea
And even if you don't want the deadlock detection to clean up your messes, it may try to do so for you badly anyway