stack: more binary package sharing

Functional Programming

Sep 1, 2015

stack: more binary package sharing

This blog post describes a new feature in stack. Until now, multiple projects using the same snapshot

could share the binary builds of packages. However, two separate

snapshots could not share the binary builds of their packages, even

if they were substantially identical. That's now changing.

tl;dr: stack will now be able to install new snapshots much more quickly, with less disk space usage, than previously.

This has been a known shortcoming since stack was first

released. It's not coincidental that this support is being added

not long after a similar project completed for Cabal. Ryan Trinkle- Vishal's

mentor on the project- described the work to me a few months back,

and I decided to wait to see the outcome of the project before

working on the feature in stack.

The improvements to Cabal here are superb, and I'm thrilled to

see them happening. However, after reviewing and discussing with a

few stack developers and users, I decided to implement a different

approach that doesn't take advantage of the new Cabal changes. The

reasons are:

As Herbert very aptly pointed out on Reddit:
Since Stack sandboxes everything maximum sharing between LTS
versions can easily be implemented going back to GHC 7.0
without this new multi-instance support.

This multi-instance support is needed if you want to accomplish
the same thing without isolated sandboxes in a single package
db.
There are some usability concerns around a single massive
database with all packages in it. Specifically, there are potential
problems around getting GHC to choose a coherent set of packages
when using something like ghci or runghc. Hopefully some concept of views will be added (as Duncan described in the original proposal), but the implications still need to be worked out.
stack users are impatient (and I mean that in the best way
possible). Why wait for a feature when we could have it now? While
the Cabal Google Summer of Code project is complete, the changes
are not yet merged to master, much less released. stack would need
to wait until those changes are readily available to end users
before relying on them.

stack's implementation

I came up with some complicated approaches to the problem, but ultimately a comment from Aaron Wolf rang true:

check the version differences and just copy compiled binaries from previous LTS for unchanged items

It turns out that this is really easy. The implementation ends up having two components:

Whenever a snapshot package is built, write a precompiled cache file containing the filepaths of the library's .conf file
(from inside the package database) and all of the executables
installed.
Before building a snapshot package, check for a precompiled
cache file. If the file exists, copy over the executables and
register the .conf file into the new snapshots database.

That precompiled cache file's path looks something like this:

/home/vagrant/.stack/precompiled/ghc-7.10.2/1.22.4.0/aeson-0.8.0.2/Vr6rCTNr+UeoWMN1qGJGhFfxIDSFqTgJixKuD6TtVEQ

This encodes the GHC version, Cabal version, package name, and

package version. The last bit is a hash of all of the configuration

information, including flags, GHC options, and dependencies. We

then hash those flags and put them in the filepath, ensuring that

when we look up a precompiled package, we're getting something that

matches what we'd be building ourselves now.

The reason we can get away with this approach in stack is

because of the invariants of a snapshot, namely: each snapshot has

precisely one version of a package available, and therefore we have

no need to deal with the new multi-instance installations GHC 7.10

supports. This also means no concern around views: a snapshot

database is by its very nature a view.

Advantages

Decreased compile times
Decreased disk space usage

Downsides

You can't reliably delete a single snapshot, as there can be
files shared between different snapshots. Deleting a single
snapshot was never an officially supported feature previously, but
if you knew what you were doing, you could do it safely.

After discussing with others: this trade-off seems acceptable:

the overall decrease in disk space usage means that the desire to

delete a single snapshot will be reduced. When real disk space

reclaiming needs to happen, the recommended approach will be to

wipe all snapshots and start over, which (1) will be an infrequent

occurrence, and (2) due to the faster compile times, will be less

burdensome.