Reproducible builds: a status update
With the yearly Reproducible Build Summit starting today, now’s a good time for an update on what has happened in Guix land in that area!
Isolated build
environments
are very helpful to achieve reproducible
builds, but they are
not sufficient: timestamps and non-determinism can still make a package
build non-reproducible. Developers can rely on guix build --check
and guix challenge
to identify non-reproducible builds.
This article provides an overview of the progress made to fix non-reproducibility issues in packages over the year, and then goes on to show a very concrete way for Guix to take advantage of reproducible builds.
Building reproducibly
Tools that produce build artifacts occupy a key role: if their output is non-reproducible, then lots of packages that use them will be non-reproducible as a result. Among those packages, we fixed:
- GNU R (timestamps in
.rds
files and man pages; random temporary file names recorded in generated files); - GNU Guile (order-sensitive symbol generation during macro expansion);
- Ghostscript (timestamps and UUIDs in generated PDF files);
- GNU groff (timestamps in generated files);
- gdk-pixbuf (unsorted directory entries ending up in generated cache files);
- Perl build
system
(
perllocal.pod
files were produced in a non-deterministic way).
Sometimes we think that an issue is rare and we embark on a trip to fix
individual packages that are affected… until we realize that it’s common
enough to deserve a global, once-and-for-all fix. This is what happened
with timestamps in gzip headers: after blissfully assuming that “almost
everyone” uses the -n
flag of
gzip,
we finally
introduced
a build phase to automatically strip timestamps for gzip headers—this is
a subset of what Debian’s
strip-nondeterminism
achieves, but hey, Scheme integration matters to us!
There’s a number of well-identified issues left to be addressed: Python bytecode, GTK+ icon them caches, TeX Live, and more. Often, the issue database initiated by Debian is a great resource to find about issues and fixes.
And the result is…
We recently gained a new build farm called
berlin.guixsd.org
,
which is slated to replace our existing build farm at
mirror.hydra.gnu.org
. Having set it up as an independent build
farm—berlin
does not download binaries from hydra
—we can challenge
build reproducibility by comparing the binaries produced on each of
these build farms. Comparing the results of two independent build
farms, with different hardware and kernel versions, maximizes the
chances to catch all sorts of non-reproducibility issues. The result
with today’s master is… drum rolls
$ guix challenge $(guix package -A | cut -f1) \
--substitute-urls="https://mirror.hydra.gnu.org https://berlin.guixsd.org"
…
6,501 store items were analyzed:
- 5,048 (77.6%) were identical
- 533 (8.2%) differed
- 920 (14.2%) were inconclusive
We’re somewhere between 78% and 91%—not as good as Debian
yet, but we know what to do next!
The inconclusive comparisons here can be due to a package that failed to
build on one machine, for instance because its test suite fails in a
non-deterministic way, or simply because one of the build farms is
lagging behind. guix challenge
lists all the problematic packages,
which makes it easy to retrieve the faulty binaries and investigate.
Reproducible builds = faster downloads!
There’s a very practical advantage to reproducible builds: anyone who publishes binaries is in essence a mirror of our build farm.
Until now, Guix’s public key infrastructure (PKI) was used pretty rigidly: you could download binaries from a server if and only if you had previously authorized its public key. So to download binaries from the person next to you, you would first need to retrieve their public key and authorize it. In addition to being inconvenient, it has the drawback of being an all-or-nothing decision: you would now accept any binary coming from that person. Can’t we do better?
We realized there’s an easy way to exploit the mirroring property
mentioned above: assuming I trust binaries from mirror.hydra.gnu.org
,
then I can download from anyone who publishes the exact same binaries.
Put this way, it seems obvious, but it required some adjustments to the
substitute code.
To understand what’s going on, let’s look at the metadata guix publish
produces, in a format inherited from Hydra:
$ wget -q -O - https://berlin.guixsd.org/8kib1cirdv0qbmn9hdkjzjfx3n5nw1yw.narinfo
StorePath: /gnu/store/8kib1cirdv0qbmn9hdkjzjfx3n5nw1yw-sed-4.4
URL: nar/gzip/8kib1cirdv0qbmn9hdkjzjfx3n5nw1yw-sed-4.4
Compression: gzip
NarHash: sha256:18v7dgny1xna7f53mbkj8bk4y2f00l5rjk2k6hj166kjv964lz7r
NarSize: 637360
References: 3x53yv4v144c9xp02rs64z7j597kkqax-gcc-5.4.0-lib 8kib1cirdv0qbmn9hdkjzjfx3n5nw1yw-sed-4.4 n6nvxlk2j8ysffjh3jphn1k5silnakh6-glibc-2.25
FileSize: 218663
System: x86_64-linux
Deriver: pi8686q63rwr4md90vm3qxwhk2g2fvqa-sed-4.4.drv
Signature: 1;berlin.guixsd.org;KHNpZ25hdHVyZSAKIChkYXRhIAogIChmbGFncyByZmM2OTc5KQogIChoYXNoIHNoYTI1NiAjQTRDRjUyMTVGNzlBOEUxRkFBNjIyOEQwQjk0QjMyMTZCRkY1RjA1NkQxMzZENUEzNTFGM0I2OTYzQzc1MzQzMiMpCiAgKQogKHNpZy12YWwgCiAgKGVjZHNhIAogICAociAjMDFDM0NGMEIzRUMwNkIwRUNGMTJEMTU4MkNCMzA2RjkzMEU2Njc1NDNFOEQ2NkZCRjhDRUY4QkQwMkMzOTg1NCMpCiAgIChzICMwRTg2MUEyRjI3MDg2MjVBRDkzMDg5RjFFRjE4NzUwQjIzQjM0RTA5MkFFRkQ3RTlFNkZCMjlCMkMwMURFNjI5IykKICAgKQogICkKIChwdWJsaWMta2V5IAogIChlY2MgCiAgIChjdXJ2ZSBFZDI1NTE5KQogICAocSAjOEQxNTZGMjk1RDI0QjBEOUE4NkZBNTc0MUE4NDBGRjJEMjRGNjBGN0I2QzQxMzQ4MTRBRDU1NjI1OTcxQjM5NCMpCiAgICkKICApCiApCg==
This “narinfo” gives us, among other things, the hash of the sed binary
that berlin.guixsd.org
obtained, the URL where it can be downloaded,
and a signature on this metadata (a base64-encoded canonical
s-expression).
Guix has supported the ability to specify several substitute servers
for a while, with --substitute-urls
, but it would filter out narinfos
signed by an unauthorized key. The main change was thus to keep
narinfos with a hash identical to that advertised by one of the
authorized narinfos. Thus, if I run:
$ guix build sed \
--substitute-urls="https://somebody.example.org https://mirror.hydra.gnu.org"
Guix will fetch a narinfo from both URLs. If somebody
’s narinfo
claims the same hash as hydra
, then Guix will download the actual
binary from somebody
—which, hopefully, may be faster than downloading
from hydra
. Of course, when the download completes, guix-daemon
verifies that the hash is really as advertised in the narinfo, such that
somebody.example.org
cannot tweak me into downloading a different
binary.
The future
This feature landed in September, and will be in the forthcoming Guix release.
Among the ideas floating around, one is to
have guix publish
advertise itself on the local network via Avahi.
Guix could, optionally, discover neighboring guix publish
instances
and add them to its list of substitute servers. Binaries could
sometimes be downloaded from the local network, which should be faster.
More generally, the role of our build farm shifts from providing binaries to providing meta-data about binaries. We can entirely decouple the choice of a meta-data server from the choice of a binary provider.
Longer-term, binaries could very well be downloaded from a content-addressed store such as IPFS or GNUnet without having to forego our existing infrastructure. Peer-to-peer distribution of binaries has been on our mind for a while, but we hadn’t quite realized this decoupling and how it would allow us to support a smooth transition.
These are all exciting perspectives, and a nice practical consequence of reproducible builds!
Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).