Adventures on the quest for long-term reproducible deployment
Rebuilding software five years later, how hard can it be? It can’t be that hard, especially when you pride yourself on having a tool that can travel in time and that does a good job at ensuring reproducible builds, right?
In hindsight, we can tell you: it’s more challenging than it
seems. Users attempting to travel 5 years back with guix time-machine
are (or were) unavoidably going to hit bumps on the road—a real
problem because that’s one of the use cases Guix aims to support well,
in particular in a reproducible
research context.
In this post, we look at some of the challenges we face while traveling back, how we are overcoming them, and open issues.
The vision
First of all, one clarification: Guix aims to support time travel, but we’re talking of a time scale measured in years, not in decades. We know all too well that this is already very ambitious—it’s something that probably nobody except Nix and Guix are even trying. More importantly, software deployment at the scale of decades calls for very different, more radical techniques; it’s the work of archivists.
Concretely, Guix 1.0.0 was released in 2019 and our goal is to allow users to travel as far back as 1.0.0 and redeploy software from there, as in this example:
$ guix time-machine -q --commit=v1.0.0 -- \
environment --ad-hoc python2 -- python
> guile: warning: failed to install locale
Python 2.7.15 (default, Jan 1 1970, 00:00:01)
[GCC 5.5.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
(The command above uses guix environment
, the predecessor of guix shell
,
which didn’t exist back then.)
It’s only 5 years ago but it’s pretty much remote history on the scale
of software evolution—in this case, that history comprises major
changes in Guix
itself and
in Guile.
How well does such a command work? Well, it depends.
The project has two build farms; bordeaux.guix.gnu.org
has been
keeping substitutes (pre-built binaries) of everything it built since
roughly 2021, while ci.guix.gnu.org
keeps substitutes for roughly two
years, but there is currently no guarantee on the duration
substitutes may be retained.
Time traveling to a period where substitutes are available is
fine: you end up downloading lots of binaries, but that’s OK, you rather
quickly have your software environment at hand.
Bumps on the build road
Things get more complicated when targeting a period in time for which
substitutes are no longer available, as was the case for v1.0.0
above.
(And really, we should assume that substitutes won’t remain available
forever: fellow NixOS hackers recently had to seriously consider
trimming their 20-year-long history of
substitutes
because the costs are not sustainable.)
Apart from the long build times, the first problem that arises in the absence of substitutes is source code unavailability. I’ll spare you the details for this post—that problem alone would deserve a book. Suffice to say that we’re lucky that we started working on integrating Guix with Software Heritage years ago, and that there has been great progress over the last couple of years to get closer to full package source code archival (more precisely: 94% of the source code of packages available in Guix in January 2024 is archived, versus 72% of the packages available in May 2019).
So what happens when you run the time-machine
command above? It
brings you to May 2019, a time for which none of the official build
farms had substitutes until a few days ago. Ideally, thanks to
isolated build
environments,
you’d build things for hours or days, and in the end all those binaries
will be here just as they were 5 years ago. In practice though, there
are several problems that isolation as currently implemented does not
address.
Among those, the most frequent problem is time traps: software build processes that fail after a certain date (these are also referred to as “time bombs” but we’ve had enough of these and would rather call for a ceasefire). This plagues a handful of packages out of almost 30,000 but unfortunately we’re talking about packages deep in the dependency graph. Here are some examples:
- OpenSSL unit tests fail after a certain date because some of the X.509 certificates they use have expired.
- GnuTLS had similar issues; newer versions rely on datefudge to fake the date while running the tests and thus avoid that problem altogether.
- Python 2.7, found in Guix 1.0.0, also had that problem with its TLS-related tests.
- OpenJDK would fail to build at some
point with this interesting
message:
Error: time is more than 10 years from present: 1388527200000
(the build system would consider that its data about currencies is likely outdated after 10 years). - Libgit2, a dependency of Guix, had (has?) a time-dependent test.
- MariaDB tests started failing in 2019.
Someone traveling to v1.0.0
will hit several of these, preventing
guix time-machine
from completing. A serious bummer, especially to
those who’ve come to Guix from the perspective of making their research
workflow
reproducible.
Time traps are the main road block, but there’s more! In rare cases, there’s software influenced by kernel details not controlled by the build daemon:
- Tests of the hwloc hardware locality library would fail when running on a Btrfs file system.
In a handful of cases, but important ones, builds might fail when performed on certain CPUs. We’re aware of at least two cases:
- Python 3.9 to 3.11 would set a signal handler stack too small for use on Intel Sapphire Rapids Xeon CPUs (it’s more complicated than this but the end result is: it will no longer build on modern hardware).
- Firefox would reportedly crash on Raptor Lake CPUs running an buggy version of their firmware.
Neither time traps nor those obscure hardware-related issues can be avoided with the isolation mechanism currently used by the build daemon. This harms time traveling when substitutes are unavailable. Giving up is not in the ethos of this project though.
Where to go from here?
There are really two open questions here:
- How can we tell which packages need to be “fixed”, and how: building at a specific date, on a specific CPU?
- How can keep those aspects of the build environment (time, CPU variant) under control?
Let’s start with #2. Before looking for a solution, it’s worth remembering where we come from. The build daemon runs build processes with a separate root file system, under dedicated user IDs, and in separate Linux namespaces, thereby minimizing interference with the rest of the system and ensuring a well-defined build environment. This technique was implemented by Eelco Dolstra for Nix in 2007 (with namespace support added in 2012), at a time where the word container had to do with boats and before “Docker” became the name of a software tool. In short, the approach consists in controlling the build environment in every detail (it’s at odds with the strategy that consists in achieving reproducible builds in spite of high build environment variability). That these are mere processes with a bunch of bind mounts makes this approach inexpensive and appealing.
Realizing we’d also want to control the build environment’s date,
we naturally turn to Linux namespaces to address that—Dolstra, Löh, and
Pierron already suggested something along these lines in the conclusion
of their 2010 Journal of Functional Programming
paper. Turns out
there is now a time
namespace.
Unfortunately it’s limited to CLOCK_MONOTONIC
and CLOCK_BOOTTIME
clocks; the manual page states:
Note that time namespaces do not virtualize the
CLOCK_REALTIME
clock. Virtualization of this clock was avoided for reasons of complexity and overhead within the kernel.
I hear you say: What about
datefudge and
libfaketime?
These rely on the LD_PRELOAD
environment variable to trick the dynamic
linker into pre-loading a library that provides symbols such as
gettimeofday
and clock_gettime
. This is a fine approach in some
cases, but it’s too fragile and too intrusive when targeting arbitrary
build processes.
That leaves us with essentially one viable option: virtual machines
(VMs). The full-system QEMU lets you specify the initial real-time
clock of the VM with the -rtc
flag, which is exactly what we need
(“user-land” QEMU such as qemu-x86_64
does not support it). And of
course, it lets you specify the CPU model to emulate.
News from the past
Now, the question is: where does the VM fit? The author considered writing a package transformation that would change a package such that it’s built in a well-defined VM. However, that wouldn’t really help: this option didn’t exist in past revisions, and it would lead to a different build anyway from the perspective of the daemon—a different derivation.
The best strategy appeared to be offloading: the build daemon can offload builds to different machines over SSH, we just need to let it send builds to a suitably-configured VM. To do that, we can reuse some of the machinery initially developed for childhurds that takes care of setting up offloading to the VM: creating substitute signing keys and SSH keys, exchanging secret key material between the host and the guest, and so on.
The end result is a service for Guix System users that can be configured in a few lines:
(use-modules (gnu services virtualization))
(operating-system
;; …
(services (append (list (service virtual-build-machine-service-type))
%base-services)))
The default setting above provides a 4-core VM whose initial date is January 2020, emulating a Skylake CPU from that time—the right setup for someone willing to reproduce old binaries. You can check the configuration like this:
$ sudo herd configuration build-vm
CPU: Skylake-Client
number of CPU cores: 4
memory size: 2048 MiB
initial date: Wed Jan 01 00:00:00Z 2020
To enable offloading to that VM, one has to explicitly start it, like so:
$ sudo herd start build-vm
From there on, every native build is offloaded to the VM. The key part is that with almost no configuration, you get everything set up to build packages “in the past”. It’s a Guix System only solution; if you run Guix on another distro, you can set up a similar build VM but you’ll have to go through the cumbersome process that is all taken care of automatically here.
Of course it’s possible to choose different configuration parameters:
(service virtual-build-machine-service-type
(virtual-build-machine
(date (make-date 0 0 00 00 01 10 2017 0)) ;further back in time
(cpu "Westmere")
(cpu-count 16)
(memory-size (* 8 1024))
(auto-start? #t)))
With a build VM with its date set to January 2020, we have been able to
rebuild Guix and its dependencies along with a bunch of packages such as
emacs-minimal
from v1.0.0
, overcoming all the time traps and other
challenges described earlier. As a side effect, substitutes
are now available from ci.guix.gnu.org
so you can even try this at
home without having to rebuild the world:
$ guix time-machine -q --commit=v1.0.0 -- build emacs-minimal --dry-run
guile: warning: failed to install locale
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
38.5 MB would be downloaded:
/gnu/store/53dnj0gmy5qxa4cbqpzq0fl2gcg55jpk-emacs-minimal-26.2
For the fun of it, we went as far as v0.16.0
, released in December
2018:
guix time-machine -q --commit=v0.16.0 -- \
environment --ad-hoc vim -- vim --version
This is the furthest we can go since channels and the underlying mechanisms that make time travel possible did not exist before that date.
There’s one “interesting” case we stumbled upon in that process: in OpenSSL 1.1.1g (released April 2020 and packaged in December 2020), some of the test certificates are not valid before April 2020, so the build VM needs to have its clock set to May 2020 or thereabouts. Booting the build VM with a different date can be done without reconfiguring the system:
$ sudo herd stop build-vm
$ sudo herd start build-vm -- -rtc base=2020-05-01T00:00:00
The -rtc …
flags are passed straight to QEMU, which is handy when
exploring workarounds…
The time-travel
continuous integration
jobset has been set up to
check that we can, at any time, travel back to one of the past releases.
This at least ensures that Guix itself and its dependencies have
substitutes available at ci.guix.gnu.org
.
Reproducible research workflows reproduced
Incidentally, this effort rebuilding 5-year-old packages has allowed us to fix embarrassing problems. Software that accompanies research papers that followed our reproducibility guidelines could no longer be deployed, at least not without this clock twiddling effort:
- code of [Re] Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices, submitted as part of the ReScience Ten Years Reproducibility Challenge in June 2020, and which is precisely about showcasing reproducible deployment with Guix;
- code of the 2022 Nature Scientific Data article entitled Toward practical transparent verifiable and long-term reproducible research using Guix, which relied on an April 2020 revision of Guix to deploy (Simon Tournier who co-authored the paper reported earlier on a failed attempt showing just how challenging it was).
It’s good news that we can now re-deploy these 5-year-old software environments with minimum hassle; it’s bad news that holding this promise took extra effort.
The ability to reproduce the environment of software that accompanies research work should not be considered a mundanity or an exercise that’s “overkill”. The ability to rerun, inspect, and modify software are the natural extension of the scientific method. Without a companion reproducible software environment, research papers are merely the advertisement of scholarship, to paraphrase Jon Claerbout.
The future
The astute reader surely noticed that we didn’t answer question #1 above:
How can we tell which packages need to be “fixed”, and how: building at a specific date, on a specific CPU?
It’s a fact that Guix so far lacks information about the date, kernel, or CPU model that should be used to build a given package. Derivations purposefully lack that information on the grounds that it cannot be enforced in user land and is rarely necessary—which is true, but “rarely” is not the same as “never”, as we saw. Should we create a catalog of date, CPU, and/or kernel annotations for packages found in past revisions? Should we define, for the long-term, an all-encompassing derivation format? If we did and effectively required virtual build machines, what would that mean from a bootstrapping standpoint?
Here’s another option: build packages in VMs running in the year 2100, say, and on a baseline CPU. We don’t need to require all users to set up a virtual build machine—that would be impractical. It may be enough to set up the project build farms so they build everything that way. This would allow us to catch time traps and year 2038 bugs before they bite.
Before we can do that, the virtual-build-machine
service needs to be
optimized. Right now, offloading to build VMs is as heavyweight as
offloading to a separate physical build machine: data is transferred
back and forth over SSH over TCP/IP. The first step will be to run SSH
over a paravirtualized transport instead such as AF_VSOCK
sockets.
Another avenue would be to make /gnu/store
in the guest VM an overlay
over the host store so that inputs do not need to be transferred and
copied.
Until then, happy software (re)deployment!
Acknowledgments
Thanks to Simon Tournier for insightful comments on a previous version of this post.
Salvo indicação em contrário, as postagens do blog neste site são protegidos por direitos autorais de seus respectivos autores e publicados sob os termos de a licença CC-BY-SA 4.0 e as da Licença de Documentação Livre GNU (versão 1.3 ou posterior, sem Seções Invariantes, sem Textos de capa frontal e sem textos de contracapa).