Identifying software

Ludovic Courtès, Maxim Cournoyer, Jan Nieuwenhuizen, Simon Tournier — March 4, 2024

What does it take to “identify software”? How can we tell what software is running on a machine to determine, for example, what security vulnerabilities might affect it?

In October 2023, the US Cybersecurity and Infrastructure Security Agency (CISA) published a white paper entitled Software Identification Ecosystem Option Analysis that looks at existing options to address these questions. The publication was followed by a request for comments; our comment as Guix developers didn’t make it on time to be published, but we’d like to share it here.

Software identification for cybersecurity purposes is a crucial topic, as the white paper explains in its introduction:

Effective vulnerability management requires software to be trackable in a way that allows correlation with other information such as known vulnerabilities […]. This correlation is only possible when different cybersecurity professionals know they are talking about the same software.

The Common Platform Enumeration (CPE) standard has been designed to fill that role; it is used to identify software as part of the well-known Common Vulnerabilities and Exposures (CVE) process. But CPE is showing its limits as an extrinsic identification mechanism: the human-readable identifiers chosen by CPE fail to capture the complexity of what “software” is.

We think functional software deployment as implemented by Nix and Guix, coupled with the source code identification work carried out by Software Heritage, provides a unique perspective on these matters.

On Software Identification

The Software Identification Ecosystem Option Analysis white paper released by CISA in October 2023 studies options towards the definition of a software identification ecosystem that can be used across the complete, global software space for all key cybersecurity use cases.

Our experience lies in the design and development of GNU Guix, a package manager, software deployment tool, and GNU/Linux distribution, which emphasizes three key elements: reproducibility, provenance tracking, and auditability. We explain in the following sections our approach and how it relates to the goal stated in the aforementioned white paper.

Guix produces binary artifacts of varying complexity from source code: package binaries, application bundles (container images to be consumed by Docker and related tools), system installations, system bundles (container and virtual machine images).

All these artifacts qualify as “software” and so does source code. Some of this “software” comes from well-identified upstream packages, sometimes with modifications added downstream by packagers (patches); binary artifacts themselves are the byproduct of a build process where the package manager uses other binary artifacts it previously built (compilers, libraries, etc.) along with more source code (the package definition) to build them. How can one identify “software” in that sense?

Software is dual: it exists in source form and in binary, machine-executable form. The latter is the outcome of a complex computational process taking source code and intermediary binaries as input.

Our thesis can be summarized as follows:

We consider that the requirements for source code identifiers differ from the requirements to identify binary artifacts.
Our view, embodied in GNU Guix, is that:
Source code can be identified in an unambiguous and distributed fashion through inherent identifiers such as cryptographic hashes.
Binary artifacts, instead, need to be the byproduct of a comprehensive and verifiable build process itself available as source code.

In the next sections, to clarify the context of this statement, we show how Guix identifies source code, how it defines the source-to-binary path and ensures its verifiability, and how it provides provenance tracking.

Source Code Identification

Guix includes package definitions for almost 30,000 packages. Each package definition identifies its origin—its “main” source code as well as patches. The origin is content-addressed: it includes a SHA256 cryptographic hash of the code (an inherent identifier), along with a primary URL to download it.

Since source is content-addressed, the URL can be thought of as a hint. Indeed, we connected Guix to the Software Heritage source code archive: when source code vanishes from its original URL, Guix falls back to downloading it from the archive. This is made possible thanks to the use of inherent (or intrinsic) identifiers both by Guix and Software Heritage.

More information can be found in this 2019 blog post and in the documents of the Software Hash Identifiers (SWHID) working group.

Reproducible Builds

Guix provides a verifiable path from source code to binaries by ensuring reproducible builds. To achieve that, Guix builds upon the pioneering research work of Eelco Dolstra that led to the design of the Nix package manager, with which it shares the same conceptual foundation.

Namely, Guix relies on hermetic builds: builds are performed in isolated environments that contain nothing but explicitly-declared dependencies—where a “dependency” can be the output of another build process or source code, including build scripts and patches.

An implication is that builds can be verified independently. For instance, for a given version of Guix, guix build gcc should produce the exact same binary, bit-for-bit. To facilitate independent verification, guix challenge gcc compares the binary artifacts of the GNU Compiler Collection (GCC) as built and published by different parties. Users can also compare to a local build with guix build gcc --check.

As with Nix, build processes are identified by derivations, which are low-level, content-addressed build instructions; derivations may refer to other derivations and to source code. For instance, /gnu/store/c9fqrmabz5nrm2arqqg4ha8jzmv0kc2f-gcc-11.3.0.drv uniquely identifies the derivation to build a specific variant of version 11.3.0 of the GNU Compiler Collection (GCC). Changing the package definition—patches being applied, build flags, set of dependencies—, or similarly changing one of the packages it depends on, leads to a different derivation (more information can be found in Eelco Dolstra's PhD thesis).

Derivations form a graph that captures the entirety of the build processes leading to a binary artifact. In contrast, mere package name/version pairs such as gcc 11.3.0 fail to capture the breadth and depth elements that lead to a binary artifact. This is a shortcoming of systems such as the Common Platform Enumeration (CPE) standard: it fails to express whether a vulnerability that applies to gcc 11.3.0 applies to it regardless of how it was built, patched, and configured, or whether certain conditions are required.

Full-Source Bootstrap

Reproducible builds alone cannot ensure the source-to-binary correspondence: the compiler could contain a backdoor, as demonstrated by Ken Thompson in Reflections on Trusting Trust. To address that, Guix goes further by implementing so-called full-source bootstrap: for the first time, literally every package in the distribution is built from source code, starting from a very small binary seed. This gives an unprecedented level of transparency, allowing code to be audited at all levels, and improving robustness against the “trusting-trust attack” described by Ken Thompson.

The European Union recognized the importance of this work through an NLnet Privacy & Trust Enhancing Technologies (NGI0 PET) grant allocated in 2021 to Jan Nieuwenhuizen to further work on full-source bootstrap in GNU Guix, GNU Mes, and related projects, followed by another grant in 2022 to expand support to the Arm and RISC-V CPU architectures.

Provenance Tracking

We define provenance tracking as the ability to map a binary artifact back to its complete corresponding source. Provenance tracking is necessary to allow the recipient of a binary artifact to access the corresponding source code and to verify the source/binary correspondence if they wish to do so.

The guix pack command can be used to build, for instance, containers images. Running guix pack -f docker python --save-provenance produces a self-describing Docker image containing the binaries of Python and its run-time dependencies. The image is self-describing because --save-provenance flag leads to the inclusion of a manifest that describes which revision of Guix was used to produce this binary. A third party can retrieve this revision of Guix and from there view the entire build dependency graph of Python, view its source code and any patches that were applied, and recursively for its dependencies.

To summarize, capturing the revision of Guix that was used is all it takes to reproduce a specific binary artifact. This is illustrated by the time-machine command. The example below deploys, at any time on any machine, the specific build artifact of the python package as it was defined in this Guix commit:

guix time-machine -q --commit=d3c3922a8f5d50855165941e19a204d32469006f \
  -- install python

In other words, because Guix itself defines how artifacts are built, the revision of the Guix source coupled with the package name unambiguously identify the package’s binary artifact. As scientists, we build on this property to achieve reproducible research workflows, as explained in this 2022 article in Nature Scientific Data; as engineers, we value this property to analyze the systems we are running and determine which known vulnerabilities and bugs apply.

Again, a software bill of materials (SBOM) written as a mere list of package name/version pairs would fail to capture as much information. The Artifact Dependency Graph (ADG) of OmniBOR, while less ambiguous, falls short in two ways: it is too fine-grained for typical cybersecurity applications (at the level of individual source files), and it only captures the alleged source/binary correspondence of individual files but not the process to go from source to binary.

Conclusions

Inherent identifiers lend themselves well to unambiguous source code identification, as demonstrated by Software Heritage, Guix, and Nix.

However, we believe binary artifacts should instead be treated as the result of a computational process; it is that process that needs to be fully captured to support independent verification of the source/binary correspondence. For cybersecurity purposes, recipients of a binary artifact must be able to be map it back to its source code (provenance tracking), with the additional guarantee that they must be able to reproduce the entire build process to verify the source/binary correspondence (reproducible builds and full-source bootstrap). As long as binary artifacts result from a reproducible build process, itself described as source code, identifying binary artifacts boils down to identifying the source code of their build process.

These ideas are developed in the 2022 scientific paper Building a Secure Software Supply Chain with GNU Guix

Unless otherwise stated, blog posts on this site are copyrighted by their respective authors and published under the terms of the CC-BY-SA 4.0 license and those of the GNU Free Documentation License (version 1.3 or later, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts).