What is a repeatable build?
Deterministic Build or Reproducible Build are slightly different, but they can be understood as the same thing from this article.
Reproducible Builds refer to theMultiple executions of the build process with the same input and build environment can produce the exact same results。 This technology is important for software development, distribution, and security validation.
A build is reproducible if it provides the exact same output regardless of when and where it is run. No matter which computer you're running on, what time of day, and what external services you access over the network, reproducible builds produce the same byte-by-byte output. This is great for both development (because reproducible builds are easy to share across different developer devices) and production (because it's easy to ensure that the results of reproducible builds haven't been tampered with – just rerun the build on your own machine and check that the results are consistent!). are very useful.
The three pillars of repeatable builds
Pillar 1: Repeatable builds
Build repeatability refers to what happens to the build machine itself. Assuming our build inputs are available, and nothing changes in the world around us, does our build produce the same output when repeated?
Deterministic installation plan
The first, simplest, and most obvious requirement in a repeatable build is a deterministic dependency installation plan.
In most languages, it's as simple as checking in a locked file. Modern build tools often allow projects to express direct dependency requirements as constraints, and then resolve those constraints to generate an installation plan (a list of dependency names and version pairs to install). Many of these tools also generate lock files for serialized installation plans. Developers can submit these lock files to version control so that future builds use the same dependency names and versions.
Note that we also need deterministic in the dependency build itself (not just version selection), and a deterministic installation plan doesn't allow us to achieve this!
Deterministic construction
Once we know what to build, our build itself (including our own code and the build of dependency code) must actually be deterministic.
This may not actually be a problem for projects without a compile step! For example, a Node project with all dependencies is pure JavaScript, and no additional work is required to achieve effective deterministicity.
For projects that do include compilation or translation (source-to-source compilation) steps, ensuring determinism is by far the most difficult part of building a reproducible build. The compilation process can implicitly introduce non-determinism in a number of ways, including:
- Turing-complete program build scripts can change the compiled output at will.
- Post-installation scripts that rely on executable file system lookups or network calls.
- C binding to a system-installed package where bindings on different systems with different headers may produce different outputs.
- Steps to build a file that reads outside of version control.
- Build steps to generate timestamps using system time.
- Steps to build dependencies that are not expressed in the network download installation plan (for example, download an NPM dependency from GitHub for a cached binary build that is C-bound).
- Change the behavior based on the currently set environment variable, but do not submit a build with the environment variable configuration.
Not all of these behaviors necessarily introduce uncertainty when set up correctly, but properly configuring the build process can be complex and difficult. For example, you can read this blog post about uncertainty in Chromium builds. Many of these issues can be mitigated by controlling the local build environment, which we will discuss in the next section.
Pillar 2: Immutable environment
Even with repeatable builds, we need to make sure that the build inputs don't change. Often, this means that we want to make sure that we build on an immutable snapshot of our surroundings.
Immutable local environment
As we discussed above, a common source of build uncertainty is relying on "dependencies" that are not captured by the build tool. C-bound system libraries are the most common examples, but other local environmental factors such as environment variable settings and files outside the scope of version control can also affect the build.
An easy way to mitigate this issue is to run the build in a known, immutable container. For example, a container runtime like Docker helps ensure that everyone uses the same system dependencies, the same environment variables, and runs on the same file system. In addition, it is easy to verify that the contents of the container match a known good build container, and if needed, the container can be easily completely removed from the known good image and recreated.
Note that we are very clear about known containers or known container images. It's not enough to just submit a Dockerfile! Why? Because the Dockerfile itself doesn't describe a fully reproducible build process for Docker images, because they don't run in an immutable global environment.
Immutable global environment
Build systems often interact with external services to complete tasks such as version resolution and dependency downloads. But external services change frequently.
Running apt install nodejs today will give you different results than last year, and probably next year will also get different results. That's why Dockerfiles themselves can't describe reproducible builds - running the same Dockerfile at different points in time will produce different build outputs!
The simple mitigation here is to configure the build whenever possible, specifying an exact version (ideally, an exact content hash as well) so that future builds use the same version as the current build. But external services can also change their behavior unexpectedly - a truly pessimistic reproducible build runs an internal image with as many of its network resources as possible.
Pillar 3: Resource availability
Let's say our build is repeatable and the world under our feet doesn't change. All we need now is access to the build input. It seems simple, right? Well......
The registry sometimes fails
Most Node developers have experienced at least one NPM outage, during which the build pipeline without caching or mirroring NPM packages is disrupted. Many Node developers have also experienced left-pad and faker removals, which have severely damaged the NPM ecosystem and effectively amounted to an outage.
The only reliable way to mitigate such build breaks is to run your own package registry mirror. When external services are unavailable, the image can remain online; When the official registry deletes the old package, the mirror can continue to provide services. The same principle applies to other remote services: unless you run your own image, the availability of a build pipeline is only comparable to the availability of its services.
Choosing to run a service image is always a delicate trade-off. On the one hand, registries like NPM have dedicated engineering and operations teams that have the expertise to keep these systems online. On the other hand, it's much easier to run a small image for a small set of dependencies than to run all NPM images. You should make mirroring decisions based on the specifics of each service, taking into account the reliability of historical external services and your team's build availability and staffing needs.
Suppliers ensure maximum availability
An easy way to ensure maximum availability of your project's dependencies is to add them to your vendor. Most package managers support some form of "vendoring", which means that instead of relying on downloads from external services, we store the dependency source code in version control, coexisting with our source code. For example, in Node, this might look like committing node_modules to source control.
While this solution isn't perfect (depending on how your vendor and project is set up, which can put a lot of strain on your version control), it's often the simplest and easiest solution for maximum availability.
Reference:
The hyperlink login is visible.
The hyperlink login is visible. |