From Ant to Gradle

2024-03-17

The build system at work was comprised of an ant build for Java with various node and npm scripts, as well as lots of bash to bundle our software. It did the job, but showed its age due to lack of parallelization and good system dependency. It was time to migrate to something more robust and modern.

Context

The main backend code was composed of Java and Scala. Ant was used as a build tool, and Ivy for the dependency management.
The frontend build and packaging was done with node and npm.

A Makefile was used as a starting point for the whole build, with a bunch of shell scripts. Calling the makefile with parallel execution (-j) made the whole build unstable and crash.

So the build was:

orchestrated by a Makefile
dependent on multiple tools: make, ant, ivy, sbt, node, npm, bash, python, …
dependent on versions of the same tools: it was relying on operating system installed versions, or the versions which were installed by the developers at the time of their environment setup. Specific versions were not really enforced
no real caching: even if some parts could be cached a little bit, the build process had no real knowledge of what was done previously and what needed to be redone
sequential and slow: as all those tools were doing stuff sequentially. As a result, the whole build itself was also sequential, it took about 15 minutes to download dependencies, compile, prepare stuff…

A new tool

Someone on the team started moving the build to a new tool, and he chose Gradle, which I never used, but only heard about.

It provides:

parallel builds
better task and dependency management
single tool, that could provide a same version of Java via toolchains or node, npm via plugins
potential remote cache for CI to further speed up the build
the whole system is written in Groovy, which I am not fond of… we just need to live with it

Due to some circumstances, the person who started the work left, and I was left alone with no real external support (see below). I nonetheless decided to continue the migration on my free time, maybe about half a day per week.

Migration steps

As I was left mostly alone on the project, I decided on the following steps to ensure a smooth migration from one build system to another. It would be more work, but for such a task, safety and confidence come first:

do not modify (or as little as possible) the original build: everything has to be done on the side
as a consequence, both must live side by side and not interact or impact each other
add the new build little by little, without interfering with my other main daily tasks
the main entry point for the build was a Makefile, keep it with the same targets, so that developers and third-party scripts calling it would be unchanged, so a new Makefile.gradle was introduced
ability to rollback, or switch from one system to another easily. By having a new Makefile.gradle, it would just be switching files
ensure full binary compatibility for jars and other outputs between the two build systems
add a new CI task in parallel to build with Gradle, while keeping the old system

Reproducibility

As mentioned above, I wanted to ensure binary compatibility between the legacy and the new Gradle build.

make the Gradle output reproducible by itself:

    tasks.withType(AbstractArchiveTask) {
        preserveFileTimestamps = false
        reproducibleFileOrder = true
    }

write a script that would compare the jar output, it needed to check the bytecode itself (just compare the two .class files), to prevent differences in dates from the jars
do the same thing for frontend generated files
use the CI to build the two systems, and then call the previous script to check for differences, run the task daily

Example of differences in outputs between the two builds:

Scala files had debug mode activated on the legacy build. The -g option needed to be added to the Gradle build. I had to look at the compiled output to find it out. Even if this would not lead to a change in behavior, it gave me confidence in the approach
checking the frontend build highlighted a few differences and non-determinism in the legacy build system. It was fixed for both build systems

With the use of Java via toolchains or node, npm via plugins, the software version used for the build could be pinned, no longer relying on the OS-provided version. This helped migrate node and Java versions more smoothly, as developers and CI did not need to do anything: Gradle did the job of checking the installation and downloading the required dependencies.

Unit tests

The legacy build was using ant to create a jar with all test classes, then run JUnit on this jar, and export results as XML files.

At first, I wanted to keep exactly the same output, so I decided to build the same test jar using Gradle, and keep using ant to produce the test results.
This allowed me to compare the two outputs knowing that the difference would only be in the test jar itself and not in the part generating the outputs.

At the end, all tests were green, and the same number of tests were executed.

Some numbers

From the CI running the legacy build and the new Gradle build, we could get some numbers.

System	Legacy build	Gradle build (no cache)	Gradle build (with cache)
2 Cores VM	16 mins	14 mins (no real gain here, due to core count)	6 mins
8 Cores VM	14 mins	7 mins	3 mins
Local	15 mins	8 mins	4 mins

Build times were reduced by 50% on a multi-core system. The gains are mainly due to Gradle parallelization – the more the cores, the faster the build. On a system with few cores, as Gradle could not parallelize enough, there was almost no gain.

For developers, depending on what was modified and the use of the Gradle cache, a new build could take just a few seconds to a few minutes max.

Politics

As much as I hate politics, migrating the build system was a heavy political challenge, maybe more than the move itself. There was a lot of friction in introducing a new tool – why change the build that currently works? The gains were not obvious.

Like everything, I also decided to go step by step:

I worked alone for a few weeks in understanding the legacy build, preparing the project, learning Gradle, and setting stuff in place on the side little by little
once I had a working build and confidence in the build itself, I started talking about it with colleagues and management. They knew I was working on it, but the status was always a bit fuzzy as I only worked on it a few hours max every week
do not force adoption, that is important as I was the minority. Show the numbers, and let them speak for themselves. I made several presentations of the work to different engineering teams
as the new Gradle build was faster and worked, more and more colleagues started using it and helped test the migration. Of course, they also found a few bugs, and things to improve, but no blocker
and one day, once the CTO started using the build himself because the legacy build was too slow, I knew it was a win

It then became accepted that the Gradle build system was better, faster, and safer due to pinning versions and reproducibility. It was ok to enable the build by default for a new release:

how to move from one build to another? As I worked on creating a separate Makefile.gradle with the same targets, it was as simple as:

$ mv Makefile Makefile.legacy
$ mv Makefile.gradle Makefile

keep the legacy build as a backup for a few releases, in case of unforeseen problems: I wanted to make sure to be able to rollback should anything really bad happen (fortunately nothing happened)
when we were confident enough, delete Makefile.legacy
then remove the old build stuff little by little, it is also a tedious task, but not really urgent

Conclusion

The new build system has now been used for at least 2 or 3 years. The whole migration took almost a year of working mostly alone for 0.5 day / week, but I can say it was a success.

What is important:

have a way to measure both builds side by side to show the gains
both builds must always be compatible, it really helps when adopting it, as there is no real stopper. If it does not work one day, developers can just switch back to the legacy build and we can fix the new one without (too much) pressure
tedious project, but tenacity pays. Let it go by itself, it was not obvious that the project could be completed successfully at first. I communicated a bit, without enforcing anything, and once it was obvious that the new build was better, people started using it by themselves and the job was (mostly) done
we could leverage the build cache to improve build performance for the CI
much faster on machines with lots of cores, so as we change hardware the build is even faster. On my new M3 Pro laptop, everything builds in 4 minutes (!)
I hate Groovy