Broken Snowflake JDBC

A few days ago, a Snowflake process began failing without any changes to the software. It required a bit of investigation.

Context

Some Snowflake connections were suddenly failing with the following error:

Max retry reached for the download of chunk#0 (Total chunks: 28) retry: 7, error: net.snowflake.client.jdbc.SnowflakeSQLLoggedException: JDBC driver encountered communication error. Message: No trusted certificate found Verify that the hostnames and portnumbers in SYSTEM$ALLOWLIST are added to your firewall's allowed list. To troubleshoot your connection further, you can refer to this article: https://docs.snowflake.com/en/user-guide/client-connectivity-troubleshooting/overview.

A colleague also found a very weird scenario:

  • when making a request with a large amount of data to the driver, it failed
  • but first, make a small request
  • make that large request again, and it works

The first thing we did was look at the changes for that release. But nothing was found related to Snowflake.
It is always a good habit to look for such errors on the project GitHub repository, and someone encountered the same kind of error.
The issue was fixed in the 3.24.2 Snowflake JDBC release.

The software was using 3.23.2, so it was a really good candidate. Updating the driver did indeed fix the issue.

However, it was still unclear what triggered the error. The error described above was on Java 24, and our software was using Java 17, and why it sometimes worked and sometimes did not.
It needed further investigation to fully understand the problem.

OS Updates

Investigating further, we found that the failing version of the software was linked to a different version of the JVM:

  • java-17-openjdk-17.0.16.0.8-2.el8.x86_64 worked
  • java-17-openjdk-headless-17.0.17.0.10-1.el9.alma.1.x86_64 was failing

Based on the GitHub Issue and the error mentioning certificates, we can run:

$ keytool -list -keystore "/usr/lib/jvm/java-17/lib/security/cacerts" -alias starfieldclass2ca -storepass changeit

The Java version that worked had the certificate:

starfieldclass2ca, Oct 24, 2024, trustedCertEntry, Certificate fingerprint (SHA-256): 14:65:FA:20:53:97:B8:76:FA:A6:F0:A9:95:8E:55:90:E4:0F:CC:7F:AA:4F:B7:C2:C8:67:75:21:FB:5F:B6:58

While the one that failed did not:

Warning: use -cacerts option to access cacerts keystore keytool error: java.lang.Exception: Alias <starfieldclass2ca> does not exist

The change involved switching from a KeyManagerFactory to a TrustManagerFactory. This fixed the algorithm used to verify the certificate chain. It seems previous validation was too strict and failed if specific certificates were missing.

AlmaLinux updates

We noticed that while building the image for the new version of the software, the OS also changed its minor version during the process.
The Dockerfile looks like:

FROM almalinux:9

RUN dnf update -y
# Then build and bundle a new software version

There are two issues that are kind of redundant:

  • almalinux:9, which uses a major version tag. 9 is a floating tag that is updated whenever a new minor AlmaLinux version is released
  • dnf update ensures the latest packages are installed and will as a consequence also update to the latest released OS version

This configuration created a non-deterministic build.

It turns out that AlmaLinux 9.7 was released a few days ago, and the Docker image followed a few days later, right when a new image for the software was being built. Essentially, almalinux:9 previously pointed to almalinux:9.6, but it now points to almalinux:9.7.

That explained the different versions of the JVM between the two images:

  • java-17-openjdk-17.0.16.0.8-2.el8.x86_64 was on 9.6
  • java-17-openjdk-headless-17.0.17.0.10-1.el9.alma.1.x86_64 was on 9.7

The Dockerfile was changed to use the last released version:

FROM almalinux:9.7

RUN dnf update -y --releasever=9.7
# Then build and bundle a new software version

Diving more into Snowflake JDBC driver

Understanding OCSP

My colleague went further to understand why a small request followed by a bigger one made the driver work.

The behavior may be linked to OCSP (we did not know what that was). The Snowflake Driver supports it.

We tried using disableOCSPChecks=true as a workaround, and it made the requests always work.
In this case, there is a certificate error, but the setting just tells the driver to ignore it.

Snowflake driver connections

OCSP at that point was a good start but did not explain why the connection sometimes worked and sometimes did not. Looking further into the documentation, the Snowflake driver will require a “full connection” when fetching more than 100kb of data.

Going back to the weird reproduction case:

  • when making a request with a large amount of data to the driver, it failed
  • but first, make a small request
  • make that large request again, and it works

This explains why the first requests fail, and why a second small request worked. But we needed to understand why the last large request worked.

OCSP response cache server

Following the documentation on OCSP again, my colleague discovered the OCSP response cache server, which will store the response of the OCSP query in $USER/.cache/snowflake/ocsp_response_cache.json to improve the speed of subsequent requests.

Removing that file makes the problem reappear.

So we guessed that first initiating a small query to Snowflake would not go through the full connection stage, but would still create the ocsp_response_cache.json file, which would then be used by the second connection for more data, and work.

Conclusion

The immediate fix for the issue was to bump the Snowflake JDBC driver, as the fix.

However, it also highlighted a few problems with our release process, and we changed a few things:

  • do not use a major version of the image; pin it to a minor version (here 9.7) and upgrade it manually
  • add a recurring task to check for newer AlmaLinux versions and update the image accordingly
  • separate the image update from the software update: when changing something, we only want to change one thing at a time and retest everything to ensure and better scope potential regressions like this one
  • it was spotted internally by a very old integration test, so we were happy it was there. We were wondering a bit before if it was to be removed, but it helped detect the issue before going to production

Also during the investigation, a few more things helped:

  • digging into the project GitHub issues gave us a very good fix
  • AI (LLMs) are very handy in these kinds of problems. Gemini was a huge help in helping us understand the reasons