Terraform modules at scale

Terraform modules at scale

#A bit of context

In the TomTom, we develop a multitude of product with diverse requirements and various technologies. If the “one size fit all” never applies to the reality that is even more true for the landscape of my comvepany that successfully span across 30 years of technology and innovation in the mapping ecosystem.

However, being part of Developer Experience Team, we observed overtime that each engineering team was required to operate their own infrastructure and this resulted in reinventing the wheel a lot of times, with each implementation specifically tailored for the workload of that team, but often lacking of well-architected pillar or best practices.

We established then an internal initiative: the Cloud Reference Architecture Team (which I’m leading). Our initial goal was to support engineering teams by offloading them from the burden of architecting yet another time infrastructure for running their application. Our mission was to offer well-architects documents and as well a reference implementation that would work as off-the-shelf solution, but highly customizable (since we know that “one-size-fit-all” would not be achievable).

To achieve this we decided to focus on the most emerging use-case: Kubernetes cluster. For this we offer now as an internal product a full fledge solution that enable teams to have a well-architected Kubernetes deployment in 20 mins (fully integrated with the other technologies implemented internally in the company), without the need of investigating, testing, designing and deciding what technology to use and how to implement it.

This pattern of a centralized code base also allow teams to delegate the updates and compatibility test of cloud infrastructure on top of which K8s runs, as well as all the components we deploy as part of the off-the-shelf ready-to-use solution.

#Terraform modules at scale

It was a long introduction to offer a bit of perspective. In the rest of this post I will though focus a bit more on how we offer this product at a scale and ensure that is reliable and easy to consume.

It’s important to consider that not all the engineering teams start with green field projects, so our personas are mainly the following:

Greenfield or fast experimentation: a team that starts a brand new product or just need a easy accelerator to spin up new infra for some experiment on existing products
Established teams, with strong experience in K8S and pre-existing infrastructure. They are interested in extending or replacing some components and have them centrally developed and maintained (e.g. they have a running cluster, and they need a proper implementation of cert-manager integrated with the internal certificate providers we support in the company).

#CPK, developer experience product and components

For the above reasons we offer what we call CPK (stands for Cloud Platform Kubernetes) that is a easy to deploy codebase offering in 20 mins the experience to have your helm chart on a live cluster serving live traffic.

CPK is effectively a composition of modules (for both infra, networking and K8s) that implements best practice, integration with other internal technologies and the complex network topology we support in our ecosystem. This solution (we called also “solution”) is shipped with sane defaults to support a fast consumption without spending time in configuring it.

Since CPK is a composition, we also release each component separately so that the 2nd persona above can leverage our artifacts to extend their infra adopting best practices in a phased approach.

To enable engineers to easily adopt our product we have some paramount tenets:

Technology decision based on the current de-facto standards with some opinionated approach: we use Terraform as configuration language and Terragrunt as orchestrator to help teams introduce dry-code practices and approach the infrastructure with a layered pattern. When it comes to the component part of the K8S cluster, we as central team opt for the best choice based on the current emerging requirements in the company landscape together with a proper valuation of what is the best promising technology that will help us to be “future-proof”

Configuration Interface must be consistent: our code base must be easy to consume, possibly with no changes on the defaults for fast experimentation, but in case you need to configure some aspect of the solution, then we ensure the configuration experience is consistent everywhere.
This means we put particular attention to the API terraform interface (variables.tf) with the engineering teams, and we try to “never break the API interface”

Testing: our code base is supposed to be used in numerous and diverse scenarios. We strive to offer a consistently tested set of terraform modules. They are tested in “isolation” (what we consider as unit-tests), in “composition” plus some in-place upgrade tests.

Below and more in detail in the next article a deep dive with some example how we try to maintain these promises.

#Modules Implementation

Modules are the basis of functionalities in Terraform. We always offer a consistent experience in approaching cloud infrastructure and K8S component deployment. We always try to keep the variables that are common across cloud with consistent naming. (e.g. vm_size represent the standard virtual machine type for the K8S nodes, regardless the cloud vendor, availability_zones will always be a list of 1,2,3 or a,b,c elements). Most of the time the underlying network topology for the control plane doesn’t not require changes except some basic variables like the above.

Custom configurations exist and they are addressed differently per cloud.

When it come to K8S components (e.g. the cert-manager we mentioned before), they are deployed with Terraform helm_provider and each module has the following:

registry coordinates defaulted but customizable
images as map of strings. This way each engineering team has always the shortcut to diverge from the default images. It’s discouraged (since we test our configurations) but it’s an escape path in case of 0-days vulnerabilities.
config or config_content: we implement our modules in a way where we ship a sane, tested, default configuration for the helm charts. You can always ship additional configuration via either file (path to yaml file) or inline content. These configuration will be merged on top of the base we offer and you can customize only specific bits of the chart of the entire configuration.
sets: as Helm use the sets for some higher priority overrides, we also support as mode to enforce come configuration the sets. You can ship a map of strings in the set and override previous configuration or extend it.

If there is some special integration we either release helper modules (glue code) or we offer feature switch on each module to enable the additional capabilities or integration with some internal solution (e.g. cert-manager can be integrated with a simple toggle with our internal Certificate Authority, or csi-secret-driver can support with a simple variable object the retrieval of secret from our Secret Stores such as Hashicorp Vault).

#Modules Documentation

Interfaces will be used only if they are usable. Part of the concept of usability is certainly documentation. But as we all engineers know, writing documentation is a tedious task. Let’s write it the by automation.

This is the approach we used in our modules, composition and root stacks.

We use a GitHub workflow that kicks in for each pull request. It will check if documentation is required for the code change and it will generate the terraform docs by using the excellent terraform-docs We have our customized template as part of the centralized workflow configuration so that we can produce a consistent documentation.

We also ensure at development time we document properly every variable and output by enforcing some sanity checks on each pull request (before any other checks starts). We use Terraform Lint with a set of enforced rules: documentation obviously but also some important rules such as:

provider version is explicit in HCL file
all resources are used
pinning is explicit everywhere

#Module at a scale with confidence: Testing

In this section I would like to deep dive a bit more in what actually help us build up confidence on releasing centrally these codebase: the testing part.

In general for each Pull Request, we run a series of tests: for specific module functionality, for the composition of modules together. For those test we intensively use the K8S manage control plane form Azure (our primary cloud). In 20 mins we are able to deploy and destroy multiple full-fledge clusters and run some test with a sample application on top of it.

Let’s see below what type of test we introduced in our workflow, depending if we are working on single module repositories or composition of them (off-the-shelf solution).

#Modules Testing

After all the test for linting and documentation are passing and additional commits (for auto documentation) is generated, the integration test are the next.

For each module we approach the testing in the following mode, by borrowing some concept from software engineering:

#Compilation Test aka “does it deploy?”

Terraform is not a compiled language, we know, but we declare that a Terraform module compiles when it actually deploys successfully with the default configuration we ship.

For this reason we have for each module implementation one or more Example Recipes: these are the simplest terraform configuration that deploys the module. It represents an extremely useful “learn by example” approach. Engineers will see easily - with the minimum entropy - what is the bare minimum required to deploy that module (e.g. for our glorified cert-manager example you need as bare minimum a K8S network topology, a K8S control plane with a default node pool, a image-repository integration helper, the cert-manager module itself with default Lets Encrypt integration and a K8S manifest representing a Certificate object.)

Our automation driven by Terratest will be able then to try to deploy this terraform recipe for every PR against this implementation (module or recipe). If it deploys we put a green tick on the “compilation” for the module.

#Unit Tests

We consider a module offering a single functionality a unit. For this reason we want to make sure that module "compiles" but also offers the promised functionalities. Again Terratest help us here. We developed a go testing reusable framework based on go test that leverage Terratest to run some functional tests after the recipe above is deployed.

In the scenario of our cert-manager module we know it will give us confidence that works when - after created a Kubernetes Certificate Object - we can see the certificate is actually created (meaning certificate request, order, challenge are successful). For this, the recipe above will actually contain external-dns for DNS validation and workload-identity to support managing cloud resources from within K8S.

Terratest then will just use the go client to check with retry if the deployed certificate is ready within an acceptable time.

As you might notice, that example recipe is actually a bit more complex of what I mentioned before: it’s acceptable, since it very well explains what is needed in the “real life” to have a seamless certificate management in the K8S cluster. Certainly there are more component in scope, but they give us more confidence they work (with their own pinned versions) together and they contribute as well to “document by example” how to use a module.

#Integration Tests

We offer an opinionated composition of multiple module with their specific (pinned) versions (what we call Solution or CPK as code name). When they all together deploy successfully and offer the functionalities we expect, then we consider our integration test successful.

In this scope, we test the composition with Terratest as well, by deploying the entire off-the-shelf product, again with sane defaults, and we created a small K8S application that in order to work requires interaction of all the modules we offer. This application exposes an ingress, and connects remotely to our slack to report the output of its invocation.

In order to achieve that: traefik ingress controller has to configure properly its pods, external-dns has to interact via workload-identities with the cloud provider to add entries on the hosted zone that is associated to the K8S cluster, csi-secret-driver has to support the retrieval of api keys used by our small app to interact with slack, metrics are exposed in the /metrics endpoint and we expect to see them with a proper PromQL query on our Victoria Metrics server.

The Terratest will then deploy this full-fledged k8s stack, add our application on top, and try to ping the web endpoint of our application. If we receive a successful outcome on slack, we are confident all the components are running properly and offering the functionality they are supposed to.

#In place upgrade tests

This type of test only runs for the “composition” of modules. This indeed is the primary use-case we support, an accelerator to help engineering teams to have an off-the-shelf infrastructure ready for hosting their own K8S application. Once this is deployed we want to make sure we don’t introduce un-announced/unplanned breaking changes (e.g. variable interface changed, or new component configuration that would break an existing deployment).

For this scope we test on every PR the possibility to in-place upgrade the composed stack between 2 consecutive minor versions (we use Semantic Versioning to keep explicit track of the introduced changes between versions): we deploy the latest “released” version and - again thanks to Terratest, Terragrunt and our in-house framework - we update the pinnings to the latest version and let Terratest try to update the pre-existing K8S cluster to the PR head version. This test gives us the confidence we can introduce the changes in a minor/patch release.

#What we do NOT test

Since we offer basically a “library” or code base that is further configurable by each engineering team, we decided not to include a continuous load testing suite. Load testing - indeed - is very artificial if ran in a non-production like system with the service you are supposed to run in production. Having such test on each PR would not give any additional confidence about the codebase that most likely will be further configured by each engineering team.

However, we periodically still run some artificial tests, with Azure Load Testing service: these are mainly aimed to verify the behavior or new version (major ones) of our Traefik ingress and in general check the stability of the solution.

Chaos testing is only partially available: we do test pod loss, or network connectivity issues but - to be fair - these test are away from being structural in out pipelines, so that will be our next focus.

#Conclusion

As you might have notice, the most comprehensive part of this article is about testing. We want to offer a central code base to deploy Kubernetes infrastructure. There are so many components included that we can’t build confidence without having automation that checks continuously if every component works as expected and if they all together integrate the functionalities with the company other solutions.

A takeaway is: infrastructure is not different than software nowadays, it’s just more annoying to replace when you do some big changes. But like software it requires to be tested, continuously. Start introducing infrastructure testing and get acquainted with the tooling as soon as you write the first modules. We are very happy with the Gruntwork frameworks Terratest and Terragrunt.

#Bonus Note: auto-pinning

Modules are in different repositories (we use repo per category of functionalities to avoid repository sprawl). These are released continuously when the main branch pass all the tests. New versions are then continuously available to be pinned everywhere are used in composition etc.

This job can’t be done by a human. And shouldn't be. For this reason we leverage extensively on Renovate Bot to keep our pinnings constantly updated. For each new module release, a new Pull Request is opened by Renovate, it goes trought all our pipelines and when all the tests are successful, Renovate automatically approves and merge the Pull Request.

NOTE: this only is active on our modules where we know we have enough test coverage. I would not recommend auto-merge in other circumstances if you don’t have enough confidence automatically achieved by test suites.