link

Logo for g-c.dev

Don’t keep your code DRY…

… but instead manage the duplication

“A little copying is better than a little dependency.”

This is so true, and I would argue that even more copying it’s better than abstractions when you are a human.

As humans engineers we have a special power, we can imagine - via kind of mind maps - some level of abstractions to identify what will be a behavior of our codebase on a specific scenario. Our power comes with some limitation: we can’t mind map easily too many layer of these abstractions.
I mean you can, and you can even hold on it for a few seconds, but the stability of the outcome it’s very delicate like a deep dream in Inception (please make sure you watch that movie. It’s not pre-requisite for this article but you must do it. Stop.)

The same way we rely on a database server to represent the outcome of a bunch of data spread across multiple tables, because that way data is efficiently managed, it makes also inefficient for humans to access it without proper tooling.

Sometime it’s inefficient also for computers, hence the de-normalization approach. Where - indeed - we adopt the approach “a little copying is better (read more efficient in some cases) than a little dependency”.
And, very often, as human you can easily make up your mind reading a de-normalized table in a data store. Just because the data now is inline and you removed the additional abstractions (JOINs).

Where am I heading?

Well, recently I have the feeling that we are overwhelmed by frameworks, each one introducing yet another abstraction layer, supposedly to make developer life easier. Supposedly.
The sad truth, when you join a new company and you are part of infrastructure/platform teams, is that the learning curve to understand what is deployed and how it will behave is getting more and more steep.

But technologies did not become much more complex. It’s the framework we introduced to keep everything “DRY” and avoid repetition and duplication that is making human onboarding miserable.
And even after onboarding, when it comes to make a change on infrastructure code, you have to git push - together with the code - some dose of hopes that everything will end up where it should.

Yes, we can introduce tests, but when it comes to testing cloud infrastructure it’s not easy nor cheap. It’s not as simple as a memory area you allocate and validate and drop.

Are you advocating to not go for DRY?

No absolutely, we should avoid repetition and duplication. I’m only saying that good codebase is only good as the level of maintainability it offers to a human. (loosely inspired to the well known one in security).

Let me drop an example: HELM. It’s a great framework for templating Kubernetes resources, very powerful and versatile. Enables you to be very DRY. At what cost though?
I’m looking at numerous memories where myself, or my colleagues had absolutely no idea how to foresee impact of changes.

Kustomise uses a different approach, overriding on overrides.
A bit better perhaps, but does not solves the underlying human issue.
And let’s not forget YAML itself is a configuration language supporting abstractions (have a look to YAML anchors) that you need to rebuild in your mind-map while trying to understand what will happen.

Another one: Terraform and Terragrunt implementations with complex (and not standardized) folder structures for defining infrastructure and configurations.
You start abstracting away configurations and you get to a point where you loose control of your - in good will - optimization effort. The result is that, when you try to change something, you can’t understand what is going to happen, and you need - at least - to go read a long plan of your entire infrastructure.

Balance is the key (no shit, Sherlock)

We need to realize we are not machines, and it’s not very convenient or effective having to simulate an entire infrastructure just to understand changes that should fall in the range of human pre-computability.

So - in short - we got to excited and now all the codebase it’s so DRY we can’t even imagine what will happen by changing a small bit of code.

Balance is the key: what if instead of getting rid of all the duplication, we just add duplication, the boring one, everywhere it’s needed, but this duplication is properly managed?

Yes, I'm advocating for managed duplication. What does this mean?
A lot of parts in our infrastructure codebase are often straight duplicates and repetitions, but they make easily accessible the possibility to mentally pre-compute / envision the changes that are occurring.

These duplications are extremely valuable!

Now, the immediate question: what if this duplicated and repeated code changes overtime? Do I need to open numerous and tedious pull requests everywhere?
No. This is the part where the duplication must be there, and must be managed.

This is where the proper boilerplate tooling should give its best.
Boilerplate is not only needed to startup a project. Boilerplate is valuable when you can update it overtime, perhaps with a central source of truth.

Most of the boilerplate projects are addressing in an excellent way only the bootstrap part but they come short when it comes to manage the boilerplated code.

I’ve tried to address this with SKA a boilerplate framework that allow to bootstrap and update overtime the code base and keep managed the duplication within the codebase. Have a look to the project if you want to read more.

What I would like to see is that the duplication is managed in a central repository, where overtime can be updated and pulled by the distributed consumer repositories by using a new version of the boilerplated code. Just like you pull a new version of a library or a module.

---
title: Lifecycle of boilerplate templates
---

sequenceDiagram
    actor P as Platform Team 
    participant B as Boilerplate Repository
    actor D as Infra Developer<br>(or SKA / Backstage)
    participant C as Consumer Repository<br>with duplications

    P -->> B: publish template
    D ->> B: use version v1 of template
    D ->> C: expand the template in final repository
    P -->> B: publish version v2
    P -->> B: publish version vN...
    D ->> B: use newer version of boilerplate 
    D ->> C: update the template in the final repository

Back to the HELM example. Yes, it’s possible to distribute common resources, via a HELM dependency (where you maintain your base deployment kubernetes resources for example) and update it overtime. The difference?
As engineer I have again no idea at a glance about what is the final result. I need instead to run +1 command to pull the new dependency, and build a template outcome. It’s not making engineer life easier. We are back at trying to deduct the information in a normalized database by doing all the JOINs with our mind.

Besides the tool I started working on, I see a lot of potential for this in Backstage. It’s probably the best framework I know for bootstrapping and managing code-base.
We should promote and use it more to managing duplication in a human accessible way.

Why this does not happen with the full fledged programming languages, e.g. Golang, Python?

I asked myself the same question: I believe, they are a very homogeneous domain, and the IDE knowing that specific homogeneous domain can help the life of engineers offering inline helpers (from the code-usage to the impact of a refactor).
Within infrastructure domain we are exposed to a very diverse type of languages and frameworks, some chosen by us, other dictated by cloud vendors. And we are more exposed to integrating with different patterns these frameworks. And IDEs are not good at helping diving in this integration efforts.

I have no much experience with Pulumi, but I think it would make some difference in making more homogeneous the domain of knowledge.

But still, after you have your codebase, you need your pipelines, workflows, configurations for code style, Renovate, Snyk, etc.. You typically try to centralize this stuff, again in good will, so that you remove duplication. But when it comes to dealing with reality most of the time you wish you had that duplication inline in your repository to be more efficient.

---
title: Example centralised boilerplate repository
---
classDiagram
  Boilerplate Repository: /templates/pipeline
  Boilerplate Repository: /templates/go_codebase
  Boilerplate Repository: /templates/codescan
  Boilerplate Repository: /templates/editor_config
  Boilerplate Repository: ... 

  class Consumer A {
    use centralised pipeline Boilerplate
  }

  Boilerplate Repository ..|> "pipeline" ConsumerA : expand template

  class Consumer B {
    use codebase Boilerplate
    use codescan Boilerplate
  }

  Boilerplate Repository ..|> "codebase,codescan" Consumer B: expand template

Try out SKA for managing your boilerplate

So, if you want to have a look to SKA, please go ahead. It’s still very young but it tries to address partially the problem with some opinionated approach. Feedbacks are welcome.

I'm planning to extend the examples and showcase what I mentioned in this post as possibilities to centrally maintain boilerplates and use SKA as tool for managing their lifecycles.

If you see the problem and you extend other frameworks to address this, it’s like a nice party: the more the merrier!

A little copying is better than a little dependency. And if properly managed, don’t be shy of copying and duplicating as long it’s for humans.