Are we reinventing the cloud?

TL;DR; this post is a rant about the unnecessary complexity and duplication that I - so far - didn't see beneficial while running workloads on managed Kubernetes

#Incipit

Everything started from this comment on LinkedIn, comment that I support 100%.

original comment here

#Delusional behaviour: Kubernetes everywhere

The more I'm using Kubernetes the more I'm puzzled about the direction we are going.

Kubernetes is an excellent abstraction layer, that can be installed over a pre-existing infrastructure and it implements a robust framework on top of which you can deploy components that will offer the possibility to have the underlying infrastructure "software-defined".

#But, there is a but

When you run on-premise infrastructure, you most likely have a small part software defined (vmware, openstack) but not the entire set of services your workload need. Hence the need of Kubernetes.

Software defined is the core principle of every major cloud provider. And a robust framework is exactly what AWS, Azure, GCP do offer via their own APIs, SDK and CLIs.

#The Duplication

Then let's consider the components you need (yes, you must) to install on top of Kubernetes to offer additional services: Access Control, DNS management, Load Balancing and traffic routing, Deployment Automation, Secrets management to make a few example, but the list is longer and longer.

Actually lets make this list more pragmatic. As an engineer, in order to install and manage a workload on Kubernetes on AWS I have to provision cloud resource and K8S components. For fun I will try to group them together to showcase what I believe are unnecessary duplications.

#IAM

on AWS	self-managed on K8S
IAM Roles	Roles
Policies	Role Bindings
Trust-Relationships	Service Accounts

#Networking

on AWS	self-managed on K8S
Security Groups	Network Policies
Load Balancer, Target Groups	Ingress Component, Routes
DNS Hosted Zone (Route53)	External-DNS Component
ACM Certificates	Cert-Manager Component

Sidenote:

with Security Groups you represent cloud component as actors, you don't have to deal with IPs (make sense in this era).
with network policies the above works only if you stay within the realm of K8S but if you need to secure the path toward a cloud managed Database then you need to deal again with IPs.

#Scaling & Elasticity

on AWS	self-managed on K8S
Autoscaling group, Launch Templates	Cluster Autoscaler / Karpenter component

#Secrets Management

on AWS	self-managed on K8S
Secrets Manager entities	External Secret Operator component, etcd secrets

#Observability

on AWS	self-managed on K8S
CloudWatch agent	Kube Prometheus Stack, Grafana, Metric Server Components, Logging Agent (fluentd, promtail, ....)

#Additionally on K8S to make the life easier

self-managed on K8S
Reloader for config map/secret rotation
Trivy for keeping an eye on the potential vulnerability popping up in the running containers

#Day 2

Now we have to maintain the infrastructure up to date with the new functionalities coming from the Cloud Provider (AWS SDK, deprecations, innovations) but also we - additionally - need to:

Periodically update Kubernetes Control Plane
Carefully rotate the Kubernetes Nodes when the time comes (hopefully the workload has Pod Disruption Budget)
Check for vulnerability each one of the images for the self-managed components
Periodically check for un-compatible API version of the K8S manifests with upcoming K8S releases
Update periodically the self-managed components (all of the above, at least, the more you add the more you need to check release notes and update with their own cadence)
Ensure we maintain proper compatibility between components and K8S versions
When there is an problem, open a bug upstream or better a Pull Request, wait for approval and then you can proceed with updating the impacted component.

The above is the basic hygiene of your infrastructure, without factor in additional effort for modernising the K8S realm as well.

#The Questions

Is it fu@*@! worth it?

Wouldn't be better to just off-load everything to the cloud provider and use the commodities the cloud provider offers for running the only thing we actually care about in our company that is the final workload?

Why don't we use ECS + Fargate? or Azure Containers or GCP Cloud Run? The underlying network and stuff, we still need to provision it in any case.

AWS, Azure, GCP, they already offer via their robust framework all the services we need. The main difference is that when you install a components on top of Kubernetes you don't have an SLA, it's managed by you and you need to come up as well with the proper maintenance, day-2 operations and upgrade patterns. And take care as well of the Kubernetes upgrades.

#The Bet

If you are a multi-cloud company (are you really?), I'm more and more convinced that it would be less costly and less frustrating (incidents-wise) to setup small cloud-experts teams to maintain simpler infrastructure, with reduced maintenance operations, and run the workload in each cloud in the vendor-locked in flavour.