EKS in Production: Lessons from Two Very Different Clusters

Running EKS in production is a different game than spinning up a tutorial cluster. I’ve operated two production EKS environments with very different requirements, and the lessons from each shaped how I think about Kubernetes on AWS.

Cluster One: SaaS Platform

A SaaS product running on EKS 1.34 in eu-west-1. The stack: Karpenter v1.8.2 for node management, Bottlerocket as the node OS, Aurora PostgreSQL Serverless v2 for the data layer, and three distinct node pools.

This cluster needed to be cost-efficient during quiet periods and scale aggressively for peak load. The workloads were predictable enough to optimize around — web frontends, API services, and background workers.

Cluster Two: E-Mobility Startup

An EV charging platform on EKS 1.33. Completely different beast: 15+ Spring Boot microservices, EMQX MQTT broker for real-time charger communication, 27+ ECR repositories, and ElastiCache Redis for session and cache layers.

The challenge here wasn’t scale — it was complexity. Dozens of services with interdependencies, real-time protocol requirements (MQTT), and a team that needed to ship features fast without breaking the charging network.

Karpenter vs Cluster Autoscaler

Both clusters use Karpenter. The decision wasn’t close.

Cluster Autoscaler works at the node group level — it scales groups up or down based on pending pods. Karpenter works at the individual node level. It looks at pending pod requirements, finds the cheapest instance type that satisfies them, and provisions exactly that.

The practical difference: Cluster Autoscaler might give you an m5.2xlarge when you only need an m5.large. Karpenter picks the right size every time. On the SaaS cluster, switching to Karpenter cut compute costs by roughly 30%.

Node Pool Strategy

The SaaS cluster runs three node pools:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-workers
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m5.large", "m5.xlarge", "m5a.large", "m5a.xlarge",
                   "m6i.large", "m6i.xlarge", "m6a.large", "m6a.xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: bottlerocket
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 60s
  limits:
    cpu: "100"
    memory: 400Gi

Spot — CI runners, batch jobs, non-critical workers. Wide instance type selection for availability.
On-demand — Production API services and anything customer-facing. No interruption risk.
Dedicated — GitLab runners that need consistent performance and can’t tolerate spot reclamation mid-pipeline.

The e-mobility cluster uses a simpler split: on-demand for everything customer-facing (charging sessions can’t drop), spot for development and testing namespaces.

GitOps with ArgoCD

Both clusters use ArgoCD with the app-of-apps pattern. A root application points to a directory of Application manifests, each defining a service and its target environment.

Three environments: dev, UAT, and prod. Each environment gets its own namespace and values overlay. Promotion is a PR that updates the image tag in the UAT or prod values file. No manual kubectl applies, no SSH into nodes.

The app-of-apps pattern scales well. When the e-mobility team added their 16th microservice, it was a new Application manifest and a Helm chart — ArgoCD picked it up on the next sync.

Observability

Both clusters run the Prometheus + Loki + Grafana stack.

Prometheus — Metrics collection with ServiceMonitors for auto-discovery
Loki — Log aggregation without the Elasticsearch operational overhead
Grafana — Dashboards for cluster health, application metrics, and Karpenter node lifecycle

The e-mobility cluster also ships MQTT broker metrics into Prometheus. When charger connectivity drops, we see it in Grafana before the customer support tickets arrive.

Secrets Management

External Secrets Operator syncs secrets from AWS Secrets Manager into Kubernetes Secrets. No secrets in Git, no manual secret creation. When a secret rotates in Secrets Manager, ESO picks up the change automatically.

This pattern works across both clusters. The SaaS platform stores database credentials, API keys, and TLS certificates in Secrets Manager. The e-mobility platform adds MQTT broker credentials and third-party charging network API keys.

Terraform State Separation

One lesson I learned the hard way: don’t manage AWS infrastructure and Kubernetes resources in the same Terraform state.

I split it into two state modules per cluster:

Infrastructure state — VPC, subnets, EKS cluster, node IAM roles, RDS, ElastiCache
Kubernetes state — Helm releases (ArgoCD, Karpenter, External Secrets Operator), namespace definitions, RBAC

They have different lifecycles. Upgrading ArgoCD shouldn’t require a plan against the entire VPC. Separate states mean independent applies and smaller blast radius when something goes wrong.

What I’d Do Differently

On the e-mobility cluster, we started with Cluster Autoscaler and migrated to Karpenter mid-project. That migration — draining node groups, shifting workloads, validating spot instance behavior — took a full sprint. Start with Karpenter from day one.

On the SaaS cluster, we underinvested in pod disruption budgets early on. Karpenter’s consolidation is aggressive by default. Without proper PDBs, it will happily terminate nodes running your primary API pods during a consolidation cycle. Define PDBs before you enable consolidation.

EKS in production is manageable if you treat it like infrastructure, not a playground. Automate the node lifecycle, separate your state, and invest in observability from the start. The clusters that hurt are the ones where these decisions get deferred.