The Kubernetes question
At some point, every growing engineering team has the conversation: "Should we move to Kubernetes?" Usually it happens after the third deployment incident in a month, or when the ops person quits and nobody knows how to manage the servers they left behind.
The answer is almost never a simple yes or no. Kubernetes solves real problems, but it creates new ones. The question isn't whether K8s is good technology -- it's whether your team is at the stage where the benefits outweigh the costs.
We've helped dozens of companies make this decision. Here's the framework we use.
Signs you've outgrown your current setup
Before evaluating solutions, be honest about whether you actually have a problem. These are the signals that your current infrastructure is holding you back:
- Deployments are manual or fragile -- someone SSH-ing into servers, running scripts, hoping nothing breaks. Every deploy is a minor emergency.
- Scaling is reactive -- you're manually adding instances when traffic spikes and forgetting to remove them when it drops. Your cloud bill reflects this.
- Service dependencies are tangled -- you're running 5+ services that need to discover each other, and you've duct-taped it together with hardcoded IPs or environment variables.
- Environment parity is a fantasy -- staging doesn't match production, local doesn't match staging, and "it works on my machine" is a weekly occurrence.
- You're spending more time on infrastructure than product -- your engineers are becoming part-time sysadmins, and it's slowing down feature delivery.
If you're nodding at three or more of these, you have an infrastructure problem. But Kubernetes isn't the only solution.
The real costs of Kubernetes
Let's be direct about what K8s actually costs, because the technology blogs rarely are.
Operational overhead. Kubernetes is a distributed system that manages other distributed systems. Cluster upgrades, node management, networking policies, RBAC configuration, persistent volume management, ingress controllers -- there's a reason "Kubernetes engineer" is its own job title. Expect to dedicate at least one senior engineer to K8s operations, or budget $3-5K/month for a managed platform like Teleport or a dedicated DevOps contractor.
Learning curve. Your team needs to understand pods, deployments, services, ingress, ConfigMaps, Secrets, namespaces, Helm charts, and the debugging workflow when something goes wrong (and it will). Budget 4-6 weeks of reduced velocity as your team ramps up.
Managed service costs. EKS on AWS runs about $73/month per cluster before you add any worker nodes. GKE and AKS have similar pricing. Add node costs, load balancers, and persistent volumes -- a minimal production-grade cluster starts around $300-500/month and scales from there.
Complexity tax. Every new engineer you hire needs K8s knowledge. Every service you deploy needs manifests. Every debugging session involves kubectl commands and YAML archaeology. This is manageable at scale but painful for small teams.
The alternatives (and when they're enough)
For many teams, these alternatives solve the same problems with a fraction of the complexity:
ECS Fargate (AWS). Our default recommendation for teams running 2-10 services on AWS. You get container orchestration, auto-scaling, service discovery, and load balancing without managing any nodes. It integrates natively with ALB, CloudWatch, and IAM. We've covered Fargate setup in detail in our Terraform guide. If you're an AWS shop and your workloads are straightforward, start here.
Railway / Render. Excellent for teams under 10 engineers who want zero infrastructure management. Push code, get a URL. Built-in databases, cron jobs, and environment management. You'll outgrow these eventually, but they can carry you through Series A.
Fly.io. Strong choice for globally distributed applications. Deploy containers to edge locations with a simple CLI. Good for latency-sensitive workloads but less mature for complex service architectures.
Cloud Run (GCP). Google's answer to Fargate. Scale-to-zero billing, automatic HTTPS, and tight integration with GCP services. Great for event-driven and API workloads.
| Solution | Best for | Team size | Monthly cost (typical) |
|---|---|---|---|
| Railway / Render | Simple apps, small teams | 1-10 | $50-500 |
| ECS Fargate | AWS-native microservices | 5-50 | $200-5,000 |
| Cloud Run | Event-driven, GCP shops | 5-50 | $100-3,000 |
| Fly.io | Global edge deployment | 3-30 | $100-2,000 |
| Kubernetes | Complex orchestration at scale | 15+ | $500-50,000+ |
When Kubernetes is actually worth it
K8s becomes the right choice when:
- You're running 10+ services with complex interdependencies, and you need fine-grained control over networking, scaling, and deployment strategies (canary, blue-green, rolling).
- You need multi-cloud or hybrid cloud -- Kubernetes is the only orchestration layer that runs identically across AWS, GCP, Azure, and on-premise. If vendor lock-in is a real concern (not a hypothetical one), K8s gives you portability.
- You need advanced scheduling -- GPU workloads, batch processing, stateful applications with specific affinity requirements, or mixed workload types on the same cluster.
- Your team is large enough to absorb the operational cost -- generally 15+ engineers, with at least 1-2 dedicated to platform/infrastructure.
- Regulatory or compliance requirements demand the level of network isolation, RBAC, and audit logging that K8s provides out of the box.
Need help implementing this? Our team can help you put these practices into action.
Getting started: a practical guide
If you've decided K8s is right for your team, here's how to start without drowning.
Step 1: Use a managed service
Do not run your own control plane. Use EKS, GKE, or AKS. The $73/month for a managed control plane is the best infrastructure money you'll spend.
# Create an EKS cluster with eksctl
eksctl create cluster \
--name my-app-production \
--region ca-central-1 \
--version 1.29 \
--nodegroup-name standard-workers \
--node-type t3.medium \
--nodes 3 \
--nodes-min 2 \
--nodes-max 6 \
--managed
Step 2: Set resource limits from day one
The number one cause of K8s outages we see is pods without resource limits consuming all available memory and crashing the node. Set limits on every deployment, no exceptions.
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api-server
image: your-registry/api-server:latest
ports:
- containerPort: 3000
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
Requests are what the scheduler uses to place your pod -- it guarantees this amount of resources. Limits are the ceiling -- exceed them and your pod gets throttled (CPU) or killed (memory). Start with requests at 50% of limits and adjust based on real usage data.
Step 3: Use Helm for packaging
Helm charts let you template your Kubernetes manifests and manage configuration across environments. Don't copy-paste YAML between staging and production -- parameterize it.
# helm/api-server/values.yaml
replicaCount: 3
image:
repository: your-registry/api-server
tag: "latest"
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilization: 70
targetMemoryUtilization: 80
ingress:
enabled: true
hostname: api.yourapp.com
tls: true
# helm/api-server/templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ .Release.Name }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ .Release.Name }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilization }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilization }}
{{- end }}
# Deploy with Helm
helm upgrade --install api-server ./helm/api-server \
--namespace production \
--set image.tag=v1.2.3 \
--wait --timeout 5m
Step 4: Invest in observability early
Kubernetes adds layers of abstraction. Without observability, debugging becomes guesswork. Install Prometheus and Grafana (or use Datadog's K8s integration) from day one. Monitor node resource utilization, pod restart counts, request latency by service, and deployment rollout status.
The bottom line
Kubernetes is not a maturity badge -- it's an operational decision with real trade-offs. If you're running a handful of services on AWS with a team under 15, ECS Fargate will serve you well with a fraction of the complexity. If you're running a complex service mesh, need multi-cloud portability, or have advanced scheduling requirements, K8s is worth the investment.
The worst outcome is adopting Kubernetes because it feels like the "serious" choice, then spending more time managing the cluster than building your product. Choose the simplest infrastructure that solves your actual problems, and upgrade when you have the signals -- and the team -- to justify it.