Completed building a production-grade LLM serving infrastructure on Kubernetes from scratch, learning how to orchestrate large language model inference through custom resources, operators, and sophisticated autoscaling patterns.
Project Overview
The nano-k8s-cluster project demonstrates how to serve LLMs on Kubernetes at production scale, moving beyond simple deployments to sophisticated orchestration patterns that handle the unique constraints of tensor-parallel models.
Core Technologies:
- Kubernetes: Control plane with minikube for local development
- vLLM: High-throughput LLM inference engine
- Go Operators: Production-grade controllers using Kubebuilder
- Observability Stack: Prometheus, Grafana, AlertManager
Two Serving Architectures:
- Monolithic: Single TP instance with router load balancing
- Disaggregated: Separate prefill (TP-8) and decode (TP-4) clusters with coordinated scaling
Fundamental Learning: Fixed-Shape Tensor Parallelism
The most critical insight is that tensor parallelism requires fixed GPU counts per model instance.
Mathematical Constraint:
tensorParallelSize = replicas × gpusPerPod
Example: For Llama-3-70B with TP-8:
- Need exactly 8 GPUs working as one unit
- Could be: 2 pods × 4 GPUs/pod, or 8 pods × 1 GPU/pod
- Cannot arbitrarily add/remove pods without breaking tensor parallelism
Implication: Traditional Kubernetes HPA fails for LLM serving because it tries to scale pods within a TP instance. Horizontal scaling means creating whole new LLMCluster instances, not adding pods to existing ones.
CRD Architecture Design
LLMCluster Custom Resource
Key Design Decisions:
spec:
replicas: 2 # Number of pods
gpusPerPod: 4 # GPUs per pod (TP / replicas = gpusPerPod)
tensorParallelSize: 8 # Must equal replicas × gpusPerPod
model: "llama-3-70b"
routerEnabled: true
Validation: CRD validates that tensorParallelSize = replicas × gpusPerPod before creation, preventing invalid configurations.
Status Subresource: Separates status from spec to prevent user modification and enable proper condition tracking.
Conditions: Rich status communication (Ready, Progressing, Degraded) for operators to react to.
LLMClusterAutoscaler Custom Resource
Fleet Scaling Pattern: Unlike traditional HPA that scales pods, this operator creates/destroys entire LLMCluster instances based on multi-metric evaluation with hysteresis to prevent oscillation.
Why Not Standard HPA:
- HPA scales pods within a deployment
- TP models require scaling whole instances
- Pod-level scaling breaks tensor parallelism constraints
Control/Data Plane Separation
The architecture implements clean separation between control and data planes:
Control Plane
- CRDs defining desired state
- Operators watching and reconciling
- Autoscaler making scaling decisions
- Controllers managing pod lifecycle
Data Plane
- Model serving pods (vLLM instances)
- Routers distributing requests
- Redis queues for request management
- Actual inference computation
Benefits:
- Independent evolution of control logic and serving stack
- Clear RBAC boundaries between teams
- Isolated failure domains
- Easier testing and debugging
Disaggregated Serving Architecture
Two-Phase Serving Pattern
Prefill Cluster (TP-8):
- Processes entire prompts
- Computes KV cache for all tokens
- Higher compute requirements
- Outputs: KV cache + last hidden state
Decode Cluster (TP-4):
- Generates tokens incrementally
- Uses cached KV from prefill
- Lower compute per token but higher frequency
- Continues generation until completion
Critical Coordination Constraint
maxPrefillPerDecode: 2 # Ratio limit
Purpose: Prevents decode cluster starvation from faster prefill operations.
Why It Matters: Without this constraint, rapid prefill operations could flood the decode cluster with pending requests, causing excessive latency for token generation.
Implementation: Router enforces the ratio in admission control, queuing prefill requests when decode cluster is at capacity.
StatefulSet vs Deployment
Why StatefulSet:
- Provides stable pod identity required for tensor parallelism
- Each pod has a consistent hostname (pods-0, pods-1, etc.)
- TP initialization relies on stable pod ordering
- Enables controlled rolling updates
OnDelete Update Strategy:
- Manual pod deletion for rolling updates
- Prevents uncontrolled simultaneous restarts
- Maintains 50% availability with PodDisruptionBudget
- Zero-downtime deployments
Performance Characteristics
Latency Budget:
- 2500ms p95 total for production serving
- Includes prefill, decode, and network overhead
- Drives architecture decisions (disaggregation helps meet this)
Throughput Calculation:
- Theoretical max: tokens/second based on GPU specs
- Real-world: ~85% efficiency
- Overhead: Network, coordination, queuing, KV cache transfer
Model Loading Time:
- ~45 seconds for Llama-3-70B TP-8
- Impacts autoscaling reaction time
- Must be accounted for in capacity planning
Fleet Autoscaling Logic
Multi-Metric Evaluation
Decision Factors:
- Request queue length (Redis)
- GPU utilization metrics (Prometheus)
- Request latency percentiles
- Pending request count
- Current instance count
Hysteresis:
scaleUpThreshold: 80
scaleDownThreshold: 30
cooldownPeriod: 300s
Purpose: Prevent oscillation by requiring thresholds to be crossed significantly before scaling actions, with cooldown periods between actions.
Scaling Decision Flow
Operational Learnings
Resource Management
GPU Allocation Strategies:
- Dedicated Instances: Full GPUs for production workloads
- MIG Profiles: Multi-Instance GPU for multi-tenant scenarios
- Trade-off: MIG increases utilization but adds complexity
Right-Sizing:
- Model loading time vs utilization trade-off
- Larger batches = better GPU utilization but higher latency
- Smaller instances = faster scaling but more overhead
Failure Tolerance
Automatic Recovery:
- Controllers self-heal pod failures
- StatefulSet maintains stable identity
- Restart policies replace crashed pods
Rolling Updates:
- OnDelete strategy enables controlled updates
- PodDisruptionBudget ensures minimum availability
- Health checks determine readiness
Observability is Critical
What to Measure:
- Request queue depth and wait times
- Token generation throughput (tokens/second)
- GPU utilization (memory and compute)
- Pod lifecycle events (scheduling, startup, termination)
- Request latency percentiles (p50, p95, p99)
Why It Matters:
- Can’t optimize what you don’t measure
- GPUs are expensive—underutilization wastes money
- Latency budgets drive architectural decisions
- Capacity planning requires accurate data
Implementation Patterns Discovered
Fixed-Shape Scaling Pattern
Problem: Traditional HPA scales pods within a deployment, breaking tensor parallelism.
Solution: Fleet scaling creates/destroys entire LLMCluster instances.
Implementation:
// Pseudo-code
func (r *AutoscalerReconciler) Reconcile() {
currentInstances := listLLMClusters()
metrics := collectMetrics()
if shouldScaleUp(metrics) {
newInstance := createLLMCluster(spec)
} else if shouldScaleDown(metrics) {
deleteInstance(oldestInstance)
}
}
Phase Coordination Pattern
Problem: Disaggregated serving requires coordinating prefill and decode phases.
Solution: Router enforces maxPrefillPerDecode ratio and manages KV cache transfer.
Implementation:
func (r *Router) routeRequest(req Request) error {
decodeCount := countDecodeInstances()
prefillCount := countPrefillInstances()
if prefillCount/decodeCount >= maxPrefillPerDecode {
return queueRequest(req, "prefill-backlog")
}
if req.hasKVCache {
return sendToDecode(req)
}
return sendToPrefill(req)
}
Declarative Infrastructure Pattern
Problem: Manual infrastructure management is error-prone and doesn’t scale.
Solution: CRD-based declarative definitions—operators manage the “how”.
Benefits:
- Single source of truth (Git)
- Self-healing through reconciliation
- Version controlled infrastructure
- Easier testing and validation
Development Approach
Progressive Learning Path
1. Foundation:
- Basic CRDs and RBAC
- Simple operators with Kubebuilder
- Understanding Kubernetes reconciliation loop
2. Monolithic Serving:
- Single TP instance deployment
- Basic request routing
- Simple metrics collection
3. Production Features:
- Router integration with load balancing
- Fleet autoscaling with hysteresis
- Comprehensive observability
4. Advanced:
- Disaggregated prefill/decode serving
- Phase coordination and KV cache transfer
- Multi-metric scaling decisions
Testing Strategy
Integration Tests:
- End-to-end request flows
- CRD creation and reconciliation
- Scaling decision logic
Load Testing:
- Locust for performance verification
- Validate latency budgets
- Measure throughput under load
Failure Scenarios:
- Pod crashes and restarts
- Network partitions
- GPU out-of-memory errors
- Queue overflow conditions
Key Takeaways
Technical Insights
- Kubernetes excels at stateful AI workloads when combined with proper operators
- CRDs extend Kubernetes to handle domain-specific logic (fixed-shape TP)
- Tensor parallelism is the fundamental constraint that drives all architectural decisions
- Disaggregation enables optimization but adds coordination complexity
- Observability is non-negotiable for production LLM serving
Architectural Principles
- Fixed-shape scaling: Scale instances, not pods, for TP models
- Control/data plane separation: Independent evolution and clear boundaries
- Declarative infrastructure: Define desired state, let operators handle implementation
- Hysteresis in scaling: Prevent oscillation through thresholds and cooldowns
Operational Lessons
- Start simple: Progress from monolithic to disaggregated
- Measure everything: Comprehensive metrics drive good decisions
- Failure recovery must be automatic: Manual intervention doesn’t scale
- Cost awareness matters: GPU resources are expensive—optimize utilization
- Documentation is learning: Document as you build, not after you’re done
Future Directions
- Predictive Scaling: ML-based traffic prediction for proactive scaling
- Multi-tenancy: MIG partitioning for cost optimization
- Global Schedulers: Cross-cluster request routing
- Training Integration: Adding training workloads to the same infrastructure