Transmission 008 · 2026-06-08

ResNet-50 takes 60 seconds to load. The cluster didn't know that.

A slow PyTorch startup window caused 502s on freshly rolled pods. I added resource ceilings, a dual-metric HPA, and startup probes to close the gap.

The constraint

I started the machine this morning and the cluster was stopped. kubectl get pods refused the connection entirely. After minikube start, the ingress controller spent several minutes failing to acquire its etcd leader lease. I ran the load test too early.

50 requests. 50 failures. Every single one a socket drop.

10:37:08 │ INFO │ Progress:  5/50 (10%) —  0 ok,  5 err
10:37:08 │ INFO │ Progress: 10/50 (20%) —  0 ok, 10 err
10:37:10 │ INFO │ Progress: 15/50 (30%) —  0 ok, 15 err
...
10:37:16 │ INFO │ Progress: 50/50 (100%) — 0 ok, 50 err

║ Successes............................... 0     ║
║ Failures................................ 50    ║
║ Success Rate............................ 0.0%  ║
║ Error Code Breakdown:                         ║
║   HTTP N/A: 50 occurrence(s)                  ║

The ingress controller logs confirmed the problem: etcd lease elections were timing out during the restart window.

E0608 07:25:56.623698  6 leaderelection.go:452] "Error retrieving lease lock"
  err="Get https://10.96.0.1:443/.../ingress-nginx-leader: context deadline exceeded"

Once the control plane stabilised, I ran the test again. 49 of 50 succeeded. One HTTP 502 slipped through.

║ Successes............................... 49   ║
║ Failures................................ 1     ║
║ Success Rate............................ 98.0% ║
║ Error Code Breakdown:                         ║
║   HTTP 502: 1 occurrence(s)                   ║

What happened

I matched the 502 to a specific pod. The ingress access log pointed at 10.244.0.57.

10.244.0.1 - - [08/Jun/2026:07:53:40 +0000] "POST /predict HTTP/1.1" 502 150
  [default-simple-model-api-service-8000] [] 10.244.0.57:8000 0 2.757 502

kubectl get pods -o wide matched that IP to simple-model-api-deployment-56b5cd5474-w54kn. I pulled its logs.

[2026-06-08 07:24:01 +0000] [1]  [INFO] Starting gunicorn 22.0.0
[2026-06-08 07:24:02 +0000] [10] [INFO] Waiting for application startup.
[2026-06-08 07:24:02 +0000] [11] [INFO] Waiting for application startup.
[2026-06-08 07:25:01 +0000] [10] [INFO] Application startup complete.
[2026-06-08 07:25:01 +0000] [11] [INFO] Application startup complete.

The workers started at 07:24:02 and finished loading at 07:25:01. That is 59 seconds. PyTorch loads the full ResNet-50 weights at startup. The pod was sitting in the Running state but not actually ready to serve traffic. The rolling update sent a request to it anyway, and it returned nothing.

The cluster had no startup probe. Kubernetes had no way to know the difference between “container is running” and “model is loaded.”

The pods also had no memory limits. Five pods initialising ResNet-50 in parallel puts significant pressure on the host. One misconfigured rollout and the whole machine runs out of memory.


The resolution

I enabled the metrics server, then applied three manifests: a ConfigMap for environment variables, an updated Deployment with resource boundaries and lifecycle probes, and an HPA.

minikube addons enable metrics-server
kubectl apply -f kubernetes/configmap.yaml
kubectl apply -f kubernetes/deployment.yaml
kubectl apply -f kubernetes/hpa.yaml

The resource boundaries:

SettingValueReason
requests.memory1GiReserves room for PyTorch weights at load time
limits.memory1.5GiCaps growth so parallel startups don’t OOM host
HPA CPU target70%Triggers scale-out during inference load
HPA memory target80%Triggers scale-out if weight caching grows
Min replicas3Keeps a baseline for the HPA to work with
Max replicas10Bounds the cluster to what the host can support

The startup probe gives each pod 90 seconds to pass its health check before Kubernetes routes traffic to it. Readiness and liveness probes run on the same /health endpoint after that.

I rebuilt the image inside the Minikube Docker daemon so the cluster picked it up without a registry pull, then triggered a rolling restart and watched it finish cleanly.

eval $(minikube docker-env)
docker build -t simple-model-api:latest .
kubectl rollout restart deployment/simple-model-api-deployment
kubectl rollout status deployment/simple-model-api-deployment --timeout=300s
deployment "simple-model-api-deployment" successfully rolled out

The HPA confirmed both metrics targets were within range after the rollout.

$ kubectl get hpa simple-model-api-hpa
NAME                   REFERENCE                                TARGETS                        MINPODS   MAXPODS   REPLICAS
simple-model-api-hpa   Deployment/simple-model-api-deployment   cpu: 1%/70%, memory: 73%/80%   3         10        5

A port-forward smoke test confirmed the application was healthy end to end.

$ curl http://localhost:8080/health
{"success":true,"status":"healthy","model":"ResNet-50","version":"1.0.0"}

The cluster now knows ResNet-50 is slow to wake up, and it waits.

← Back to Transmissions