Simple Model API — Production ML Inference Service
A FastAPI service wrapping ResNet-50 for image classification. Production patterns: health probes, correlation IDs, structured errors.
What this is
I built a FastAPI service that wraps a ResNet-50 model for image classification. This log ignores the model weights and focuses entirely on the production infrastructure patterns required to wrap machine learning code in a reliable HTTP API.
Why I built it
Most machine learning tutorials stop exactly at model.predict(). In production, naked prediction code fails due to cold-start latency, memory leaks, and untraceable errors. I built this repository as a strict reference architecture for handling startup gating, correlation tracing, and graceful out-of-memory (OOM) recovery.
The architecture
POST /predict → multipart upload → validation → preprocess → forward pass → softmax → top-k → JSON
GET /health → model-aware readiness probe (503 if not loaded)
GET /info → static service metadata
Key patterns
Lifespan-managed model loading
The application loads the weights into memory exactly once during the ASGI server startup using FastAPI’s lifespan context manager. I added a mandatory warmup forward pass immediately following execution. Without this, the first incoming user request takes 3x to 5x longer because PyTorch lazily compiles its internal dispatch tables on impact.
@asynccontextmanager
async def lifespan(app: FastAPI):
# Load model to CPU/GPU memory once
app.state.model = resnet50(pretrained=True)
app.state.model.eval()
# Force PyTorch to compile dispatch tables before accepting traffic
dummy_input = torch.randn(1, 3, 224, 224)
with torch.no_grad():
app.state.model(dummy_input)
yield
Correlation IDs
Every incoming request receives a unique UUID generated at the gateway layer. The application attaches this token as an X-Correlation-ID header to both successful responses and error bodies. This allows upstream load balancers to map request-response lifecycles without parsing the payload.
Defensive error handling
Inference execution can throw sudden RuntimeError exceptions from the underlying C++ subsystem due to dimension mismatches or sudden VRAM fragmentation. The prediction controller wraps execution inside a targeted try-except block, formatting failures into a clean JSON structure instead of letting the Uvicorn worker crash.
{
"success": false,
"error": {
"code": "VALIDATION_ERROR",
"message": "Input image tensor shape must be (3, 224, 224)."
}
}
Deployment
The infrastructure uses an identical health check specification across local Docker Compose engines and remote Kubernetes clusters. The configuration maps container runtime availability directly to the application’s internal state.
# Kubernetes deployment snippet
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 15
periodSeconds: 5
What I learned
Pre-warming the PyTorch execution context during startup is mandatory for predictable latency. Compiling the execution paths during the lifespan sequence completely flattens the initial latency spike, ensuring the very first request matches the performance profile of the millionth.
I later appended a standalone batch utility (batch_inference_client.py) to process local image directories efficiently without hitting the HTTP throttling limits during validation tasks.
Status
The production-ready API infrastructure is fully operational.