Simple Model API — Production ML Inference Service

What this is

I built a FastAPI service that wraps a ResNet-50 model for image classification. This log ignores the model weights and focuses entirely on the production infrastructure patterns required to wrap machine learning code in a reliable HTTP API.

Why I built it

Most machine learning tutorials stop exactly at model.predict(). In production, naked prediction code fails due to cold-start latency, memory leaks, and untraceable errors. I built this repository as a strict reference architecture for handling startup gating, correlation tracing, and graceful out-of-memory (OOM) recovery.

The architecture

POST /predict  →  multipart upload  →  validation  →  preprocess  →  forward pass  →  softmax  →  top-k  →  JSON
GET  /health   →  model-aware readiness probe (503 if not loaded)
GET  /info     →  static service metadata

Key patterns

Lifespan-managed model loading

The application loads the weights into memory exactly once during the ASGI server startup using FastAPI’s lifespan context manager. I added a mandatory warmup forward pass immediately following execution. Without this, the first incoming user request takes 3x to 5x longer because PyTorch lazily compiles its internal dispatch tables on impact.

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load model to CPU/GPU memory once
    app.state.model = resnet50(pretrained=True)
    app.state.model.eval()
    
    # Force PyTorch to compile dispatch tables before accepting traffic
    dummy_input = torch.randn(1, 3, 224, 224)
    with torch.no_grad():
        app.state.model(dummy_input)
    yield

Correlation IDs

Every incoming request receives a unique UUID generated at the gateway layer. The application attaches this token as an X-Correlation-ID header to both successful responses and error bodies. This allows upstream load balancers to map request-response lifecycles without parsing the payload.

Defensive error handling

Inference execution can throw sudden RuntimeError exceptions from the underlying C++ subsystem due to dimension mismatches or sudden VRAM fragmentation. The prediction controller wraps execution inside a targeted try-except block, formatting failures into a clean JSON structure instead of letting the Uvicorn worker crash.

{
  "success": false,
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Input image tensor shape must be (3, 224, 224)."
  }
}

Deployment

The infrastructure uses an identical health check specification across local Docker Compose engines and remote Kubernetes clusters. The configuration maps container runtime availability directly to the application’s internal state.

# Kubernetes deployment snippet
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 15
  periodSeconds: 5

What I learned

Pre-warming the PyTorch execution context during startup is mandatory for predictable latency. Compiling the execution paths during the lifespan sequence completely flattens the initial latency spike, ensuring the very first request matches the performance profile of the millionth.

I later appended a standalone batch utility (batch_inference_client.py) to process local image directories efficiently without hitting the HTTP throttling limits during validation tasks.

Status

The production-ready API infrastructure is fully operational.