Deploy ML Model with Docker & FastAPI: Production Guide

Introduction

You have trained a machine learning model in a Jupyter notebook. It scores well on your test set. You show the metrics to your team and everyone is impressed. Then comes the question that separates hobby projects from real engineering: how do you get this model into production so that other services, applications, or users can actually call it?

The gap between a notebook prototype and a production deployment is where most ML projects stall. The model needs to be serialized, wrapped in an API, packaged with its dependencies, and deployed in a way that is reproducible, scalable, and observable. This tutorial bridges that gap entirely. You will take a simple scikit-learn classifier, wrap it in a FastAPI REST API, containerize everything with Docker, and have a production-ready service running in under an hour.

Every code snippet in this guide is copy-paste ready. By the end, you will have a Dockerfile, a trained model artifact, a FastAPI server with health checks and metadata endpoints, and a Docker Compose file for orchestration. No fluff, no hand-waving, just working code and the reasoning behind each decision.

Project Structure

Before writing any code, let us establish a clean project layout. Keeping the training script separate from the serving code is a best practice in MLOps because it decouples model creation from model deployment. Here is the full directory structure you will build:

project-structure

ml-docker-deploy/
├── train.py              # Train and serialize the model
├── app.py                # FastAPI application
├── model/
│   └── model.joblib      # Serialized model artifact (generated by train.py)
├── requirements.txt      # Python dependencies
├── Dockerfile            # Container build instructions
├── docker-compose.yml    # Multi-service orchestration
└── .dockerignore         # Files to exclude from the Docker context

The model/ directory holds the serialized model artifact. In a real-world pipeline, this artifact would typically come from a model registry or artifact store like MLflow or DVC, but for this tutorial we will generate it locally with a training script.

Step 1: Train and Save a Model

We need a trained model to deploy. For this tutorial we will use the classic Iris dataset and a RandomForestClassifier from scikit-learn. The Iris dataset is small and ships with sklearn, so there is nothing to download. The important thing here is not the model itself but the pattern: train a model, evaluate it, and serialize it to disk using joblib. This same pattern applies whether you are deploying a simple classifier or a complex deep learning pipeline.

train.py

"""train.py — Train a classifier and save it to disk."""
import os
import joblib
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# ── Load data ────────────────────────────────────────────────
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# ── Train model ──────────────────────────────────────────────
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=4,
    random_state=42,
)
model.fit(X_train, y_train)

# ── Evaluate ─────────────────────────────────────────────────
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {acc:.4f}")

# ── Save the model artifact ──────────────────────────────────
os.makedirs("model", exist_ok=True)
model_path = os.path.join("model", "model.joblib")
joblib.dump(model, model_path)
print(f"Model saved to {model_path}")

# ── Save class names for the API to use ──────────────────────
import json
meta = {
    "target_names": list(data.target_names),
    "feature_names": list(data.feature_names),
    "accuracy": round(acc, 4),
}
with open(os.path.join("model", "metadata.json"), "w") as f:
    json.dump(meta, f, indent=2)
print("Metadata saved to model/metadata.json")

Run the training script to generate the model artifact:

bash

pip install scikit-learn joblib
python train.py

You should see output showing a test accuracy of around 1.0 (the Iris dataset is small and separable). More importantly, you now have a model/model.joblib file and a model/metadata.json file. The metadata file stores the class names and feature names so the API can return human-readable predictions instead of raw integer labels.

Step 2: Build the FastAPI Server

Now we wrap the model in a REST API. FastAPI is ideal for ML serving because it is async-capable, generates OpenAPI documentation automatically, and validates request bodies through Pydantic models. The API will load the model once at startup, accept feature arrays via POST requests, and return predictions with class names and probabilities.

app.py

"""app.py — FastAPI server for the ML model."""
import json
import time
from contextlib import asynccontextmanager
from pathlib import Path

import joblib
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

# ── Paths ────────────────────────────────────────────────────
MODEL_DIR = Path("model")
MODEL_PATH = MODEL_DIR / "model.joblib"
META_PATH = MODEL_DIR / "metadata.json"

# ── Global state ─────────────────────────────────────────────
model = None
metadata = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Load the model and metadata once at server startup."""
    global model, metadata
    if not MODEL_PATH.exists():
        raise RuntimeError(f"Model file not found at {MODEL_PATH}")
    model = joblib.load(MODEL_PATH)
    with open(META_PATH) as f:
        metadata = json.load(f)
    print(f"Model loaded. Classes: {metadata['target_names']}")
    yield
    model = None
    metadata = None


app = FastAPI(
    title="Iris Classifier API",
    version="1.0.0",
    description="Predict Iris flower species from sepal/petal measurements.",
    lifespan=lifespan,
)


# ── Pydantic schemas ─────────────────────────────────────────

class PredictionRequest(BaseModel):
    """Input features for a single prediction."""
    features: list[float] = Field(
        ...,
        min_length=4,
        max_length=4,
        description="Four numeric features: sepal_length, sepal_width, petal_length, petal_width",
        examples=[[5.1, 3.5, 1.4, 0.2]],
    )


class PredictionResponse(BaseModel):
    """The predicted class and confidence scores."""
    predicted_class: str
    confidence: float = Field(..., ge=0.0, le=1.0)
    probabilities: dict[str, float]
    inference_time_ms: float


class BatchPredictionRequest(BaseModel):
    """Input features for multiple predictions."""
    instances: list[list[float]] = Field(
        ...,
        min_length=1,
        max_length=100,
        description="A list of feature arrays, each with 4 numeric values.",
    )


class BatchPredictionResponse(BaseModel):
    predictions: list[PredictionResponse]


# ── Routes ───────────────────────────────────────────────────

@app.post("/predict", response_model=PredictionResponse)
async def predict(payload: PredictionRequest):
    """Return a single prediction with class probabilities."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded.")

    start = time.perf_counter()
    X = np.array([payload.features])
    proba = model.predict_proba(X)[0]
    idx = int(np.argmax(proba))
    elapsed = (time.perf_counter() - start) * 1000

    class_names = metadata["target_names"]
    return PredictionResponse(
        predicted_class=class_names[idx],
        confidence=round(float(proba[idx]), 4),
        probabilities={
            name: round(float(p), 4) for name, p in zip(class_names, proba)
        },
        inference_time_ms=round(elapsed, 2),
    )


@app.post("/predict/batch", response_model=BatchPredictionResponse)
async def predict_batch(payload: BatchPredictionRequest):
    """Return predictions for multiple instances in one call."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded.")

    for i, instance in enumerate(payload.instances):
        if len(instance) != 4:
            raise HTTPException(
                status_code=422,
                detail=f"Instance {i} must have exactly 4 features, got {len(instance)}.",
            )

    start = time.perf_counter()
    X = np.array(payload.instances)
    probas = model.predict_proba(X)
    elapsed = (time.perf_counter() - start) * 1000

    class_names = metadata["target_names"]
    results = []
    for proba in probas:
        idx = int(np.argmax(proba))
        results.append(
            PredictionResponse(
                predicted_class=class_names[idx],
                confidence=round(float(proba[idx]), 4),
                probabilities={
                    name: round(float(p), 4)
                    for name, p in zip(class_names, proba)
                },
                inference_time_ms=round(elapsed / len(payload.instances), 2),
            )
        )
    return BatchPredictionResponse(predictions=results)

There are several deliberate design choices here worth highlighting. The model is loaded once in the lifespan context manager, not on every request. This avoids the overhead of deserialization on every call, which for larger models could add seconds of latency. The response includes inference_time_ms so downstream consumers can monitor performance without guessing. The batch endpoint processes all instances in a single NumPy call, which is significantly faster than looping through individual predictions.

Server architecture diagram — Production ML serving: FastAPI handles requests, the model runs inference, Docker packages everything

Step 3: Add Health Check and Metadata

Production services need more than just a prediction endpoint. Container orchestrators like Kubernetes and Docker Swarm rely on health check endpoints to determine whether a container is alive and ready to serve traffic. A metadata endpoint helps operators and downstream teams understand what version of the model is running without digging through logs. Add the following routes to the bottom of app.py:

app.py

# ── Health and metadata endpoints ────────────────────────────

@app.get("/health")
async def health_check():
    """Liveness probe: returns 200 if the server is running."""
    return {"status": "healthy", "model_loaded": model is not None}


@app.get("/ready")
async def readiness_check():
    """Readiness probe: returns 200 only when the model is loaded."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not ready.")
    return {"status": "ready"}


@app.get("/info")
async def model_info():
    """Return model metadata: class names, feature names, accuracy."""
    if metadata is None:
        raise HTTPException(status_code=503, detail="Metadata not loaded.")
    return {
        "model_type": "RandomForestClassifier",
        "framework": "scikit-learn",
        "target_names": metadata["target_names"],
        "feature_names": metadata["feature_names"],
        "training_accuracy": metadata["accuracy"],
    }

The distinction between /health and /ready matters in orchestrated environments. The health (liveness) probe tells the orchestrator the process is alive. The readiness probe tells it the service is ready to accept traffic. During startup, the server is alive but not ready until the model finishes loading. Kubernetes uses this distinction to avoid routing traffic to containers that are still initializing.

Step 4: Write the Dockerfile

Docker packages your application, its dependencies, and the model artifact into a single portable image. Anyone with Docker installed can run your model without worrying about Python versions, OS differences, or dependency conflicts. Consult the Docker build guide for a deeper dive into build optimization. Here is the production-ready Dockerfile:

First, create a requirements.txt file with the exact dependencies:

requirements.txt

fastapi==0.115.6
uvicorn[standard]==0.34.0
scikit-learn==1.6.1
joblib==1.4.2
numpy==2.2.2

Now the Dockerfile itself:

Dockerfile

# ── Stage 1: Build dependencies ──────────────────────────────
FROM python:3.12-slim AS builder

WORKDIR /app

# Install dependencies into a virtual env for clean copy later
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# ── Stage 2: Production image ────────────────────────────────
FROM python:3.12-slim

# Security: run as a non-root user
RUN groupadd -r mluser && useradd -r -g mluser mluser

WORKDIR /app

# Copy only the virtual env from the builder stage
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy application code and model artifact
COPY app.py .
COPY model/ ./model/

# Switch to non-root user
USER mluser

# Expose the port
EXPOSE 8000

# Health check — Docker will mark the container as unhealthy
# if this endpoint fails 3 times in a row
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD ["python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]

# Start the server with production settings
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

This Dockerfile uses a multi-stage build. The first stage installs all pip dependencies into a virtual environment. The second stage copies only the virtual environment and application code, leaving behind the build tools, pip cache, and anything else that inflates the image. The result is a significantly smaller and more secure image. The HEALTHCHECK instruction tells Docker to periodically call the /health endpoint. If the endpoint fails three times in a row, Docker marks the container as unhealthy, which orchestrators use to trigger restarts or reroute traffic.

Also create a .dockerignore file to keep the build context small and avoid copying unnecessary files into the image:

.dockerignore

__pycache__/
*.pyc
.venv/
.git/
.gitignore
*.md
train.py
.env

Step 5: Build and Run

With the Dockerfile in place, building and running the container takes two commands. The build step installs dependencies and copies the model into the image. The run step starts the container and maps port 8000 on your machine to port 8000 inside the container.

bash

# Build the Docker image
docker build -t ml-api:latest .

# Run the container
docker run -d --name ml-api -p 8000:8000 ml-api:latest

# Check that the container is running
docker ps

# View the startup logs
docker logs ml-api

The -d flag runs the container in detached mode so your terminal stays available. You should see the container listed with a status of "Up" when you run docker ps. After a few seconds the health check will report as "healthy." If you see "starting," wait for the model to finish loading.

Step 6: Test the API

With the container running, test every endpoint to confirm everything works end to end. Start with the health check:

bash

# Health check
curl http://localhost:8000/health
# {"status": "healthy", "model_loaded": true}

# Readiness check
curl http://localhost:8000/ready
# {"status": "ready"}

# Model metadata
curl http://localhost:8000/info
# {"model_type": "RandomForestClassifier", "framework": "scikit-learn", ...}

Now send a prediction request. The Iris dataset expects four features: sepal length, sepal width, petal length, and petal width, all in centimeters:

bash

# Single prediction — this should return "setosa"
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

Expected response:

bash

{
  "predicted_class": "setosa",
  "confidence": 1.0,
  "probabilities": {
    "setosa": 1.0,
    "versicolor": 0.0,
    "virginica": 0.0
  },
  "inference_time_ms": 0.45
}

Test the batch endpoint with multiple instances:

bash

# Batch prediction — mixed classes
curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      [5.1, 3.5, 1.4, 0.2],
      [6.7, 3.0, 5.2, 2.3],
      [5.9, 3.0, 4.2, 1.5]
    ]
  }'

You should see predictions for setosa, virginica, and versicolor respectively. Also visit http://localhost:8000/docs in your browser to see the automatically generated Swagger UI, where you can test every endpoint interactively.

Watch Your Model File Size

This tutorial uses a small scikit-learn model (a few kilobytes), but real-world models can be hundreds of megabytes or several gigabytes. Large model files inflate your Docker image and slow down build times and deployments. For large models, consider these strategies: (1) Use multi-stage builds (as shown above) to keep the final image lean. (2) Store model artifacts in an external registry like S3, GCS, or MLflow and download them at container startup. (3) Use Docker layer caching — copy requirements.txt and install dependencies before copying the model so that dependency layers are cached across builds. (4) For very large models, mount a volume instead of baking the weights into the image.

Production Checklist

Before deploying to a live environment, walk through this checklist. These are the items that separate a demo from a production service.

Structured logging. Use Python's logging module or a library like structlog to emit JSON-formatted logs. Log every prediction with the input shape, predicted class, confidence, and latency. These logs are essential for debugging model drift, identifying slow requests, and building audit trails.
CORS configuration. If your API will be called from a browser-based frontend, add CORS middleware. FastAPI provides CORSMiddleware out of the box. Be explicit about allowed origins in production rather than using a wildcard.
Rate limiting. Protect your API from abuse and runaway clients. Use a library like slowapi that integrates directly with FastAPI, or offload rate limiting to a reverse proxy like Nginx or an API gateway.
Monitoring and alerting. Expose Prometheus metrics for request count, latency percentiles, error rates, and model inference time. Use Grafana or Datadog dashboards to visualize trends and set alerts for anomalies like sudden spikes in error rates or latency degradation.
Authentication. Add API key validation or OAuth2 bearer tokens using FastAPI's dependency injection system. Store secrets in environment variables or a secrets manager, never hardcoded in source code.
Graceful shutdown. Ensure in-flight requests complete before the container stops. Uvicorn handles SIGTERM gracefully by default, but set the Docker stop grace period appropriately with stop_grace_period in your Docker Compose file.

Going Further: Docker Compose

In a real deployment, your ML API rarely runs alone. You might have a reverse proxy for TLS termination, a Redis instance for caching predictions, or a Prometheus server for metrics. Docker Compose lets you define and run all of these services together with a single command. Here is a compose file that runs the ML API behind Nginx with a Redis cache:

docker-compose.yml

services:
  ml-api:
    build: .
    container_name: ml-api
    ports:
      - "8000:8000"
    environment:
      - LOG_LEVEL=info
      - WORKERS=2
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"

  redis:
    image: redis:7-alpine
    container_name: ml-redis
    ports:
      - "6379:6379"
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    container_name: ml-nginx
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
    depends_on:
      ml-api:
        condition: service_healthy
    restart: unless-stopped

Start the entire stack with a single command:

bash

# Start all services in the background
docker compose up -d --build

# Check the status of all containers
docker compose ps

# View logs from the ML API service
docker compose logs -f ml-api

# Tear down the stack when done
docker compose down

The compose file sets resource limits on the ML API container to prevent a single runaway request from consuming all available memory. The Nginx service waits for the ML API health check to pass before starting, which ensures the proxy never routes traffic to an unready backend. This pattern scales well: you can add more services (a model registry, a Celery worker queue, a monitoring stack) by adding more entries to the same file.

GPU Support in Docker

If your model requires GPU inference, use the nvidia/cuda base image instead of python:3.12-slim, install the NVIDIA Container Toolkit on your host, and pass the --gpus all flag when running the container. In Docker Compose, add "deploy.resources.reservations.devices" with the "nvidia" driver. GPU acceleration can reduce inference latency by 10-100x for deep learning models.

Continue learning with these related articles:

Key Takeaways

Separate training from serving. Your training script produces a model artifact that the API server loads at startup. This decoupling means you can retrain and redeploy the model independently without changing any API code.
FastAPI gives you automatic validation, interactive docs, and async support with minimal boilerplate. It is one of the best frameworks for wrapping ML models in a REST API.
Docker multi-stage builds keep your production images small and secure by discarding build-time dependencies. Always run containers as a non-root user.
Health checks and readiness probes are not optional. They tell orchestrators when your container is alive and when it is ready to serve traffic, enabling zero-downtime deployments.
Include inference timing in your responses. It costs nothing to measure and gives downstream consumers the data they need to monitor performance without external instrumentation.
Docker Compose is the simplest path from a single container to a multi-service stack. Use it to add reverse proxies, caches, and monitoring alongside your model server.
For production, always add structured logging, CORS configuration, rate limiting, authentication, and monitoring. These concerns apply to every ML API, regardless of the model or framework you use.

The pattern demonstrated in this tutorial generalizes far beyond Iris classification. Whether you are serving a large language model, an image classifier, a recommendation engine, or a time-series forecaster, the architecture remains the same: train and serialize the model, wrap it in a FastAPI server, containerize with Docker, and orchestrate with Compose. Master this workflow once and you can deploy any model to production.

Deploy a Machine Learning Model to Production with Docker and FastAPI

Introduction

Project Structure

Step 1: Train and Save a Model

Step 2: Build the FastAPI Server

Step 3: Add Health Check and Metadata

Step 4: Write the Dockerfile

Step 5: Build and Run

Step 6: Test the API

Production Checklist

Going Further: Docker Compose

Key Takeaways

Related Articles

Build a RAG Chatbot with LangChain and Pinecone

Fine-Tune an Open-Source LLM on Your Own Data with LoRA

Build a Sentiment Analysis API with Python and FastAPI

Introduction

Project Structure

Step 1: Train and Save a Model

Step 2: Build the FastAPI Server

Step 3: Add Health Check and Metadata

Step 4: Write the Dockerfile

Step 5: Build and Run

Step 6: Test the API

Production Checklist

Going Further: Docker Compose

Related Reading

Key Takeaways

Related Articles

Build a RAG Chatbot with LangChain and Pinecone

Fine-Tune an Open-Source LLM on Your Own Data with LoRA

Build a Sentiment Analysis API with Python and FastAPI