AI & Machine Learning

Zero-Trust Local AI: Hardening vLLM Docker Containers Behind Caddy with mTLS and OIDC for Secure Production Inference

By Geethu 12 min read
Zero-Trust-Local-AI

Running large language models locally is no longer a novelty; by 2026 it is a standard pattern for teams that need predictable latency, data residency, and cost control. vLLM has become a common serving layer because it exposes an OpenAI-compatible HTTP API and is designed for high-throughput inference.

The security problem is that “local” does not automatically mean “safe.” A production inference endpoint is still a high-value target:

  • It can leak sensitive prompts and generated data.
  • It can be abused for expensive GPU denial-of-service.
  • It can become a pivot point into the rest of the network if container boundaries and credentials are weak.

A zero-trust posture assumes the network is hostile—even inside a homelab or “internal-only” VLAN. The practical implication is that every hop must be authenticated, authorized, encrypted, observable, and least-privileged.

This article presents an implementation-focused pattern for hardening vLLM in Docker behind Caddy, using:

  • mTLS for device/workload identity and encrypted transport.
  • OIDC for user and service authentication with modern SSO.
  • Network isolation + container hardening to reduce blast radius.
  • Metrics/logs to debug failure modes and detect abuse.

The goal is secure production inference without turning the stack into a bespoke identity platform.

Architecture Overview

Components

  • vLLM: Runs the model and serves an OpenAI-compatible API over HTTP (default 8000).
  • Internal TLS sidecar (Nginx): Terminates mTLS for upstream traffic to vLLM (because vLLM typically serves plain HTTP inside the container network).
  • Caddy: Edge reverse proxy, TLS termination, client certificate verification, and routing.
  • OIDC gateway (oauth2-proxy): Handles OIDC login, session cookies, and identity headers for apps behind a reverse proxy (paired with Caddy forward_auth).
  • Prometheus/Grafana (optional): Scrapes vLLM /metrics and visualizes key latency/queue/GPU cache signals. vLLM exposes Prometheus-compatible metrics at /metrics.

Traffic flow

Client → Caddy (mTLS + TLS)
Client presents a certificate issued by an internal CA. Caddy requires and verifies it.

Caddy → oauth2-proxy (forward_auth)
For protected routes, Caddy calls oauth2-proxy to verify an authenticated OIDC session.

Caddy → Nginx sidecar (mTLS)
Even after the edge is authenticated, the internal hop is also authenticated with mTLS. This prevents any other container on the network from talking directly to inference.

Nginx → vLLM (loopback HTTP)
Nginx forwards to 127.0.0.1:8000 inside the same pod/compose “service” namespace.

Why this composition

  • mTLS provides a strong, non-phishable identity factor at the transport layer.
  • OIDC provides centralized user/service identity and policy controls (MFA, device posture, conditional access) without hand-rolled auth.
  • Defense in depth: Compromise of one layer does not automatically grant inference access.

Implementation Breakdown

1) Certificates: internal CA and mTLS materials

Use an internal CA (Smallstep step-ca, Vault PKI, or an enterprise PKI). Smallstep is a common choice for mTLS automation in infrastructure environments.

Create a root CA and issue:

  • Server cert for ai.example.internal (Caddy edge)
  • Client certs for operators/services
  • Server cert for the internal Nginx mTLS endpoint
  • Client cert for Caddy → Nginx upstream authentication

Example with step CLI (illustrative; adapt to your PKI):

# Root CA (offline is preferable)
step certificate create "Local AI Root CA" root_ca.crt root_ca.key \
  --profile root-ca --no-password --insecure

# Issue an edge server cert for Caddy
step certificate create ai.example.internal caddy.crt caddy.key \
  --profile leaf --ca root_ca.crt --ca-key root_ca.key \
  --san ai.example.internal --san ai --no-password --insecure

# Issue a client cert for a human operator/device
step certificate create "DevOps Laptop" client-devops.crt client-devops.key \
  --profile leaf --ca root_ca.crt --ca-key root_ca.key \
  --set "subjectAltName=email:devops@example.com" \
  --no-password --insecure

Operational notes:

  • Store root CA private keys offline.
  • Rotate leaf certs aggressively (days/weeks), automate issuance, and revoke promptly.
  • Separate Caddy’s public TLS identity from mTLS trust roots when possible (don’t reuse the same CA for everything).

Caddy supports client certificate authentication configuration via its TLS settings, including client_auth with trusted_ca_cert_file.

2) Docker Compose: isolated network + hardened services

This Compose file:

  • Puts only Caddy on the “public” network.
  • Keeps vLLM isolated on an internal network.
  • Forces all traffic to vLLM through the Nginx mTLS sidecar.
  • Adds conservative container hardening defaults.
services:
  caddy:
    image: caddy:2.8
    restart: unless-stopped
    ports:
      - "443:443"
      - "80:80"
    networks:
      - public
      - internal
    volumes:
      - ./caddy/Caddyfile:/etc/caddy/Caddyfile:ro
      - ./pki/root_ca.crt:/etc/caddy/pki/root_ca.crt:ro
      - ./pki/caddy_upstream_client.crt:/etc/caddy/pki/up_client.crt:ro
      - ./pki/caddy_upstream_client.key:/etc/caddy/pki/up_client.key:ro
      - caddy_data:/data
      - caddy_config:/config
    read_only: true
    security_opt:
      - no-new-privileges:true
    cap_drop: ["ALL"]

  oauth2-proxy:
    image: quay.io/oauth2-proxy/oauth2-proxy:v7.6.0
    restart: unless-stopped
    networks:
      - internal
    environment:
      OAUTH2_PROXY_PROVIDER: oidc
      OAUTH2_PROXY_OIDC_ISSUER_URL: "https://idp.example.com/realms/prod"
      OAUTH2_PROXY_CLIENT_ID: "local-ai"
      OAUTH2_PROXY_CLIENT_SECRET: "${OAUTH2_PROXY_CLIENT_SECRET}"
      OAUTH2_PROXY_COOKIE_SECRET: "${OAUTH2_PROXY_COOKIE_SECRET}"
      OAUTH2_PROXY_EMAIL_DOMAINS: "*"
      OAUTH2_PROXY_HTTP_ADDRESS: "0.0.0.0:4180"
      OAUTH2_PROXY_UPSTREAMS: "static://200"
      OAUTH2_PROXY_REVERSE_PROXY: "true"
      OAUTH2_PROXY_SET_XAUTHREQUEST: "true"
      OAUTH2_PROXY_WHITELIST_DOMAINS: ".example.internal"
      OAUTH2_PROXY_COOKIE_SECURE: "true"
      OAUTH2_PROXY_COOKIE_SAMESITE: "lax"
    read_only: true
    security_opt:
      - no-new-privileges:true
    cap_drop: ["ALL"]

  vllm:
    image: vllm/vllm-openai:latest
    restart: unless-stopped
    networks:
      - internal
    # vLLM OpenAI-compatible server. API key enforcement is still useful for defense-in-depth.
    command: >
      vllm serve NousResearch/Meta-Llama-3-8B-Instruct
      --dtype auto
      --host 127.0.0.1
      --port 8000
      --api-key ${VLLM_API_KEY}
      --disable-access-log-for-endpoints "/health,/metrics"
    volumes:
      - vllm_models:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]
    # Optional: keep container writable if your model cache needs it; otherwise prefer read_only.
    security_opt:
      - no-new-privileges:true

  vllm-mtls:
    image: nginx:1.27-alpine
    restart: unless-stopped
    depends_on: [vllm]
    networks:
      - internal
    ports: [] # no published ports
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./pki/root_ca.crt:/etc/nginx/pki/root_ca.crt:ro
      - ./pki/vllm_server.crt:/etc/nginx/pki/server.crt:ro
      - ./pki/vllm_server.key:/etc/nginx/pki/server.key:ro
    read_only: true
    security_opt:
      - no-new-privileges:true
    cap_drop: ["ALL"]

networks:
  public:
  internal:
    internal: true

volumes:
  caddy_data:
  caddy_config:
  vllm_models:

Why these choices matter:

  • internal: true prevents direct egress/ingress from the internal network except through attached services.
  • vLLM binds to 127.0.0.1 so only the sidecar can reach it.
  • Removing access logs for /health and /metrics reduces noise; vLLM supports this via --disable-access-log-for-endpoints.

3) Nginx sidecar: enforce mTLS to the model API

nginx.conf terminates TLS and requires client certificates for any request. It proxies to vLLM over loopback HTTP.

events {}

http {
  server {
    listen 8443 ssl;

    ssl_certificate     /etc/nginx/pki/server.crt;
    ssl_certificate_key /etc/nginx/pki/server.key;

    # mTLS requirement
    ssl_client_certificate /etc/nginx/pki/root_ca.crt;
    ssl_verify_client on;

    # Basic hardening
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 10m;

    # Limit request sizes to reduce abuse
    client_max_body_size 2m;

    location / {
      proxy_pass http://127.0.0.1:8000;
      proxy_http_version 1.1;
      proxy_set_header Host $host;
      proxy_set_header X-Forwarded-Proto https;
      proxy_set_header X-Forwarded-For $remote_addr;

      # Optional timeouts (tune for streaming)
      proxy_read_timeout 300s;
      proxy_send_timeout 300s;
    }
  }
}

Key point: Even if an attacker lands on the internal Docker network, they still cannot talk to inference without presenting a valid client certificate trusted by the sidecar.

4) Caddy: edge TLS+mTLS, OIDC gate, and upstream mTLS

Caddy terminates public TLS, verifies client certificates, then uses forward_auth to oauth2-proxy for OIDC. Finally it proxies to the Nginx sidecar over mTLS, presenting its own upstream client certificate.

Caddyfile:

{
  # Reduce information leakage
  servers {
    protocols h1 h2
  }
}

ai.example.internal {
  encode zstd gzip

  # Edge TLS + require client certs (mTLS)
  tls /etc/caddy/certs/caddy.crt /etc/caddy/certs/caddy.key {
    client_auth {
      mode require_and_verify
      trusted_ca_cert_file /etc/caddy/pki/root_ca.crt
    }
  }

  # Health endpoint for load balancers (still mTLS-protected)
  respond /health 200

  # OIDC authentication via oauth2-proxy
  # forward_auth is Caddy’s opinionated auth_request-style integration.
  forward_auth oauth2-proxy:4180 {
    uri /oauth2/auth
    copy_headers X-Auth-Request-User X-Auth-Request-Email Authorization Set-Cookie
  }

  # Protect vLLM OpenAI-compatible endpoints
  @openai_api {
    path /v1/* /metrics
  }

  handle @openai_api {
    reverse_proxy vllm-mtls:8443 {
      transport http {
        tls
        tls_insecure_skip_verify false
        tls_trusted_ca_certs /etc/caddy/pki/root_ca.crt

        # Client cert for Caddy -> Nginx sidecar mTLS
        tls_client_certificate /etc/caddy/pki/up_client.crt
        tls_client_certificate_key /etc/caddy/pki/up_client.key
      }
    }
  }

  # Default deny
  respond 404
}

Why the layered controls matter:

  • Client mTLS at the edge blocks anonymous network scanning and forces device identity before anything else.
  • OIDC via forward_auth adds user identity, MFA, group-based policy, and session management.
  • Upstream mTLS prevents lateral movement inside the container network and avoids “any container can call inference.”

Observability and Debugging

Metrics

vLLM exposes Prometheus-compatible metrics at /metrics on the OpenAI-compatible server. Key signals to scrape and alert on:

  • Request queue depth (waiting/running)
  • Time-to-first-token and per-token latency histograms
  • GPU KV cache usage / cache hit behavior
  • Error counters (5xx, request failures)

Operational guidance:

  • Scrape /metrics from the internal network only.
  • If /metrics must be exposed, protect it with the same mTLS + OIDC policy as the inference endpoints.

Logs

Collect logs from:

  • Caddy access logs: request metadata, auth outcomes, upstream errors.
  • oauth2-proxy logs: OIDC handshake, cookie/session validation, redirects.
  • Nginx TLS logs: mTLS handshake failures (bad cert, unknown CA).
  • vLLM logs: model load events, OOM errors, request failures.

vLLM supports suppressing access logs for high-frequency endpoints to reduce log noise.

Failure modes and common misconfigurations

mTLS handshake fails at the edge

Symptoms: browser shows “bad certificate,” curl fails with tlsv1 alert unknown ca.

Causes: wrong trust root in Caddy, missing client cert, client cert not intended for clientAuth EKU.

OIDC loop / redirect storms

Symptoms: repeated redirects to /oauth2/start, never reaches API.

Causes: cookie secret mismatch, wrong redirect URL, domain mismatch, missing --reverse-proxy (oauth2-proxy setting).

Caddy → upstream fails

Symptoms: 502/504 from Caddy.

Causes: Caddy not presenting upstream client cert, Nginx requires mTLS but Caddy configured without it, wrong trusted CA.

Inference latency spikes

Symptoms: increased time-to-first-token, queue depth grows.

Causes: GPU saturation, KV cache pressure, too-large max_tokens, insufficient batching headroom, CPU throttling.

Security Considerations

Attack surface

  • Edge proxy (TLS termination, auth routing)
  • OIDC gateway (cookie/session handling)
  • Inference API (high-cost compute, sensitive payloads)
  • Model artifacts and cache (supply chain and integrity)

Hardening strategies

Least privilege containers

no-new-privileges, drop Linux caps, read-only filesystems where feasible.

Network isolation

  • Separate public and internal networks; publish only 443/80 from Caddy.
  • Bind vLLM to loopback and force sidecar mediation.

Policy layering

  • mTLS for device/workload identity + OIDC for user identity.
  • Keep vLLM’s --api-key enabled as an additional check.

Secrets management

Store OIDC client secrets and cookie secrets outside Compose:

  • Docker secrets, SOPS-encrypted env files, Vault Agent injection, or Kubernetes Secrets with sealed-secrets/external-secrets.

Keep TLS private keys on tmpfs or locked-down volumes; enforce file ownership and permissions.

Rotate:

  • OIDC client secret (as required by IdP policy)
  • oauth2-proxy cookie secret (planned maintenance window)
  • mTLS leaf certs (automated)

Plugin risk management

Caddy can do OIDC with plugins (for example, caddy-security provides OIDC and JWT-based controls). However, third-party auth plugins increase supply-chain and vulnerability management burden; prior security research has identified issues in at least one SSO-related Caddy plugin ecosystem.

Mitigations:

  • Prefer “separate process” gateways (oauth2-proxy) when organizational policy requires conservative dependency surfaces.
  • If using plugins, pin builds, track advisories, and implement compensating controls (mTLS, network isolation, WAF/rate limits).

Performance Considerations

Bottlenecks

  • GPU memory and KV cache: primary limiter for concurrency and context length.
  • CPU scheduling: tokenization and request handling can become CPU-bound under load.
  • Network and TLS overhead: usually minor relative to model compute, but mTLS adds handshake cost for short-lived clients.

Scaling options

Vertical: more GPU memory, faster GPUs, tuned concurrency.

Horizontal:

  • Multiple vLLM replicas behind Caddy (or a L4/L7 load balancer).
  • Sticky routing for long streaming sessions if required.

Segmentation:

  • Separate deployments by model size or tenant to prevent noisy-neighbor effects.

Resource optimization

Use vLLM-native metrics to tune batching and concurrency. Consider:

  • Quantized model variants where acceptable
  • Request limits (max prompt size, max tokens)
  • Rate limiting at the edge (per user/client cert subject)

Tradeoffs and Design Decisions

Why this approach vs alternatives

Caddy + oauth2-proxy vs “Caddy OIDC plugin everywhere”
forward_auth with oauth2-proxy keeps OIDC logic in a dedicated gateway, reduces custom Caddy builds, and aligns with common reverse-proxy auth_request patterns.

mTLS everywhere vs “OIDC only”
OIDC alone does not authenticate machines at the transport layer. mTLS blocks unauthenticated network access even if OIDC endpoints are reachable and reduces exposure to token theft/replay in internal hops.

Sidecar TLS vs “TLS in the app”
Many inference servers (including typical vLLM deployments) run plain HTTP behind a proxy for simplicity. vLLM’s standard Docker/OpenAI server pattern emphasizes HTTP serving; a sidecar is a practical way to enforce mTLS without patching the app.

What to avoid

  • Exposing vLLM directly on a LAN without mTLS and authorization gates.
  • Relying on “internal network = trusted network.”
  • Long-lived static certificates with no revocation/rotation plan.
  • Running auth plugins without pinned versions and a vulnerability response process.
  • Leaving /metrics unauthenticated; metrics often leak capacity and load patterns.

Conclusion

A zero-trust local AI inference stack in 2026 must treat the inference API as production-critical infrastructure: authenticated, authorized, encrypted, observable, and least-privileged by default.

This architecture delivers that by combining:

  • Caddy as a clean edge proxy with client certificate verification and forward_auth.
  • OIDC via oauth2-proxy for SSO-grade identity and policy integration.
  • End-to-end mTLS, including the internal hop to the inference service, enforced with an Nginx sidecar.
  • First-class observability using vLLM’s Prometheus metrics endpoint and structured proxy logs.

The result is a practical, publishable pattern for secure production inference that works in enterprise environments and advanced homelabs alike—without assuming a trusted network, and without turning model serving into an identity science project.

Geethu

Geethu is an educator with a passion for exploring the ever-evolving world of technology, artificial intelligence, and IT. In her free time, she delves into research and writes insightful articles, breaking down complex topics into simple, engaging, and informative content. Through her work, she aims to share her knowledge and empower readers with a deeper understanding of the latest trends and innovations.

Leave a Comment

Your email address will not be published. Required fields are marked *