Zero-Trust Local AI: Hardening vLLM Docker Containers Behind Caddy with mTLS and OIDC for Secure Production Inference

Running large language models locally is no longer a novelty; by 2026 it is a standard pattern for teams that need predictable latency, data residency, and cost control. vLLM has become a common serving layer because it exposes an OpenAI-compatible HTTP API and is designed for high-throughput inference.
The security problem is that “local” does not automatically mean “safe.” A production inference endpoint is still a high-value target:
- It can leak sensitive prompts and generated data.
- It can be abused for expensive GPU denial-of-service.
- It can become a pivot point into the rest of the network if container boundaries and credentials are weak.
A zero-trust posture assumes the network is hostile—even inside a homelab or “internal-only” VLAN. The practical implication is that every hop must be authenticated, authorized, encrypted, observable, and least-privileged.
This article presents an implementation-focused pattern for hardening vLLM in Docker behind Caddy, using:
- mTLS for device/workload identity and encrypted transport.
- OIDC for user and service authentication with modern SSO.
- Network isolation + container hardening to reduce blast radius.
- Metrics/logs to debug failure modes and detect abuse.
The goal is secure production inference without turning the stack into a bespoke identity platform.
Architecture Overview
Components
- vLLM: Runs the model and serves an OpenAI-compatible API over HTTP (default 8000).
- Internal TLS sidecar (Nginx): Terminates mTLS for upstream traffic to vLLM (because vLLM typically serves plain HTTP inside the container network).
- Caddy: Edge reverse proxy, TLS termination, client certificate verification, and routing.
- OIDC gateway (oauth2-proxy): Handles OIDC login, session cookies, and identity headers for apps behind a reverse proxy (paired with Caddy forward_auth).
- Prometheus/Grafana (optional): Scrapes vLLM /metrics and visualizes key latency/queue/GPU cache signals. vLLM exposes Prometheus-compatible metrics at /metrics.
Traffic flow
Client → Caddy (mTLS + TLS)
Client presents a certificate issued by an internal CA. Caddy requires and verifies it.
Caddy → oauth2-proxy (forward_auth)
For protected routes, Caddy calls oauth2-proxy to verify an authenticated OIDC session.
Caddy → Nginx sidecar (mTLS)
Even after the edge is authenticated, the internal hop is also authenticated with mTLS. This prevents any other container on the network from talking directly to inference.
Nginx → vLLM (loopback HTTP)
Nginx forwards to 127.0.0.1:8000 inside the same pod/compose “service” namespace.
Why this composition
- mTLS provides a strong, non-phishable identity factor at the transport layer.
- OIDC provides centralized user/service identity and policy controls (MFA, device posture, conditional access) without hand-rolled auth.
- Defense in depth: Compromise of one layer does not automatically grant inference access.
Implementation Breakdown
1) Certificates: internal CA and mTLS materials
Use an internal CA (Smallstep step-ca, Vault PKI, or an enterprise PKI). Smallstep is a common choice for mTLS automation in infrastructure environments.
Create a root CA and issue:
- Server cert for ai.example.internal (Caddy edge)
- Client certs for operators/services
- Server cert for the internal Nginx mTLS endpoint
- Client cert for Caddy → Nginx upstream authentication
Example with step CLI (illustrative; adapt to your PKI):
# Root CA (offline is preferable)
step certificate create "Local AI Root CA" root_ca.crt root_ca.key \
--profile root-ca --no-password --insecure
# Issue an edge server cert for Caddy
step certificate create ai.example.internal caddy.crt caddy.key \
--profile leaf --ca root_ca.crt --ca-key root_ca.key \
--san ai.example.internal --san ai --no-password --insecure
# Issue a client cert for a human operator/device
step certificate create "DevOps Laptop" client-devops.crt client-devops.key \
--profile leaf --ca root_ca.crt --ca-key root_ca.key \
--set "subjectAltName=email:devops@example.com" \
--no-password --insecure Operational notes:
- Store root CA private keys offline.
- Rotate leaf certs aggressively (days/weeks), automate issuance, and revoke promptly.
- Separate Caddy’s public TLS identity from mTLS trust roots when possible (don’t reuse the same CA for everything).
Caddy supports client certificate authentication configuration via its TLS settings, including client_auth with trusted_ca_cert_file.
2) Docker Compose: isolated network + hardened services
This Compose file:
- Puts only Caddy on the “public” network.
- Keeps vLLM isolated on an internal network.
- Forces all traffic to vLLM through the Nginx mTLS sidecar.
- Adds conservative container hardening defaults.
services:
caddy:
image: caddy:2.8
restart: unless-stopped
ports:
- "443:443"
- "80:80"
networks:
- public
- internal
volumes:
- ./caddy/Caddyfile:/etc/caddy/Caddyfile:ro
- ./pki/root_ca.crt:/etc/caddy/pki/root_ca.crt:ro
- ./pki/caddy_upstream_client.crt:/etc/caddy/pki/up_client.crt:ro
- ./pki/caddy_upstream_client.key:/etc/caddy/pki/up_client.key:ro
- caddy_data:/data
- caddy_config:/config
read_only: true
security_opt:
- no-new-privileges:true
cap_drop: ["ALL"]
oauth2-proxy:
image: quay.io/oauth2-proxy/oauth2-proxy:v7.6.0
restart: unless-stopped
networks:
- internal
environment:
OAUTH2_PROXY_PROVIDER: oidc
OAUTH2_PROXY_OIDC_ISSUER_URL: "https://idp.example.com/realms/prod"
OAUTH2_PROXY_CLIENT_ID: "local-ai"
OAUTH2_PROXY_CLIENT_SECRET: "${OAUTH2_PROXY_CLIENT_SECRET}"
OAUTH2_PROXY_COOKIE_SECRET: "${OAUTH2_PROXY_COOKIE_SECRET}"
OAUTH2_PROXY_EMAIL_DOMAINS: "*"
OAUTH2_PROXY_HTTP_ADDRESS: "0.0.0.0:4180"
OAUTH2_PROXY_UPSTREAMS: "static://200"
OAUTH2_PROXY_REVERSE_PROXY: "true"
OAUTH2_PROXY_SET_XAUTHREQUEST: "true"
OAUTH2_PROXY_WHITELIST_DOMAINS: ".example.internal"
OAUTH2_PROXY_COOKIE_SECURE: "true"
OAUTH2_PROXY_COOKIE_SAMESITE: "lax"
read_only: true
security_opt:
- no-new-privileges:true
cap_drop: ["ALL"]
vllm:
image: vllm/vllm-openai:latest
restart: unless-stopped
networks:
- internal
# vLLM OpenAI-compatible server. API key enforcement is still useful for defense-in-depth.
command: >
vllm serve NousResearch/Meta-Llama-3-8B-Instruct
--dtype auto
--host 127.0.0.1
--port 8000
--api-key ${VLLM_API_KEY}
--disable-access-log-for-endpoints "/health,/metrics"
volumes:
- vllm_models:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
# Optional: keep container writable if your model cache needs it; otherwise prefer read_only.
security_opt:
- no-new-privileges:true
vllm-mtls:
image: nginx:1.27-alpine
restart: unless-stopped
depends_on: [vllm]
networks:
- internal
ports: [] # no published ports
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./pki/root_ca.crt:/etc/nginx/pki/root_ca.crt:ro
- ./pki/vllm_server.crt:/etc/nginx/pki/server.crt:ro
- ./pki/vllm_server.key:/etc/nginx/pki/server.key:ro
read_only: true
security_opt:
- no-new-privileges:true
cap_drop: ["ALL"]
networks:
public:
internal:
internal: true
volumes:
caddy_data:
caddy_config:
vllm_models: Why these choices matter:
internal: trueprevents direct egress/ingress from the internal network except through attached services.- vLLM binds to
127.0.0.1so only the sidecar can reach it. - Removing access logs for
/healthand/metricsreduces noise; vLLM supports this via--disable-access-log-for-endpoints.
3) Nginx sidecar: enforce mTLS to the model API
nginx.conf terminates TLS and requires client certificates for any request. It proxies to vLLM over loopback HTTP.
events {}
http {
server {
listen 8443 ssl;
ssl_certificate /etc/nginx/pki/server.crt;
ssl_certificate_key /etc/nginx/pki/server.key;
# mTLS requirement
ssl_client_certificate /etc/nginx/pki/root_ca.crt;
ssl_verify_client on;
# Basic hardening
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
# Limit request sizes to reduce abuse
client_max_body_size 2m;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto https;
proxy_set_header X-Forwarded-For $remote_addr;
# Optional timeouts (tune for streaming)
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
} Key point: Even if an attacker lands on the internal Docker network, they still cannot talk to inference without presenting a valid client certificate trusted by the sidecar.
4) Caddy: edge TLS+mTLS, OIDC gate, and upstream mTLS
Caddy terminates public TLS, verifies client certificates, then uses forward_auth to oauth2-proxy for OIDC. Finally it proxies to the Nginx sidecar over mTLS, presenting its own upstream client certificate.
Caddyfile:
{
# Reduce information leakage
servers {
protocols h1 h2
}
}
ai.example.internal {
encode zstd gzip
# Edge TLS + require client certs (mTLS)
tls /etc/caddy/certs/caddy.crt /etc/caddy/certs/caddy.key {
client_auth {
mode require_and_verify
trusted_ca_cert_file /etc/caddy/pki/root_ca.crt
}
}
# Health endpoint for load balancers (still mTLS-protected)
respond /health 200
# OIDC authentication via oauth2-proxy
# forward_auth is Caddy’s opinionated auth_request-style integration.
forward_auth oauth2-proxy:4180 {
uri /oauth2/auth
copy_headers X-Auth-Request-User X-Auth-Request-Email Authorization Set-Cookie
}
# Protect vLLM OpenAI-compatible endpoints
@openai_api {
path /v1/* /metrics
}
handle @openai_api {
reverse_proxy vllm-mtls:8443 {
transport http {
tls
tls_insecure_skip_verify false
tls_trusted_ca_certs /etc/caddy/pki/root_ca.crt
# Client cert for Caddy -> Nginx sidecar mTLS
tls_client_certificate /etc/caddy/pki/up_client.crt
tls_client_certificate_key /etc/caddy/pki/up_client.key
}
}
}
# Default deny
respond 404
} Why the layered controls matter:
- Client mTLS at the edge blocks anonymous network scanning and forces device identity before anything else.
- OIDC via forward_auth adds user identity, MFA, group-based policy, and session management.
- Upstream mTLS prevents lateral movement inside the container network and avoids “any container can call inference.”
Observability and Debugging
Metrics
vLLM exposes Prometheus-compatible metrics at /metrics on the OpenAI-compatible server. Key signals to scrape and alert on:
- Request queue depth (waiting/running)
- Time-to-first-token and per-token latency histograms
- GPU KV cache usage / cache hit behavior
- Error counters (5xx, request failures)
Operational guidance:
- Scrape
/metricsfrom the internal network only. - If
/metricsmust be exposed, protect it with the same mTLS + OIDC policy as the inference endpoints.
Logs
Collect logs from:
- Caddy access logs: request metadata, auth outcomes, upstream errors.
- oauth2-proxy logs: OIDC handshake, cookie/session validation, redirects.
- Nginx TLS logs: mTLS handshake failures (bad cert, unknown CA).
- vLLM logs: model load events, OOM errors, request failures.
vLLM supports suppressing access logs for high-frequency endpoints to reduce log noise.
Failure modes and common misconfigurations
mTLS handshake fails at the edge
Symptoms: browser shows “bad certificate,” curl fails with tlsv1 alert unknown ca.
Causes: wrong trust root in Caddy, missing client cert, client cert not intended for clientAuth EKU.
OIDC loop / redirect storms
Symptoms: repeated redirects to /oauth2/start, never reaches API.
Causes: cookie secret mismatch, wrong redirect URL, domain mismatch, missing --reverse-proxy (oauth2-proxy setting).
Caddy → upstream fails
Symptoms: 502/504 from Caddy.
Causes: Caddy not presenting upstream client cert, Nginx requires mTLS but Caddy configured without it, wrong trusted CA.
Inference latency spikes
Symptoms: increased time-to-first-token, queue depth grows.
Causes: GPU saturation, KV cache pressure, too-large max_tokens, insufficient batching headroom, CPU throttling.
Security Considerations
Attack surface
- Edge proxy (TLS termination, auth routing)
- OIDC gateway (cookie/session handling)
- Inference API (high-cost compute, sensitive payloads)
- Model artifacts and cache (supply chain and integrity)
Hardening strategies
Least privilege containers
no-new-privileges, drop Linux caps, read-only filesystems where feasible.
Network isolation
- Separate public and internal networks; publish only 443/80 from Caddy.
- Bind vLLM to loopback and force sidecar mediation.
Policy layering
- mTLS for device/workload identity + OIDC for user identity.
- Keep vLLM’s
--api-keyenabled as an additional check.
Secrets management
Store OIDC client secrets and cookie secrets outside Compose:
- Docker secrets, SOPS-encrypted env files, Vault Agent injection, or Kubernetes Secrets with sealed-secrets/external-secrets.
Keep TLS private keys on tmpfs or locked-down volumes; enforce file ownership and permissions.
Rotate:
- OIDC client secret (as required by IdP policy)
- oauth2-proxy cookie secret (planned maintenance window)
- mTLS leaf certs (automated)
Plugin risk management
Caddy can do OIDC with plugins (for example, caddy-security provides OIDC and JWT-based controls). However, third-party auth plugins increase supply-chain and vulnerability management burden; prior security research has identified issues in at least one SSO-related Caddy plugin ecosystem.
Mitigations:
- Prefer “separate process” gateways (oauth2-proxy) when organizational policy requires conservative dependency surfaces.
- If using plugins, pin builds, track advisories, and implement compensating controls (mTLS, network isolation, WAF/rate limits).
Performance Considerations
Bottlenecks
- GPU memory and KV cache: primary limiter for concurrency and context length.
- CPU scheduling: tokenization and request handling can become CPU-bound under load.
- Network and TLS overhead: usually minor relative to model compute, but mTLS adds handshake cost for short-lived clients.
Scaling options
Vertical: more GPU memory, faster GPUs, tuned concurrency.
Horizontal:
- Multiple vLLM replicas behind Caddy (or a L4/L7 load balancer).
- Sticky routing for long streaming sessions if required.
Segmentation:
- Separate deployments by model size or tenant to prevent noisy-neighbor effects.
Resource optimization
Use vLLM-native metrics to tune batching and concurrency. Consider:
- Quantized model variants where acceptable
- Request limits (max prompt size, max tokens)
- Rate limiting at the edge (per user/client cert subject)
Tradeoffs and Design Decisions
Why this approach vs alternatives
Caddy + oauth2-proxy vs “Caddy OIDC plugin everywhere”
forward_auth with oauth2-proxy keeps OIDC logic in a dedicated gateway, reduces custom Caddy builds, and aligns with common reverse-proxy auth_request patterns.
mTLS everywhere vs “OIDC only”
OIDC alone does not authenticate machines at the transport layer. mTLS blocks unauthenticated network access even if OIDC endpoints are reachable and reduces exposure to token theft/replay in internal hops.
Sidecar TLS vs “TLS in the app”
Many inference servers (including typical vLLM deployments) run plain HTTP behind a proxy for simplicity. vLLM’s standard Docker/OpenAI server pattern emphasizes HTTP serving; a sidecar is a practical way to enforce mTLS without patching the app.
What to avoid
- Exposing vLLM directly on a LAN without mTLS and authorization gates.
- Relying on “internal network = trusted network.”
- Long-lived static certificates with no revocation/rotation plan.
- Running auth plugins without pinned versions and a vulnerability response process.
- Leaving
/metricsunauthenticated; metrics often leak capacity and load patterns.
Conclusion
A zero-trust local AI inference stack in 2026 must treat the inference API as production-critical infrastructure: authenticated, authorized, encrypted, observable, and least-privileged by default.
This architecture delivers that by combining:
- Caddy as a clean edge proxy with client certificate verification and forward_auth.
- OIDC via oauth2-proxy for SSO-grade identity and policy integration.
- End-to-end mTLS, including the internal hop to the inference service, enforced with an Nginx sidecar.
- First-class observability using vLLM’s Prometheus metrics endpoint and structured proxy logs.
The result is a practical, publishable pattern for secure production inference that works in enterprise environments and advanced homelabs alike—without assuming a trusted network, and without turning model serving into an identity science project.



