vLLM Provider¶
vLLM is a high-throughput inference server with an OpenAI-compatible API. Used for self-hosted production inference of open-weight models.
Configuration¶
apiVersion: model.joch.dev/v1alpha1
kind: Model
metadata: { name: llama-3-1-70b-vllm }
spec:
provider: vllm
model: meta-llama/Llama-3.1-70B-Instruct
endpoint:
type: hosted
baseUrl: http://vllm-llama:8000/v1
auth:
secretRef: { name: vllm-api-key }
capabilities:
text: true
toolCalling: true # set per model
structuredOutput: true # set per model
streaming: true
limits:
contextWindowTokens: 128000
routing:
regions: [on-prem]
Authentication¶
vLLM supports an optional API key (--api-key). When set, the adapter sends Authorization: Bearer <key>.
Tool calling¶
Tool calling depends on the served model and the vLLM build. Declare capability per Model record and let the router fall forward when capability is missing.
Streaming¶
Standard OpenAI SSE shape.
Region / residency¶
Self-hosted; constrained by where you deploy vLLM. Use the Model.spec.routing.regions to enforce.
Cost reporting¶
Track tokens; price as 0 or as your internal chargeback.
Reference¶
- vLLM docs: https://docs.vllm.ai/