Skip to content

vLLM Provider

vLLM is a high-throughput inference server with an OpenAI-compatible API. Used for self-hosted production inference of open-weight models.

Configuration

apiVersion: model.joch.dev/v1alpha1
kind: Model
metadata: { name: llama-3-1-70b-vllm }
spec:
  provider: vllm
  model: meta-llama/Llama-3.1-70B-Instruct
  endpoint:
    type: hosted
    baseUrl: http://vllm-llama:8000/v1
  auth:
    secretRef: { name: vllm-api-key }
  capabilities:
    text: true
    toolCalling: true       # set per model
    structuredOutput: true  # set per model
    streaming: true
  limits:
    contextWindowTokens: 128000
  routing:
    regions: [on-prem]

Authentication

vLLM supports an optional API key (--api-key). When set, the adapter sends Authorization: Bearer <key>.

Tool calling

Tool calling depends on the served model and the vLLM build. Declare capability per Model record and let the router fall forward when capability is missing.

Streaming

Standard OpenAI SSE shape.

Region / residency

Self-hosted; constrained by where you deploy vLLM. Use the Model.spec.routing.regions to enforce.

Cost reporting

Track tokens; price as 0 or as your internal chargeback.

Reference