Skip to content

llama.cpp Provider

llama.cpp is a lightweight C++ inference server. The HTTP server (llama-server) exposes an OpenAI-compatible API.

Configuration

apiVersion: model.joch.dev/v1alpha1
kind: Model
metadata: { name: llama-3-1-8b-llamacpp }
spec:
  provider: llama-cpp
  model: llama-3.1-8b-instruct.Q4_K_M.gguf
  endpoint:
    type: local
    baseUrl: http://localhost:8080/v1
  capabilities:
    text: true
    toolCalling: false   # depends on model + build
    structuredOutput: false
    streaming: true
  routing:
    regions: [on-prem]

Authentication

Optional bearer token via --api-key.

When to use

  • Resource-constrained edge / on-prem deployments.
  • Offline workloads where Ollama or vLLM is too heavy.
  • Development with quantized models on CPU or modest GPUs.

Limits

  • Tool calling support is uneven and depends on chat templates compiled into the build.
  • Throughput is significantly lower than vLLM for batch workloads.

Reference