Connect coding agents to any model

A protocol-translating proxy that connects Claude Code, Codex, OpenCode, and Qwen Code to local and upstream models, injecting the tools you expect.

Download View Source

MIT License · Open Source · Free Forever

Coding agents

Claude Code Codex CLI OpenCode Qwen Code OpenClaw Your App

↓

go-llm-proxy → Web Search

↓

Local backends

vLLM llama-server Ollama

Cloud backends (API key)

OpenAI Anthropic MiniMax Zhipu (GLM)

Connects agents to local models

Claude Code speaks Anthropic. Codex speaks Responses API. Your vLLM box speaks Chat Completions. The proxy translates automatically.

Multiplexes models

Route multiple models across multiple backends behind one endpoint. Name rewriting, per-model timeouts, per-key access control.

Adds tools backends lack

Web search via Tavily or Brave Search, image description, PDF text extraction, OCR for scanned documents — executed at the proxy, injected transparently.

Compatibility matrix

What works with each coding agent through the proxy.

	Claude Code	Codex CLI	OpenCode	Qwen Code
Protocol
Native API	Anthropic Messages	OpenAI Responses	Chat Completions	Chat Completions
Translation needed	auto-translated	auto-translated	passthrough	passthrough
Core features
Text + streaming	✓	✓	✓	✓
Tool calling	✓	✓	✓	✓
Multi-turn tool loops	✓	✓	✓	✓
Reasoning display	✓	✓	N/A	N/A
Server-side features
Token usage tracking	✓	✓	✓	✓
Context compaction	✓	✓	N/A	N/A
Token counting endpoint	✗	N/A	N/A	N/A
Prompt caching	passthrough	N/A	N/A	N/A
Extended thinking	✓	✓	N/A	N/A
Proxy-side processing (details)
Web search (Tavily / Brave)	✓ proxy	✓ proxy	✓ MCP	✓ MCP
Image description	✓ vision	✓ vision	✓ vision	✓ vision
PDF text extraction	✓ proxy	client-side	✓	✓
Scanned PDF / OCR	✓ OCR model	✓ OCR model	✓	✓
Conversation compaction	N/A	✓	N/A	N/A
Usage logging & reports	✓	✓	✓	✓
Configuration
Model slots	Sonnet / Opus / Haiku	Single model	Build / Plan agents	Multi-select
Config output	settings.json + script	config.toml + script	JSON config	settings.json
Setup guide	Claude Code	Codex CLI	OpenCode	Qwen Code

Web search intercepts server-side search tool calls, executes them via Tavily, and injects results back into the conversation. Image description routes user-attached images to a vision model; tool output images (PDF pages, screenshots) are routed to a dedicated OCR model for text extraction. PDF text extraction runs locally in pure Go; scanned PDFs fall back to the OCR model. All results are cached by content hash for instant follow-up turns. See the pipeline documentation for full details.

Quick start

Create a config.yaml with your models and keys:

listen: ":8080"

models:
  # Your coding model (vLLM, llama-server, Ollama, etc.)
  - name: my-model
    backend: http://192.168.1.10:8000/v1
    responses_mode: translate   # recommended for vLLM backends with Codex

  # Cloud API — native Anthropic passthrough
  - name: MiniMax-M2.5
    backend: https://api.minimax.io/anthropic
    api_key: your-minimax-key
    type: anthropic

  # Cloud API — auto-translated from any protocol
  - name: glm-5.1
    backend: https://api.z.ai/api/coding/paas/v4
    api_key: your-zhipu-key

  # Vision model — describes images for text-only backends
  - name: Qwen3-VL-8B
    backend: http://192.168.1.10:8001/v1
    supports_vision: true

  # OCR model — fast text extraction from documents and scanned PDFs
  - name: paddleOCR
    backend: http://192.168.1.10:8002/v1
    supports_vision: true

# Processing pipeline — handles images, PDFs, and web search transparently
processors:
  vision: Qwen3-VL-8B           # any vision-capable model from above
  ocr: paddleOCR                # PaddleOCR-VL-1.5 (0.9B) — fast, accurate
  web_search_key: tvly-...      # Tavily or Brave Search key (auto-detected)

keys:
  - key: sk-your-secret-key
    name: admin

Recommended processor models

Vision	Qwen3-VL-8B — best quality/speed balance for image description. Handles charts, screenshots, diagrams.
OCR	PaddleOCR-VL-1.5 (0.9B) — purpose-built for document parsing. 94.5% accuracy, 109 languages, ~2s/page. Tiny VRAM footprint.
Web search	Tavily (free: 1,000 req/month) or Brave Search (free: $5/month credit). Add your API key to `processors.web_search_key` — provider is auto-detected from the key prefix.

Run it:

# Binary (recommended)
./go-llm-proxy -config config.yaml

# Docker (limited testing — ongoing)
docker compose -f docker/docker-compose.yml up -d

Point any OpenAI or Anthropic-compatible client at http://localhost:8080/v1. That's it.

The built-in config generator at GET / creates ready-to-use configs for each coding agent. Enable it with -serve-config-generator or serve_config_generator: true in config.

Features

Protocol translation	Anthropic Messages API ↔ Chat Completions. OpenAI Responses API ↔ Chat Completions. Auto-detected per backend.
Model multiplexing	Multiple models across multiple backends. Name rewriting, per-model timeouts, per-key model restrictions.
Image description	Images sent to text-only models are described by a vision-capable model and replaced with text. Concurrent processing (up to 5 in parallel), cached by content hash for instant follow-up turns.
PDF processing	Text extraction for native PDFs. OCR via dedicated fast model for scanned documents and page images. Results cached across turns.
Web search	Server-side search tools (Claude Code, Codex) executed at the proxy via Tavily or Brave Search (auto-detected from key prefix). Results displayed in client UI. Streaming and non-streaming modes.
Context tracking	Token usage from backends reported to clients for context window tracking. Token counting endpoint planned for per-section breakdowns.
API key management	Issue proxy keys with per-key model restrictions. Backend credentials stay on the server.
Usage monitoring	Per-request logging to SQLite. Token counts, latency, per-user breakdowns. Web dashboard and CLI reports.
Qdrant proxy	Vector database proxy with separate app key auth and automatic multi-tenant isolation. Each app's data is tagged and filtered transparently.
Sampling defaults	Per-model defaults for temperature, penalties, reasoning effort, and max tokens. Injected when clients don't send their own — coding assistants override automatically.
Config generator	Built-in web UI creates ready-to-use configs for Claude Code, Codex, OpenCode, and Qwen Code.
Hot reload	Config reloads on file save or SIGHUP. Add models or rotate keys without restarting.
Security	Constant-time auth, IP rate limiting, SSRF protection, sanitized error responses, path allowlisting.

Supported backends

Anything that speaks OpenAI Chat Completions or Anthropic Messages protocol works. Tested with:

vLLM llama-server Ollama OpenAI API Anthropic API MiniMax Zhipu (GLM)

How it works

Client request (any protocol)
  → Protocol handler parses request
  → Translate to Chat Completions (if needed)
  → Pipeline: describe images, extract PDF text, OCR page images, inject search tool
  → Send to backend
  → Pipeline: execute web search if called, re-send with results
  → Translate response back to client protocol
  → Stream to client

The pipeline is optional. Without processors configured, the proxy just translates and routes. With processors enabled, images, PDFs, and search work transparently on any backend.

Documentation

Configuration reference	All config fields, modes, and examples
Claude Code guide	Messages API translation, tool calling, web search
Codex CLI guide	Responses API translation, compaction, context windows
Processing pipeline	Image description, PDF OCR, web search — per-client behavior
Docker deployment	One config file, one command
Qdrant proxy	Vector database proxy with app key auth and data isolation
Security	Hardening, rate limiting, deployment recommendations