Industry · 2026-06-29 · 6 min read

Coinbase Cuts AI Costs in Half With Smart Routing to GLM 5.2 and Kimi 2.7

Coinbase built an internal AI gateway that routes requests to open-weight models like GLM 5.2 and Kimi 2.7 at $0.95–$1.40 per million tokens — 5× cheaper than U.S. flagships. Defaults, prompt caching, and task-based routing pushed cache hits from 5% to 60%, cutting their AI bill in half even as token usage climbed. Here's the playbook — and how to copy it.

TL;DR

Coinbase built an internal AI gateway that routes requests across providers instead of hard-coding one frontier model per app.
The gateway leans on cheap open-weight models — China's GLM 5.2 and Kimi 2.7 — at $0.95–$1.40 per million tokens, vs $5+ for U.S. flagships like GPT-5.5 and Claude Opus.
Smart defaults + prompt caching + task-based routing lifted cache hit rates from ~5% to ~60%, letting token volume balloon without the bill following.
Independent operators like AI analyst Miles Deutscher report >50% savings using the same "barbell strategy" — frontier models for the hardest tasks, open models for everything else.
The catch: cheaper inference tends to induce more usage, so savings only stick if you also cap per-app budgets.

What Coinbase Actually Built

Coinbase's engineering team stopped pointing each internal app at a single LLM SDK and instead put a gateway in the middle. Every AI call inside the company — code review bots, customer-support drafts, fraud-signal summarizers, internal search — now flows through a shared service that decides which model to use, whether to cache, and how to handle fallbacks.

The routing logic is built around three levers:

1. Default to the cheapest model that clears the quality bar for a task class.

2. Prompt-cache aggressively so repeated system prompts and tool definitions don't get re-billed.

3. Escalate to frontier models only when the task warrants it — long-context reasoning, ambiguous customer escalations, or anything touching money movement.

Before the gateway, the team estimates roughly 5% of traffic hit a cache. After defaults, prefix-stable system prompts, and a unified context layer, ~60% of requests now hit cache — which on most providers means 10–25% of the input price instead of full freight.

The $0.95 vs $5+ Gap

The reason this works is the gap between open-weight Chinese models and U.S. flagships has gone from "interesting" to "embarrassing":

For routine work — classification, extraction, summarization, structured output, draft generation — GLM 5.2 and Kimi 2.7 score within striking distance of the closed flagships on most public benchmarks. At 5–20× lower cost per token, the math only fails when the task genuinely needs frontier reasoning.

!Z.ai GLM 5.2 logo

!Moonshot Kimi 2.7 logo

The Barbell Strategy, Generalized

AI analyst Miles Deutscher independently reported >50% savings on his own production stack using what he calls a barbell strategy: frontier models on one end for the small slice of tasks that need them, open-weight models on the other end for the long tail — and almost nothing in the middle-tier "premium but not flagship" zone.

The pattern works because task value is bimodal. A handful of calls genuinely need GPT-5.5 or Claude Opus 4.8 — multi-hop reasoning, novel code generation, customer-money decisions. The vast majority don't. Middle-tier models are usually the worst trade in the stack: they cost 3–5× more than open weights without matching frontier quality on the hard tasks.

How to Copy the Playbook

You don't need Coinbase's headcount to ship this. The minimum viable gateway is:

1. Centralize provider keys behind one internal service so every app calls *your* endpoint, not OpenAI's. This is the precondition for everything else.

2. Set a per-task default model. Start with GLM 5.2 or Kimi 2.7 for anything that isn't obviously frontier-grade. Make the frontier model an opt-in flag, not the default.

3. Standardize system prompts and tool definitions so prefixes are byte-stable. This is what unlocks the 5% → 60% cache-hit jump.

4. Add per-app monthly budgets with hard cutoffs. Cheaper tokens reliably induce more usage — if you don't cap it, the savings get spent on volume.

5. Log every call's cost so the routing layer can A/B model choices on real traffic, not vibes.

For pricing math on a specific workload, the AI Model Pricing Table shows live per-1M rates across all major providers, and the Prompt Caching Explained guide walks through the cache-hit math in detail.

The Counter-Argument: Cheaper Tokens, More Tokens

Not everyone is celebrating. The pushback is Jevons paradox applied to inference: when the unit cost of a thing drops, total consumption usually rises faster than the price falls. Several enterprise teams that switched to cheap open models reported higher total spend within a quarter — not because the per-token price went up, but because cheap inference made it sensible to call the model on tasks that previously weren't worth it.

That's not an argument against routing. It's an argument for routing plus budgeting. The gateway pattern only saves money when the savings are *captured* — through caps, alerts, and per-team accountability — rather than reinvested in unbounded volume growth.

Bottom Line

Coinbase's smart-routing gateway is the first widely-discussed case of a name-brand enterprise putting open-weight Chinese models on the critical path for cost reasons and publishing the numbers. The savings — ~50% on the AI bill, with cache-hit rates up 12× — are reproducible by any team willing to centralize provider access and accept that the default model is no longer the most expensive one.

If you're still hard-coding GPT-5.5 or Claude Opus 4.8 as your application's only LLM, you're paying a frontier tax on routine work that GLM 5.2 and Kimi 2.7 handle for a fifth of the price. The Coinbase playbook says: build the gateway, set the defaults, cache the prefixes, cap the budgets — and let the frontier models earn their keep on the hard 10%.

See live rates on the LLM Leaderboard or run your own routing math with the Agent Loop Cost Estimator.