Benchmarks · 2026-06-26 · 6 min read
GLM-5.2 Tops Open-Weight Coding: 62.1% SWE-Bench Pro, 81% Terminal-Bench, $5.80/M
Z.ai's GLM-5.2 leads every open model on SWE-Bench Pro (62.1%) and Terminal-Bench 2.1 (81.0%), often beating OpenAI's GPT-5.5 and trailing only Claude Opus 4.8 — at $5.80 per million tokens vs Opus's $23. Free trials in Devin until July 5, plus local runs on consumer GPUs. Here's the benchmark breakdown and what it does to your coding-agent bill.
TL;DR
- GLM-5.2 is the #1 open-weight model on SWE-Bench Pro at 62.1% and Terminal-Bench 2.1 at 81.0% — beating OpenAI GPT-5.5 on both and trailing only Anthropic Claude Opus 4.8.
- Blended price: $5.80 per million tokens vs Opus 4.8 at ~$23 — roughly 4× cheaper for the same code-bench tier.
- Devin is offering free GLM-5.2 trials in-app through July 5, and the MIT-licensed weights run locally on consumer GPUs for marginal-cost-zero inference.
- Qwen 3.6 remains a serious open alternative for agent loops, but GLM-5.2 owns the coding-benchmark crown for now.
- Net effect: the default model in coding IDEs and agent frameworks is no longer a closed flagship — it's mix-and-match, open-weight, and ~4–5× cheaper.
The Headline Numbers
Z.ai launched GLM-5.2 on June 16, 2026, and within ten days the benchmarks settled in a way no open model has hit before. Two scores matter most for anyone shipping a coding agent:
SWE-Bench Pro tests real GitHub issue resolution against repo context. Terminal-Bench 2.1 grades multi-step shell-and-tool execution. GLM-5.2 is now the first open-weight model to live in the same tier as the closed flagships on both — not "close on benchmarks but unusable in the loop," but actually shipping fixed PRs at competitive pass rates.
Why $5.80 vs $23 Changes Defaults
The headline price gap is ~4× on output and even wider with prompt caching. For a typical coding-agent day — multiple file reads, long context, repeated tool calls — the math gets ugly for Opus:
That is $15K/month of pure savings for a 100-dev shop running coding agents — at near-identical SWE-Bench Pro pass rates. The CFO conversation writes itself.
What Devs Actually Get
Hosted on Z.ai or via OpenRouter: $1.40 input / $4.20 output per million tokens, 1M-token context, dual reasoning modes (High / Max).
Free trial in Devin through July 5: Cognition built GLM-5.2 routing into Devin's app so teams can A/B against Claude on real PRs before committing. Anecdotal early reports: PR acceptance rates within ~3 points of Opus on JS/TS and Python repos.
Local inference on consumer GPUs: The MIT-licensed weights are 744B parameters MoE with ~40B active, which means two 24GB consumer cards (4090/5090) plus offloading run it usefully for single-developer coding loops. For internal tooling and air-gapped repos, marginal cost trends to electricity only.
How It Stacks Against the Other Open Contenders
GLM-5.2 doesn't win every category. The honest scoreboard:
Translation: GLM-5.2 is the default for coding-IDE and SWE-agent workloads. Qwen 3.6 is still the safer pick for long agent loops that drift across many tool calls. DeepSeek remains the cheapest "good enough" generalist. Treat them as specialists in a router, not as a single replacement.
The Vendor-Lock-In Story Just Broke
Three things collided this month:
1. An open model hit closed-flagship coding scores. That has been "next year" for two years. It is now.
2. The price is 4× cheaper hosted, and effectively zero self-hosted.
3. Big coding tools shipped first-class support fast — Devin's in-app trial is the loudest example, but Cursor, Cline, Aider and Continue all have GLM-5.2 presets within days.
Combined, the cost of switching off Opus 4.8 for routine coding work is now a config change, not a migration project. That's the actual news.
What To Do This Week
1. Route default coding traffic to GLM-5.2. Keep Opus 4.8 reserved for the hardest 10% — gnarly refactors, multi-repo reasoning, ambiguous specs.
2. Take the free Devin trial if you already use it — it's the fastest way to A/B PR acceptance on your own codebase before July 5.
3. Spin up a self-hosted node for internal repos. Even one 2×4090 box pays for itself in weeks at team scale and removes the egress-cost-and-privacy concerns of hosted Opus.
4. Cache aggressively. GLM-5.2's $0.26 cached-input rate makes long-context coding loops nearly free.
5. Keep Qwen 3.6 in the router for long agent loops where tool-call stability matters more than raw code-bench score.
Bottom Line
For the first time, the best price-per-quality coding model on the market is open-weight, permissively licensed, and 4× cheaper than the closed flagship it competes with. GLM-5.2 didn't just match expectations for an open release — it shifted the default. If you're still defaulting coding-agent traffic to Opus 4.8, you're paying a Claude tax for results an MIT-licensed model now matches on the benchmarks that decide whether a PR ships.
See live pricing across providers on the AI Model Pricing Table, or compare GLM-5.2 head-to-head with the closed flagships in GLM-5.2 vs Claude Opus 4.8.