Claude benchmark scores live in three places that disagree: Anthropic release notes, Scale's SEAL leaderboard, and third-party trackers. The disagreement is scaffolding, not error. This page consolidates every current score with the exact API model ID and price per row, and labels which harness produced each number.
Master Table: Every Current Claude Model
Six Claude models were in active service as of June 9, 2026. Fable 5 went GA on June 9, 2026 as a new tier above Opus (currently suspended as of June 12, 2026, see note above). Opus 4.8 shipped May 28, 2026 at the same price as 4.7 and 4.6. Mythos 5 is the same underlying model as Fable 5 with safety classifiers lifted, restricted to approved Project Glasswing partners (also currently suspended).
| Model | API Model ID | SWE-bench Verified | SWE-bench Pro | Context / Max Output | $/MTok In / Out |
|---|---|---|---|---|---|
| Claude Fable 5 | claude-fable-5 | 95.0% | 80.3% | 1M / 128k | $10 / $50 |
| Claude Mythos 5 | claude-mythos-5 (restricted) | 93.9%* | 77.8%* | 1M / 128k | $10 / $50 |
| Claude Opus 4.8 | claude-opus-4-8 | 88.6% | 69.2% | 1M / 128k | $5 / $25 |
| Claude Opus 4.7 | claude-opus-4-7 | 87.6% | 64.3% | 1M / 128k | $5 / $25 |
| Claude Opus 4.6 | claude-opus-4-6 | 80.8% | 51.9%† | 1M / 128k | $5 / $25 |
| Claude Sonnet 4.6 | claude-sonnet-4-6 | 79.6% | n/a | 1M / 64k | $3 / $15 |
| Claude Haiku 4.5 | claude-haiku-4-5 | 73.3% | 39.5%† | 200k / 64k | $1 / $5 |
| Claude Opus 4.5 (legacy) | claude-opus-4-5-20251101 | 80.9% | 45.9%† | 200k / 64k | $5 / $25 |
| Claude Sonnet 4.5 (legacy) | claude-sonnet-4-5-20250929 | 77.2% | 43.6%† | 200k / 64k | $3 / $15 |
SWE-bench Pro scores without a dagger are Anthropic-run. † = Scale SEAL leaderboard (standardized scaffolding), which runs well below vendor scaffolds. * = Claude Mythos Preview scores; Mythos 5 itself is not on public leaderboards. Verified scores from the llm-stats tracker (updated June 2026) and provider announcements.
SWE-bench Verified Scores
SWE-bench Verified contains 500 human-validated Python tasks from real GitHub repositories. Models are scored on the percentage of tasks resolved correctly. Frontier scores are self-reported with provider scaffolds, so treat differences under ~1 point as noise.
| Rank | Model | Score | Provider |
|---|---|---|---|
| 1 | Claude Fable 5 | 95.0% | Anthropic |
| 2 | Claude Mythos Preview | 93.9% | Anthropic |
| 3 | Claude Opus 4.8 | 88.6% | Anthropic |
| 4 | Claude Opus 4.7 | 87.6% | Anthropic |
| 5 | Claude Opus 4.5 | 80.9% | Anthropic |
| 6 | Claude Opus 4.6 | 80.8% | Anthropic |
| 7 | DeepSeek-V4-Pro-Max | 80.6% | DeepSeek |
| 8 | Gemini 3.1 Pro | 80.6% | |
| 9 | MiniMax M3 | 80.5% | MiniMax |
| 10 | Qwen3.7 Max | 80.4% | Alibaba |
Source: llm-stats SWE-bench Verified tracker, June 2026.
The top six entries are all Claude. The bigger story is the compression below them: four open-weights or non-US models (DeepSeek V4 Pro Max, Gemini 3.1 Pro, MiniMax M3, Qwen3.7 Max) sit within 0.5 points of each other at ~80.5%, roughly where Opus 4.5 and 4.6 were a generation ago. Verified has a known contamination history and the 80% band is saturated. The 88.6% to 95.0% range above it is where the benchmark still differentiates.
SWE-bench Pro Scores: Vendor vs Standardized
SWE-bench Pro contains 1,865 tasks across 41 professional repositories, split into public, commercial (private), and held-out sets. Two score families exist and they are not comparable: Anthropic's vendor-run numbers (own scaffold) and Scale's SEAL leaderboard (standardized scaffold, same harness for every model). Vendor scaffolds run 15 to 30 points higher.
Anthropic-Run Scores (Vendor Scaffold)
| Model | Score |
|---|---|
| Claude Fable 5 | 80.3% |
| Claude Mythos Preview | 77.8% |
| Claude Opus 4.8 | 69.2% |
| Claude Opus 4.7 | 64.3% |
| GPT-5.5 | 58.6% |
| Gemini 3.1 Pro | 54.2% |
Scale SEAL Leaderboard (Standardized Scaffold, Public Set)
Scale runs every model through the same harness, which isolates model capability from scaffold engineering. The newest Claude models (4.7, 4.8, Fable 5) are not yet listed; Opus 4.6 is the top Claude entry.
| Rank | Model | Score | CI |
|---|---|---|---|
| 1 | GPT-5.4 (xHigh) | 59.1% | ±3.56 |
| 2 | Muse Spark | 55.0% | ±3.60 |
| 3 | Claude Opus 4.6 (thinking) | 51.9% | ±3.61 |
| 4 | Gemini 3.1 Pro (thinking) | 46.1% | ±3.60 |
| 5 | Claude Opus 4.5 | 45.9% | ±3.60 |
| 6 | Claude Sonnet 4.5 | 43.6% | ±3.60 |
| 7 | Gemini 3 Pro | 43.3% | ±3.60 |
| 8 | Claude Sonnet 4 | 42.7% | ±3.59 |
| 9 | GPT-5 (High) | 41.8% | ±3.49 |
| 10 | GPT-5.2 Codex | 41.0% | ±3.57 |
| 11 | Claude Haiku 4.5 | 39.5% | ±3.55 |
| 12 | Qwen3 Coder 480B | 38.7% | ±3.55 |
On the harder private (commercial) set, the ordering flips: Claude Opus 4.6 (thinking) leads at 47.1%, ahead of Muse Spark (44.7%) and GPT-5.4 xHigh (43.4%), with Gemini 3.1 Pro at 32.2%. Claude degrades less than competitors when moving from public to unseen commercial repositories, which is the closer proxy for production codebases.
Sources: SEAL public leaderboard and private leaderboard, June 9, 2026.
Scaffolding moves scores more than model choice
The same Opus 4.6 scores 51.9% on Scale's standardized harness and materially higher on vendor scaffolds. Search and context retrieval are the usual difference. In Morph internal benchmarks, adding WarpGrep as a search subagent lifted Opus 4.6 from 55.4% to 57.5% on SWE-bench Pro while cutting cost 15.6% and latency 28%. WarpGrep runs in its own context window and issues up to 8 parallel tool calls per turn. Pricing: $0 for 100k requests, $1 per 1M on Pro.
Opus 4.8 vs Opus 4.7: What Changed
Opus 4.8 (released May 28, 2026) is a same-price upgrade over 4.7. Anthropic's release notes claim it is 4x less likely to let flaws pass in code review.
| Benchmark | Opus 4.8 | Opus 4.7 |
|---|---|---|
| SWE-bench Pro | 69.2% | 64.3% |
| SWE-bench Verified | 88.6% | 87.6% |
| GPQA Diamond | 93.6% | n/a |
| OSWorld | 83.4% | 82.3% (OSWorld-Verified) |
| GDPval-AA (Elo) | 1890 | n/a |
| Price ($/MTok in / out) | $5 / $25 | $5 / $25 |
| Fast mode ($/MTok in / out) | $10 / $50 | $30 / $150 |
| Training cutoff | Jan 2026 | Jan 2026 |
The fast-mode repricing is the underreported change: Opus 4.8 fast mode (research preview, ~2.5x faster output) costs $10/$50 per MTok, a third of the $30/$150 fast-mode price on Opus 4.6 and 4.7. On GDPval-AA, Opus 4.8's 1890 Elo compares to 1769 for GPT-5.5. One tokenizer caveat applies to both 4.7 and 4.8: models from Opus 4.7 onward (including Fable 5) use a new tokenizer that can produce up to 35% more tokens for the same text than pre-4.7 models, so per-request cost comparisons against 4.6 need re-baselining, not just a price-table read.
Terminal-Bench Scores
Terminal-Bench tests whether a model can operate a live terminal: environment management, debugging, multi-step system operations. It is the benchmark where Claude has historically trailed OpenAI, and the gap is narrowing rather than closed.
| Model | Version | Score |
|---|---|---|
| GPT-5.5 | Terminal-Bench 2.1 | 78.2% |
| Claude Opus 4.8 | Terminal-Bench 2.1 | 74.6% |
| Claude Opus 4.6 | Terminal-Bench 2.0 | 65.4% |
For the grounding query that lands here: Claude Opus 4.6 scored 65.4% on Terminal-Bench 2.0. Opus 4.8 scores 74.6% on Terminal-Bench 2.1, 3.6 points behind GPT-5.5's 78.2%. Versions 2.0 and 2.1 use different task sets, so the 65.4% and 74.6% figures are not directly comparable.
GPQA, OSWorld, and Agentic Benchmarks
Coding benchmarks measure patch quality. These measure the reasoning and computer-operation capability underneath it.
| Benchmark | Score | What It Measures |
|---|---|---|
| GPQA Diamond | 93.6% | Graduate-level science questions validated by domain experts |
| OSWorld | 83.4% | Operating a real desktop OS through screenshots and actions |
| Online-Mind2Web | 84% | Live web-browsing task completion |
| GDPval-AA | 1890 Elo | Economically valuable knowledge work (GPT-5.5: 1769) |
Anthropic's announcement footnotes add that Opus 4.8 is the first model to break 10% on a legal agent benchmark scored on an all-tasks-pass standard. Fable 5 extends the frontier further on Anthropic's hardest internal evals: 29.3% on FrontierCode Diamond versus 13.4% for Opus 4.8, and 29.8% on GDP.pdf vision tasks versus 24.9% for GPT-5.5. Mythos 5, the classifier-lifted variant, posts 78.0% capture on ExploitBench and 46.1% on BioMysteryBench, which is why it is gated to vetted Project Glasswing partners rather than self-serve.
Pricing per Million Tokens (June 2026)
Benchmark tables without prices hide the real decision. Full Anthropic API price list, with cache and batch rates:
| Model | Input | Output | Cache Hit | Batch In / Out |
|---|---|---|---|---|
| Claude Fable 5 / Mythos 5 | $10.00 | $50.00 | $1.00 | $5.00 / $25.00 |
| Claude Opus 4.8 / 4.7 / 4.6 / 4.5 | $5.00 | $25.00 | $0.50 | $2.50 / $12.50 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | $1.50 / $7.50 |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | $0.50 / $2.50 |
| Claude Opus 4.1 / 4 (deprecated) | $15.00 | $75.00 | n/a | n/a |
Modifiers that apply across models: 5-minute cache writes bill at 1.25x base input and 1-hour writes at 2x; cache reads at 0.1x. The Batch API is 50% off both input and output. Setting inference_geo: "us" (Opus 4.6+ and Sonnet 4.6+) adds a 1.1x multiplier on all token categories. The web search tool costs $10 per 1,000 searches. There is no long-context surcharge on the 1M-window models: Fable 5, Mythos 5, Opus 4.8/4.7/4.6, and Sonnet 4.6 bill a 900k-token request at the same rate as a 9k one.
For context against competitors: GPT-5.5 is $5/$30 per MTok, gpt-5.4 is $2.50/$15, gpt-5.3-codex is $1.75/$14, and Gemini 3.1 Pro is $2/$12 for prompts up to 200k tokens ($4/$18 above). Opus 4.8 at $5/$25 sits between GPT-5.5 and gpt-5.4 on output price while leading both on SWE-bench Pro (69.2% vs 58.6% for GPT-5.5, Anthropic-run). See Anthropic API pricing for the full breakdown.
Claude API Model IDs and Versions (2026)
Exact strings to pass as model. From the 4.6 generation onward, IDs are dateless pinned snapshots; earlier models keep date suffixes.
| Model | API ID | Context | Max Output | Training Cutoff |
|---|---|---|---|---|
| Claude Fable 5 | claude-fable-5 | 1M | 128k | n/a (GA Jun 9, 2026) |
| Claude Mythos 5 | claude-mythos-5 | 1M | 128k | Glasswing partners only |
| Claude Opus 4.8 | claude-opus-4-8 | 1M (200k on MS Foundry) | 128k | Jan 2026 |
| Claude Opus 4.7 | claude-opus-4-7 | 1M | 128k | Jan 2026 |
| Claude Opus 4.6 | claude-opus-4-6 | 1M | 128k | Aug 2025 |
| Claude Sonnet 4.6 | claude-sonnet-4-6 | 1M | 64k | Jan 2026 (reliable: Aug 2025) |
| Claude Haiku 4.5 | claude-haiku-4-5-20251001 (alias: claude-haiku-4-5) | 200k | 64k | Jul 2025 |
| Claude Sonnet 4.5 | claude-sonnet-4-5-20250929 | 200k | 64k | n/a |
| Claude Opus 4.5 | claude-opus-4-5-20251101 | 200k | 64k | n/a |
The Batch API supports 300k output tokens on Opus 4.6+ and Sonnet 4.6 via the beta header output-300k-2026-03-24. Fable 5 is available on the Claude API, Claude Platform on AWS, Bedrock (anthropic.claude-fable-5), Vertex AI, and Microsoft Foundry.
Deprecation and Retirement Dates
| Model | API ID | Retires |
|---|---|---|
| Claude Sonnet 4 | claude-sonnet-4-20250514 | June 15, 2026 |
| Claude Opus 4 | claude-opus-4-20250514 | June 15, 2026 |
| Claude Opus 4.1 | claude-opus-4-1-20250805 | August 5, 2026 |
| Claude Haiku 3.5 | claude-3-5-haiku-20241022 | Retired (except Bedrock/Vertex) |
Which Claude Model Is Best for Coding?
Opus 4.8 is the default. It holds 88.6% SWE-bench Verified, 69.2% SWE-bench Pro, and 74.6% Terminal-Bench 2.1 at $5/$25 per MTok, half the output price of GPT-5.5 with a higher Pro score. Route around the default by task shape:
| Task | Model | Why |
|---|---|---|
| Default coding agent | Opus 4.8 | 69.2% SWE-bench Pro at $5/$25; fast mode at $10/$50 |
| Hardest long-horizon work | Fable 5 (currently suspended, see note) | 80.3% Pro, 95.0% Verified; 2x output price ($50/MTok) |
| Everyday edits, drafts, chat | Sonnet 4.6 | $3/$15 with the same 1M context window |
| Code review, tests, subagents | Haiku 4.5 | $1/$5; 39.5% SEAL Pro beats GPT-5.2 Codex's private-set 27.7% |
| Migrations needing huge context | Opus 4.8 or Sonnet 4.6 | 1M tokens, no long-context surcharge |
Fable 5's 11-point SWE-bench Pro lead over Opus 4.8 costs 2x per output token (currently suspended, see note above). When available, it pays off on tasks where a failed run wastes more than the token delta: large refactors, overnight autonomous runs, frontier-difficulty bugs. For multi-agent setups, pairing Opus 4.8 with a cheap search subagent beats upgrading the main model: that is the WarpGrep result above. Cross-vendor comparison lives at best AI model for coding.
Legacy Models: Claude 3.5 Sonnet and Earlier
Searches for Claude 3.5 Sonnet benchmarks still land here, so the official numbers, for the record: Claude 3.5 Sonnet (June 2024) scored 33.4% on SWE-bench Verified, 59.4% on GPQA Diamond, and 92.0% on HumanEval per Anthropic's launch announcement. The upgraded Claude 3.5 Sonnet (October 2024) raised SWE-bench Verified to 49.0%. Both were retired October 28, 2025; the recommended replacement is claude-sonnet-4-6.
| Model | Release | SWE-bench Verified |
|---|---|---|
| Claude 3.5 Sonnet | Jun 2024 | 33.4% |
| Claude 3.5 Sonnet (upgraded) | Oct 2024 | 49.0% |
| Claude 3.7 Sonnet | Feb 2025 | 62.3% |
| Claude Sonnet 4.5 | Sep 2025 | 77.2% |
| Claude Opus 4.5 | Nov 2025 | 80.9% |
| Claude Opus 4.6 | Feb 2026 | 80.8% |
| Claude Opus 4.7 | Early 2026 | 87.6% |
| Claude Opus 4.8 | May 2026 | 88.6% |
| Claude Fable 5 | Jun 2026 | 95.0% |
33.4% to 95.0% in two years. The 4.6 generation plateaued near 81% while Anthropic shipped 1M context and agentic tool use; 4.7 and 4.8 broke the plateau, and Fable 5 added another 6.4 points on top. The spread on SWE-bench Pro stays wide (80.3% to 39.5% across current Claude models alone), which is why Pro is the better differentiator at the frontier.
Frequently Asked Questions
What Claude models are available in 2026 (Opus, Sonnet, Haiku versions)?
Claude Fable 5 (claude-fable-5, GA June 9, 2026, currently suspended), Claude Mythos 5 (claude-mythos-5, Project Glasswing partners only, currently suspended), Claude Opus 4.8 (claude-opus-4-8, released May 28, 2026), Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5. Sonnet 4 and Opus 4 retire June 15, 2026; Opus 4.1 retires August 5, 2026.
What is the API model ID for Claude Sonnet 4.6?
claude-sonnet-4-6. No date suffix: from the 4.6 generation onward, Anthropic IDs are dateless pinned snapshots. 1M context, 64k max output, $3/$15 per MTok.
What is Claude Opus 4.8's SWE-bench Pro score?
69.2% in Anthropic's vendor-run evaluation, vs 64.3% for Opus 4.7, 58.6% for GPT-5.5, and 54.2% for Gemini 3.1 Pro. On Scale's standardized SEAL leaderboard, the top Claude entry is Opus 4.6 (thinking) at 51.9% public / 47.1% private; 4.7, 4.8, and Fable 5 are not yet listed there.
What is Claude Opus 4.6's Terminal-Bench score?
65.4% on Terminal-Bench 2.0. On the newer Terminal-Bench 2.1, Opus 4.8 scores 74.6% vs GPT-5.5's 78.2%. The two versions use different task sets and are not directly comparable.
What are Claude 3.5 Sonnet's official benchmark scores?
June 2024 release: 33.4% SWE-bench Verified, 59.4% GPQA Diamond, 92.0% HumanEval. October 2024 upgrade: 49.0% SWE-bench Verified. Retired October 28, 2025; replacement is claude-sonnet-4-6.
Which Claude model is best for coding?
Opus 4.8 by default (88.6% Verified, 69.2% Pro, $5/$25). Fable 5 (currently suspended, see note above) for the hardest long-horizon work if 2x output price is justified (95.0% Verified, 80.3% Pro, $10/$50). Sonnet 4.6 for everyday coding at $3/$15. Haiku 4.5 for high-volume review and subagent work at $1/$5. In multi-agent setups, a specialized search subagent like WarpGrep lifts whichever main model you pick.
What is Claude Fable 5 and how does it score?
Fable 5 is Anthropic's tier above Opus, GA June 9, 2026 (currently suspended as of June 12, 2026, see note above): 95.0% SWE-bench Verified, 80.3% SWE-bench Pro, 29.3% FrontierCode Diamond (vs 13.4% for Opus 4.8). $10/$50 per MTok, 1M context, 128k max output, adaptive thinking always on.
Do Claude models charge extra for long context?
No. Fable 5, Mythos 5, Opus 4.8/4.7/4.6, and Sonnet 4.6 include the full 1M-token window at standard per-token pricing. GPT-5.5 also offers 1M, while Gemini 3.1 Pro raises rates above 200k tokens ($2/$12 to $4/$18). One cost caveat: the tokenizer on Opus 4.7+ can produce up to 35% more tokens for the same text than pre-4.7 models.
Why do Scale SEAL scores differ from Anthropic's scores?
Scaffolding. Scale runs every model through one standardized harness on SWE-bench Pro's public set; Anthropic runs its own agent scaffold. Opus-class models score in the high 60s on vendor scaffolds and low 50s on Scale's. Use SEAL numbers to compare models against each other, vendor numbers to track a single vendor's generation-over-generation progress.
Build with Claude + WarpGrep
WarpGrep lifted every model it was paired with on SWE-bench Pro, taking Opus 4.6 from 55.4% to 57.5% while cutting cost 15.6% and latency 28%. It runs in its own context window and issues 8 parallel tool calls per turn. $0 for 100k requests.
