Claude Benchmarks (2026): Fable 5 Hits 95% SWE-bench Verified. Every Model, Score, API ID, and Price

Every current Claude model benchmarked: Fable 5 (95% SWE-bench Verified), Opus 4.8 (88.6%, 69.2% SWE-bench Pro), Opus 4.7, Sonnet 4.6, Haiku 4.5. Exact API model IDs, $/MTok pricing, Terminal-Bench, GPQA, and legacy Claude 3.5 Sonnet scores.

June 9, 2026 · 1 min read
Claude Benchmarks (2026): Fable 5 Hits 95% SWE-bench Verified. Every Model, Score, API ID, and Price

Claude benchmark scores live in three places that disagree: Anthropic release notes, Scale's SEAL leaderboard, and third-party trackers. The disagreement is scaffolding, not error. This page consolidates every current score with the exact API model ID and price per row, and labels which harness produced each number.

Scores verified June 9, 2026

Master Table: Every Current Claude Model

Six Claude models were in active service as of June 9, 2026. Fable 5 went GA on June 9, 2026 as a new tier above Opus (currently suspended as of June 12, 2026, see note above). Opus 4.8 shipped May 28, 2026 at the same price as 4.7 and 4.6. Mythos 5 is the same underlying model as Fable 5 with safety classifiers lifted, restricted to approved Project Glasswing partners (also currently suspended).

95.0%
Fable 5 SWE-bench Verified
88.6%
Opus 4.8 SWE-bench Verified
69.2%
Opus 4.8 SWE-bench Pro
93.6%
Opus 4.8 GPQA Diamond
ModelAPI Model IDSWE-bench VerifiedSWE-bench ProContext / Max Output$/MTok In / Out
Claude Fable 5claude-fable-595.0%80.3%1M / 128k$10 / $50
Claude Mythos 5claude-mythos-5 (restricted)93.9%*77.8%*1M / 128k$10 / $50
Claude Opus 4.8claude-opus-4-888.6%69.2%1M / 128k$5 / $25
Claude Opus 4.7claude-opus-4-787.6%64.3%1M / 128k$5 / $25
Claude Opus 4.6claude-opus-4-680.8%51.9%†1M / 128k$5 / $25
Claude Sonnet 4.6claude-sonnet-4-679.6%n/a1M / 64k$3 / $15
Claude Haiku 4.5claude-haiku-4-573.3%39.5%†200k / 64k$1 / $5
Claude Opus 4.5 (legacy)claude-opus-4-5-2025110180.9%45.9%†200k / 64k$5 / $25
Claude Sonnet 4.5 (legacy)claude-sonnet-4-5-2025092977.2%43.6%†200k / 64k$3 / $15

SWE-bench Pro scores without a dagger are Anthropic-run. † = Scale SEAL leaderboard (standardized scaffolding), which runs well below vendor scaffolds. * = Claude Mythos Preview scores; Mythos 5 itself is not on public leaderboards. Verified scores from the llm-stats tracker (updated June 2026) and provider announcements.

SWE-bench Verified Scores

SWE-bench Verified contains 500 human-validated Python tasks from real GitHub repositories. Models are scored on the percentage of tasks resolved correctly. Frontier scores are self-reported with provider scaffolds, so treat differences under ~1 point as noise.

RankModelScoreProvider
1Claude Fable 595.0%Anthropic
2Claude Mythos Preview93.9%Anthropic
3Claude Opus 4.888.6%Anthropic
4Claude Opus 4.787.6%Anthropic
5Claude Opus 4.580.9%Anthropic
6Claude Opus 4.680.8%Anthropic
7DeepSeek-V4-Pro-Max80.6%DeepSeek
8Gemini 3.1 Pro80.6%Google
9MiniMax M380.5%MiniMax
10Qwen3.7 Max80.4%Alibaba

Source: llm-stats SWE-bench Verified tracker, June 2026.

The top six entries are all Claude. The bigger story is the compression below them: four open-weights or non-US models (DeepSeek V4 Pro Max, Gemini 3.1 Pro, MiniMax M3, Qwen3.7 Max) sit within 0.5 points of each other at ~80.5%, roughly where Opus 4.5 and 4.6 were a generation ago. Verified has a known contamination history and the 80% band is saturated. The 88.6% to 95.0% range above it is where the benchmark still differentiates.

SWE-bench Pro Scores: Vendor vs Standardized

SWE-bench Pro contains 1,865 tasks across 41 professional repositories, split into public, commercial (private), and held-out sets. Two score families exist and they are not comparable: Anthropic's vendor-run numbers (own scaffold) and Scale's SEAL leaderboard (standardized scaffold, same harness for every model). Vendor scaffolds run 15 to 30 points higher.

Anthropic-Run Scores (Vendor Scaffold)

ModelScore
Claude Fable 580.3%
Claude Mythos Preview77.8%
Claude Opus 4.869.2%
Claude Opus 4.764.3%
GPT-5.558.6%
Gemini 3.1 Pro54.2%

Scale SEAL Leaderboard (Standardized Scaffold, Public Set)

Scale runs every model through the same harness, which isolates model capability from scaffold engineering. The newest Claude models (4.7, 4.8, Fable 5) are not yet listed; Opus 4.6 is the top Claude entry.

RankModelScoreCI
1GPT-5.4 (xHigh)59.1%±3.56
2Muse Spark55.0%±3.60
3Claude Opus 4.6 (thinking)51.9%±3.61
4Gemini 3.1 Pro (thinking)46.1%±3.60
5Claude Opus 4.545.9%±3.60
6Claude Sonnet 4.543.6%±3.60
7Gemini 3 Pro43.3%±3.60
8Claude Sonnet 442.7%±3.59
9GPT-5 (High)41.8%±3.49
10GPT-5.2 Codex41.0%±3.57
11Claude Haiku 4.539.5%±3.55
12Qwen3 Coder 480B38.7%±3.55

On the harder private (commercial) set, the ordering flips: Claude Opus 4.6 (thinking) leads at 47.1%, ahead of Muse Spark (44.7%) and GPT-5.4 xHigh (43.4%), with Gemini 3.1 Pro at 32.2%. Claude degrades less than competitors when moving from public to unseen commercial repositories, which is the closer proxy for production codebases.

Sources: SEAL public leaderboard and private leaderboard, June 9, 2026.

Scaffolding moves scores more than model choice

The same Opus 4.6 scores 51.9% on Scale's standardized harness and materially higher on vendor scaffolds. Search and context retrieval are the usual difference. In Morph internal benchmarks, adding WarpGrep as a search subagent lifted Opus 4.6 from 55.4% to 57.5% on SWE-bench Pro while cutting cost 15.6% and latency 28%. WarpGrep runs in its own context window and issues up to 8 parallel tool calls per turn. Pricing: $0 for 100k requests, $1 per 1M on Pro.

Opus 4.8 vs Opus 4.7: What Changed

Opus 4.8 (released May 28, 2026) is a same-price upgrade over 4.7. Anthropic's release notes claim it is 4x less likely to let flaws pass in code review.

BenchmarkOpus 4.8Opus 4.7
SWE-bench Pro69.2%64.3%
SWE-bench Verified88.6%87.6%
GPQA Diamond93.6%n/a
OSWorld83.4%82.3% (OSWorld-Verified)
GDPval-AA (Elo)1890n/a
Price ($/MTok in / out)$5 / $25$5 / $25
Fast mode ($/MTok in / out)$10 / $50$30 / $150
Training cutoffJan 2026Jan 2026

The fast-mode repricing is the underreported change: Opus 4.8 fast mode (research preview, ~2.5x faster output) costs $10/$50 per MTok, a third of the $30/$150 fast-mode price on Opus 4.6 and 4.7. On GDPval-AA, Opus 4.8's 1890 Elo compares to 1769 for GPT-5.5. One tokenizer caveat applies to both 4.7 and 4.8: models from Opus 4.7 onward (including Fable 5) use a new tokenizer that can produce up to 35% more tokens for the same text than pre-4.7 models, so per-request cost comparisons against 4.6 need re-baselining, not just a price-table read.

Terminal-Bench Scores

Terminal-Bench tests whether a model can operate a live terminal: environment management, debugging, multi-step system operations. It is the benchmark where Claude has historically trailed OpenAI, and the gap is narrowing rather than closed.

ModelVersionScore
GPT-5.5Terminal-Bench 2.178.2%
Claude Opus 4.8Terminal-Bench 2.174.6%
Claude Opus 4.6Terminal-Bench 2.065.4%

For the grounding query that lands here: Claude Opus 4.6 scored 65.4% on Terminal-Bench 2.0. Opus 4.8 scores 74.6% on Terminal-Bench 2.1, 3.6 points behind GPT-5.5's 78.2%. Versions 2.0 and 2.1 use different task sets, so the 65.4% and 74.6% figures are not directly comparable.

GPQA, OSWorld, and Agentic Benchmarks

Coding benchmarks measure patch quality. These measure the reasoning and computer-operation capability underneath it.

BenchmarkScoreWhat It Measures
GPQA Diamond93.6%Graduate-level science questions validated by domain experts
OSWorld83.4%Operating a real desktop OS through screenshots and actions
Online-Mind2Web84%Live web-browsing task completion
GDPval-AA1890 EloEconomically valuable knowledge work (GPT-5.5: 1769)

Anthropic's announcement footnotes add that Opus 4.8 is the first model to break 10% on a legal agent benchmark scored on an all-tasks-pass standard. Fable 5 extends the frontier further on Anthropic's hardest internal evals: 29.3% on FrontierCode Diamond versus 13.4% for Opus 4.8, and 29.8% on GDP.pdf vision tasks versus 24.9% for GPT-5.5. Mythos 5, the classifier-lifted variant, posts 78.0% capture on ExploitBench and 46.1% on BioMysteryBench, which is why it is gated to vetted Project Glasswing partners rather than self-serve.

Pricing per Million Tokens (June 2026)

Benchmark tables without prices hide the real decision. Full Anthropic API price list, with cache and batch rates:

ModelInputOutputCache HitBatch In / Out
Claude Fable 5 / Mythos 5$10.00$50.00$1.00$5.00 / $25.00
Claude Opus 4.8 / 4.7 / 4.6 / 4.5$5.00$25.00$0.50$2.50 / $12.50
Claude Sonnet 4.6$3.00$15.00$0.30$1.50 / $7.50
Claude Haiku 4.5$1.00$5.00$0.10$0.50 / $2.50
Claude Opus 4.1 / 4 (deprecated)$15.00$75.00n/an/a

Modifiers that apply across models: 5-minute cache writes bill at 1.25x base input and 1-hour writes at 2x; cache reads at 0.1x. The Batch API is 50% off both input and output. Setting inference_geo: "us" (Opus 4.6+ and Sonnet 4.6+) adds a 1.1x multiplier on all token categories. The web search tool costs $10 per 1,000 searches. There is no long-context surcharge on the 1M-window models: Fable 5, Mythos 5, Opus 4.8/4.7/4.6, and Sonnet 4.6 bill a 900k-token request at the same rate as a 9k one.

For context against competitors: GPT-5.5 is $5/$30 per MTok, gpt-5.4 is $2.50/$15, gpt-5.3-codex is $1.75/$14, and Gemini 3.1 Pro is $2/$12 for prompts up to 200k tokens ($4/$18 above). Opus 4.8 at $5/$25 sits between GPT-5.5 and gpt-5.4 on output price while leading both on SWE-bench Pro (69.2% vs 58.6% for GPT-5.5, Anthropic-run). See Anthropic API pricing for the full breakdown.

Claude API Model IDs and Versions (2026)

Exact strings to pass as model. From the 4.6 generation onward, IDs are dateless pinned snapshots; earlier models keep date suffixes.

ModelAPI IDContextMax OutputTraining Cutoff
Claude Fable 5claude-fable-51M128kn/a (GA Jun 9, 2026)
Claude Mythos 5claude-mythos-51M128kGlasswing partners only
Claude Opus 4.8claude-opus-4-81M (200k on MS Foundry)128kJan 2026
Claude Opus 4.7claude-opus-4-71M128kJan 2026
Claude Opus 4.6claude-opus-4-61M128kAug 2025
Claude Sonnet 4.6claude-sonnet-4-61M64kJan 2026 (reliable: Aug 2025)
Claude Haiku 4.5claude-haiku-4-5-20251001 (alias: claude-haiku-4-5)200k64kJul 2025
Claude Sonnet 4.5claude-sonnet-4-5-20250929200k64kn/a
Claude Opus 4.5claude-opus-4-5-20251101200k64kn/a

The Batch API supports 300k output tokens on Opus 4.6+ and Sonnet 4.6 via the beta header output-300k-2026-03-24. Fable 5 is available on the Claude API, Claude Platform on AWS, Bedrock (anthropic.claude-fable-5), Vertex AI, and Microsoft Foundry.

Deprecation and Retirement Dates

ModelAPI IDRetires
Claude Sonnet 4claude-sonnet-4-20250514June 15, 2026
Claude Opus 4claude-opus-4-20250514June 15, 2026
Claude Opus 4.1claude-opus-4-1-20250805August 5, 2026
Claude Haiku 3.5claude-3-5-haiku-20241022Retired (except Bedrock/Vertex)

Which Claude Model Is Best for Coding?

Opus 4.8 is the default. It holds 88.6% SWE-bench Verified, 69.2% SWE-bench Pro, and 74.6% Terminal-Bench 2.1 at $5/$25 per MTok, half the output price of GPT-5.5 with a higher Pro score. Route around the default by task shape:

TaskModelWhy
Default coding agentOpus 4.869.2% SWE-bench Pro at $5/$25; fast mode at $10/$50
Hardest long-horizon workFable 5 (currently suspended, see note)80.3% Pro, 95.0% Verified; 2x output price ($50/MTok)
Everyday edits, drafts, chatSonnet 4.6$3/$15 with the same 1M context window
Code review, tests, subagentsHaiku 4.5$1/$5; 39.5% SEAL Pro beats GPT-5.2 Codex's private-set 27.7%
Migrations needing huge contextOpus 4.8 or Sonnet 4.61M tokens, no long-context surcharge

Fable 5's 11-point SWE-bench Pro lead over Opus 4.8 costs 2x per output token (currently suspended, see note above). When available, it pays off on tasks where a failed run wastes more than the token delta: large refactors, overnight autonomous runs, frontier-difficulty bugs. For multi-agent setups, pairing Opus 4.8 with a cheap search subagent beats upgrading the main model: that is the WarpGrep result above. Cross-vendor comparison lives at best AI model for coding.

Legacy Models: Claude 3.5 Sonnet and Earlier

Searches for Claude 3.5 Sonnet benchmarks still land here, so the official numbers, for the record: Claude 3.5 Sonnet (June 2024) scored 33.4% on SWE-bench Verified, 59.4% on GPQA Diamond, and 92.0% on HumanEval per Anthropic's launch announcement. The upgraded Claude 3.5 Sonnet (October 2024) raised SWE-bench Verified to 49.0%. Both were retired October 28, 2025; the recommended replacement is claude-sonnet-4-6.

ModelReleaseSWE-bench Verified
Claude 3.5 SonnetJun 202433.4%
Claude 3.5 Sonnet (upgraded)Oct 202449.0%
Claude 3.7 SonnetFeb 202562.3%
Claude Sonnet 4.5Sep 202577.2%
Claude Opus 4.5Nov 202580.9%
Claude Opus 4.6Feb 202680.8%
Claude Opus 4.7Early 202687.6%
Claude Opus 4.8May 202688.6%
Claude Fable 5Jun 202695.0%

33.4% to 95.0% in two years. The 4.6 generation plateaued near 81% while Anthropic shipped 1M context and agentic tool use; 4.7 and 4.8 broke the plateau, and Fable 5 added another 6.4 points on top. The spread on SWE-bench Pro stays wide (80.3% to 39.5% across current Claude models alone), which is why Pro is the better differentiator at the frontier.

Frequently Asked Questions

What Claude models are available in 2026 (Opus, Sonnet, Haiku versions)?

Claude Fable 5 (claude-fable-5, GA June 9, 2026, currently suspended), Claude Mythos 5 (claude-mythos-5, Project Glasswing partners only, currently suspended), Claude Opus 4.8 (claude-opus-4-8, released May 28, 2026), Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5. Sonnet 4 and Opus 4 retire June 15, 2026; Opus 4.1 retires August 5, 2026.

What is the API model ID for Claude Sonnet 4.6?

claude-sonnet-4-6. No date suffix: from the 4.6 generation onward, Anthropic IDs are dateless pinned snapshots. 1M context, 64k max output, $3/$15 per MTok.

What is Claude Opus 4.8's SWE-bench Pro score?

69.2% in Anthropic's vendor-run evaluation, vs 64.3% for Opus 4.7, 58.6% for GPT-5.5, and 54.2% for Gemini 3.1 Pro. On Scale's standardized SEAL leaderboard, the top Claude entry is Opus 4.6 (thinking) at 51.9% public / 47.1% private; 4.7, 4.8, and Fable 5 are not yet listed there.

What is Claude Opus 4.6's Terminal-Bench score?

65.4% on Terminal-Bench 2.0. On the newer Terminal-Bench 2.1, Opus 4.8 scores 74.6% vs GPT-5.5's 78.2%. The two versions use different task sets and are not directly comparable.

What are Claude 3.5 Sonnet's official benchmark scores?

June 2024 release: 33.4% SWE-bench Verified, 59.4% GPQA Diamond, 92.0% HumanEval. October 2024 upgrade: 49.0% SWE-bench Verified. Retired October 28, 2025; replacement is claude-sonnet-4-6.

Which Claude model is best for coding?

Opus 4.8 by default (88.6% Verified, 69.2% Pro, $5/$25). Fable 5 (currently suspended, see note above) for the hardest long-horizon work if 2x output price is justified (95.0% Verified, 80.3% Pro, $10/$50). Sonnet 4.6 for everyday coding at $3/$15. Haiku 4.5 for high-volume review and subagent work at $1/$5. In multi-agent setups, a specialized search subagent like WarpGrep lifts whichever main model you pick.

What is Claude Fable 5 and how does it score?

Fable 5 is Anthropic's tier above Opus, GA June 9, 2026 (currently suspended as of June 12, 2026, see note above): 95.0% SWE-bench Verified, 80.3% SWE-bench Pro, 29.3% FrontierCode Diamond (vs 13.4% for Opus 4.8). $10/$50 per MTok, 1M context, 128k max output, adaptive thinking always on.

Do Claude models charge extra for long context?

No. Fable 5, Mythos 5, Opus 4.8/4.7/4.6, and Sonnet 4.6 include the full 1M-token window at standard per-token pricing. GPT-5.5 also offers 1M, while Gemini 3.1 Pro raises rates above 200k tokens ($2/$12 to $4/$18). One cost caveat: the tokenizer on Opus 4.7+ can produce up to 35% more tokens for the same text than pre-4.7 models.

Why do Scale SEAL scores differ from Anthropic's scores?

Scaffolding. Scale runs every model through one standardized harness on SWE-bench Pro's public set; Anthropic runs its own agent scaffold. Opus-class models score in the high 60s on vendor scaffolds and low 50s on Scale's. Use SEAL numbers to compare models against each other, vendor numbers to track a single vendor's generation-over-generation progress.

Build with Claude + WarpGrep

WarpGrep lifted every model it was paired with on SWE-bench Pro, taking Opus 4.6 from 55.4% to 57.5% while cutting cost 15.6% and latency 28%. It runs in its own context window and issues 8 parallel tool calls per turn. $0 for 100k requests.