Claude Benchmarks (2026): Fable 5 Hits 95% SWE-bench Verified. Every Model, Score, API ID, and Price

Claude benchmark scores live in three places that disagree: Anthropic release notes, Scale's SEAL leaderboard, and third-party trackers. The disagreement is scaffolding, not error. This page consolidates every current score with the exact API model ID and price per row, and labels which harness produced each number.

Scores verified June 9, 2026

Master Table: Every Current Claude Model

Six Claude models were in active service as of June 9, 2026. Fable 5 went GA on June 9, 2026 as a new tier above Opus (currently suspended as of June 12, 2026, see note above). Opus 4.8 shipped May 28, 2026 at the same price as 4.7 and 4.6. Mythos 5 is the same underlying model as Fable 5 with safety classifiers lifted, restricted to approved Project Glasswing partners (also currently suspended).

95.0%

Fable 5 SWE-bench Verified

88.6%

Opus 4.8 SWE-bench Verified

69.2%

Opus 4.8 SWE-bench Pro

93.6%

Opus 4.8 GPQA Diamond

Model	API Model ID	SWE-bench Verified	SWE-bench Pro	Context / Max Output	$/MTok In / Out
Claude Fable 5	claude-fable-5	95.0%	80.3%	1M / 128k	$10 / $50
Claude Mythos 5	claude-mythos-5 (restricted)	93.9%*	77.8%*	1M / 128k	$10 / $50
Claude Opus 4.8	claude-opus-4-8	88.6%	69.2%	1M / 128k	$5 / $25
Claude Opus 4.7	claude-opus-4-7	87.6%	64.3%	1M / 128k	$5 / $25
Claude Opus 4.6	claude-opus-4-6	80.8%	51.9%†	1M / 128k	$5 / $25
Claude Sonnet 4.6	claude-sonnet-4-6	79.6%	n/a	1M / 64k	$3 / $15
Claude Haiku 4.5	claude-haiku-4-5	73.3%	39.5%†	200k / 64k	$1 / $5
Claude Opus 4.5 (legacy)	claude-opus-4-5-20251101	80.9%	45.9%†	200k / 64k	$5 / $25
Claude Sonnet 4.5 (legacy)	claude-sonnet-4-5-20250929	77.2%	43.6%†	200k / 64k	$3 / $15

SWE-bench Pro scores without a dagger are Anthropic-run. † = Scale SEAL leaderboard (standardized scaffolding), which runs well below vendor scaffolds. * = Claude Mythos Preview scores; Mythos 5 itself is not on public leaderboards. Verified scores from the llm-stats tracker (updated June 2026) and provider announcements.

SWE-bench Verified Scores

SWE-bench Verified contains 500 human-validated Python tasks from real GitHub repositories. Models are scored on the percentage of tasks resolved correctly. Frontier scores are self-reported with provider scaffolds, so treat differences under ~1 point as noise.

Rank	Model	Score	Provider
1	Claude Fable 5	95.0%	Anthropic
2	Claude Mythos Preview	93.9%	Anthropic
3	Claude Opus 4.8	88.6%	Anthropic
4	Claude Opus 4.7	87.6%	Anthropic
5	Claude Opus 4.5	80.9%	Anthropic
6	Claude Opus 4.6	80.8%	Anthropic
7	DeepSeek-V4-Pro-Max	80.6%	DeepSeek
8	Gemini 3.1 Pro	80.6%	Google
9	MiniMax M3	80.5%	MiniMax
10	Qwen3.7 Max	80.4%	Alibaba

Source: llm-stats SWE-bench Verified tracker, June 2026.

The top six entries are all Claude. The bigger story is the compression below them: four open-weights or non-US models (DeepSeek V4 Pro Max, Gemini 3.1 Pro, MiniMax M3, Qwen3.7 Max) sit within 0.5 points of each other at ~80.5%, roughly where Opus 4.5 and 4.6 were a generation ago. Verified has a known contamination history and the 80% band is saturated. The 88.6% to 95.0% range above it is where the benchmark still differentiates.

SWE-bench Pro Scores: Vendor vs Standardized

SWE-bench Pro contains 1,865 tasks across 41 professional repositories, split into public, commercial (private), and held-out sets. Two score families exist and they are not comparable: Anthropic's vendor-run numbers (own scaffold) and Scale's SEAL leaderboard (standardized scaffold, same harness for every model). Vendor scaffolds run 15 to 30 points higher.

Anthropic-Run Scores (Vendor Scaffold)

Model	Score
Claude Fable 5	80.3%
Claude Mythos Preview	77.8%
Claude Opus 4.8	69.2%
Claude Opus 4.7	64.3%
GPT-5.5	58.6%
Gemini 3.1 Pro	54.2%

Scale SEAL Leaderboard (Standardized Scaffold, Public Set)

Scale runs every model through the same harness, which isolates model capability from scaffold engineering. The newest Claude models (4.7, 4.8, Fable 5) are not yet listed; Opus 4.6 is the top Claude entry.

Rank	Model	Score	CI
1	GPT-5.4 (xHigh)	59.1%	±3.56
2	Muse Spark	55.0%	±3.60
3	Claude Opus 4.6 (thinking)	51.9%	±3.61
4	Gemini 3.1 Pro (thinking)	46.1%	±3.60
5	Claude Opus 4.5	45.9%	±3.60
6	Claude Sonnet 4.5	43.6%	±3.60
7	Gemini 3 Pro	43.3%	±3.60
8	Claude Sonnet 4	42.7%	±3.59
9	GPT-5 (High)	41.8%	±3.49
10	GPT-5.2 Codex	41.0%	±3.57
11	Claude Haiku 4.5	39.5%	±3.55
12	Qwen3 Coder 480B	38.7%	±3.55

On the harder private (commercial) set, the ordering flips: Claude Opus 4.6 (thinking) leads at 47.1%, ahead of Muse Spark (44.7%) and GPT-5.4 xHigh (43.4%), with Gemini 3.1 Pro at 32.2%. Claude degrades less than competitors when moving from public to unseen commercial repositories, which is the closer proxy for production codebases.

Sources: SEAL public leaderboard and private leaderboard, June 9, 2026.

Scaffolding moves scores more than model choice

The same Opus 4.6 scores 51.9% on Scale's standardized harness and materially higher on vendor scaffolds. Search and context retrieval are the usual difference. In Morph internal benchmarks, adding WarpGrep as a search subagent lifted Opus 4.6 from 55.4% to 57.5% on SWE-bench Pro while cutting cost 15.6% and latency 28%. WarpGrep runs in its own context window and issues up to 8 parallel tool calls per turn. Pricing: $0 for 100k requests, $1 per 1M on Pro.

Opus 4.8 vs Opus 4.7: What Changed

Opus 4.8 (released May 28, 2026) is a same-price upgrade over 4.7. Anthropic's release notes claim it is 4x less likely to let flaws pass in code review.

Benchmark	Opus 4.8	Opus 4.7
SWE-bench Pro	69.2%	64.3%
SWE-bench Verified	88.6%	87.6%
GPQA Diamond	93.6%	n/a
OSWorld	83.4%	82.3% (OSWorld-Verified)
GDPval-AA (Elo)	1890	n/a
Price ($/MTok in / out)	$5 / $25	$5 / $25
Fast mode ($/MTok in / out)	$10 / $50	$30 / $150
Training cutoff	Jan 2026	Jan 2026

The fast-mode repricing is the underreported change: Opus 4.8 fast mode (research preview, ~2.5x faster output) costs $10/$50 per MTok, a third of the $30/$150 fast-mode price on Opus 4.6 and 4.7. On GDPval-AA, Opus 4.8's 1890 Elo compares to 1769 for GPT-5.5. One tokenizer caveat applies to both 4.7 and 4.8: models from Opus 4.7 onward (including Fable 5) use a new tokenizer that can produce up to 35% more tokens for the same text than pre-4.7 models, so per-request cost comparisons against 4.6 need re-baselining, not just a price-table read.

Terminal-Bench Scores

Terminal-Bench tests whether a model can operate a live terminal: environment management, debugging, multi-step system operations. It is the benchmark where Claude has historically trailed OpenAI, and the gap is narrowing rather than closed.

Model	Version	Score
GPT-5.5	Terminal-Bench 2.1	78.2%
Claude Opus 4.8	Terminal-Bench 2.1	74.6%
Claude Opus 4.6	Terminal-Bench 2.0	65.4%

For the grounding query that lands here: Claude Opus 4.6 scored 65.4% on Terminal-Bench 2.0. Opus 4.8 scores 74.6% on Terminal-Bench 2.1, 3.6 points behind GPT-5.5's 78.2%. Versions 2.0 and 2.1 use different task sets, so the 65.4% and 74.6% figures are not directly comparable.

GPQA, OSWorld, and Agentic Benchmarks

Coding benchmarks measure patch quality. These measure the reasoning and computer-operation capability underneath it.

Benchmark	Score	What It Measures
GPQA Diamond	93.6%	Graduate-level science questions validated by domain experts
OSWorld	83.4%	Operating a real desktop OS through screenshots and actions
Online-Mind2Web	84%	Live web-browsing task completion
GDPval-AA	1890 Elo	Economically valuable knowledge work (GPT-5.5: 1769)

Anthropic's announcement footnotes add that Opus 4.8 is the first model to break 10% on a legal agent benchmark scored on an all-tasks-pass standard. Fable 5 extends the frontier further on Anthropic's hardest internal evals: 29.3% on FrontierCode Diamond versus 13.4% for Opus 4.8, and 29.8% on GDP.pdf vision tasks versus 24.9% for GPT-5.5. Mythos 5, the classifier-lifted variant, posts 78.0% capture on ExploitBench and 46.1% on BioMysteryBench, which is why it is gated to vetted Project Glasswing partners rather than self-serve.

Pricing per Million Tokens (June 2026)

Benchmark tables without prices hide the real decision. Full Anthropic API price list, with cache and batch rates:

Model	Input	Output	Cache Hit	Batch In / Out
Claude Fable 5 / Mythos 5	$10.00	$50.00	$1.00	$5.00 / $25.00
Claude Opus 4.8 / 4.7 / 4.6 / 4.5	$5.00	$25.00	$0.50	$2.50 / $12.50
Claude Sonnet 4.6	$3.00	$15.00	$0.30	$1.50 / $7.50
Claude Haiku 4.5	$1.00	$5.00	$0.10	$0.50 / $2.50
Claude Opus 4.1 / 4 (deprecated)	$15.00	$75.00	n/a	n/a

Modifiers that apply across models: 5-minute cache writes bill at 1.25x base input and 1-hour writes at 2x; cache reads at 0.1x. The Batch API is 50% off both input and output. Setting inference_geo: "us" (Opus 4.6+ and Sonnet 4.6+) adds a 1.1x multiplier on all token categories. The web search tool costs $10 per 1,000 searches. There is no long-context surcharge on the 1M-window models: Fable 5, Mythos 5, Opus 4.8/4.7/4.6, and Sonnet 4.6 bill a 900k-token request at the same rate as a 9k one.

For context against competitors: GPT-5.5 is $5/$30 per MTok, gpt-5.4 is $2.50/$15, gpt-5.3-codex is $1.75/$14, and Gemini 3.1 Pro is $2/$12 for prompts up to 200k tokens ($4/$18 above). Opus 4.8 at $5/$25 sits between GPT-5.5 and gpt-5.4 on output price while leading both on SWE-bench Pro (69.2% vs 58.6% for GPT-5.5, Anthropic-run). See Anthropic API pricing for the full breakdown.

Claude API Model IDs and Versions (2026)

Exact strings to pass as model. From the 4.6 generation onward, IDs are dateless pinned snapshots; earlier models keep date suffixes.

Model	API ID	Context	Max Output	Training Cutoff
Claude Fable 5	claude-fable-5	1M	128k	n/a (GA Jun 9, 2026)
Claude Mythos 5	claude-mythos-5	1M	128k	Glasswing partners only
Claude Opus 4.8	claude-opus-4-8	1M (200k on MS Foundry)	128k	Jan 2026
Claude Opus 4.7	claude-opus-4-7	1M	128k	Jan 2026
Claude Opus 4.6	claude-opus-4-6	1M	128k	Aug 2025
Claude Sonnet 4.6	claude-sonnet-4-6	1M	64k	Jan 2026 (reliable: Aug 2025)
Claude Haiku 4.5	claude-haiku-4-5-20251001 (alias: claude-haiku-4-5)	200k	64k	Jul 2025
Claude Sonnet 4.5	claude-sonnet-4-5-20250929	200k	64k	n/a
Claude Opus 4.5	claude-opus-4-5-20251101	200k	64k	n/a

The Batch API supports 300k output tokens on Opus 4.6+ and Sonnet 4.6 via the beta header output-300k-2026-03-24. Fable 5 is available on the Claude API, Claude Platform on AWS, Bedrock (anthropic.claude-fable-5), Vertex AI, and Microsoft Foundry.

Deprecation and Retirement Dates

Model	API ID	Retires
Claude Sonnet 4	claude-sonnet-4-20250514	June 15, 2026
Claude Opus 4	claude-opus-4-20250514	June 15, 2026
Claude Opus 4.1	claude-opus-4-1-20250805	August 5, 2026
Claude Haiku 3.5	claude-3-5-haiku-20241022	Retired (except Bedrock/Vertex)

Which Claude Model Is Best for Coding?

Opus 4.8 is the default. It holds 88.6% SWE-bench Verified, 69.2% SWE-bench Pro, and 74.6% Terminal-Bench 2.1 at $5/$25 per MTok, half the output price of GPT-5.5 with a higher Pro score. Route around the default by task shape:

Task	Model	Why
Default coding agent	Opus 4.8	69.2% SWE-bench Pro at $5/$25; fast mode at $10/$50
Hardest long-horizon work	Fable 5 (currently suspended, see note)	80.3% Pro, 95.0% Verified; 2x output price ($50/MTok)
Everyday edits, drafts, chat	Sonnet 4.6	$3/$15 with the same 1M context window
Code review, tests, subagents	Haiku 4.5	$1/$5; 39.5% SEAL Pro beats GPT-5.2 Codex's private-set 27.7%
Migrations needing huge context	Opus 4.8 or Sonnet 4.6	1M tokens, no long-context surcharge

Fable 5's 11-point SWE-bench Pro lead over Opus 4.8 costs 2x per output token (currently suspended, see note above). When available, it pays off on tasks where a failed run wastes more than the token delta: large refactors, overnight autonomous runs, frontier-difficulty bugs. For multi-agent setups, pairing Opus 4.8 with a cheap search subagent beats upgrading the main model: that is the WarpGrep result above. Cross-vendor comparison lives at best AI model for coding.

Legacy Models: Claude 3.5 Sonnet and Earlier

Searches for Claude 3.5 Sonnet benchmarks still land here, so the official numbers, for the record: Claude 3.5 Sonnet (June 2024) scored 33.4% on SWE-bench Verified, 59.4% on GPQA Diamond, and 92.0% on HumanEval per Anthropic's launch announcement. The upgraded Claude 3.5 Sonnet (October 2024) raised SWE-bench Verified to 49.0%. Both were retired October 28, 2025; the recommended replacement is claude-sonnet-4-6.

Model	Release	SWE-bench Verified
Claude 3.5 Sonnet	Jun 2024	33.4%
Claude 3.5 Sonnet (upgraded)	Oct 2024	49.0%
Claude 3.7 Sonnet	Feb 2025	62.3%
Claude Sonnet 4.5	Sep 2025	77.2%
Claude Opus 4.5	Nov 2025	80.9%
Claude Opus 4.6	Feb 2026	80.8%
Claude Opus 4.7	Early 2026	87.6%
Claude Opus 4.8	May 2026	88.6%
Claude Fable 5	Jun 2026	95.0%

33.4% to 95.0% in two years. The 4.6 generation plateaued near 81% while Anthropic shipped 1M context and agentic tool use; 4.7 and 4.8 broke the plateau, and Fable 5 added another 6.4 points on top. The spread on SWE-bench Pro stays wide (80.3% to 39.5% across current Claude models alone), which is why Pro is the better differentiator at the frontier.

Frequently Asked Questions

What Claude models are available in 2026 (Opus, Sonnet, Haiku versions)?

Claude Fable 5 (claude-fable-5, GA June 9, 2026, currently suspended), Claude Mythos 5 (claude-mythos-5, Project Glasswing partners only, currently suspended), Claude Opus 4.8 (claude-opus-4-8, released May 28, 2026), Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5. Sonnet 4 and Opus 4 retire June 15, 2026; Opus 4.1 retires August 5, 2026.

What is the API model ID for Claude Sonnet 4.6?

claude-sonnet-4-6. No date suffix: from the 4.6 generation onward, Anthropic IDs are dateless pinned snapshots. 1M context, 64k max output, $3/$15 per MTok.

What is Claude Opus 4.8's SWE-bench Pro score?

69.2% in Anthropic's vendor-run evaluation, vs 64.3% for Opus 4.7, 58.6% for GPT-5.5, and 54.2% for Gemini 3.1 Pro. On Scale's standardized SEAL leaderboard, the top Claude entry is Opus 4.6 (thinking) at 51.9% public / 47.1% private; 4.7, 4.8, and Fable 5 are not yet listed there.

What is Claude Opus 4.6's Terminal-Bench score?

65.4% on Terminal-Bench 2.0. On the newer Terminal-Bench 2.1, Opus 4.8 scores 74.6% vs GPT-5.5's 78.2%. The two versions use different task sets and are not directly comparable.

What are Claude 3.5 Sonnet's official benchmark scores?

June 2024 release: 33.4% SWE-bench Verified, 59.4% GPQA Diamond, 92.0% HumanEval. October 2024 upgrade: 49.0% SWE-bench Verified. Retired October 28, 2025; replacement is claude-sonnet-4-6.

Which Claude model is best for coding?

Opus 4.8 by default (88.6% Verified, 69.2% Pro, $5/$25). Fable 5 (currently suspended, see note above) for the hardest long-horizon work if 2x output price is justified (95.0% Verified, 80.3% Pro, $10/$50). Sonnet 4.6 for everyday coding at $3/$15. Haiku 4.5 for high-volume review and subagent work at $1/$5. In multi-agent setups, a specialized search subagent like WarpGrep lifts whichever main model you pick.

What is Claude Fable 5 and how does it score?

Fable 5 is Anthropic's tier above Opus, GA June 9, 2026 (currently suspended as of June 12, 2026, see note above): 95.0% SWE-bench Verified, 80.3% SWE-bench Pro, 29.3% FrontierCode Diamond (vs 13.4% for Opus 4.8). $10/$50 per MTok, 1M context, 128k max output, adaptive thinking always on.

Do Claude models charge extra for long context?

No. Fable 5, Mythos 5, Opus 4.8/4.7/4.6, and Sonnet 4.6 include the full 1M-token window at standard per-token pricing. GPT-5.5 also offers 1M, while Gemini 3.1 Pro raises rates above 200k tokens ($2/$12 to $4/$18). One cost caveat: the tokenizer on Opus 4.7+ can produce up to 35% more tokens for the same text than pre-4.7 models.

Why do Scale SEAL scores differ from Anthropic's scores?

Scaffolding. Scale runs every model through one standardized harness on SWE-bench Pro's public set; Anthropic runs its own agent scaffold. Opus-class models score in the high 60s on vendor scaffolds and low 50s on Scale's. Use SEAL numbers to compare models against each other, vendor numbers to track a single vendor's generation-over-generation progress.

Build with Claude + WarpGrep

WarpGrep lifted every model it was paired with on SWE-bench Pro, taking Opus 4.6 from 55.4% to 57.5% while cutting cost 15.6% and latency 28%. It runs in its own context window and issues 8 parallel tool calls per turn. $0 for 100k requests.

Try WarpGrep

See SWE-bench Pro Scores

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Glance

Blog

Startup Credits

Students

Contact Us

About

Careers