The First Benchmark for Built Output

Which model builds better?

Same brief. Same constraints. Multiple models. Real rendered output you can see, compare, and vote on. Not chat responses. Not benchmark scores. Actual websites.

See the Comparison ↓ Read the Research
3
Models Tested
$8.15
Total Cost
18
Pages Generated
0
Templates Used

Three models. One brief.
Judge for yourself.

Each model received identical inputs: client bio, services, testimonials, colour palette, typography, and 8 stock photos. Full creative freedom on layout, hierarchy, and composition.

Website generated by Claude Opus 4.6
Opus 4.6
Anthropic
◆ Best Design
$6.32
Cost
680s
Time
172K
Tokens
Website generated by Claude Sonnet 4.5
Sonnet 4.5
Anthropic
Best Value
$1.11
Cost
594s
Time
142K
Tokens
Website generated by GPT-5.2
GPT-5.2
OpenAI
Best Copy
$0.72
Cost
687s
Time
113K
Tokens
Output Arena Research · Feb 2026
“Opus is worth 6x the cost for creative work. The gap between Opus and Sonnet is bigger than the gap between Sonnet and GPT-5.2.”
From our first multi-model comparison

The numbers behind the builds

Every metric tracked. Every token counted. No vibes-only benchmarking.

Metric Opus 4.6 Sonnet 4.5 GPT-5.2
Cost per site$6.32$1.11$0.72
Generation time680s594s687s
Input tokens110K85K55K
Output tokens62K57K58K
Images used888
Headings161714
ARIA landmarks111315
Links282634
Schema.orgYesYesYes

Benchmarks exist for everything except what gets built

There are leaderboards for chat. Evals for RAG accuracy. Benchmarks for code. But nothing for rendered creative output. Nothing for the thing your users actually see.

  • Chat responses — Chatbot Arena
  • Creative writing — EQ-Bench
  • Code generation — SWE-bench
  • RAG accuracy — DeepEval, RAGAS
  • Reasoning — GPQA, FrontierMath
  • Rendered creative output — Nothing. Until now.

Define. Generate. Compare.

01

Define your brief

Upload a sitespec or pick from our templates. Set your palette, typography, style, and brand feeling. The brief stays constant across every model.

02

Pick your models

Choose which models to test. Opus, Sonnet, GPT-5.2, Gemini. Each one gets the same constraints and full creative freedom over layout and composition.

03

See the output

Real rendered websites, deployed and viewable. Side-by-side comparison with cost, speed, tokens, and page metrics. Vote on the winner.

Stop guessing.
Start benchmarking.

Output Arena is launching soon. Get early access to the platform and our research feed.