The First Benchmark for Built Output

Which model builds better?

Same brief. Same constraints. Multiple models. Real rendered output you can see, compare, and vote on. Not chat responses. Not benchmark scores. Actual websites.

See the Comparison ↓ Read the Research

Models Tested

$8.15

Total Cost

Pages Generated

Templates Used

Live Comparison

Three models. One brief.
Judge for yourself.

Each model received identical inputs: client bio, services, testimonials, colour palette, typography, and 8 stock photos. Full creative freedom on layout, hierarchy, and composition.

Output Arena Research · Feb 2026

“Opus is worth 6x the cost for creative work. The gap between Opus and Sonnet is bigger than the gap between Sonnet and GPT-5.2.”

From our first multi-model comparison

Raw Data

The numbers behind the builds

Every metric tracked. Every token counted. No vibes-only benchmarking.

Metric	Opus 4.6	Sonnet 4.5	GPT-5.2
Cost per site	$6.32	$1.11	$0.72
Generation time	680s	594s	687s
Input tokens	110K	85K	55K
Output tokens	62K	57K	58K
Images used	8	8	8
Headings	16	17	14
ARIA landmarks	11	13	15
Links	28	26	34
Schema.org	Yes	Yes	Yes

The Problem

Benchmarks exist for everything except what gets built

There are leaderboards for chat. Evals for RAG accuracy. Benchmarks for code. But nothing for rendered creative output. Nothing for the thing your users actually see.

✓Chat responses — Chatbot Arena
✓Creative writing — EQ-Bench
✓Code generation — SWE-bench
✓RAG accuracy — DeepEval, RAGAS
✓Reasoning — GPQA, FrontierMath
✗Rendered creative output — Nothing. Until now.

How It Works

Define. Generate. Compare.

Define your brief

Upload a sitespec or pick from our templates. Set your palette, typography, style, and brand feeling. The brief stays constant across every model.

Pick your models

Choose which models to test. Opus, Sonnet, GPT-5.2, Gemini. Each one gets the same constraints and full creative freedom over layout and composition.

See the output

Real rendered websites, deployed and viewable. Side-by-side comparison with cost, speed, tokens, and page metrics. Vote on the winner.

Coming Soon

Stop guessing.
Start benchmarking.

Output Arena is launching soon. Get early access to the platform and our research feed.

Which model builds better?

Three models. One brief.Judge for yourself.

The numbers behind the builds

Benchmarks exist for everything except what gets built

Define. Generate. Compare.

Define your brief

Pick your models

See the output

Stop guessing.Start benchmarking.

Three models. One brief.
Judge for yourself.

Stop guessing.
Start benchmarking.