Same brief. Same constraints. Multiple models. Real rendered output you can see, compare, and vote on. Not chat responses. Not benchmark scores. Actual websites.
Each model received identical inputs: client bio, services, testimonials, colour palette, typography, and 8 stock photos. Full creative freedom on layout, hierarchy, and composition.
“Opus is worth 6x the cost for creative work. The gap between Opus and Sonnet is bigger than the gap between Sonnet and GPT-5.2.”From our first multi-model comparison
Every metric tracked. Every token counted. No vibes-only benchmarking.
| Metric | Opus 4.6 | Sonnet 4.5 | GPT-5.2 |
|---|---|---|---|
| Cost per site | $6.32 | $1.11 | $0.72 |
| Generation time | 680s | 594s | 687s |
| Input tokens | 110K | 85K | 55K |
| Output tokens | 62K | 57K | 58K |
| Images used | 8 | 8 | 8 |
| Headings | 16 | 17 | 14 |
| ARIA landmarks | 11 | 13 | 15 |
| Links | 28 | 26 | 34 |
| Schema.org | Yes | Yes | Yes |
There are leaderboards for chat. Evals for RAG accuracy. Benchmarks for code. But nothing for rendered creative output. Nothing for the thing your users actually see.
Upload a sitespec or pick from our templates. Set your palette, typography, style, and brand feeling. The brief stays constant across every model.
Choose which models to test. Opus, Sonnet, GPT-5.2, Gemini. Each one gets the same constraints and full creative freedom over layout and composition.
Real rendered websites, deployed and viewable. Side-by-side comparison with cost, speed, tokens, and page metrics. Vote on the winner.
Output Arena is launching soon. Get early access to the platform and our research feed.