I gave three LLMs the same brief and asked them to build a website. Here's what happened.

I’ve been building a product called BirthBuild. It generates complete websites for doulas and birth workers from a conversation. Pick your colours, describe your vibe, the AI builds your site.

But which AI?

That’s the question nobody has good data on. Not for benchmarks. Not for chat responses. For actual rendered output. Real websites you can click around and judge with your own eyes.

So I built a testing harness. Same client brief. Same colour palette. Same typography. Same photos. Three models. Let them cook.

The setup

The brief is for a fictional doula called Dina Hart. She’s based in Brighton, offers birth doula support, postnatal care, and hypnobirthing courses. The spec includes her bio, services, testimonials, and contact details. Everything a real client would provide through the BirthBuild onboarding flow.

Each model gets identical constraints:

Palette: Sage & Sand (warm greens, cream, earth tones)
Typography: Mixed (DM Serif Display headings, Inter body)
Style: Classic
Brand feeling: Reassuring
Photos: 8 professional stock images

The models generate a complete design system (CSS custom properties, component styles, responsive breakpoints) and then six pages of semantic HTML using that design system. No templates. No component library. Pure LLM output.

The contenders

Claude Opus 4.6 (Anthropic’s flagship) Claude Sonnet 4.5 (Anthropic’s workhorse) GPT-5.2 (OpenAI’s latest)

I also tried GPT-5.2 Pro. It errored out immediately. Turns out it’s a reasoning model that only works through OpenAI’s Responses API. Not a chat model. Not relevant here.

The numbers

	Opus 4.6	Sonnet 4.5	GPT-5.2
Cost	$6.32	$1.11	$0.72
Time	680s	594s	687s
Input tokens	110K	85K	55K
Output tokens	62K	57K	58K
Images	8	8	8
Headings	16	17	14
Landmarks	11	13	15
Links	28	26	34
Schema.org	Yes	Yes	Yes

All three sites are structurally sound. All three pass basic accessibility checks. All three include Schema.org markup. The metrics don’t tell you much about which one is actually better. You need to look at them.

So let’s look at them

Opus 4.6 ($6.32)

Website generated by Opus 4.6

Opus builds the most polished site by a clear margin. The hero section uses a large background image with the headline “Your birth, your way” set in a warm serif. There’s real visual hierarchy. The about section uses an asymmetric layout with Dina’s portrait breaking out of the grid. Services are presented as distinct cards with enough breathing room. The testimonials section has a featured quote with a photo gallery. The footer has a proper CTA block.

The whole thing feels like a site a good designer built. Not a site an AI generated. That’s the difference.

Sonnet 4.5 ($1.11)

Website generated by Sonnet 4.5

Sonnet produces a clean, competent site. The hero has a green overlay on the background image with “Calm, Experienced Birth Support” as the headline. It works. But it’s safe. The layout is more conventional. Sections stack predictably. There’s less visual tension, fewer moments where your eye gets pulled somewhere interesting.

The text contrast on the hero is slightly off. Some of the body copy is hard to read against the background image. This is the kind of thing that happens when the model doesn’t have a strong sense of how layered text interacts with photography.

It’s a perfectly shippable site. You wouldn’t be embarrassed by it. But put it next to Opus and you can feel the gap.

GPT-5.2 ($0.72)

Website generated by GPT-5.2

GPT-5.2 takes a different approach to the header. Dina’s name and a brief description sit in the nav bar alongside a “Check availability” CTA button. It’s a smart choice. The hero headline “Steady, calm support for your birth and the weeks that follow” is arguably the best copy of the three. Warmer. More specific. More human.

The layout is clean and the information architecture is solid. Three CTAs in the hero (check availability, explore services, view gallery) give the visitor clear paths. The overall design is professional but less visually distinctive than Opus.

GPT-5.2’s real strength here is copywriting. The tone is warmer and more natural than both Claude models. If you’re building a site where the words matter more than the visual polish, this is worth noting.

What I actually learned

Opus is worth 6x the cost for creative work. Not for every task. But when you’re asking an LLM to make design decisions, compose layouts, create visual hierarchy, the premium model produces dramatically better output. The gap between Opus and Sonnet is bigger than the gap between Sonnet and GPT-5.2.

GPT-5.2 writes better copy. The tone is more natural, the headlines are more specific, the body text reads like a human wrote it. Both Claude models tend toward slightly more formal, slightly more generic language.

Sonnet is the production workhorse. At $1.11 per site, the economics are completely different. If you’re generating hundreds of sites, you can’t afford Opus. Sonnet gives you 80% of the quality at 18% of the cost. That’s the real tradeoff.

Structural quality is a wash. All three models produce valid semantic HTML, proper heading hierarchy, accessible landmarks, and Schema.org markup. The technical floor has risen to the point where structural compliance isn’t a differentiator. The differentiator is taste.

Nobody is benchmarking this. There are leaderboards for chat responses (Chatbot Arena), creative writing (EQ-Bench), and coding (SWE-bench). There is nothing for rendered creative output. No benchmark for “which model builds the better website.” The existing eval tools (DeepEval, LangWatch, Langfuse) test text accuracy, hallucination rates, and RAG quality. None of them can tell you whether a generated landing page looks good.

The test harness

I built the tool that produced this comparison. Same brief, controlled variables, multiple models, deployed output you can actually visit. Cost tracking, token counts, generation time, page metrics, side-by-side comparison.

It started as internal tooling for BirthBuild. But the more I used it, the more obvious it became that this is a tool other people need. Anyone building a generation pipeline faces the same question: which model, which prompt version, did the update break my output?

I’m building it into something others can use. More on that soon.

Try it yourself

The three sites are live. Same brief, same constraints, different models. Judge for yourself:

Opus 4.6: View live site
Sonnet 4.5: View live site
GPT-5.2: View live site

Which one would you hire for your birth? Which one would you ship to a client?