A HELM-style benchmark for AI product judgment

Can an AI actually take a product to PMF?

Most LLM benchmarks grade one-shot answers. Real product management is a multi-turn game of focus, sequencing and capital under uncertainty. Specbench drops each LLM and agent harness into simulated startups and measures whether it can steer them to durable product-market fit — not whether it can write a nice PRD.

Play it yourself How it works

Top system

Disciplined PM Harness

Best PMF rate

86%

Scenarios

Seeds / cell

Playable benchmark

Play the benchmark yourself

Two modes: replay real pivotal decisions from Slack, Airbnb, Stripe and 38 others — or run your own startup quarter by quarter and try to reach PMF before the runway runs out.

Step into the founder’s chair

You’ll face a series of real, pivotal decisions from the world’s top startups — Slack, Airbnb, Stripe, Netflix and more — with the outcome hidden, exactly as the founders saw them. Every call moves your trajectory-to-PMF meter. Make the wrong ones and you run out of runway. How far can you get?

Choose your run

49 real decision points in the deck. Every run — and the order of every answer — is shuffled, so there’s no pattern to game. You start under-funded, and a couple of wrong calls end the run.

Finished a run? Sign up to claim your rank on the player leaderboard below.

Player leaderboard

Loading the board…

Leaderboard

Generated Jun 15, 2026

#SystemIndexPMF rateSurvivalQuality⌀ Q→PMF

Disciplined PM HarnessAgent harness

Textbook PM strategy: focus, sequence build→sell, price to WTP, scale only after fit.

68.3

PMF rate86%

86%

Survival100%

100%

Decision quality: 7.1/10

Quarters to PMF: 11.1

RandomBaseline

Uniformly random focus, allocation and pricing. The control floor.

17.8

PMF rate0%

Survival3%

Decision quality: 5.9/10

Quarters to PMF: —

Naive Growth-ChaserBaseline

Chases the largest market, over-invests in GTM, overprices, and lacks focus discipline.

17.2

PMF rate0%

Survival16%

16%

Decision quality: 4.8/10

Quarters to PMF: —

The Specbench Index (0–100) blends PMF attainment (40%), survival (15%), decision quality (20%), speed-to-fit (15%) and capital efficiency (10%). See the methodology.

Run your own agent against this benchmark

POST to our public API — your LLM or harness plays the simulation turn by turn and appears on a community leaderboard.

API docs

Where each system breaks

PMF attainment rate per system × scenario. Skill is not uniform: a system can nail a focused SaaS niche and still fail to crack consumer retention. Green = reaches fit reliably, red = never does.

System	SaaS: Find the Niche	Dev Tool: The Wedge	Consumer: Retention or Bust	Turnaround: 4 Quarters of Runway	Marketplace: Cold Start	Motion: PLG vs Sales-led	Regulated B2B: The Moat
Disciplined PM Harness	100%	100%	0%	100%	100%	100%	100%
Random	0%	0%	0%	0%	0%	0%	0%
Naive Growth-Chaser	0%	0%	0%	0%	0%	0%	0%

Two ways we measure product judgment

One benchmark, two complementary tests — synthetic stress-testing and a backtest against real history.

Simulator

Can it reach PMF from scratch?

Drop the system into 7 invented startups and run them quarter by quarter under uncertainty, competition and a finite balance sheet. Measures whether it can build fit — the leaderboard above.

See the scenarios

Decision Replay

Would it have chosen what winners chose?

Replay real pivotal decisions from famous startups — Slack, Airbnb, Stripe, Instagram, Netflix — give the AI the same situation with the outcome hidden, and check if it picks the move history rewarded. Grounded in real ground truth.

Open Decision Replay

Why we built a benchmark, not a brochure

We make a product that uses AI to help product teams. That only matters if AI is genuinely good at product judgment — so we measure it, adversarially and reproducibly, including against our own systems. Every number here is generated by a deterministic, open simulator you can rerun. The same harness that runs the leaderboard powers regression tests on the simulation itself.

Read the methodology