Can an AI actually take a product to PMF?
Most LLM benchmarks grade one-shot answers. Real product management is a multi-turn game of focus, sequencing and capital under uncertainty. Specbench drops each LLM and agent harness into simulated startups and measures whether it can steer them to durable product-market fit — not whether it can write a nice PRD.
Play the benchmark yourself
Two modes: replay real pivotal decisions from Slack, Airbnb, Stripe and 38 others — or run your own startup quarter by quarter and try to reach PMF before the runway runs out.
Step into the founder’s chair
You’ll face a series of real, pivotal decisions from the world’s top startups — Slack, Airbnb, Stripe, Netflix and more — with the outcome hidden, exactly as the founders saw them. Every call moves your trajectory-to-PMF meter. Make the wrong ones and you run out of runway. How far can you get?
49 real decision points in the deck. Every run — and the order of every answer — is shuffled, so there’s no pattern to game. You start under-funded, and a couple of wrong calls end the run.
Finished a run? Sign up to claim your rank on the player leaderboard below.
Player leaderboard
Leaderboard
Generated Jun 15, 2026Textbook PM strategy: focus, sequence build→sell, price to WTP, scale only after fit.
Uniformly random focus, allocation and pricing. The control floor.
Chases the largest market, over-invests in GTM, overprices, and lacks focus discipline.
The Specbench Index (0–100) blends PMF attainment (40%), survival (15%), decision quality (20%), speed-to-fit (15%) and capital efficiency (10%). See the methodology.
POST to our public API — your LLM or harness plays the simulation turn by turn and appears on a community leaderboard.
Where each system breaks
PMF attainment rate per system × scenario. Skill is not uniform: a system can nail a focused SaaS niche and still fail to crack consumer retention. Green = reaches fit reliably, red = never does.
| System | SaaS: Find the Niche | Dev Tool: The Wedge | Consumer: Retention or Bust | Turnaround: 4 Quarters of Runway | Marketplace: Cold Start | Motion: PLG vs Sales-led | Regulated B2B: The Moat |
|---|---|---|---|---|---|---|---|
| Disciplined PM Harness | 100% | 100% | 0% | 100% | 100% | 100% | 100% |
| Random | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| Naive Growth-Chaser | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Two ways we measure product judgment
One benchmark, two complementary tests — synthetic stress-testing and a backtest against real history.
Can it reach PMF from scratch?
Drop the system into 7 invented startups and run them quarter by quarter under uncertainty, competition and a finite balance sheet. Measures whether it can build fit — the leaderboard above.
See the scenariosWould it have chosen what winners chose?
Replay real pivotal decisions from famous startups — Slack, Airbnb, Stripe, Instagram, Netflix — give the AI the same situation with the outcome hidden, and check if it picks the move history rewarded. Grounded in real ground truth.
Open Decision ReplayWhy we built a benchmark, not a brochure
We make a product that uses AI to help product teams. That only matters if AI is genuinely good at product judgment — so we measure it, adversarially and reproducibly, including against our own systems. Every number here is generated by a deterministic, open simulator you can rerun. The same harness that runs the leaderboard powers regression tests on the simulation itself.