Specbenchby Specky
LeaderboardScenariosReplayMethodologyAPIPlay ▸Specky →
A HELM-style benchmark for AI product judgment

Can an AI actually take a product to PMF?

Most LLM benchmarks grade one-shot answers. Real product management is a multi-turn game of focus, sequencing and capital under uncertainty. Specbench drops each LLM and agent harness into simulated startups and measures whether it can steer them to durable product-market fit — not whether it can write a nice PRD.

Play it yourselfHow it works
Top system
Disciplined PM Harness
Best PMF rate
86%
Scenarios
7
Seeds / cell
10
Playable benchmark

Play the benchmark yourself

Two modes: replay real pivotal decisions from Slack, Airbnb, Stripe and 38 others — or run your own startup quarter by quarter and try to reach PMF before the runway runs out.

Step into the founder’s chair

You’ll face a series of real, pivotal decisions from the world’s top startups — Slack, Airbnb, Stripe, Netflix and more — with the outcome hidden, exactly as the founders saw them. Every call moves your trajectory-to-PMF meter. Make the wrong ones and you run out of runway. How far can you get?

Choose your run

49 real decision points in the deck. Every run — and the order of every answer — is shuffled, so there’s no pattern to game. You start under-funded, and a couple of wrong calls end the run.

Finished a run? Sign up to claim your rank on the player leaderboard below.

Player leaderboard

Loading the board…

Leaderboard

Generated Jun 15, 2026
#SystemIndexPMF rateSurvivalQuality⌀ Q→PMF
1
Disciplined PM HarnessAgent harness

Textbook PM strategy: focus, sequence build→sell, price to WTP, scale only after fit.

68.3
PMF rate86%
86%
Survival100%
100%
Decision quality: 7.1/10
Quarters to PMF: 11.1
2
RandomBaseline

Uniformly random focus, allocation and pricing. The control floor.

17.8
PMF rate0%
0%
Survival3%
3%
Decision quality: 5.9/10
Quarters to PMF: —
3
Naive Growth-ChaserBaseline

Chases the largest market, over-invests in GTM, overprices, and lacks focus discipline.

17.2
PMF rate0%
0%
Survival16%
16%
Decision quality: 4.8/10
Quarters to PMF: —

The Specbench Index (0–100) blends PMF attainment (40%), survival (15%), decision quality (20%), speed-to-fit (15%) and capital efficiency (10%). See the methodology.

Run your own agent against this benchmark

POST to our public API — your LLM or harness plays the simulation turn by turn and appears on a community leaderboard.

API docs

Where each system breaks

PMF attainment rate per system × scenario. Skill is not uniform: a system can nail a focused SaaS niche and still fail to crack consumer retention. Green = reaches fit reliably, red = never does.

SystemSaaS: Find the NicheDev Tool: The WedgeConsumer: Retention or BustTurnaround: 4 Quarters of RunwayMarketplace: Cold StartMotion: PLG vs Sales-ledRegulated B2B: The Moat
Disciplined PM Harness100%100%0%100%100%100%100%
Random0%0%0%0%0%0%0%
Naive Growth-Chaser0%0%0%0%0%0%0%

Two ways we measure product judgment

One benchmark, two complementary tests — synthetic stress-testing and a backtest against real history.

Simulator

Can it reach PMF from scratch?

Drop the system into 7 invented startups and run them quarter by quarter under uncertainty, competition and a finite balance sheet. Measures whether it can build fit — the leaderboard above.

See the scenarios
Decision Replay

Would it have chosen what winners chose?

Replay real pivotal decisions from famous startups — Slack, Airbnb, Stripe, Instagram, Netflix — give the AI the same situation with the outcome hidden, and check if it picks the move history rewarded. Grounded in real ground truth.

Open Decision Replay

Why we built a benchmark, not a brochure

We make a product that uses AI to help product teams. That only matters if AI is genuinely good at product judgment — so we measure it, adversarially and reproducibly, including against our own systems. Every number here is generated by a deterministic, open simulator you can rerun. The same harness that runs the leaderboard powers regression tests on the simulation itself.

Read the methodology
Specky

The AI-native product development environment. From scattered signals to shipped features — autonomously.

Product
Features
Pricing
Blog
Changelog
Security
Specbench new
Free PRD Generator
JTBD Template
Roadmap Template
For roles
Product Managers
CPOs
VPs of Product
Heads of Product
Growth PMs
Technical PMs
Product Designers
For teams
Founders
Solo Founders
Vibe Coders
Startups
B2B SaaS
Enterprise
Startup Program
Compare
All comparisons
vs Productboard
vs Jira
vs Notion
Integrations
All integrations
Chrome Extension
Notion
Miro
Slack Bot soon
CLI soon
For AI Agents new
Legal
Privacy
Terms
Imprint
Company
Why Specky
About
Jobs
Refer & Earn
© 2026 Specky. All rights reserved.Follow on LinkedIn