Methodology

How Specbench measures product judgment

Specbench is inspired by Stanford's HELM: instead of a single accuracy number, it scores systems holistically across many scenarios and metrics. The twist is the task — not answering questions, but running a company turn by turn toward product-market fit.

The task: a multi-turn company simulator

Each run places a system under test inside a simulated startup with a market, a balance sheet, a team, and a set of customer segments. Every quarter the system receives a dashboard of the company's state and must make one structured decision:

Focus — which single segment to serve this quarter.
Allocation — how to split finite effort across product, research, go-to-market and hiring.
Pricing — the price point, judged against each segment's willingness to pay.
Capital — whether to attempt a raise (only succeeds with a real traction story).

The market model then resolves the quarter with interpretable dynamics: customer-acquisition cost rises sharply when product fit is low (you can buy clicks, not retention), churn is driven by weak fit, overpricing and competition, awareness compounds through word-of-mouth, and the team's morale and runway feed back into execution. Random shocks (competitor launches, PR events, viral moments) keep it honest. The simulation runs for the scenario's horizon (10–14 quarters) or until the company runs out of cash.

Systems under test

Specbench compares two categories against deterministic controls:

Raw LLMs— a single structured model call per quarter, no scaffolding. "Can the base model play PM?"
Agent harnesses — the model wrapped in a reasoning loop (situation analysis → decision → self-critique). This is the value scaffolding like Specky adds on top of a base model.
Baselines — a random agent (the floor) and a naive growth-chaser (chases the biggest TAM, over-spends on marketing, overprices). Any system worth shipping must clearly beat these.

The published baselines run fully offline and deterministically, so the leaderboard reproduces exactly and CI can regression-test the simulator's ability to rank skill. LLM and harness systems are run with the same harness when API keys are supplied.

The PMF score

Each quarter the company gets a composite PMF score from 0–100, modeled on the signals practitioners actually use. Retention dominates — it is the truest signal of fit — followed by having a clear winning segment, sane unit economics (an LTV/CAC proxy), and organic pull:

PMF = 40% · retention + 20% · segment-fit concentration + 20% · unit economics + 20% · organic pull

A scenario counts as reaching PMFwhen the score holds at or above the scenario's threshold (≈70) for two consecutive quarters — sustained fit, not a one-quarter spike.

The Specbench Index

Per system × scenario we report HELM-style multi-metric cells (PMF rate, survival, mean peak PMF, capital efficiency, decision quality, robustness). These roll up into a single 0–100 Specbench Index using fixed, absolute weights — not min-max against the current field — so scores stay comparable across benchmark revisions:

40%	PMF attainment	Did it reach a durable PMF score and sustain it? The whole point.
15%	Survival	Did the company avoid running out of cash before the horizon?
20%	Decision quality	Were the moves defensible PM reasoning, independent of luck?
15%	Speed to fit	How quickly fit arrived, relative to the scenario horizon.
10%	Capital efficiency	PMF progress per dollar burned — focus beats brute force.

Decision Replay — backtesting against real history

The simulator tests judgment on invented companies. Decision Replaytests it on real ones. We curated pivotal decision points from documented startup history — Slack's pivot from a game to a chat tool, Airbnb's unscalable photo hustle, Stripe's developer-first wedge, Netflix's Qwikster blunder — and for each one we strip out the outcome and present the AI with exactly what the founders knew at the time: the situation, the signals, and the candidate moves.

The pick is then graded three ways:

Founder Alignment — the hindsight quality (0–100) of the move it chose, averaged across cases. The headline.
Best-move rate — how often it chose the single historically-best option.
Founder-match rate — how often it chose what the founders actually did. This diverges from best-move on cases where the founders made a famous mistake — choosing well there means not matching them.

Two honesty caveats. First, these are curated case studies distilled from public accounts, each with sources; the option scores encode a hindsight judgment, not a measurement, and the dataset ships as a strong seed set built to grow toward 100 rather than a fabricated round number. Second, the offline pickers are reference bounds, not contenders: the "Principle Oracle" reads the principle tags (a ceiling) and the "Anti-pattern" picks badly (a floor). Real model rows see only the situation text — never the tags or the outcome — and are the actual subject of the test.

Scoring decision quality

Outcomes can be lucky or unlucky, so we score the reasoningseparately. A pure, offline heuristic rates each decision against PM first principles — focus, build-before-scale sequencing, pricing discipline, capital efficiency, and avoiding strategic thrash. When API keys are available, an LLM-as-judge can replace the heuristic to grade each move's rationale directly. Either way, decision quality is averaged across every quarter of a run.

Reproducibility

Every run is deterministic given a (system, scenario, seed) triple — all randomness flows through a single seeded generator, never Math.random. We average 10 seeds per cell to separate skill from variance, and report the standard deviation of peak PMF as a robustness signal. The simulator, scenarios, baselines and scoring are all in the repository under lib/specbench/; regenerate the leaderboard with npx tsx scripts/specbench/run.ts.

Limitations & honesty

Specbench is a model of reality, not reality. The market dynamics are interpretable by design, which means they are also simplifications — real PMF involves messy qualitative judgment a simulator can't fully capture. We publish the weights and code precisely so the assumptions are inspectable and contestable. We also run our ownsystems through it, including when the results aren't flattering. Treat the Index as a structured, reproducible comparison — a strong signal, not a verdict.

← Back to the leaderboard