Tell Claude Code to build a payment API. It will produce a plan. Ten steps, clean formatting, green checkmarks. It looks thorough. It looks confident.

Step 7: "Implement webhook handling."

Four words. No mention of retry logic. No mention of signature verification. No mention of what happens when Stripe sends an event for a subscription type you haven't modeled yet.

The plan doesn't know what it doesn't know. And neither do you — until you're three hours deep and the whole thing stalls on an assumption nobody questioned.

That's the problem specfirst solves.

Plans Are Binary. Reality Isn't.

Most planning tools treat specs as a gate. You either have one or you don't. Pass/fail. The plan exists, therefore proceed.

But a 10-step plan where step 7 is "figure out the database schema" isn't much better than no plan at all. The plan exists. The confidence is fake.

specfirst adds a gradient. Every step in your implementation plan gets scored across four dimensions:

Requirement Clarity — Do we actually know what to build?
Implementation Certainty — Do we know how to build it?
Risk Awareness — Have we identified what can go wrong?
Dependency Clarity — Are external dependencies resolved?

Each dimension scores 0.0 to 1.0 with calibrated anchors. No vibes. A 0.9 in Requirement Clarity means the acceptance criteria are specific and testable. A 0.4 means there are open questions that would change the implementation.

The scores compose into an overall plan confidence using geometric mean — not arithmetic mean. That's a deliberate choice. One catastrophic gap tanks the whole score. A plan that's 0.95 across five steps and 0.2 on one step isn't a 0.83 plan. It's a 0.58 plan. Because that one step will block everything downstream.

What It Looks Like

Install it. Initialize it. Run the skill.

npm install -g @rezzedai/specfirst
specfirst init

Then in Claude Code:

/spec Build a Stripe payment processing API with subscriptions

specfirst runs a 7-step pipeline — analyze the request, clarify ambiguity, decompose into steps, score each step, check if the scope warrants full analysis, verify through chain-of-verification, and output the spec.

Here's what the output looks like for that payment API:

## Step 3: Webhook Event Processing
Confidence: 0.68 (REVIEW)

| Dimension               | Score |
|--------------------------|-------|
| Requirement Clarity      | 0.70  |
| Implementation Certainty | 0.60  |
| Risk Awareness           | 0.65  |
| Dependency Clarity       | 0.75  |

Flags:
- Refund policy for partial periods: unspecified
- Webhook retry/idempotency strategy: not defined
- 3 event types referenced but not mapped to handlers

That step didn't fail. It flagged. Yellow means "a human should look at this before an agent burns tokens building it." Green means proceed. Red means stop — critical gaps that will cascade.

The overall plan scored 0.72. Not red. Not green. The kind of score that tells you: refine steps 3, 4, and 6, then build with confidence. Don't guess. Don't hope. Know.

Why Geometric Mean

Arithmetic mean hides risk. Five green steps and one red step average out to yellow. That's a lie. The red step will block your build. Your agent will hit it, hallucinate a solution, and you'll spend an hour debugging something that should have been a five-minute conversation before writing code.

Geometric mean punishes that. One weak dimension drags the composite down hard. A plan with a 0.2 in any dimension reads as what it is: not ready.

This isn't academic. I watched agents burn 40-minute sessions building on plans that looked solid and weren't. The geometric mean catches what arithmetic mean papers over.

Auto-Scope: Not Everything Needs a Spec

Add a health check endpoint? That doesn't need chain-of-verification and six failure modes.

specfirst detects trivial tasks automatically. Two steps, high confidence, well-defined scope — it skips the heavyweight analysis and gives you a compact spec:

## Health Check Endpoint
Overall Confidence: 0.95 (PROCEED)
Steps: 2 | Auto-scoped: lightweight

The right amount of process for the right size problem. A health check endpoint gets two lines. A payment API gets a full spec with verification. Spec weight scales with implementation risk.

The Workflow

specfirst isn't a standalone tool. It's the first move in a build cycle.

/spec your feature — get confidence scores
Fix the yellows — clarify requirements, resolve dependencies
Build with a green plan — the agent knows what it's building and so do you
specfirst review later — re-score against what actually shipped

The review command is the feedback loop. Compare the original spec against the implementation. Did the risks materialize? Were the scores accurate? Over time, your specs get sharper because you're calibrating against reality, not guessing.

Zero Dependencies. Runs Anywhere.

specfirst is a Node CLI with zero external dependencies. No API keys. No cloud services. No accounts to create. Install it, init your project, run the skill. Works offline. Works in CI. Works in any Claude Code session.

specfirst list          # See all specs in your project
specfirst status        # Dashboard of plan confidence
specfirst review <spec> # Re-score against implementation

The entire thing is open source. Check the repo.

Plans are cheap. Confident plans are rare. specfirst makes the difference visible before you write a line of code.

Your AI Agent's Plan Looks Great. It's Lying to You.

Plans Are Binary. Reality Isn't.

What It Looks Like

Why Geometric Mean

Auto-Scope: Not Everything Needs a Spec

The Workflow

Zero Dependencies. Runs Anywhere.

Related Posts

I Built 6 Open-Source Tools for Claude Code. Then I Connected Them.

CacheBash: An MCP Server That Lets Your AI Sessions Talk to Each Other

Nobody Warned Me About This Place