Running Evals

Use Sero evals when you need a structured signal about prompt assembly or agent behavior. Snapshot evals are fast and local. Real LLM evals call providers and can cost money.

Pick the right command

Command	When to use	Cost/auth
`pnpm eval:snapshot`	Prompt assembly/cache drift checks	No live LLM calls; low/no provider cost.
`pnpm eval`	Full promptfoo eval against real providers	Requires credentials and may cost money.
`pnpm eval:view`	Inspect saved promptfoo results	No new model calls.

Run commands from the monorepo root.

Snapshot eval workflow

pnpm eval:snapshot

Snapshot evals assemble Sero prompt blocks and check presence, ordering, size, and metadata. Run them before committing changes to agent prompts, CLI prompt blocks, container prompt blocks, subagent prompt guidance, or session setup.

If a snapshot fails after an intentional prompt change, inspect the failure reason and update the relevant baseline in eval/scenarios/prompt-stability.yaml only with the code change that caused it.

Real LLM eval workflow

ANTHROPIC_API_KEY=... pnpm eval

Real evals use promptfoo plus Sero's eval provider. They create isolated temp workspaces under /tmp/sero-eval-*, initialize a clean Git repo, expose file tools, and use an eval-only sero-cli shim for deterministic platform checks.

Run them before releases, after model/SDK upgrades, or when changing agent behavior. They are not currently claimed as a required PR gate for every change.

Inspect results

pnpm eval:view

This opens Promptfoo's local result viewer so you can compare pass/fail history, scores, model output, tool metadata, and scenario details.

Scenario matrix

Scenario file	Mode	Coverage
`eval/scenarios/prompt-stability.yaml`	Snapshot	Prompt blocks, ordering, prompt size, cache-stability metadata.
`eval/scenarios/file-ops.yaml`	Real LLM	Read/write/edit behavior and latency guards in temp workspaces.
`eval/scenarios/coding-tasks.yaml`	Real LLM	React/TypeScript generation, null-safety fixes, utility generation.
`eval/scenarios/cli-ops.yaml`	Real LLM	Agent preference for `sero-cli`, workspace info, batch commands, VCS status.

Interpreting failures

Snapshot block missing — inspect the prompt-building source that should add that block.
Snapshot ordering changed — confirm whether prompt cache behavior intentionally changed.
Prompt grew too much — remove accidental verbosity or update the baseline only for intentional growth.
Real eval tool sequence failed — inspect tool metadata; the agent may have used raw tools instead of the expected platform tool.
LLM rubric failed — read the output before assuming product code is broken; rubrics can be noisy.
Auth/provider failure — check ANTHROPIC_API_KEY or profile auth state.

ON THIS PAGE

Running Evals#

Pick the right command#

Snapshot eval workflow#

Real LLM eval workflow#

Inspect results#

Scenario matrix#

Interpreting failures#

Related docs#