Use Sero evals when you need a structured signal about prompt assembly or agent behavior. Snapshot evals are fast and local. Real LLM evals call providers and can cost money.
| Command | When to use | Cost/auth |
|---|---|---|
pnpm eval:snapshot |
Prompt assembly/cache drift checks | No live LLM calls; low/no provider cost. |
pnpm eval |
Full promptfoo eval against real providers | Requires credentials and may cost money. |
pnpm eval:view |
Inspect saved promptfoo results | No new model calls. |
Run commands from the monorepo root.
Snapshot evals assemble Sero prompt blocks and check presence, ordering, size, and metadata. Run them before committing changes to agent prompts, CLI prompt blocks, container prompt blocks, subagent prompt guidance, or session setup.
If a snapshot fails after an intentional prompt change, inspect the failure reason and update the relevant baseline in eval/scenarios/prompt-stability.yaml only with the code change that caused it.
Real evals use promptfoo plus Sero's eval provider. They create isolated temp workspaces under /tmp/sero-eval-*, initialize a clean Git repo, expose file tools, and use an eval-only sero-cli shim for deterministic platform checks.
Run them before releases, after model/SDK upgrades, or when changing agent behavior. They are not currently claimed as a required PR gate for every change.
This opens Promptfoo's local result viewer so you can compare pass/fail history, scores, model output, tool metadata, and scenario details.
| Scenario file | Mode | Coverage |
|---|---|---|
eval/scenarios/prompt-stability.yaml |
Snapshot | Prompt blocks, ordering, prompt size, cache-stability metadata. |
eval/scenarios/file-ops.yaml |
Real LLM | Read/write/edit behavior and latency guards in temp workspaces. |
eval/scenarios/coding-tasks.yaml |
Real LLM | React/TypeScript generation, null-safety fixes, utility generation. |
eval/scenarios/cli-ops.yaml |
Real LLM | Agent preference for sero-cli, workspace info, batch commands, VCS status. |
ANTHROPIC_API_KEY or profile auth state.