The wizard ships with several test lanes that escalate from fast in-process checks to slow real-binary integration. Pick the cheapest lane that catches your regression.
┌───────────────────────────┬──────────────────────────────┬───────────┐
│ Lane │ Catches │ Speed │
├───────────────────────────┼──────────────────────────────┼───────────┤
│ Unit (`pnpm test`) │ Pure logic, schemas, hooks │ ~40s │
│ Snapshot (in unit lane) │ Screen layout / copy │ ~40s │
│ Scenario (in unit lane) │ Wizard flow, scripted agent │ ~40s │
│ Smoke PTY │ Raw mode, signals, TTY │ ~30s/case│
│ E2E (`pnpm test:e2e`) │ Real CLI + framework wire │ 10+ min │
│ Eval (Phase 4 — TBD) │ Real Claude agent quality │ nightly │
└───────────────────────────┴──────────────────────────────┴───────────┘
vitest.config.ts. Includes everything under src/**/__tests__. The
existing 2700+ tests live here.
This is also where the scenario tests live — they run in-process
against the AgentDriver port using
createScriptedDriver to feed
canned message sequences into the real runAgent machinery. No
subprocess, no Claude SDK, no network. Use these for any deterministic
regression — flow ordering, tool-call sequencing, NDJSON event shapes,
exit-code routing.
Pattern:
import { setAgentDriver } from '../../lib/agent-driver';
import { createScriptedDriver, mk } from '../../lib/scripted-agent-driver';
beforeEach(() => {
const { driver, calls } = createScriptedDriver({
messages: [
mk.systemInit(),
mk.assistantText('thinking...'),
mk.resultSuccess('done'),
],
});
setAgentDriver(driver);
});
afterEach(() => setAgentDriver(null));Snapshot tests live alongside their components under
src/ui/tui/screens/__tests__/*.snap.test.tsx and use
renderSnapshot / makeStoreForSnapshot.
NDJSON output gets normalized through
redactNdjsonStream (lands in #518) before
diffing — see docs/agent-ndjson-contract.md
for the contract.
vitest.config.smoke.ts. Spawns the wizard via tsx bin.ts under a
node-pty pseudo-terminal so
TTY-only behavior surfaces:
- Ink only enters raw mode against a real
isTTYstdin. - Signals fire on real OS processes.
process.stdout.writeonly EPIPE-aborts on a real closed pipe.mode-config.ts's TTY detection runs against actualprocess.stdout.isTTY.
Kept intentionally small — one or two scenarios. PTY tests are 10–50× a unit test, and scaling them past a smoke layer destroys the dev feedback loop. For deterministic scenario coverage use the unit-lane scripted driver instead.
Skips automatically when node-pty isn't loadable (e.g., a host without a
prebuild and without a C++ toolchain). CI must ensure the toolchain is
present so the lane actually exercises real PTY behavior there.
e2e-tests/jest.config.ts. Builds the binary, spawns it in test
applications under e2e-tests/test-applications/, MSW-mocks the Amplitude
API, and replays recorded LLM responses keyed by request hash. Used to
verify framework-specific instrumentation lands correctly. Slow — not on
PR CI. See e2e-tests/README.md.
When the eval lane lands, cross-process LLM calls will be intercepted
via llmock — an HTTP server that
records /messages-shaped requests and replays them deterministically.
The wizard subprocess hits ANTHROPIC_BASE_URL=http://localhost:<port>
and the test suite owns the cassette JSON.
(The llmock dependency will be added when this Phase 4 feature is implemented.)
Why not extend MSW (which e2e-tests/ already uses)? MSW only intercepts
in the same Node process. The Claude Agent SDK runs in a child process, so
we need a real HTTP server. Cassettes are the right primitive for the
smoke tier of the eval lane (deterministic replay, zero cost). Live
LLM calls are reserved for the standard / full / deep tiers (real
agent quality measurement). The two answer different questions and must
not be conflated.
Status: not yet wired. Tracked as a Phase 4 deliverable.
| Symptom or invariant | Lane |
|---|---|
| Tool-call ordering / NDJSON event shape | Unit (scripted) |
| Screen layout / copy / hint bar contents | Unit (snapshot) |
| Pure helper / classifier / parser | Unit |
| Raw mode / signals / TTY-conditional code | Smoke PTY |
Real binary survives a clean --help / --version |
Smoke PTY |
| Real framework integration end-to-end (build + run) | E2E |
| "Did Claude do the right thing" | Eval (Phase 4) |
| Reward hacking / sycophancy / silent fallback | Eval (Phase 4) |
When a deterministic check is possible, prefer the unit lane — it gives the fastest feedback and the most precise regression signal. Reach for PTY only when the bug literally requires a TTY to reproduce.