skill-optimizer

Benchmark and self-optimize SDK, CLI, and MCP guidance so every agent model can use your tool reliably.

skill-optimizer runs your SDK / CLI / MCP docs against multiple LLMs, measures whether they call the right actions with the right arguments, and iteratively rewrites your SKILL.md / docs until a floor score is met across every model.

Requirements: Node.js 20+, an OpenRouter API key.

Installation

git clone https://github.com/fastxyz/skill-optimizer
cd skill-optimizer
npm install
npm run build
npm link        # makes `skill-optimizer` available globally

Quickstart

export OPENROUTER_API_KEY=sk-or-...

Step 1 — Scaffold config (run from your project root):

npx skill-optimizer init cli       # or: init sdk, init mcp

The wizard asks for your repo path, models to benchmark, and where your SKILL.md lives. It creates a skill-optimizer/ directory:

skill-optimizer.json — the main config (commit this)
.skill-optimizer/cli-commands.json — CLI surface manifest (template to edit, or auto-extracted)
.skill-optimizer/tools.json — MCP surface manifest (template to edit)

Step 2 — (CLI/MCP only) Extract your surface if code-first discovery yields nothing:

npx skill-optimizer import-commands --from ./src/cli.ts
# or for a compiled binary:
npx skill-optimizer import-commands --from my-cli --scrape

Step 3 — Run a benchmark:

npx skill-optimizer run --config ./skill-optimizer/skill-optimizer.json

Step 4 — Run the optimizer (iteratively improves your SKILL.md):

npx skill-optimizer optimize --config ./skill-optimizer/skill-optimizer.json

The optimizer never modifies your original SKILL.md — it works from versioned local copies in .skill-optimizer/ and prints a progress table at the end showing per-model improvement.

Non-interactive / CI mode:

# Accept all wizard defaults without prompts
npx skill-optimizer init cli --yes

# Load answers from a JSON file
npx skill-optimizer init --answers answers.json

answers.json format:

{
  "surface": "cli",
  "repoPath": "/absolute/path/to/your-repo",
  "models": ["openrouter/anthropic/claude-sonnet-4.6", "openrouter/openai/gpt-4o"],
  "maxTasks": 20,
  "maxIterations": 5,
  "entryFile": "src/cli.ts"
}

Key config fields in skill-optimizer/skill-optimizer.json:

Field	What it does	Set it to
`target.repoPath`	Root of the project being benchmarked	Absolute or relative path to your repo
`target.discovery.sources`	Source files to scan for callable methods/commands/tools	e.g. `["../src/index.ts"]` or `["../src/server.ts"]`
`target.skill`	Docs file the optimizer will edit	Path to your `SKILL.md` or equivalent guidance doc
`benchmark.models`	Models to benchmark	Valid OpenRouter model IDs

How it works

Discover callable surface (SDK methods / CLI commands / MCP tools) via tree-sitter or a manifest.
Scope the surface with target.scope.include / target.scope.exclude globs.
Generate tasks — one prompt per in-scope action, coverage-guaranteed.
Benchmark — every configured model attempts every task; static evaluator checks action calls + args.
Verdict — PASS/FAIL against two gates (per-model floor, weighted average).
Optimize — create a local versioned copy of your SKILL.md (skill-v{N}.md in .skill-optimizer/), mutate it, re-benchmark, accept only if both gates hold, rollback if not. The target repo's original skill file is never modified.
Recommendations — on FAIL, one critic call summarizes what to improve manually.
Progress table — after the optimizer finishes, a per-model table shows Baseline → each iteration → Final → Δ so you can see exactly where each model improved.

Configuration reference

See docs/reference/config-schema.md for the full generated config reference — auto-updated at every build.

See docs/reference/errors.md for all error codes, descriptions, and fix instructions.

Interpreting the verdict

Every benchmark run produces one of two verdicts: PASS or FAIL.

Two gates must both be satisfied for a PASS:

benchmark.verdict.perModelFloor (default 0.6): every model must pass at least this fraction of tasks. A single model below the floor fails the run, regardless of the average.
benchmark.verdict.targetWeightedAverage (default 0.7): the weighted average score across all models must reach this threshold.

benchmark.models[].weight (default 1.0): heavier-weighted models count more toward the weighted average. Use higher weights for flagship models you care most about.

The optimizer only accepts a mutation when:

the weighted average improves by at least minImprovement, AND
no model that was above the floor drops below it.

Exit codes: 0 = PASS, 1 = FAIL — usable directly in CI pipelines.

Scope & coverage

Control which actions are benchmarked with target.scope:

target.scope.include (default ["*"]): glob patterns for actions to include.
target.scope.exclude (default []): glob patterns for actions to exclude.

The * wildcard matches any sequence of characters including dots and slashes — it is not limited to a single path segment.

Examples:

"Wallet.*" — includes all Wallet methods
"*.internal*" — excludes anything with "internal" anywhere in the name
"get_*" — includes only getter actions

Task generation is coverage-guaranteed: every in-scope action gets at least one task. If the first generation pass misses any, a targeted retry runs (max 2 iterations). If coverage still fails, an error names the uncovered actions and suggests either fixing SKILL.md guidance or adding them to scope.exclude.

Cost notes

Rough LLM spend per run:

Baseline benchmark: N models × M tasks LLM calls.
Optimizer iteration: 1 mutation call + N models × M tasks re-benchmark per iteration.
Recommendations: 1 critic call, only on FAIL verdict.

No per-failure LLM calls — feedback is deterministic (structured failure details + patterns + passing/failing diffs).

Dependencies

The optimizer's coding agent is powered by @mariozechner/pi-coding-agent — a small OSS wrapper around OpenRouter that handles agent sessions and tool loops. Models are accessed through OpenRouter — you need one API key for everything.

Troubleshooting

Missing OPENROUTER_API_KEY: Set it in your shell before running:

export OPENROUTER_API_KEY=sk-or-...

Dirty git: The optimizer requires a clean git state in the target repo (requireCleanGit: true by default). Commit or stash uncommitted changes before running. Note: the optimizer never writes to the target repo's skill file — it works from local versioned copies in .skill-optimizer/.

maxTasks < scope_size: benchmark.taskGeneration.maxTasks must be >= the number of in-scope actions. Run npx skill-optimizer --dry-run --config ./skill-optimizer.json to see the count without making LLM calls.

Empty scope: target.scope.include matched nothing. Check your glob patterns — remember * matches everything including dots.

Legacy skill-benchmark.json: Rename it to skill-optimizer.json. The loader will tell you if it finds the old name.

Contributing

See CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
docs/reference		docs/reference
mock-repos		mock-repos
scripts		scripts
src		src
tests		tests
.codex		.codex
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill-optimizer

Installation

Quickstart

How it works

Configuration reference

Interpreting the verdict

Scope & coverage

Cost notes

Dependencies

Troubleshooting

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skill-optimizer

Installation

Quickstart

How it works

Configuration reference

Interpreting the verdict

Scope & coverage

Cost notes

Dependencies

Troubleshooting

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages