feat(scripts): safer reclassify_with_llm.py with provider flags + tighter prompt by flintfromthebasement · Pull Request #164 · verygoodplugins/automem

flintfromthebasement · 2026-05-01T15:32:00Z

Why

scripts/reclassify_with_llm.py is a one-shot maintenance tool, but in its current form it's all-or-nothing: there's no way to dry-run it, no way to sample a subset, and no way to point it at anything except OpenAI. That makes it scary to actually run on a real corpus, and impossible to benchmark alternative classification models without forking.

This PR makes it safe to use on a production corpus and provider-agnostic, plus tightens the classification prompt based on a 100-memory benchmark.

What changes

1. CLI flags for safe partial runs

Flag	Purpose
`--limit N`	Cap memories processed (must be >= 1)
`--sample {head,random}`	How to pick when `--limit` is set (default: `head`)
`--seed N`	Reproducibility for `--sample random`
`--dry-run`	Classify but don't write back to FalkorDB
`--yes`	Skip the interactive confirmation prompt
`--provider {openai,openrouter}`	Force base URL + key (default: `openai`)
`--model M`	Override `CLASSIFICATION_MODEL` per-run

Typical workflow now:

# 1. Sanity-check on 100 random memories, no writes
./scripts/reclassify_with_llm.py --limit 100 --sample random --seed 42 --dry-run

# 2. If the distribution looks right, commit to the full pass
./scripts/reclassify_with_llm.py --yes

For head sampling the cap is pushed into Cypher, so partial dry-runs no longer materialize the full fallback corpus.

2. OpenRouter / OpenAI-compatible provider support

Adds three env vars: OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY (documented in .env.example and docs/ENVIRONMENT_VARIABLES.md). Same script can now target OpenRouter, LiteLLM, vLLM, Azure, or any OpenAI-compatible endpoint without code changes.

--provider forces the canonical base URL for its provider; provider-specific keys never fall back across providers (so OPENAI_API_KEY can't leak to a third-party endpoint).

Includes a tolerant JSON extractor for models that don't honor response_format=json (Gemini families on OpenRouter return prose-wrapped JSON and otherwise crash the strict parser). response_format is now gated on the selected endpoint, not the model name.

3. Tightened SYSTEM_PROMPT

The prior prompt was a loose 7-bullet type list. New prompt has strict definitions, keyword cues, and explicit priority rules ("Fact:" and descriptive statements go to Context, not Insight; chat/DM fragments aren't Decisions just because they contain "decided").

Empirical impact on a 100-memory sample (Gemini 3.1 Flash-Lite via OpenRouter):

Type	Before (loose)	After (strict)
Insight	56% (catch-all)	8%
False Decisions on DM/session fragments	several	0
Context, Pattern, Habit	underused	distribution closer to intent

Out of scope

The startup-tick guard from the same flint-branch commit (automem/consolidation/runtime_scheduler.py) is not in this PR — it's a separate concern (FalkorDB RDB-loading race at init) and will land as its own PR.
discover_creative_associations (the rule-based "dreaming" edge inference in consolidation.py) is unchanged here. There's an open thought to LLM-replace that with the same Gemini 3.1 Flash-Lite + tight-prompt pattern — happy to file a separate issue if it's interesting to benchmark.

Test plan

./scripts/reclassify_with_llm.py --help shows all new flags
--dry-run --limit 10 runs against a dev FalkorDB without writing
--provider openrouter --model google/gemini-3.1-flash-lite-preview --limit 10 --sample random --dry-run works end-to-end
--limit 0 and --limit -5 are rejected at CLI parse time
Default behavior (no flags, OpenAI) is unchanged from prior script

…hter prompt Three improvements to scripts/reclassify_with_llm.py to make it safe to run on a real corpus and easy to retarget at different LLM providers. 1. CLI flags for safe partial runs: - --limit N cap the number of memories processed - --sample N random sample N memories (instead of first N) - --seed N reproducibility for sampled runs - --dry-run classify but don't write back to FalkorDB - --yes skip the interactive confirmation prompt - --provider P openai | openrouter (default: openai) - --model M override CLASSIFICATION_MODEL per-run Lets you do a 100-memory sanity-check pass before committing to a full reclassification across thousands of records. The prior version was all-or-nothing. 2. OpenRouter / OpenAI-compatible support: - Adds OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY env vars so the same script can target any OpenAI-compatible endpoint (OpenRouter, LiteLLM, vLLM, Azure, etc.) without code changes. - Adds a tolerant JSON extractor for models that don't honor response_format=json (e.g. Gemini families on OpenRouter), which otherwise return prose-wrapped JSON and crash the strict parser. 3. Tightened SYSTEM_PROMPT: - Replaces the loose 7-bullet type list with strict definitions, keyword cues, and explicit priority rules ("Fact:" / descriptive statements go to Context, not Insight; chat/DM fragments aren't Decisions just because they contain the word "decided"). - Empirical impact on a 100-memory sample using Gemini 3.1 Flash-Lite: - Insight share: 56% → 8% (was being used as a catch-all) - False Decision calls on session/DM fragments eliminated - Pattern, Context, Habit usage closer to the intended distribution The script remains a one-shot maintenance tool — typically run after a model swap, prompt change, or large bulk import — not a recurring task. Co-Authored-By: Claude Opus 4.7 <[email protected]>

…B load race (#165) ## Why When `init_consolidation_scheduler()` runs a tick **immediately** after spawning the worker thread, FalkorDB can still be loading its RDB snapshot from disk. Every Redis command during that window returns: > `LOADING Redis is loading the dataset in memory` The eager tick catches the error, logs it, and bumps `last_run` timestamps — silently skipping the day's decay / creative / cluster work until tomorrow. The bigger the corpus, the longer the RDB load, the more reliably this fires. On any restart-on-deploy host (Railway, Docker, systemd) with a few thousand memories, it hits every deploy. ## What changes One line in `automem/consolidation/runtime_scheduler.py:100` — drop the eager `run_consolidation_tick_fn()` call after starting the worker thread, and add a comment explaining why. ```diff state.consolidation_thread.start() - run_consolidation_tick_fn() + # Skip eager first tick: FalkorDB may still be loading its RDB snapshot at + # startup and the "Redis is loading the dataset in memory" error poisons + # the day's decay/creative run. The worker loop will fire its first tick + # after consolidation_tick_seconds, which is plenty of warm-up time. logger.info("Consolidation scheduler initialized") ``` ## Why this is safe - The worker loop still fires within `CONSOLIDATION_TICK_SECONDS` (default 3600s = 1h). For decay/creative/cluster intervals measured in days, a one-tick startup delay is invisible. - The scheduler is timestamp-driven (`last_run` per task), not edge-triggered. Missed intervals get picked up by the next loop iteration — nothing is "lost" by deferring. - Failure mode flips from "silent broken run" to "no run yet, will run shortly" — strictly better. ## Out of scope - A more involved fix would actively probe FalkorDB readiness with retries before the first tick. That's a bigger change and arguably belongs at the FalkorDB-client layer, not here. This PR is the minimal, low-risk fix. - The `discover_creative_associations` / clustering improvements live in #163 and #164. ## Test plan - [ ] Service starts cleanly with no eager tick log entry - [ ] Worker loop fires its first tick after `CONSOLIDATION_TICK_SECONDS` - [ ] Forcing a tick via `POST /consolidate` still works immediately - [ ] On a restart with a large RDB, no `LOADING Redis is loading the dataset in memory` errors appear in consolidation logs Co-authored-by: Claude Opus 4.7 <[email protected]>

Copilot

Pull request overview

This PR updates the one-shot LLM reclassification script to support safer partial runs and OpenAI-compatible providers while tightening the classification prompt.

Changes:

Adds CLI flags for dry runs, limits/sampling, confirmation skipping, provider selection, and model override.
Adds OpenRouter/custom endpoint configuration and tolerant JSON extraction for prose-wrapped model output.
Replaces the classification system prompt with stricter type definitions and priority rules.

jack-arturo · 2026-05-14T00:27:58Z

@copilot apply changes based on the comments in this thread

- Validate --limit >= 1 at argparse parse time (positive_int) - Stop falling back across providers — OPENAI_API_KEY never leaks to non-OpenAI endpoints - Push --limit into Cypher for head sampling (no full-corpus materialization) - Force canonical base URL when --provider is explicit - Gate response_format=json_object on endpoint, not model name - Sanitize JSON parse exception so corpus content can't leak via logs - Provider-aware missing-API-key error - Disclaim OpenAI gpt-4o-mini cost estimate when model differs - Document OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY in .env.example, ENVIRONMENT_VARIABLES.md, and the script docstring Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

flintfromthebasement mentioned this pull request May 1, 2026

fix(consolidation): skip eager first tick at startup to avoid FalkorDB load race #165

Merged

4 tasks

Merge branch 'main' into feat/reclassify-script-improvements

f3e827b

jack-arturo requested a review from Copilot May 14, 2026 00:15

Copilot started reviewing on behalf of jack-arturo May 14, 2026 00:15 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

jack-arturo and others added 2 commits May 13, 2026 18:33

style(scripts): apply black formatting to reclassify_with_llm.py

aaff9fb

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

jack-arturo merged commit a742602 into verygoodplugins:main May 14, 2026
7 checks passed

jack-arturo mentioned this pull request May 14, 2026

chore(main): release 0.16.0 #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts): safer reclassify_with_llm.py with provider flags + tighter prompt#164

feat(scripts): safer reclassify_with_llm.py with provider flags + tighter prompt#164
jack-arturo merged 4 commits into
verygoodplugins:mainfrom
flintfromthebasement:feat/reclassify-script-improvements

flintfromthebasement commented May 1, 2026 •

edited by jack-arturo

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jack-arturo commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

flintfromthebasement commented May 1, 2026 • edited by jack-arturo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changes

1. CLI flags for safe partial runs

2. OpenRouter / OpenAI-compatible provider support

3. Tightened SYSTEM_PROMPT

Out of scope

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jack-arturo commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flintfromthebasement commented May 1, 2026 •

edited by jack-arturo

Loading