Skip to content

feat(scripts): safer reclassify_with_llm.py with provider flags + tighter prompt#164

Merged
jack-arturo merged 4 commits into
verygoodplugins:mainfrom
flintfromthebasement:feat/reclassify-script-improvements
May 14, 2026
Merged

feat(scripts): safer reclassify_with_llm.py with provider flags + tighter prompt#164
jack-arturo merged 4 commits into
verygoodplugins:mainfrom
flintfromthebasement:feat/reclassify-script-improvements

Conversation

@flintfromthebasement
Copy link
Copy Markdown
Contributor

@flintfromthebasement flintfromthebasement commented May 1, 2026

Why

scripts/reclassify_with_llm.py is a one-shot maintenance tool, but in its current form it's all-or-nothing: there's no way to dry-run it, no way to sample a subset, and no way to point it at anything except OpenAI. That makes it scary to actually run on a real corpus, and impossible to benchmark alternative classification models without forking.

This PR makes it safe to use on a production corpus and provider-agnostic, plus tightens the classification prompt based on a 100-memory benchmark.

What changes

1. CLI flags for safe partial runs

Flag Purpose
--limit N Cap memories processed (must be >= 1)
--sample {head,random} How to pick when --limit is set (default: head)
--seed N Reproducibility for --sample random
--dry-run Classify but don't write back to FalkorDB
--yes Skip the interactive confirmation prompt
--provider {openai,openrouter} Force base URL + key (default: openai)
--model M Override CLASSIFICATION_MODEL per-run

Typical workflow now:

# 1. Sanity-check on 100 random memories, no writes
./scripts/reclassify_with_llm.py --limit 100 --sample random --seed 42 --dry-run

# 2. If the distribution looks right, commit to the full pass
./scripts/reclassify_with_llm.py --yes

For head sampling the cap is pushed into Cypher, so partial dry-runs no longer materialize the full fallback corpus.

2. OpenRouter / OpenAI-compatible provider support

Adds three env vars: OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY (documented in .env.example and docs/ENVIRONMENT_VARIABLES.md). Same script can now target OpenRouter, LiteLLM, vLLM, Azure, or any OpenAI-compatible endpoint without code changes.

--provider forces the canonical base URL for its provider; provider-specific keys never fall back across providers (so OPENAI_API_KEY can't leak to a third-party endpoint).

Includes a tolerant JSON extractor for models that don't honor response_format=json (Gemini families on OpenRouter return prose-wrapped JSON and otherwise crash the strict parser). response_format is now gated on the selected endpoint, not the model name.

3. Tightened SYSTEM_PROMPT

The prior prompt was a loose 7-bullet type list. New prompt has strict definitions, keyword cues, and explicit priority rules ("Fact:" and descriptive statements go to Context, not Insight; chat/DM fragments aren't Decisions just because they contain "decided").

Empirical impact on a 100-memory sample (Gemini 3.1 Flash-Lite via OpenRouter):

Type Before (loose) After (strict)
Insight 56% (catch-all) 8%
False Decisions on DM/session fragments several 0
Context, Pattern, Habit underused distribution closer to intent

Out of scope

  • The startup-tick guard from the same flint-branch commit (automem/consolidation/runtime_scheduler.py) is not in this PR — it's a separate concern (FalkorDB RDB-loading race at init) and will land as its own PR.
  • discover_creative_associations (the rule-based "dreaming" edge inference in consolidation.py) is unchanged here. There's an open thought to LLM-replace that with the same Gemini 3.1 Flash-Lite + tight-prompt pattern — happy to file a separate issue if it's interesting to benchmark.

Test plan

  • ./scripts/reclassify_with_llm.py --help shows all new flags
  • --dry-run --limit 10 runs against a dev FalkorDB without writing
  • --provider openrouter --model google/gemini-3.1-flash-lite-preview --limit 10 --sample random --dry-run works end-to-end
  • --limit 0 and --limit -5 are rejected at CLI parse time
  • Default behavior (no flags, OpenAI) is unchanged from prior script

…hter prompt

Three improvements to scripts/reclassify_with_llm.py to make it safe to
run on a real corpus and easy to retarget at different LLM providers.

1. CLI flags for safe partial runs:
   - --limit N        cap the number of memories processed
   - --sample N       random sample N memories (instead of first N)
   - --seed N         reproducibility for sampled runs
   - --dry-run        classify but don't write back to FalkorDB
   - --yes            skip the interactive confirmation prompt
   - --provider P     openai | openrouter (default: openai)
   - --model M        override CLASSIFICATION_MODEL per-run

   Lets you do a 100-memory sanity-check pass before committing to a
   full reclassification across thousands of records. The prior version
   was all-or-nothing.

2. OpenRouter / OpenAI-compatible support:
   - Adds OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY
     env vars so the same script can target any OpenAI-compatible endpoint
     (OpenRouter, LiteLLM, vLLM, Azure, etc.) without code changes.
   - Adds a tolerant JSON extractor for models that don't honor
     response_format=json (e.g. Gemini families on OpenRouter), which
     otherwise return prose-wrapped JSON and crash the strict parser.

3. Tightened SYSTEM_PROMPT:
   - Replaces the loose 7-bullet type list with strict definitions,
     keyword cues, and explicit priority rules ("Fact:" / descriptive
     statements go to Context, not Insight; chat/DM fragments aren't
     Decisions just because they contain the word "decided").
   - Empirical impact on a 100-memory sample using Gemini 3.1 Flash-Lite:
     - Insight share: 56% → 8% (was being used as a catch-all)
     - False Decision calls on session/DM fragments eliminated
     - Pattern, Context, Habit usage closer to the intended distribution

The script remains a one-shot maintenance tool — typically run after a
model swap, prompt change, or large bulk import — not a recurring task.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
jack-arturo pushed a commit that referenced this pull request May 1, 2026
…B load race (#165)

## Why

When `init_consolidation_scheduler()` runs a tick **immediately** after
spawning the worker thread, FalkorDB can still be loading its RDB
snapshot from disk. Every Redis command during that window returns:

> `LOADING Redis is loading the dataset in memory`

The eager tick catches the error, logs it, and bumps `last_run`
timestamps — silently skipping the day's decay / creative / cluster work
until tomorrow. The bigger the corpus, the longer the RDB load, the more
reliably this fires. On any restart-on-deploy host (Railway, Docker,
systemd) with a few thousand memories, it hits every deploy.

## What changes

One line in `automem/consolidation/runtime_scheduler.py:100` — drop the
eager `run_consolidation_tick_fn()` call after starting the worker
thread, and add a comment explaining why.

```diff
     state.consolidation_thread.start()
-    run_consolidation_tick_fn()
+    # Skip eager first tick: FalkorDB may still be loading its RDB snapshot at
+    # startup and the "Redis is loading the dataset in memory" error poisons
+    # the day's decay/creative run. The worker loop will fire its first tick
+    # after consolidation_tick_seconds, which is plenty of warm-up time.
     logger.info("Consolidation scheduler initialized")
```

## Why this is safe

- The worker loop still fires within `CONSOLIDATION_TICK_SECONDS`
(default 3600s = 1h). For decay/creative/cluster intervals measured in
days, a one-tick startup delay is invisible.
- The scheduler is timestamp-driven (`last_run` per task), not
edge-triggered. Missed intervals get picked up by the next loop
iteration — nothing is "lost" by deferring.
- Failure mode flips from "silent broken run" to "no run yet, will run
shortly" — strictly better.

## Out of scope

- A more involved fix would actively probe FalkorDB readiness with
retries before the first tick. That's a bigger change and arguably
belongs at the FalkorDB-client layer, not here. This PR is the minimal,
low-risk fix.
- The `discover_creative_associations` / clustering improvements live in
#163 and #164.

## Test plan

- [ ] Service starts cleanly with no eager tick log entry
- [ ] Worker loop fires its first tick after
`CONSOLIDATION_TICK_SECONDS`
- [ ] Forcing a tick via `POST /consolidate` still works immediately
- [ ] On a restart with a large RDB, no `LOADING Redis is loading the
dataset in memory` errors appear in consolidation logs

Co-authored-by: Claude Opus 4.7 <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the one-shot LLM reclassification script to support safer partial runs and OpenAI-compatible providers while tightening the classification prompt.

Changes:

  • Adds CLI flags for dry runs, limits/sampling, confirmation skipping, provider selection, and model override.
  • Adds OpenRouter/custom endpoint configuration and tolerant JSON extraction for prose-wrapped model output.
  • Replaces the classification system prompt with stricter type definitions and priority rules.

Comment thread scripts/reclassify_with_llm.py
Comment thread scripts/reclassify_with_llm.py Outdated
Comment thread scripts/reclassify_with_llm.py Outdated
Comment thread scripts/reclassify_with_llm.py Outdated
Comment thread scripts/reclassify_with_llm.py
Comment thread scripts/reclassify_with_llm.py Outdated
Comment thread scripts/reclassify_with_llm.py Outdated
Comment thread scripts/reclassify_with_llm.py Outdated
Comment thread scripts/reclassify_with_llm.py
Comment thread scripts/reclassify_with_llm.py Outdated
@jack-arturo
Copy link
Copy Markdown
Member

@copilot apply changes based on the comments in this thread

jack-arturo and others added 2 commits May 13, 2026 18:33
- Validate --limit >= 1 at argparse parse time (positive_int)
- Stop falling back across providers — OPENAI_API_KEY never leaks to non-OpenAI endpoints
- Push --limit into Cypher for head sampling (no full-corpus materialization)
- Force canonical base URL when --provider is explicit
- Gate response_format=json_object on endpoint, not model name
- Sanitize JSON parse exception so corpus content can't leak via logs
- Provider-aware missing-API-key error
- Disclaim OpenAI gpt-4o-mini cost estimate when model differs
- Document OPENROUTER_API_KEY, CLASSIFICATION_BASE_URL, CLASSIFICATION_API_KEY in .env.example, ENVIRONMENT_VARIABLES.md, and the script docstring

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jack-arturo jack-arturo merged commit a742602 into verygoodplugins:main May 14, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants