Autonomous research on AI long-term memory systems, inspired by karpathy/autoresearch.
An AI agent autonomously modifies a memory system implementation (memory_system.py), runs evaluation benchmarks against 30 structured scenarios, and keeps/discards changes based on a quality score judged by an LLM.
prepare.py— FIXED: test scenarios, LLM judge, evaluation harness, metricsmemory_system.py— AGENT MODIFIES: storage, retrieval, update, consolidationprogram.md— HUMAN EDITS: agent instructions (the "research org code")
cd autoresearch-memory
uv sync
uv run prepare.py # run baseline evaluation (uses local `claude` CLI as judge)memory_score (0-10, higher is better):
memory_score = 0.3*relevance + 0.3*correctness + 0.25*completeness + 0.15*(10 - noise)
Point an AI agent (e.g. Claude Code) at program.md and let it go. It will autonomously experiment with memory_system.py, running the eval after each change and keeping improvements.