Skip to content

Latest commit

 

History

History
185 lines (141 loc) · 8.64 KB

File metadata and controls

185 lines (141 loc) · 8.64 KB

Context Compaction & Sessions

Overview

Lis uses rolling async context compaction to manage long conversations. When context grows too large, older messages are summarized into a compact block while the conversation continues uninterrupted. Each compaction creates a new session, preserving the summary and an embedding for future semantic search.

Design Decisions

Sessions — not merged summaries

Each session is a self-contained segment of a chat with its own summary. Full compaction always creates a new session. Summaries are never merged into each other.

  • Compaction → finalize current session (summary + embedding) → reassign kept messages to new session → set chat.CurrentSessionId to new session.
  • /new or /clear → finalize session (async summary) → create new session. Messages stay with their original session.
  • /resume → reopen a previous session (full context) or create a new session with the target's summary injected (fallback).
  • Context for continuations: the parent session's summary is injected when building context, giving the AI continuity without merging.

Message ownership — session_id

Each message has a NOT NULL session_id FK to session. Messages are loaded via WHERE m.SessionId == session.Id.

During compaction, messages after the split point are atomically reassigned to the new session via UPDATE message SET session_id = newSession.Id WHERE session_id = oldSession.Id AND id > splitMessageId.

No token estimation — ever

Token counts come exclusively from provider response metadata (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens). The old EstimateTokens heuristic (chars / 3.5) is removed entirely.

  • Compaction trigger: based on actual input_tokens from the last response.
  • Pre-send check: for very large contexts, CountMessageTokensAsync (Anthropic's free endpoint) can validate before sending.
  • Per-message storage: actual token counts stored as columns on MessageEntity, not buried in SkContent JSON.

Progressive compaction — tool pruning first

Two stages, applied in order:

  1. Tool output pruning — triggered when total tool result tokens (after ToolsPrunedThroughId) exceed ToolPruneThreshold. Replaces tool output with the bare function name (preserving FunctionResultContent metadata and CallId). Non-destructive (DB unchanged). Runs ONCE, stays stable → preserves prompt cache.

  2. Full compaction — triggered when input_tokens from the last response exceeds CompactionThreshold (default: 80% of ContextBudget). Keeps KeepRecentTokens of recent messages. Everything before that is summarized by the compaction LLM. Creates a new session with kept messages reassigned atomically.

Tool summarization policy

Global env LIS_TOOL_SUMMARIZATION_POLICY:

  • auto — per-tool [ToolSummarization] attribute. Default: Prune.
  • keep_all — never prune. Safeguard: force-prunes if context blows past threshold.
  • keep_none — always prune everything.

Per-tool attribute SummarizationPolicy:

  • Prune — replace with one-liner.
  • Summarize — include full output in compaction prompt for LLM summarization (e.g., book content with emotional value).

Prompt caching — 4 breakpoints

Anthropic's prompt caching (up to 4 cache_control breakpoints):

  1. After system prompt — stable, rarely changes.
  2. After session summary (if exists) — changes only on compaction.
  3. At tool prune boundary — ensures pruning doesn't invalidate later messages' cache.
  4. Top-level automatic — cache_control on request body, auto-caches growing conversation.

Implemented via CacheControlHandler DelegatingHandler that injects cache_control markers into the HTTP request JSON.

Compaction client — separate provider

Compaction can use a different provider/model than the main conversation (e.g., Haiku for cheap summaries). Registered as keyed IChatClient("compaction"). Falls back to main client if not configured.

Chat commands — intercepted before AI

/status, /new, /clear, /compact, /prune, /resume, /abort are handled by CommandRouter before AI processing. Commands bypass debouncing and execute immediately. Responses are persisted to message history. No AI tokens wasted on commands. Commands support arguments: /command [args].

Message queuing — queued flag

All message queries filter !m.Queued to exclude messages that are queued while an AI response is in progress. See docs/MESSAGE_QUEUE.md for the full queuing model.

Thinking effort — configurable

ANTHROPIC_THINKING_EFFORT env var: low (1024), medium (4096), high (16384), or exact token count.

Configuration Reference

Env Var Default Purpose
LIS_KEEP_RECENT_TOKENS 4000 Recent messages kept verbatim after compaction
LIS_TOOL_PRUNE_THRESHOLD 8000 Tool output tokens to trigger pruning
LIS_TOOL_KEEP_THRESHOLD 2000 Recent tool output tokens to keep unpruned
LIS_COMPACTION_THRESHOLD 80% of budget Input tokens to trigger full compaction
LIS_COMPACTION_NOTIFY true Notify user on compaction events
LIS_TOOL_SUMMARIZATION_POLICY auto auto, keep_all, keep_none
LIS_RESUME_TOKEN_BUDGET 70% of budget Max tokens for full session resume
LIS_COMPACTION_PROVIDER (main) anthropic (others: not yet)
LIS_COMPACTION_API_KEY API key for compaction provider
LIS_COMPACTION_MODEL Model for summarization
ANTHROPIC_CACHE_TTL 5m 5m or 1h
ANTHROPIC_CACHE_ENABLED true Toggle prompt caching
ANTHROPIC_THINKING_EFFORT (off) Thinking budget

Notification Formats

Tool pruning:

🔧 Tool outputs pruned (12.4k → 2.1k, -83%)
  📊 Context: 18.5k/150k (12%)

Full compaction:

⚙️ Compacted (157k → 9.1k)
  🔧 System: 1.8k tokens
  📝 Summary: 3.2k tokens
  💬 Kept context: 2.7k tokens
  🛠️ Tools: 1.4k tokens (1.1k defs + 300 calls)
  📊 Total: 9.1k/150k (6%)

Implementation

Key files

File Purpose
Lis.Persistence/Entities/SessionEntity.cs Session data model with summary, embedding, token stats
Lis.Persistence/Entities/MessageEntity.cs Message data model with session_id FK
Lis.Agent/CompactionService.cs Async summarization, session lifecycle, embedding generation, message reassignment
Lis.Agent/ContextWindowBuilder.cs History assembly with session/summary injection, tool pruning
Lis.Agent/ConversationService.cs Session management, compaction triggers, command routing
Lis.Agent/ToolRunner.cs Token usage extraction from streaming responses
Lis.Agent/Commands/ Command framework: IChatCommand, CommandRouter, /status, /new, /compact, /prune, /resume
Lis.Providers/Anthropic/AnthropicProvider.cs CacheControlHandler, thinking effort, cache config
Lis.Core/Util/ToolSummarizationAttribute.cs Per-tool summarization policy attribute
Lis.Core/Channel/TokenUsage.cs Token usage DTO

Flow

  1. Message arrives → ConversationService.IngestMessageAsync (ensure session, persist with session_id)
  2. ConversationService.RespondAsync — load chat + session
  3. Check commands (/status, /new, /clear, /compact, /prune, /resume) → handle without AI
  4. Load messages from current session (WHERE m.SessionId == session.Id)
  5. Build context: system prompt → parent summary → session summary → messages (with pruning)
  6. Send to AI via ToolRunner (streaming, with tool loop)
  7. Extract TokenUsage from response metadata → update session stats + message columns
  8. Check compaction triggers (based on actual input_tokens):
    • Tool prune threshold → set ToolsPrunedThroughId
    • Compaction threshold → fire async CompactionService.CompactAsync
  9. Compaction: summarize → embed → finalize session → create new session → reassign kept messages (in transaction)

Prompt caching

CacheControlHandler DelegatingHandler intercepts Anthropic API requests and injects 4 cache breakpoints:

  1. Last system content block — caches the stable system prompt
  2. Last session summary message — caches summaries (stable within session)
  3. Tool prune boundary — caches pruned content (stable once set, communicated via ToolContext.CacheBreakIndex)
  4. Top-level cache_control — auto-caches growing conversation prefix

The handler is inserted in the HttpClient pipeline before BearerAuthHandler. Cache stats (cache_read_input_tokens, cache_creation_input_tokens) are extracted from response metadata and stored per-message and per-session.