-
-
Notifications
You must be signed in to change notification settings - Fork 67.5k
Description
Problem
When a session is processing a request (especially waiting for a slow model or a long tool call), the Control UI shows no indication of whether the session is alive, thinking, stalled, or crashed. Users experience "dead air" — minutes of silence with no feedback — which is indistinguishable from a crash or stall.
This is especially painful with:
- Slower models (long time-to-first-token)
- Multi-step agentic loops with tool calls
- Background/long-running operations
- Sessions with a history of instability
Proposed Solution
Session Heartbeat Emitter
Each active session emits a lightweight heartbeat signal at a regular interval (e.g., every 5 seconds) containing:
interface SessionHeartbeat {
sessionKey: string;
state:
| "idle" // no active turn
| "awaiting_model" // sent request, waiting for first token
| "streaming" // receiving tokens from model
| "tool_exec" // executing a tool call
| "tool_wait" // waiting for tool result
| "complete"; // turn finished, waiting for next input
turnId?: string; // current turn identifier
model?: string; // active model
toolName?: string; // currently executing tool
lastTokenAt?: number;// timestamp of last received token
startedAt: number; // when current turn started
}Key Behaviors
- Transition-based emission — heartbeat fires on every state transition (idle → awaiting_model → streaming → tool_exec, etc.) AND on a periodic timer (5s) to catch stalls
- No extra model calls — this is pure session lifecycle metadata, not LLM traffic
- Stall detection — if heartbeat shows
awaiting_modelortool_execfor > configurable threshold (e.g., 30s), the session is flagged as "potentially stalled" - Crash detection — if no heartbeat received for > 15s, the UI shows session as "unresponsive"
Control UI Integration
- Per-session status indicator (colored dot or badge): 🟢 active, 🟡 waiting, 🔴 stalled/dead
- Tooltip showing: state, current model, current tool, elapsed time
- Optional "last activity" timestamp per session
- Session list should sort/update in real-time based on heartbeat
API Surface
GET /api/sessions/:key/heartbeat → current heartbeat state
GET /api/sessions/heartbeat → all session heartbeats (batch)
Or expose via existing session list endpoint with an extended status field.
Open Questions
- Should heartbeat be WebSocket-based (push) or polling (pull)? WebSocket is better for real-time, polling is simpler to implement.
- Should stall thresholds be configurable per-session or global?
- Should there be an auto-recovery action (restart stalled session) exposed via the UI?
Alternatives Considered
- Model streaming alone — doesn't cover tool execution gaps or slow time-to-first-token
- Existing session list polling — already exists but lacks granular state info
- Log tailing — too heavy, requires parsing, not structured
Impact
This would significantly improve the operational experience of running OpenClaw, especially for users managing multiple sessions or using slower/cheaper models where latency is expected. It turns "is it dead?" from a guessing game into a visible status.