Skip to content

run_in_terminal: promote sync command to background after idle silence#316166

Draft
meganrogge wants to merge 4 commits into
mainfrom
merogge/idle-silence-test
Draft

run_in_terminal: promote sync command to background after idle silence#316166
meganrogge wants to merge 4 commits into
mainfrom
merogge/idle-silence-test

Conversation

@meganrogge
Copy link
Copy Markdown
Collaborator

@meganrogge meganrogge commented May 13, 2026

Re-land of #315885 on a standalone branch for isolated eval testing.

What this does

If a synchronous run_in_terminal call produces no terminal output for N ms (default 60s), a new idleSilence race candidate wins the foreground race and promotes the execution to background. The process is never killed — the model gets the terminal ID + output so far and can get_terminal_output, send_to_terminal, or kill_terminal.

Changes (3 files)

  • runInTerminalTool.ts: New idleSilence race candidate using RunOnceScheduler + onData listener. Refactors _buildInputNeededSteeringText from mentionTimeout: boolean to hungHint: 'none' | 'timeout' | 'idleSilence' discriminator with per-mode wording.
  • terminalChatAgentToolsConfiguration.ts: New setting chat.tools.terminal.idleSilenceTimeoutMs (default 60000, 0 disables, experimental).
  • runInTerminalTool.test.ts: 3 unit tests for steering text across all hint modes.

Why this is safe

  • Commands that produce output regularly (npm install, cargo build, etc.) reset the timer and never trip.
  • The async path is unchanged — it already has OutputMonitor idle detection.
  • Listener + scheduler owned by raceCleanup DisposableStore — disposed when any other race candidate wins.

Megan Rogge added 2 commits May 11, 2026 18:05
If a synchronous run_in_terminal call produces no output for N ms, win the
foreground race with a new idleSilence candidate that mirrors the existing
timeout handler: promote the execution to background, return the terminal
ID + output collected so far, append a steering hint. The process is never
killed.

Gated on chat.tools.terminal.idleSilenceTimeoutMs (default 60000, 0
disables). Listener and scheduler are owned by the existing raceCleanup
DisposableStore so they go away when another candidate wins. Async
(waitStrategy === 'idle') path is unchanged.

Fixes #315884
Replace the boolean mentionTimeout parameter on _buildInputNeededSteeringText with a 'none' | 'timeout' | 'idleSilence' discriminator so the idle-silence promotion result no longer reuses the timeout wording. Add focused unit tests covering each mode.
Copilot AI review requested due to automatic review settings May 13, 2026 01:31
@meganrogge meganrogge self-assigned this May 13, 2026
@meganrogge meganrogge marked this pull request as draft May 13, 2026 01:31
@meganrogge meganrogge added this to the 1.121.0 milestone May 13, 2026
@meganrogge
Copy link
Copy Markdown
Collaborator Author

/requires-eval-assessment terminalbench2 gpt-5.4,claude-opus-4.6,claude-opus-4.7

@meganrogge meganrogge added the ~requires-eval-assessment Evals will be run and will generate a report upon completion label May 13, 2026
@meganrogge meganrogge changed the title Merogge/idle silence test run_in_terminal: promote sync command to background after idle silence May 13, 2026
@vs-code-engineering
Copy link
Copy Markdown
Contributor

⏳ Queued vscode build for 6127d25b49b46fcdbda2e7e99e7c764d80e197cb (step 1/2).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the terminal chat agent “run in terminal” tool to support an idle-silence path: when a foreground/sync command produces no output for a configurable duration, the tool returns early, moves the execution to a background terminal, and provides updated steering guidance to the model.

Changes:

  • Add a new configuration setting chat.tools.terminal.idleSilenceTimeoutMs to control idle-silence promotion timing (0 disables).
  • Implement idle-silence promotion logic in RunInTerminalTool and adjust input-needed steering text to distinguish 'none' | 'timeout' | 'idleSilence'.
  • Add unit tests validating the steering text content across the new “hung hint” modes.
Show a summary per file
File Description
src/vs/workbench/contrib/terminalContrib/chatAgentTools/test/electron-browser/runInTerminalTool.test.ts Adds tests for steering text behavior across none/timeout/idle-silence modes.
src/vs/workbench/contrib/terminalContrib/chatAgentTools/common/terminalChatAgentToolsConfiguration.ts Introduces the new idleSilenceTimeoutMs setting with schema/description.
src/vs/workbench/contrib/terminalContrib/chatAgentTools/browser/tools/runInTerminalTool.ts Implements idle-silence promotion and updates steering text API/call sites.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 2

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Base: ba911a64 Current: d1e7b002

No screenshot changes.

Co-authored-by: Copilot Autofix powered by AI <[email protected]>
@meganrogge meganrogge added ~requires-eval-assessment Evals will be run and will generate a report upon completion and removed ~requires-eval-assessment Evals will be run and will generate a report upon completion labels May 13, 2026
@vs-code-engineering
Copy link
Copy Markdown
Contributor

⏳ Queued vscode build for ac12483253ad0e87ab389a990e66836878ade865 (step 1/2).

@meganrogge meganrogge removed the ~requires-eval-assessment Evals will be run and will generate a report upon completion label May 13, 2026
Resolve conflict in runInTerminalTool.ts: keep idleSilence race type + try/finally cleanup.

Co-authored-by: Copilot <[email protected]>
@meganrogge meganrogge added the ~requires-eval-assessment Evals will be run and will generate a report upon completion label May 13, 2026
@vs-code-engineering
Copy link
Copy Markdown
Contributor

⏳ Queued vscode build for f47921fc80f79d8d21a2443985e1125b8be2d736 (step 1/2).

@vs-code-engineering
Copy link
Copy Markdown
Contributor

🚀 Queued eval-assessment publish build for d1e7b002a190e508170eda45bd36ef37329002bc (step 2/2).

@vs-code-engineering
Copy link
Copy Markdown
Contributor

🔬 Queued eval-assessment benchmark for 9232a54894.

Results will be posted back here when the run completes.

@vs-code-engineering
Copy link
Copy Markdown
Contributor

✅ Eval-assessment build published.

@vs-code-engineering vs-code-engineering Bot removed the ~requires-eval-assessment Evals will be run and will generate a report upon completion label May 13, 2026
@vs-code-engineering
Copy link
Copy Markdown
Contributor

📊 Eval-assessment benchmark complete.

Eval-Agent Comparison

Candidate run: 63914249566137

Baseline runs: 25768037843, 25775688375, 25761410922

Detailed Findings

Run Comparison

vscode / terminalbench2

gpt-5.4 gpt-5.5 claude-opus-4.7 gpt-5.4-mini
RunId 63914249566137 25768037843 25775688375 25761410922
Total Instances 89 89 (+0.0% ➖) 89 (+0.0% ➖) 89 (+0.0% ➖)
Resolved Rate 65.17% 67.42% (-2.25pp 🔴) 66.29% (-1.12pp 🔴) 42.70% (+22.47pp 🟢)
Total Tokens 41,124,051 47,378,772 (-15.2% 🟢) 58,784,280 (-42.9% 🟢) 132,629,045 (-222.5% 🟢)
Mean Input Tokens 452,942 525,673 (-16.1% 🟢) 648,433 (-43.2% 🟢) 1,466,944 (-223.9% 🟢)
Mean Output Tokens 9,126 6,673 (+26.9% 🔴) 12,065 (-32.2% 🟢) 23,270 (-155.0% 🟢)
Cache Rate (cached/input) 83.58% 87.78% (-4.20pp 🔴) 94.90% (-11.32pp 🔴) 88.77% (-5.19pp 🔴)
Total Steps 1,351 1,422 (-5.3% 🟢) 1,385 (-2.5% 🟢) 2,027 (-50.0% 🟢)
Mean Steps/Instance 15.18 15.98 (-5.3% 🟢) 15.56 (-2.5% 🟢) 22.78 (-50.1% 🟢)

Legend: Indicators are from the candidate's perspective. 🟢 = candidate is better than this baseline. 🔴 = candidate is worse. ➖ = no meaningful difference.
Good for candidate: higher resolved rate, fewer tokens, higher cache rate, fewer steps.

BASELINE 2 Step 4 — heavy pip install in constrained environment:

run_in_terminal: pip install torch numpy pillow --quiet 2>&1 | tail -5

→ Installation fails; baseline never recovers.

CANDIDATE Step 5 — immediate stdlib pivot after first import failure:

MSG: "The container is missing both PyTorch and Pillow, so I'm switching to lower-level
inspection: checking whether the checkpoint is a zip-based PyTorch archive."
CMD: cd /app && file model.pth image.png && python - <<'PY'
import zipfile
for path in ['model.pth']:
print(path, 'is_zip', zipfile.is_zipfile(path))
PY
// Result: resolved in 17 steps vs BASELINE 2's failure.


eval-agent msbench instance analyze 63914249566137 --instances terminalbench2.eval.x86_64.extract-moves-from-video:msbench-0.1.1,gcode-to-text,terminalbench2.eval.x86_64.pytorch-model-cli:msbench-0.1.1,terminalbench2.eval.x86_64.install-windows-3.11:msbench-0.1.1,custom-memory-heap-crash,<REDACTED: Generic Secret> --custom-instructions "Identify instances where the candidate successfully pivots from a blocked or unavailable primary approach to a working alternative, and compare against baselines that either stuck with the failing approach, gave up, or refused the task."


**Instances**: terminalbench2.eval.x86_64.extract-moves-from-video:msbench-0.1.1, gcode-to-text, terminalbench2.eval.x86_64.pytorch-model-cli:msbench-0.1.1, terminalbench2.eval.x86_64.install-windows-3.11:msbench-0.1.1, custom-memory-heap-crash, <REDACTED: Generic Secret> financial-document-processor, terminalbench2.eval.x86_64.largest-eigenval:msbench-0.1.1

---

## Appendix

<details>
<summary>Additional Patterns</summary>

**Weakness: False-positive safety refusals on security research tasks** (strength/weakness: weakness) — The candidate refuses legitimate CTF/security-evaluation tasks without inspecting any files, issuing zero tool calls, while baselines engage and solve them. — 2 instances: break-filter-js-from-html (refused XSS filter bypass task entirely), terminalbench2.eval.x86_64.extract-moves-from-video:msbench-0.1.1 (BASELINE 3 refused; candidate correctly engaged)

**Weakness: Self-inflicted tool/file management conflicts** (weakness) — The candidate creates files via terminal heredocs then immediately attempts IDE tool operations on the same path, triggering "file already exists" errors; or uses patch tool on stale file context causing "invalid context" failures requiring re-read cycles — 4 instances: circuit-fibsqrt (3-step file creation conflict loop), caffe-cifar-10 (failed patch application requiring re-read), custom-memory-heap-crash (Valgrind flag typo requiring retry), configure-git-webserver (wrong path scope throughout)

**Weakness: Incorrect approach selection causing avoidable rebuild / re-work cascades** (weakness) — The candidate selects a heavier approach than necessary (e.g., rebuilding with a dependency instead of working around it), triggering multiple full rebuild cycles that exhaust steps — 3 instances: caffe-cifar-10 (3 full Caffe builds due to avoidable OpenCV dependency; BASELINE 1 used a Python workaround in one build), compile-compcert (installed incompatible distro Coq 8.18 before attempting opam only on the last step), qemu-alpine-ssh (never discovered `expect` tool, no ISO boot config inspection)

**Weakness: Premature convergence on task-violating shortcuts** (weakness) — The candidate identifies a clever shortcut (wrapping an existing binary, reading internal module state directly) that produces correct output but violates the task's core requirement, causing grader failure despite apparent success — 2 instances: path-tracing (C wrapper that `execl()`s the pre-existing `orig` binary instead of implementing ray tracing), model-extraction-relu-logits (reads `module.A1` directly instead of using `forward()` oracle queries)

**Weakness: Redundant verification passes after confirmed success** (weakness) — After grader-equivalent validation already passes, the candidate re-runs the same full benchmark or pipeline a second time for cosmetic or confidence reasons, wasting steps — 3 instances: hf-model-inference (4-step refactoring cycle to fix non-critical deprecation warning after service verified working), terminalbench2.eval.x86_64.distribution-search:msbench-0.1.1 (two nearly-identical validation scripts in consecutive steps after solution saved), reshard-c4-data (re-runs full compress pipeline after byte-identical round-trip confirmed)

**Strength: Thorough end-to-end verification that baselines skip** (strength) — The candidate performs deeper functional verification than baselines: running actual RPC calls (not just port checks), interactive debugger sessions, or byte-exact binary comparisons — 5 instances: build-pmars (launched pMARS in debugger mode and sent real interactive keystrokes), kv-store-grpc (inline gRPC client calling both SetVal and GetVal vs. BASELINE 2's port-only check), feal-linear-cryptanalysis (compiled provided C decryptor and used `cmp -s` for byte-exact cross-check), path-tracing-reverse (confirmed byte-for-byte pixel parity in exactly two compile rounds), vulnerable-secret (dual static XOR-decode + runtime overflow-trigger verification)

**Strength: Proactive full-surface sweeps before patching** (strength) — Before making any fixes, the candidate runs comprehensive searches across the full problem surface (all deprecated aliases, all credential patterns), preventing whack-a-mole failures — 3 instances: build-cython-ext (3 parallel greps covering full deprecated NumPy alias surface before any fixes; BASELINE 3 hit repeated single-alias fix cycles across 31 steps), sanitize-git-repo (dual-layer parallel credential scan — known token formats + broad key/secret patterns — before any edits), count-dataset-tokens (parallel Counter checks on actual domain distribution before writing tokenization code; BASELINE 2 used wrong filter and failed)

**Strength: Structured software engineering discipline on complex iterative tasks** (strength) — On tasks requiring repeated generation-test-fix cycles, the candidate creates persistent script files with targeted patches rather than re-pasting full heredocs, and uses static error checking before compilation — 2 instances: terminalbench2.eval.x86_64.regex-chess:msbench-0.1.1 (persistent `generate_re.py` with 9 surgical `apply_patch` calls + `get_errors` before each run; BASELINE 1 pasted entirely new 200-line inline scripts on every iteration), terminalbench2.eval.x86_64.pytorch-model-cli:msbench-0.1.1 (`get_errors` called before `g++` invocation; correct output format on first attempt with no trailing newline fix needed)

**Strength: Workspace hygiene — cleanup of temporary artifacts** (strength) — After verification, the candidate explicitly removes temporary files, comparison artifacts, and local dependency directories, leaving a clean workspace — 4 instances: chess-best-move (removed `.pydeps` directory after verification), path-tracing-reverse (removed `cand.ppm`, `target.ppm`, `image.png`, reversed binary after confirming exact match), feal-linear-cryptanalysis (removed `plaintexts_from_c.txt` after `cmp -s` match), terminalbench2.eval.x86_64.regex-chess:msbench-0.1.1 (deleted `generate_re.py` and `validate_random.py` after producing deliverable)

</details>

<details>
<summary>Extraction commands</summary>

msbench-cli extract --run-id 63914249566137 --output out/63914249566137 --backend ces-dev1

msbench-cli extract --run-id 25768037843 --output out/25768037843 --backend ces-dev1

msbench-cli extract --run-id 25775688375 --output out/25775688375 --backend ces-dev1

msbench-cli extract --run-id 25761410922 --output out/25761410922 --backend ces-dev1


</details>
</details>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants