run_in_terminal: promote sync command to background after idle silence by meganrogge · Pull Request #316166 · microsoft/vscode

meganrogge · 2026-05-13T01:31:32Z

Re-land of #315885 on a standalone branch for isolated eval testing.

What this does

If a synchronous run_in_terminal call produces no terminal output for N ms (default 60s), a new idleSilence race candidate wins the foreground race and promotes the execution to background. The process is never killed — the model gets the terminal ID + output so far and can get_terminal_output, send_to_terminal, or kill_terminal.

Changes (3 files)

runInTerminalTool.ts: New idleSilence race candidate using RunOnceScheduler + onData listener. Refactors _buildInputNeededSteeringText from mentionTimeout: boolean to hungHint: 'none' | 'timeout' | 'idleSilence' discriminator with per-mode wording.
terminalChatAgentToolsConfiguration.ts: New setting chat.tools.terminal.idleSilenceTimeoutMs (default 60000, 0 disables, experimental).
runInTerminalTool.test.ts: 3 unit tests for steering text across all hint modes.

Why this is safe

Commands that produce output regularly (npm install, cargo build, etc.) reset the timer and never trip.
The async path is unchanged — it already has OutputMonitor idle detection.
Listener + scheduler owned by raceCleanup DisposableStore — disposed when any other race candidate wins.

If a synchronous run_in_terminal call produces no output for N ms, win the foreground race with a new idleSilence candidate that mirrors the existing timeout handler: promote the execution to background, return the terminal ID + output collected so far, append a steering hint. The process is never killed. Gated on chat.tools.terminal.idleSilenceTimeoutMs (default 60000, 0 disables). Listener and scheduler are owned by the existing raceCleanup DisposableStore so they go away when another candidate wins. Async (waitStrategy === 'idle') path is unchanged. Fixes #315884

Replace the boolean mentionTimeout parameter on _buildInputNeededSteeringText with a 'none' | 'timeout' | 'idleSilence' discriminator so the idle-silence promotion result no longer reuses the timeout wording. Add focused unit tests covering each mode.

meganrogge · 2026-05-13T01:32:00Z

/requires-eval-assessment terminalbench2 gpt-5.4,claude-opus-4.6,claude-opus-4.7

vs-code-engineering · 2026-05-13T01:32:55Z

⏳ Queued vscode build for 6127d25b49b46fcdbda2e7e99e7c764d80e197cb (step 1/2).

Build: https://dev.azure.com/monacotools/Monaco/_build/results?buildId=438758
When this succeeds, the eval-assessment publish build will be queued automatically.

Copilot

Pull request overview

This PR extends the terminal chat agent “run in terminal” tool to support an idle-silence path: when a foreground/sync command produces no output for a configurable duration, the tool returns early, moves the execution to a background terminal, and provides updated steering guidance to the model.

Changes:

Add a new configuration setting chat.tools.terminal.idleSilenceTimeoutMs to control idle-silence promotion timing (0 disables).
Implement idle-silence promotion logic in RunInTerminalTool and adjust input-needed steering text to distinguish 'none' | 'timeout' | 'idleSilence'.
Add unit tests validating the steering text content across the new “hung hint” modes.

Show a summary per file

File	Description
src/vs/workbench/contrib/terminalContrib/chatAgentTools/test/electron-browser/runInTerminalTool.test.ts	Adds tests for steering text behavior across none/timeout/idle-silence modes.
src/vs/workbench/contrib/terminalContrib/chatAgentTools/common/terminalChatAgentToolsConfiguration.ts	Introduces the new `idleSilenceTimeoutMs` setting with schema/description.
src/vs/workbench/contrib/terminalContrib/chatAgentTools/browser/tools/runInTerminalTool.ts	Implements idle-silence promotion and updates steering text API/call sites.

Copilot's findings

Files reviewed: 3/3 changed files
Comments generated: 2

github-actions · 2026-05-13T01:37:00Z

Base: ba911a64 Current: d1e7b002

~~No screenshot changes.~~

Co-authored-by: Copilot Autofix powered by AI <[email protected]>

vs-code-engineering · 2026-05-13T01:39:07Z

⏳ Queued vscode build for ac12483253ad0e87ab389a990e66836878ade865 (step 1/2).

Build: https://dev.azure.com/monacotools/Monaco/_build/results?buildId=438760
When this succeeds, the eval-assessment publish build will be queued automatically.

Resolve conflict in runInTerminalTool.ts: keep idleSilence race type + try/finally cleanup. Co-authored-by: Copilot <[email protected]>

vs-code-engineering · 2026-05-13T01:51:21Z

⏳ Queued vscode build for f47921fc80f79d8d21a2443985e1125b8be2d736 (step 1/2).

Build: https://dev.azure.com/monacotools/Monaco/_build/results?buildId=438763
When this succeeds, the eval-assessment publish build will be queued automatically.

vs-code-engineering · 2026-05-13T02:42:12Z

🚀 Queued eval-assessment publish build for d1e7b002a190e508170eda45bd36ef37329002bc (step 2/2).

Pipeline run: https://dev.azure.com/monacotools/Monaco/_build/results?buildId=438773
On success, publishes @vscode/[email protected] on the dev tag.

vs-code-engineering · 2026-05-13T02:54:20Z

🔬 Queued eval-assessment benchmark for 9232a54894.

Package: @vscode/[email protected] (dev tag)
Benchmark: terminalbench2
Tracking issues:
- terminalbench2 / gpt-5.4: https://github.com/github/evald/issues/22445
- terminalbench2 / claude-opus-4.6: https://github.com/github/evald/issues/22446
- terminalbench2 / claude-opus-4.7: https://github.com/github/evald/issues/22447

Results will be posted back here when the run completes.

vs-code-engineering · 2026-05-13T02:55:06Z

✅ Eval-assessment build published.

Package: @vscode/[email protected] (tag: dev)
Install: npm install @vscode/[email protected]
Pipeline run: https://dev.azure.com/monacotools/Monaco/_build/results?buildId=438773

vs-code-engineering · 2026-05-13T06:53:02Z

📊 Eval-assessment benchmark complete.

Tracking issue: https://github.com/github/evald/issues/22445
Publish build: https://dev.azure.com/monacotools/Monaco/_build/results?buildId=438773

Eval-Agent Comparison

Candidate run: 63914249566137

Baseline runs: 25768037843, 25775688375, 25761410922

Detailed Findings

Run Comparison

vscode / terminalbench2

	gpt-5.4	gpt-5.5	claude-opus-4.7	gpt-5.4-mini
RunId	63914249566137	25768037843	25775688375	25761410922
Total Instances	89	89 (+0.0% ➖)	89 (+0.0% ➖)	89 (+0.0% ➖)
Resolved Rate	65.17%	67.42% (-2.25pp 🔴)	66.29% (-1.12pp 🔴)	42.70% (+22.47pp 🟢)
Total Tokens	41,124,051	47,378,772 (-15.2% 🟢)	58,784,280 (-42.9% 🟢)	132,629,045 (-222.5% 🟢)
Mean Input Tokens	452,942	525,673 (-16.1% 🟢)	648,433 (-43.2% 🟢)	1,466,944 (-223.9% 🟢)
Mean Output Tokens	9,126	6,673 (+26.9% 🔴)	12,065 (-32.2% 🟢)	23,270 (-155.0% 🟢)
Cache Rate (cached/input)	83.58%	87.78% (-4.20pp 🔴)	94.90% (-11.32pp 🔴)	88.77% (-5.19pp 🔴)
Total Steps	1,351	1,422 (-5.3% 🟢)	1,385 (-2.5% 🟢)	2,027 (-50.0% 🟢)
Mean Steps/Instance	15.18	15.98 (-5.3% 🟢)	15.56 (-2.5% 🟢)	22.78 (-50.1% 🟢)

Legend: Indicators are from the candidate's perspective. 🟢 = candidate is better than this baseline. 🔴 = candidate is worse. ➖ = no meaningful difference.
Good for candidate: higher resolved rate, fewer tokens, higher cache rate, fewer steps.

BASELINE 2 Step 4 — heavy pip install in constrained environment:

run_in_terminal: pip install torch numpy pillow --quiet 2>&1 | tail -5

→ Installation fails; baseline never recovers.

CANDIDATE Step 5 — immediate stdlib pivot after first import failure:

MSG: "The container is missing both PyTorch and Pillow, so I'm switching to lower-level
inspection: checking whether the checkpoint is a zip-based PyTorch archive."
CMD: cd /app && file model.pth image.png && python - <<'PY'
import zipfile
for path in ['model.pth']:
print(path, 'is_zip', zipfile.is_zipfile(path))
PY
// Result: resolved in 17 steps vs BASELINE 2's failure.

eval-agent msbench instance analyze 63914249566137 --instances terminalbench2.eval.x86_64.extract-moves-from-video:msbench-0.1.1,gcode-to-text,terminalbench2.eval.x86_64.pytorch-model-cli:msbench-0.1.1,terminalbench2.eval.x86_64.install-windows-3.11:msbench-0.1.1,custom-memory-heap-crash,<REDACTED: Generic Secret> --custom-instructions "Identify instances where the candidate successfully pivots from a blocked or unavailable primary approach to a working alternative, and compare against baselines that either stuck with the failing approach, gave up, or refused the task."


**Instances**: terminalbench2.eval.x86_64.extract-moves-from-video:msbench-0.1.1, gcode-to-text, terminalbench2.eval.x86_64.pytorch-model-cli:msbench-0.1.1, terminalbench2.eval.x86_64.install-windows-3.11:msbench-0.1.1, custom-memory-heap-crash, <REDACTED: Generic Secret> financial-document-processor, terminalbench2.eval.x86_64.largest-eigenval:msbench-0.1.1

---

## Appendix

<details>
<summary>Additional Patterns</summary>

**Weakness: False-positive safety refusals on security research tasks** (strength/weakness: weakness) — The candidate refuses legitimate CTF/security-evaluation tasks without inspecting any files, issuing zero tool calls, while baselines engage and solve them. — 2 instances: break-filter-js-from-html (refused XSS filter bypass task entirely), terminalbench2.eval.x86_64.extract-moves-from-video:msbench-0.1.1 (BASELINE 3 refused; candidate correctly engaged)

**Weakness: Self-inflicted tool/file management conflicts** (weakness) — The candidate creates files via terminal heredocs then immediately attempts IDE tool operations on the same path, triggering "file already exists" errors; or uses patch tool on stale file context causing "invalid context" failures requiring re-read cycles — 4 instances: circuit-fibsqrt (3-step file creation conflict loop), caffe-cifar-10 (failed patch application requiring re-read), custom-memory-heap-crash (Valgrind flag typo requiring retry), configure-git-webserver (wrong path scope throughout)

**Weakness: Incorrect approach selection causing avoidable rebuild / re-work cascades** (weakness) — The candidate selects a heavier approach than necessary (e.g., rebuilding with a dependency instead of working around it), triggering multiple full rebuild cycles that exhaust steps — 3 instances: caffe-cifar-10 (3 full Caffe builds due to avoidable OpenCV dependency; BASELINE 1 used a Python workaround in one build), compile-compcert (installed incompatible distro Coq 8.18 before attempting opam only on the last step), qemu-alpine-ssh (never discovered `expect` tool, no ISO boot config inspection)

**Weakness: Premature convergence on task-violating shortcuts** (weakness) — The candidate identifies a clever shortcut (wrapping an existing binary, reading internal module state directly) that produces correct output but violates the task's core requirement, causing grader failure despite apparent success — 2 instances: path-tracing (C wrapper that `execl()`s the pre-existing `orig` binary instead of implementing ray tracing), model-extraction-relu-logits (reads `module.A1` directly instead of using `forward()` oracle queries)

**Weakness: Redundant verification passes after confirmed success** (weakness) — After grader-equivalent validation already passes, the candidate re-runs the same full benchmark or pipeline a second time for cosmetic or confidence reasons, wasting steps — 3 instances: hf-model-inference (4-step refactoring cycle to fix non-critical deprecation warning after service verified working), terminalbench2.eval.x86_64.distribution-search:msbench-0.1.1 (two nearly-identical validation scripts in consecutive steps after solution saved), reshard-c4-data (re-runs full compress pipeline after byte-identical round-trip confirmed)

**Strength: Thorough end-to-end verification that baselines skip** (strength) — The candidate performs deeper functional verification than baselines: running actual RPC calls (not just port checks), interactive debugger sessions, or byte-exact binary comparisons — 5 instances: build-pmars (launched pMARS in debugger mode and sent real interactive keystrokes), kv-store-grpc (inline gRPC client calling both SetVal and GetVal vs. BASELINE 2's port-only check), feal-linear-cryptanalysis (compiled provided C decryptor and used `cmp -s` for byte-exact cross-check), path-tracing-reverse (confirmed byte-for-byte pixel parity in exactly two compile rounds), vulnerable-secret (dual static XOR-decode + runtime overflow-trigger verification)

**Strength: Proactive full-surface sweeps before patching** (strength) — Before making any fixes, the candidate runs comprehensive searches across the full problem surface (all deprecated aliases, all credential patterns), preventing whack-a-mole failures — 3 instances: build-cython-ext (3 parallel greps covering full deprecated NumPy alias surface before any fixes; BASELINE 3 hit repeated single-alias fix cycles across 31 steps), sanitize-git-repo (dual-layer parallel credential scan — known token formats + broad key/secret patterns — before any edits), count-dataset-tokens (parallel Counter checks on actual domain distribution before writing tokenization code; BASELINE 2 used wrong filter and failed)

**Strength: Structured software engineering discipline on complex iterative tasks** (strength) — On tasks requiring repeated generation-test-fix cycles, the candidate creates persistent script files with targeted patches rather than re-pasting full heredocs, and uses static error checking before compilation — 2 instances: terminalbench2.eval.x86_64.regex-chess:msbench-0.1.1 (persistent `generate_re.py` with 9 surgical `apply_patch` calls + `get_errors` before each run; BASELINE 1 pasted entirely new 200-line inline scripts on every iteration), terminalbench2.eval.x86_64.pytorch-model-cli:msbench-0.1.1 (`get_errors` called before `g++` invocation; correct output format on first attempt with no trailing newline fix needed)

**Strength: Workspace hygiene — cleanup of temporary artifacts** (strength) — After verification, the candidate explicitly removes temporary files, comparison artifacts, and local dependency directories, leaving a clean workspace — 4 instances: chess-best-move (removed `.pydeps` directory after verification), path-tracing-reverse (removed `cand.ppm`, `target.ppm`, `image.png`, reversed binary after confirming exact match), feal-linear-cryptanalysis (removed `plaintexts_from_c.txt` after `cmp -s` match), terminalbench2.eval.x86_64.regex-chess:msbench-0.1.1 (deleted `generate_re.py` and `validate_random.py` after producing deliverable)

</details>

<details>
<summary>Extraction commands</summary>

msbench-cli extract --run-id 63914249566137 --output out/63914249566137 --backend ces-dev1

msbench-cli extract --run-id 25768037843 --output out/25768037843 --backend ces-dev1

msbench-cli extract --run-id 25775688375 --output out/25775688375 --backend ces-dev1

msbench-cli extract --run-id 25761410922 --output out/25761410922 --backend ces-dev1


</details>
</details>

Megan Rogge added 2 commits May 11, 2026 18:05

Copilot AI review requested due to automatic review settings May 13, 2026 01:31

meganrogge self-assigned this May 13, 2026

meganrogge marked this pull request as draft May 13, 2026 01:31

meganrogge added this to the 1.121.0 milestone May 13, 2026

meganrogge added the ~requires-eval-assessment Evals will be run and will generate a report upon completion label May 13, 2026

Copilot started reviewing on behalf of meganrogge May 13, 2026 01:32 View session

meganrogge changed the title ~~Merogge/idle silence test~~ run_in_terminal: promote sync command to background after idle silence May 13, 2026

austindyoung mentioned this pull request May 13, 2026

[mirror] microsoft/vscode#316166 run_in_terminal: promote sync command to background after idle silence austindyoung/vscode-driftfence-mirror-fork#2383

Open

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread src/vs/workbench/contrib/terminalContrib/chatAgentTools/browser/tools/runInTerminalTool.ts Outdated

Comment thread src/vs/workbench/contrib/terminalContrib/chatAgentTools/browser/tools/runInTerminalTool.ts

meganrogge mentioned this pull request May 13, 2026

Revert "run_in_terminal: promote sync command to background after idle output" #316165

Merged

Potential fix for pull request finding

ac12483

Co-authored-by: Copilot Autofix powered by AI <[email protected]>

meganrogge added ~requires-eval-assessment Evals will be run and will generate a report upon completion and removed ~requires-eval-assessment Evals will be run and will generate a report upon completion labels May 13, 2026

meganrogge removed the ~requires-eval-assessment Evals will be run and will generate a report upon completion label May 13, 2026

Merge origin/main into merogge/idle-silence-test

f47921f

Resolve conflict in runInTerminalTool.ts: keep idleSilence race type + try/finally cleanup. Co-authored-by: Copilot <[email protected]>

meganrogge added the ~requires-eval-assessment Evals will be run and will generate a report upon completion label May 13, 2026

vs-code-engineering Bot removed the ~requires-eval-assessment Evals will be run and will generate a report upon completion label May 13, 2026

Conversation

meganrogge commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Changes (3 files)

Why this is safe

Uh oh!

meganrogge commented May 13, 2026

Uh oh!

vs-code-engineering Bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vs-code-engineering Bot commented May 13, 2026

Uh oh!

vs-code-engineering Bot commented May 13, 2026

Uh oh!

vs-code-engineering Bot commented May 13, 2026

Uh oh!

vs-code-engineering Bot commented May 13, 2026

Uh oh!

vs-code-engineering Bot commented May 13, 2026

Uh oh!

vs-code-engineering Bot commented May 13, 2026

Eval-Agent Comparison

Run Comparison

vscode / terminalbench2

BASELINE 2 Step 4 — heavy pip install in constrained environment:

→ Installation fails; baseline never recovers.

CANDIDATE Step 5 — immediate stdlib pivot after first import failure:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

meganrogge commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading