feat(llm): preserve reasoning blocks across Anthropic and Gemini turns (#200)#205
Conversation
open-multi-agent#200) Phase 1 of open-multi-agent#200 — close the protocol gap where multi-turn extended thinking silently drops the signatures Anthropic and Gemini 3 require. IR additions (src/types.ts): - ReasoningBlock.signature / redactedData - ToolUseBlock.signature - AgentConfig.thinking: { enabled, budgetTokens?, effort? } Anthropic adapter: - Extract signature from thinking blocks, surface redacted_thinking as ReasoningBlock with redactedData; echo both back unchanged on subsequent turns; forward thinking: { type: 'enabled', budget_tokens } on chat() and stream(). Default budget 1024 (SDK-enforced minimum). Gemini adapter: - Extract thoughtSignature from Part top-level (verified against @google/genai Part schema — NOT nested under functionCall); echo back when present on both functionCall and thought-summary text Parts; forward thinkingConfig: { includeThoughts, thinkingBudget }. Unsigned reasoning is dropped on outgoing to keep context lean. Bedrock: comment refresh — IR now carries the signature fields, but adapter-side preservation is tracked as a follow-up. +23 tests across anthropic-adapter and gemini-adapter-contract suites, covering incoming extraction, outgoing round-trip, request-param forwarding, and the unsigned-reasoning drop path. Deferred to follow-up PRs (per owner's reply on open-multi-agent#200): - OpenAI reasoning_effort forwarding - Anthropic adaptive thinking (Claude 4.7+) - Anthropic 4.5 interleaved-thinking-2025-05-14 beta header detection - Bedrock signature wire-up
|
Rest looks right for the phase-1 scope. |
Per @JackChen-me's review on open-multi-agent#205: `toAnthropicThinkingParam` defaulted budget_tokens to 1024 without considering maxTokens, so any caller passing `thinking: { enabled: true }` with `maxTokens <= 1024` (or any explicit budgetTokens >= maxTokens) would hit a runtime 400 from Anthropic. Validate both constraints before the request: - budget_tokens >= 1024 (SDK-documented minimum) - budget_tokens < max_tokens (docs: "budget_tokens must be set to a value less than max_tokens"; the interleaved-thinking exception is not yet wired up — see RFC open-multi-agent#200 phase 2) Throws with a clear message identifying which constraint failed and how to fix it, rather than letting Anthropic's cryptic 400 propagate. Threaded effectiveMaxTokens (= options.maxTokens ?? 4096) into both chat() and stream() call sites so the validator sees the same value the request will use. +3 tests covering the two new throw paths and the regression scenario from the review (default budget 1024 + maxTokens 1024). Existing tests that relied on the chatOpts() default maxTokens=1024 were updated to specify a larger maxTokens so they exercise the happy path. 759/759 passing.
|
Good catch, you're right. Verified against the docs (" Pushed
Went with throw rather than clamp because OMA's pattern is explicit-over-magic and silently changing a user-specified
|
open-multi-agent#200) (open-multi-agent#205) * feat(llm): preserve reasoning blocks across Anthropic and Gemini turns (open-multi-agent#200) Phase 1 of open-multi-agent#200 — close the protocol gap where multi-turn extended thinking silently drops the signatures Anthropic and Gemini 3 require. IR additions (src/types.ts): - ReasoningBlock.signature / redactedData - ToolUseBlock.signature - AgentConfig.thinking: { enabled, budgetTokens?, effort? } Anthropic adapter: - Extract signature from thinking blocks, surface redacted_thinking as ReasoningBlock with redactedData; echo both back unchanged on subsequent turns; forward thinking: { type: 'enabled', budget_tokens } on chat() and stream(). Default budget 1024 (SDK-enforced minimum). Gemini adapter: - Extract thoughtSignature from Part top-level (verified against @google/genai Part schema — NOT nested under functionCall); echo back when present on both functionCall and thought-summary text Parts; forward thinkingConfig: { includeThoughts, thinkingBudget }. Unsigned reasoning is dropped on outgoing to keep context lean. Bedrock: comment refresh — IR now carries the signature fields, but adapter-side preservation is tracked as a follow-up. +23 tests across anthropic-adapter and gemini-adapter-contract suites, covering incoming extraction, outgoing round-trip, request-param forwarding, and the unsigned-reasoning drop path. Deferred to follow-up PRs (per owner's reply on open-multi-agent#200): - OpenAI reasoning_effort forwarding - Anthropic adaptive thinking (Claude 4.7+) - Anthropic 4.5 interleaved-thinking-2025-05-14 beta header detection - Bedrock signature wire-up * fix(anthropic): validate thinking.budgetTokens against API constraints Per @JackChen-me's review on open-multi-agent#205: `toAnthropicThinkingParam` defaulted budget_tokens to 1024 without considering maxTokens, so any caller passing `thinking: { enabled: true }` with `maxTokens <= 1024` (or any explicit budgetTokens >= maxTokens) would hit a runtime 400 from Anthropic. Validate both constraints before the request: - budget_tokens >= 1024 (SDK-documented minimum) - budget_tokens < max_tokens (docs: "budget_tokens must be set to a value less than max_tokens"; the interleaved-thinking exception is not yet wired up — see RFC open-multi-agent#200 phase 2) Throws with a clear message identifying which constraint failed and how to fix it, rather than letting Anthropic's cryptic 400 propagate. Threaded effectiveMaxTokens (= options.maxTokens ?? 4096) into both chat() and stream() call sites so the validator sees the same value the request will use. +3 tests covering the two new throw paths and the regression scenario from the review (default budget 1024 + maxTokens 1024). Existing tests that relied on the chatOpts() default maxTokens=1024 were updated to specify a larger maxTokens so they exercise the happy path. 759/759 passing.
…ch openai
The Copilot and Azure-OpenAI adapters were quietly dropping seven
AgentConfig fields that openai.ts has always forwarded:
frequency_penalty, presence_penalty, top_p, top_k, min_p,
parallel_tool_calls, extraBody (spread)
A user setting `topP: 0.9` or `extraBody: { logit_bias: {...} }` on a
Copilot/Azure agent would see no effect on the wire, with no warning —
the AgentConfig accepted the field, the adapter silently ignored it.
Mirror the openai.ts param object exactly: same field set, same ordering
(sampling first, ...extraBody, then structural fields so extraBody can
override sampling defaults but cannot clobber model/messages/tools/stream).
This was discovered while tracing the reasoning_effort change — it's not
strictly part of open-multi-agent#200, but the Copilot/Azure adapters were already in
this PR's surface area for the reasoning_effort fix and the silent-drop
behaviour is the same bug class as the budget-vs-maxTokens issue you
caught on open-multi-agent#205. Folding the parity fix in while the diff is open.
+6 tests (3 per adapter):
- forwards full sampling-param set + extraBody
- extraBody overrides sampling params (field-ordering contract)
- extraBody cannot override structural fields (model/stream)
784/784 passing.
…m parity in copilot/azure (#200) (#209) * feat(openai): forward thinking.effort as reasoning_effort (#200) Phase-1 follow-up to #205. Owner explicitly carved this out as separate work in their RFC #200 reply: "OpenAI reasoning_effort is worth adding, but keep it separate from the phase-1 protocol fix." The IR field `AgentConfig.thinking.effort` was added in #205 but only wired up for Anthropic / Gemini. This PR adds the OpenAI side. Changes: - openai.ts chat() and stream(): one-line `reasoning_effort: options.thinking?.effort` added before `...options.extraBody` so extraBody can override (consistent with how the other sampling params are ordered). - types.ts AgentConfig.thinking JSDoc: refreshed to document per-provider mapping now that OpenAI is wired up. Design notes: - The `'minimal'` effort value (gpt-5) isn't yet in the SDK 4.104 type union (`'low' | 'medium' | 'high' | null`), but the existing `as ChatCompletionCreateParamsNonStreaming` cast at the bottom of the params object already covers this, consistent with how `top_k` / `min_p` are forwarded. Test verifies pass-through. - `thinking.enabled` and `thinking.budgetTokens` are intentionally ignored by the OpenAI adapter — OpenAI's reasoning surface is qualitative (`effort`) only, and the explicit-effort design was the owner's open-question resolution in #200. - Scope deliberately tight: openai.ts only. Other OpenAI-compatible adapters (copilot, azure-openai, deepseek, grok, minimax, qiniu) are unchanged — they each route to providers with their own reasoning semantics, so a separate PR per adapter is the right unit. +6 tests covering: chat forwarding, stream forwarding, gpt-5 'minimal' pass-through, omission when thinking absent, omission when effort unset (with budgetTokens set, to verify enabled/budgetTokens are correctly ignored), extraBody override of effort. 770/770 passing. * feat(copilot, azure-openai): forward thinking.effort as reasoning_effort Both adapters implement LLMAdapter directly (rather than extending OpenAIAdapter like grok/minimax/deepseek/qiniu do), so they didn't pick up the reasoning_effort forwarding from the openai.ts change in this PR. Mirror the same pattern to close the gap: - Add `reasoning_effort: options.thinking?.effort` to chat() and stream() on both adapters. - Add `as ChatCompletionCreateParams{NonStreaming,Streaming}` casts on the params objects (didn't exist before because neither adapter forwarded any param outside the SDK's declared union — the cast is needed to pass the IR's `'minimal'` value through, same as openai.ts). +8 tests across copilot-adapter and azure-openai-adapter suites: chat forwarding, stream forwarding, gpt-5 'minimal' pass-through, omission when thinking absent / effort unset. 778/778 passing. * fix(copilot, azure-openai): backfill sampling-param forwarding to match openai The Copilot and Azure-OpenAI adapters were quietly dropping seven AgentConfig fields that openai.ts has always forwarded: frequency_penalty, presence_penalty, top_p, top_k, min_p, parallel_tool_calls, extraBody (spread) A user setting `topP: 0.9` or `extraBody: { logit_bias: {...} }` on a Copilot/Azure agent would see no effect on the wire, with no warning — the AgentConfig accepted the field, the adapter silently ignored it. Mirror the openai.ts param object exactly: same field set, same ordering (sampling first, ...extraBody, then structural fields so extraBody can override sampling defaults but cannot clobber model/messages/tools/stream). This was discovered while tracing the reasoning_effort change — it's not strictly part of #200, but the Copilot/Azure adapters were already in this PR's surface area for the reasoning_effort fix and the silent-drop behaviour is the same bug class as the budget-vs-maxTokens issue you caught on #205. Folding the parity fix in while the diff is open. +6 tests (3 per adapter): - forwards full sampling-param set + extraBody - extraBody overrides sampling params (field-ordering contract) - extraBody cannot override structural fields (model/stream) 784/784 passing. * fix(copilot, azure-openai): drop top_k / min_p — vLLM-only, not accepted by these endpoints Mid-PR correction. The previous "mirror openai.ts exactly" reasoning was sloppy: openai.ts forwards `top_k` and `min_p` deliberately for the local-server case (`provider: 'openai', baseURL: 'http://localhost:8000'` pointing at vLLM), but Copilot and Azure OpenAI both have hard-coded cloud endpoints with no equivalent escape hatch. Verified against real docs: - Azure OpenAI Chat Completions reference lists `frequency_penalty` / `presence_penalty` / `top_p` / `parallel_tool_calls` / `logit_bias`, but explicitly does NOT include `top_k` or `min_p`. - vLLM docs confirm `top_k` / `min_p` are vLLM-only sampling extensions not part of the OpenAI Chat Completions spec. - Copilot uses `api.githubcopilot.com` — also a fixed cloud proxy that normalises to standard OpenAI Chat Completions. So forwarding these as top-level fields on Copilot / Azure would be dead weight at best and a 400 at worst. Users who genuinely need them on either adapter can still pass via `extraBody` (e.g. for a local proxy that re-routes to vLLM). Removed from copilot.ts and azure-openai.ts param objects in both chat() and stream(); updated cast comments to drop the `top_k` / `min_p` mention; replaced the "forwards full set" tests with two narrower tests each: (a) forwards the OpenAI-cloud-compatible subset, (b) does NOT forward `top_k` / `min_p` (regression guard). 786/786 passing. * docs(copilot): disclose that the API spec is reverse-engineered, not official After verifying my assumptions per-field (per @MyPrototypeWhat's review push), confirmed: - Azure OpenAI: docs explicitly list frequency_penalty / presence_penalty / top_p / parallel_tool_calls (verbatim quotes from learn.microsoft.com). - Copilot: GitHub doesn't publish a public chat/completions API reference. The available references are community-reverse-engineered (e.g. Alorse/copilot-to-api), and the documented request examples only show messages/model/max_tokens/temperature/stream/tools/tool_choice — the four sampling fields aren't called out. Forwarding them anyway because: 1. Copilot's design promise is OpenAI Chat Completions compatibility. 2. The four fields are standard OpenAI fields, so the proxy promise should cover them. 3. Per-model behaviour (some routed reasoning models may ignore them) is out of scope for the adapter — same situation as openai.ts forwarding topP to OpenAI's o-series, which the model itself ignores. 4. Removing them would re-introduce the silent-drop bug for users whose models DO honour the fields. Added a JSDoc caveat in the chat() params block so future maintainers see the verification gap and don't mistake the forwarding for a verified contract. * revert(copilot): keep param surface narrow — no public spec for the 4 sampling fields Per @MyPrototypeWhat's review: "既然没有公开的证据证明,那就保持原状呗?" ("If there's no public evidence, just keep the original state, right?") Agreed — Copilot's API isn't publicly documented by GitHub, and the reverse-engineered references only show model / messages / max_tokens / temperature / tools / stream / tool_choice. Forwarding additional sampling fields (frequency_penalty, presence_penalty, top_p, parallel_tool_calls) and extraBody on the assumption that "Copilot is OpenAI-compatible" was a guess, not evidence. Reverting the Copilot side of the sampling-param backfill from this PR. The Copilot adapter now stays at its pre-PR field surface PLUS the new `reasoning_effort` (which is the actual #200 work and verified against the SDK's typed union). Replaced the "forwards full set" + "does NOT forward top_k/min_p" tests with one consolidated "does NOT forward unverified fields" regression test, plus a positive test that the documented subset (temperature / max_tokens / tools) still works. Azure-OpenAI side is unchanged — Microsoft Learn explicitly documents frequency_penalty / presence_penalty / top_p / parallel_tool_calls / logit_bias on the Azure OpenAI Chat Completions endpoint, so that backfill stays. 784/784 passing. * refactor(copilot, azure-openai): replace object-level cast with field-level cast on reasoning_effort The object-level `as ChatCompletionCreateParams{NonStreaming,Streaming}` cast was originally added to cover the IR's `'minimal'` reasoning effort value (gpt-5), which isn't in SDK 4.104's type union. After last commit's revert, that's the ONLY remaining type mismatch in either adapter — every other field they forward is already in the SDK union. A blanket object-level cast is overkill for a single field. Replace with a narrow field-level `as 'low' | 'medium' | 'high' | null | undefined` on the `reasoning_effort` line itself. Side benefit: TypeScript still type-checks the rest of the params object, so a future typo / wrong-type field would now be caught at compile time instead of slipping past the broad cast. (Note: openai.ts keeps its object-level cast because it forwards `top_k` and `min_p` — those are field NAMES not in the SDK type at all, not just unsupported values, so a field-level cast doesn't help.) Removed the now-unused `ChatCompletionCreateParamsNonStreaming` and `ChatCompletionCreateParamsStreaming` imports from both files. 784/784 passing. Lint clean. * docs(azure-openai): trim verbose comments — keep only the WHY, drop verification process The previous comment block leaked PR-review process into source code: - "Mirrors the field ordering contract in openai.ts" — decorative pointer - "(Confirmed against the Azure OpenAI Chat Completions API reference, which lists frequency_penalty/presence_penalty/top_p/parallel_tool_calls but not top_k or min_p.)" — verification work, belongs in commit/PR description, not in long-lived source Kept the two genuinely load-bearing parts: - Field ordering rationale (sampling first → extraBody → structural) — without this, future contributors will reorder - Single-line note on why top_k/min_p are absent — defends against future "you forgot these" PRs * docs: trim verbose comments per CLAUDE.md ("only when WHY is non-obvious") After @MyPrototypeWhat's audit ("是否必须的?"), trimmed three categories: 1. Copilot chat() narrow-surface comment: 9 lines → 3 lines. The explicit list of forwarded/not-forwarded fields was redundant with the source code immediately below it. Kept the load-bearing WHY ("Copilot's API isn't publicly documented"). 2. Three test rationale comments removed entirely (4-6 lines each). The test names already convey both WHAT and WHY: - "forwards the OpenAI-cloud-compatible sampling params and extraBody" - "does NOT forward vLLM-only top_k / min_p (Azure runs MS-hosted OpenAI models)" - "does NOT forward sampling params or extraBody that lack public Copilot spec coverage" The verification process and bug history live in PR #209's description and the per-commit messages, not in long-lived test source. Net -21 lines. 784/784 still passing. Lint clean. * refactor(thinking): drop 'minimal' from IR — route via extraBody (per @JackChen-me) Per owner's call on #209: option 2. Narrow `ThinkingConfig.effort` to the SDK 4.104-declared values (`'low' | 'medium' | 'high'`) and document that newer API values (gpt-5 `'minimal'`, GPT-5.5 `'none'`) travel via `extraBody: { reasoning_effort: '<value>' }` — the same escape hatch already used for vLLM-only `top_k` / `min_p`. Cleanup: - src/types.ts: ThinkingConfig.effort union narrowed; JSDoc points users to extraBody for newer SDK-untyped values. - src/llm/openai.ts: object-level cast comment trimmed (no longer needs to mention 'minimal'; cast still required for top_k/min_p/extraBody). - src/llm/copilot.ts + src/llm/azure-openai.ts: field-level cast on `reasoning_effort` removed entirely. The IR type now satisfies the SDK type union, so no boundary cast is needed. - tests/openai-adapter.test.ts + tests/azure-openai-adapter.test.ts: rewrote the 'minimal' tests to verify the extraBody route (the now- supported path) instead of the old IR-direct route. - tests/copilot-adapter.test.ts: deleted the 'minimal' test outright. Copilot doesn't forward extraBody (narrow surface, no public spec), so neither route exists for Copilot users wanting 'minimal'. They use 'low' instead, or wait for SDK bump. 783/783 passing. Lint clean. Net -7 lines. No behaviour change for users on `'low' | 'medium' | 'high'` (the entire SDK-typed surface).
Summary
Phase 1 of #200 — close the protocol gap where multi-turn extended thinking silently drops the signatures Anthropic and Gemini 3 require to continue a conversation. Scope follows the owner's approval comment verbatim.
IR changes (
src/types.ts)ReasoningBlock.signature— Anthropicthinking.signatureand Gemini 3 thought-summarythoughtSignature. Required to echo back unchanged on Anthropic; recommended on Gemini 3.ReasoningBlock.redactedData— Anthropicredacted_thinking.data. Block must still be echoed back on subsequent turns.ToolUseBlock.signature— GeminithoughtSignaturefor functionCall Parts. Required by Gemini 3+ ("required for function calling signatures" per Gemini docs).AgentConfig.thinking: { enabled, budgetTokens?, effort? }— explicit config.effortcarried separately frombudgetTokensper the owner's open-question resolution.Adapter changes
src/llm/anthropic.ts— extractsignaturefromthinkingblocks; newredacted_thinkingbranch on incoming; outgoingtoAnthropicContentBlockParamhandles reasoning blocks (signature→thinkingblock param,redactedData→redacted_thinkingblock param, neither → drop). NewtoAnthropicThinkingParamhelper forwardsthinking: { type: 'enabled', budget_tokens }on bothchat()andstream(). Default budget 1024 (SDK comment:Must be ≥1024 and less than max_tokens).src/llm/gemini.ts—thoughtSignatureis a top-levelPartfield per the@google/genaiPartschema (verified atgenai.d.ts:8675), NOT nested underfunctionCall. Extracted on both functionCall and thought-summary text Parts (Gemini 3 may sign both); echoed back on outgoing when present, dropped when unsigned to keep context lean. NewtoGeminiThinkingConfighelper forwardsthinkingConfig: { includeThoughts, thinkingBudget }.fromGeminiParthelper extracted sochat()andstream()share one parsing implementation.src/llm/bedrock.ts— comment refresh only. IR now carries the signature fields the previous comment said were missing; Bedrock-side wire-up toreasoningContent.reasoningText.signatureis tracked as a follow-up.src/agent/agent.ts+src/agent/runner.ts— one-line wire-ups to threadthinkingfromAgentConfigthroughRunnerOptionsintoLLMChatOptions.Tests
+23 across
tests/anthropic-adapter.test.ts(+12) andtests/gemini-adapter-contract.test.ts(+11):756/756passing (baseline733+23new).npm run lintclean for this PR (3 pre-existingbedrock.tserrors are from a missing optional peer dep@aws-sdk/client-bedrock-runtime, unrelated and present onmain).Cross-provider asymmetry to flag
Important detail captured in the IR JSDoc: the same
signaturefield name carries different protocol semantics depending on origin — Anthropic signsReasoningBlocks (required round-trip), Gemini signsParts containing functionCalls (required) or text+thought:truesummaries (recommended). The IR doesn't try to model the difference; the adapters know how to interpret their ownsignatures.Deferred to follow-up PRs (per owner)
reasoning_effortforwarding (expliciteffortonThinkingConfig, owner asked to keep separate from this PR).{type: 'adaptive'}and rejectsenabledwith HTTP 400). JSDoc ontoAnthropicThinkingParamdocuments the constraint.interleaved-thinking-2025-05-14beta header detection.Streaming
signature_deltaaccumulation is intentionally NOT separate work — the existingstream.finalMessage()path returns the assembled SDK content with signatures intact, so the newfromAnthropicContentBlockextraction handles streaming "for free" (covered by theextracts signature from streamed final messagetest).Test plan
npm test— 756/756 (baseline 733 + 23 new)npm run lint— no new errors (3 pre-existing bedrock-related errors confirmed via stash)platform.claude.com/docs/en/docs/build-with-claude/extended-thinkingand SDK type definitions in@anthropic-ai/[email protected]ai.google.dev/gemini-api/docs/thinkingand@google/genaiPartschema (verifiedthoughtSignatureis top-level on Part, not nested)