feat: task-level retry with exponential backoff#37
Conversation
Add `maxRetries`, `retryDelayMs`, and `retryBackoff` to task config. When a task fails and retries remain, the orchestrator waits with exponential backoff and re-runs the task with a fresh agent conversation. A `task_retry` event is emitted via `onProgress` for observability. Cascade failure only occurs after all retries are exhausted. Closes #30
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7714e6189
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const sharedMem = team.getSharedMemoryInstance() | ||
| if (sharedMem) { | ||
| await sharedMem.write(assignee, `task:${task.id}:result`, result.output) | ||
| const maxAttempts = (task.maxRetries ?? 0) + 1 |
There was a problem hiding this comment.
Validate retry count before deriving attempt loop
The new retry loop computes maxAttempts as (task.maxRetries ?? 0) + 1 without bounding input, so any negative maxRetries (from runTasks input or parsed specs) yields maxAttempts <= 0 and the for loop never executes. Because the task was already set to in_progress, it is never completed or failed, and executeQueue can exit with a silently unexecuted task. Please clamp/validate maxRetries to a non-negative integer (or fail fast) before entering the loop.
Useful? React with 👍 / 👎.
| maxRetries: spec.maxRetries, | ||
| retryDelayMs: spec.retryDelayMs, | ||
| retryBackoff: spec.retryBackoff, |
There was a problem hiding this comment.
Parse retry fields from coordinator task specs
The runTeam path now threads spec.maxRetries, spec.retryDelayMs, and spec.retryBackoff into created tasks, but parseTaskSpecs() still only copies title/description/assignee/dependsOn, so these retry fields remain undefined for coordinator-generated tasks. In practice, retry settings emitted by the coordinator are dropped, so task-level retries only work via runTasks and not via runTeam as intended.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
By design — retry config is a user-facing option for runTasks(). The coordinator in runTeam() is not expected to generate retry fields, so parseTaskSpecs() intentionally omits them.
- Extract `executeWithRetry()` as a testable exported function - Add `computeRetryDelay()` with 30s max cap (prevents runaway backoff) - Remove retry fields from `ParsedTaskSpec` (dead code for runTeam path) - Deduplicate retry event emission (single code path for both error types) - Injectable delay function for test determinism - Rewrite tests to call the real `executeWithRetry`, not a copy - 15 tests covering: success, retry+success, retry+failure, backoff calculation, delay cap, delay function injection, no-retry default
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 972f9f0c74
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const result = await run() | ||
| if (result.success) { | ||
| return result | ||
| } |
There was a problem hiding this comment.
Aggregate token usage across retry attempts
When a task succeeds after one or more failed attempts, executeWithRetry returns only the final attempt result here, so token usage (and other execution metadata) from earlier attempts is discarded. Since executeQueue stores this single result in ctx.agentResults and buildTeamRunResult() sums result.tokenUsage, runs with retries underreport actual model consumption, which can skew billing/quota tracking and observability.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d9b20c0 — executeWithRetry now accumulates totalUsage across all attempts and returns it in the final result.
- maxRetries clamped to >= 0 (negative values treated as no retry) - retryBackoff clamped to >= 1 (prevents zero/negative delay oscillation) - retryDelayMs clamped to >= 0 - Add tests for negative maxRetries and negative backoff Addresses Codex review P1 on #37
Previously only the final attempt's tokenUsage was returned, causing under-reporting of actual model consumption when retries occurred. Now all attempts' token counts are summed in the returned result. Addresses Codex review P2 (token usage) on #37
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63e1f7068a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| onRetry?: (data: { attempt: number; maxAttempts: number; error: string; nextDelayMs: number }) => void, | ||
| delayFn: (ms: number) => Promise<void> = sleep, | ||
| ): Promise<AgentRunResult> { | ||
| const maxAttempts = Math.max(0, task.maxRetries ?? 0) + 1 |
There was a problem hiding this comment.
Guard maxRetries against non-finite values
executeWithRetry derives maxAttempts directly from task.maxRetries, but Math.max(0, task.maxRetries ?? 0) does not sanitize Infinity/NaN. If a caller passes maxRetries: Infinity (possible via runTasks since this is plain JS input), the for (attempt <= maxAttempts) loop becomes effectively unbounded and the task never exits retrying, stalling executeQueue; NaN similarly produces a broken attempt bound. Please coerce to a finite non-negative integer before entering the loop.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d9b20c0 — all retry fields are now guarded with Number.isFinite() before use.
Summary
Closes #30
maxRetries,retryDelayMs,retryBackoffto task configdelay * backoff^attempt)task_retryevent viaonProgresswith attempt number, error, and next delayrunTeam()andrunTasks()Usage
Files changed
src/types.tsTask, addtask_retrytoOrchestratorEventsrc/task/task.tscreateTask()accepts and passes through retry configsrc/orchestrator/orchestrator.tsexecuteQueue(),sleep()helper, plumbing inrunTasks()/loadSpecsIntoQueue()tests/task-retry.test.tsTest plan
npm run build— compiles cleannpm run lint— type check passesnpm test— 93/93 tests pass (7 new)