-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
Description
Problem Statement
Over the past 3 days (Mar 27–30, 2026), the dotnet-build-and-test workflow for merge_group events has a 73% failure rate (22 failures out of 30 runs). All failures originate from integration test suites and occur across 7 different PRs regardless of the PR code changes — strong evidence of systemic flakiness rather than PR-specific bugs.
Statistics
| Metric | Value |
|---|---|
| Total runs (3 days) | 30 |
| Failures | 22 (73%) |
| Successes | 6 (20%) |
| Cancelled | 2 (7%) |
Per-PR Pass Rate
| PR | Total Runs | Passed | Failed | Pass Rate |
|---|---|---|---|---|
| PR-4948 | 7 | 0 | 7 | 0% |
| PR-4665 | 5 | 0 | 5 | 0% |
| PR-4615 | 3 | 0 | 3 | 0% |
| PR-4502 | 2 | 0 | 2 | 0% |
| PR-4952 | 6 | 1 | 3 | 17% |
| PR-4915 | 2 | 1 | 1 | 50% |
| PR-4925 | 4 | 3 | 1 | 75% |
| PR-4858 | 1 | 1 | 0 | 100% |
Failing Test Suites
- Microsoft.Agents.AI.DurableTask.IntegrationTests — fails in ALL 7 PRs
- Microsoft.Agents.AI.Hosting.AzureFunctions.IntegrationTests — fails in some PRs
Detailed Failure Catalog
1. ConsoleAppSamplesValidation.ReliableStreamingSampleValidationAsync (DurableTask) ⭐ Most common
- Error:
Not enough content before interrupt (got 0). - Frequency: 5+ PRs (PR-4615, PR-4502, PR-4665, PR-4915, PR-4925)
- Source:
ConsoleAppSamplesValidation.cs:566→SamplesValidationBase.cs:153 - Cause: Test sends a travel planning prompt, waits for streaming content, then sends an interrupt. The LLM (gpt-5-nano) doesn't stream any content within the timeout window.
2. ConsoleAppSamplesValidation.SingleAgentOrchestrationHITLSampleValidationAsync (DurableTask)
- Error:
Wasn't prompted with the second draft.orWasn't prompted with the first draft. - Frequency: PR-4615 (2 runs), PR-4665
- Source:
ConsoleAppSamplesValidation.cs:243 - Cause: HITL sample — AI generates content, user rejects, AI should regenerate. The draft notification doesn't arrive before the process is killed (~60s timeout).
3. ConsoleAppSamplesValidation.LongRunningToolsSampleValidationAsync (DurableTask)
- Error:
Wasn't prompted with the first draft. - Frequency: PR-4665
- Cause: Same pattern as Adding Microsoft SECURITY.MD #2 — long-running tools sample doesn't produce a content draft within the expected timeframe.
4. SamplesValidation.LongRunningToolsSampleValidationAsync (AzureFunctions)
- Error:
System.TimeoutException : Timeout waiting for 'Content published notification is logged'orTimeout waiting for 'Orchestration is requesting human feedback' - Frequency: PR-4665 (2 runs), PR-4948
- Cause: Azure Functions version waits for specific log messages but orchestration doesn't reach those states within the timeout.
5. SamplesValidation.ReliableStreamingSampleValidationAsync (AzureFunctions)
- Error:
TaskCanceledException : The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing. - Frequency: PR-4665
- Cause: HTTP request to the Azure Functions host times out at 100 seconds waiting for the streaming response.
6. ExternalClientTests.CallLongRunningFunctionToolsAsync (DurableTask)
- Error:
System.Threading.Tasks.TaskCanceledException : A task was canceled. - Duration: Exactly 1m 00s 001ms — hard 60-second timeout
- Frequency: PR-4665
- Cause: CancellationToken timeout of 60 seconds is too tight for LLM-backed function tool calls.
7. ExternalClientTests.CallFunctionToolsAsync (DurableTask)
- Error:
System.Threading.Tasks.TaskCanceledException : A task was canceled. - Duration: Exactly 1m 00s 002ms — hard 60-second timeout
- Frequency: PR-4665
- Cause: Same as design: State Management Design #6.
8. paths-filter job (non-test)
- Error:
fatal: couldn't find remote ref gh-readonly-queue/main/pr-4952-... - Frequency: 1 occurrence (PR-4952)
- Cause: Merge queue branch deleted before
dorny/paths-filter@v3could fetch it. Git race condition.
Root Cause Analysis
All failures are timing/latency related. Every integration test failure falls into:
- LLM response too slow — Azure OpenAI (gpt-5-nano, France region) not responding fast enough
- Orchestration timeout — Durable Task orchestrations don't complete within hard-coded timeouts
- HttpClient timeout — HTTP requests to local services time out
- Process lifecycle timing — Console app processes killed before producing expected output
Evidence this is flakiness, not bugs:
- Same tests pass in some runs and fail in others for the same PR (e.g., PR-4925: 75% pass rate)
- Failures occur across 7 unrelated PRs with different code changes
- All error messages are timing-related
- Tests hit a real Azure OpenAI endpoint, making them inherently non-deterministic
Recommended Long-term Fixes
- Increase timeouts — The 60s process timeout and various log-waiting timeouts are too tight for LLM-backed tests
- Add retry logic — A single retry for the flakiest tests would dramatically improve pass rate
- Increase streaming content wait — ReliableStreamingSampleValidationAsync should wait longer before sending the interrupt
- Mock/stub LLM calls — Remove external dependency for deterministic testing
- Workflow-level retry — Add
strategy.max-attempts: 2for the integration test job - Investigate Azure OpenAI France latency — Check if the endpoint has experienced latency spikes
Immediate Mitigation
A companion PR will skip these 7 flaky tests with [Fact(Skip = "Flaky: see #THIS_ISSUE")] to unblock the merge queue. The tests should be re-enabled once the timeouts and retry logic are improved.
Reactions are currently unavailable