.NET: Flaky integration tests blocking merge queue (73% failure rate)

## Problem Statement

Over the past 3 days (Mar 27&ndash;30, 2026), the `dotnet-build-and-test` workflow for `merge_group` events has a **73% failure rate** (22 failures out of 30 runs). All failures originate from integration test suites and occur across **7 different PRs** regardless of the PR code changes &mdash; strong evidence of systemic flakiness rather than PR-specific bugs.

## Statistics

| Metric | Value |
|--------|-------|
| Total runs (3 days) | 30 |
| Failures | 22 (73%) |
| Successes | 6 (20%) |
| Cancelled | 2 (7%) |

### Per-PR Pass Rate

| PR | Total Runs | Passed | Failed | Pass Rate |
|----|-----------|--------|--------|-----------|
| PR-4948 | 7 | 0 | 7 | 0% |
| PR-4665 | 5 | 0 | 5 | 0% |
| PR-4615 | 3 | 0 | 3 | 0% |
| PR-4502 | 2 | 0 | 2 | 0% |
| PR-4952 | 6 | 1 | 3 | 17% |
| PR-4915 | 2 | 1 | 1 | 50% |
| PR-4925 | 4 | 3 | 1 | 75% |
| PR-4858 | 1 | 1 | 0 | 100% |

## Failing Test Suites

1. **Microsoft.Agents.AI.DurableTask.IntegrationTests** &mdash; fails in ALL 7 PRs
2. **Microsoft.Agents.AI.Hosting.AzureFunctions.IntegrationTests** &mdash; fails in some PRs

## Detailed Failure Catalog

### 1. `ConsoleAppSamplesValidation.ReliableStreamingSampleValidationAsync` (DurableTask) &#11088; Most common
- **Error**: `Not enough content before interrupt (got 0).`
- **Frequency**: 5+ PRs (PR-4615, PR-4502, PR-4665, PR-4915, PR-4925)
- **Source**: `ConsoleAppSamplesValidation.cs:566` &rarr; `SamplesValidationBase.cs:153`
- **Cause**: Test sends a travel planning prompt, waits for streaming content, then sends an interrupt. The LLM (gpt-5-nano) doesn't stream any content within the timeout window.

### 2. `ConsoleAppSamplesValidation.SingleAgentOrchestrationHITLSampleValidationAsync` (DurableTask)
- **Error**: `Wasn't prompted with the second draft.` or `Wasn't prompted with the first draft.`
- **Frequency**: PR-4615 (2 runs), PR-4665
- **Source**: `ConsoleAppSamplesValidation.cs:243`
- **Cause**: HITL sample &mdash; AI generates content, user rejects, AI should regenerate. The draft notification doesn't arrive before the process is killed (~60s timeout).

### 3. `ConsoleAppSamplesValidation.LongRunningToolsSampleValidationAsync` (DurableTask)
- **Error**: `Wasn't prompted with the first draft.`
- **Frequency**: PR-4665
- **Cause**: Same pattern as #2 &mdash; long-running tools sample doesn't produce a content draft within the expected timeframe.

### 4. `SamplesValidation.LongRunningToolsSampleValidationAsync` (AzureFunctions)
- **Error**: `System.TimeoutException : Timeout waiting for 'Content published notification is logged'` or `Timeout waiting for 'Orchestration is requesting human feedback'`
- **Frequency**: PR-4665 (2 runs), PR-4948
- **Cause**: Azure Functions version waits for specific log messages but orchestration doesn't reach those states within the timeout.

### 5. `SamplesValidation.ReliableStreamingSampleValidationAsync` (AzureFunctions)
- **Error**: `TaskCanceledException : The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.`
- **Frequency**: PR-4665
- **Cause**: HTTP request to the Azure Functions host times out at 100 seconds waiting for the streaming response.

### 6. `ExternalClientTests.CallLongRunningFunctionToolsAsync` (DurableTask)
- **Error**: `System.Threading.Tasks.TaskCanceledException : A task was canceled.`
- **Duration**: Exactly 1m 00s 001ms &mdash; hard 60-second timeout
- **Frequency**: PR-4665
- **Cause**: CancellationToken timeout of 60 seconds is too tight for LLM-backed function tool calls.

### 7. `ExternalClientTests.CallFunctionToolsAsync` (DurableTask)
- **Error**: `System.Threading.Tasks.TaskCanceledException : A task was canceled.`
- **Duration**: Exactly 1m 00s 002ms &mdash; hard 60-second timeout
- **Frequency**: PR-4665
- **Cause**: Same as #6.

### 8. `paths-filter` job (non-test)
- **Error**: `fatal: couldn't find remote ref gh-readonly-queue/main/pr-4952-...`
- **Frequency**: 1 occurrence (PR-4952)
- **Cause**: Merge queue branch deleted before `dorny/paths-filter@v3` could fetch it. Git race condition.

## Root Cause Analysis

**All failures are timing/latency related.** Every integration test failure falls into:
1. **LLM response too slow** &mdash; Azure OpenAI (gpt-5-nano, France region) not responding fast enough
2. **Orchestration timeout** &mdash; Durable Task orchestrations don't complete within hard-coded timeouts
3. **HttpClient timeout** &mdash; HTTP requests to local services time out
4. **Process lifecycle timing** &mdash; Console app processes killed before producing expected output

**Evidence this is flakiness, not bugs:**
- Same tests pass in some runs and fail in others for the same PR (e.g., PR-4925: 75% pass rate)
- Failures occur across 7 unrelated PRs with different code changes
- All error messages are timing-related
- Tests hit a real Azure OpenAI endpoint, making them inherently non-deterministic

## Recommended Long-term Fixes

1. **Increase timeouts** &mdash; The 60s process timeout and various log-waiting timeouts are too tight for LLM-backed tests
2. **Add retry logic** &mdash; A single retry for the flakiest tests would dramatically improve pass rate
3. **Increase streaming content wait** &mdash; ReliableStreamingSampleValidationAsync should wait longer before sending the interrupt
4. **Mock/stub LLM calls** &mdash; Remove external dependency for deterministic testing
5. **Workflow-level retry** &mdash; Add `strategy.max-attempts: 2` for the integration test job
6. **Investigate Azure OpenAI France latency** &mdash; Check if the endpoint has experienced latency spikes

## Immediate Mitigation

A companion PR will skip these 7 flaky tests with `[Fact(Skip = "Flaky: see #THIS_ISSUE")]` to unblock the merge queue. The tests should be re-enabled once the timeouts and retry logic are improved.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.NET: Flaky integration tests blocking merge queue (73% failure rate) #4971

Problem Statement

Statistics

Per-PR Pass Rate

Failing Test Suites

Detailed Failure Catalog

1. `ConsoleAppSamplesValidation.ReliableStreamingSampleValidationAsync` (DurableTask) ⭐ Most common

2. `ConsoleAppSamplesValidation.SingleAgentOrchestrationHITLSampleValidationAsync` (DurableTask)

3. `ConsoleAppSamplesValidation.LongRunningToolsSampleValidationAsync` (DurableTask)

4. `SamplesValidation.LongRunningToolsSampleValidationAsync` (AzureFunctions)

5. `SamplesValidation.ReliableStreamingSampleValidationAsync` (AzureFunctions)

6. `ExternalClientTests.CallLongRunningFunctionToolsAsync` (DurableTask)

7. `ExternalClientTests.CallFunctionToolsAsync` (DurableTask)

8. `paths-filter` job (non-test)

Root Cause Analysis

Recommended Long-term Fixes

Immediate Mitigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Total runs (3 days)	30
Failures	22 (73%)
Successes	6 (20%)
Cancelled	2 (7%)

PR	Total Runs	Passed	Failed	Pass Rate
PR-4948	7	0	7	0%
PR-4665	5	0	5	0%
PR-4615	3	0	3	0%
PR-4502	2	0	2	0%
PR-4952	6	1	3	17%
PR-4915	2	1	1	50%
PR-4925	4	3	1	75%
PR-4858	1	1	0	100%

.NET: Flaky integration tests blocking merge queue (73% failure rate) #4971

Description

Problem Statement

Statistics

Per-PR Pass Rate

Failing Test Suites

Detailed Failure Catalog

1. ConsoleAppSamplesValidation.ReliableStreamingSampleValidationAsync (DurableTask) ⭐ Most common

2. ConsoleAppSamplesValidation.SingleAgentOrchestrationHITLSampleValidationAsync (DurableTask)

3. ConsoleAppSamplesValidation.LongRunningToolsSampleValidationAsync (DurableTask)

4. SamplesValidation.LongRunningToolsSampleValidationAsync (AzureFunctions)

5. SamplesValidation.ReliableStreamingSampleValidationAsync (AzureFunctions)

6. ExternalClientTests.CallLongRunningFunctionToolsAsync (DurableTask)

7. ExternalClientTests.CallFunctionToolsAsync (DurableTask)

8. paths-filter job (non-test)

Root Cause Analysis

Recommended Long-term Fixes

Immediate Mitigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `ConsoleAppSamplesValidation.ReliableStreamingSampleValidationAsync` (DurableTask) ⭐ Most common

2. `ConsoleAppSamplesValidation.SingleAgentOrchestrationHITLSampleValidationAsync` (DurableTask)

3. `ConsoleAppSamplesValidation.LongRunningToolsSampleValidationAsync` (DurableTask)

4. `SamplesValidation.LongRunningToolsSampleValidationAsync` (AzureFunctions)

5. `SamplesValidation.ReliableStreamingSampleValidationAsync` (AzureFunctions)

6. `ExternalClientTests.CallLongRunningFunctionToolsAsync` (DurableTask)

7. `ExternalClientTests.CallFunctionToolsAsync` (DurableTask)

8. `paths-filter` job (non-test)