Skip to content

.NET: Flaky integration tests blocking merge queue (73% failure rate) #4971

@rogerbarreto

Description

@rogerbarreto

Problem Statement

Over the past 3 days (Mar 27–30, 2026), the dotnet-build-and-test workflow for merge_group events has a 73% failure rate (22 failures out of 30 runs). All failures originate from integration test suites and occur across 7 different PRs regardless of the PR code changes — strong evidence of systemic flakiness rather than PR-specific bugs.

Statistics

Metric Value
Total runs (3 days) 30
Failures 22 (73%)
Successes 6 (20%)
Cancelled 2 (7%)

Per-PR Pass Rate

PR Total Runs Passed Failed Pass Rate
PR-4948 7 0 7 0%
PR-4665 5 0 5 0%
PR-4615 3 0 3 0%
PR-4502 2 0 2 0%
PR-4952 6 1 3 17%
PR-4915 2 1 1 50%
PR-4925 4 3 1 75%
PR-4858 1 1 0 100%

Failing Test Suites

  1. Microsoft.Agents.AI.DurableTask.IntegrationTests — fails in ALL 7 PRs
  2. Microsoft.Agents.AI.Hosting.AzureFunctions.IntegrationTests — fails in some PRs

Detailed Failure Catalog

1. ConsoleAppSamplesValidation.ReliableStreamingSampleValidationAsync (DurableTask) ⭐ Most common

  • Error: Not enough content before interrupt (got 0).
  • Frequency: 5+ PRs (PR-4615, PR-4502, PR-4665, PR-4915, PR-4925)
  • Source: ConsoleAppSamplesValidation.cs:566SamplesValidationBase.cs:153
  • Cause: Test sends a travel planning prompt, waits for streaming content, then sends an interrupt. The LLM (gpt-5-nano) doesn't stream any content within the timeout window.

2. ConsoleAppSamplesValidation.SingleAgentOrchestrationHITLSampleValidationAsync (DurableTask)

  • Error: Wasn't prompted with the second draft. or Wasn't prompted with the first draft.
  • Frequency: PR-4615 (2 runs), PR-4665
  • Source: ConsoleAppSamplesValidation.cs:243
  • Cause: HITL sample — AI generates content, user rejects, AI should regenerate. The draft notification doesn't arrive before the process is killed (~60s timeout).

3. ConsoleAppSamplesValidation.LongRunningToolsSampleValidationAsync (DurableTask)

  • Error: Wasn't prompted with the first draft.
  • Frequency: PR-4665
  • Cause: Same pattern as Adding Microsoft SECURITY.MD #2 — long-running tools sample doesn't produce a content draft within the expected timeframe.

4. SamplesValidation.LongRunningToolsSampleValidationAsync (AzureFunctions)

  • Error: System.TimeoutException : Timeout waiting for 'Content published notification is logged' or Timeout waiting for 'Orchestration is requesting human feedback'
  • Frequency: PR-4665 (2 runs), PR-4948
  • Cause: Azure Functions version waits for specific log messages but orchestration doesn't reach those states within the timeout.

5. SamplesValidation.ReliableStreamingSampleValidationAsync (AzureFunctions)

  • Error: TaskCanceledException : The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
  • Frequency: PR-4665
  • Cause: HTTP request to the Azure Functions host times out at 100 seconds waiting for the streaming response.

6. ExternalClientTests.CallLongRunningFunctionToolsAsync (DurableTask)

  • Error: System.Threading.Tasks.TaskCanceledException : A task was canceled.
  • Duration: Exactly 1m 00s 001ms — hard 60-second timeout
  • Frequency: PR-4665
  • Cause: CancellationToken timeout of 60 seconds is too tight for LLM-backed function tool calls.

7. ExternalClientTests.CallFunctionToolsAsync (DurableTask)

  • Error: System.Threading.Tasks.TaskCanceledException : A task was canceled.
  • Duration: Exactly 1m 00s 002ms — hard 60-second timeout
  • Frequency: PR-4665
  • Cause: Same as design: State Management Design #6.

8. paths-filter job (non-test)

  • Error: fatal: couldn't find remote ref gh-readonly-queue/main/pr-4952-...
  • Frequency: 1 occurrence (PR-4952)
  • Cause: Merge queue branch deleted before dorny/paths-filter@v3 could fetch it. Git race condition.

Root Cause Analysis

All failures are timing/latency related. Every integration test failure falls into:

  1. LLM response too slow — Azure OpenAI (gpt-5-nano, France region) not responding fast enough
  2. Orchestration timeout — Durable Task orchestrations don't complete within hard-coded timeouts
  3. HttpClient timeout — HTTP requests to local services time out
  4. Process lifecycle timing — Console app processes killed before producing expected output

Evidence this is flakiness, not bugs:

  • Same tests pass in some runs and fail in others for the same PR (e.g., PR-4925: 75% pass rate)
  • Failures occur across 7 unrelated PRs with different code changes
  • All error messages are timing-related
  • Tests hit a real Azure OpenAI endpoint, making them inherently non-deterministic

Recommended Long-term Fixes

  1. Increase timeouts — The 60s process timeout and various log-waiting timeouts are too tight for LLM-backed tests
  2. Add retry logic — A single retry for the flakiest tests would dramatically improve pass rate
  3. Increase streaming content wait — ReliableStreamingSampleValidationAsync should wait longer before sending the interrupt
  4. Mock/stub LLM calls — Remove external dependency for deterministic testing
  5. Workflow-level retry — Add strategy.max-attempts: 2 for the integration test job
  6. Investigate Azure OpenAI France latency — Check if the endpoint has experienced latency spikes

Immediate Mitigation

A companion PR will skip these 7 flaky tests with [Fact(Skip = "Flaky: see #THIS_ISSUE")] to unblock the merge queue. The tests should be re-enabled once the timeouts and retry logic are improved.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions