Skip to content

[grafana-otel-advisor] OTel improvement: emit synthetic exception event for agent failures with missing output #31305

@github-actions

Description

@github-actions

OTel Instrumentation Improvement: emit synthetic exception event for agent failures with missing output

Analysis Date: 2026-05-10
Priority: High
Effort: Small (< 2h)

Problem

When an agent job ends with GH_AW_AGENT_CONCLUSION=failure and /tmp/gh-aw/agent_output.json is missing or unreadable (hard crash, OOM kill, runner termination, signal kill), the conclusion span is emitted with status.code = STATUS_CODE_ERROR (2) but no exception span event. Backends like Grafana Tempo, Honeycomb, and Datadog rely on exception events (with exception.type and exception.message) to surface failure reasons in the trace UI, group failures by exception type, and feed alerting rules.

The synthetic-exception path in actions/setup/js/send_otlp_span.cjs (lines 1390-1397) covers only the timed_out and cancelled outcomes, so generic failure-without-output runs leave a red span with no diagnostic event attached. A DevOps engineer paging on "agent failed" cannot answer "why did it fail?" from the span alone — they have to leave the trace, jump to GitHub Actions logs, and correlate manually.

Why This Matters (DevOps Perspective)
  • MTTR: For hard crashes (OOM, runner termination, segfault) the trace is the first place an on-call engineer looks. Currently those traces show status=ERROR with no exception event — engineers must context-switch to GitHub Actions logs to learn anything.
  • Alerting: Exception-based alerts ("alert when exception.type =~ \"gh-aw.*\" rate > X") miss this entire class of hard-failure runs because no exception event is emitted.
  • Dashboards: A common pattern is grouping failures by exception.type to see top failure modes. Right now gh-aw.AgentTimedOut and gh-aw.AgentCancelled are visible, but raw failure outcomes silently drop out of the breakdown.
  • Span hygiene: An OTel span with status.code = ERROR and no exception event is a known anti-pattern flagged by linters like the OTel Collector's spanstatusprocessor and breaks parity with the OTel exception semantic conventions.
Current Behavior

In actions/setup/js/send_otlp_span.cjs (around lines 1387-1399):

const buildSpanEvents = eventTimeMs => {
  const shouldEmitSyntheticException = hasNoReadableAgentOutput && (isAgentTimedOut || isAgentCancelled);
  if (outputErrors.length === 0) {
    if (shouldEmitSyntheticException) {
      const exceptionType = isAgentTimedOut ? "gh-aw.AgentTimedOut" : "gh-aw.AgentCancelled";
      const exceptionMessage = (statusMessage || `agent ${agentConclusion}`).slice(0, MAX_ATTR_VALUE_LENGTH);
      return [{ timeUnixNano: toNanoString(eventTimeMs), name: "exception", attributes: [buildAttr("exception.type", exceptionType), buildAttr("exception.message", exceptionMessage)] }];
    }
    return [];
  }
  // ... per-error events when outputErrors.length > 0 ...
};

The condition (isAgentTimedOut || isAgentCancelled) excludes the most common hard-failure mode: agentConclusion === "failure" with no readable agent_output.json. Those runs return [] (no events) even though statusCode === 2 is set later in the same function.

Note that isAgentNonOK = isAgentFailure || isAgentCancelled is already computed nearby and is used to drive the span status — but it is not used to drive the synthetic-exception emission.

Proposed Change

Reuse the existing isAgentNonOK predicate so any non-OK agent outcome (failure, timeout, cancellation) triggers a synthetic exception event when no agent_output.json is available. Add a third exception type (gh-aw.AgentFailureWithoutOutput) so backends can distinguish hard crashes from clean timeouts/cancellations.

// Proposed replacement in actions/setup/js/send_otlp_span.cjs
const buildSpanEvents = eventTimeMs => {
  // Emit a synthetic exception event whenever the agent did not finish OK and
  // there is no agent_output.json to extract concrete errors from. This covers
  // hard crashes (OOM, runner termination, signal kill) where the agent process
  // is killed before it can write structured output — not just clean timeouts
  // and cancellations.
  const shouldEmitSyntheticException = hasNoReadableAgentOutput && isAgentNonOK;
  if (outputErrors.length === 0) {
    if (shouldEmitSyntheticException) {
      let exceptionType;
      if (isAgentTimedOut) {
        exceptionType = "gh-aw.AgentTimedOut";
      } else if (isAgentCancelled) {
        exceptionType = "gh-aw.AgentCancelled";
      } else {
        exceptionType = "gh-aw.AgentFailureWithoutOutput";
      }
      const exceptionMessage = (statusMessage || `agent ${agentConclusion}`).slice(0, MAX_ATTR_VALUE_LENGTH);
      return [{
        timeUnixNano: toNanoString(eventTimeMs),
        name: "exception",
        attributes: [
          buildAttr("exception.type", exceptionType),
          buildAttr("exception.message", exceptionMessage),
        ],
      }];
    }
    return [];
  }
  // ... existing per-error events branch unchanged ...
};

This is a minimal change: one predicate swap + one extra branch in the type selection. No new attributes, no schema changes, no impact on non-failure paths.

Expected Outcome

After this change:

  • In Grafana / Honeycomb / Datadog: Every red conclusion span (and dedicated agent span when emitted) carries an exception event. Backends that auto-detect exception events show "❗ exception" markers in the trace UI for hard-crashed agent runs, just as they do today for timeouts and cancellations. TraceQL / Honeycomb queries like { status = error && event:name = "exception" } cover all failed runs instead of only the timeout/cancellation subset.
  • In the local JSONL mirror (/tmp/gh-aw/otel.jsonl): The conclusion span entry now has a non-empty events: [...] array on hard-crash runs, making post-hoc artifact-based debugging viable without a live collector.
  • For on-call engineers: A new exception.type = "gh-aw.AgentFailureWithoutOutput" value becomes alertable and groupable, surfacing the runs that previously had status=ERROR with no diagnostic context.
  • Dashboards: Stacked-bar "failures by exception type" panels start showing a gh-aw.AgentFailureWithoutOutput slice. Currently those runs are invisible to that breakdown.
Implementation Steps
  • Edit buildSpanEvents in actions/setup/js/send_otlp_span.cjs (around lines 1390-1397) to use isAgentNonOK and the three-way type selection shown above.
  • Add a unit test in actions/setup/js/action_conclusion_otlp.test.cjs (or action_otlp.test.cjs) that asserts: when GH_AW_AGENT_CONCLUSION=failure and agent_output.json is missing, the conclusion span carries exactly one exception event with exception.type = "gh-aw.AgentFailureWithoutOutput".
  • Extend the existing timed_out / cancelled synthetic-exception tests to also cover the new failure branch (parametrized table-driven test if convenient).
  • Run make test-unit (or cd actions/setup/js && npx vitest run) to confirm the new and existing tests pass.
  • Run make fmt to ensure formatting.
  • Open a PR referencing this issue.
Evidence from Live Grafana Data

The Grafana Tempo datasource (grafanacloud-traces) attached to this workflow returned zero traces for any {} TraceQL query across the last 30 days, and /api/v2/search/tags returned only the intrinsic scope with no resource.* or span.* tags populated:

GET /api/datasources/proxy/uid/grafanacloud-traces/api/search?q=%7B%7D&limit=10
→ {"metrics":{"completedJobs":3,"totalJobs":3},"traces":[]}

GET /api/datasources/proxy/uid/grafanacloud-traces/api/v2/search/tags
→ {"metrics":{},"scopes":[{"name":"intrinsic","tags":["duration","event:name",...]}]}

The Prometheus and Loki datasources also returned no gh-aw.*, gen_ai.*, or traces_spanmetrics_* series. So the recommendation in this issue is grounded in static analysis of the instrumentation code rather than live telemetry. The gap above is verifiable by running the existing action_otlp.test.cjs harness with GH_AW_AGENT_CONCLUSION=failure and no agent_output.json on disk — the resulting conclusion span has status.code = 2 and events = [], confirming the missing exception event.

Related Files
  • actions/setup/js/send_otlp_span.cjs (primary change site — buildSpanEvents inside sendJobConclusionSpan)
  • actions/setup/js/action_conclusion_otlp.cjs (no change; documents the env vars used)
  • actions/setup/js/action_otlp.test.cjs (test for the new exception-event emission path)
  • actions/setup/js/action_conclusion_otlp.test.cjs (parallel test surface)

Generated by the Daily Grafana OTel Instrumentation Advisor workflow

Generated by Daily Grafana OTel Instrumentation Advisor · ● 14.9M ·

  • expires on May 17, 2026, 5:39 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions