OTel Instrumentation Improvement: emit synthetic exception event for agent failures with missing output
Analysis Date: 2026-05-10
Priority: High
Effort: Small (< 2h)
Problem
When an agent job ends with GH_AW_AGENT_CONCLUSION=failure and /tmp/gh-aw/agent_output.json is missing or unreadable (hard crash, OOM kill, runner termination, signal kill), the conclusion span is emitted with status.code = STATUS_CODE_ERROR (2) but no exception span event. Backends like Grafana Tempo, Honeycomb, and Datadog rely on exception events (with exception.type and exception.message) to surface failure reasons in the trace UI, group failures by exception type, and feed alerting rules.
The synthetic-exception path in actions/setup/js/send_otlp_span.cjs (lines 1390-1397) covers only the timed_out and cancelled outcomes, so generic failure-without-output runs leave a red span with no diagnostic event attached. A DevOps engineer paging on "agent failed" cannot answer "why did it fail?" from the span alone — they have to leave the trace, jump to GitHub Actions logs, and correlate manually.
Why This Matters (DevOps Perspective)
- MTTR: For hard crashes (OOM, runner termination, segfault) the trace is the first place an on-call engineer looks. Currently those traces show
status=ERROR with no exception event — engineers must context-switch to GitHub Actions logs to learn anything.
- Alerting: Exception-based alerts ("alert when
exception.type =~ \"gh-aw.*\" rate > X") miss this entire class of hard-failure runs because no exception event is emitted.
- Dashboards: A common pattern is grouping failures by
exception.type to see top failure modes. Right now gh-aw.AgentTimedOut and gh-aw.AgentCancelled are visible, but raw failure outcomes silently drop out of the breakdown.
- Span hygiene: An OTel span with
status.code = ERROR and no exception event is a known anti-pattern flagged by linters like the OTel Collector's spanstatusprocessor and breaks parity with the OTel exception semantic conventions.
Current Behavior
In actions/setup/js/send_otlp_span.cjs (around lines 1387-1399):
const buildSpanEvents = eventTimeMs => {
const shouldEmitSyntheticException = hasNoReadableAgentOutput && (isAgentTimedOut || isAgentCancelled);
if (outputErrors.length === 0) {
if (shouldEmitSyntheticException) {
const exceptionType = isAgentTimedOut ? "gh-aw.AgentTimedOut" : "gh-aw.AgentCancelled";
const exceptionMessage = (statusMessage || `agent ${agentConclusion}`).slice(0, MAX_ATTR_VALUE_LENGTH);
return [{ timeUnixNano: toNanoString(eventTimeMs), name: "exception", attributes: [buildAttr("exception.type", exceptionType), buildAttr("exception.message", exceptionMessage)] }];
}
return [];
}
// ... per-error events when outputErrors.length > 0 ...
};
The condition (isAgentTimedOut || isAgentCancelled) excludes the most common hard-failure mode: agentConclusion === "failure" with no readable agent_output.json. Those runs return [] (no events) even though statusCode === 2 is set later in the same function.
Note that isAgentNonOK = isAgentFailure || isAgentCancelled is already computed nearby and is used to drive the span status — but it is not used to drive the synthetic-exception emission.
Proposed Change
Reuse the existing isAgentNonOK predicate so any non-OK agent outcome (failure, timeout, cancellation) triggers a synthetic exception event when no agent_output.json is available. Add a third exception type (gh-aw.AgentFailureWithoutOutput) so backends can distinguish hard crashes from clean timeouts/cancellations.
// Proposed replacement in actions/setup/js/send_otlp_span.cjs
const buildSpanEvents = eventTimeMs => {
// Emit a synthetic exception event whenever the agent did not finish OK and
// there is no agent_output.json to extract concrete errors from. This covers
// hard crashes (OOM, runner termination, signal kill) where the agent process
// is killed before it can write structured output — not just clean timeouts
// and cancellations.
const shouldEmitSyntheticException = hasNoReadableAgentOutput && isAgentNonOK;
if (outputErrors.length === 0) {
if (shouldEmitSyntheticException) {
let exceptionType;
if (isAgentTimedOut) {
exceptionType = "gh-aw.AgentTimedOut";
} else if (isAgentCancelled) {
exceptionType = "gh-aw.AgentCancelled";
} else {
exceptionType = "gh-aw.AgentFailureWithoutOutput";
}
const exceptionMessage = (statusMessage || `agent ${agentConclusion}`).slice(0, MAX_ATTR_VALUE_LENGTH);
return [{
timeUnixNano: toNanoString(eventTimeMs),
name: "exception",
attributes: [
buildAttr("exception.type", exceptionType),
buildAttr("exception.message", exceptionMessage),
],
}];
}
return [];
}
// ... existing per-error events branch unchanged ...
};
This is a minimal change: one predicate swap + one extra branch in the type selection. No new attributes, no schema changes, no impact on non-failure paths.
Expected Outcome
After this change:
- In Grafana / Honeycomb / Datadog: Every red conclusion span (and dedicated agent span when emitted) carries an
exception event. Backends that auto-detect exception events show "❗ exception" markers in the trace UI for hard-crashed agent runs, just as they do today for timeouts and cancellations. TraceQL / Honeycomb queries like { status = error && event:name = "exception" } cover all failed runs instead of only the timeout/cancellation subset.
- In the local JSONL mirror (
/tmp/gh-aw/otel.jsonl): The conclusion span entry now has a non-empty events: [...] array on hard-crash runs, making post-hoc artifact-based debugging viable without a live collector.
- For on-call engineers: A new
exception.type = "gh-aw.AgentFailureWithoutOutput" value becomes alertable and groupable, surfacing the runs that previously had status=ERROR with no diagnostic context.
- Dashboards: Stacked-bar "failures by exception type" panels start showing a
gh-aw.AgentFailureWithoutOutput slice. Currently those runs are invisible to that breakdown.
Implementation Steps
Evidence from Live Grafana Data
The Grafana Tempo datasource (grafanacloud-traces) attached to this workflow returned zero traces for any {} TraceQL query across the last 30 days, and /api/v2/search/tags returned only the intrinsic scope with no resource.* or span.* tags populated:
GET /api/datasources/proxy/uid/grafanacloud-traces/api/search?q=%7B%7D&limit=10
→ {"metrics":{"completedJobs":3,"totalJobs":3},"traces":[]}
GET /api/datasources/proxy/uid/grafanacloud-traces/api/v2/search/tags
→ {"metrics":{},"scopes":[{"name":"intrinsic","tags":["duration","event:name",...]}]}
The Prometheus and Loki datasources also returned no gh-aw.*, gen_ai.*, or traces_spanmetrics_* series. So the recommendation in this issue is grounded in static analysis of the instrumentation code rather than live telemetry. The gap above is verifiable by running the existing action_otlp.test.cjs harness with GH_AW_AGENT_CONCLUSION=failure and no agent_output.json on disk — the resulting conclusion span has status.code = 2 and events = [], confirming the missing exception event.
Related Files
actions/setup/js/send_otlp_span.cjs (primary change site — buildSpanEvents inside sendJobConclusionSpan)
actions/setup/js/action_conclusion_otlp.cjs (no change; documents the env vars used)
actions/setup/js/action_otlp.test.cjs (test for the new exception-event emission path)
actions/setup/js/action_conclusion_otlp.test.cjs (parallel test surface)
Generated by the Daily Grafana OTel Instrumentation Advisor workflow
Generated by Daily Grafana OTel Instrumentation Advisor · ● 14.9M · ◷
OTel Instrumentation Improvement: emit synthetic exception event for agent failures with missing output
Analysis Date: 2026-05-10
Priority: High
Effort: Small (< 2h)
Problem
When an agent job ends with
GH_AW_AGENT_CONCLUSION=failureand/tmp/gh-aw/agent_output.jsonis missing or unreadable (hard crash, OOM kill, runner termination, signal kill), the conclusion span is emitted withstatus.code = STATUS_CODE_ERROR (2)but noexceptionspan event. Backends like Grafana Tempo, Honeycomb, and Datadog rely onexceptionevents (withexception.typeandexception.message) to surface failure reasons in the trace UI, group failures by exception type, and feed alerting rules.The synthetic-exception path in
actions/setup/js/send_otlp_span.cjs(lines 1390-1397) covers only thetimed_outandcancelledoutcomes, so genericfailure-without-output runs leave a red span with no diagnostic event attached. A DevOps engineer paging on "agent failed" cannot answer "why did it fail?" from the span alone — they have to leave the trace, jump to GitHub Actions logs, and correlate manually.Why This Matters (DevOps Perspective)
status=ERRORwith no exception event — engineers must context-switch to GitHub Actions logs to learn anything.exception.type =~ \"gh-aw.*\"rate > X") miss this entire class of hard-failure runs because no exception event is emitted.exception.typeto see top failure modes. Right nowgh-aw.AgentTimedOutandgh-aw.AgentCancelledare visible, but rawfailureoutcomes silently drop out of the breakdown.status.code = ERRORand no exception event is a known anti-pattern flagged by linters like the OTel Collector'sspanstatusprocessorand breaks parity with the OTel exception semantic conventions.Current Behavior
In
actions/setup/js/send_otlp_span.cjs(around lines 1387-1399):The condition
(isAgentTimedOut || isAgentCancelled)excludes the most common hard-failure mode:agentConclusion === "failure"with no readableagent_output.json. Those runs return[](no events) even thoughstatusCode === 2is set later in the same function.Note that
isAgentNonOK = isAgentFailure || isAgentCancelledis already computed nearby and is used to drive the span status — but it is not used to drive the synthetic-exception emission.Proposed Change
Reuse the existing
isAgentNonOKpredicate so any non-OK agent outcome (failure, timeout, cancellation) triggers a synthetic exception event when noagent_output.jsonis available. Add a third exception type (gh-aw.AgentFailureWithoutOutput) so backends can distinguish hard crashes from clean timeouts/cancellations.This is a minimal change: one predicate swap + one extra branch in the type selection. No new attributes, no schema changes, no impact on non-failure paths.
Expected Outcome
After this change:
exceptionevent. Backends that auto-detect exception events show "❗ exception" markers in the trace UI for hard-crashed agent runs, just as they do today for timeouts and cancellations. TraceQL / Honeycomb queries like{ status = error && event:name = "exception" }cover all failed runs instead of only the timeout/cancellation subset./tmp/gh-aw/otel.jsonl): The conclusion span entry now has a non-emptyevents: [...]array on hard-crash runs, making post-hoc artifact-based debugging viable without a live collector.exception.type = "gh-aw.AgentFailureWithoutOutput"value becomes alertable and groupable, surfacing the runs that previously hadstatus=ERRORwith no diagnostic context.gh-aw.AgentFailureWithoutOutputslice. Currently those runs are invisible to that breakdown.Implementation Steps
buildSpanEventsinactions/setup/js/send_otlp_span.cjs(around lines 1390-1397) to useisAgentNonOKand the three-way type selection shown above.actions/setup/js/action_conclusion_otlp.test.cjs(oraction_otlp.test.cjs) that asserts: whenGH_AW_AGENT_CONCLUSION=failureandagent_output.jsonis missing, the conclusion span carries exactly oneexceptionevent withexception.type = "gh-aw.AgentFailureWithoutOutput".timed_out/cancelledsynthetic-exception tests to also cover the newfailurebranch (parametrized table-driven test if convenient).make test-unit(orcd actions/setup/js && npx vitest run) to confirm the new and existing tests pass.make fmtto ensure formatting.Evidence from Live Grafana Data
The Grafana Tempo datasource (
grafanacloud-traces) attached to this workflow returned zero traces for any{}TraceQL query across the last 30 days, and/api/v2/search/tagsreturned only theintrinsicscope with noresource.*orspan.*tags populated:The Prometheus and Loki datasources also returned no
gh-aw.*,gen_ai.*, ortraces_spanmetrics_*series. So the recommendation in this issue is grounded in static analysis of the instrumentation code rather than live telemetry. The gap above is verifiable by running the existingaction_otlp.test.cjsharness withGH_AW_AGENT_CONCLUSION=failureand noagent_output.jsonon disk — the resulting conclusion span hasstatus.code = 2andevents = [], confirming the missing exception event.Related Files
actions/setup/js/send_otlp_span.cjs(primary change site —buildSpanEventsinsidesendJobConclusionSpan)actions/setup/js/action_conclusion_otlp.cjs(no change; documents the env vars used)actions/setup/js/action_otlp.test.cjs(test for the new exception-event emission path)actions/setup/js/action_conclusion_otlp.test.cjs(parallel test surface)Generated by the Daily Grafana OTel Instrumentation Advisor workflow