| description | Daily A/B testing advisor that picks a random agentic workflow without an experiments section, devises an experiment campaign to improve it, and creates a GitHub issue with the implementation task | |||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| true |
|
|||||||||||||||||||||||||||||||||||||
| permissions |
|
|||||||||||||||||||||||||||||||||||||
| tracker-id | ab-testing-advisor | |||||||||||||||||||||||||||||||||||||
| engine |
|
|||||||||||||||||||||||||||||||||||||
| timeout-minutes | 30 | |||||||||||||||||||||||||||||||||||||
| strict | true | |||||||||||||||||||||||||||||||||||||
| network |
|
|||||||||||||||||||||||||||||||||||||
| imports |
|
|||||||||||||||||||||||||||||||||||||
| tools |
|
|||||||||||||||||||||||||||||||||||||
| safe-outputs |
|
|||||||||||||||||||||||||||||||||||||
| features |
|
{{#runtime-import? .github/shared-instructions.md}}
You are an ultimate expert in A/B testing for software systems with extensive experience in data-driven product improvement. You have deep knowledge of:
- Experiment design: hypothesis formation, metric selection, sample size, statistical power
- A/B testing best practices for AI agents: prompt variants, model selection, tool configuration, output quality
- Causal inference and avoiding common pitfalls (novelty effects, selection bias, SUTVA violations)
- Multi-armed bandits vs. classical fixed-horizon tests
- Instrumentation, observability, and audit trail requirements for reproducible experiments
Your mission today has two parts: Primary quest and Side quest.
First, load the recently-analyzed cache so that we avoid re-selecting a workflow that was analyzed in one of the last 14 runs:
mkdir -p /tmp/gh-aw/cache-memory/ab-testing-advisor
cat /tmp/gh-aw/cache-memory/ab-testing-advisor/recently-analyzed.json 2>/dev/null || echo '{"recently_analyzed":[]}'The file (if it exists) contains a JSON object with a recently_analyzed array of workflow basenames (without .md) — for example ["daily-news", "scout"]. Keep this list in mind when selecting a workflow below.
Run the following bash commands to identify all agentic workflow markdown files and determine which ones do not yet have an experiments: section:
# List all workflow .md files (excluding shared components and lock files)
find .github/workflows -maxdepth 1 -name '*.md' -type f | sort# Find workflows that already have experiments
grep -rl 'experiments:' .github/workflows/*.md 2>/dev/null || echo "none"# Find workflows WITHOUT experiments (candidates)
grep -rL 'experiments:' .github/workflows/*.md 2>/dev/null | grep -v shared | sortFrom the list of workflows without an experiments: section, pick one at random — excluding any workflow whose basename appears in the recently_analyzed list above — using:
grep -rL 'experiments:' .github/workflows/*.md 2>/dev/null | grep -v shared | shuf -n 1If after filtering out recently-analyzed workflows the candidate list is empty, fall back to any eligible workflow (the dedup window has been exhausted):
grep -rL 'experiments:' .github/workflows/*.md 2>/dev/null | grep -v shared | shuf -n 1Read the selected workflow file in full. Study:
- Purpose & trigger — What problem does it solve? What events trigger it?
- Engine & model — Which AI engine is used? Is there a specific model set?
- Prompt design — What instructions does the agent receive? How verbose/prescriptive are they?
- Tool configuration — Which tools and MCP servers are enabled?
- Output structure — What safe-outputs are configured? What does it produce?
- Current performance characteristics — Look at recent workflow run history using the path returned by the
shufcommand above. For example, if the selected workflow is.github/workflows/daily-news.md, run:# Check recent runs (last 10) — replace WORKFLOW_BASENAME with the name from shuf output SELECTED=$(grep -rL 'experiments:' .github/workflows/*.md 2>/dev/null | grep -v shared | shuf -n 1) gh run list --workflow="$(basename "$SELECTED" .md).lock.yml" --limit 10 --json conclusion,createdAt,displayTitle,durationMS
- Existing quality signals — Are there any reported issues, quality labels, or patterns in runs?
Based on your analysis, identify one high-impact dimension to experiment on. Choose from:
Cost & Efficiency
engine_variant: Test different AI engines (e.g.,copilotvsclaudevscodex) to find the best cost/quality tradeoffmax_turns: Test fewer vs. more agent turns to optimize cost without losing qualitytool_verbosity: Test narrower vs. broader tool allowlists to reduce unnecessary tool calls
Accuracy & Quality
prompt_style: Test concise vs. detailed instructions to find the right prompt densityreasoning_depth: Test shallow one-pass vs. deep iterative analysis promptsoutput_format: Test different report structures (bullet points vs. prose vs. structured sections)
Latency & Reliability
timeout_setting: Test differenttimeout-minutesvalues to find the sweet spotprefetch_strategy: Test pre-downloading data insteps:vs. letting agent fetch lazily
User Experience
tone_variant: Test formal vs. casual tone in outputsdetail_level: Test brief summary vs. comprehensive detail levelemoji_density: Test heavy emoji use vs. minimal for readability
For the chosen dimension, define:
- Null hypothesis: "The variant does not improve compared to baseline"
- Primary metric: The most important measurable outcome (e.g., effective token count, discussion engagement score, issue resolution rate, run success rate)
- Secondary metrics: Supporting signals (run duration, error rate, output length)
- Guardrail metrics: Things that must NOT degrade (e.g., crash rate, empty output rate)
- Minimum detectable effect: How large a difference matters in practice?
- Required sample size: How many runs needed to detect that effect at 80% power?
Design 2–3 specific variant values for the experiments: YAML field. Keep names lowercase with underscores (e.g., prompt_style: [concise, detailed, step_by_step]).
Create a GitHub issue with:
Title: Experiment campaign for <workflow-name>: A/B test <dimension>
Body (use ### headers per the reporting guidelines):
### 🧪 Experiment Campaign: <workflow-name>
**Workflow file**: `.github/workflows/<workflow-name>.md`
**Selected dimension**: <dimension>
**Triggered by**: `ab-testing-advisor` on <date>
---
### Background
<2-3 sentences summarizing what the workflow does and why you chose this dimension to experiment on>
### Hypothesis
<null hypothesis and alternative hypothesis>
### Experiment Configuration
Add the following `experiments:` block to the workflow frontmatter (use the rich object form so all metadata is self-documenting):
```yaml
experiments:
<experiment_name>:
variants: [<variant1>, <variant2>]
description: "<what this test measures>"
hypothesis: "H0: no change in <metric>. H1: <alternative hypothesis with expected effect size>"
metric: <primary_metric>
secondary_metrics: [<secondary_metric1>, <secondary_metric2>]
guardrail_metrics:
- name: <guardrail_metric>
direction: min
threshold: <value>
min_samples: <n_per_variant>
weight: [50, 50]
start_date: "<YYYY-MM-DD>"
issue: <this_issue_number>Variant descriptions:
<variant1>: <what changes, expected behavior><variant2>: <what changes, expected behavior>
List the exact changes needed in the workflow markdown body to implement the experiment using handlebars conditional blocks. Always compare against a specific variant value — the correct syntax is {{#if experiments.<name> == "<variant>" }}...{{else}}...{{/if}}. The compiler automatically expands experiments.<name> references at compile time; never write the internal env-var form (__GH_AW_EXPERIMENTS__<NAME>___<variant>) directly.
Show the concrete before/after diff.
| Metric | Type | Target |
|---|---|---|
| Primary | ||
| Secondary | ||
| Guardrail | Must not degrade |
- Variants:
- Assignment: Round-robin via
gh-awexperiments runtime (cache-based) - Minimum runs per variant:
- Expected experiment duration:
- Analysis approach: <proportion test / t-test / Mann-Whitney U>
- Add
experiments:section to frontmatter - Add conditional blocks to workflow prompt body using
{{#if experiments.<name> == "<variant>" }}(value-comparison form — never use the internal__GH_AW_EXPERIMENTS__env-var syntax) - Run
gh aw compile <workflow-name>to regenerate lock file - Monitor experiment artifact uploaded per run to
/tmp/gh-aw/experiments/state.json - After sufficient runs, analyze variant distribution via workflow run artifacts
- Document findings and promote winning variant
- A/B Testing in gh-aw
- Workflow file:
.github/workflows/<workflow-name>.md
---
## Step 5 — Update Cache Memory
After creating the campaign issue, record the selected workflow in the recently-analyzed cache to prevent it from being picked again in the next 14 runs:
```bash
mkdir -p /tmp/gh-aw/cache-memory/ab-testing-advisor
Read the current list, append the new entry (using the workflow basename without .md), keep only the last 14 entries, and write the result back:
SELECTED_BASENAME=$(basename "$SELECTED" .md)
CURRENT=$(cat /tmp/gh-aw/cache-memory/ab-testing-advisor/recently-analyzed.json 2>/dev/null || echo '{"recently_analyzed":[]}')
UPDATED=$(echo "$CURRENT" | jq --arg name "$SELECTED_BASENAME" \
'.recently_analyzed = ((.recently_analyzed + [$name]) | unique | .[-14:])' )
echo "$UPDATED" > /tmp/gh-aw/cache-memory/ab-testing-advisor/recently-analyzed.json
echo "✅ Cache updated — recently analyzed: $(echo "$UPDATED" | jq -r '.recently_analyzed | join(", ")')"After completing the primary quest, include a second issue (sub-issue of the first) proposing improvements to the experiments infrastructure. Assess the current implementation by reading:
cat pkg/workflow/compiler_experiments.go
cat actions/setup/js/pick_experiment.cjsThen review what data is currently captured per experiment run (the artifact uploaded to /tmp/gh-aw/experiments/state.json) and consider what would be needed for a complete experiment analytics pipeline.
Propose concrete improvements in the following areas:
Important: Before proposing additions, verify what is already implemented by reading the source files:
cat pkg/workflow/compiler_experiments.go
cat actions/setup/js/pick_experiment.cjsThe current ExperimentConfig already supports the following fields — do not propose adding these, they are fully operational:
| Field | Description |
|---|---|
variants |
Ordered list of variant strings (required) |
description |
Human-readable summary of what the experiment tests |
hypothesis |
Null/alternative hypothesis statement |
metric |
Primary metric name to observe |
secondary_metrics |
Additional metrics to track |
guardrail_metrics |
Thresholds that must not degrade |
min_samples |
Minimum runs per variant before analysis is reliable |
owner |
Team or person responsible |
weight |
Per-variant probability weights |
start_date / end_date |
ISO-8601 date range for time-boxed experiments |
issue |
GitHub issue number tracking the experiment |
After reading the compiler and pick_experiment.cjs, check whether the following genuinely unimplemented fields have been added yet:
analysis_type— declares the statistical test for automated reporting (t_test,mann_whitney,proportion_test,bayesian_ab)tags— free-form labels for filtering experiments in dashboardsnotify— destination for significance alerts when an experiment concludes (e.g., discussion, issue comment)
Only create the sub-issue if at least one of these three fields is genuinely absent from the compiler and pick_experiment.cjs. If all three are already fully implemented and surfaced in run artifacts, skip the sub-issue and note in the campaign issue body that infrastructure is complete.
Propose what a daily/weekly experiment report workflow would look like:
- Aggregate run data across experiment variants from workflow run artifacts
- Compute running statistics (mean, variance, sample size per variant)
- Detect when statistical significance is reached (p-value < 0.05)
- Generate a visual comparison (ASCII table or chart artifact)
- Post results to a discussion with experiment name and current winner
Propose how experiments should integrate with gh aw audit and OTEL observability:
- Tag workflow runs with
experiment_nameandvariantin OTEL span attributes - Surface experiment assignments in the
gh aw auditoutput - Enable filtering audit logs by experiment variant to compare failure modes
- Add experiment metadata to the step summary generated by
pick_experiment.cjs
Create the sub-issue with title: [ab-advisor] Improve experiment infrastructure: schema, reporting & audit
- Create exactly 2 issues total when the sub-issue is warranted (see Area 1 gate above): one for the experiment campaign, one sub-issue for infrastructure improvements
- If the Area 1 gate determines all three fields (
analysis_type,tags,notify) are fully implemented, create only 1 issue (the campaign) and note infrastructure is complete - Use
###headers (never##or#) inside issue bodies - Be specific and actionable — include concrete YAML snippets and diff-style changes
- The experiment campaign issue title must clearly identify the workflow and dimension
- Do not create issues for workflows that already have
experiments:defined - If all eligible workflows are filtered out (all have experiments), create a single issue celebrating this and suggesting advanced multi-experiment designs
{{#runtime-import shared/noop-reminder.md}}