This repository contains two AI-powered pipelines that generate executive summaries of GitHub repositories, projects, and portfolios over a configurable time window. Both pipelines produce the same three-tier output — repo-level, project-level, and portfolio-level summaries — and are designed for direct comparison.
| Pipeline | Approach | Models | Output suffix |
|---|---|---|---|
Chat-based (src/) |
Pre-fetches activity data, feeds it to OpenAI via Python | Configurable per tier | _chatbased |
Agentic (agentic/) |
GitHub Copilot CLI reads repos and activity autonomously | GitHub Copilot | _agentbased |
Both pipelines write to the same reports/ and reports_pdf/ folders. Each pipeline only cleans its own output files, so you can run both and keep both sets of results side by side.
Both pipelines share the same .env file. Use .env_example as a template.
- GitHub token: Go to GitHub.com > Settings > Developer settings > Personal Access Tokens > Tokens (classic) > Generate a token. Select the
repoandread:orgboxes. You need NIH-CFDE organization access. - OpenAI API key: Go to OpenAI API and generate a token.
- GitHub Copilot subscription: You need an active GitHub Copilot subscription and must authenticate locally before running (see Section 2).
The agentic pipeline uses GitHub Copilot CLI, which must be authenticated with your GitHub account before running. This only needs to be done once on your local machine.
# Install Copilot CLI locally if you haven't already
npm install -g @githubnext/github-copilot-cli
# Authenticate — this saves credentials to ~/.config/github-copilot/
copilot authFollow the prompts to log in with your GitHub account. You need an active GitHub Copilot subscription. Once authenticated, the Docker run command mounts these credentials into the container so Copilot can identify you without re-authenticating.
docker build -t cfde-pipeline --no-cache .
docker run --rm --env-file .env \
-v "$PWD/data:/app/data" \
-v "$PWD/reports:/app/reports" \
-v "$PWD/reports_pdf:/app/reports_pdf" \
-v "$HOME/.config/github-copilot:/root/.config/github-copilot:ro" \
cfde-pipelinedocker build -t cfde-pipeline --build-arg PIPELINE=chatbased --no-cache .
<<<<<<< HEAD
docker run --rm --env-file .env \
-v "$PWD/data:/app/data" \
-v "$PWD/reports:/app/reports" \
-v "$PWD/reports_pdf:/app/reports_pdf" \
cfde-pipelineThe
-v "$HOME/.config/github-copilot:/root/.config/github-copilot:ro"mount is only needed for the agentic pipeline. The chat-based pipeline does not need it.
- Time window: Change
--days=365insrc/full.shoragentic/full.sh. - Models (chat-based only): Change the
--modelflags insrc/full.sh. - Project cohort:
projects_seed.csvis auto-generated bybuild_projects_seed.py. To use a different set of projects, updatedata/projects_seed.csvmanually and remove thebuild_projects_seed.pycall fromfull.sh.
Both pipelines share steps 1–4. Summaries diverge at step 5.
=======
- This script grabs the JSON information from CFDE-Eval core private repository with repository and project information (i.e., what projects we care about). If you want to run pipeline with another project cohort, update project_seed.csv instead and don't run this.
- Uses GraphQL querying to fetch all github activity from all repostiroies in project_seed.csv. Fills /data/ folder. If you want to use another time, update that call: --days=365. If GraphQL fails for repos, there are retries implemented. Failure is still a possibility (netweork issues, etc.).
- These files put the needed information in a digestible format for LLM (parquet files and JSON structures)
- Generates per-repository executive-summary Markdown reports by loading cleaned GitHub activity tables and seed repos, shallow-cloning each repo to infer its goal from code, then prompting an OpenAI model (with retries) to synthesize “Summary and Goal” + “Recent Developments” sections and writing them to reports/, cleaning up clones afterward.
- Aggregates repo-level “Summary and Goal” and “Recent Developments” sections from previously generated Markdown (with rollup JSON as fallback evidence) and uses an OpenAI model to synthesize a single per-project executive-summary Markdown report per project in reports/.
Synthesizes a single portfolio-wide executive summary by reading the rollup _portfolio.json, pulling goal and “Recent Developments” text from project/repo Markdown reports (with metric-based fallback), then prompting an OpenAI model (with retries) to produce a two-section Markdown report written to reports/_portfolio_full.md.
- Generates PDF files, saved in /reports_pdf/, from the markdown files generated before.
edd60da9bbcb75541aa8a73c7ddb78c0a777865b
1) Clean outputs
Removes only the current pipeline's output files from reports/ and reports_pdf/.
2) src/build_projects_seed.py
Fetches repository and project information from the CFDE-Eval core private repository and writes data/projects_seed.csv. To use a custom project cohort, skip this step and edit the CSV directly.
3) src/fetch_github_activity.py
Uses GraphQL to fetch all GitHub activity (commits, PRs, issues, releases, stars, forks) for all repositories in projects_seed.csv. Retry logic is included for network failures.
4) src/normalize_activity.py and src/rollup_projects.py
Normalize raw data into cleaned parquet tables and per-project JSON rollups for downstream consumption.
5) src/summarize_repos.py
Shallow-clones each repository and uses a map-reduce approach over the codebase to infer its goal. Combines this with fetched activity data and calls an OpenAI model to produce a ## Summary and Goal + ## Recent Developments report per repository. Output: reports/<PROJECT_ID>__<owner>__<repo>__chatbased.md.
6) src/summarize_projects.py
Reads all repo-level _chatbased.md files for each project and calls an OpenAI model to synthesize a single project-level executive summary. Output: reports/<PROJECT_ID>__chatbased.md.
7) src/summarize_portfolio.py
Reads all project-level _chatbased.md files and calls an OpenAI model to produce a single portfolio-wide summary. Output: reports/_portfolio_full__chatbased.md.
5) agentic/run_repo_summaries.sh + agentic/build_activity_context.py
Shallow-clones each repository, injects the same activity data as the chat-based pipeline into a _activity_context.md file, then runs GitHub Copilot CLI inside the clone using the repo-summary skill. Copilot reads the code and activity autonomously and writes the summary. Output: reports/<PROJECT_ID>__<owner>__<repo>__agentbased.md.
6) agentic/run_project_summaries.sh
Creates a temporary working directory per project containing all its repo-level _agentbased.md files, then runs Copilot CLI using the project-summary skill to synthesize a project-level summary. Output: reports/<PROJECT_ID>__agentbased.md.
7) agentic/run_portfolio_summary.sh
Gathers all project-level _agentbased.md files into a working directory and runs Copilot CLI using the portfolio-summary skill to produce a portfolio-wide summary. Output: reports/_portfolio_full__agentbased.md.
8) src/make_pdfs.py
Converts all Markdown reports in reports/ to styled PDFs saved in reports_pdf/. Each pipeline's outputs are named with their respective suffix so both sets can coexist.