Analysis of GitHub Repository and Project Content

This repository contains two AI-powered pipelines that generate executive summaries of GitHub repositories, projects, and portfolios over a configurable time window. Both pipelines produce the same three-tier output — repo-level, project-level, and portfolio-level summaries — and are designed for direct comparison.

Pipeline	Approach	Models	Output suffix
Chat-based (`src/`)	Pre-fetches activity data, feeds it to OpenAI via Python	Configurable per tier	`_chatbased`
Agentic (`agentic/`)	GitHub Copilot CLI reads repos and activity autonomously	GitHub Copilot	`_agentbased`

Both pipelines write to the same reports/ and reports_pdf/ folders. Each pipeline only cleans its own output files, so you can run both and keep both sets of results side by side.

1. Add the required tokens

Both pipelines share the same .env file. Use .env_example as a template.

Chat-based pipeline needs:

GitHub token: Go to GitHub.com > Settings > Developer settings > Personal Access Tokens > Tokens (classic) > Generate a token. Select the repo and read:org boxes. You need NIH-CFDE organization access.
OpenAI API key: Go to OpenAI API and generate a token.

Agentic pipeline additionally needs:

GitHub Copilot subscription: You need an active GitHub Copilot subscription and must authenticate locally before running (see Section 2).

2. One-time setup: authenticate GitHub Copilot CLI (agentic pipeline only)

The agentic pipeline uses GitHub Copilot CLI, which must be authenticated with your GitHub account before running. This only needs to be done once on your local machine.

# Install Copilot CLI locally if you haven't already
npm install -g @githubnext/github-copilot-cli

# Authenticate — this saves credentials to ~/.config/github-copilot/
copilot auth

Follow the prompts to log in with your GitHub account. You need an active GitHub Copilot subscription. Once authenticated, the Docker run command mounts these credentials into the container so Copilot can identify you without re-authenticating.

3. Run the pipeline of your choice

Agentic (default)

docker build -t cfde-pipeline --no-cache .

docker run --rm --env-file .env \
  -v "$PWD/data:/app/data" \
  -v "$PWD/reports:/app/reports" \
  -v "$PWD/reports_pdf:/app/reports_pdf" \
  -v "$HOME/.config/github-copilot:/root/.config/github-copilot:ro" \
  cfde-pipeline

Chat-based

docker build -t cfde-pipeline --build-arg PIPELINE=chatbased --no-cache .

<<<<<<< HEAD
docker run --rm --env-file .env \
  -v "$PWD/data:/app/data" \
  -v "$PWD/reports:/app/reports" \
  -v "$PWD/reports_pdf:/app/reports_pdf" \
  cfde-pipeline

The -v "$HOME/.config/github-copilot:/root/.config/github-copilot:ro" mount is only needed for the agentic pipeline. The chat-based pipeline does not need it.

4. Customizing the pipeline

Time window: Change --days=365 in src/full.sh or agentic/full.sh.
Models (chat-based only): Change the --model flags in src/full.sh.
Project cohort: projects_seed.csv is auto-generated by build_projects_seed.py. To use a different set of projects, update data/projects_seed.csv manually and remove the build_projects_seed.py call from full.sh.

Pipeline details

Both pipelines share steps 1–4. Summaries diverge at step 5.

Shared steps (both pipelines)

=======

2) src/build_projects_seed.py

This script grabs the JSON information from CFDE-Eval core private repository with repository and project information (i.e., what projects we care about). If you want to run pipeline with another project cohort, update project_seed.csv instead and don't run this.

3) /src/fetch_github_activity.py

Uses GraphQL querying to fetch all github activity from all repostiroies in project_seed.csv. Fills /data/ folder. If you want to use another time, update that call: --days=365. If GraphQL fails for repos, there are retries implemented. Failure is still a possibility (netweork issues, etc.).

4) /src/normalize_activity.py and /src/rollup_projects.py

These files put the needed information in a digestible format for LLM (parquet files and JSON structures)

5) LLM calls: /src/summarize_repos.py, /src/summarize_projects.py, /src/summarize_portfolio.py

/src/summarize_repos.py:

Generates per-repository executive-summary Markdown reports by loading cleaned GitHub activity tables and seed repos, shallow-cloning each repo to infer its goal from code, then prompting an OpenAI model (with retries) to synthesize “Summary and Goal” + “Recent Developments” sections and writing them to reports/, cleaning up clones afterward.

/src/summarize_projects.py:

Aggregates repo-level “Summary and Goal” and “Recent Developments” sections from previously generated Markdown (with rollup JSON as fallback evidence) and uses an OpenAI model to synthesize a single per-project executive-summary Markdown report per project in reports/.

/src/summarize_portfolio.py:

Synthesizes a single portfolio-wide executive summary by reading the rollup _portfolio.json, pulling goal and “Recent Developments” text from project/repo Markdown reports (with metric-based fallback), then prompting an OpenAI model (with retries) to produce a two-section Markdown report written to reports/_portfolio_full.md.

6) src/make_pdf.py

Generates PDF files, saved in /reports_pdf/, from the markdown files generated before.

edd60da9bbcb75541aa8a73c7ddb78c0a777865b

1) Clean outputs Removes only the current pipeline's output files from reports/ and reports_pdf/.

2) src/build_projects_seed.py Fetches repository and project information from the CFDE-Eval core private repository and writes data/projects_seed.csv. To use a custom project cohort, skip this step and edit the CSV directly.

3) src/fetch_github_activity.py Uses GraphQL to fetch all GitHub activity (commits, PRs, issues, releases, stars, forks) for all repositories in projects_seed.csv. Retry logic is included for network failures.

4) src/normalize_activity.py and src/rollup_projects.py Normalize raw data into cleaned parquet tables and per-project JSON rollups for downstream consumption.

Chat-based summary steps (`src/`)

5) src/summarize_repos.py Shallow-clones each repository and uses a map-reduce approach over the codebase to infer its goal. Combines this with fetched activity data and calls an OpenAI model to produce a ## Summary and Goal + ## Recent Developments report per repository. Output: reports/<PROJECT_ID>__<owner>__<repo>__chatbased.md.

6) src/summarize_projects.py Reads all repo-level _chatbased.md files for each project and calls an OpenAI model to synthesize a single project-level executive summary. Output: reports/<PROJECT_ID>__chatbased.md.

7) src/summarize_portfolio.py Reads all project-level _chatbased.md files and calls an OpenAI model to produce a single portfolio-wide summary. Output: reports/_portfolio_full__chatbased.md.

Agentic summary steps (`agentic/`)

5) agentic/run_repo_summaries.sh + agentic/build_activity_context.py Shallow-clones each repository, injects the same activity data as the chat-based pipeline into a _activity_context.md file, then runs GitHub Copilot CLI inside the clone using the repo-summary skill. Copilot reads the code and activity autonomously and writes the summary. Output: reports/<PROJECT_ID>__<owner>__<repo>__agentbased.md.

6) agentic/run_project_summaries.sh Creates a temporary working directory per project containing all its repo-level _agentbased.md files, then runs Copilot CLI using the project-summary skill to synthesize a project-level summary. Output: reports/<PROJECT_ID>__agentbased.md.

7) agentic/run_portfolio_summary.sh Gathers all project-level _agentbased.md files into a working directory and runs Copilot CLI using the portfolio-summary skill to produce a portfolio-wide summary. Output: reports/_portfolio_full__agentbased.md.

Shared final step (both pipelines)

8) src/make_pdfs.py Converts all Markdown reports in reports/ to styled PDFs saved in reports_pdf/. Each pipeline's outputs are named with their respective suffix so both sets can coexist.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
agentic		agentic
data		data
reports		reports
reports_pdf		reports_pdf
src		src
.dockerignore		.dockerignore
.env_example		.env_example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of GitHub Repository and Project Content

1. Add the required tokens

Chat-based pipeline needs:

Agentic pipeline additionally needs:

2. One-time setup: authenticate GitHub Copilot CLI (agentic pipeline only)

3. Run the pipeline of your choice

Agentic (default)

Chat-based

4. Customizing the pipeline

Pipeline details

Shared steps (both pipelines)

2) src/build_projects_seed.py

3) /src/fetch_github_activity.py

4) /src/normalize_activity.py and /src/rollup_projects.py

5) LLM calls: /src/summarize_repos.py, /src/summarize_projects.py, /src/summarize_portfolio.py

/src/summarize_repos.py:

/src/summarize_projects.py:

/src/summarize_portfolio.py:

6) src/make_pdf.py

Chat-based summary steps (`src/`)

Agentic summary steps (`agentic/`)

Shared final step (both pipelines)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Analysis of GitHub Repository and Project Content

1. Add the required tokens

Chat-based pipeline needs:

Agentic pipeline additionally needs:

2. One-time setup: authenticate GitHub Copilot CLI (agentic pipeline only)

3. Run the pipeline of your choice

Agentic (default)

Chat-based

4. Customizing the pipeline

Pipeline details

Shared steps (both pipelines)

2) src/build_projects_seed.py

3) /src/fetch_github_activity.py

4) /src/normalize_activity.py and /src/rollup_projects.py

5) LLM calls: /src/summarize_repos.py, /src/summarize_projects.py, /src/summarize_portfolio.py

/src/summarize_repos.py:

/src/summarize_projects.py:

/src/summarize_portfolio.py:

6) src/make_pdf.py

Chat-based summary steps (src/)

Agentic summary steps (agentic/)

Shared final step (both pipelines)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Chat-based summary steps (`src/`)

Agentic summary steps (`agentic/`)

Packages