Analysing GitHub commits with R
@BasiaFusinska
barbarafusinska.com
barbara.fusinska@gmail.com
About me
Programmer
Math enthusiast
Sweet tooth
@BasiaFusinska
• @BasiaFusinska
• barbarafusinska.com
• barbara.fusinska@gmail.com
https://github.com/BasiaFusinska/RTalk
Goals
Agenda
• Data analysis
• Capturing GitHub data
• Using R in Data Analysis
• Azure ML
Data analysis process
Raw Data
Processed
Data
Data Analysis
& Visualization
Exploratory
Data Analysis
Data
Capture
What do we want to find out?
Language statistics
GitHut Visualisation
http://githut.info/
Data Capture
GitHub Archive
https://www.githubarchive.org/
Event Data
Data Source
(Events/Logs/Files)
Store & reorganize
Query
Google BigQuery
https://cloud.google.com/bigquery/what-is-bigquery
GitHub API
https://developer.github.com/v3/
Why R?
• Ross Ihaka & Robert Gentleman
• Name:
– First letter of names
– Play on the name of S
– S-PLUS – commercial alternative
• Open source
• Nr 1 for statistical computing
Development environment
R Environment
• R project
– console environment
– http://www.r-project.org/
• IDE
– Any editor
– RStudio
http://www.rstudio.com/products/rstudio/download/
RStudio
Editor
Console
Environment
variables
Plots
Files
Help
Packages
R Basics
Filtering
Goal: Language distribution
Distribution of active repositories per
language
What is an active repository?
Source of truth
GitHub Archive CreateEvent
GitHub Archive PushEvent
GitHub Archive PullRequestEvent
Task: Reading Pull Requests
1. Read the file line by line and
extract only pull request events
2. Extract id and language
information
3. Count and visualise language
distribution
Data: 1h GitHub Archive Events
from 01-01-2015, 3 PM
Reading Events
Read Pull Requests
Unique data
Language information
Language information output
Missing information
Omitting information
Plotting
Now everything is sorted…
Goal: Analyze ACTUAL active
repositories
Missing data
Language information
• Active repositories – Create, Push and
PullRequest events
• Missing language information:
– Google BigQuery
– GitHub API
• Process various data sources
Google BigQuery
Different sources of data
• GitHub Archive:
– id,
– url in a form:
https://api.github.com/repos/:name
– (rare cases) language
• Google BigQuery:
– no id,
– url in a form:
https://github.com/:name
– language
Task: Reading Active Repositories
1. Read the file line by line and extract only
create, push and pull request events
2. Extract id and url information
3. Read Google BigQuery data from saved file
4. Combine repositories data and Google data
base on the same url and fill in missing
language information
5. Count and visualise language distribution
Read GitHub Archive
Read Google Data
Repository data
Various data sources
Combining data
Sorted data
Active repositories per languages
Goal: Using GitHub API
Task: Retrieve language info for
repository
• GET
/repos/:owner/:repo/languages
• Owner: BasiaFusinska
• Repo: RWorkshop
GitHub API from R
Task: Calling GitHub Search
• GET
/search/repositories
• Querying: q parameter
• Paging: page parameter
GitHub API Search
Digression
Data…
Big Data in R
• What’s _Big_Data_ anyway?
• R processes data in memory
• Bring down only the data
you need
• Streaming the data from
database
Azure Machine Learning Studio
Azure ML R Script
Datasets
Reading pull requests experiments
Analyzing the whole day
R Script
Pull requests - visualization
To summarize…
• Data science – not a rocket science, shaving
the yak
• Different sources – different truths
• Capturing & storing data
• Data science UI – visualization is the key
• Desktop - hypothesis, development
• Cloud – production
What’s next?
• Data Exploration in R, workshop
• basiafusinska.com, blog
• katacoda.com, interactive learning platform
Thank you
barbara.fusinska@gmail.com
@BasiaFusinska
barbarafusinska.com
https://github.com/BasiaFusinska/RTalk

Analysing GitHub commits with R