R - scripted data
History
Language
Packages
Tools
RPubs
Slidify
Shiny
A Brief History of R
– 1976 S - Bell Labs; Fortran
– John Chambers
– 1988 S Version 3; C language
● 1991 R Created
– Ross Ihaka and Robert Gentleman
● 1993 R Announced
– 1993 S licensed to StatSci (now Insightful)
● 2000 R Version 1.0.0 released
– 2004 S purchased from Lucent (2MM)
– 2008 TIBCO acquires Insightful (25MM)
Other “Stats” Tools
● R – additional, commercial support
Oracle: “Big Data Appliance” - R + Hadoop
+ Linux + NoSQL + Exadata(H/W)
IBM: R executing in Hadoop (massively
parallel in-databse analytics)
● SAS (SAS Institute) dev. 1966, 1st rel 1972
● SPSS (IBM) 1st rel 1968
Model Development and
Execution Comparison
http://inside-bigdata.com/2014/06/25/revolution-r-enterprise-vs-sas-performance/
Oracle + INTEL Libraries
https://blogs.oracle.com/R/entry/oracle_r_distribution_performance_benchmark
Language
● Derviative of S (S PLUS)
● Portable (includes Playstation 3)
● Interpreted, calls into C libraries
● Functional!
● GPL
● 40 year old technology
● Open Source (you want it, you do it)
Data Types
● Symbols refer to objects
● Object attributes
– names
– dimnames
– dimensions
– class
– length
– user defined attributes/metadata
Data Types
● Object types – single class, except list
– List
(may have mixed classes)
– Vectors
(scalar is a vector of length 1)
– Matrices
(vector with 'dimension' attribute)
(column major order)
Data Types
● Object types
– Factors
● Categorical data (like an enumeration)
– Data frames
● Special list, each element has same length
● Elements are columns with length rows
● Each elements (column) has its own type
● row.names() attribute to name the rows
● Convert to matrix with data.matrix()
● Load with read.table(), read.csv()
Data Types
● Object “atomic” classes
– character
– numeric (double precision real)
– integer
– complex
– logical (booleans)
Numeric and Integer include Inf and NaN
1 / Inf == 0 !
any class can be NA
NaN is NA, NA is not NaN
Data Types
● Dates
– “Date” class
– Days since epoch (1970-01-01)
● Times
– “POSIXct” or “POSIXlt” class
– Seconds since epoch
● Coerce to string with as.Date()
● Generic functions include 'weekdays()',
months()', 'quarters()'
Operators
● Grouping: ()
● Assignment: to<-from AND from->to
● Vectorized: + - ! * / ^ %% & |
● ~ ? : %/% %*% %o% %x% %in% < > == >=
<= && ||
● Element access: [[]] [] $
● Function argument types:
– symbol, symbol=default, ...
Control Structures
● if, else
● for
● while
● repeat
● break, next, return
Apply
● apply – apply functions over arrays
● lapply – apply functions over list / vector
● sapply – apply function to data frames
● tapply – apply function over ragged array
● mapply – apply function to multiple objects
Functions
● Functions are objects
● Functional closure consists of:
– Formal argument list
– Function body (definition)
– Environment
● Each of these can be assigned to
● Assign to environment can eliminate
unwanted environment capture
Packages
● CRAN (Comprehensive R Archive Network)
– Main site, includes R download
● Bioconductor
– Analysis of genomic data
– Next generation high-throughput
sequencing
● R-forge
● GitHub and Personal repositories
Packages
● Analysis
– Statistical analysis (stats, linprog)
● Linear (and general linear) modeling
● Tree models
● Analysis of variance
– Machine learning (caret, kernlab)
● Clustering (forests, k-means, knn, etc)
● Training and predictions
● Cross validation and error analysis
Packages
● Graphics
– Base graphics
● Plot: plot, hist, ...
● Annotate: text, lines, points, axis, ...
– Lattice
● Single command: xyplot, bwplot, ...
– Ggplot2
● Single command: qplot
● Defining objects: aesthetics, geoms
● Chain commands: ggplot, geom_*, ...
Packages
● Data visualization
– rCharts (GitHub), converts visualizations to
Javascript (e.g. d3.js)
http://www.google.com/trends/explore#q=R%20language%2C%20Data%20Visualization%2C%20D3.js%2C%20Processing.js&cmpt=q
Tools
● Command line
● Rstudio (can run on remote Linux server)
● Rkward
● Rcommander (tcl/tk)
● JGR – Java (GUI for R)
● Rattle - RGtk2
Tools
● Debugging
– Print statements!
– Interactive tools:
● traceback() – stack trace on error
● debug() – flags function for stepping
● browser() - stops function and enters debug
● trace() - insert trace statements
● recover() - modify error behavior, can
browse call stack
Tools
● Profiling
– “We should forget about small efficiencies,
say about 97% of the time: premature
optimization is the root of all evil”
– Donald Knuth
– system.time() - CPU, wall times
– Rprof() - use symmaryRprof() to see results
● Do not use Rprof() and system.time()
together
● Calls to C/Fortran libraries not profiled
Data Exploration
● Script it!
– If you can't repeat it, it didn't happen
● Get the data (ingest)
– Functions to download, uncompress,
unarchive, store, read, and organize
● Clean the data
– Handle missing and incomplete data,
impute values, identify outliers
Data Exploration
● Look at the data (models, visualization)
– Model – regressions (linear, logistic),
clustering, ANOVA
– Refine models and plot the result
● Look for systematic issues – unexpected
trends, bias, unexplained variance, error
estimates, residual analysis
● Explore complexity – number of explanatory
factors
– Plot the models
● What does it look like?
Reproducible Research
● Allows others to validate the work
● Ensures that the results are accepted
● Reduces the chance of errors propagating
– http://youtu.be/7gYIs7uYbMo
– 2010 Anil Potti resigns from Duke after
research was found flawed (off by 1!)
● Clinical trials based on the flawed research
was finally cancelled
● Closed data, non-reproducible research
exacerbated the problem
Reproducible Research
● Don't do things by hand – especially editing
spreadsheets to “clean up” data (removing
outliers, validating, editing) or dowloading
files
● Actions taken by hand need very detailed
documentation to reproduce – such as
download sites and what files were
downloaded to
● GUIs are convenient, but can't be repeated
Reproducible Research
● Capture the steps in a script:
– download.file(“http://...”, “localfile.zip”)
● Can be repeated as long as the link is
available. Can keep and manage the
downloaded file if that is an issue
– Use version control
● Capture small steps at a time (git is good
for this!)
● Can track changes and revert if needed
● Can use GitHub, BitBucket, SouceForge to
publish the results as well
Reproducible Research
● Capture environment – OS, tools, versions
● Don't save outputs – regenerate
– Ok to cache results while in use, but don't
store the results, just the code+data that
produced it
– If you keep intermediate files, document
how they were created
● Set random seed
Sharing Research
● Rmarkdown – markdown with embedded R
– knitr package executes the R fragments
and embeds the code and results into
markdown, which can convert to HTML or
PDF
– Literate programming!
● Hosted documentation
– Rpubs (rpubs.com)
– GitHub gh-pages (github.io)
Sharing Research
● Embedded presentations
– Author using slidify package
– Rmarkdown with embedded R code
– Creates HTML5 presentation slide deck
– Can include inline quizes
Data Products
● Interactive visualizations
– shiny, shinyapp packages
– RStudio includes interactive display of
shiny applications during development
– Generates bootstrap + HTML5 + javascript
+ d3 application
● Hosted!
– Hosted at shinyapp.io
– Private? Server images available (for
purchase)

R - the language

  • 1.
    R - scripteddata History Language Packages Tools RPubs Slidify Shiny
  • 2.
    A Brief Historyof R – 1976 S - Bell Labs; Fortran – John Chambers – 1988 S Version 3; C language ● 1991 R Created – Ross Ihaka and Robert Gentleman ● 1993 R Announced – 1993 S licensed to StatSci (now Insightful) ● 2000 R Version 1.0.0 released – 2004 S purchased from Lucent (2MM) – 2008 TIBCO acquires Insightful (25MM)
  • 3.
    Other “Stats” Tools ●R – additional, commercial support Oracle: “Big Data Appliance” - R + Hadoop + Linux + NoSQL + Exadata(H/W) IBM: R executing in Hadoop (massively parallel in-databse analytics) ● SAS (SAS Institute) dev. 1966, 1st rel 1972 ● SPSS (IBM) 1st rel 1968
  • 4.
    Model Development and ExecutionComparison http://inside-bigdata.com/2014/06/25/revolution-r-enterprise-vs-sas-performance/
  • 5.
    Oracle + INTELLibraries https://blogs.oracle.com/R/entry/oracle_r_distribution_performance_benchmark
  • 6.
    Language ● Derviative ofS (S PLUS) ● Portable (includes Playstation 3) ● Interpreted, calls into C libraries ● Functional! ● GPL ● 40 year old technology ● Open Source (you want it, you do it)
  • 7.
    Data Types ● Symbolsrefer to objects ● Object attributes – names – dimnames – dimensions – class – length – user defined attributes/metadata
  • 8.
    Data Types ● Objecttypes – single class, except list – List (may have mixed classes) – Vectors (scalar is a vector of length 1) – Matrices (vector with 'dimension' attribute) (column major order)
  • 9.
    Data Types ● Objecttypes – Factors ● Categorical data (like an enumeration) – Data frames ● Special list, each element has same length ● Elements are columns with length rows ● Each elements (column) has its own type ● row.names() attribute to name the rows ● Convert to matrix with data.matrix() ● Load with read.table(), read.csv()
  • 10.
    Data Types ● Object“atomic” classes – character – numeric (double precision real) – integer – complex – logical (booleans) Numeric and Integer include Inf and NaN 1 / Inf == 0 ! any class can be NA NaN is NA, NA is not NaN
  • 11.
    Data Types ● Dates –“Date” class – Days since epoch (1970-01-01) ● Times – “POSIXct” or “POSIXlt” class – Seconds since epoch ● Coerce to string with as.Date() ● Generic functions include 'weekdays()', months()', 'quarters()'
  • 12.
    Operators ● Grouping: () ●Assignment: to<-from AND from->to ● Vectorized: + - ! * / ^ %% & | ● ~ ? : %/% %*% %o% %x% %in% < > == >= <= && || ● Element access: [[]] [] $ ● Function argument types: – symbol, symbol=default, ...
  • 13.
    Control Structures ● if,else ● for ● while ● repeat ● break, next, return
  • 14.
    Apply ● apply –apply functions over arrays ● lapply – apply functions over list / vector ● sapply – apply function to data frames ● tapply – apply function over ragged array ● mapply – apply function to multiple objects
  • 15.
    Functions ● Functions areobjects ● Functional closure consists of: – Formal argument list – Function body (definition) – Environment ● Each of these can be assigned to ● Assign to environment can eliminate unwanted environment capture
  • 16.
    Packages ● CRAN (ComprehensiveR Archive Network) – Main site, includes R download ● Bioconductor – Analysis of genomic data – Next generation high-throughput sequencing ● R-forge ● GitHub and Personal repositories
  • 17.
    Packages ● Analysis – Statisticalanalysis (stats, linprog) ● Linear (and general linear) modeling ● Tree models ● Analysis of variance – Machine learning (caret, kernlab) ● Clustering (forests, k-means, knn, etc) ● Training and predictions ● Cross validation and error analysis
  • 18.
    Packages ● Graphics – Basegraphics ● Plot: plot, hist, ... ● Annotate: text, lines, points, axis, ... – Lattice ● Single command: xyplot, bwplot, ... – Ggplot2 ● Single command: qplot ● Defining objects: aesthetics, geoms ● Chain commands: ggplot, geom_*, ...
  • 19.
    Packages ● Data visualization –rCharts (GitHub), converts visualizations to Javascript (e.g. d3.js) http://www.google.com/trends/explore#q=R%20language%2C%20Data%20Visualization%2C%20D3.js%2C%20Processing.js&cmpt=q
  • 20.
    Tools ● Command line ●Rstudio (can run on remote Linux server) ● Rkward ● Rcommander (tcl/tk) ● JGR – Java (GUI for R) ● Rattle - RGtk2
  • 21.
    Tools ● Debugging – Printstatements! – Interactive tools: ● traceback() – stack trace on error ● debug() – flags function for stepping ● browser() - stops function and enters debug ● trace() - insert trace statements ● recover() - modify error behavior, can browse call stack
  • 22.
    Tools ● Profiling – “Weshould forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” – Donald Knuth – system.time() - CPU, wall times – Rprof() - use symmaryRprof() to see results ● Do not use Rprof() and system.time() together ● Calls to C/Fortran libraries not profiled
  • 23.
    Data Exploration ● Scriptit! – If you can't repeat it, it didn't happen ● Get the data (ingest) – Functions to download, uncompress, unarchive, store, read, and organize ● Clean the data – Handle missing and incomplete data, impute values, identify outliers
  • 24.
    Data Exploration ● Lookat the data (models, visualization) – Model – regressions (linear, logistic), clustering, ANOVA – Refine models and plot the result ● Look for systematic issues – unexpected trends, bias, unexplained variance, error estimates, residual analysis ● Explore complexity – number of explanatory factors – Plot the models ● What does it look like?
  • 25.
    Reproducible Research ● Allowsothers to validate the work ● Ensures that the results are accepted ● Reduces the chance of errors propagating – http://youtu.be/7gYIs7uYbMo – 2010 Anil Potti resigns from Duke after research was found flawed (off by 1!) ● Clinical trials based on the flawed research was finally cancelled ● Closed data, non-reproducible research exacerbated the problem
  • 26.
    Reproducible Research ● Don'tdo things by hand – especially editing spreadsheets to “clean up” data (removing outliers, validating, editing) or dowloading files ● Actions taken by hand need very detailed documentation to reproduce – such as download sites and what files were downloaded to ● GUIs are convenient, but can't be repeated
  • 27.
    Reproducible Research ● Capturethe steps in a script: – download.file(“http://...”, “localfile.zip”) ● Can be repeated as long as the link is available. Can keep and manage the downloaded file if that is an issue – Use version control ● Capture small steps at a time (git is good for this!) ● Can track changes and revert if needed ● Can use GitHub, BitBucket, SouceForge to publish the results as well
  • 28.
    Reproducible Research ● Captureenvironment – OS, tools, versions ● Don't save outputs – regenerate – Ok to cache results while in use, but don't store the results, just the code+data that produced it – If you keep intermediate files, document how they were created ● Set random seed
  • 29.
    Sharing Research ● Rmarkdown– markdown with embedded R – knitr package executes the R fragments and embeds the code and results into markdown, which can convert to HTML or PDF – Literate programming! ● Hosted documentation – Rpubs (rpubs.com) – GitHub gh-pages (github.io)
  • 30.
    Sharing Research ● Embeddedpresentations – Author using slidify package – Rmarkdown with embedded R code – Creates HTML5 presentation slide deck – Can include inline quizes
  • 31.
    Data Products ● Interactivevisualizations – shiny, shinyapp packages – RStudio includes interactive display of shiny applications during development – Generates bootstrap + HTML5 + javascript + d3 application ● Hosted! – Hosted at shinyapp.io – Private? Server images available (for purchase)