DATA SCIENCE?
what is
DATA SCIENCE?
- making an impact with data -
Ta Virot Chiraphadhanakul
Data Scientist, Facebook
Harvard Business Review (October 2012)
whatis data science?
VALUABLE INSIGHTS
transforming data into
“No one should die because they cannot afford
health care and no one should go broke because
they get sick. If you agree please post this as your
status for the rest of the day”
Information Evolution in Social Networks (Adamic et al. 2014)
Social Influence in Social Advertising: Evidence from Field Experiments (Bakshy et al. 2012)
A 61-million-person experiment in social influence and political mobilization (Bond et al. 2012)
DATA PRODUCTS
transforming data into
email classification
spam detection
INTERESTING STORIES
and sometimes, transforming data into
http://www.google.org/flutrends/
Coordinated Migration (Facebook Data Science page)
https://jawbone.com/blog/2014-year-review/
How families interact on Facebook (Facebook Data Science page)
social good
data science for
making data accessible to everyone
OPEN DATA
https://www.alltuition.com/
www.allthebuses.com
http://mbtaviz.github.io/
http://fragile-success.rpa.org/maps/jobs.html
don’t have the data you want? just ask for them!
CROWDSOURCING DATA
http://www.openstreetmap.org/
http://www.streetbump.org/
http://www.waze.com/
https://www.23andme.com/
hack-a-thons, meet-ups, competitions, fellowships
BUILDING COMMUNITIES
https://seanjtaylor.github.io/out-for-justice/
http://ivory-infinity-763.appspot.com/
http://www.kaggle.com
hubway.virot.me bayareabikeshare.virot.me
ELEMENTSof data science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
CODING, CODING, CODING
STATISTICS
data wrangling (aka data janitor work)
BIG DATA, BIG PILE OF JUNK(?)
Harvard Business Review (April 2014)
http://xkcd.com/605/
ice cream consumption vs. murders
CORRELATION ⇏ CAUSATION
randomized controlled experiments, a/b testing
CAUSAL INFERENCE
MACHINE LEARNING
regression, classification, clustering, collaborative
filtering, etc.
MACHINE LEARNING TASKS
feature engineering
GARBAGE IN, GARBAGE OUT
cross-validation and
penalizing model complexity (regularization)
AVOID OVERFITTING
Nature (February 2013)
DOMAIN EXPERTISE
measure the right things
METRIC, METRIC, METRIC
Seven Pitfalls to Avoid when Running Controlled Experiments on the Web (Crook et al. 2009)
speed, simplicity, cost of obtaining data, …
BEYOND MODEL ACCURACY
VISUALIZATION
Graphs in Statistical Analysis (Anscombe 1973)
EXPLORATORY DATA ANALYSIS
summarize and visualize important characteristics of a
data set
http://bl.ocks.org/mbostock/4063663
http://xkcd.com/1138/
THANK YOU
Ta Virot Chiraphadhanakul
http://ta.virot.me | @tvirot
THANK YOU

Data Science 101