An Introduction to Data Mining with R

An Introduction to Data Mining with R
Yanchang Zhao
http://www.RDataMining.com

6 September 2013

1 / 43

Questions

Do you know data mining and techniques for it?

2 / 43

Questions

Have you used R before?

2 / 43

Questions

Have you used R before?
Have you used R in your data mining research or projects?

2 / 43

Outline
Introduction
Classiﬁcation with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
3 / 43

What is R?
R 1 is a free software environment for statistical computing
and graphics.
R can be easily extended with 4,728 packages available on
CRAN2 (as of Sept 6, 2013).
Many other packages provided on Bioconductor3 , R-Forge4 ,
GitHub5 , etc.
R manuals on CRAN6
An Introduction to R
The R Language Deﬁnition
R Data Import/Export
...
1

http://www.r-project.org/
http://cran.r-project.org/
3
http://www.bioconductor.org/
4
http://r-forge.r-project.org/
5
https://github.com/
6
http://cran.r-project.org/manuals.html
2

4 / 43

Why R?

R is widely used in both academia and industry.
R is ranked no. 1 again in the KDnuggets 2013 poll on Top
Languages for analytics, data mining, data science 7 .
The CRAN Task Views 8 provide collections of packages for
diﬀerent tasks.
Machine learning & atatistical learning
Cluster analysis & ﬁnite mixture models
Time series analysis
Multivariate statistics
Analysis of spatial data
...

7
8

http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html
http://cran.r-project.org/web/views/
5 / 43

Outline
Introduction
Clustering with R
Text Mining with R
R and Hadoop
Online Resources
6 / 43


Decision trees: rpart, party
Random forest: randomForest, party
SVM: e1071, kernlab
Neural networks: nnet, neuralnet, RSNNS
Performance evaluation: ROCR

7 / 43

The Iris Dataset
# iris data
str(iris)

## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 .
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1
## $ Species
: Factor w/ 3 levels "setosa","versicolor",..:
# split into training and test datasets
set.seed(1234)
ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3))
iris.train <- iris[ind==1, ]
iris.test <- iris[ind==2, ]

8 / 43

Build a Decision Tree

# build a decision tree
library(party)
iris.formula <- Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
iris.ctree <- ctree(iris.formula, data=iris.train)

9 / 43

plot(iris.ctree)
1
Petal.Length
p < 0.001
≤ 1.9

> 1.9
3
Petal.Width
p < 0.001
≤ 1.7

> 1.7

4
Petal.Length
p = 0.026
≤ 4.4

> 4.4

Node 2 (n = 40)

Node 5 (n = 21)

Node 6 (n = 19)

Node 7 (n = 32)

1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0

0

0

setosa

setosa

0
setosa

setosa

10 / 43

Prediction

# predict on test data
pred <- predict(iris.ctree, newdata = iris.test)
# check prediction result
table(pred, iris.test$Species)
##
## pred
setosa versicolor virginica
##
setosa
10
0
0
##
versicolor
0
12
2
##
virginica
0
0
14

11 / 43

Outline
Introduction
Clustering with R
Text Mining with R
R and Hadoop
Online Resources
12 / 43

Clustering with R

k-means: kmeans(), kmeansruns()9
k-medoids: pam(), pamk()
Hierarchical clustering: hclust(), agnes(), diana()
DBSCAN: fpc
BIRCH: birch

9

Functions are followed with “()”, and others are packages.
13 / 43

k-means Clustering
set.seed(8953)
iris2 <- iris
# remove class IDs
iris2$Species <- NULL
# k-means clustering
iris.kmeans <- kmeans(iris2, 3)
# check result
table(iris$Species, iris.kmeans$cluster)
##
##
##
##
##

1 2 3
setosa
0 50 0
versicolor 2 0 48
virginica 36 0 14

14 / 43

*

3.0

*

2.5

*

2.0

Sepal.Width

3.5

4.0

# plot clusters and their centers
plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster)
points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")],
col=1:3, pch="*", cex=5)

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

15 / 43

Density-based Clustering

library(fpc)
iris2 <- iris[-5] # remove class IDs
# DBSCAN clustering
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
# compare clusters with original class IDs
table(ds$cluster, iris$Species)
##
##
##
##
##
##

0
1
2
3

setosa versicolor virginica
2
10
17
48
0
0
0
37
0
0
3
33

16 / 43

# 1-3: clusters; 0: outliers or noise
plotcluster(iris2, ds$cluster)

0
3

3 33
0
3
3
03 3
1

2

1

dc 2

1

1
1

3
3
3 3
0
0 2 2
2
0 2 2
2
2
2

0

2

−2

0
0

−8

−6

3

3
3
3 3 0 33
3
3
3 3
3

3 30
3
3
0
3
2
3
22
2 2 22 2 2
0
3
0
2 20 2 2
2
3
2
2 22
02
0
22
30
0
3
2 20
2
0 0
0
0
2 2

−1

0

1

1

0 1
1
1 1
1 1 1
1 11
1
1
1
1
1
11 1 1 1 11
111 1
1 1 11 1
11 1
1
1
1 11
1
11

−4

−2
dc 1

0

0

0

0
0

2

17 / 43

Outline
Introduction
Clustering with R
Text Mining with R
R and Hadoop
Online Resources
18 / 43


Association rules: apriori(), eclat() in package arules
Sequential patterns: arulesSequence
Visualisation of associations: arulesViz

19 / 43

The Titanic Dataset
load("./data/titanic.raw.rdata")
dim(titanic.raw)
## [1] 2201

4

idx <- sample(1:nrow(titanic.raw), 8)
titanic.raw[idx, ]
##
##
##
##
##
##
##
##
##

501
477
674
766
1485
1388
448
590

Class
Sex
Age Survived
3rd
Male Adult
No
3rd
Male Adult
No
3rd
Male Adult
No
Crew
Male Adult
No
3rd Female Adult
No
2nd Female Adult
No
3rd
Male Adult
No
3rd
Male Adult
No
20 / 43

Association Rule Mining

# find association rules with the APRIORI algorithm
library(arules)
rules <- apriori(titanic.raw, control=list(verbose=F),
parameter=list(minlen=2, supp=0.005, conf=0.8),
appearance=list(rhs=c("Survived=No", "Survived=Yes"),
default="lhs"))
# sort rules
quality(rules) <- round(quality(rules), digits=3)
rules.sorted <- sort(rules, by="lift")
# have a look at rules
# inspect(rules.sorted)

21 / 43

#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#

lhs
{Class=2nd,
Age=Child}
2 {Class=2nd,
Sex=Female,
Age=Child}
3 {Class=1st,
Sex=Female}
4 {Class=1st,
Sex=Female,
Age=Adult}
5 {Class=2nd,
Sex=Male,
Age=Adult}
6 {Class=2nd,
Sex=Female}
7 {Class=Crew,
Sex=Female}
8 {Class=Crew,
Sex=Female,
Age=Adult}
9 {Class=2nd,
Sex=Male}
10 {Class=2nd,

rhs

support confidence

lift

1

=> {Survived=Yes}

0.011

1.000 3.096

=> {Survived=Yes}

0.006

1.000 3.096

=> {Survived=Yes}

0.064

0.972 3.010

=> {Survived=Yes}

0.064

0.972 3.010

=> {Survived=No}

0.070

0.917 1.354

=> {Survived=Yes}

0.042

0.877 2.716

=> {Survived=Yes}

0.009

0.870 2.692

=> {Survived=Yes}

0.009

0.870 2.692

=> {Survived=No}

0.070

0.860 1.271
22 / 43

library(arulesViz)
plot(rules, method = "graph")
Graph for 12 rules

width: support (0.006 − 0.192)
color: lift (1.222 − 3.096)

{Class=2nd,Age=Child}
{Class=2nd,Sex=Female}
{Class=1st,Sex=Female,Age=Adult}

{Class=Crew,Sex=Female}
{Survived=Yes}
{Class=2nd,Sex=Female,Age=Child}

{Class=Crew,Sex=Female,Age=Adult}
{Class=1st,Sex=Female}

{Class=2nd,Sex=Male}

{Class=2nd,Sex=Female,Age=Adult}

{Class=3rd,Sex=Male}
{Survived=No}

{Class=3rd,Sex=Male,Age=Adult}
{Class=2nd,Sex=Male,Age=Adult}

23 / 43

Outline
Introduction
Clustering with R
Text Mining with R
R and Hadoop
Online Resources
24 / 43

Text Mining with R

Text mining: tm
Topic modelling: topicmodels, lda
Word cloud: wordcloud
Twitter data access: twitteR

25 / 43

Fetch Twitter Data
## retrieve tweets from the user timeline of @rdatammining
library(twitteR)
# tweets <- userTimeline('rdatamining')
load(file = "./data/rdmTweets.RData")
(nDocs <- length(tweets))
## [1] 320
strwrap(tweets[[320]]$text, width = 50)
##
##
##
##

[1]
[2]
[3]
[4]

"An R Reference Card for Data Mining is now"
"available on CRAN. It lists many useful R"
"functions and packages for data mining"
"applications."

# convert tweets to a data frame
df <- do.call("rbind", lapply(tweets, as.data.frame))
26 / 43

Text Cleaning
library(tm)
# build a corpus
myCorpus <- Corpus(VectorSource(df$text))
# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)
# remove punctuation & numbers
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
# remove URLs
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
# remove 'r' and 'big' from stopwords
myStopwords <- setdiff(stopwords("english"), c("r", "big"))
# remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

27 / 43

Stemming
# keep a copy of corpus
myCorpusCopy <- myCorpus
# stem words
myCorpus <- tm_map(myCorpus, stemDocument)
# stem completion
myCorpus <- tm_map(myCorpus, stemCompletion,
dictionary = myCorpusCopy)
# replace "miners" with "mining", because "mining" was
# first stemmed to "mine" and then completed to "miners"
myCorpus <- tm_map(myCorpus, gsub, pattern="miners",
replacement="mining")
strwrap(myCorpus[320], width=50)
## [1] "r reference card data mining now available cran"
## [2] "list used r functions package data mining"
## [3] "applications"

28 / 43

Frequent Terms

myTdm <- TermDocumentMatrix(myCorpus,
control=list(wordLengths=c(1,Inf)))
# inspect frequent words
(freq.terms <- findFreqTerms(myTdm, lowfreq=20))
## [1] "analysis"
## [4] "data"
## [7] "network"
## [10] "postdoctoral"
## [13] "slides"
## [16] "university"

"big"
"examples"
"package"
"r"
"social"
"used"

"computing"
"mining"
"position"
"research"
"tutorial"

29 / 43

Associations

# which words are associated with 'r'?
findAssocs(myTdm, "r", 0.2)
## examples
##
0.32

code
0.29

package
0.20

# which words are associated with 'mining'?
findAssocs(myTdm, "mining", 0.25)
##
##
##
##

data
0.47
supports
0.30

mahout recommendation
0.30
0.30
frequent
itemset
0.26
0.26

sets
0.30

30 / 43

Network of Terms
library(graph)
library(Rgraphviz)
plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)

university

tutorial

social

network

analysis

mining

research

postdoctoral

position

used

r

data

big

package

examples

computing

slides

31 / 43

Word Cloud
library(wordcloud)
m <- as.matrix(myTdm)
freq <- sort(rowSums(m), decreasing=T)
wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)
provided melbourne

analysis outlier
map

mining network

open
graphics
thanks
conference users
processing
cfp text

analyst

exampleschapter
postdoctoral

slides used big

job

analytics join

high
sydney
topic

china

large
snowfall
casesee available poll draft
performance applications
group now
reference course code can via
visualizing
series tenuretrack
industrial center due introduction
association clustering access
information
page distributed
sentiment videos techniques tried
youtube
top presentation science
classification southern
wwwrdataminingcom
canberra added experience
management

predictive

talk

r

linkedin

vacancy

research
package
notes card

get

data

database

statistics
rdatamining
knowledge list
graph

free online

using
recent

published

workshop find

position

fast call

studies

tutorial

california

cloud

frequent
week tools

document

technology

nd

australia social university
datasets

google
short software

time learn
details
lecture

book

forecasting functions follower submission
business events
kdnuggetsinteractive
detection programmingcanada
spatial
search
ausdm pdf modelling machine
twitter
starting fellow
web
scientist
computing parallel ibm
amp rules
dmapps
handling

32 / 43

Topic Modelling
library(topicmodels)
set.seed(123)
myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8)
terms(myLda, 5)
##
##
##
##
##
##
##
##
##
##
##
##

[1,]
[2,]
[3,]
[4,]
[5,]
[1,]
[2,]
[3,]
[4,]
[5,]

Topic 1
Topic 2
Topic 3 Topic 4
"data"
"r"
"r"
"research"
"mining"
"package" "time"
"position"
"big"
"examples" "series" "data"
"association" "used"
"users" "university"
"rules"
"code"
"talk"
"postdoctoral"
Topic 5
Topic 6
Topic 7
Topic 8
"mining"
"group"
"data"
"analysis"
"data"
"data"
"r"
"network"
"slides"
"used"
"mining"
"social"
"modelling" "software" "analysis" "text"
"tools"
"kdnuggets" "book"
"slides"

33 / 43

Outline
Introduction
Clustering with R
Text Mining with R
R and Hadoop
Online Resources
34 / 43


Time series decomposition: decomp(), decompose(), arima(),
stl()
Time series forecasting: forecast
Time Series Clustering: TSclust
Dynamic Time Warping (DTW): dtw

35 / 43

Outline
Introduction
Clustering with R
Text Mining with R
R and Hadoop
Online Resources
36 / 43


Packages: igraph, sna
Centrality measures: degree(), betweenness(), closeness(),
transitivity()
Clusters: clusters(), no.clusters()
Cliques: cliques(), largest.cliques(), maximal.cliques(),
clique.number()
Community detection: fastgreedy.community(),
spinglass.community()

37 / 43

Outline
Introduction
Clustering with R
Text Mining with R
R and Hadoop
Online Resources
38 / 43

R and Hadoop
Packages: RHadoop, RHive
RHadoop10 is a collection of 3 R packages:
rmr2 - perform data analysis with R via MapReduce on a
Hadoop cluster
rhdfs - connect to Hadoop Distributed File System (HDFS)
rhbase - connect to the NoSQL HBase database

You can play with it on a single PC (in standalone or
pseudo-distributed mode), and your code developed on that
will be able to work on a cluster of PCs (in full-distributed
mode)!
Step by step to set up my ﬁrst R Hadoop system
http://www.rdatamining.com/tutorials/rhadoop

10

https://github.com/RevolutionAnalytics/RHadoop/wiki
39 / 43

An Example of MapReducing with R
library(rmr2)
map <- function(k, lines) {
words.list <- strsplit(lines, "s")
words <- unlist(words.list)
return(keyval(words, 1))
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function(input, output = NULL) {
mapreduce(input = input, output = output, input.format = "text",
map = map, reduce = reduce)
}
## Submit job
out <- wordcount(in.file.path, out.file.path)
11
11

From Jeﬀrey Breen’s presentation on Using R with Hadoop

http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/
40 / 43

Outline
Introduction
Clustering with R
Text Mining with R
R and Hadoop
Online Resources
41 / 43

Online Resources
RDataMining website
http://www.rdatamining.com

R Reference Card for Data Mining
R and Data Mining: Examples and Case Studies

RDataMining Group on LinkedIn (3100+ members)
http://group.rdatamining.com

RDataMining on Twitter (1200+ followers)
http://twitter.com/rdatamining

Free online courses
http://www.rdatamining.com/resources/courses

Online documents
http://www.rdatamining.com/resources/onlinedocs

42 / 43

The End

Thanks!
Email: yanchang(at)rdatamining.com
43 / 43

An Introduction to Data Mining with R

More Related Content

What's hot

Viewers also liked

Similar to An Introduction to Data Mining with R

More from Yanchang Zhao

Recently uploaded

In this document

An Introduction to Data Mining with R