AssociationRule.pdf

Association Analysis
UE 141 Spring 2013
1
Jing Gao
SUNY Buffalo

Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Implication means co-occurrence,
not causality!

Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than
or equal to a minsup threshold
TID Items
1 Bread, Milk

Definition: Association Rule
Example:
Beer
}
Diaper
,
Milk
{ 
4
.
0
5
2
|
T
|
)
Beer
Diaper,
,
Milk
(




s
67
.
0
3
2
)
Diaper
,
Milk
(
)
Beer
Diaper,
Milk,
(





c
 Association Rule
– An implication expression of the form X 
Y, where X and Y are itemsets
– Example:
{Milk, Diaper}  {Beer}
 Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain both
X and Y
– Confidence (c)
 Measures how often items in Y
appear in transactions that
contain X
TID Items
1 Bread, Milk

Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!

Mining Association Rules
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements

Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
• Frequent itemset generation is still
computationally expensive

Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are
2d possible candidate
itemsets

Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
TID Items
1 Bread, Milk
N
Transactions List of
Candidates
M
w

Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
• Apriori principle holds due to the following property
of the support measure:
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
)
(
)
(
)
(
:
, Y
s
X
s
Y
X
Y
X 




Found to be
Infrequent
null
A B C D E
ABCDE
Illustrating Apriori Principle
null
A B C D E
ABCDE
Pruned
supersets

12
The Apriori Algorithm—An Example
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L3
3rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2

Mining Association Rules from Record Data
Session
Id
Country Session
Length
(sec)
Number of
Web Pages
viewed
Gender
Browser
Type
Buy
1 USA 982 8 Male IE No
2 China 811 10 Female Chrome No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10
Example of Association Rule:
{Number of Pages [5,10)  (Browser=Mozilla)}  {Buy = No}
How to apply association analysis formulation to record data?

Handling Categorical Attributes
• Transform categorical attribute into binary
variables
• Introduce a new “item” for each distinct
attribute-value pair
– Example: replace Browser Type attribute with
• Browser Type = Internet Explorer
• Browser Type = Mozilla
• Browser Type = Chrome

Handling Categorical Attributes
• Potential Issues
– What if attribute has many possible values
• Example: attribute country has more than 200 possible
values
• Many of the attribute values may have very low support
– Potential solution: Aggregate the low-support attribute values
– What if distribution of attribute values is highly
skewed
• Example: 95% of the visitors have Buy = No
• Most of the items will be associated with (Buy=No) item
– Potential solution: drop the highly frequent items

Handling Continuous Attributes
• Different kinds of rules:
– Age[21,35)  Salary[70k,120k)  Buy
– Salary[70k,120k)  Buy  Age: =28, =4
• Different methods:
– Discretization-based
– Statistics-based

Question
• Will association analysis help Wal-mart?
– Start with the “beer and diaper” story
– Discuss possible benefits and challenges in using
association analysis for supermarkets
17

AssociationRule.pdf

More Related Content

Similar to AssociationRule.pdf

Recently uploaded

AssociationRule.pdf