pattern mining

Mining Sequential Patterns
Agrawal, Rakesh, and Ramakrishnan Srikant.
"Mining sequential patterns." In Data Engineering, 1995. Proceedings
of the Eleventh International Conference on, pp. 3-14. IEEE, 1995.
Presenter : Shaina Raza ( Phd student)
Instructor : Dr. Cherie Ding

 Introduction to Sequential Mining
 Problem Definition
 Algorithm
 Performance
 Conclusion
Outline
2/10/20172

 Algorithm
 Performance
 Conclusion
Outline
2/10/20173

 A sequence is an ordered list of elements (transactions) : s = < e1 e2 e3 … >
 Each element contains a collection of events (items) : ei = {i1, i2, …, ik}
 Each element is attributed to a specific time or location
 Length of a sequence= |s|= number of elements of the sequence
 A k-sequence = k events (items) sequence
Sequence
Sequence
E1
E2
E1
E3
E2
E3
E4
E2
Element
(Transaction)
Event
(Item)
t1 t2 tn
2/10/2017
4

Subsequence
 A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (m ≥ n)
such that a1  b1 , a2  b2, …, an  bm
Data sequence Subsequence Contain?
< {2,4} {3,5,6} {8} > < {2} {3,5} > Yes
< {1,2} {3,4} > < {1} {2} > No
2/10/20175

 Sequential Pattern Mining is the mining of frequently occurring ordered
events or subsequences as patterns
 Example:
 Web access patterns, Weather prediction, Telephone calling patterns, DNA
sequences and gene structures
Sequential Pattern Mining
2/10/2017
Sequence Database Sequence Element (Transaction) Event (Item)
Customer Purchase history of a
given customer
A set of items bought by a
customer at time t
Books, grocery, CDs,
etc
6

Sequential Pattern Mining : Example
A sequence database
SID Sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>
Support for subsequence <(ab)c> is 2 (Present in 1 and 3)
Frequent as it satisfies minimum support of 2
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
2/10/2017
Length of 1st Sequence = 1 – 9
(‘a’ multiple times but contribute only
one to the support of <a> )
8

APPROACHES OF SEQUNTIAL PATTERN
ALGORITHM
 Apriori-like (Concept introduction and an initial )
 Agrawal & Srikant. Mining sequential patterns, ICDE’95
 Apriori-based
 GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’96)
 SPADE (Zaki@Machine Leanining’00)
 Pattern-Growth-based
 FreeSpan & PrefixSpan(Han et al.@KDD’00; Pei, et al.@ICDE’01)
 Constraint-based
 SPIRIT ( Garofalakis, Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02)
 Mining closed sequential patterns
 CloSpan (Yan, Han & Afshar @SDM’03)
2/10/20179

 Algorithm
 Performance
 Conclusion
Outline
2/10/201710

1.Problem Definition
 Given:
 a database of sequences
 a user-specified minimum support threshold, minsup
 Task:
 Find all subsequences (or maximal sequences ) among all
sequences with support ≥ minsup
Support for a sequence is defined as the fraction of total customers who support this
sequence
Maximal Sequence : A sequence that is not contained in any other sequence. 2/10/2017
11

 Algorithm
 Performance
 Conclusion
Outline
2/10/201712

2/10/2017
2.The Algorithm (Phases)
1. Sort
2. Litemset
3. Transformation
4. Sequence
1. AprioriAll,
2. AprioriSome
3. DynamicSome
5. Maximal
13

1.Sort Phase
 Sort the database:
 Customer ID as the major key.
 Transaction-Time as the
minor key.
 Convert the original
transaction DB into a
customer sequence DB.
Customer
ID
Transaction
Time
Items
Bought
1 June 25 '93 30
1 June 30 '93 90
2 June 10 '93 10,20
2 June 15 '93 30
2 June 20 '93 40,60,70
3 June 25 '93 30,50,70
4 June 25 '93 30
4 June 30 '93 40,70
4 July 25 '93 90
5 June 12 '93 90
Original database 2/10/201714

Customer Id Customer Sequence
1 <(30) (90)>
2 <(10 , 20) (30) (40, 60, 70)>
3 <(30, 50 , 70)>
4 <(30) (40, 70) (90)>
5 <(90)>
Customer-sequence version of the
Sequential Database
• Minimum support of 40%: a minimum support of 2 customers.
Sequential Patterns with Support > 40%
<(30) , (90)>
<(30) (40, 70)>
2/10/201715

2/10/2017
The Algorithm (Phases)
1. Sort
2. Litemset
3. Transformation
4. Sequence
1. AprioriAll,
2. AprioriSome
3. DynamicSome
5. Maximal
16

2.Litemset Phase
 Litemset (Large Itemset):
 Itemset with minimum support.
Customer
ID
Transaction
Time
Items
Bought
1 June 25 '93 30
1 June 30 '93 90
2 June 10 '93 10,20
2 June 15 '93 30
2 June 20 '93 40,60,70
3 June 25 '93 30,50,70
4 June 25 '93 30
4 June 30 '93 40,70
4 July 25 '93 90
5 June 12 '93 90
Large Itemsets Mapped To
(30) 1
(40) 2
(70) 3
(40 70) 4
(90) 5
*Reason of mapping: treating litemsets as single entities
- Compare two litemsets in constant time
- Reduce the time to check if a sequence is contained in a customer sequence
2/10/201717

2/10/2017
1. Sort
2. Litemset
3. Transformation
4. Sequence
1. AprioriAll,
2. AprioriSome
3. DynamicSome
5. Maximal
18

3.Transformation Phase
Cust ID Original Cust Sequence Transformed Customer Sequence After Mapping
1 < (30) (90) > <{(30)} {(90)}> <{1} {5}>
2 < (10 20) (30) (40 60 70) > <{(30)} {(40),(70),(40 70)}> <{1} {2,3,4}>
3 < (30) (50) (70) > <{(30),(70)}> <{1,3}>
4 < (30) (40 70) (90) > <{(30)} {(40),(70),(40 70)} {(90)}> <{1} {2,3,4} {5}>
5 < (90) > <{(90)}> <{5}>
• Replace each transaction with all litemsets contained in the transaction.
• Transactions with no litemsets are dropped.
Note: (10 20) dropped because of lack of support.
(40 60 70) replaced with set of litemsets {(40),(70),(40 70)} (60 does not have
minisup)

2/10/2017
1. Sort
2. Litemset
3. Transformation
4. Sequence
1. AprioriAll,
2. AprioriSome
3. DynamicSome
5. Maximal
20

2/10/2017
4.Sequence Phase
Two types of algorithms:
 Count-all: counts all large sequences, including non-
maximal sequences.
 AprioriAll
 Count-some: try to avoid counting non-maximal sequences
by counting longer sequences first.
 AprioriSome
 DynamicSome
21

2/10/2017
1. Sort
2. Litemset
3. Transformation
4. Sequence
1. AprioriAll
2. AprioriSome
3. DynamicSome
5. Maximal
22

 Based on the normal Apriori algorithm that counts all the large sequences
 Steps
 Candidate generation:
 Join Lk-1 with itself to form Ck
insert into Ck
select p.litemset1,p.litemsetk-1,q.litemsetk-1 from Lk-1 p, Lk-1 q
where p.litemset1=q.litemset1 ….litemsetk-2=q.litemsetk-2
 Delete all sequences c in Ck such that some (k-1)-sub sequence of c is
not in Lk-1.
a) AprioriAll
2/10/201723

AprioriAll (Example)
Cust Sequences
<{1 5} {2} {3} {4} >
<{1} {3} {4} {3 5}>
<{1} {2} {3} {4}>
<{1} {3} {5}>
<{4} {5}>
1-Seq Sup
<1> 4
<2> 2
<3> 4
<4> 4
<5> 4
2-Seq Sup
<1 2> 2
<1 3> 4
<1 4> 3
<1 5> 3
<2 3> 2
<2 4> 2
<3 4> 3
<3 5> 2
<4 5> 2
3-Seq Sup
<1 2 3> 2
<1 2 4> 2
<1 3 4> 3
<1 3 5> 2
<2 3 4> 2
4-Seq Sup
<1 2 3 4> 2
<3 4 5> 1
<2 5> 0
Minisup = 40%
Answer: <1 2 3 4>, <1 3 5>, <4 5>
Lk = Large k-Sequences
Sequence Support
<1 2 3 4> 2
<1 3 5> 2
<4 5> 2
Maximal Large Sequences
2/10/2017
L1
L2 L3
L4
24

2/10/2017
1. Sort
2. Litemset
3. Transformation
4. Sequence
1. AprioriAll,
2. AprioriSome
3. DynamicSome
5. Maximal
25

b)AprioriSome
 Avoid counting non-maximal sequences by counting the longer ones first.
 next(length of sequence in last pass) to count next
 next(k) = k + 1 : AprioriSome degenerates into AprioriAll (extreme case)
 Balances the tradeoff between the time wasted in counting non-maximal
sequences versus counting extensions of small candidate sequences
2/10/201726

AprioriSome Phases
Forward phase Backward phase
• Find all large sequences of certain
lengths
• e.g. length 1, 2, 4 and 6 in "forward
phase"
• Find all remaining large sequences
skipped over forward phase
• count sequences of length 3 and 5
in "backward phase
In the forward phase, candidates for all
levels are counted:
• If in the large sequences of length Lk-1were
checked, then generate new candidates Ck
based on them
• If in the large sequences of length Lk-1were
NOT checked, then generate new candidates
Ck based on candidates Ck-1
In backward phase: delete the large
sequences found in the forward phase that
are non-maximal.
2/10/201727

2/10/2017
AprioriSome Forward
next(k) = 2k
minsup = 2
Forward Phase:
L2
2-Seq Sup
<1 2> 2
<1 3> 4
<1 4> 3
<1 5> 3
<2 3> 2
<2 4> 2
<3 4> 3
<3 5> 2
<4 5> 2
L3
3-Seq Sup
<1 2 3> 2
<1 2 4> 2
<1 3 4> 3
<1 3 5> 2
<2 3 4> 2
L4
4-Seq Sup
<1 2 3 4> 2
C3
3-Seq
<1 2 3>
<1 2 4>
<1 3 4>
<1 3 5>
<2 3 4>
<3 4 5>
C4
4-Seq
<1 2 3 4>
28

2/10/2017
AprioriSome Backward
 Example
Backward Phase:
3-Sequences
C3
29

2/10/2017
1. Sort
2. Litemset
3. Transformation
4. Sequence
1. AprioriAll,
2. AprioriSome
3. DynamicSome
5. Maximal
30

2/10/2017
DynamicSome
 Similar to AprioriSome
 AprioriSome generates Ck from Ck-1
 DynamicSome generates Ck “on the fly”
 Use the variable step to decide how to jump.
 On-the-fly generate function
 Otf(Lk,Lj,c)  (k+j)-sequence contained in c
 Where Lk =large k-sequence, Lj= large j-sequence, c:customer sequence
 Example
 Otf(L2,L2, <{1} {2} {3 7} {4}>)  <1 2 3 4>
 Where c1={1], c2={2}, c3={3 7}, c4={4}
 Join L2 with L2 with join condition
L2-sequence
31

2/10/2017
1. Sort
2. Litemset
3. Transformation
4. Sequence
1. AprioriAll,
2. AprioriSome
3. DynamicSome
5. Maximal
32

2/10/2017
5.Maximal Phase
 Find maximal sequences among large sequences.
 k-sequence: sequence of length k
 S: the set of all large sequences
for (k=n; k>1; k--) do
for each k-sequence sk do
Delete from S all subsequences of sk
Already performed above
33

 Finding Sequential Patterns (The Main Algorithm)
 Sequence Phase (AprioriAll, AprioriSome,
DynamicSome)
 Performance
 Conclusion
Outline
2/10/201734

Performance
 DynamicSome generates too many candidates.
 AprioriSome does a little better than AprioriAll.
 It avoids counting many non-maximal sequences.
 AprioriSome generates more candidates which remain
memory resident
2/10/2017 35

 Finding Sequential Patterns (The Main Algorithm)
 Sequence Phase (AprioriAll, AprioriSome,
DynamicSome)
 Performance
 Conclusion
Outline
2/10/201736

 They proposed an algorithm for finding sequential patterns in a
database.
 They proposed three different algorithm for the sequence
phase.
 AprioriAll
 AprioriSome
 DynamicSome
Conclusion
2/10/201737

2/10/2017
Ref: Mining Sequential Patterns
 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements.
EDBT’96.
 H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97.
 M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 2001.
 J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by
Prefix-Projected Pattern Growth. ICDE'01 (TKDE’04).
 J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases, CIKM'02.
 X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03.
 J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04.
 H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in Large Database, KDD'04.
 J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99.
 J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data, KDD'00.
38

pattern mining

More Related Content

What's hot

Similar to pattern mining

Recently uploaded

pattern mining