DFA minimization algorithms in map reduce

DFA Minimization Algorithms in
Map-Reduce
Iraj Hedayati Somarin
MasterThesis Defense – January 2016
Computer Science and Software Engineering
Faculty of Engineering and Computer Science
Concordia University
Supervisor: Gösta K. Grahne
Examiner: Brigitte Jaumard
Examiner: Hovhannes A. Harutyunyan
Chair: Rajagopalan Jayakumar

Outline
• Introduction
• DFA Minimization in Map-Reduce
• Cost Analysis
• Experimental Results
• Conclusion
1

INTRODUCTION
An introduction about the problem and related works done so far
2

DFA, Big-Data and our Motivation
• Finite Automata
• Deterministic Finite Automata
• DFA Minimization is the process of:
• Removing unreachable states
• Merging non-distinguishable states
• What is Big-Data? (e.g. peta equal to 250 or 1015)
• Insufficient study of DFA minimization for data-intensive applications and
parallel environments
3
𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩

DFA Minimization Methods
(Watson, 1993)
Equivalence
of States (≡)
Equivalence
Relation
Bottom-Up Top-Down
Layer-wise Unordered State Pairs
Point-Wise
Brzozowski
Denote 𝜋 = {𝐵1, 𝐵2, … , 𝐵 𝑚} as a partition
on 𝑄, then:
𝑝 ≡ 𝜋 𝑞 ↔ ∀𝑤 ∈ Σ∗
,
𝛿 𝑝, 𝑤 ∈ 𝐵𝑖 ∧ 𝛿 𝑞, 𝑤 ∈ 𝐵𝑖
4

Moore’sAlgorithm (Moore, 1956)
• Input is DFA 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩ where 𝑘 = |Σ| and 𝑛 = |𝑄|
• Initialize partition 𝜋 = {0,1} over 𝑄 where:
• ∀𝑝 ∈ 𝑄, 𝑝 ∈
0 , 𝑝 ∈ 𝑄 ∖ 𝐹
1 , 𝑝 ∈ 𝐹
• Iteratively refine the partition using equivalence relation in iteration 𝑖 (≡𝑖)
𝑝 ≡𝑖 𝑞 ↔ 𝑝 ≡𝑖−1 𝑞 ∧ ∀𝑎 ∈ Σ, 𝛿 𝑝, 𝑎 ≡𝑖−1 𝛿 𝑞, 𝑎
• The initial partition is ≡0
• Complexity 𝑂(𝑘𝑛2
)
5

Hopcroft’s Algorithm (Hopcroft, 1971)
• The idea is avoiding some unnecessary operations
• Input is DFA 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩ where 𝑘 = Σ and 𝑛 = |𝑄|
• Initialize partition 𝜋 = {0,1} over 𝑄 where:
• ∀𝑝 ∈ 𝑄, 𝑝 ∈
0 , 𝑝 ∈ 𝑄 ∖ 𝐹
1 , 𝑝 ∈ 𝐹
• Keep list of splitters
• Iteratively divide partitions using splitter ⟨𝑃, 𝑎⟩
𝐵 ÷ 𝑃, 𝑎 = {𝐵1, 𝐵2} where
𝐵1 = {𝑞 ∈ 𝐵 ∶ 𝛿 𝑞, 𝑎 ∈ 𝑃} and 𝐵2 = {𝑞 ∈ 𝐵 ∶ 𝛿 𝑞, 𝑎 ∉ 𝑃}
• Update the list of splitters
• Complexity= 𝑂(𝑘𝑛 log 𝑛); Number of Iterations = 𝑂 𝑘𝑛
6

Hopcroft’s Algorithm (Example)
𝑃 𝐵
𝑄𝑈𝐸 = { 𝑃, 𝑎 , 𝑃1, 𝑎 , 𝑃2, 𝑎 }
7
𝑃1
𝑃2
𝑄𝑈𝐸 = 𝑄𝑈𝐸 ∪ ⟨𝐵1, 𝑎⟩
𝑃 𝐵1 𝐵2
𝑃1
𝑃2

Map-Reduce Model
DFS
Data 1
Data 2
Data 3
Data 4
Mapping
Mapper 1
Mapper 2
Reduce
Reducer 1
Reducer 2
Reducer 3
DFS
Data 1
Data 2
Data 3
Original Data Mapped Data
𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒
ℛ =
𝑀𝑎𝑝𝑝𝑒𝑑 𝐷𝑎𝑡𝑎
𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐷𝑎𝑡𝑎
8

RelatedWorks in Parallel DFA
Minimization
1) Employing EREW-PRAM model (Moore’s method) (𝑂(𝑘𝑛), 𝑂(𝑛))
(Ravikumar and Xiong 1996)
2) Employing CRCW-PRAM model (Moore’s method) (𝑂 𝑘𝑛 log 𝑛 , 𝑂
𝑛
log 𝑛
)
(Tewari et al. 2002)
3) Employing Map-Reduce model (Moore’s method) [Moore-MR] ℛ =
3
2
(Harrafi 2015)
• Challenge is how to store block numbers:
1) Parallel in-block sorting and rename blocks in serial
2) Parallel Perfect Hashing Function and partial sum
3) No action is taken
9

Cost Model
• Communication Complexity (Yao 1979 & Kushilevitz 1997)
• The Lower Bound Recipe for Replication Rate (Afrati et al. 2013)
• Computational Complexity of Map-Reduce (Turan 2015)
10

Cost Model – Communication Complexity
• Yao’s two-party model
Bob
𝑦 ∈ 0,1 𝑛
Alice
𝑥 ∈ 0,1 𝑛
𝑓: 0,1 𝑛
× 0,1 𝑛
→ {0,1}
How much communication is
required? 𝒟(𝑓)
Upper Bound (Worst Case):
𝒟 𝑓 ≤ 𝑛 + 1
𝐴 ⊂ 0,1 𝑛
𝐵 ⊂ 0,1 𝑛
Lower Bound:
𝒟 𝑓 ≥ log 𝒞(𝑓)
where 𝒞(𝑓) is the number of rectangles
Fooling set is a well-known method for finding f-monochromatic rectangles
11

Cost Model – Lower Bound Recipe
(Afrati et al. 2013)
Reducer 1
Reducer 2
Reducer n
Reducer Capacity = 𝜌
Input = 𝐼
𝜌1
𝜌2
𝜌 𝑛
Output = O
𝑔(𝜌1)
𝑔(𝜌2)
𝑔(𝜌 𝑛)
ℛ =
𝑖=1
𝑛
𝜌𝑖
|𝐼|
𝑖=1
𝑛
𝑔(𝜌𝑖) ≥ 𝑂 →
𝑖=1
𝑛
𝜌𝑖
𝑔 𝜌𝑖
𝜌𝑖
≥ 𝑂 ⟹
𝑔 𝜌
𝜌
𝑖=1
𝑛
𝜌𝑖 ≥ 𝑂 → ℛ ≥
𝜌|𝑂|
𝑔 𝜌 |𝐼|
12

Cost Model – Computational Complexity
(Turan 2015)
• Lets denote aTuring machine 𝑀 = (𝑚, 𝑟, 𝑛, 𝜌) where:
• 𝑚 indicates whether it is a mapping task (𝑚 = 1) or a reducer task (𝑚 = 0)
• 𝑟 indicates the round number
• 𝑛 indicates the input size
• 𝜌 indicates the reducer size
• 𝑀𝑅𝐶[𝑓 𝑛 , 𝑔 𝑛 ]
• ∃𝑐, 0 < 𝑐 < 1: there is an 𝑂 𝑛 𝑐
-space and 𝑂(𝑔 𝑛 )-time Turing machine 𝑀 =
(𝑚, 𝑟, 𝑛, 𝜌) and 𝑅 = 𝑂 𝑓 𝑛 .
13

DFA MINIMIZATION IN
MAP-REDUCE
Proposed algorithms for minimizing a DFA in Map-Reduce model
14

Enhancement to Moore-MR
• Moore-MR (Harrafi 2015):
• Input 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩
• Pre-Processing: generate Δ with records ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝 ∈ 0,1 , 𝐷 ∈ +, − ⟩ from 𝛿
• Mapping Schema: map every transition record of Δ based on 𝑝 if 𝐷 = + and based on 𝑝
and 𝑞 if 𝐷 = −
ℎ: 𝑄 → {1,2, … , 𝑛}
• ReducerTask: Compute new block number using Moore method
• Note that, in order to accomplish reducer task in reducer 𝑝, it requires 𝜋 𝑞 for every state it
has a transition to.Transitions with 𝐷 = − are responsible to carry these data
• Challenge is new block numbers are concatenation of 𝑘 other block numbers. After round
𝑟, the size of each is equal to 𝑘 + 1 r.
15

Enhancement to Moore-MR
PPHF-MR
• Having 𝑆 ⊂ 𝑆′ and 𝑅 where 𝑅 ≪ |𝑆′
| , then 𝑃𝐻𝐹: 𝑆 → 𝑅 is a one-to-one function
• Mapping: map every record ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝, 𝐷⟩
to ℎ(𝜋 𝑝)
• ReducerTask: assign new block number
from range [𝑗 ⋅ 𝑛, 𝑗 + 1 ⋅ 𝑛 − 1] where 𝑗
is reducer number
Moore-MR-PPHF is obtained by applying
PPHF-MR after each iteration of Moore-MR
16

Hopcroft-MR
Pre-Processing
PreProcessing
Mapper Reducer
Iterate Until QUE is not empty
PartitionDetect
Mapper Reducer
BlockUpdate
Mapper Reducer
PPHF-MR
Mapper Reducer
Construct
Minimal
DFA
ℎ(𝑞) ℎ(𝑞) ℎ(𝑝) ℎ(𝜋 𝑝)
Transition: ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝, 𝜋 𝑞⟩
Δ blocks[a,Bi]
Block tuple: ⟨𝑎, 𝑞, 𝜋 𝑞⟩
Δ, blocks[a,Bi]
Update tuple: ⟨𝑝, 𝜋 𝑝, 𝜋 𝑝
𝑛𝑒𝑤⟩
new Δ, blocks[a,Bi],new Δ, blocks[a,Bi]
17

Hopcroft-MR vs. Hopcroft-MR-PAR
• In Hopcroft-MR we pick one splitter at a time while in Hopcroft-MR-PAR we pick
all the splitters from QUE
• In Hopcroft-MR,
𝜋 𝑝
𝑛𝑒𝑤
= 𝜋 𝑝 + |𝜋|
• In Hopcroft-MR-PAR,
𝜋 𝑝
𝑛𝑒𝑤 = 𝜋 × A 𝑝 + 𝜋 𝑝
• Where A is bit vector
𝑃, 𝑎 ∈ 𝑄𝑈𝐸 ∧ 𝑞 ∈ 𝑃 ∧ 𝛿 𝑝, 𝑎 = 𝑞 → A 𝑝 𝑎 = 1
18

COST ANALYSIS
Analyzing cost measures for the proposed algorithms as well as finding
lower bound and upper bound on each
19

Communication Cost Bounds
• Upper-Bound for DFA minimization problem in parallel environments
𝐷𝐶𝐶 𝐷𝐹𝐴 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 ≤ 𝑂 𝑘𝑛3 log 𝑛
where 𝑘 = |Σ| and 𝑛 = |𝑄|
• Lower-Bound on DFA minimization problem in parallel environments
𝐷𝐶𝐶 𝐷𝐹𝐴 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 ≥ 𝑂(𝑘𝑛 log 𝑛)
20

Lower Bound on Replication Rate
• ℛ ≥
𝜌×|𝑂|
𝑔(𝜌)×|𝐼|
• 𝑔(𝜌): For every input record (transition) a reducer produces exactly one record of
output. Hence 𝑔 𝜌 = 𝜌
• The output is exactly equal to input size containing updated transitions. Hence,
𝑂 ≤ |𝐼|.
• ℛ ≥
𝜌× 𝑂
𝑔 𝜌 × 𝐼
=
𝜌× 𝐼
𝜌× 𝐼
= 1
21

Moore-MR-PPHF
• ℛ =
3
2
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑟(ℛ 𝐼 + |𝑂|) where 𝑟 is number of Map-Reduce rounds
• 𝑆𝑖𝑧𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑟𝑒𝑐𝑜𝑟𝑑 = 𝑂(log 𝑛 + log 𝑘)
• 𝑂 ∼ 𝐼 = 𝑘𝑛(log 𝑛 + log 𝑘)
• 𝑟 = 𝑂(𝑛)
𝐶𝐶 = 𝑂 𝑛 ⋅ 𝑘𝑛 log 𝑛 + log 𝑘 = 𝑂(𝑘𝑛2
log 𝑛 + log 𝑘 )
22

Hopcroft-MR
• ℛ = 1
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 + 𝐶𝐶 𝑃𝑃𝐻𝐹
• 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 =
𝑂 𝑛 log 𝑛 log 𝑛 + log 𝑘 + 𝑛 log 𝑛 log 𝑛 + log 𝑘 + 𝑛 log 𝑛 log 𝑛 + log 𝑘
= O(n log 𝑛 log 𝑛 + log 𝑘 )
• 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 = 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒
𝑀𝑎𝑝𝑝𝑒𝑟
+ 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒
𝑅𝑒𝑑𝑢𝑐𝑒𝑟
= 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒
• 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒
= 𝑂(𝑘𝑛 ⋅ (𝑘𝑛 ⋅ log 𝑘 + log 𝑛 + 𝑛 ⋅ (log 𝑘 + log 𝑛))) + 𝑂(𝑛𝑙𝑜𝑔 𝑛 ⋅ log 𝑛)
= 𝑂( 𝑘𝑛 2
(log 𝑛 + log 𝑘))
• 𝐶𝐶 𝑃𝑃𝐻𝐹 = 𝑂(𝑘𝑛2
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑂( 𝑘𝑛 2
23

Hopcroft-MR-PAR
• ℛ = 1
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 + 𝐶𝐶 𝑃𝑃𝐻𝐹
• 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 = 𝑂(𝑘𝑛2(log 𝑛 + log 𝑘))
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑂(𝑘𝑛2(log 𝑛 + log 𝑘))
24

Comparison of Complexity Measures
Replication Rate Communication Cost Sensitive to Skewness
Lower Bound 1 𝑂 𝑘𝑛 log 𝑛 -
Moore-MR (Harrafi 2015) 3
2
𝑂 𝑛𝑘 𝑛 No
Moore-MR-PPHF 3
2
𝑂 𝑘𝑛2
log 𝑛 + log 𝑘 No
Hopcroft-MR 1 𝑂 𝑘𝑛 2 log 𝑛 + log 𝑘 Yes
Hopcroft-MR-PAR 1 𝑂 𝑘𝑛2 log 𝑛 + log 𝑘 Yes
25

EXPERIMENTAL
RESULTS
Plotting the results gathered from running proposed algorithms on
different data sets
26

Data Generator - Circular
Input DFA Minimized DFA
27

Data Generator – Duplicated Random
Input DFA Minimized DFA
28

CONCLUSION
Concluding work done in this thesis and suggesting future works and
further questions
34

Conclusion
• In this work we studied DFA minimization algorithms in Map-Reduce and PRAM
• Proposed an enhancement to a DFA minimization algorithm in Map-Reduce by
introducing PPHF in Map-Reduce
• Proposed a new algorithm in Map-Reduce based on Hopcroft’s method
• Found lower bound on Replication Rate in Map-Reduce and Communication Cost
in parallel environment for DFA minimization problem
• Studied different measures of Map-Reduce algorithms
• Found that two critical measures are missing: Sensitivity to Skewness and
Horizontal growth of data
35

FutureWorks
• Reducer Capacity vs. Number of Rounds trade-off
• Investigating other methods of minimization
• Extending complexity model and class
• Is it possible to compare Map-Reduce algorithms with others in different models
(PRAM, serial, and etc.)?
36

Thank you
Questions & Answer
37

DFA minimization algorithms in map reduce

More Related Content

What's hot

Viewers also liked

Similar to DFA minimization algorithms in map reduce

Recently uploaded

DFA minimization algorithms in map reduce

Editor's Notes