DFA Minimization Algorithms in
Map-Reduce
Iraj Hedayati Somarin
MasterThesis Defense – January 2016
Computer Science and Software Engineering
Faculty of Engineering and Computer Science
Concordia University
Supervisor: Gösta K. Grahne
Examiner: Brigitte Jaumard
Examiner: Hovhannes A. Harutyunyan
Chair: Rajagopalan Jayakumar
Outline
• Introduction
• DFA Minimization in Map-Reduce
• Cost Analysis
• Experimental Results
• Conclusion
1
INTRODUCTION
An introduction about the problem and related works done so far
2
DFA, Big-Data and our Motivation
• Finite Automata
• Deterministic Finite Automata
• DFA Minimization is the process of:
• Removing unreachable states
• Merging non-distinguishable states
• What is Big-Data? (e.g. peta equal to 250 or 1015)
• Insufficient study of DFA minimization for data-intensive applications and
parallel environments
3
𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩
DFA Minimization Methods
(Watson, 1993)
Equivalence
of States (≡)
Equivalence
Relation
Bottom-Up Top-Down
Layer-wise Unordered State Pairs
Point-Wise
Brzozowski
Denote 𝜋 = {𝐵1, 𝐵2, … , 𝐵 𝑚} as a partition
on 𝑄, then:
𝑝 ≡ 𝜋 𝑞 ↔ ∀𝑤 ∈ Σ∗
,
𝛿 𝑝, 𝑤 ∈ 𝐵𝑖 ∧ 𝛿 𝑞, 𝑤 ∈ 𝐵𝑖
4
Moore’sAlgorithm (Moore, 1956)
• Input is DFA 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩ where 𝑘 = |Σ| and 𝑛 = |𝑄|
• Initialize partition 𝜋 = {0,1} over 𝑄 where:
• ∀𝑝 ∈ 𝑄, 𝑝 ∈
0 , 𝑝 ∈ 𝑄 ∖ 𝐹
1 , 𝑝 ∈ 𝐹
• Iteratively refine the partition using equivalence relation in iteration 𝑖 (≡𝑖)
𝑝 ≡𝑖 𝑞 ↔ 𝑝 ≡𝑖−1 𝑞 ∧ ∀𝑎 ∈ Σ, 𝛿 𝑝, 𝑎 ≡𝑖−1 𝛿 𝑞, 𝑎
• The initial partition is ≡0
• Complexity 𝑂(𝑘𝑛2
)
5
Hopcroft’s Algorithm (Hopcroft, 1971)
• The idea is avoiding some unnecessary operations
• Input is DFA 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩ where 𝑘 = Σ and 𝑛 = |𝑄|
• Initialize partition 𝜋 = {0,1} over 𝑄 where:
• ∀𝑝 ∈ 𝑄, 𝑝 ∈
0 , 𝑝 ∈ 𝑄 ∖ 𝐹
1 , 𝑝 ∈ 𝐹
• Keep list of splitters
• Iteratively divide partitions using splitter ⟨𝑃, 𝑎⟩
𝐵 ÷ 𝑃, 𝑎 = {𝐵1, 𝐵2} where
𝐵1 = {𝑞 ∈ 𝐵 ∶ 𝛿 𝑞, 𝑎 ∈ 𝑃} and 𝐵2 = {𝑞 ∈ 𝐵 ∶ 𝛿 𝑞, 𝑎 ∉ 𝑃}
• Update the list of splitters
• Complexity= 𝑂(𝑘𝑛 log 𝑛); Number of Iterations = 𝑂 𝑘𝑛
6
Hopcroft’s Algorithm (Example)
𝑃 𝐵
𝑄𝑈𝐸 = { 𝑃, 𝑎 , 𝑃1, 𝑎 , 𝑃2, 𝑎 }
7
𝑃1
𝑃2
𝑄𝑈𝐸 = 𝑄𝑈𝐸 ∪ ⟨𝐵1, 𝑎⟩
𝑃 𝐵1 𝐵2
𝑃1
𝑃2
Map-Reduce Model
DFS
Data 1
Data 2
Data 3
Data 4
Mapping
Mapper 1
Mapper 2
Reduce
Reducer 1
Reducer 2
Reducer 3
DFS
Data 1
Data 2
Data 3
Original Data Mapped Data
𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒
ℛ =
𝑀𝑎𝑝𝑝𝑒𝑑 𝐷𝑎𝑡𝑎
𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐷𝑎𝑡𝑎
8
RelatedWorks in Parallel DFA
Minimization
1) Employing EREW-PRAM model (Moore’s method) (𝑂(𝑘𝑛), 𝑂(𝑛))
(Ravikumar and Xiong 1996)
2) Employing CRCW-PRAM model (Moore’s method) (𝑂 𝑘𝑛 log 𝑛 , 𝑂
𝑛
log 𝑛
)
(Tewari et al. 2002)
3) Employing Map-Reduce model (Moore’s method) [Moore-MR] ℛ =
3
2
(Harrafi 2015)
• Challenge is how to store block numbers:
1) Parallel in-block sorting and rename blocks in serial
2) Parallel Perfect Hashing Function and partial sum
3) No action is taken
9
Cost Model
• Communication Complexity (Yao 1979 & Kushilevitz 1997)
• The Lower Bound Recipe for Replication Rate (Afrati et al. 2013)
• Computational Complexity of Map-Reduce (Turan 2015)
10
Cost Model – Communication Complexity
• Yao’s two-party model
Bob
𝑦 ∈ 0,1 𝑛
Alice
𝑥 ∈ 0,1 𝑛
𝑓: 0,1 𝑛
× 0,1 𝑛
→ {0,1}
How much communication is
required? 𝒟(𝑓)
Upper Bound (Worst Case):
𝒟 𝑓 ≤ 𝑛 + 1
𝐴 ⊂ 0,1 𝑛
𝐵 ⊂ 0,1 𝑛
Lower Bound:
𝒟 𝑓 ≥ log 𝒞(𝑓)
where 𝒞(𝑓) is the number of rectangles
Fooling set is a well-known method for finding f-monochromatic rectangles
11
Cost Model – Lower Bound Recipe
(Afrati et al. 2013)
Reducer 1
Reducer 2
Reducer n
Reducer Capacity = 𝜌
Input = 𝐼
𝜌1
𝜌2
𝜌 𝑛
Output = O
𝑔(𝜌1)
𝑔(𝜌2)
𝑔(𝜌 𝑛)
ℛ =
𝑖=1
𝑛
𝜌𝑖
|𝐼|
𝑖=1
𝑛
𝑔(𝜌𝑖) ≥ 𝑂 →
𝑖=1
𝑛
𝜌𝑖
𝑔 𝜌𝑖
𝜌𝑖
≥ 𝑂 ⟹
𝑔 𝜌
𝜌
𝑖=1
𝑛
𝜌𝑖 ≥ 𝑂 → ℛ ≥
𝜌|𝑂|
𝑔 𝜌 |𝐼|
12
Cost Model – Computational Complexity
(Turan 2015)
• Lets denote aTuring machine 𝑀 = (𝑚, 𝑟, 𝑛, 𝜌) where:
• 𝑚 indicates whether it is a mapping task (𝑚 = 1) or a reducer task (𝑚 = 0)
• 𝑟 indicates the round number
• 𝑛 indicates the input size
• 𝜌 indicates the reducer size
• 𝑀𝑅𝐶[𝑓 𝑛 , 𝑔 𝑛 ]
• ∃𝑐, 0 < 𝑐 < 1: there is an 𝑂 𝑛 𝑐
-space and 𝑂(𝑔 𝑛 )-time Turing machine 𝑀 =
(𝑚, 𝑟, 𝑛, 𝜌) and 𝑅 = 𝑂 𝑓 𝑛 .
13
DFA MINIMIZATION IN
MAP-REDUCE
Proposed algorithms for minimizing a DFA in Map-Reduce model
14
Enhancement to Moore-MR
• Moore-MR (Harrafi 2015):
• Input 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩
• Pre-Processing: generate Δ with records ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝 ∈ 0,1 , 𝐷 ∈ +, − ⟩ from 𝛿
• Mapping Schema: map every transition record of Δ based on 𝑝 if 𝐷 = + and based on 𝑝
and 𝑞 if 𝐷 = −
ℎ: 𝑄 → {1,2, … , 𝑛}
• ReducerTask: Compute new block number using Moore method
• Note that, in order to accomplish reducer task in reducer 𝑝, it requires 𝜋 𝑞 for every state it
has a transition to.Transitions with 𝐷 = − are responsible to carry these data
• Challenge is new block numbers are concatenation of 𝑘 other block numbers. After round
𝑟, the size of each is equal to 𝑘 + 1 r.
15
Enhancement to Moore-MR
PPHF-MR
• Having 𝑆 ⊂ 𝑆′ and 𝑅 where 𝑅 ≪ |𝑆′
| , then 𝑃𝐻𝐹: 𝑆 → 𝑅 is a one-to-one function
• Mapping: map every record ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝, 𝐷⟩
to ℎ(𝜋 𝑝)
• ReducerTask: assign new block number
from range [𝑗 ⋅ 𝑛, 𝑗 + 1 ⋅ 𝑛 − 1] where 𝑗
is reducer number
Moore-MR-PPHF is obtained by applying
PPHF-MR after each iteration of Moore-MR
16
Hopcroft-MR
Pre-Processing
PreProcessing
Mapper Reducer
Iterate Until QUE is not empty
PartitionDetect
Mapper Reducer
BlockUpdate
Mapper Reducer
PPHF-MR
Mapper Reducer
Construct
Minimal
DFA
ℎ(𝑞) ℎ(𝑞) ℎ(𝑝) ℎ(𝜋 𝑝)
Transition: ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝, 𝜋 𝑞⟩
Δ blocks[a,Bi]
Block tuple: ⟨𝑎, 𝑞, 𝜋 𝑞⟩
Δ, blocks[a,Bi]
Update tuple: ⟨𝑝, 𝜋 𝑝, 𝜋 𝑝
𝑛𝑒𝑤⟩
new Δ, blocks[a,Bi],new Δ, blocks[a,Bi]
17
Hopcroft-MR vs. Hopcroft-MR-PAR
• In Hopcroft-MR we pick one splitter at a time while in Hopcroft-MR-PAR we pick
all the splitters from QUE
• In Hopcroft-MR,
𝜋 𝑝
𝑛𝑒𝑤
= 𝜋 𝑝 + |𝜋|
• In Hopcroft-MR-PAR,
𝜋 𝑝
𝑛𝑒𝑤 = 𝜋 × A 𝑝 + 𝜋 𝑝
• Where A is bit vector
𝑃, 𝑎 ∈ 𝑄𝑈𝐸 ∧ 𝑞 ∈ 𝑃 ∧ 𝛿 𝑝, 𝑎 = 𝑞 → A 𝑝 𝑎 = 1
18
COST ANALYSIS
Analyzing cost measures for the proposed algorithms as well as finding
lower bound and upper bound on each
19
Communication Cost Bounds
• Upper-Bound for DFA minimization problem in parallel environments
𝐷𝐶𝐶 𝐷𝐹𝐴 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 ≤ 𝑂 𝑘𝑛3 log 𝑛
where 𝑘 = |Σ| and 𝑛 = |𝑄|
• Lower-Bound on DFA minimization problem in parallel environments
𝐷𝐶𝐶 𝐷𝐹𝐴 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 ≥ 𝑂(𝑘𝑛 log 𝑛)
20
Lower Bound on Replication Rate
• ℛ ≥
𝜌×|𝑂|
𝑔(𝜌)×|𝐼|
• 𝑔(𝜌): For every input record (transition) a reducer produces exactly one record of
output. Hence 𝑔 𝜌 = 𝜌
• The output is exactly equal to input size containing updated transitions. Hence,
𝑂 ≤ |𝐼|.
• ℛ ≥
𝜌× 𝑂
𝑔 𝜌 × 𝐼
=
𝜌× 𝐼
𝜌× 𝐼
= 1
21
Moore-MR-PPHF
• ℛ =
3
2
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑟(ℛ 𝐼 + |𝑂|) where 𝑟 is number of Map-Reduce rounds
• 𝑆𝑖𝑧𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑟𝑒𝑐𝑜𝑟𝑑 = 𝑂(log 𝑛 + log 𝑘)
• 𝑂 ∼ 𝐼 = 𝑘𝑛(log 𝑛 + log 𝑘)
• 𝑟 = 𝑂(𝑛)
𝐶𝐶 = 𝑂 𝑛 ⋅ 𝑘𝑛 log 𝑛 + log 𝑘 = 𝑂(𝑘𝑛2
log 𝑛 + log 𝑘 )
22
Hopcroft-MR
• ℛ = 1
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 + 𝐶𝐶 𝑃𝑃𝐻𝐹
• 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 =
𝑂 𝑛 log 𝑛 log 𝑛 + log 𝑘 + 𝑛 log 𝑛 log 𝑛 + log 𝑘 + 𝑛 log 𝑛 log 𝑛 + log 𝑘
= O(n log 𝑛 log 𝑛 + log 𝑘 )
• 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 = 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒
𝑀𝑎𝑝𝑝𝑒𝑟
+ 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒
𝑅𝑒𝑑𝑢𝑐𝑒𝑟
= 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒
𝑀𝑎𝑝𝑝𝑒𝑟
• 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒
𝑀𝑎𝑝𝑝𝑒𝑟
= 𝑂(𝑘𝑛 ⋅ (𝑘𝑛 ⋅ log 𝑘 + log 𝑛 + 𝑛 ⋅ (log 𝑘 + log 𝑛))) + 𝑂(𝑛𝑙𝑜𝑔 𝑛 ⋅ log 𝑛)
= 𝑂( 𝑘𝑛 2
(log 𝑛 + log 𝑘))
• 𝐶𝐶 𝑃𝑃𝐻𝐹 = 𝑂(𝑘𝑛2
log 𝑛 + log 𝑘 )
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑂( 𝑘𝑛 2
log 𝑛 + log 𝑘 )
23
Hopcroft-MR-PAR
• ℛ = 1
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 + 𝐶𝐶 𝑃𝑃𝐻𝐹
• 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 = 𝑂(𝑘𝑛2(log 𝑛 + log 𝑘))
• 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑂(𝑘𝑛2(log 𝑛 + log 𝑘))
24
Comparison of Complexity Measures
Replication Rate Communication Cost Sensitive to Skewness
Lower Bound 1 𝑂 𝑘𝑛 log 𝑛 -
Moore-MR (Harrafi 2015) 3
2
𝑂 𝑛𝑘 𝑛 No
Moore-MR-PPHF 3
2
𝑂 𝑘𝑛2
log 𝑛 + log 𝑘 No
Hopcroft-MR 1 𝑂 𝑘𝑛 2 log 𝑛 + log 𝑘 Yes
Hopcroft-MR-PAR 1 𝑂 𝑘𝑛2 log 𝑛 + log 𝑘 Yes
25
EXPERIMENTAL
RESULTS
Plotting the results gathered from running proposed algorithms on
different data sets
26
Data Generator - Circular
Input DFA Minimized DFA
27
Data Generator – Duplicated Random
Input DFA Minimized DFA
28
Data Generator – Linear
29
Moore-MR vs. Moore-MR-PPHF
30
Circular DFA
31
Replicated Random DFA
32
Number of Rounds
33
CONCLUSION
Concluding work done in this thesis and suggesting future works and
further questions
34
Conclusion
• In this work we studied DFA minimization algorithms in Map-Reduce and PRAM
• Proposed an enhancement to a DFA minimization algorithm in Map-Reduce by
introducing PPHF in Map-Reduce
• Proposed a new algorithm in Map-Reduce based on Hopcroft’s method
• Found lower bound on Replication Rate in Map-Reduce and Communication Cost
in parallel environment for DFA minimization problem
• Studied different measures of Map-Reduce algorithms
• Found that two critical measures are missing: Sensitivity to Skewness and
Horizontal growth of data
35
FutureWorks
• Reducer Capacity vs. Number of Rounds trade-off
• Investigating other methods of minimization
• Extending complexity model and class
• Is it possible to compare Map-Reduce algorithms with others in different models
(PRAM, serial, and etc.)?
36
Thank you
Questions & Answer
37

DFA minimization algorithms in map reduce

  • 1.
    DFA Minimization Algorithmsin Map-Reduce Iraj Hedayati Somarin MasterThesis Defense – January 2016 Computer Science and Software Engineering Faculty of Engineering and Computer Science Concordia University Supervisor: Gösta K. Grahne Examiner: Brigitte Jaumard Examiner: Hovhannes A. Harutyunyan Chair: Rajagopalan Jayakumar
  • 2.
    Outline • Introduction • DFAMinimization in Map-Reduce • Cost Analysis • Experimental Results • Conclusion 1
  • 3.
    INTRODUCTION An introduction aboutthe problem and related works done so far 2
  • 4.
    DFA, Big-Data andour Motivation • Finite Automata • Deterministic Finite Automata • DFA Minimization is the process of: • Removing unreachable states • Merging non-distinguishable states • What is Big-Data? (e.g. peta equal to 250 or 1015) • Insufficient study of DFA minimization for data-intensive applications and parallel environments 3 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩
  • 5.
    DFA Minimization Methods (Watson,1993) Equivalence of States (≡) Equivalence Relation Bottom-Up Top-Down Layer-wise Unordered State Pairs Point-Wise Brzozowski Denote 𝜋 = {𝐵1, 𝐵2, … , 𝐵 𝑚} as a partition on 𝑄, then: 𝑝 ≡ 𝜋 𝑞 ↔ ∀𝑤 ∈ Σ∗ , 𝛿 𝑝, 𝑤 ∈ 𝐵𝑖 ∧ 𝛿 𝑞, 𝑤 ∈ 𝐵𝑖 4
  • 6.
    Moore’sAlgorithm (Moore, 1956) •Input is DFA 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩ where 𝑘 = |Σ| and 𝑛 = |𝑄| • Initialize partition 𝜋 = {0,1} over 𝑄 where: • ∀𝑝 ∈ 𝑄, 𝑝 ∈ 0 , 𝑝 ∈ 𝑄 ∖ 𝐹 1 , 𝑝 ∈ 𝐹 • Iteratively refine the partition using equivalence relation in iteration 𝑖 (≡𝑖) 𝑝 ≡𝑖 𝑞 ↔ 𝑝 ≡𝑖−1 𝑞 ∧ ∀𝑎 ∈ Σ, 𝛿 𝑝, 𝑎 ≡𝑖−1 𝛿 𝑞, 𝑎 • The initial partition is ≡0 • Complexity 𝑂(𝑘𝑛2 ) 5
  • 7.
    Hopcroft’s Algorithm (Hopcroft,1971) • The idea is avoiding some unnecessary operations • Input is DFA 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩ where 𝑘 = Σ and 𝑛 = |𝑄| • Initialize partition 𝜋 = {0,1} over 𝑄 where: • ∀𝑝 ∈ 𝑄, 𝑝 ∈ 0 , 𝑝 ∈ 𝑄 ∖ 𝐹 1 , 𝑝 ∈ 𝐹 • Keep list of splitters • Iteratively divide partitions using splitter ⟨𝑃, 𝑎⟩ 𝐵 ÷ 𝑃, 𝑎 = {𝐵1, 𝐵2} where 𝐵1 = {𝑞 ∈ 𝐵 ∶ 𝛿 𝑞, 𝑎 ∈ 𝑃} and 𝐵2 = {𝑞 ∈ 𝐵 ∶ 𝛿 𝑞, 𝑎 ∉ 𝑃} • Update the list of splitters • Complexity= 𝑂(𝑘𝑛 log 𝑛); Number of Iterations = 𝑂 𝑘𝑛 6
  • 8.
    Hopcroft’s Algorithm (Example) 𝑃𝐵 𝑄𝑈𝐸 = { 𝑃, 𝑎 , 𝑃1, 𝑎 , 𝑃2, 𝑎 } 7 𝑃1 𝑃2 𝑄𝑈𝐸 = 𝑄𝑈𝐸 ∪ ⟨𝐵1, 𝑎⟩ 𝑃 𝐵1 𝐵2 𝑃1 𝑃2
  • 9.
    Map-Reduce Model DFS Data 1 Data2 Data 3 Data 4 Mapping Mapper 1 Mapper 2 Reduce Reducer 1 Reducer 2 Reducer 3 DFS Data 1 Data 2 Data 3 Original Data Mapped Data 𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒 ℛ = 𝑀𝑎𝑝𝑝𝑒𝑑 𝐷𝑎𝑡𝑎 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐷𝑎𝑡𝑎 8
  • 10.
    RelatedWorks in ParallelDFA Minimization 1) Employing EREW-PRAM model (Moore’s method) (𝑂(𝑘𝑛), 𝑂(𝑛)) (Ravikumar and Xiong 1996) 2) Employing CRCW-PRAM model (Moore’s method) (𝑂 𝑘𝑛 log 𝑛 , 𝑂 𝑛 log 𝑛 ) (Tewari et al. 2002) 3) Employing Map-Reduce model (Moore’s method) [Moore-MR] ℛ = 3 2 (Harrafi 2015) • Challenge is how to store block numbers: 1) Parallel in-block sorting and rename blocks in serial 2) Parallel Perfect Hashing Function and partial sum 3) No action is taken 9
  • 11.
    Cost Model • CommunicationComplexity (Yao 1979 & Kushilevitz 1997) • The Lower Bound Recipe for Replication Rate (Afrati et al. 2013) • Computational Complexity of Map-Reduce (Turan 2015) 10
  • 12.
    Cost Model –Communication Complexity • Yao’s two-party model Bob 𝑦 ∈ 0,1 𝑛 Alice 𝑥 ∈ 0,1 𝑛 𝑓: 0,1 𝑛 × 0,1 𝑛 → {0,1} How much communication is required? 𝒟(𝑓) Upper Bound (Worst Case): 𝒟 𝑓 ≤ 𝑛 + 1 𝐴 ⊂ 0,1 𝑛 𝐵 ⊂ 0,1 𝑛 Lower Bound: 𝒟 𝑓 ≥ log 𝒞(𝑓) where 𝒞(𝑓) is the number of rectangles Fooling set is a well-known method for finding f-monochromatic rectangles 11
  • 13.
    Cost Model –Lower Bound Recipe (Afrati et al. 2013) Reducer 1 Reducer 2 Reducer n Reducer Capacity = 𝜌 Input = 𝐼 𝜌1 𝜌2 𝜌 𝑛 Output = O 𝑔(𝜌1) 𝑔(𝜌2) 𝑔(𝜌 𝑛) ℛ = 𝑖=1 𝑛 𝜌𝑖 |𝐼| 𝑖=1 𝑛 𝑔(𝜌𝑖) ≥ 𝑂 → 𝑖=1 𝑛 𝜌𝑖 𝑔 𝜌𝑖 𝜌𝑖 ≥ 𝑂 ⟹ 𝑔 𝜌 𝜌 𝑖=1 𝑛 𝜌𝑖 ≥ 𝑂 → ℛ ≥ 𝜌|𝑂| 𝑔 𝜌 |𝐼| 12
  • 14.
    Cost Model –Computational Complexity (Turan 2015) • Lets denote aTuring machine 𝑀 = (𝑚, 𝑟, 𝑛, 𝜌) where: • 𝑚 indicates whether it is a mapping task (𝑚 = 1) or a reducer task (𝑚 = 0) • 𝑟 indicates the round number • 𝑛 indicates the input size • 𝜌 indicates the reducer size • 𝑀𝑅𝐶[𝑓 𝑛 , 𝑔 𝑛 ] • ∃𝑐, 0 < 𝑐 < 1: there is an 𝑂 𝑛 𝑐 -space and 𝑂(𝑔 𝑛 )-time Turing machine 𝑀 = (𝑚, 𝑟, 𝑛, 𝜌) and 𝑅 = 𝑂 𝑓 𝑛 . 13
  • 15.
    DFA MINIMIZATION IN MAP-REDUCE Proposedalgorithms for minimizing a DFA in Map-Reduce model 14
  • 16.
    Enhancement to Moore-MR •Moore-MR (Harrafi 2015): • Input 𝐴 = ⟨𝑄, Σ, 𝛿, 𝑠, 𝐹⟩ • Pre-Processing: generate Δ with records ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝 ∈ 0,1 , 𝐷 ∈ +, − ⟩ from 𝛿 • Mapping Schema: map every transition record of Δ based on 𝑝 if 𝐷 = + and based on 𝑝 and 𝑞 if 𝐷 = − ℎ: 𝑄 → {1,2, … , 𝑛} • ReducerTask: Compute new block number using Moore method • Note that, in order to accomplish reducer task in reducer 𝑝, it requires 𝜋 𝑞 for every state it has a transition to.Transitions with 𝐷 = − are responsible to carry these data • Challenge is new block numbers are concatenation of 𝑘 other block numbers. After round 𝑟, the size of each is equal to 𝑘 + 1 r. 15
  • 17.
    Enhancement to Moore-MR PPHF-MR •Having 𝑆 ⊂ 𝑆′ and 𝑅 where 𝑅 ≪ |𝑆′ | , then 𝑃𝐻𝐹: 𝑆 → 𝑅 is a one-to-one function • Mapping: map every record ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝, 𝐷⟩ to ℎ(𝜋 𝑝) • ReducerTask: assign new block number from range [𝑗 ⋅ 𝑛, 𝑗 + 1 ⋅ 𝑛 − 1] where 𝑗 is reducer number Moore-MR-PPHF is obtained by applying PPHF-MR after each iteration of Moore-MR 16
  • 18.
    Hopcroft-MR Pre-Processing PreProcessing Mapper Reducer Iterate UntilQUE is not empty PartitionDetect Mapper Reducer BlockUpdate Mapper Reducer PPHF-MR Mapper Reducer Construct Minimal DFA ℎ(𝑞) ℎ(𝑞) ℎ(𝑝) ℎ(𝜋 𝑝) Transition: ⟨𝑝, 𝑎, 𝑞, 𝜋 𝑝, 𝜋 𝑞⟩ Δ blocks[a,Bi] Block tuple: ⟨𝑎, 𝑞, 𝜋 𝑞⟩ Δ, blocks[a,Bi] Update tuple: ⟨𝑝, 𝜋 𝑝, 𝜋 𝑝 𝑛𝑒𝑤⟩ new Δ, blocks[a,Bi],new Δ, blocks[a,Bi] 17
  • 19.
    Hopcroft-MR vs. Hopcroft-MR-PAR •In Hopcroft-MR we pick one splitter at a time while in Hopcroft-MR-PAR we pick all the splitters from QUE • In Hopcroft-MR, 𝜋 𝑝 𝑛𝑒𝑤 = 𝜋 𝑝 + |𝜋| • In Hopcroft-MR-PAR, 𝜋 𝑝 𝑛𝑒𝑤 = 𝜋 × A 𝑝 + 𝜋 𝑝 • Where A is bit vector 𝑃, 𝑎 ∈ 𝑄𝑈𝐸 ∧ 𝑞 ∈ 𝑃 ∧ 𝛿 𝑝, 𝑎 = 𝑞 → A 𝑝 𝑎 = 1 18
  • 20.
    COST ANALYSIS Analyzing costmeasures for the proposed algorithms as well as finding lower bound and upper bound on each 19
  • 21.
    Communication Cost Bounds •Upper-Bound for DFA minimization problem in parallel environments 𝐷𝐶𝐶 𝐷𝐹𝐴 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 ≤ 𝑂 𝑘𝑛3 log 𝑛 where 𝑘 = |Σ| and 𝑛 = |𝑄| • Lower-Bound on DFA minimization problem in parallel environments 𝐷𝐶𝐶 𝐷𝐹𝐴 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 ≥ 𝑂(𝑘𝑛 log 𝑛) 20
  • 22.
    Lower Bound onReplication Rate • ℛ ≥ 𝜌×|𝑂| 𝑔(𝜌)×|𝐼| • 𝑔(𝜌): For every input record (transition) a reducer produces exactly one record of output. Hence 𝑔 𝜌 = 𝜌 • The output is exactly equal to input size containing updated transitions. Hence, 𝑂 ≤ |𝐼|. • ℛ ≥ 𝜌× 𝑂 𝑔 𝜌 × 𝐼 = 𝜌× 𝐼 𝜌× 𝐼 = 1 21
  • 23.
    Moore-MR-PPHF • ℛ = 3 2 •𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑟(ℛ 𝐼 + |𝑂|) where 𝑟 is number of Map-Reduce rounds • 𝑆𝑖𝑧𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑟𝑒𝑐𝑜𝑟𝑑 = 𝑂(log 𝑛 + log 𝑘) • 𝑂 ∼ 𝐼 = 𝑘𝑛(log 𝑛 + log 𝑘) • 𝑟 = 𝑂(𝑛) 𝐶𝐶 = 𝑂 𝑛 ⋅ 𝑘𝑛 log 𝑛 + log 𝑘 = 𝑂(𝑘𝑛2 log 𝑛 + log 𝑘 ) 22
  • 24.
    Hopcroft-MR • ℛ =1 • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 + 𝐶𝐶 𝑃𝑃𝐻𝐹 • 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 = 𝑂 𝑛 log 𝑛 log 𝑛 + log 𝑘 + 𝑛 log 𝑛 log 𝑛 + log 𝑘 + 𝑛 log 𝑛 log 𝑛 + log 𝑘 = O(n log 𝑛 log 𝑛 + log 𝑘 ) • 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 = 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 𝑀𝑎𝑝𝑝𝑒𝑟 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 𝑅𝑒𝑑𝑢𝑐𝑒𝑟 = 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 𝑀𝑎𝑝𝑝𝑒𝑟 • 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 𝑀𝑎𝑝𝑝𝑒𝑟 = 𝑂(𝑘𝑛 ⋅ (𝑘𝑛 ⋅ log 𝑘 + log 𝑛 + 𝑛 ⋅ (log 𝑘 + log 𝑛))) + 𝑂(𝑛𝑙𝑜𝑔 𝑛 ⋅ log 𝑛) = 𝑂( 𝑘𝑛 2 (log 𝑛 + log 𝑘)) • 𝐶𝐶 𝑃𝑃𝐻𝐹 = 𝑂(𝑘𝑛2 log 𝑛 + log 𝑘 ) • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑂( 𝑘𝑛 2 log 𝑛 + log 𝑘 ) 23
  • 25.
    Hopcroft-MR-PAR • ℛ =1 • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 + 𝐶𝐶 𝑃𝑃𝐻𝐹 • 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 = 𝑂(𝑘𝑛2(log 𝑛 + log 𝑘)) • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑂(𝑘𝑛2(log 𝑛 + log 𝑘)) 24
  • 26.
    Comparison of ComplexityMeasures Replication Rate Communication Cost Sensitive to Skewness Lower Bound 1 𝑂 𝑘𝑛 log 𝑛 - Moore-MR (Harrafi 2015) 3 2 𝑂 𝑛𝑘 𝑛 No Moore-MR-PPHF 3 2 𝑂 𝑘𝑛2 log 𝑛 + log 𝑘 No Hopcroft-MR 1 𝑂 𝑘𝑛 2 log 𝑛 + log 𝑘 Yes Hopcroft-MR-PAR 1 𝑂 𝑘𝑛2 log 𝑛 + log 𝑘 Yes 25
  • 27.
    EXPERIMENTAL RESULTS Plotting the resultsgathered from running proposed algorithms on different data sets 26
  • 28.
    Data Generator -Circular Input DFA Minimized DFA 27
  • 29.
    Data Generator –Duplicated Random Input DFA Minimized DFA 28
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    CONCLUSION Concluding work donein this thesis and suggesting future works and further questions 34
  • 36.
    Conclusion • In thiswork we studied DFA minimization algorithms in Map-Reduce and PRAM • Proposed an enhancement to a DFA minimization algorithm in Map-Reduce by introducing PPHF in Map-Reduce • Proposed a new algorithm in Map-Reduce based on Hopcroft’s method • Found lower bound on Replication Rate in Map-Reduce and Communication Cost in parallel environment for DFA minimization problem • Studied different measures of Map-Reduce algorithms • Found that two critical measures are missing: Sensitivity to Skewness and Horizontal growth of data 35
  • 37.
    FutureWorks • Reducer Capacityvs. Number of Rounds trade-off • Investigating other methods of minimization • Extending complexity model and class • Is it possible to compare Map-Reduce algorithms with others in different models (PRAM, serial, and etc.)? 36
  • 38.

Editor's Notes

  • #5 FA are one of the most robust machines for modelling discrete phenomena with a wide range of applications playing one of the two roles: models and descriptors represented by a directed graph where states (accepting and non-accepting) are the vertices and each transition represented by an edge DFA is a variation of FA in which every state has an outgoing transition for each alphabet symbol DFA plays a significant role in: compiler and hardware design, software model testing and protocol verification problems among others If DFA gets a finite set of alphabet, it does the calculation and will accept/reject it The set of strings accepted by a DFA is it’s language It is possible two or more DFA accept the same language however there exists a unique DFA for each language which is minimal w.r.t the number of states No path from initial state Two states are distinguishable if there is a string where processing it from p leads to final state and from q leads to non-final state Big data is amount of data can not be processed by normal processing units and techniques
  • #6 Brzozowski: Two round of reverse + determination Two states are equivalence if for every strings in language computed from either of two Point-Wise: states are equivalent unless shown (functional programming) EQ-Relation: Find distinguishable states based on equivalence class All are distinguishable (incremental) Two initial class, Final and non-Final. Iterative refinement Layer-wise: equivalency for string length of i (Moore) Unordered: Using a list. (Hopcroft) State-Pairs: Every two pairs are compared for equivalency (Hopcroft–Ullman)
  • #7 Assume input DFA A Initialize partition pi over states as if state p is final it belongs to class 1 and if it is non-final it belongs to class 0 Iteratively refine the partition using equivalency relation in iteration i means p and q are equivalent in iteration i, if and only if for every alphabet symbol a, (p,a,p’) and (q,a,q’), p’ and q’ were in the same class in iteration i-1 The complexity if O(kn^2), in each iteration it investigates all states and for every state, all of it’s alphabet symbol. In worst case, we need n iterations. However, it is been discovered that only a few automata, the number of iterations is greater than log n
  • #10 The main idea in Map-Reduce model is about dividing a huge amount of data in chunks and process each separately. Each Map-Reduce task contains at least two phase: 1) Map 2) Reduce One of the major measures for a map-reduce algorithm is that how much data it is replicating over reducers
  • #13 Fooling set: every pair (x_i,y_i) where f(x_i,y_i)=z For every i, j applying exclusive conjunction of f(x_i,y_j)=z and f(x_j,y_i)=z
  • #32 Linear data set K=2 Q=[2-1024]
  • #33 Linear data set K=2 Q=[2-1024]
  • #34 Linear data set K=2 Q=[2-1024]
  • #35 Linear data set K=2 Q=[2-1024]