STRING MATCHING
Partha P. Chakrabarti & Aritra Hazra
Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur
P
P
P
P
P
P
P
P
T
P
P
String Matching: The Problem
• Goal: Find pattern P[ ] of length M in a text T[ ] of length N.
– Typically, N >> M and N is very very large (M can also be large)!
• Example: Finding a keyword from a whole PDF document
Naïve (Brute-Force) Approach
• Check for pattern starting at each text position
– Recursive Formulation (naiveMatch_rec)
– Iterative Approach (naiveMatch_itr)
Algorithm naiveMatch_rec (T[ ], N, P[ ], M)
if (N < M) then return 0;
else if (M == -1) then return 1;
else if (T[N] == P[M]) then
return (naiveMatchRec (T, N-1, P, M-1));
else
return (naiveMatchRec (T, N-1, P, M));
Algorithm naiveMatch_itr (T[ ], N, P[ ], M)
for i = 0 to N-M do {
for j = 0 to M-1 do {
if (P[i+j] == T[j]) then j++;
else break;
}
if (j == M) then
match found starting at T[i]; break;
}
Overall Time
Complexity: Θ(MN)
Can Naïve String Search be made Better?
• Illustrating Example:
– Suppose we are searching in text for pattern BAAAAAAAAA
– Suppose we match 5 characters in pattern, with mismatch on 6th character
– We know previous 6 characters in text are BAAAAB (assuming, alphabet Σ = {A, B})
• How can we make string search
algorithm more efficient?
– DO NOT check every
overlapping occurrence of
pattern string in text string
– DO make greater jumps
and DO reduce number of
comparisons
– DO NOT need to back up
the pointer in text string
Reducing Overlapped Checking: by Memorization
• Additional storage remembering what has been SEEN in Text String previously
• State Machine as
the data structure
Finite number of
states (including
start state and
halt state)
Exactly one state
transition for each
char in alphabet
Accept if sequence
of state transitions
leads to halt state DFA (Deterministic Finite Automaton)
Text String
Pattern String
Knuth-Morris-Pratt (KMP) Algorithm: Definitions
• Some Necessary Definitions
– String of length N is given as, S[0..N-1] = s0 s1 … sN-1 (where each si is from Σ)
– Substring of S[0..N-1] of length (j-i+1) is, S[i..j] = si si+1 ... sj-1 sj (0 ≤ i ≤ j ≤ N-1)
– Prefix of S[0..N-1] of length k is given as, S[0..k-1] = s0 s1 … sk-1 (1 ≤ k ≤ N-1)
– Suffix of S[0..N-1] of length l is given as, S[N-l..N-1] = sN-l sN-l+1 ... sN-1 (1 ≤ k ≤ N-1)
– Border: A substring if it is a prefix as well as suffix
• Border of S[0..N-1] having length k if S[0..k-1] = S[N-k..N-1]
• Proper Border if it is not the whole string itself
• Intuition: To find longest length proper border!!
ß string of length N à
s0 … sk-1 sk ... sN-k-1 sN-k ... sN-1
prefix suffix
KMP Algorithm: Notions and Intuition
• Longest Proper Border à Failure Function
– Given pattern string P[0..M-1], we define failure function for each i (0 ≤ i ≤ M) as,
F(i) = MAXIMUM { k | 0 ≤ k ≤ i-1 and P[1..k] = P[i-k+1..i] }
– Example:
i 0 1 2 3 4 5 6 7
P[i] a b c a b a b c
Longest Proper Border of P[0..i] ϕ ϕ ϕ a ab a ab abc
F[i] 0 0 0 1 2 1 2 3
T
P
P
§ Intuition: Use failure function to jump/shift P[ ]
by (k-F[k]+1) positions ahead
§ Proof: If shifting P by smaller amount
produced a match, then proper border of
P[0..k] longer than F[k] à Contradiction!!
KMP Algorithm: An Example
b a b
c a b a b a b a c a a b
a b a b a c a
b a b
c a b a b a b a c a a b
a b a b a c a
b a b
c a b a b a b a c a a b
a b a b a c a
0 0 1 2 3 0 1
b a b
c a b a b a b a c a a b
a b a b a c a
b a b
c a b a b a b a c a a b
a b a b a c a
b a b
c a b a b a b a c a a b
a b a b a c a
b a b
c a b a b a b a c a a b
a b a b a c a
Pattern String
Longest Proper Border Length
Text String
MATCH
KMP Algorithm and Time Complexity
Time Complexity:
• Outer loop runs ≤ (N-M+1) time
• Each iteration of outer loop increments (i-j)
– (i-j) initializes to 0 and inner loop does
not impact (i-j), as it increases i & j both
– when j continues to be 0, i increases by
1 => (i-j) increases by 1
– when j>1, i unchanged & j gets F[j-1]
• F[j-1] ≤ j-1 => i - F[j-1] ≥ (i-j)+1
• so j getting F[j-1] increases (i-j) by 1
• O(N) time in total
+ KMP_Match algorithm = O(N-M+1) time
+ Computing failure function = O(M) time
Algorithm KMP_Match (T[ ], N, P[ ], M)
F[ ] ß ComputeFailureFunct (P[ ], M);
i = 0; j = 0;
while (i-j ≤ N-M) do { // M-j ≤ N-i
while ( (j < M) and (T[i+j] == P[j]) ) do {
i++; j++;
}
if (j == M) then
match found starting at T[i-M]
if (j == 0) then i++;
else j = F[j-1];
}
find longest
matching prefix
report for match
jump/shift using
failure function
KMP Algorithm: Computing Failure Function
Algorithm ComputeFailureFunct (P[ ], M);
F[0] = 0; i = 1; j = 0;
while (i < M) do {
while ( (i < M) and (P[i] == P[j]) ) do {
j++; F[i] = j; i++;
}
if (j == 0) then do {
F[i] = 0; i++;
}
else j = F[j-1];
}
P
P
P
P
P
P
P
P
Example
Failure Function computed by sliding the Pattern String over itself !
Time Complexity: O(M)
Food-for-Thought: Exercise?
• String matching using KMP Algorithm searches only for first match
• Modify KMP Algorithm to perform the following:
① What changes will you make in the algorithm so that it can search for all
matches of pattern present in the text string?
• Example: Text = ABACAABAACAABABABAACAABBCA & Pattern = ACAAB
② When the matches may be overlapped, then how can you find all overlapping
matches as well?
• Example: Text = BABABABACABABABABACBABABAC & Pattern = ABABA
Hint: Try to bring modifications to the DFA and re-position your jumps/shifts!
Rabin-Karp Algorithm: Mathematical Overview
• Use mathematical computations
– Assume that, string is formed from Σ = {0, 1, 2, …, R-1} (radix-R notation, R = |Σ|)
– P ß decimal value of pattern string P[0..M-1] = p0 p1 … pM-1 (each pi is from Σ)
• P = pM-1 + R (pM-2 + R (pM-3 + … + R (p1 + R p0) ... )) ß Horner’s Rule [ Θ(M)-time ]
– Ti ß decimal value of M-window text-string starting at T[i], i.e. ti ti+1 … ti+M-1
• T0 ß Compute similarly for t0 t1 … tM-1 using Horner’s Rule in Θ(M)-time
– Example (…32145… in decimal): Ti = 5 + 10 x (4 + 10 x (1 + 10 x (2 + 10 x 3)))
• Ti+1 = R (Ti – RM-1 ti) + ti+M ß Compute from Ti (shift M-length window) in Θ(1)-time
– Example (...321456... à ...321456...): Ti+1 = 10 x (Ti – 10(5-1) x 3) + 6
• Computation of T1, T2, …, TN-M in Θ(N-M)-time
• When P = Ti, MATCH FOUND from index-i at T[ ], i.e. p0 p1 … pM-1 = ti ti+1 … ti+M-1
Overall Time
Complexity:
Θ(N)
Rabin-Karp Algorithm: Efficient Computation
• Challenge: efficiently compute Ti+1 given that we know Ti
– Ti = ti RM-1 + ti+1 RM-2 + ... + ti+M-1 R0 and Ti+1 = ti+1 RM-1 + ti+2 RM-2 + ... + ti+M R0
• Key property:
Can update function in
constant time!
– Ti+1 = (Ti – ti RM-1) R + ti+M
current
value
subtract
leading digit
multiply
by radix
add new
trailing digit
Rabin-Karp Algorithm: An Example
T0 = ((((3) * 10 + 1) * 10 + 4) * 10 + 1) * 10 + 5
T1 = 10 * (31415 – 104 * 3) + 9
T2 = 10 * (14159 – 104 * 1) + 2
T3 = 10 * (41592 – 104 * 4) + 6
T4 = 10 * (15926 – 104 * 1) + 5
T5 = 10 * (59265 – 104 * 5) + 3
T6 = 10 * (92653 – 104 * 9) + 5
So, P
MATCH !!
as, P = T6
Θ(M)
Θ(M)
each in Θ(1)
Θ(N-M) in
worst-case
Overall Time-
Complexity:
Θ(N)
Rabin-Karp Algorithm: Hash-map based Approach
• Solution: use Modular Hashing
– Compute a hash of
P[0..M-1], say HP
– For each i, compute a hash
of T[i..i+M-1], say HT
– If pattern hash (HP) ≠ text
substring hash (HT),
definitely NOT a match
– If pattern hash (HP) = text
substring hash (HT), check
for a VALID match
• Demerit of computing P and Ti values:
– may be very large if M is long! (non-constant arithmetic operations)
Modular Hash with R=10
and H(k) = k (mod 997)
Rabin-Karp Algorithm: Modular Hash-map Arithmatic
Modular hash function Compute:
• Ti = ti RM-1 + ti+1 RM-2 + ... + ti+M–1
R0 (mod Q)
– Horner's method: Linear-
time method to evaluate
degree-M polynomial
• Ti+1 = [ ( Ti(mod Q) – ti *
RM-1(mod Q) ) R + ti+M ](mod Q)
– Efficient modular maths
To keep numbers small, take
intermediate results modulo Q
26535 = 2*10000 + 6*1000 + 5*100 + 3*10 + 5
= ((((2) *10 + 6) * 10 + 5) * 10 + 3) * 10 + 5
Rabin-Karp Algorithm: Rolling Modular Hash-map
• First R entries: Use Horner's rule
• Remaining entries: Use rolling hash (and % or modulus to avoid overflow)
Rabin-Karp Algorithm (Psudo-code)
Algorithm Rabin-Karp_StrMatch (TXT[], N, PAT[], M, R, Q)
C = RM-1 mod Q; P = 0; T0 = 0;
for j = 1 to m do { // Preprocessing
P = (RP + PAT[j]) mod Q; T0 = (RT0 + TXT[j]) mod Q;
}
for i = 0 to N-M do { // Matching
if (P == Ti) then
if (PAT[1..M] = TXT[i+1..i+M]) then
match found starting at TXT[i];
if (i < N-M) then
Ti+1 = (R (Ti – TXT[i+1] C) + TXT[i+M+1]) mod Q
}
Comparative Study
Θ(n+m) in
practical cases
n = text string length
m = pattern string length
Thank you

StringMatching-Rabikarp algorithmddd.pdf

  • 1.
    STRING MATCHING Partha P.Chakrabarti & Aritra Hazra Department of Computer Science and Engineering Indian Institute of Technology Kharagpur P P P P P P P P T P P
  • 2.
    String Matching: TheProblem • Goal: Find pattern P[ ] of length M in a text T[ ] of length N. – Typically, N >> M and N is very very large (M can also be large)! • Example: Finding a keyword from a whole PDF document
  • 3.
    Naïve (Brute-Force) Approach •Check for pattern starting at each text position – Recursive Formulation (naiveMatch_rec) – Iterative Approach (naiveMatch_itr) Algorithm naiveMatch_rec (T[ ], N, P[ ], M) if (N < M) then return 0; else if (M == -1) then return 1; else if (T[N] == P[M]) then return (naiveMatchRec (T, N-1, P, M-1)); else return (naiveMatchRec (T, N-1, P, M)); Algorithm naiveMatch_itr (T[ ], N, P[ ], M) for i = 0 to N-M do { for j = 0 to M-1 do { if (P[i+j] == T[j]) then j++; else break; } if (j == M) then match found starting at T[i]; break; } Overall Time Complexity: Θ(MN)
  • 4.
    Can Naïve StringSearch be made Better? • Illustrating Example: – Suppose we are searching in text for pattern BAAAAAAAAA – Suppose we match 5 characters in pattern, with mismatch on 6th character – We know previous 6 characters in text are BAAAAB (assuming, alphabet Σ = {A, B}) • How can we make string search algorithm more efficient? – DO NOT check every overlapping occurrence of pattern string in text string – DO make greater jumps and DO reduce number of comparisons – DO NOT need to back up the pointer in text string
  • 5.
    Reducing Overlapped Checking:by Memorization • Additional storage remembering what has been SEEN in Text String previously • State Machine as the data structure Finite number of states (including start state and halt state) Exactly one state transition for each char in alphabet Accept if sequence of state transitions leads to halt state DFA (Deterministic Finite Automaton) Text String Pattern String
  • 6.
    Knuth-Morris-Pratt (KMP) Algorithm:Definitions • Some Necessary Definitions – String of length N is given as, S[0..N-1] = s0 s1 … sN-1 (where each si is from Σ) – Substring of S[0..N-1] of length (j-i+1) is, S[i..j] = si si+1 ... sj-1 sj (0 ≤ i ≤ j ≤ N-1) – Prefix of S[0..N-1] of length k is given as, S[0..k-1] = s0 s1 … sk-1 (1 ≤ k ≤ N-1) – Suffix of S[0..N-1] of length l is given as, S[N-l..N-1] = sN-l sN-l+1 ... sN-1 (1 ≤ k ≤ N-1) – Border: A substring if it is a prefix as well as suffix • Border of S[0..N-1] having length k if S[0..k-1] = S[N-k..N-1] • Proper Border if it is not the whole string itself • Intuition: To find longest length proper border!! ß string of length N à s0 … sk-1 sk ... sN-k-1 sN-k ... sN-1 prefix suffix
  • 7.
    KMP Algorithm: Notionsand Intuition • Longest Proper Border à Failure Function – Given pattern string P[0..M-1], we define failure function for each i (0 ≤ i ≤ M) as, F(i) = MAXIMUM { k | 0 ≤ k ≤ i-1 and P[1..k] = P[i-k+1..i] } – Example: i 0 1 2 3 4 5 6 7 P[i] a b c a b a b c Longest Proper Border of P[0..i] ϕ ϕ ϕ a ab a ab abc F[i] 0 0 0 1 2 1 2 3 T P P § Intuition: Use failure function to jump/shift P[ ] by (k-F[k]+1) positions ahead § Proof: If shifting P by smaller amount produced a match, then proper border of P[0..k] longer than F[k] à Contradiction!!
  • 8.
    KMP Algorithm: AnExample b a b c a b a b a b a c a a b a b a b a c a b a b c a b a b a b a c a a b a b a b a c a b a b c a b a b a b a c a a b a b a b a c a 0 0 1 2 3 0 1 b a b c a b a b a b a c a a b a b a b a c a b a b c a b a b a b a c a a b a b a b a c a b a b c a b a b a b a c a a b a b a b a c a b a b c a b a b a b a c a a b a b a b a c a Pattern String Longest Proper Border Length Text String MATCH
  • 9.
    KMP Algorithm andTime Complexity Time Complexity: • Outer loop runs ≤ (N-M+1) time • Each iteration of outer loop increments (i-j) – (i-j) initializes to 0 and inner loop does not impact (i-j), as it increases i & j both – when j continues to be 0, i increases by 1 => (i-j) increases by 1 – when j>1, i unchanged & j gets F[j-1] • F[j-1] ≤ j-1 => i - F[j-1] ≥ (i-j)+1 • so j getting F[j-1] increases (i-j) by 1 • O(N) time in total + KMP_Match algorithm = O(N-M+1) time + Computing failure function = O(M) time Algorithm KMP_Match (T[ ], N, P[ ], M) F[ ] ß ComputeFailureFunct (P[ ], M); i = 0; j = 0; while (i-j ≤ N-M) do { // M-j ≤ N-i while ( (j < M) and (T[i+j] == P[j]) ) do { i++; j++; } if (j == M) then match found starting at T[i-M] if (j == 0) then i++; else j = F[j-1]; } find longest matching prefix report for match jump/shift using failure function
  • 10.
    KMP Algorithm: ComputingFailure Function Algorithm ComputeFailureFunct (P[ ], M); F[0] = 0; i = 1; j = 0; while (i < M) do { while ( (i < M) and (P[i] == P[j]) ) do { j++; F[i] = j; i++; } if (j == 0) then do { F[i] = 0; i++; } else j = F[j-1]; } P P P P P P P P Example Failure Function computed by sliding the Pattern String over itself ! Time Complexity: O(M)
  • 11.
    Food-for-Thought: Exercise? • Stringmatching using KMP Algorithm searches only for first match • Modify KMP Algorithm to perform the following: ① What changes will you make in the algorithm so that it can search for all matches of pattern present in the text string? • Example: Text = ABACAABAACAABABABAACAABBCA & Pattern = ACAAB ② When the matches may be overlapped, then how can you find all overlapping matches as well? • Example: Text = BABABABACABABABABACBABABAC & Pattern = ABABA Hint: Try to bring modifications to the DFA and re-position your jumps/shifts!
  • 12.
    Rabin-Karp Algorithm: MathematicalOverview • Use mathematical computations – Assume that, string is formed from Σ = {0, 1, 2, …, R-1} (radix-R notation, R = |Σ|) – P ß decimal value of pattern string P[0..M-1] = p0 p1 … pM-1 (each pi is from Σ) • P = pM-1 + R (pM-2 + R (pM-3 + … + R (p1 + R p0) ... )) ß Horner’s Rule [ Θ(M)-time ] – Ti ß decimal value of M-window text-string starting at T[i], i.e. ti ti+1 … ti+M-1 • T0 ß Compute similarly for t0 t1 … tM-1 using Horner’s Rule in Θ(M)-time – Example (…32145… in decimal): Ti = 5 + 10 x (4 + 10 x (1 + 10 x (2 + 10 x 3))) • Ti+1 = R (Ti – RM-1 ti) + ti+M ß Compute from Ti (shift M-length window) in Θ(1)-time – Example (...321456... à ...321456...): Ti+1 = 10 x (Ti – 10(5-1) x 3) + 6 • Computation of T1, T2, …, TN-M in Θ(N-M)-time • When P = Ti, MATCH FOUND from index-i at T[ ], i.e. p0 p1 … pM-1 = ti ti+1 … ti+M-1 Overall Time Complexity: Θ(N)
  • 13.
    Rabin-Karp Algorithm: EfficientComputation • Challenge: efficiently compute Ti+1 given that we know Ti – Ti = ti RM-1 + ti+1 RM-2 + ... + ti+M-1 R0 and Ti+1 = ti+1 RM-1 + ti+2 RM-2 + ... + ti+M R0 • Key property: Can update function in constant time! – Ti+1 = (Ti – ti RM-1) R + ti+M current value subtract leading digit multiply by radix add new trailing digit
  • 14.
    Rabin-Karp Algorithm: AnExample T0 = ((((3) * 10 + 1) * 10 + 4) * 10 + 1) * 10 + 5 T1 = 10 * (31415 – 104 * 3) + 9 T2 = 10 * (14159 – 104 * 1) + 2 T3 = 10 * (41592 – 104 * 4) + 6 T4 = 10 * (15926 – 104 * 1) + 5 T5 = 10 * (59265 – 104 * 5) + 3 T6 = 10 * (92653 – 104 * 9) + 5 So, P MATCH !! as, P = T6 Θ(M) Θ(M) each in Θ(1) Θ(N-M) in worst-case Overall Time- Complexity: Θ(N)
  • 15.
    Rabin-Karp Algorithm: Hash-mapbased Approach • Solution: use Modular Hashing – Compute a hash of P[0..M-1], say HP – For each i, compute a hash of T[i..i+M-1], say HT – If pattern hash (HP) ≠ text substring hash (HT), definitely NOT a match – If pattern hash (HP) = text substring hash (HT), check for a VALID match • Demerit of computing P and Ti values: – may be very large if M is long! (non-constant arithmetic operations) Modular Hash with R=10 and H(k) = k (mod 997)
  • 16.
    Rabin-Karp Algorithm: ModularHash-map Arithmatic Modular hash function Compute: • Ti = ti RM-1 + ti+1 RM-2 + ... + ti+M–1 R0 (mod Q) – Horner's method: Linear- time method to evaluate degree-M polynomial • Ti+1 = [ ( Ti(mod Q) – ti * RM-1(mod Q) ) R + ti+M ](mod Q) – Efficient modular maths To keep numbers small, take intermediate results modulo Q 26535 = 2*10000 + 6*1000 + 5*100 + 3*10 + 5 = ((((2) *10 + 6) * 10 + 5) * 10 + 3) * 10 + 5
  • 17.
    Rabin-Karp Algorithm: RollingModular Hash-map • First R entries: Use Horner's rule • Remaining entries: Use rolling hash (and % or modulus to avoid overflow)
  • 18.
    Rabin-Karp Algorithm (Psudo-code) AlgorithmRabin-Karp_StrMatch (TXT[], N, PAT[], M, R, Q) C = RM-1 mod Q; P = 0; T0 = 0; for j = 1 to m do { // Preprocessing P = (RP + PAT[j]) mod Q; T0 = (RT0 + TXT[j]) mod Q; } for i = 0 to N-M do { // Matching if (P == Ti) then if (PAT[1..M] = TXT[i+1..i+M]) then match found starting at TXT[i]; if (i < N-M) then Ti+1 = (R (Ti – TXT[i+1] C) + TXT[i+M+1]) mod Q }
  • 19.
    Comparative Study Θ(n+m) in practicalcases n = text string length m = pattern string length
  • 20.