An introduction to KMP algorithm
In this post, I will introduce the string matching problem, and a clever solution - KMP algorithm. I start from the brute force method and showed how to deduce KMP.
Introduction
String Matching Problem
- Given strings
S
(lengthm
) andT
(lengthn
), findT
inS
. The stringT
called pattern. - Wide Applications, essential role in various real-world problems.
- Text processing: spell checking, text classification, information retrieval (especially in NLP field).
- Pattern recognition: bioinformatics, speech recognition
- Database management: used to searching and retrieving efficiently.
KMP algorithm
- This is an algorithm designed for the above problem, which is time efficient.
- Summary of key ideas in KMP (for those are familiar with string
matching problem):
- reduce the total comparison round --> utilize the information gained after each failed comparison
- exploit the information of the pattern string --> maintain an array for storing useful information.
Brute Force method
- Before dig into KMP algorithm, let's look at the method which
defines the lower limit of the performance.
1
2
3
4
5# Brute force algorithm for string matching
def bruteForce(S, P):
for i in range(len(S) - len(P) + 1):
if S[i : i + len(P)] == P:
print(f'Find a matching point at position {i}') # f-string1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16void bruteForce(char *s, char *p) {
int lenS = strlen(s), lenP = strlen(p);
for(int i = 0; i <= lenS - lenP; i++) {
boolean flag = true;
for int(j = 0; p[j] != '\0'; j++) {
if(s[i+j] != p[j]) {
flag = false;
break;
}
}
if(flag) printf("Find a matching point at position %d\n", i);
}
} - The intuition is pretty straight forward,
- we match the string character by character, if no failed matching till the end of the pattern string, then a successful matching was found.
- Then for the entire string, we place the pattern string at the start
of each character in
S
string, then perform the above the step.
Analysis
- The step
1
takes|P|
(length of pattern string, usually marked asn
) time, and the step2
takes|S|
(length of original string) time, we could easily conclude the time complexity is \(O(|P| \cdot |S| )\). - Why is it so slow? Or where could we improve?
- Apparently we could not optimize the comparison process at step
1
, to check a match we need to perform a full comparison. - Naturally we turned to think if we could reduce the round for
matching in step
2
. Look at the example below:- After the above failed comparison, the brute force would start a new
round of comparison at
S[1]
,S[2]
. These attempts is obvious to fail. The KMP algorithm therefore proposes a mechanism to skip these impossible matching. - The resulting process should be like this: (white colored means comparison skipped):
- After the above failed comparison, the brute force would start a new
round of comparison at
- Apparently we could not optimize the comparison process at step
Deducing the KMP algorithm
Partial Matching Table
- Following the intuition of skipping impossible
matching, we need to find a mechanism to decide which ones
should be skipped. The KMP algorithm proposed a structure called
partial matching table, which stores the information we
need.
- The partial matching table has the same length as the pattern
string, denoted as
next[]
. Given the length ofP
isn
,next[i]
represents- If substring
p[0..i]
has a pair of identical prefix and suffix (p[0..k-1] == p[i-(k-1),..i]
), thennext[i] = k
- For example
- for the pattern string
abcabcd
,we havenext[i] = [0,0,0,1,2,3,0]
. Wheni = 5
, the string isabcabc
, therefore we could get identical pairabc
, where the length is3
- for the pattern string
- If substring
- The partial matching table has the same length as the pattern
string, denoted as
- You may have noticed, with
next[]
array calculated, we could skip impossible matchings by referencing thenext[]
array. As shown in below example- When we failed the first match at
P[3]
, we replace theP[1]
on the mismatched index and restart the matching. When failed the second match atP[6]
, we replaceP[3]
on the mismatched index. - From above we could conclude, we utilize the identical prefix and
suffix for skipping some comparisons. Say the mismatch occurs at
P[i]
, then betweenP[0]~P[i-1]
, the formernext[i-1]
character(s) should be the same as the latternext[i-1]
character(s), so we could replace the suffix with the prefix. In conclusion, we could realignP[next[i-1]]
withP[i]
. - In other word, the
next[]
array provides reference for how to skip comparison and how many comparison to skip.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18# Suppose we have calculated the next[] array
def search():
s_pointer = 0
p_pointer = 0
while s_pointer < len(s):
if s[s_pointer] == p[p_pointer]: # If matched, increment both pointers
s_pointer += 1
p_pointer += 1
elif p_pointer: # mismatched at p[p_pointer], move p_pointer
p_pointer = next[p_pointer-1]
else: # mismatcher at p[0]
s_pointer += 1
if p_pointer == len(p):
print(f'The matching start point is {s_pointer - p_pointer}')
p_pointer = next[p_pointer - 1]
- How to analysis the complexity of this step?
- We could use amortized analysis, the
s_pointer
could increment at mostlen(s)
times, therefore the time complexity is \(O(n)\).
- We could use amortized analysis, the
- When we failed the first match at
Calculating Partial Matching Table
Now we have drawn the overall picture of KMP algorithm. The only problem remained is how to calculate the
next[]
array.An easy way to come up with:
1
2
3
4
5
6
7def calNext(x):
for i in range(x, 0, -1):
if p[0:i] == p[x+1-i : x+1]: # Note that p[0..x] has length x+1
return i
return 0
next = [calNext(x) for x in range(len(p))]- However, the complexity of algorithm is \(O(m^2)\), as for each entry
x
, we performx
times check.
- However, the complexity of algorithm is \(O(m^2)\), as for each entry
An optimized way, consider using recursion.
- Suppose we already know
next[x-1]
, we want to calculatenext[x]
- if
P[x] == P[next[x-1]]
-->next[x] = next[x-1] + 1
- if
P[x] != P[next[x-1]]
- To simply, we denote
next[x-1]
aspre
- When this two character does not match, we should shorten the
pre
and try the match again. We could utilize the property ofnext[]
array -P[0]~P[pre - 1]
(A
) is identical toP[x - pre] ~ P[x-1]
(B
). - To shorten
pre
, we need to find the ak
, which maximizes the length, whereprefix(A,k) == suffix(B,k)
. - Remember
A
andB
is identical, then we transform the problem to find the longest common prefix and suffix ofA
! which is exactlynext[pre-1]
. - Therefore we conclude: Iterate
pre = next[pre-1]
, untilP[next[x-1]] == P[x]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16next = []
def calNext():
next.append(0) # Base case
x = 1
pre = 0
while x < len(p):
if p[pre] == p[x]:
pre += 1
x += 1
elif pre:
pre = next[pre-1]
else:
next.append(0)
x += 1 - We could find the code looks similar to the one developed earlier.
It is because they actually share the same essence.
search()
method perform string matching betweenS
andP
, wherecalNext()
perform string matching betweenP
andP
! - Similar to previous algorithm, the time complexity is \(O(m)\).
- To simply, we denote
- if
- Suppose we already know
In conclusion, we conquer the string matching problem with time complexity \(O(m+n)\)!
Salute to the designers of this algorithm: Donald Knuth(K), James H. Morris(M), Vaughan Pratt(P).
References
An introduction to KMP algorithm
https://delusion4013.github.io/2023/02/10/An-introduction-to-KMP-algorithm/