Algorithm Notes: KMP
Background
String searching is one of the most common interview problems. Of course, it can be solved by using the simple way that every body can work out, which is:
- Compare the template string to the target one by one character.
- When meet the difference, come back to the start of the template string and the next character of the last start of the target string".
The time complexity is , that is obviously not good. If we can solve it using KMP(Knuth–Morris–Pratt string-searching algorithm), maybe we can stand out in the interviews.
How To
The key point of KMP is to avoid the repeated comparisons. We'll take an example to illustrate how.
Consider to find the first occurance of string w
= "ABCDABD" in string s
= "ABCDABCDABDE".
s | A | B | C | D | A | B | c" style="text-align:center;">C | D | A | B | D | E |
---|---|---|---|---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |||||
w | A | B | C | D | A | B | D | |||||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
When compared to the sixth character, we meet the difference. Now i = 6
, j = 6
.
We'll introduce a Partial Match Table, PMT in short. It will be explained later in this article.
w | A | B | C | D | A | b" style="text-align:center;">B | D |
---|---|---|---|---|---|---|---|
PMT | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
According to the PMT, we move j to 2, keep i unchanged, and we get:
s | A | B | C | D | A | B | c" style="text-align:center;">C | D | A | B | D | E |
---|---|---|---|---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |||||
w | A | B | C | D | A | B | D | |||||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
Then continue to compare, and we'll find the match.
So the process of KMP is:
- Cursor
i
of target strings
andj
of template stringw
both start from 0, and compare characters ofs
andw
one by one; - When difference appears, move
j
to the value ofPMT[j-1]
. Ifj==0
, just keepj=0
; - Continue to compare. Once reach the end of string
w
, the match is found and returni-j
. Otherwise, return -1.
Code is as follows:
function KMP(s, w) {
let i = 0, j = 0;
while (i < s.length && j < w.length) {
if (s[i] === w[j]) {
i++;
j++;
if (j === w.length) {
return i - j;
}
} else if (j > 0) {
j = pmt[j-1];
} else {
i++;
}
}
return -1;
}
Partial Match Table
Now let's explain the PMT. First we'll introduce prefix and suffix. Also take the w = 'ABCDABD'
as example.
Prefix: all substrings from 0 to second last character. 'A', 'AB', 'ABC', 'ABCD', 'ABCDA', 'ABCDAB'
.
Suffix: all substrings from 1 to last character. 'D', 'BD', 'ABD', 'DABD', 'CDABD', 'BCDABD'
.
The PMT value is the length of the longest same string between prefix and suffix. So we get:
w | A | B | C | D | A | B | D |
---|---|---|---|---|---|---|---|
PMT | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
'A'
has no prefix or suffix, so PMT value is 0;'AB'
has prefix of'A'
and suffix of'B'
, no same, so PMT value is 0;- …
'ABCDA'
has prefix of'A','AB','ABC', 'ABCD'
and suffix of'A', 'DA', 'CDA', 'BCDA'
, longest same'A'
, so PMT value is 1;'ABCDAB'
has prefix of'A','AB','ABC', 'ABCD', 'ABCDA'
and suffix of'B', 'AB', 'DAB', 'CDAB', 'BCDAB'
, longest same'AB'
, so PMT value is 2;- …
Code is as follows:
function PMT(w) {
let pmt = new Array(w.length).fill(0);
let i = 1, j = 0;
while (i < w.length) {
if (w[i] === w[j]) {
i++;
j++;
pmt[i] = j;
} else if (j > 0) {
j = pmt[j-1];
} else {
pmt[i] = 0;
i++;
}
}
}
PMT[0]
is 0 of course. So we start from the second character, that is i = 1
. We consider string w
as two strings and try to find the same substring. The second cursor j
start from 0.
A | b" style="text-align:center;">B | C | D | A | B | D | ||
---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
A | B | C | D | A | B | D | ||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
A
and B
is not same, so PMT[1] = 0
. Move i
to next, and keep j = 0
.
A | B | c" style="text-align:center;">C | D | A | B | D | |||
---|---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ||
A | B | C | D | A | B | D | |||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
A
and C
is not same, so PMT[2] = 0
. Move i
to next, and keep j = 0
.
A | B | C | d" style="text-align:center;">D | A | B | D | ||||
---|---|---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |||
A | B | C | D | A | B | D | ||||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
We get PMT[3] = 0
.
A | B | C | D | a" style="text-align:center;">A | B | D | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ||||
A | B | C | D | A | B | D | |||||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
We get one same character A
, so PMT[4] = 1
. Now let's move both i
and j
to next and keep comparing.
A | B | C | D | a" style="text-align:center;">A | b" style="text-align:center;">B | D | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ||||
A | B | C | D | A | B | D | |||||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
Two same characters AB
, so PMT[5] = 2
. Continue to move i
and j
to next.
A | B | C | D | a" style="text-align:center;">A | b" style="text-align:center;">B | d" style="text-align:center;">D | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ||||
A | B | C | D | A | B | D | |||||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
C
and D
are not same. Like KMP, we keep i
unchanged, and move j
to PMT[j-1]
, which is 0
.
A | B | C | D | A | B | d" style="text-align:center;">D | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ||||||
A | B | C | D | A | B | D | |||||||
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
At last we get PMT[6] = 0
. But why j = PMT[j - 1]
?
When we meet difference, we will move the string below to the right, which means to move j
back. There are j-1, j-2, …, 0
. How to decide which is the best move? Remember that the target of PMT is to find the longest same prefix and suffix. PMT[j - 1]
is the longest length of the same prefix and suffix of 'w[0...j-1]'
. Let's say
Moreover, through previous comparison we get
Combine this two equations, and we get
If we make j = PMT[j-1]
, the above equation will turn into
which is the same as the second equation above! It means by making the shortest move of j
back, we will continue to find the longest prefix and suffix with least comparison.
Next Table
Next table is the permutation of PMT.
w | A | B | C | D | A | B | D |
---|---|---|---|---|---|---|---|
PMT | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
Next | -1 | 0 | 0 | 0 | 0 | 1 | 2 |
It sets Next[0] = -1
, and moves PMT right in one step. The purpose of Next table is to make the KMP code more concisely. I will not illustrate it here, as understanding PMT first is the key.