Algorithm Notes: KMP

10min read

Background

String searching is one of the most common interview problems. Of course, it can be solved by using the simple way that every body can work out, which is:

  • Compare the template string to the target one by one character.
  • When meet the difference, come back to the start of the template string and the next character of the last start of the target string".

The time complexity is O(m×n)O(m \times n), that is obviously not good. If we can solve it using KMP(Knuth–Morris–Pratt string-searching algorithm), maybe we can stand out in the interviews.

How To

The key point of KMP is to avoid the repeated comparisons. We'll take an example to illustrate how.

Consider to find the first occurance of string w = "ABCDABD" in string s = "ABCDABCDABDE".

s A B C D A B c" style="text-align:center;">C D A B D E
i 0 1 2 3 4 5 6
w A B C D A B D
j 0 1 2 3 4 5 6

When compared to the sixth character, we meet the difference. Now i = 6, j = 6.

We'll introduce a Partial Match Table, PMT in short. It will be explained later in this article.

w A B C D A b" style="text-align:center;">B D
PMT 0 0 0 0 1 2 0

According to the PMT, we move j to 2, keep i unchanged, and we get:

s A B C D A B c" style="text-align:center;">C D A B D E
i 0 1 2 3 4 5 6
w A B C D A B D
j 0 1 2 3 4 5 6

Then continue to compare, and we'll find the match.

So the process of KMP is:

  1. Cursor i of target string s and j of template string w both start from 0, and compare characters of s and w one by one;
  2. When difference appears, move j to the value of PMT[j-1]. If j==0, just keep j=0;
  3. Continue to compare. Once reach the end of string w, the match is found and return i-j. Otherwise, return -1.

Code is as follows:

function KMP(s, w) {
  let i = 0, j = 0;
  while (i < s.length && j < w.length) {
    if (s[i] === w[j]) {
      i++;
      j++;
      if (j === w.length) {
        return i - j;
      }
    } else if (j > 0) {
      j = pmt[j-1];
    } else {
      i++;
    }
  }
  return -1;
}

Partial Match Table

Now let's explain the PMT. First we'll introduce prefix and suffix. Also take the w = 'ABCDABD' as example.

Prefix: all substrings from 0 to second last character. 'A', 'AB', 'ABC', 'ABCD', 'ABCDA', 'ABCDAB'.

Suffix: all substrings from 1 to last character. 'D', 'BD', 'ABD', 'DABD', 'CDABD', 'BCDABD'.

The PMT value is the length of the longest same string between prefix and suffix. So we get:

w A B C D A B D
PMT 0 0 0 0 1 2 0
  • 'A' has no prefix or suffix, so PMT value is 0;
  • 'AB' has prefix of 'A' and suffix of 'B', no same, so PMT value is 0;
  • 'ABCDA' has prefix of 'A','AB','ABC', 'ABCD' and suffix of 'A', 'DA', 'CDA', 'BCDA' , longest same 'A', so PMT value is 1;
  • 'ABCDAB' has prefix of 'A','AB','ABC', 'ABCD', 'ABCDA' and suffix of 'B', 'AB', 'DAB', 'CDAB', 'BCDAB', longest same'AB', so PMT value is 2;

Code is as follows:

function PMT(w) {
  let pmt = new Array(w.length).fill(0);
  let i = 1, j = 0;
  while (i < w.length) {
    if (w[i] === w[j]) {
      i++;
      j++;
      pmt[i] = j;
    } else if (j > 0) {
      j = pmt[j-1];
    } else {
      pmt[i] = 0;
      i++;
    }
  }
}

PMT[0] is 0 of course. So we start from the second character, that is i = 1. We consider string w as two strings and try to find the same substring. The second cursor j start from 0.

A b" style="text-align:center;">B C D A B D
i 0 1 2 3 4 5 6
A B C D A B D
j 0 1 2 3 4 5 6

A and B is not same, so PMT[1] = 0. Move i to next, and keep j = 0.

A B c" style="text-align:center;">C D A B D
i 0 1 2 3 4 5 6
A B C D A B D
j 0 1 2 3 4 5 6

A and C is not same, so PMT[2] = 0. Move i to next, and keep j = 0.

A B C d" style="text-align:center;">D A B D
i 0 1 2 3 4 5 6
A B C D A B D
j 0 1 2 3 4 5 6

We get PMT[3] = 0.

A B C D a" style="text-align:center;">A B D
i 0 1 2 3 4 5 6
A B C D A B D
j 0 1 2 3 4 5 6

We get one same character A, so PMT[4] = 1. Now let's move both i and j to next and keep comparing.

A B C D a" style="text-align:center;">A b" style="text-align:center;">B D
i 0 1 2 3 4 5 6
A B C D A B D
j 0 1 2 3 4 5 6

Two same characters AB, so PMT[5] = 2. Continue to move i and j to next.

A B C D a" style="text-align:center;">A b" style="text-align:center;">B d" style="text-align:center;">D
i 0 1 2 3 4 5 6
A B C D A B D
j 0 1 2 3 4 5 6

C and D are not same. Like KMP, we keep i unchanged, and move j to PMT[j-1], which is 0.

A B C D A B d" style="text-align:center;">D
i 0 1 2 3 4 5 6
A B C D A B D
j 0 1 2 3 4 5 6

At last we get PMT[6] = 0. But why j = PMT[j - 1]?

When we meet difference, we will move the string below to the right, which means to move j back. There are j-1, j-2, …, 0. How to decide which is the best move? Remember that the target of PMT is to find the longest same prefix and suffix. PMT[j - 1] is the longest length of the same prefix and suffix of 'w[0...j-1]'. Let's say

w[0...PMT[j1]1]=w[jPMT[j1]...j1]w_{[0...PMT[j-1]-1]} = w_{[j-PMT[j-1]...j-1]}

Moreover, through previous comparison we get

w[ij...i1]=w[0...j1]w_{[i-j...i-1]} = w_{[0...j-1]}

Combine this two equations, and we get

w[0...PMT[j1]1]=w[iPMT[j1]...i1]w_{[0...PMT[j-1]-1]} = w_{[i-PMT[j-1]...i-1]}

If we make j = PMT[j-1], the above equation will turn into

w[0...j1]=w[ij...i1] w_{[0...j-1]} = w_{[i-j...i-1]}

which is the same as the second equation above! It means by making the shortest move of j back, we will continue to find the longest prefix and suffix with least comparison.

Next Table

Next table is the permutation of PMT.

w A B C D A B D
PMT 0 0 0 0 1 2 0
Next -1 0 0 0 0 1 2

It sets Next[0] = -1, and moves PMT right in one step. The purpose of Next table is to make the KMP code more concisely. I will not illustrate it here, as understanding PMT first is the key.


MORE FROM THE BLOG

Algorithm Notes: Traversal of Binary...

Mostly we know that the traversal of binary tree can be easily done by recursion....

3min read

Algorithm Notes: Understanding of Binary...

Recently, I have recovered my algorithm praticing. It's like something is messing in my mind, maybe due to the shallow...

4min read

Algorithm Notes: Binary Heap

A Binary Heap is a complete Binary Tree. It is either a Min Heap, where parent node's value is less...

2min read

Use React Context + useReducer...

React Hooks have brought us much convenience in writing concise readable code. Recently, I have been enjoying the usage of...

4min read