Advanced KMP String Search

Sooner or later you're going to have to deal with pattern matching at a lower level than you're typically accustomed. For instance, you may need to implement a search feature within a text editor. What follows is an example of the KMP (Knuth–Morris–Pratt) algorithm with a nice little hacker twist to give it slightly better performance.

For those who are unfamiliar, the basics of this algorithm are simple. Rather than perform a search against every character of the input ( i.e. strcmp(); ), we instead perform a match against only key positions of the string. This allows us to easily eliminate positions from the search space simply because the key positions do not match. Here's an example

the universe: ABAABACDEGAAA
the key:........EGA

On the first pass, we compare ABA to EGA using only key position 0. A is not equal to E so we can avoid that position and move to the next, which saves the time of doing a string comparison on the remaining pieces. So the next pass becomes

the universe: ABAABACDEGAAA
the key:..........EGA

B is not equal to E, so skip that position as well. Only when we've found a matching key position, do we start to evaluate the remaining characters. It may not seem like it, but this is actually a tremendous time savings over doing a naive strcmp() on each key position. However, a problem arises in the speed of the algorithm, when we have a universe that has many potential matches. For example:

the universe: EGXEECAEGADEGA
the key:........EGA

In this scenario, we spend a bit of extra time looking at sections that have similar key positions, ultimately to find out, that they never result in a perfect match. Using the KMP algorithm, this can't really be avoided. It's just how the algorithm works. But, it can be improved. Time for that hacker twist I talked about.

What happens if we examine two positions at once? We can effectively cut our comparison time in half by narrowing the search space during each pass from both ends. If we get a match with the first key position, and the second key position, we know that there may be a need to keep processing, and if not, we've found a entire section of the universe which can safely be ignored, as opposed to only a single key position.

For our purposes, I've chosen the initial key positions to be search_text[0] for the left key position, and strlen( search_text ) for the right. During each pass, non-matching sections of the universe are ignored, and then the left side is incremented, and the right side decremented. Saving even more time, since we no longer have to check the first and last positions. Hopefully you're still with me. If not, it's just the KMP algorithm working from both ends of the string. Simple eh?

So how to do this in C/C++? The toughest question we'll have to answer is: "How do we ignore an entire section of the universe?"

Well, how about using one of the standard containers? How about a stack? If we treat our search space as a stack, we can simply ignore non-matching patterns by popping them during each pass. The next pass will then automatically begin with only the sections of the search space that have the potential to form a match. The code below is that very scenario in a nutshell. It's only a demo of the algorithm, but with only slight modification you can use it for any of your string searching purposes. Enjoy.

Oh, and for more on string search algorithms be sure to check out the wiki.

Code
#include 
#include 

int main()
{
   std::vector search_space_stack;
   std::vector match_stack;

   char universe[] = "This is a very, very, very, hairy algorithmvery";
   char search_text[] = "very";

   int len_pattern = std::strlen(search_text);
   int previous_first_position = 0;

   // populate the universe (a.k.a search space) for the initial pass
   for(int i = 0; i < strlen(universe) - (len_pattern - 1); i++)
   {
      search_space_stack.push_back(i);
   }

   // the heart and soul of the search
   while( search_space_stack.empty() == false )
   {
      int start_position = previous_first_position;
      int end_position   = (len_pattern -1) - start_position;

      char first_char = search_text[start_position];
      char last_char = search_text[end_position];

      char test_char1 = universe[ search_space_stack.back() + start_position ];
      char test_char2 = universe[ search_space_stack.back() + end_position ];

      int result = (test_char1 == first_char) + (test_char2 == last_char);

      if( result != 2 )
      {
         search_space_stack.pop_back();
         previous_first_position += end_position;
      }
      else if( (end_position - start_position) < 1 )
      {
         match_stack.push_back( search_space_stack.back() );
         search_space_stack.pop_back();
      }
      else
      {
         previous_first_position++;
      }

   }

   // now to display our results
   for(int i = 0; i < match_stack.size(); i++)
   {
      char buffer[256];
      std::memset((void*)buffer,0,256);
      std::memcpy((void*)buffer,(void*)&universe[match_stack[i]],len_pattern);
      std::cout << match_stack[i] << " " << buffer << std::endl;
   }

   // clean up before exit
   match_stack.clear();

   return(0);
}

No comments:

Post a Comment