Sooner or later you're going to have to deal with pattern matching at a lower level than you're typically accustomed. For instance, you may need to implement a search feature within a text editor. What follows is an example of the KMP (Knuth–Morris–Pratt) algorithm with a nice little hacker twist to give it slightly better performance.
For those who are unfamiliar, the basics of this algorithm are simple. Rather than perform a search against every character of the input ( i.e. strcmp(); ), we instead perform a match against only key positions of the string. This allows us to easily eliminate positions from the search space simply because the key positions do not match. Here's an example
the universe: ABAABACDEGAAA
the key:........EGA
On the first pass, we compare ABA to EGA using only key position 0. A is not equal to E so we can avoid that position and move to the next, which saves the time of doing a string comparison on the remaining pieces. So the next pass becomes
the universe: ABAABACDEGAAA
the key:..........EGA
B is not equal to E, so skip that position as well. Only when we've found a matching key position, do we start to evaluate the remaining characters. It may not seem like it, but this is actually a tremendous time savings over doing a naive strcmp() on each key position. However, a problem arises in the speed of the algorithm, when we have a universe that has many potential matches. For example:
the universe: EGXEECAEGADEGA
the key:........EGA
In this scenario, we spend a bit of extra time looking at sections that have similar key positions, ultimately to find out, that they never result in a perfect match. Using the KMP algorithm, this can't really be avoided. It's just how the algorithm works. But, it can be improved. Time for that hacker twist I talked about.
What happens if we examine two positions at once? We can effectively cut our comparison time in half by narrowing the search space during each pass from both ends. If we get a match with the first key position, and the second key position, we know that there may be a need to keep processing, and if not, we've found a entire section of the universe which can safely be ignored, as opposed to only a single key position.
For our purposes, I've chosen the initial key positions to be search_text[0] for the left key position, and strlen( search_text ) for the right. During each pass, non-matching sections of the universe are ignored, and then the left side is incremented, and the right side decremented. Saving even more time, since we no longer have to check the first and last positions. Hopefully you're still with me. If not, it's just the KMP algorithm working from both ends of the string. Simple eh?
So how to do this in C/C++? The toughest question we'll have to answer is: "How do we ignore an entire section of the universe?"
Well, how about using one of the standard containers? How about a stack? If we treat our search space as a stack, we can simply ignore non-matching patterns by popping them during each pass. The next pass will then automatically begin with only the sections of the search space that have the potential to form a match. The code below is that very scenario in a nutshell. It's only a demo of the algorithm, but with only slight modification you can use it for any of your string searching purposes. Enjoy.
Oh, and for more on string search algorithms be sure to check out the wiki.
For those who are unfamiliar, the basics of this algorithm are simple. Rather than perform a search against every character of the input ( i.e. strcmp(); ), we instead perform a match against only key positions of the string. This allows us to easily eliminate positions from the search space simply because the key positions do not match. Here's an example
the universe: ABAABACDEGAAA
the key:........EGA
On the first pass, we compare ABA to EGA using only key position 0. A is not equal to E so we can avoid that position and move to the next, which saves the time of doing a string comparison on the remaining pieces. So the next pass becomes
the universe: ABAABACDEGAAA
the key:..........EGA
B is not equal to E, so skip that position as well. Only when we've found a matching key position, do we start to evaluate the remaining characters. It may not seem like it, but this is actually a tremendous time savings over doing a naive strcmp() on each key position. However, a problem arises in the speed of the algorithm, when we have a universe that has many potential matches. For example:
the universe: EGXEECAEGADEGA
the key:........EGA
In this scenario, we spend a bit of extra time looking at sections that have similar key positions, ultimately to find out, that they never result in a perfect match. Using the KMP algorithm, this can't really be avoided. It's just how the algorithm works. But, it can be improved. Time for that hacker twist I talked about.
What happens if we examine two positions at once? We can effectively cut our comparison time in half by narrowing the search space during each pass from both ends. If we get a match with the first key position, and the second key position, we know that there may be a need to keep processing, and if not, we've found a entire section of the universe which can safely be ignored, as opposed to only a single key position.
For our purposes, I've chosen the initial key positions to be search_text[0] for the left key position, and strlen( search_text ) for the right. During each pass, non-matching sections of the universe are ignored, and then the left side is incremented, and the right side decremented. Saving even more time, since we no longer have to check the first and last positions. Hopefully you're still with me. If not, it's just the KMP algorithm working from both ends of the string. Simple eh?
So how to do this in C/C++? The toughest question we'll have to answer is: "How do we ignore an entire section of the universe?"
Well, how about using one of the standard containers? How about a stack? If we treat our search space as a stack, we can simply ignore non-matching patterns by popping them during each pass. The next pass will then automatically begin with only the sections of the search space that have the potential to form a match. The code below is that very scenario in a nutshell. It's only a demo of the algorithm, but with only slight modification you can use it for any of your string searching purposes. Enjoy.
Oh, and for more on string search algorithms be sure to check out the wiki.
Code
No comments:
Post a Comment