Fast String Distance (SIFT) Algorithm
This article is obsolete, a better version of the algorithm has been published: Sift3
While researching different ways of measuring the distance between two strings, or how different they are, I've found of course the Levenstein algorithm. The problem with it is that it is slow. Searching more, I've seen some algorithms that seemed fast, but I didn't have the time or brain power to understand them. So I've devised my own algorithm, called SIFT. You might think the name comes from Siderite's Intelligent and Fast Technique, but it comes from the English word 'sift'. :)
How does it work? Well, the common scenario in comparing strings is that someone made a mistake, a typo. So in principle, the two strings should be very similar in order to be worth comparing them. So what I do is this:
There is an optimisation towards the safe side: if the sift similarity is big enough, perform the constly Levenstein distance.
Ok, it might not be so clear, let's take an example:
INTERESTING
INFORMATIVE
Step 1: remove all identical overlapping characters (sift)
TEESNG
FOMAVE
Now we have smaller words to check, let's suppose there was a typo, that means that part of the one word is offset with one or maybe two characters from the other. So we move them a bit, that's a 'shake'.
Step 2: shake
TEESNG
[]FOMAVE
Oops, no overlapping characters. We do this one or two times more and there is no result, so...
Step 3: return result
MaxLength(TEESNG, FOMAVE)=6
There you have it. The sifting algorithm, because it resembles sifting grain.
Not satisfied with such a simple example? Let's take another:
Click here
Tests have shown it to be pretty close to Levenstein, at least in the cases that matter, while being substantially faster.
While researching different ways of measuring the distance between two strings, or how different they are, I've found of course the Levenstein algorithm. The problem with it is that it is slow. Searching more, I've seen some algorithms that seemed fast, but I didn't have the time or brain power to understand them. So I've devised my own algorithm, called SIFT. You might think the name comes from Siderite's Intelligent and Fast Technique, but it comes from the English word 'sift'. :)
How does it work? Well, the common scenario in comparing strings is that someone made a mistake, a typo. So in principle, the two strings should be very similar in order to be worth comparing them. So what I do is this:
foreach phase
remove identical overlapping characters
shake strings
return number of shakes + the length of the longest string between the two.
There is an optimisation towards the safe side: if the sift similarity is big enough, perform the constly Levenstein distance.
Ok, it might not be so clear, let's take an example:
INTERESTING
INFORMATIVE
Step 1: remove all identical overlapping characters (sift)
TEESNG
FOMAVE
Now we have smaller words to check, let's suppose there was a typo, that means that part of the one word is offset with one or maybe two characters from the other. So we move them a bit, that's a 'shake'.
Step 2: shake
TEESNG
[]FOMAVE
Oops, no overlapping characters. We do this one or two times more and there is no result, so...
Step 3: return result
MaxLength(TEESNG, FOMAVE)=6
There you have it. The sifting algorithm, because it resembles sifting grain.
Not satisfied with such a simple example? Let's take another:
Click here
Tests have shown it to be pretty close to Levenstein, at least in the cases that matter, while being substantially faster.
Hello,
ReplyDeletethis is just a cut off from the Levensthein algorithm. You only check permutations, and not all of them but only the number of shakes. This means that you don't count deletions, insertions and even permutations sometimes (if shake number is low).
Br,
Andras
You are correct, but the way it was designed has a parallel feel to it. One could optimize it with bit operations and so on.
ReplyDeleteIt doesn't much matters now. Sift3 is a lot better and simpler to implement.