Wednesday, 4 April 2007

Super Fast and Accurate string distance algorithm: Sift3

Update November 2014: Sift4 is here!! Check out the new improved version of Sift here: Super Fast and Accurate string distance algorithm: Sift4

Update October 6 2014: New stuff, compare Levenstein vs Sift here:
Algorithm:      Levenstein      Sift
String 1:   String 2:  
Result:

Update June 25th 2013: I've decided to play a little with the suggestions in the comments and check for validity. This was spurned by the realization that a lot of people use my algorithm. So, in order to celebrate this, here is the "3B" version of the Sift3 algorithm:
It is made in Javascript, this time, as it was easier to test and has the following extra features:
  • a maxDistance value that tells the algorithm to stop if the strings are already too different.
  • two pointers c1 and c2, rather than a single pointer c and two offsets
  • Instead of dividing to 2 the total length of the strings compared, now I divide it with 1.5. Why? Because this way the value is closer to the Levenshtein distance computed per random strings

Happy usage!
The variant I posted was totally buggy. I removed it. Just use sift3Distance.
Update: the Sift algorithm is now on Github.

A while ago I wrote an entry here about Sift2, an improvement of Sift, the original and silly string distance algorithm. Now I am publishing Sift3, which is way more accurate and even simpler as an algorithm.

I found out that my algorithm is part of a class of algorithms that solve the Longest Common Substring problem, therefore I calculated the LCS, not the distance, then the distance from the LCS. The result is way more robust, easy to understand and closer to the Levenshtein algorithm both on random strings and user databases. Not to mention that there is no goto in this one.

BTW, if you are looking for an algorithm that detects switched words, this is not it :) This just looks for typos and small regional differences between the strings. I mean, you could normalize the strings, so that words are ordered by some mechanism, then it would work because the words wouldn't be switched :)

I promise to work on a word switching algorithm, but not in the near future.
Without further due, here is the code:

The C# code is a method in an object that has a private member maxOffset. As in Sift2 maxOffset should be around 5 and it represents the range in which to try to find a missing character.

public float Distance(string s1, string s2, int maxOffset)
{
if (String.IsNullOrEmpty(s1))
{
return String.IsNullOrEmpty(s2)
? 0
: s2.Length;
}
if (String.IsNullOrEmpty(s2))
{
return s1.Length;
}
int c = 0;
int offset1 = 0;
int offset2 = 0;
int lcs = 0;
while ((c + offset1 < s1.Length)
&&
(c + offset2 < s2.Length))
{
if (s1[c + offset1] ==
s2[c + offset2])
lcs++;
else
{
offset1 = 0;
offset2 = 0;
for (int i = 0;
i < maxOffset;
i++)
{
if ((c + i < s1.Length)
&&
(s1[c + i] == s2[c]))
{
offset1 = i;
break;
}
if ((c + i < s2.Length)
&&
(s1[c] == s2[c + i]))
{
offset2 = i;
break;
}
}
}
c++;
}
return (s1.Length + s2.Length)/2 - lcs;
}


And here is the T-Sql code. This version is actually an improvement of my original source, gracefully provided by Todd Wolf:

CREATE FUNCTION [DBO].[Sift3distance2]
(
@s1 NVARCHAR(3999),@s2 NVARCHAR(3999),@maxOffset INT
)
RETURNS FLOAT
AS 
BEGIN
DECLARE @s1LEN INT,@s2LEN INT

SELECT @s1LEN=Len(Isnull(@s1,'')),@s2LEN=Len(Isnull(@s2,''))

IF @s1LEN=0 RETURN @s2LEN
ELSE
IF @s2LEN=0 RETURN @s1LEN

IF Isnull(@maxOffset,0)=0 SET @maxOffset=5

DECLARE @currPos INT,@matchCnt INT,@wrkPos INT,@s1Offset INT,@s1Char VARCHAR,@s1Pos INT,@s1Dist INT,@s2Offset INT,@s2Char VARCHAR,@s2Pos INT,@s2Dist INT

SELECT @s1Offset=0,@s2Offset=0,@matchCnt=0,@currPos=0

WHILE(@currPos+@s1Offset<@s1LEN AND @currPos+@s2Offset<@s2LEN)
BEGIN
SET @wrkPos=@currPos+1

IF(Substring(@s1,@wrkPos+@s1Offset,1)=Substring(@s2,@wrkPos+@s2Offset,1)) SET @matchCnt=@matchCnt+1
ELSE
BEGIN
SET @s1Offset=0

SET @s2Offset=0

SELECT @s1Char=Substring(@s1,@wrkPos,1),@s2Char=Substring(@s2,@wrkPos,1)

SELECT @s1Pos=Charindex(@s2Char,@s1,@wrkPos)-1,@s2Pos=Charindex(@s1Char,@s2,@wrkPos)-1

SELECT @s1Dist=@s1Pos-@currPos,@s2Dist=@s2Pos-@currPos

IF(@s1Pos>0 AND (@s1Dist<=@s2Dist OR @s2Pos<1) AND @s1Dist<@maxOffset) SET @s1Offset=(@s1Pos-@wrkPos)+1
ELSE
IF(@s2Pos>0 AND (@s2Dist<@s1Dist OR @s1Pos<1) AND @s2Dist<@maxOffset) SET @s2Offset=(@s2Pos-@wrkPos)+1
END

SET @currPos=@currPos+1
END

RETURN(@s1LEN+@s2LEN)/2.0-@matchCnt
END

It doesn't give the same exact results as my own code, yet the result is close enough and the speed is about 20% higher.

And thanks to Diogo Nechtan, the version in PHP:
function sift3Plus($s1, $s2, $maxOffset) {
$s1Length = strlen($s1); 
$s2Length = strlen($s2);
if (empty($s1)) {
return (empty($s2) ? 0 : $s2Length);
}
if (empty($s2)) {
return $s1Length;
}
$c1 = $c2 = $lcs = 0;

while (($c1 < $s1Length) && ($c2 < $s2Length)) {
if (($d = $s1{$c1}) == $s2{$c2}) {
$lcs++;
} else {
for ($i = 1; $i < $maxOffset; $i++) {
if (($c1 + $i < $s1Length) && (($d = $s1{$c1 + $i}) == $s2{$c2})) {
$c1 += $i;
break;
}
if (($c2 + $i < $s2Length) && (($d = $s1{$c1}) == $s2{$c2 + $i})) {
$c2 += $i;
break;
}
}
}
$c1++;
$c2++;
}
return (($s1Length + $s2Length) / 2 - $lcs);
}

And thanks to Fernando Jorge Mota, the version in Python:


Also, here is the Javascript version, used in Mailcheck, by Derrick Ko and Wei Lu.

function sift3Distance(s1, s2) {
if (s1 == null || s1.length === 0) {
if (s2 == null || s2.length === 0) {
return 0;
} else {
return s2.length;
}
}

if (s2 == null || s2.length === 0) {
return s1.length;
}

var c = 0;
var offset1 = 0;
var offset2 = 0;
var lcs = 0;
var maxOffset = 5;

while ((c + offset1 < s1.length) && (c + offset2 < s2.length)) {
if (s1.charAt(c + offset1) == s2.charAt(c + offset2)) {
lcs++;
} else {
offset1 = 0;
offset2 = 0;
for (var i = 0; i < maxOffset; i++) {
if ((c + i < s1.length) && (s1.charAt(c + i) == s2.charAt(c))) {
offset1 = i;
break;
}
if ((c + i < s2.length) && (s1.charAt(c) == s2.charAt(c + i))) {
offset2 = i;
break;
}
}
}
c++;
}
return (s1.length + s2.length) / 2 - lcs;
}

Another implementation, this time in Java, by Eclesia:


You might also be interested in a customised version in AutoKey, by Toralf:


Thanks all for your comments and I look forward to more. Just tell me it worked or not and, most important, why. Good luck!

66 comments:

  1. I like this. Using the SQL function, I have had a couple strange results right off the bat with a single back slash "\" making a really big difference. Is "\" a reserved character in SQL? This is SQL Server 2005.

    ReplyDelete
  2. Nevermind my previous post as my problem appears to have nothing to do with SQL server. I re-wrote it in ColdFusion script and experienced the exact same behavior. I am comparing a larger amount of text like a couple of paragraphs.
    If I were to compare two sentences which were idential with the exception of an extra letter being present at the beginning of on of the sentences, then the two would compare as being very different. Furthermore if I was to omit or add a single letter at the end of a string, they would compare as being very similar.
    That would obviously not be the case, since it was only a single-character difference in each scenario. Perhaps I shouldn't use this algorithm for larger strings such as sentences.

    ReplyDelete
  3. Sorry to be spamming you comments, but I wanted to say that it appears the behavior I reported in my second post was simply becase I had a maxOffset of 0. With a larger maxOffset I am able to insert several words into the middle or beginning of one string and it still produces a relativley close match.

    Of course, I can still trick it by simply re-arranging the same peices of data.
    For instance, if I was to compare the strings:
    "I like spam. Let's go to the store."
    and
    "Let's go to the store. I like spam."

    then they produce a pretty low comparison even though they are the same sentences, but rearranged.
    To compare those would be much more complicated though.

    ReplyDelete
  4. Thank you for your comments. I didn't have time to test this properly so I might have missed something ;)
    Can you give me some examples of where it didn't quite work?

    Also, this algorithm is basically finding typos. maxOffset should be around 5, and it detects (or it should :-/) mispelled words or small 4 letter word insertions and so on.

    What you need for getting around the "trick" with the word rearranging is an algorithm that searches and compares groups of two or three letters or even entire words. But that's beyond the scope of Sift3.

    Again, thank you for posting comments of your experiences with Sift.

    ReplyDelete
  5. In the C# sample, was maxOffset supposed to be a parameter like in the T-SQL version? Because it's not declared in the function itself.

    ReplyDelete
  6. oh, wait, I looked at the code for Sift2 and I see that it's a private field initialized during construction. This is an excellent example of why I am a proponent of decorating such fields, either with _ (Microsoft's recommendation) or m_ (the older, MFC-style version). If it had been called _maxOffset I probably would have figured it out myself.
    Anyways, thanks for the code. I'm trying to implement matching/duplicate elimination in-house and this will be my starting point. There's a lot of talk on the 'Net about edit distance and a lot of mathematical formulas that I probably would have understood 20 years ago in school, but very little actual code. Again, thanks.

    ReplyDelete
  7. Sorry about that. Published Sift3 in a bit of a hurry and I only copy pasted the method from an object.
    I am glad I could help, I am no mathematician and after looking at some of the formulas you are talking about and not understanding anything, I started from scratch with my silly brain.
    I am still wondering it worked :)

    ReplyDelete
  8. Okay, I have tested this function and I can say that it's not giving me the results I expect. For example, I compared the name "Dimas" to "Jim" and "Tina" and in both cases I got a distance of 2 when I should have gotten 3. To make sure we're both on the same page here, this is why I think I should have gotten three:
    Jim-> Change first letter, delete two letters at the end: 3
    Tina-> Change first letter, change third letter, delete one letter at the end: 3.
    Do you agree?

    ReplyDelete
  9. One more thing: your function is typed to return float but the return statement is operating on ints, so you'll never send back anything but an int. I don't think edit distance has the concept of anything but ints, anyways, so I think the function ought to reflect that.

    ReplyDelete
  10. Maybe it wasn't clear enough that this algorithm is aproximating the Levenshtein distance, it doesn't always get it right.

    I tried to minimize the difference between the Levenshtein distance and the result of this function and this is what I came up with. However, the first priority was speed.

    In order to improve on the accuracy, you should use Levenshtein for short words (let's say if they have less than 6 letters) or on the top similarity results from Sift. In this way you double check the results, but only on those strings similar enough to be worth it.

    As for the int, it's a good catch, but I am returning float so I don't have to cast anything when computing similarity and to compare with Sift2, which returned a float. Dividing by 2.0 increased the difference from Levenshtein, too :)

    ReplyDelete
  11. witch one is faster? Sift2 or Sift3?

    Tks

    ReplyDelete
  12. Sift 3 is better and faster.

    ReplyDelete
  13. I was looking for a method to compare two digital ham radio messages to see if they were the same message. Since there can be some line noise, it has to be a little forgiving.

    I tried Swift3 and at the default maxOffset of 5 it worked like crap in my test cases. Before giving up and moving on, I tried 20. Now it works like a champ.

    I'm very impressed.

    ReplyDelete
  14. I am glad I could help. Thank you for your comment!

    ReplyDelete
  15. Dear,

    I've downloaded and start goofing arround with your algo.
    first off - great work, some things are best left of simple :-)

    I'd like to propose 2 improvements that would not hit performance and may be of good use:

    1. make maxOffset varing according to strings lengths. i've found that u best use lower maxoffset when u have short words, than in a sentence. - easy to implement, no cpu overload, and results seem better (didn't heavy test it though).

    2. location priority - again simple implementation (i.e *log(pos)), as in some sentences the begining is more important than the end or vice versa

    I'd be happy to discuss it with you if you think my ideas has any value to you.

    best,
    DaberElay.

    ReplyDelete
  16. Thanks for the input. Your sugestions certainly seem reasonable. Alas, I lack the time to work on this algorithm at the moment.
    If you do transform your ideas in code (and maybe some test data as well), please let me know, I will add your contribution to this post.

    ReplyDelete
  17. I have a store of possibly misspelled words with a count of around 300,000. I have a dictionary of around 1000 terms. Your algo gave me some great results but takes quite a bit of time because of the loooping. I modified the SQL UDF to be more efficient and thought that I would send you my results. (Yes, I know I'm way too anal about spacing ;-) )

    CREATE FUNCTION [dbo].[Sift3Distance](@s1 nvarchar(3999), @s2 nvarchar(3999),@maxOffset int)
    RETURNS FLOAT
    AS
    BEGIN

    DECLARE @s1LEN INT,
    @s2LEN INT

    SELECT @s1LEN = LEN(ISNULL(@s1, '')),
    @s2LEN = LEN(ISNULL(@s2, ''))

    IF @s1LEN = 0
    RETURN @s2LEN
    ELSE IF @s2LEN = 0
    RETURN @s1LEN

    IF ISNULL(@maxOffset,0)=0
    SET @maxOffset=5

    DECLARE @currPos INT,
    @matchCnt INT,
    @wrkPos INT,

    @s1Offset INT,
    @s1Char VARCHAR,
    @s1Pos INT,
    @s1Dist INT,

    @s2Offset INT,
    @s2Char VARCHAR,
    @s2Pos INT,
    @s2Dist INT

    SELECT @s1Offset=0,
    @s2Offset=0,
    @matchCnt=0,
    @currPos=0


    WHILE (@currPos+@s1Offset<@s1LEN AND @currPos+@s2Offset<@s2LEN)
    BEGIN
    SET @wrkPos = @currPos + 1
    IF (SUBSTRING(@s1,@wrkPos+@s1Offset,1)=SUBSTRING(@s2,@wrkPos+@s2Offset,1))
    SET @matchCnt=@matchCnt+1
    ELSE
    BEGIN
    SET @s1Offset=0
    SET @s2Offset=0

    SELECT @s1Char = SUBSTRING(@s1,@wrkPos,1),
    @s2Char = SUBSTRING(@s2,@wrkPos,1)

    SELECT @s1Pos = CHARINDEX(@s2Char, @s1, @wrkPos) - 1,
    @s2Pos = CHARINDEX(@s1Char, @s2, @wrkPos) - 1

    SELECT @s1Dist = @s1Pos-@currPos,
    @s2Dist = @s2Pos-@currPos

    IF(@s1Pos > 0 AND (@s1Dist <= @s2Dist OR @s2Pos<1) AND @s1Dist < @MaxOffset)
    SET @s1Offset = (@s1Pos-@wrkPos) + 1
    ELSE IF(@s2Pos > 0 AND (@s2Dist < @s1Dist OR @s1Pos<1) AND @s2Dist < @MaxOffset)
    SET @s2Offset = (@s2Pos-@wrkPos) + 1
    END

    SET @currPos=@currPos+1
    END

    RETURN (@s1LEN+@s2LEN)/2.0-@matchCnt
    END


    I'm still trying to determine if this will give me the throughput that I'm looking for but I sure appreciate the hard work that you did!

    ReplyDelete
  18. Ah, the good old days when I had the time to pursue my own ideas and research. Thank you for giving me feedback. For me it was a pleasure to create the algorithm and most of all to see it used.

    I will try to test your improvement and also to integrate the other cool ideas from the others, maybe change version to Sift4. If only I had the time...

    ReplyDelete
  19. I totally understand. It seems these days that I only learn new stuff for which I have a need. No more time to just go have some "fun". :-)

    I've been toying with the idea of adding a MaxDistance param to short circuit the loop when it falls into range to maybe shave a few more milliseconds off.

    Probably won't do much for the casual app but when your comparing 13,000 known terms to 300,000 possible misspellings each iteration means something!

    Will post anything else that comes out of my testing.

    ReplyDelete
  20. I have a few ideas. Already got a small algorithm that is twice as fast as Sift3, but terribly innacurate, partly based on your optimisation. Made it today.

    ReplyDelete
  21. One minor suggestion...include maxOffset as an input parameter and make the method Static.

    ReplyDelete
  22. Hey, it's me again. This post is pretty old, but I figured I owed you a heads up. Quite some time ago I ported your SIFT3 over to ColdFusion script and then modified the heck out of it as an experiment. I came up with a version that I used to highlight difference in longer strings of text (1 or 2 paragraphs). With a maxoffset of anywhere from 10 to 50 it would recognize spelling differences as well as minor insertions/deletions. It would highlight the differences by inserting HTML tags into the strings for display in a web page as well as reporting the difference count and percentage. The function is nowhere as small and fast as SIFT3 was, but since I based my work off of SIFT3 I wanted to be fair and let you know.

    If you want to see it, I have a small example of it along with the ColdFusion code posted on my site so I could share it with people. I credited your name and blog in the comments.

    http://www.bradwood.com/string_compare/


    Thanks!

    ~Brad

    ReplyDelete
  23. You are most welcome. I really appreciate you leaving this comment. I am glad I helped.

    ReplyDelete
  24. Thank you; an interesting piece of code. I expect to test/use it in near future and report on my experience.

    Simon

    ReplyDelete
  25. Sorry, I did not test it much, cannot judge how good it is. (You left a message on my blog asking for feedback, so here it is.)

    ReplyDelete
  26. Hi Siderite!

    Your 'Sift3' algortihm is just what I was looking for -- thank you very much for having made it! I immediately made a Java port of it for my project, where I was stuck with the traditionel and FAR too slow O(n^2) dynamic programming algorithm for string distance calculation. Sift3 was WAY faster and well suited to my kind of data, i.e. inaccurately OCR'ed text. However, studying Sift3 a bit deeper, I came to wonder why it makes use of just a single position pointer 'c' for both strings being compared, instead of two pointers "c1" and "c2", one for each string. The single-postition-pointer approach of Sift3 implies a fairly small limit (i.e., maxOffset) on the distance between equivalent indices in the strings. Why this limit? I wrote a two-position-pointers version of Sift3; 96,75% of the time it yields the same results as Sift3 with my data, and it is somewhat faster than Sift3, too. Here's the Java code (don't know how to make it look more readable in this forum; sorry!):

    public static int sift3Plus(CharSequence s1, CharSequence s2, int maxOffset) {
    final int s1Length = s1.length(), s2Length = s2.length();
    if(s1 == null || s1Length == 0)
    return((s2 == null || s2Length == 0) ? 0 : s2Length);
    if(s2 == null || s2Length == 0)
    return(s1Length);
    int c1 = 0;
    int c2 = 0;
    int lcs = 0;
    while((c1 < s1Length) && (c2 < s2Length)) {
    if((d = s1.charAt(c1)) == s2.charAt(c2))
    lcs++;
    else {
    for(int i = 1; i < maxOffset; i++) {
    if((c1 + i < s1Length) && ((d = s1.charAt(c1+i)) == s2.charAt(c2))) {
    c1 += i;
    break;
    }
    if((c2 + i < s2Length) && ((d = s1.charAt(c1)) == s2.charAt(c2+i))) {
    c2 += i;
    break;
    }
    }
    }
    c1++;
    c2++;
    }
    return((s1Length + s2Length)/2 - lcs);
    }

    Thanks again,
    Mikkel

    ReplyDelete
  27. It is a matter of semantics rather than code.

    The c pointer shows a general position from which offsets are computed. Assuming both strings are identical, then the c pointer just moves along, while the offsets are preserved. If they are not, offsets are changed to skip the differences, then c advances as before.

    The algorithm changes a little when no match is found. In that case the pointer is advanced to a position where there is a match between any two letters.

    You are correct that using two pointers would speed things up a little, but also you would find a problem when trying to reset the counter to a previous position.

    It's a matter of personal choice, I think.

    ReplyDelete
  28. Thanks for replying!

    Indeed it's a matter of semantics: Why would you want to reset, at any time, the position pointer to a previous position? I mean, once the algorithm has matched the character indexed 'c1' in the one string with the character indexed 'c2' in the other string, it seems weird, philosophically, to allow the algorithm to try, at some point, to re-match either c1 or c2 with yet another index -- which Sift3's resetting mechanism does. After all, any index of the one string should correspond at most to one index of the other one. This is why I consider a two-pointers version of the algorithm more straightforward.

    ReplyDelete
  29. Because the algorithm always looks at the first match for one of the letter at the current position. If a match is found, it moves the offset accordingly, up to maxOffset. If then it finds no match, it should start with the characters it skipped when it moved the offset.

    For something like AXXXXB and XXXXAB, first it matches A, then if must reset the offset to continue with B.

    But thinking of this I just realized the algorihtm has a small bug. If the offset is large, it might reach the end of the string. In that case it will not reset the offsets.

    ReplyDelete
  30. The two-pointer version has no need to do any such resetting to match A with A and then B with B in such a pair of strings. With >>A1234B<< and >>5678AB<<, c1 and c2 are first 0 and 4, respectively; both pointers are then incremented, to 1 and 5, and finally c1 is set to 5 as well. Cf. this "spypointish" output from a test run:

    Sift3Plus comparing >>A1234B<<, >>5678AB<<:
    c1 is 0, c2 is 0
    >>A1234B<<[0]=A doesn't match >>5678AB<<[0]=5
    -- >>A1234B<<[1]=1 doesn't match >>5678AB<<[0]=5
    -- >>A1234B<<[0]=A doesn't match >>5678AB<<[1]=6
    -- >>A1234B<<[2]=2 doesn't match >>5678AB<<[0]=5
    -- >>A1234B<<[0]=A doesn't match >>5678AB<<[2]=7
    -- >>A1234B<<[3]=3 doesn't match >>5678AB<<[0]=5
    -- >>A1234B<<[0]=A doesn't match >>5678AB<<[3]=8
    -- >>A1234B<<[4]=4 doesn't match >>5678AB<<[0]=5
    -- >>A1234B<<[0]=A matches >>5678AB<<[4]=A
    c1 is 0, c2 is 4
    c1 is 1, c2 is 5
    >>A1234B<<[1]=1 doesn't match >>5678AB<<[5]=B
    -- >>A1234B<<[2]=2 doesn't match >>5678AB<<[5]=B
    -- >>A1234B<<[3]=3 doesn't match >>5678AB<<[5]=B
    -- >>A1234B<<[4]=4 doesn't match >>5678AB<<[5]=B
    -- >>A1234B<<[5]=B matches >>5678AB<<[5]=B
    c1 is 5, c2 is 5
    LCS is AB

    Here it's always a matter of incrementing the position pointers, never one of resetting them to previous values.

    ReplyDelete
  31. What about AxxxB and xBAxxx ? wouldn't the incrementing of the A pointer prohibit the matching of B?

    ReplyDelete
  32. It sure would, it sure would. But why would you want to match the Bs when they aren't on the same side of the matched As? If we imagine that A and B have been switched, a string distance algorithm should certainly impose a penalty for this. Also, if we consider the algorithm an LCS finder, the two strings have no LCS "AB", just the LCS "A" (or "B").

    ReplyDelete
  33. The story of the algorithm is pretty funny. It started with removing the identical letters at the same index, then shifting the strings with an offset, then repeating the operation, until a number of operations has been completed. Hence the name of "sift", as the strings where moving back and forth.

    Sift2 and Sift3 are natural evolutions of the algorithm, by optimizing the code, not the concept. Later on, someone wrote about it belonging to the class of LCS algorithms, so I changed the sense of a variable and renamed it to lcs, as someone might want to compute the LCS and there is a clear formula relating distance and lcs.

    Now, I also remember (it was a long time ago after all) that one version of the algorithm used two pointers and no offset.

    In other words, I didn't invent this based on some previous work or on any knowledge of string distance algorithms, it's something I did out of my own head. Also the "logic" of the algo came only later, when I tried to explain to myself why it worked. So it's all a little backwards :)

    You may be right about possible optimizations, but right now I don't have the time to work on it. I am interested, though, as this is one of my soul children. So I will give a complete new analysis, maybe in another blog entry, when I can find a spare moment.

    Again, thank you for your interest and please let me know about any other ideas that you have about Sift. After all, Sift4 is long overdue :)

    ReplyDelete
  34. Thanks for your post!
    I converted it to Python: http://pastebin.com/iDUN0G1G

    Sadly, it is just slightly faster than my Damerau Levenshtein implementation...

    ReplyDelete
  35. What maxOffset did you use? Do you think there is a performance issue in the algorithm?

    ReplyDelete
  36. I have a suggestion, you could cache the length of s1 and s2. I don't know what the performance is for the len function in Python.

    ReplyDelete
  37. Damn it is so fast. It is not very accurate though. Or maybe the interpretation of Accuracy is different :)

    ReplyDelete
  38. :-P Perhaps. It was designed to find typos, not compare long texts and it performed well in (and was tuned by) tests on randomly generated strings.

    You use it to find if strings are basically the same with small changes rather than a perfect distance score.

    What sort of metric did you have in mind? Maybe a discussion on the desired functionality of the algorithm might help improve it.

    ReplyDelete
  39. I converted it to PHP code:

    function sift3Plus($s1, $s2, $maxOffset) {
    $s1Length = strlen($s1); $s2Length = strlen($s2);
    if (empty($s1)) return (empty($s2) ? 0 : $s2Length);
    if (empty($s2)) return $s1Length;
    $c1 = $c2 = $lcs = 0;

    while (($c1 < $s1Length) && ($c2 < $s2Length)) {
    if (($d = $s1{$c1}) == $s2{$c2}) $lcs++;
    else {
    for ($i = 1; $i < $maxOffset; $i++) {
    if (($c1 + $i < $s1Length) && (($d = $s1{$c1 + $i}) == $s2{$c2})) {
    $c1 += $i;
    break;
    }
    if (($c2 + $i < $s2Length) && (($d = $s1{$c1}) == $s2{$s2 + $i})) {
    $c2 += $i;
    break;
    }
    }
    }
    $c1++;
    $c2++;
    }
    return (($s1Length + $s2Length) / 2 - $lcs);
    }


    echo sift3Plus('nechtan', 'nectan', 4); // returns 1.5;

    ReplyDelete
  40. Thank you, Diogo! I will update the post. Please let me know how you used the algorithm and if it worked for you.

    ReplyDelete
  41. Here is a simple version of this script in Python: https://gist.github.com/3067867

    ReplyDelete
  42. Thanks, Fernando! I've updated the post with your code.

    ReplyDelete
  43. Thanks you for this algorithm.

    There is just an error in PHP code.
    $s2{$s2 + $i} should be $s2{$c2 + $i}

    ReplyDelete
  44. Thank you for the bug report! I am updating the post.

    ReplyDelete
  45. Hi Siderite - thank you for this algorithm. I hope you don't mind, but I have cited you and your algorithm in my Dissertation, and I have implemented it in MYSQL in a system I am writing (I hope that's OK?)

    Here is the MYSQL implementation for your blog. Once again - thank you.

    /***************************************** FN_Sift3Distance **************************************/

    /*6000 is guessed as the upper limit size of a FONT LIST or MIME INFO - either way, its enough for a fingerprint*/

    DROP FUNCTION IF EXISTS FN_Sift3Distance;

    $$CREATE FUNCTION `FN_Sift3Distance`(s1 varchar(6000), s2 varchar(6000), maxOffset int)

    RETURNS float



    BEGIN

    DECLARE s1LEN INT;

    DECLARE s2LEN INT;

    DECLARE currPos INT;

    DECLARE matchCnt INT;

    DECLARE wrkPos INT;

    DECLARE s1Offset INT;

    DECLARE s1Char CHAR;

    DECLARE s1Pos INT;

    DECLARE s1Dist INT;

    DECLARE s2Offset INT;

    DECLARE s2Char CHAR;

    DECLARE s2Pos INT;

    DECLARE s2Dist INT;

    DECLARE returnVar FLOAT;



    SET s1LEN = LENGTH(s1), s2LEN = LENGTH(s2), maxOffset = IFNULL(maxOffset,2);

    SET s1Offset = 0, s2Offset = 0, matchCnt=0, currPos=0;



    WHILE ( (currPos+s1Offset < s1LEN) AND (currPos+s2Offset < s2LEN) ) DO

    SET wrkPos = currPos + 1;

    IF ( SUBSTRING(s1,wrkPos+s1Offset,1) = SUBSTRING(s2,wrkPos+s2Offset,1) ) THEN

    SET matchCnt = matchCnt + 1;

    ELSE

    SET s1Offset = 0;

    SET s2Offset = 0;

    SET s1Char = SUBSTRING(s1,wrkPos,1), s2Char=SUBSTRING(s2,wrkPos,1);

    SET s1Pos = LOCATE(s2Char,s1,wrkPos)-1, s2Pos = LOCATE(s1Char,s2,wrkPos)-1;

    SET s1Dist = s1Pos-currPos, s2Dist = s2Pos-currPos;

    IF (s1Pos > 0 AND (s1Dist <= s2Dist OR s2Pos < 1) AND s1Dist < maxOffset) THEN

    SET s1Offset = (s1Pos-wrkPos) + 1;

    ELSEIF (s2Pos > 0 AND (s2Dist < s1Dist OR s1Pos < 1) AND s2Dist < maxOffset) THEN

    SET s2Offset= (s2Pos-wrkPos) + 1;

    END IF;

    END IF;

    SET currPos = currPos + 1;

    END WHILE;



    SET returnVAR = (s1LEN + s2LEN) / 2.0 - matchCnt;

    RETURN returnVAR;



    END$$







    /***************************************** START FN_Sift3Similarity **************************************/

    DROP FUNCTION IF EXISTS FN_Sift3Similarity;

    $$CREATE FUNCTION `FN_Sift3Similarity`(s1 varchar(6000), s2 varchar(6000), maxOffset int)

    RETURNS float



    BEGIN



    DECLARE distance float;

    DECLARE maxLen int;

    DECLARE returnVar float;



    SET distance= FN_Sift3Distance(s1,s2,maxOffset);

    SET maxLen = CASE WHEN LENGTH(s1) > LENGTH(s2) THEN

    LENGTH(s1)

    ELSE LENGTH(s2)

    END;



    SET returnVar = CASE WHEN maxLen = 0 THEN

    1

    ELSE

    (1- (distance / maxLen)) * 100

    END;

    RETURN returnVar;

    END$$

    /***************************************** END FN_Sift3Similarity **************************************/


    ReplyDelete
  46. Of course it's OK, buddy! Just tell me how it worked out in the end.

    ReplyDelete
  47. hello there...
    Looking for a very fast fuzzy compare function I stepped on yours.
    I found it really great BUT I could not find things like clermond ferrand using the captcha string clermon feran.
    After some research I found that it was because your formula was not taking into account the length of the two strings in the results...
    This can be easily corrected just by dividing the result by the sum of the lentgh of the chains...
    The exact formula would be :
    (length(string1) +length(string2)-matchcount)/(length(string1) +length(string2))...
    this will give you a number betwen 0 and 1 that will give 0 if the two string are exactly the same and one if they are 'totaly' different (not a char in common)...
    This would be something like a probabilty distance creating a cloud arround the string...
    you could use 1 minus the values if you want a fuzzy logic paramter. It would give you 1 if the two chains are exactly the same...
    I thinkt this formula will bring a great improvement over you algorythm and that it is giving it more sense... Just try it and see...
    If you need more explanation, you can join me on this email : catodique@hotmail.com (Carmelo Guarneri ).

    ReplyDelete
  48. sorry, this email will be better.
    email : catodique@gmail.com (Carmelo Guarneri ).

    ReplyDelete
  49. You are describing a similarity function, one that is computed from a distance function. For example the similarity formula I use when having the distance is 1-distance/max(l1,l2). You propose (l1 +l2-matchcount)/(l1+l2) which is 1-matchcount/(l1+l2) which is similar (pardon the pun).

    A distance is a length, a similarity is a percentage and there should be some relationship between them.

    I can quote (formula used with current variables) from http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Relation_to_other_problems: "The edit distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion, is: l1+l2-2*LCS" which also inspired me to refactor Sift to use the ending formula as it is now.

    ReplyDelete
  50. Hello, thanks much for the algorithm. Switching from levenshtein distance to it caused my application to run 100 times faster.

    I can provide ruby version of the code.


    By the way, do you have any standard example strings pairs for testing purpose?

    ReplyDelete
  51. Thanks, Victor!
    I don't have standard strings because I am lazy :) I tested it with randomly created strings - which I know don't really test anything - and with the data in different projects where I needed the algorithm.

    Incidentally I am now researching string kernels and something called http://en.wikipedia.org/wiki/Ukkonen%27s_algorithm. I don't know your specific requirements, though.

    ReplyDelete
  52. A java version , thanks for the algorithm

    public class SIFT3 extends StringDistance {
    private final static byte CASE_BITWISE_MASK = (byte)0x20;
    private final static float CASE_PUNISHMENT = 0.05f;
    private static SIFT3 instance;

    private SIFT3(){
    }

    public final static SIFT3 getInstance(){
    if(null==instance){
    instance = new SIFT3();
    }
    return instance;
    }

    @Override
    public float computeDistance(String s1, String s2 , int maxOffset) {
    if(null==s1){
    return null==s2?0:s2.length();
    }
    else if(null==s2){
    return null==s1?0:s1.length();
    }

    boolean isCaseDiff = false;

    int l1 = s1.length();
    int l2 = s2.length();

    int c = 0;
    int offset1 = 0;
    int offset2 = 0;
    float lcs = 0f;

    while((c + offset1 < l1) && (c + offset2 < l2)){
    if ((s1.charAt(c + offset1)|CASE_BITWISE_MASK) == (s2.charAt(c + offset2)|CASE_BITWISE_MASK)){
    lcs++;
    if(!isCaseDiff && s1.charAt(c + offset1)!=s2.charAt(c + offset2)){
    isCaseDiff=true;
    }
    }
    else{
    offset1 = 0;
    offset2 = 0;

    for (int i = 0;i< maxOffset;i++){
    if ((c + i < l1) && ((s1.charAt(c + i)|CASE_BITWISE_MASK) == (s2.charAt(c)|CASE_BITWISE_MASK))){
    offset1 = i;
    if(!isCaseDiff && s1.charAt(c + i)!=s2.charAt(c)){
    isCaseDiff=true;
    }
    break;
    }

    if ((c + i < l2) && ((s1.charAt(c)|CASE_BITWISE_MASK) == (s2.charAt(c + i)|CASE_BITWISE_MASK))){
    offset2 = i;
    if(!isCaseDiff && s1.charAt(c)!=s2.charAt(c+i)){
    isCaseDiff=true;
    }
    break;
    }
    }
    }
    c++;
    }

    return (l1+l2)/2.0f - lcs + (isCaseDiff?CASE_PUNISHMENT:0);
    }

    @Override
    public float computeSimilarity(String s1, String s2 , int maxOffset){
    if(null==s1){
    return null==s2?0:1f/s2.length();
    }
    else if(null==s2){
    return null==s1?0:1f/s1.length();
    }

    return 1.0f-computeDistance(s1,s2,maxOffset) / (float)Math.max(s1.length(), s2.length());
    }
    }

    ReplyDelete
  53. Thank you! But you added some extra stuff to it. It would be nice if you could explain it to others.

    ReplyDelete
  54. Hie Siderite.... I am thinking of using the MySQL implementation of your algorithm for my Honours project. However I don't understand what max offset is and thus I can't make use of it....

    The algorithms is as follows:
    DELIMITER $$

    CREATE DEFINER=`root`@`localhost` FUNCTION `FN_Sift3Similarity`(s1 text, s2 text, maxOffset int) RETURNS float
    BEGIN



    DECLARE distance float;

    DECLARE maxLen int;

    DECLARE returnVar float;



    SET distance= FN_Sift3Distance(s1,s2,maxOffset);

    SET maxLen = CASE WHEN LENGTH(s1) > LENGTH(s2) THEN

    LENGTH(s1)

    ELSE LENGTH(s2)

    END;



    SET returnVar = CASE WHEN maxLen = 0 THEN 1

    ELSE(1- (distance / maxLen)) * 100

    END;

    RETURN returnVar;

    END


    and the second one i found is :
    DELIMITER $$

    CREATE DEFINER=`root`@`localhost` FUNCTION `FN_Sift3Distance`(s1 text, s2 text, maxOffset int) RETURNS float
    BEGIN

    DECLARE s1LEN INT;

    DECLARE s2LEN INT;

    DECLARE currPos INT;

    DECLARE matchCnt INT;

    DECLARE wrkPos INT;

    DECLARE s1Offset INT;

    DECLARE s1Char CHAR;

    DECLARE s1Pos INT;

    DECLARE s1Dist INT;

    DECLARE s2Offset INT;

    DECLARE s2Char CHAR;

    DECLARE s2Pos INT;

    DECLARE s2Dist INT;

    DECLARE returnVar FLOAT;



    SET s1LEN = LENGTH(s1), s2LEN = LENGTH(s2), maxOffset = IFNULL(maxOffset,2);

    SET s1Offset = 0, s2Offset = 0, matchCnt=0, currPos=0;



    WHILE ( (currPos+s1Offset < s1LEN) AND (currPos+s2Offset < s2LEN) ) DO

    SET wrkPos = currPos + 1;

    IF ( SUBSTRING(s1,wrkPos+s1Offset,1) = SUBSTRING(s2,wrkPos+s2Offset,1) ) THEN

    SET matchCnt = matchCnt + 1;

    ELSE

    SET s1Offset = 0;

    SET s2Offset = 0;

    SET s1Char = SUBSTRING(s1,wrkPos,1), s2Char=SUBSTRING(s2,wrkPos,1);

    SET s1Pos = LOCATE(s2Char,s1,wrkPos)-1, s2Pos = LOCATE(s1Char,s2,wrkPos)-1;

    SET s1Dist = s1Pos-currPos, s2Dist = s2Pos-currPos;

    IF (s1Pos > 0 AND (s1Dist <= s2Dist OR s2Pos < 1) AND s1Dist < maxOffset) THEN

    SET s1Offset = (s1Pos-wrkPos) + 1;

    ELSEIF (s2Pos > 0 AND (s2Dist < s1Dist OR s1Pos < 1) AND s2Dist < maxOffset) THEN

    SET s2Offset= (s2Pos-wrkPos) + 1;

    END IF;

    END IF;

    SET currPos = currPos + 1;

    END WHILE;



    SET returnVAR = (s1LEN + s2LEN) / 2.0 - matchCnt;

    RETURN returnVAR;



    END

    ReplyDelete
  55. MaxOffset is a value that tells Sift how many position to search for a character. In my usual dealing I used the value of 5. You can use whatever you want. The algorithm will be slower as the value is higher. You can consider MaxOffset as the length of what you would call a word in a sentence.

    For example you have strings: abcdef and 12345abcdef. If MaxOffset is 5, then the algorithm will search for 'a' at the first position then advance and look again 5 times. It will find 'a' in the second string and conclude that the strings are different only in '12345'. If MaxOffset is 4, it will stop before finding 'a' and conclude the strings are very different.

    ReplyDelete
  56. Thank you so much Siderite. That was helpful :)

    ReplyDelete
  57. It's me again with the MySQL implementation... The results returned are a kind of far from accurate. Could you please look at it and suggest how best I can make it more efficient?. I posted the code in the post I was asking what offset is.

    Thank you for your assistance!

    ReplyDelete
  58. I've checked the MySQL implementation and it should work as well as the T-Sql one. I checked a few values for that and it seemed OK.

    If you could, give me some test strings and tell me the results you were expecting and the results you got.

    I've updated the post with a sift vs Levenstein calculator.

    ReplyDelete
  59. Hi. Good work.

    Sift3B gives the same result for
    ("a", "xx")
    and for
    ("a", "xa")
    like the strings of the second one have nothing alike,
    but then it gives a better match for
    ("a", "ax")

    In my implementation I put
    lcs += 1 / (i + 1);
    before the breaks

    ReplyDelete
  60. The Sift implementation in the post is normal Sift3. Sift3B is an experiment, really.

    Anyway, Sift is a little fuzzy. The algorithm's focus is performance. For tiny strings like that, performance is not an issue, obviously.

    I will look into that specific case, if I find the time, though.

    ReplyDelete
  61. Well, I understand not wanting to decrease the performance, but this doesn't do any noticeable damage.
    It would only do something extra once it had found a match and was ready to exit the loop.

    I used tiny strings for clarity, and because sift3Distance, as used in the post for testing, limits maxOffset to 5.

    If you try, with sift3Distance or sift3B,
    ("ABCDE", "xxxAxxxBxxxCxxxDxxxE", 20)
    and
    ("ABCDE", "xxxxxxxxxxxxxxxxxxxx", 20)
    both result are the same,
    both are completely different strings.
    But by increasing lcs a little bit when matches are found by moving the offset the first case would be a little less than totally different.

    I use lcs += 1 / (i + 1) because the farther away the less it increases lcs, but it always increases something, as opposed to nothing for the first or only character found in another part of the string.

    Anyway, with a little tweaking, I'm liking Sift more than Levenshtein.
    Thanks.

    ReplyDelete
  62. Looked over it and I realize what the problem is. In the for loop, when finding a match and setting offset1 or offset2, I don't increment lcs. If I do that, the result is the same no matter the first or second letter (x or a). However, the resulting lcs value is now a lot more optimistic than it should be. A cost of changing the offsets should be applied.

    I've reached the same formula as you in the first tries, even attempted to use lcs+=1-(i/maxOffset); and divided by 1.75, but I don't think it is really the correct approach and it is certainly not very scientific. I think lcs should be incremented conditionally, but I haven't found the correct condition yet.

    ReplyDelete
  63. Hi to all!
    I tried every implementation of the Sif3 algorithm that is present in this blog and in the discussion on the strings "mug" against "muggle" and "T EY1 B AH0 L" against "L EY1 B AH0 L".

    Well, for the first pair, the every implementation of the function returns me 1.5, that is what I expect, even if I variate a lot the maxOffset parameter (up to 1000). For the second pair the situation is different, and I really can't get the rid of it: the fucntion(s) returns my 1.0 for values of maxOffset ranging from 0 to 12 (that is the length of the strings-1), that is what I expect. For values from 13 up to 1000 (and maybe more), it returns me 13...that is strange, especially if compared with what happens for the first pair. Any suggestion?

    Thank you!

    ReplyDelete
  64. First of all, check out Sift4, it is much more well thought of.

    The explanation for your problem is the letter L, found at the beginning of one word and the end of the second. So the algorithm tries for 13 characters to find a match, it matches the end L, and then assumes that the rest of the string (what it jumped) is totally different and adds it to the distance - remember, I traded accuracy for speed, the algorithm is not recursive.

    The more precise Sift4 should take that into consideration and give you a better result.

    ReplyDelete