Monday 29 December 2014

The Martian, by Andy Weir

On December 29, 2014 In books, misc, picture → No comments

The Martian is a short and easy to read book about a guy being abandoned on Mars by mistake. Andy Weir writes most of the book as the astronaut's log entries, but in a colloquial and funny way. I started reading the book since there are a lot of people that praised it and there is also a Ridley Scott movie being made from the book. I hope it won't suck (*cough*Interstellar*cough*).

What I found really appealing is not only the subject matter and all the science and engineering involved, but also the positive attitude of the main character in the face of adversity. All that science and engineering was cool too, though :) and presented in an easy to digest way (there are no equations anywhere :-P). After a while it becomes difficult to suspend disbelief since there are accidents after accidents and Mars really is being painted as the bad guy, trying to kill the protagonist, yet he always finds a way out of the problem at hand. I mean, if they make the movie follow the book, they need Matt Damon in it just because he has all that Bourne training and he needs it to survive the set. Yet one cannot help rooting for Mark Watney, anyway. There are some politics involved as well, but not that much; basically NASA is presented as an organization of scientists that want to get the job done, even if some are more cautious than others.

In summary, I think this is a book that any space geek should read. I finished it in two days, so it's not really a waste of anyone's time.

Harry Potter and the Methods of Rationality, by Eliezer S. Yudkowsky

On December 24, 2014 In books, misc, picture → No comments

Merry Christmas and a Happy New Year, all!

I thought I would write this post as a Christmas present and let you know about this very cool book. I've read the Harry Potter books and so I could appreciate this book more, as it is fan fiction, however it can be read separately; my recommendation, though, is to know at least something (like having watched the movies) about the Harry Potter universe.

That being said, imagine that Harry Potter would have been a slightly different person, a combination between Sherlock Holmes and Richard Feynman, maybe. Highly intelligent, having read a lot of really serious books and having understood and integrated them into his own philosophy, this Harry Potter goes to the Hogwarts school of magic for two reasons: to apply the scientific method to magic and, having discovered its mysteries, take over the world and stave off death.

Imagine what such a character would do with the stern and moralizing lectures of Dark Ages tutors and you can see why this book is really really funny. But don't take it all as a satire. The references, both serious and in jest, are real. The teachings and methods applied are real. All in all, from every book about children and young adult heroes that I've read so far, this one presents the best role model yet! And I include Ender's Game here (which is also referenced in the book when Harry's adoptive father is comparing the two - hillarious).

I would have to say that I've finished the book, but I didn't actually do that. And that is because the book is not something put on paper and sold by a publisher, instead it is an ongoing story that is offered freely on a blog-like site. Yes, you can read it online. And I have read all that was written yet and, if you consider the parallel universe of the original Harry Potter books, we are about a book and a half in.
Update: I have actually finished the book. :( The writer actually finished writing it after 122 chapters. The ending was pretty cool, too, but I really wanted more. He writes "This is the end of Harry Potter and the Methods of Rationality. I will write no sequel myself; I have said what I set out to say, and it is done. You have my enthusiastic consent to write within this universe yourself, if you wish.".

So, the bottom line is that I love this book, even if a little inconsistent in the sense that the style and the ideas do not keep the same sort of rhythm throughout. You can read it at its completely free site: HPMOR. Its author, Eliezer S. Yudkowsky is a scholar of human rationality and artificial intelligence. I don't know much about him, but let me tell you that, after reading Harry Potter and the Methods of Rationality, I am eager to familiarize myself with his work. I highly recommend this epic undertaking, which probably started as a satire, until its characters gained enough consistency to define their own solid and inspiring story.

A technically free Internet

On December 08, 2014 In essay, misc, technology → No comments

Right now there is a battle raging on that few of us are aware of. It is for one thing only, and that is control over the Internet, control over communications between people, whether it is a discussion about a two tiered Internet, one free and one paid, or a ban instituted by a government or another on sites that are considered bad for you. It started as it usually does, with governments and corporations trying to get as much of the pie as possible. Only something was different: the Internet is so basic, so flexible, that the companies regulating its use and owning the hardware it runs on cannot control its flow or its direction. And as great strides have been made by intelligence and commercial entities alike to control the content and to track the use, equally great strides have been made by individuals to conceal the use and escape monitoring and censorship. The biggest and most touted mechanism that allows anonymity on the Internet is called TOR, The Onion Router, and its concept is simple: encrypt all communications and randomly route requests through the TOR nodes so that the origin of the access is next to impossible to find. There are other, less known methods of doing this, but TOR is the most used and the most known. It is mostly used as a proxy to anonymize normal Internet access, though, and very few people are actually using TOR to access TOR services only.

I am here to tell you that, first, TOR is not enough and, second, that no other software will ever be enough for this kind of use. You see, the TOR nodes I was talking about are people using TOR on their computers and allowing other people to access the "normal" Internet through them. A lot of the TOR exit nodes that are the border of the anonymous TOR world and the transparent Internet, are actually heavily monitored by everyone interested, if not actually ran by them from the beginning. Like in an old example where the FBI was running an IP anonymizing proxy, those exit points are the weak spot of the TOR network. Another flaw is the fact that it works as a proxy for normal IP protocols. Some software (Bittorrent, for example) is openly sending the originating IP in their data, so it doesn't matter if you go through TOR to download stuff, your IP is still there for the world to see. Since you cannot trust all software than runs on your computer, you cannot completely trust using TOR as a proxy for anonymous Internet access.

The solution, I believe, is to implement the anonymizing and encryption features in the Internet itself. Make it so that there is no address for any of its users, or if it is, it is something temporary that you assigned for a connection or another and can be easily recreated and changed. Do it in such a manner that no one will be able to control the DNS servers and the naming schemes, so that you can call your web site whatever you want and not have to pay for it and be able to host it without broadcasting to the world where you are. The problems in implementing this are major, but not insurmountable. One of them is that encryption and complicated routing are significantly decreasing access time. However, given the speed of Internet today, that is not really a big problem anymore.

My thesis is that if freedom of speech, true freedom of speech, is implemented in a technical way, unbiased by any other rule than that you are free to communicate without fear, then no amount of intimidation will be able to break it. As always when human politics have encroached in the territory of personal freedom, the only solution is usually technical, at least since Gutenberg made his printing press and probably way before that.

I am myself not skilled enough to think of all the aspects of such a new protocol for the Internet. Also I am pretty sure that opposition will be huge against any attempt to do it. But what about if we, technical people, get together and make this work? Borrowing parts from the enormously successful TOR, Bittorrent, Bitcoin, we can architect freedom rather than just talk about it in the context of some war or another. Think about it the next time when, in your free country, you get arrested for saying what you believe in or sharing what you know or trying to access a site and finding that it is not there anymore, not for you at least.

Good and bad political TV shows

On December 06, 2014 In essay, misc, movies, politics → No comments

Last year there were three very good US political shows: Homeland, of course, then The Americans, which presents two KBG agents pretending to be US citizens as the main protagonists, as well as The Assets, which was not that good, but was about Aldrich Ames, the infamous American CIA agent who sold secrets to the Russians. All of these shows were presenting various intelligence services doing their best, and in good conscience, to further the interests of their countries. Motivated people, some you might love, some you might hate, but all doing things for the right reasons. Unfortunately, this year seems to be plagued by disgustingly propagandistic shows like Madam Secretary and State of Affairs, bent on showing the US as the spotless white knights and their enemies as faceless evil villains.

Both seemingly wanting to put forward strong female characters with real power and responsibility, they do it in a sledgehammer way that makes me want to cringe. Madam Secretary is about an ex-CIA, now political university intellectual, woman who gets to become the US Foreign Secretary after an unforeseen accident to the real secretary. Interpreted by the beautiful Téa Leoni, it presents the entire US administration as a bunch of do-gooders, plagued by the moral consequences of using their power and having to often sidestep their conscience in order to save the world from the boogie man. Not only a powerful woman at work, she is also a mother, having to take care of her daughter and solve family problems together with her teacher husband. The entire image portrayed by the series is so artificial that you actually see all the pink dripping from where it was forcefully painted over all the black bits.

State of Affairs just started. The lead of the series is Grey's Anatomy star Katherine Heigl, also a CIA woman with a direct professional and personal connection with the US president, who is a black woman. Her fiance was killed right in front of her by terrorists. He was none other than the president's son. In the first three episodes she has to make decisions to thwart the efforts of: Arab terrorists abducting American doctors and threatening them with decapitation, Russian submarines that steal American secrets by tapping undersea fiber optics and Boko Haram terrorists kidnapping school girls. Meanwhile she is being torn by the fact that the guy who killed her fiance was a former asset of hers. She doesn't tell the president because... she would break her oaths to the CIA. The show has some darkness in it, but it also artificial, as some hidden entity has some proof that would incriminate her and a shady character who might be good or bad or both is circling like a vulture. Soon enough we'll discover her father is somehow involved and her fiance is not dead and he is actually her brother or something.

To their benefit, after exhausting the original material, Homeland is not necessarily worst. Also there is another show which I cannot yet say if I like or not, called The Honourable Woman. Great cast: Maggie Gyllenhaal, Stephen Rea (who is not a werewolf in this one... yet :D ) and others. It is an US/UK coproduction, really atmospheric, really dark, but also a bit slow and obtuse. I may have had my brain affected by the previous shows so I can't yet appreciate the value of this one. It seems to mix real politics with some really heavy stuff like assassinations, the economic battlefront between Israel and Palestine, arms smuggling, etc.

The reason why I wrote this post is because sometimes I feel like the media shows are following too closely the politics of the moment, so close in fact that many a time they seem to be slightly ahead of it. The bad guys are entities that are not yet enemies of the US or that behave in worst ways than their real life counterparts, the people in charge often have to bend the rules in order to remain the good guys, even if officially in reality those people have not (yet) bent those rules, etc. Unlike war movies that try to erase or at least erode the moral debt of nations and people involved in past wars, I feel now there are films and series that inject the apology before the act has even been committed. In a period when the US intelligence apparatus is being attacked from all sides by news reports of their activities, here they are, all these stories about the good ole CIA, ran by intellectual and beautiful women of power who maintain their morality like the good mothers of the nation that they are. Am I paranoid here?

Super Fast and Accurate string distance algorithm: Sift4

On November 10, 2014 In C#, database, javascript, programming → 27 comments

Warning: the algorithm works perfectly well and is better than Sift3, however you might want to consider it in beta, as I am still looking for better implementation solutions that might change the structure of the code.

Try the Javascript implementation here:

Algorithm: Levenstein Sift3 Sift4 MaxOffset:

String 1: String 2:

Result:

Update 28 Mar 2015: I've changed the algorithm significantly. The transpositions are now computed differently and the cost of a transposition in the final result is 1, rather than 0.5. Also, while I think a value of 1 is better conceptually, I noticed that Sift4 approximates Levenshtein a little better when the cost of a transposition is either 2 or a function depending on the offset difference between c2 and c1, especially when maxOffset grows. This can be now changed via the new options function transpositionCostEvaluator. The problem I am having now is more false positives when the letters/tokens of the two strings are the same, but their positions are jumbled differently. With small maxOffset values, like 5 or 10, the result is much better than Sift3, however when maxOffset grows, lots of matches can be found and the cost of transpositions becomes very important.

Update 27 Mar 2015: Thanks to Emanuele Bastianelli who discovered a bug that appeared in an edge case, I've updated the algorithms. Now, at the end of the while loop there is an extra check to prevent the algorithm existing prematurely, before computing remaining tokens.

A really long time ago I wrote the third version of Sift, the string distance algorithm. It so happens that I am going to make a small presentation, here in Ispra, about this algorithm, so I had the opportunity to review it. I found some inconsistencies and I actually did some research in the field that gave more more ideas. So before giving the presentation I thought of publishing what I think is the fourth version. What's new:

33% more accurate
three different variants: simple, common and general
new concepts added
support for own value and matching functions, different tokenizer functions, etc.
actually tested with a (slightly more) serious test
more robust, working better for large values of maxOffset

Before I get into the details, I am publishing the algorithm here for the moment, no Codeplex or PasteBin or GitHub or whatever. Also, it is written in Javascript now, the C# and T-SQL version pending. Of course, it would be great if, as before, the community of people using the algorithm would go into implementing it into various programming languages, however I am a bit apprehensive because more often than not people came with their own improvements or interpretations when translating the algorithm into another language. But support is always welcome!

I created a test that used random strings, but also a huge list of commonly used English phrases as well as mutations on these strings, adding or removing small bits and so on. I then implemented Sift3, Levenstein and the new algorithm and computed the error distance between the Levenstein distance and the two Sift variants. This permitted me to see how the error evolves when changing the algorithm and the parameters. One thing I noticed is that when increasing the maxOffset value to large values like 15 or 20, the accuracy of Sift3 was going down. Also, as pointed out by one commenter on the Sift3 post, there are cases when Sift3(a,b) is different from Sift3(b,a). There are edge cases, but this one in particular grated me.

After implementing Sift4, I can now tell you that the simple version is slightly better than Sift3 for small maxOffset values like 5, but it gets better as the value increases. The common version is a bit more complex, but the error decreases with 33% and maintains a low error for large maxOffset values. The extended or general version receives an options object that can change almost everything, but most important is the tokenizer function. Imagine that you want to compute the distance based not on letters, but on n-grams (groups of n characters). Or that you want to compare them by the words in the text, maybe even their synonyms. This can all be achieved just by changing the tokenizer function. The other parameters involve defining what it means for two tokens to match and what is the value of their match, etc.

One of the new concepts implemented is taken from the Jaro distance. Jaro seems a lot like Sift in the way that it considers two characters to match if they are in close proximity. Also, if "the streams cross", like 'ab' vs 'ba', one considers them transpositions and removes some of their value from the distance. Actually, if I look at the implementation, it might be that I have independently discovered the Jaro distance. I will research this further. I don't know if the transposition calculation is the most optimal. At the moment it uses an array of all matches found until a point, clearing it of values as the cursors move along the string. The difference between the simple and the common versions of Sift4 is that the simple version is not computing the transpositions at all and has no concept of maxDistance. In that respect it is a slightly fixed up Sift3.

Another new concept added is the one of local substring. Imagine that the Largest Common Subsequence that Sift is actually trying to find in order to determine the distance is made of substrings, separated by non matching characters. Each of these substrings can be used to improve the distance function. For example one could argue that 'abcdex' is closer to 'abcde' than 'abcxde', because even if the largest common subsequence is 5, the largest common substring is 5 for the first string and only 3 for the second. The extended version of the algorithm allows for changing the value of each substring individually.

Well, here they are, the three versions. The extended version has some examples at the end for possible parameters.

Simplest Sift4:

// Sift4 - simplest version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
function sift4(s1, s2, maxOffset) {
if (!s1||!s1.length) {
if (!s2) {
return 0;
}
return s2.length;
}

if (!s2||!s2.length) {
return s1.length;
}

var l1=s1.length;
var l2=s2.length;

var c1 = 0;  //cursor for string 1
var c2 = 0;  //cursor for string 2
var lcss = 0;  //largest common subsequence
var local_cs = 0; //local common substring

while ((c1 < l1) && (c2 < l2)) {
if (s1.charAt(c1) == s2.charAt(c2)) {
local_cs++;
} else {
lcss+=local_cs;
local_cs=0;
if (c1!=c2) {
c1=c2=Math.max(c1,c2); //using max to bypass the need for computer transpositions ('ab' vs 'ba')
}
for (var i = 0; i < maxOffset && (c1+i<l1 || c2+i<l2); i++) {
if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
c1+= i;
local_cs++;
break;
}
if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
c2+= i;
local_cs++;
break;
}
}
}
c1++;
c2++;
}
lcss+=local_cs;
return Math.round(Math.max(l1,l2)- lcss);
}

Common Sift4:

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of characters to search for matching letters
// maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4(s1, s2, maxOffset, maxDistance) {
if (!s1||!s1.length) {
if (!s2) {
return 0;
}
return s2.length;
}

if (!s2||!s2.length) {
return s1.length;
}

var l1=s1.length;
var l2=s2.length;

var c1 = 0;  //cursor for string 1
var c2 = 0;  //cursor for string 2
var lcss = 0;  //largest common subsequence
var local_cs = 0; //local common substring
var trans = 0;  //number of transpositions ('ab' vs 'ba')
var offset_arr=[];  //offset pair array, for computing the transpositions

while ((c1 < l1) && (c2 < l2)) {
if (s1.charAt(c1) == s2.charAt(c2)) {
local_cs++;
var isTrans=false;
//see if current match is a transposition
var i=0;
while (i<offset_arr.length) {
var ofs=offset_arr[i];
if (c1<=ofs.c1 || c2 <= ofs.c2) {
// when two matches cross, the one considered a transposition is the one with the largest difference in offsets
isTrans=Math.abs(c2-c1)>=Math.abs(ofs.c2-ofs.c1);
if (isTrans)
{
trans++;
} else
{
if (!ofs.trans) {
ofs.trans=true;
trans++;
}
}
break;
} else {
if (c1>ofs.c2 && c2>ofs.c1) {
offset_arr.splice(i,1);
} else {
i++;
}
}
}
offset_arr.push({
c1:c1,
c2:c2,
trans:isTrans
});
} else {
lcss+=local_cs;
local_cs=0;
if (c1!=c2) {
c1=c2=Math.min(c1,c2);  //using min allows the computation of transpositions
}
//if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches 
for (var i = 0; i < maxOffset && (c1+i<l1 || c2+i<l2); i++) {
if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
c1+= i-1; 
c2--;
break;
}
if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
c1--;
c2+= i-1;
break;
}
}
}
c1++;
c2++;
if (maxDistance)
{
var temporaryDistance=Math.max(c1,c2)-lcss+trans;
if (temporaryDistance>=maxDistance) return Math.round(temporaryDistance);
}
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if ((c1 >= l1) || (c2 >= l2)) {
lcss+=local_cs;
local_cs=0;
c1=c2=Math.min(c1,c2);
}
}
lcss+=local_cs;
return Math.round(Math.max(l1,l2)- lcss +trans); //add the cost of transpositions to the final result
}

Extended/General Sift4:

// Sift4 - extended version
// online algorithm to compute the distance between two strings in O(n)
// maxOffset is the number of positions to search for matching tokens
// options: the options for the function, allowing for customization of the scope and algorithm:
//         maxDistance: the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
//         tokenizer: a function to transform strings into vectors of tokens
//        tokenMatcher: a function to determine if two tokens are matching (equal)
//        matchingEvaluator: a function to determine the way a token match should be added to the local_cs. For example a fuzzy match could be implemented.
//        localLengthEvaluator: a function to determine the way the local_cs value is added to the lcss. For example longer continuous substrings could be awarded.
//        transpositionCostEvaluator: a function to determine the value of an individual transposition. For example longer transpositions should have a higher cost.
//        transpositionsEvaluator: a function to determine the way the total cost of transpositions affects the final result
// the options can and should be implemented at a class level, but this is the demo algorithm
function sift4(s1, s2, maxOffset, options) {

options=extend(options,{
maxDistance:null,
tokenizer: function(s) { return s?s.split(''):[]; },
tokenMatcher: function(t1,t2) { return t1==t2; },
matchingEvaluator: function(t1,t2) { return 1; },
localLengthEvaluator: function(local_cs) { return local_cs; },
transpositionCostEvaluator: function(c1,c2) { return 1; },
transpositionsEvaluator: function(lcss,trans) { return lcss-trans; }
});

var t1=options.tokenizer(s1);
var t2=options.tokenizer(s2);

var l1=t1.length;
var l2=t2.length;

if (l1==0) return l2;
if (l2==0) return l1;

var c1 = 0;  //cursor for string 1
var c2 = 0;  //cursor for string 2
var lcss = 0;  //largest common subsequence
var local_cs = 0; //local common substring
var trans = 0;  //number of transpositions ('ab' vs 'ba')
var offset_arr=[];  //offset pair array, for computing the transpositions

while ((c1 < l1) && (c2 < l2)) {
if (options.tokenMatcher(t1[c1],t2[c2])) {
local_cs+=options.matchingEvaluator(t1[c1],t2[c2]);
var isTrans=false;
//see if current match is a transposition
var i=0;
while (i<offset_arr.length) {
var ofs=offset_arr[i];
if (c1<=ofs.c1 || c2 <= ofs.c2) {
// when two matches cross, the one considered a transposition is the one with the largest difference in offsets
isTrans=Math.abs(c2-c1)>=Math.abs(ofs.c2-ofs.c1);
if (isTrans)
{
trans+=options.transpositionCostEvaluator(c1,c2);
} else
{
if (!ofs.trans) {
ofs.trans=true;
trans+=options.transpositionCostEvaluator(ofs.c1,ofs.c2);
}
}
break;
} else {
if (c1>ofs.c2 && c2>ofs.c1) {
offset_arr.splice(i,1);
} else {
i++;
}
}
}
offset_arr.push({
c1:c1,
c2:c2,
trans:isTrans
});
} else {
lcss+=options.localLengthEvaluator(local_cs);
local_cs=0;
if (c1!=c2) {
c1=c2=Math.min(c1,c2);  //using min allows the computation of transpositions
}
//if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches 
for (var i = 0; i < maxOffset && (c1+i<l1 || c2+i<l2); i++) {
if ((c1 + i < l1) && options.tokenMatcher(t1[c1+i],t2[c2])) {
c1+= i-1; 
c2--;
break;
}
if ((c2 + i < l2) && options.tokenMatcher(t1[c1],t2[c2+i])) {
c1--;
c2+= i-1;
break;
}
}
}
c1++;
c2++;
if (options.maxDistance)
{
var temporaryDistance=options.localLengthEvaluator(Math.max(c1,c2))-options.transpositionsEvaluator(lcss,trans);
if (temporaryDistance>=options.maxDistance) return Math.round(temporaryDistance);
}
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if ((c1 >= l1) || (c2 >= l2)) {
lcss+=options.localLengthEvaluator(local_cs);
local_cs=0;
c1=c2=Math.min(c1,c2);
}
}
lcss+=options.localLengthEvaluator(local_cs);
return Math.round(options.localLengthEvaluator(Math.max(l1,l2)) - options.transpositionsEvaluator(lcss,trans)); //add the cost of found transpositions
}

function extend(obj,def) {
var result={};
for (var prop in def) {
if (!obj||!obj.hasOwnProperty(prop)) {
result[prop]=def[prop];
} else {
result[prop]=obj[prop];
}
}
return result;
}

// possible values for the options
// tokenizers:
function nGramTokenizer(s,n) { //tokenizer:function(s) { return nGramTokenizer(s,2); }
var result=[];
if (!s) return result;
for (var i=0; i<=s.length-n; i++) {
result.push(s.substr(i,n));
}
return result;
}

function wordSplitTokenizer(s) { //tokenizer:wordSplitTokenizer
if (!s) return [];
return s.split(/\s+/);
}

function characterFrequencyTokenizer(s) { //tokenizer:characterFrequencyTokenizer (letters only)
var result = [];
for (var i=0; i<=25; i++) {
var val=0;
if (s) {
for (j=0; j<s.length; j++) {
var code=s.charCodeAt(j);
if (code==i+65 || code==i+97) val++;
}
}
result.push(val);
}
return result;
}

//tokenMatchers:
function sift4TokenMatcher(t1,t2) { //tokenMatcher:sift4TokenMatcher
var similarity=1-sift4(t1,t2,5)/Math.max(t1.length,t2.length);
return similarity>0.7;
}

//matchingEvaluators:
function sift4MatchingEvaluator(t1,t2) { //matchingEvaluator:sift4MatchingEvaluator
var similarity=1-sift4(t1,t2,5)/Math.max(t1.length,t2.length);
return similarity;
}

//localLengthEvaluators:
function rewardLengthEvaluator(l) {
if (l<1) return l; //0 -> 0
return l-1/(l+1);  //1 -> 0.5, 2-> 0.66, 9 -> 0.9
}

function rewardLengthEvaluator2(l) {
return Math.pow(l,1.5); // 0 -> 0, 1 -> 1, 2 -> 2.83, 10 -> 31.62
}

//transpositionCostEvaluators:
function longerTranspositionsAreMoreCostly(c1,c2) {
return Math.abs(c2-c1)/9+1;
}

As always, I will be most happy to know if you used my algorithm and how it performed, as well as receive any suggestion that you might have.

Here is some explanation for the options of the general algorithm.

It no longer searches for characters, but for tokens. That is why the default tokenizer function splits the values into characters so that the algorithm would work on an array of one character long tokens. Other options are possible, like splitting the strings by empty spaces so that the comparisons are done on words or transforming a string into an array of strings N characters long, the so called N-grams. The tokenizer can be anything, like the characterFrequencyTokenizer, which turns each word in an array of 25 values representing the number of letters in the word for each letter a-z.

The tokenMatcher function returns true if two tokens are matching. They can be fuzzy matched, for example the sift4tokenMatcher uses Sift inside Sift to determine the character distance between two tokens and returns true if they match more than 70%.

The matchingEvaluator is a function that returns the value that will be added to the "common substring" length value when two tokens match. The default is 1, but one can use some other metric, like the similarity, for example. Of course, the common substring length has lost its meaning when these functions change, but the variable local_cs is still used.

The lengthEvaluator is taking the length value of the local common substring and returns a value that will be added to the longest common subsequence value. Usually it returns the same value as the one provided, but some functions could reward longer substrings.

FAQ:

Q: Can you make Sift4 to work case insensitive?
A: Just turn the strings to lower or upper case before you compare them. Since this algorithm is more general, the concept of 'case' might not apply.

Q: Can you make Sift4 to compare strings based on their meaning, like using synonyms?
A: Use a tokenizer function that splits the strings into words, then replaces them with the first of their synonyms. A more complex solution would require to analyse the strings beforehand and turn them into some ordered synonym or equivalent expressions equivalents, then use Sift4 with a word tokenizer (one is provided in the Extended algorithm source)

Q: I need an implementation for this programming language, can you help?
A: I can, but I might not have the time. Ask anyway, maybe I can be persuaded :)

Q: I have been using Sift3 until now, how do I upgrade to Sift4?
A: The best way I can think of is to implement Sift4 Simplest, as it needs only the Sift3 code and some minor changes. Since you never needed tokens before, I doubt you need them now. But if you do, I can help, see the above question.

Q: How can I reward you for this fantastic piece of software engineering?
A: While I did this for free and I don't expect to make any money out of it and while this algorithm is completely free to use and change as you see fit, I don't mind having a beer every now and then ;)

Q: Your algorithm really sucks because... reasons.
A: It may. I would be glad to discuss the reasons, though, and try to fix any problem you encounter.

Q: I compared Sift4 with another algorithm that is much more exact and there are differences.
A: Of course, they are different algorithms. This is a fuzzy distance calculator, it doesn't give you the exact value. There are still edge cases. But the idea of Sift is to be fast and relatively accurate, rather than very accurate. You need more accuracy, try to combine Sift with Levenshtein for example, computing Levenshtein only where Sift says the strings are above a certain similarity.

Q: I want to make maxOffset dependent on the length of the strings compared. Can you do that?
A: That is a perfect example why maxOffset should be a parameter of the function rather than a member of the class. Since this implementation is so far Javascript only, just compute the maxOffset that is convenient to you before you compare.

Q: I want to vary the weight of matches based on the position of the match, for example matches at the beginning of the string could be more valuable than those at the end.
A: The position of the match is indeed not sent to the functions that can be specified in the options object of the Sift4 Extended, but that can be trivially changed in the code. I don't think this particular request is very common, though, and I prefer to keep it out of the published implementation to make the code easier to understand.

Q: I found a bug!
A: Let me know it and I will try and fix it.

Q: If you need to compare large lists of strings, it is better to precompute some things, like specific hashes or suffix trees, etc. This will speed up the comparison tremendously!
A: Sift is what is called an online algorithm. It does not precompute anything, it just gets the two strings and the parameters for its functioning and returns the distance. You are correct in what you are saying, but that kind of solution is not in the scope of Sift, at least not version 4.

Q: What are the edge cases for Sift?
A: Probably there are several, but I didn't really spot them. One of them is that one might find both letters at a position matching letters at other positions, but only one will count. Example 'abxx' and 'bayy'. The algorithm will look at position 0, find no match, then try to find the closest match for each letter. Starting with position 0 in the first string it will find 'a' matched in position 1 in the second. It will increase both counters and lcss will be increase as well. Next check will be 'b', the character at position 1 in the first string matched with position 2 in the second string. No match, therefore both counters will be reset to 1, and starting search again. The 'b' match is lost and distance is 3 instead of 2. Also I think there might be some situations where the counters are not equal and the biggest of them reaches the end of its string, thus terminating the algorithm, but there could have been more matches. Incidentally I tried to fix both these issues and the error from Levenshtein was not really affected, but I am not 100% sure of the implementation.

Q: The algorithm continues to be asymmetric, Sift4(s1,s2) can be different from Sift4(s2,s1).
A: Yes. This is one of the artifacts of the linear nature of the algorithm. There is a function that is symmetric and that is Math.min(Sift4(a,b),Sift4(b,a)), however it is twice as slow, obviously.

Implementations in other languages:

C# implementation [expand/collapse]

public class Sift4
{
private Options _options;

public Sift4(Options options)
{
if (options == null) options = new Options();
if (options.Tokenizer == null)
{
options.Tokenizer = (s) => s == null
? new string[0]
: s.ToCharArray().Select(c => c.ToString()).ToArray();
}
if (options.TokenMatcher == null)
{
options.TokenMatcher = (t1, t2) => t1 == t2;
}
if (options.MatchingEvaluator == null)
{
options.MatchingEvaluator = (t1, t2) => 1;
}
if (options.LocalLengthEvaluator == null)
{
options.LocalLengthEvaluator = (l) => l;
}
if (options.TranspositionCostEvaluator == null)
{
options.TranspositionCostEvaluator = (c1, c2) => 1;
}
if (options.TranspositionsEvaluator == null)
{
options.TranspositionsEvaluator = (l, t) => l - t;
}
_options = options;
}

/// <summary>
/// General distance algorithm uses all the parameters in the options object and works on tokens
/// </summary>
/// <param name="s1"></param>
/// <param name="s2"></param>
/// <param name="maxOffset"></param>
/// <returns></returns>
public double GeneralDistance(string s1, string s2, int maxOffset)
{
var t1 = _options.Tokenizer(s1);
var t2 = _options.Tokenizer(s2);

var l1 = t1.Length;
var l2 = t2.Length;

if (l1 == 0) return _options.LocalLengthEvaluator(l2);
if (l2 == 0) return _options.LocalLengthEvaluator(l1);

var c1 = 0;  //cursor for string 1
var c2 = 0;  //cursor for string 2
var lcss = 0.0;  //largest common subsequence
var local_cs = 0.0; //local common substring
var trans = 0.0;  //cost of transpositions ('axb' vs 'xba')
var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

while ((c1 < l1) && (c2 < l2))
{
if (_options.TokenMatcher(t1[c1], t2[c2]))
{
local_cs += _options.MatchingEvaluator(t1[c1], t2[c2]);
var isTransposition = false;
var op = offset_arr.First;
while (op != null)
{  //see if current match is a transposition
var ofs = op.Value;
if (c1 <= ofs.C1 || c2 <= ofs.C2)
{
// when two matches cross, the one considered a transposition is the one with the largest difference in offsets
isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
if (isTransposition)
{
trans += _options.TranspositionCostEvaluator(c1, c2);
}
else
{
if (!ofs.IsTransposition)
{
ofs.IsTransposition = true;
trans += _options.TranspositionCostEvaluator(ofs.C1, ofs.C2);
}
}
break;
}
else
{
var next_op = op.Next;
if (c1 > ofs.C2 && c2 > ofs.C1)
{
offset_arr.Remove(op);
}
op = next_op;
}
}
offset_arr.AddLast(new OffsetPair(c1, c2)
{
IsTransposition = isTransposition
});
}
else
{
lcss += _options.LocalLengthEvaluator(local_cs);
local_cs = 0;
if (c1 != c2)
{
c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
}
//if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches 
for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
{
if ((c1 + i < l1) && _options.TokenMatcher(t1[c1 + i], t2[c2]))
{
c1 += i - 1;
c2--;
break;
}
if ((c2 + i < l2) && _options.TokenMatcher(t1[c1], t2[c2 + i]))
{
c1--;
c2 += i - 1;
break;
}
}
}
c1++;
c2++;
if (_options.MaxDistance != null)
{
var temporaryDistance = _options.LocalLengthEvaluator(Math.Max(c1, c2)) - _options.TranspositionsEvaluator(lcss, trans);
if (temporaryDistance >= _options.MaxDistance) return Math.Round(temporaryDistance, MidpointRounding.AwayFromZero);
}
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if ((c1 >= l1) || (c2 >= l2))
{
lcss += _options.LocalLengthEvaluator(local_cs);
local_cs = 0;
c1 = c2 = Math.Min(c1, c2);
}
}
lcss += _options.LocalLengthEvaluator(local_cs);
return Math.Round(_options.LocalLengthEvaluator(Math.Max(l1, l2)) - _options.TranspositionsEvaluator(lcss, trans), MidpointRounding.AwayFromZero); //apply transposition cost to the final result
}

/// <summary>
/// Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
/// </summary>
/// <param name="s1"></param>
/// <param name="s2"></param>
/// <param name="maxOffset"></param>
/// <param name="maxDistance"></param>
/// <returns></returns>
public static double CommonDistance(string s1, string s2, int maxOffset, int maxDistance = 0)
{
var l1 = s1 == null ? 0 : s1.Length;
var l2 = s2 == null ? 0 : s2.Length;

if (l1 == 0) return l2;
if (l2 == 0) return l1;

var c1 = 0;  //cursor for string 1
var c2 = 0;  //cursor for string 2
var lcss = 0;  //largest common subsequence
var local_cs = 0; //local common substring
var trans = 0;  //number of transpositions ('axb' vs 'xba')
var offset_arr = new LinkedList<OffsetPair>();  //offset pair array, for computing the transpositions

while ((c1 < l1) && (c2 < l2))
{
if (s1[c1] == s2[c2])
{
local_cs++;
var isTransposition = false;
var op = offset_arr.First;
while (op != null)
{  //see if current match is a transposition
var ofs = op.Value;
if (c1 <= ofs.C1 || c2 <= ofs.C2)
{
// when two matches cross, the one considered a transposition is the one with the largest difference in offsets
isTransposition = Math.Abs(c2 - c1) >= Math.Abs(ofs.C2 - ofs.C1);
if (isTransposition)
{
trans++;
}
else
{
if (!ofs.IsTransposition)
{
ofs.IsTransposition = true;
trans++;
}
}
break;
}
else
{
var next_op = op.Next;
if (c1 > ofs.C2 && c2 > ofs.C1)
{
offset_arr.Remove(op);
}
op = next_op;
}
}
offset_arr.AddLast(new OffsetPair(c1, c2)
{
IsTransposition = isTransposition
});
}
else
{
lcss += local_cs;
local_cs = 0;
if (c1 != c2)
{
c1 = c2 = Math.Min(c1, c2);  //using min allows the computation of transpositions
}
//if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches 
for (var i = 0; i < maxOffset && (c1 + i < l1 || c2 + i < l2); i++)
{
if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
{
c1 += i - 1;
c2--;
break;
}
if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
{
c1--;
c2 += i - 1;
break;
}
}
}
c1++;
c2++;
if (maxDistance > 0)
{
var temporaryDistance = Math.Max(c1, c2) - (lcss - trans);
if (temporaryDistance >= maxDistance) return temporaryDistance;
}
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if ((c1 >= l1) || (c2 >= l2))
{
lcss += local_cs;
local_cs = 0;
c1 = c2 = Math.Min(c1, c2);
}
}
lcss += local_cs;
return Math.Max(l1, l2) - (lcss - trans); //apply transposition cost to the final result
}

/// <summary>
/// Standard Sift algorithm, using strings and taking only maxOffset as a parameter
/// </summary>
/// <param name="s1"></param>
/// <param name="s2"></param>
/// <param name="maxOffset"></param>
/// <returns></returns>
public static int SimplestDistance(string s1, string s2, int maxOffset)
{
var l1 = s1 == null ? 0 : s1.Length;
var l2 = s2 == null ? 0 : s2.Length;

if (l1 == 0) return l2;
if (l2 == 0) return l1;

var c1 = 0;  //cursor for string 1
var c2 = 0;  //cursor for string 2
var lcss = 0;  //largest common subsequence
var local_cs = 0; //local common substring

while ((c1 < l1) && (c2 < l2))
{
if (s1[c1] == s2[c2])
{
local_cs++;
}
else
{
lcss += local_cs;
local_cs = 0;
if (c1 != c2)
{
c1 = c2 = Math.Max(c1, c2);
}
//if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches 
for (var i = 0; i < maxOffset && (c1 + i < l1 && c2 + i < l2); i++)
{
if ((c1 + i < l1) && s1[c1 + i] == s2[c2])
{
c1 += i - 1;
c2--;
break;
}
if ((c2 + i < l2) && s1[c1] == s2[c2 + i])
{
c1--;
c2 += i - 1;
break;
}
}
}
c1++;
c2++;
}
lcss += local_cs;
return Math.Max(l1, l2) - lcss;
}

private class OffsetPair
{
public int C1 { get; set; }
public int C2 { get; set; }
public bool IsTransposition { get; set; }

public OffsetPair(int c1, int c2)
{
this.C1 = c1;
this.C2 = c2;
this.IsTransposition = false;
}
}

public class Options
{
/// <summary>
/// If set, the algorithm will stop if the distance reaches this value
/// </summary>
public int? MaxDistance { get; set; }

/// <summary>
/// The function that turns strings into a list of tokens (also strings)
/// </summary>
public Func<string, string[]> Tokenizer { get; set; }

/// <summary>
/// The function that determines if two tokens are matching (similar to characters being the same in the simple algorithm)
/// </summary>
public Func<string, string, bool> TokenMatcher { get; set; }

/// <summary>
/// The function that determines the value of a match of two tokens (the equivalent of adding 1 to the lcss when two characters match)
/// This assumes that the TokenMatcher function is a lot less expensive than this evaluator. If that is not the case, 
/// you can optimize the speed of the algorithm by using only the matching evaluator and then calculating if two tokens match on the returned value.
/// </summary>
public Func<string, string, double> MatchingEvaluator { get; set; }

/// <summary>
/// Determines if the local value (computed on subsequent matched tokens) must be modified.
/// In case one wants to reward longer matched substrings, for example, this is what you need to change.
/// </summary>
public Func<double, double> LocalLengthEvaluator { get; set; }

/// <summary>
/// The function determining the cost of an individual transposition, based on its counter positions.
/// </summary>
public Func<int, int, double> TranspositionCostEvaluator { get; set; }

/// <summary>
/// The function determining how the cost of transpositions affects the final result
/// Change it if it doesn't suit you.
/// </summary>
public Func<double, double, double> TranspositionsEvaluator { get; set; }
}
}

PHP implementation [expand/collapse]

// Sift4 - common version
// online algorithm to compute the distance between two strings in O(n)
// $maxOffset is the number of characters to search for matching letters
// $maxDistance is the distance at which the algorithm should stop computing the value and just exit (the strings are too different anyway)
function sift4($s1, $s2, $maxOffset, $maxDistance = 0) {
if (!$s1 || !strlen($s1)) {
if (!$s2) {
return 0;
}
return strlen($s2);
}

if (!$s2 || !strlen($s2)) {
return strlen($s1);
}

$l1 = strlen($s1);
$l2 = strlen($s2);

$c1 = 0; //cursor for string 1
$c2 = 0; //cursor for string 2
$lcss = 0; //largest common subsequence
$local_cs = 0; //local common substring
$trans = 0; //number of transpositions ('ab' vs 'ba')
$offset_arr = array(); //offset pair array, for computing the transpositions
while (($c1 < $l1) && ($c2 < $l2)) {
if (substr($s1, $c1, 1) == substr($s2, $c2, 1)) {
$local_cs++;
$isTrans = false;
$i = 0;
while ($i < sizeof($offset_arr)) { //see if current match is a transposition
$ofs = $offset_arr[$i];
if ($c1 <= $ofs['c1'] || $c2 <= $ofs['c2']) {
$isTrans = abs($c2 - $c1) >= abs($ofs['c2'] - $ofs['c1']);
if ($isTrans) {
$trans++;
} else {
if (!$ofs['trans']) {
$ofs['trans'] = true;
$trans++;
}
}
break;
} else {
if ($c1 > $ofs['c2'] && $c2 > $ofs['c1']) {
array_splice($offset_arr, $i, 1);
} else {
$i++;
}
}
}
array_push($offset_arr, array('c1' =  > $c1, 'c2' =  > $c2, 'trans' =  > $isTrans));
} else {
$lcss += $local_cs;
$local_cs = 0;
if ($c1 != $c2) {
$c1 = $c2 = min($c1, $c2); //using min allows the computation of transpositions
}
//if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches
for ($i = 0; $i < $maxOffset && ($c1 + $i < $l1 || $c2 + $i < $l2); $i++) {
if (($c1 + $i < $l1) && (substr($s1, $c1 + $i, 1) == substr($s2, $c2, 1))) {
$c1 += $i - 1;
$c2--;
break;
}
if (($c2 + $i < $l2) && (substr($s1, $c1, 1) == substr($s2, $c2 + $i, 1))) {
$c1--;
$c2 += $i - 1;
break;
}
}
}
$c1++;
$c2++;
if ($maxDistance) {
$temporaryDistance = max($c1, $c2) - $lcss + $trans;
if ($temporaryDistance >= $maxDistance)
return $temporaryDistance;
}
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if (($c1 >= $l1) || ($c2 >= $l2)) {
$lcss += $local_cs;
$local_cs = 0;
$c1 = $c2 = min($c1, $c2);
}
}
$lcss += $local_cs;
return max($l1, $l2) - $lcss + $trans; //apply transposition cost to final result
}

Thanks to Ferenc Szatmári for corrections in the PHP code

The Simplest and General versions of the algorithm remain as an exercise to you, since I haven't been working in PHP for more than a decade.

T-SQL implementation [expand/collapse]

---<summary>
---Static distance algorithm working on strings, computing transpositions as well as stopping when maxDistance was reached.
---</summary>
---<param name="s1"></param>
---<param name="s2"></param>
---<param name="maxOffset"></param>
---<param name="maxDistance"></param>
---<returns></returns>
CREATE FUNCTION Sift4Common(@s1          NVARCHAR(max), 
@s2          NVARCHAR(max), 
@maxOffset   INT, 
@maxDistance INT) RETURNS INT 
AS 
BEGIN 
DECLARE @l1 INT = Len(ISNULL(@s1, '')); 
DECLARE @l2 INT = Len(ISNULL(@s2, '')); 

IF ( @l1 = 0 ) 
RETURN @l2; 

IF ( @l2 = 0 ) 
RETURN @l1; 

DECLARE @c1 INT = 0; 
DECLARE @c2 INT = 0; 
DECLARE @lcss INT = 0; 
DECLARE @local_cs INT = 0; 
DECLARE @trans INT = 0; 
DECLARE @offset_arr TABLE 
( 
C1 INT, 
C2 INT,
Trans BIT
) 
DECLARE @i INT
DECLARE @temporaryDistance FLOAT
DECLARE @result INT
DECLARE @oc1 INT, @oc2 INT, @otr BIT
DECLARE @isTrans BIT

WHILE ( ( @c1 < @l1 ) 
AND ( @c2 < @l2 ) ) 
BEGIN 
IF ( Substring(@s1, @c1 + 1, 1) = Substring(@s2, @c2 + 1, 1) ) 
BEGIN 
SET @local_cs=@local_cs + 1; 
SET @isTrans=0
SET @oc1=NULL;
SELECT TOP 1 @oc1=o.C1,@oc2=o.C2,@otr=o.Trans
FROM @offset_arr o 
WHERE @c1 <= o.C1 OR @c2 <= o.C2
IF (@oc1 IS NOT NULL)
BEGIN
SET @isTrans=CASE WHEN ABS(@c2-@c1)>=ABS(@oc2-@oc1) THEN 1 ELSE 0 END
IF (@isTrans=1)
BEGIN
SET @trans=@trans+1
END
ELSE
IF (@otr=0)
BEGIN
SET @trans=@trans+1
UPDATE @offset_arr SET Trans=1 WHERE C1=@oc1 AND C2=@oc2
END
END

DELETE FROM @offset_arr 
WHERE  @c1 > C1 
AND @c1 > C2
AND @c2 > C1 
AND @c2 > C2; 

INSERT INTO @offset_arr 
VALUES     (@c1, @c2, @isTrans); 
END 
ELSE 
BEGIN 
SET @lcss = @lcss + @local_cs; 
SET @local_cs = 0; 

IF ( @c1 != @c2 ) 
-- using min allows the computation of transpositions 
BEGIN 
IF ( @c1 < @c2 ) 
BEGIN 
SET @c2=@c1; 
END 
ELSE 
BEGIN 
SET @c1=@c2; 
END 
END 

--if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
--so that we can have only one code block handling matches  
SET @i=0; 

WHILE ( @i < @maxOffset AND (@c1 + @i < @l1 OR @c2 + @i < @l2) ) 
BEGIN 
IF ( ( @c1 + @i < @l1 ) 
AND Substring(@s1, @c1 + @i + 1, 1) = 
Substring(@s2, @c2 + 1, 1) 
) 
BEGIN 
SET @c1 = @c1 + @i - 1; 
SET @c2=@c2 - 1; 

BREAK; 
END 

IF ( ( @c2 + @i < @l2 ) 
AND Substring(@s1, @c1 + 1, 1) = Substring(@s2, 
@c2 + @i + 1, 1 
) 
) 
BEGIN 
SET @c1 = @c1 - 1; 
SET @c2=@c2 + @i - 1; 

BREAK; 
END 

SET @i=@i + 1; 
END 
END 

SET @c1=@c1 + 1; 
SET @c2=@c2 + 1; 

IF ( @maxDistance > 0 ) 
BEGIN 
IF ( @c1 > @c2 ) 
BEGIN 
SET @temporaryDistance = @c1 - ( @lcss - @trans / 2.0 ); 
END 
ELSE 
BEGIN 
SET @temporaryDistance = @c2 - ( @lcss - @trans / 2.0 ); 
END 

IF ( @temporaryDistance >= @maxDistance ) 
RETURN Round(@temporaryDistance, 0); 
END 
IF ( ( @c1 >= @l1 ) OR ( @c2 >= @l2 ) )
BEGIN
SET @lcss = @lcss + @local_cs; 
SET @local_cs = 0; 

IF ( @c1 < @c2 ) 
BEGIN 
SET @c2=@c1; 
END 
ELSE 
BEGIN 
SET @c1=@c2; 
END 
END 
END 

SET @lcss = @lcss + @local_cs; 

--apply the transposition cost to the final result
IF ( @l1 > @l2 ) 
BEGIN 
SET @result = @l1 - (@lcss - @trans);
END 
ELSE 
BEGIN 
SET @result = @l2 - (@lcss - @trans);
END 

RETURN @result; 
END

Clearly a general version of the algorithm is not possible in Transact SQL.

Here is a MySQL version, gracefully provided by Ferenc Szatmári:

BEGIN 
DECLARE l1 INT DEFAULT Length(IFNULL(s1, '')); 
DECLARE l2 INT DEFAULT Length(IFNULL(s2, '')); 


DECLARE c1 INT Default 0; 
DECLARE c2 INT default 0; 
DECLARE lcss INT default 0; 
DECLARE local_cs INT default 0; 
DECLARE trans INT default 0; 
DECLARE i INT;
DECLARE temporaryDistance FLOAT;
DECLARE result INT;
DECLARE oc1 INT;
DECLARE oc2 INT;
DECLARE otr smallint;
DECLARE isTrans smallint;

drop temporary table if exists offset_arr;
CREATE TEMPORARY TABLE IF not EXISTS offset_arr
( 
C1 INT, 
C2 INT,
Trans BIT
)engine=memory;


/*      delete from offset_arr;*/


IF l1 = 0 THEN
RETURN l2; 
END IF;
IF l2 = 0 THEN 
RETURN l1; 
END IF;  






WHILE ( ( c1 < l1 ) AND ( c2 < l2 ) ) 
DO 
IF ( Substring(s1, c1 + 1, 1) = Substring(s2, c2 + 1, 1) ) THEN

SET local_cs=local_cs + 1; 
SET isTrans=0;
SET oc1=NULL;
SELECT  o.C1, o.C2,o.Trans  into oc1, oc2, otr
FROM offset_arr o 
WHERE c1 <= o.C1 OR c2 <= o.C2
LIMIT 1;
IF oc1 IS NOT NULL THEN

SET isTrans=CASE WHEN ABS(c2-c1)>=ABS(oc2-oc1) THEN 1 ELSE 0 END;
IF (isTrans=1) THEN
SET trans=trans+1;
ELSE
IF (otr=0) THEN


SET trans=trans+1;
UPDATE offset_arr SET Trans=1 WHERE C1=oc1 AND C2=oc2;
END IF;
END IF;  
END IF;


DELETE FROM offset_arr 
WHERE  c1 > C1 
AND c1 > C2
AND c2 > C1 
AND c2 > C2; 


INSERT INTO offset_arr 
VALUES (c1, c2, isTrans); 

ELSE 

SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 != c2 ) THEN
-- using min allows the computation of transpositions 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;
END IF;


/*if matching tokens are found, remove 1 from both cursors (they get incremented at the end of the loop)
--so that we can have only one code block handling matches  */
SET i=0; 


label : WHILE ( i < maxOffset AND (c1 + i < l1 OR c2 + i < l2) ) do


IF ( ( c1 + i < l1 ) 
AND Substring(s1, c1 + i + 1, 1) = Substring(s2, c2 + 1, 1) 
) THEN


SET c1 = c1 + i - 1; 
SET c2=c2 - 1; 


leave label; 
END IF; 


IF ( ( c2 + i < l2 ) 
AND Substring(s1, c1 + 1, 1) = Substring(s2, c2 + i + 1, 1) 
)THEN 


SET c1 = c1 - 1; 
SET c2=c2 + i - 1; 


leave label;  
END IF;


SET i=i + 1; 
END WHILE;
END IF; 


SET c1=c1 + 1; 
SET c2=c2 + 1; 


IF ( maxDistance > 0 ) THEN


IF ( c1 > c2 ) THEN
SET temporaryDistance = c1 - ( lcss - trans / 2.0 ); 
ELSE 
SET temporaryDistance = c2 - ( lcss - trans / 2.0 ); 
END IF;


IF ( temporaryDistance >= maxDistance ) THEN
RETURN Round(temporaryDistance, 0); 
END IF;  
END IF;
IF ( ( c1 >= l1 ) OR ( c2 >= l2 ) ) THEN
SET lcss = lcss + local_cs; 
SET local_cs = 0; 


IF ( c1 < c2 ) THEN
SET c2=c1; 
ELSE 
SET c1=c2; 
END IF;
END IF;
END while;


SET lcss = lcss + local_cs; 


/*apply the transposition cost to the final result*/
IF ( l1 > l2 ) THEN
SET result = l1 - (lcss - trans);
ELSE 
SET result = l2 - (lcss - trans);
END IF;
RETURN result; 
END

Java implementation [expand/collapse]

Here is a Java version, gracefully provided by Nathanæl Fischer:

/**
 * Sift4 - common version
 * online algorithm to compute the distance between two strings in O(n)
 * Algorithm by siderite, java port by Nathan Fischer 2016
 * https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html
 * @param s1
 * @param s2
 * @param maxOffset the number of characters to search for matching letters
 * @return
 */
public static double sift4(String s1, String s2, int maxOffset) {
class Offset{
int c1;
int c2;
boolean trans;

Offset(int c1, int c2, boolean trans) {
this.c1 = c1;
this.c2 = c2;
this.trans = trans;
}
}

if(s1 == null || s1.isEmpty())
return s2 == null ? 0 : s2.length();

if(s2 == null || s2.isEmpty())
return s1.length();

int l1=s1.length();
int l2=s2.length();

int c1 = 0;  //cursor for string 1
int c2 = 0;  //cursor for string 2
int lcss = 0;  //largest common subsequence
int local_cs = 0; //local common substring
int trans = 0;  //number of transpositions ('ab' vs 'ba')
LinkedList<Offset> offset_arr=new LinkedList<>();  //offset pair array, for computing the transpositions

while ((c1 < l1) && (c2 < l2)) {
if (s1.charAt(c1) == s2.charAt(c2)) {
local_cs++;
boolean isTrans=false;
//see if current match is a transposition
int i=0;
while (i<offset_arr.size()) {
Offset ofs=offset_arr.get(i);
if (c1<=ofs.c1 || c2 <= ofs.c2) {
// when two matches cross, the one considered a transposition is the one with the largest difference in offsets
isTrans=Math.abs(c2-c1)>=Math.abs(ofs.c2-ofs.c1);
if (isTrans) {
trans++;
} else {
if (!ofs.trans) {
ofs.trans=true;
trans++;
}
}
break;
} else {
if (c1>ofs.c2 && c2>ofs.c1) {
offset_arr.remove(i);
} else {
i++;
}
}
}
offset_arr.add(new Offset(c1, c2, isTrans));
} else {
lcss+=local_cs;
local_cs=0;
if (c1!=c2) {
c1=c2=Math.min(c1,c2);  //using min allows the computation of transpositions
}
//if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches
for (int i = 0; i < maxOffset && (c1+i<l1 || c2+i<l2); i++) {
if ((c1 + i < l1) && (s1.charAt(c1 + i) == s2.charAt(c2))) {
c1+= i-1;
c2--;
break;
}
if ((c2 + i < l2) && (s1.charAt(c1) == s2.charAt(c2 + i))) {
c1--;
c2+= i-1;
break;
}
}
}
c1++;
c2++;
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if ((c1 >= l1) || (c2 >= l2)) {
lcss+=local_cs;
local_cs=0;
c1=c2=Math.min(c1,c2);
}
}
lcss+=local_cs;
return Math.round(Math.max(l1,l2)- lcss +trans); //add the cost of transpositions to the final result
}

Powershell implementation of simple Sift4, by Kirk Sayre:

function Calc-Sift4Distance {

<# 
.SYNOPSIS 

Compute the edit distance between 2 strings using the sift4 string
edit distance algorithm.

.PARAMETER s1

The 1st string

.PARAMETER s2

The 2nd string

.PARAMETER maxOffset

The maximum common substring length for which to search. The default
is 10.

.RETURN

The # of edits needed to make the given 2 strings equal.
#>

Param(
[Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
[String]
$s1,

[Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)]
[String]
$s2,

[Parameter(ValueFromPipelineByPropertyName = $True)]
[Int]
$maxOffset = 10
)

# Handle null or empty strings.
if ((-not $s1) -or ($s1.length -eq 0)) {
if (-not $s2) {
return 0
}
return $s2.length
}

if ((-not $s2) -or ($s2.length -eq 0)) {
return $s1.length
}

# Initialization.
$l1 = $s1.length
$l2 = $s2.length
$c1 = 0 # Cursor for string 1.
$c2 = 0 # Cursor for string 2.
$lcss = 0 # Largest common subsequence.
$local_cs = 0 # Local common substring.

# Scan strings.
while (($c1 -lt $l1) -and ($c2 -lt $l2)) {
if ($s1[$c1] -eq $s2[$c2]) {
$local_cs++
}
else {
$lcss += $local_cs
$local_cs = 0
if ($c1 -ne $c2) {
# Using max to bypass the need for computer transpositions ('ab' vs 'ba').
$c1 = $c2 = (@($c1, $c2) | Measure -Max).Maximum
}

for ($i = 0; (($i -lt $maxOffset) -and ((($c1 + $i) -lt $l1) -or (($c2 + $i) -lt $l2))); $i++) {
if ((($c1 + $i) -lt $l1) -and ($s1[$c1 + $i] -eq $s2[$c2])) {
$c1 += $i
$local_cs++
break
}

if ((($c1 + $i) -lt $l2) -and ($s1[$c1] -eq $s2[$c2 + $i])) {
$c2 += $i
$local_cs++
break
}
}
}
$c1++
$c2++
}
$lcss += $local_cs
return [math]::Round((@($l1, $l2) | Measure -Max).Maximum - $lcss)
}

Thanks to Hugo Amaro, a C++ implementation:

struct sift_offset {
int c1;
int c2;
bool trans;
};
 
template<typename T>
int sift4(T * s1, int s1_size, T * s2, int s2_size, int maxOffset, int maxDistance) {
if (!s1 || !s1_size) {
if (!s2) {
return 0;
}
return s2_size;
}
 
if (!s2 || !s2_size) {
return s1_size;
}
 
int l1 = s1_size;
int l2 = s2_size;
 
int c1 = 0;  //cursor for string 1
int c2 = 0;  //cursor for string 2
int lcss = 0;  //largest common subsequence
int local_cs = 0; //local common substring
int trans = 0;  //number of transpositions ('ab' vs 'ba')
std::vector<sift_offset> offset_arr;  //offset pair array, for computing the transpositions
 
while ((c1 < l1) && (c2 < l2)) {
if (s1[c1] == s2[c2]) {
local_cs++;
bool isTrans = false;
//see if current match is a transposition
int i = 0;
while (i<offset_arr.size()) {
sift_offset ofs=offset_arr[i];
if (c1<=ofs.c1 || c2 <= ofs.c2) {
// when two matches cross, the one considered a transposition is the one with the largest difference in offsets
isTrans=std::abs(c2-c1)>=std::abs(ofs.c2-ofs.c1);
 
if (isTrans)
{
trans++;
}
else
{
if (!ofs.trans) {
ofs.trans = true;
trans++;
}
}
break;
}
else {
if (c1>ofs.c2 && c2>ofs.c1) {
offset_arr.erase(offset_arr.begin()+i);
 
}
else {
i++;
}
}
}
offset_arr.push_back({
c1,
c2,
isTrans
});
}
else {
lcss += local_cs;
local_cs = 0;
if (c1 != c2) {
c1 = c2 = (int)min(c1, c2);  //using min allows the computation of transpositions
}
//if matching characters are found, remove 1 from both cursors (they get incremented at the end of the loop)
//so that we can have only one code block handling matches 
for (int i = 0; i < maxOffset && (c1+i<l1 || c2+i<l2); i++) {
if ((c1 + i < l1) && (s1[c1 + i] == s2[c2])) {
c1+= i-1; 
c2--;
break;
}
if ((c2 + i < l2) && (s1[c1] == s2[c2 + i])) {
c1--;
c2+= i-1;
break;
}
}
}
c1++;
c2++;
if (maxDistance)
{
int temporaryDistance=(int)max(c1,c2)-lcss+trans;
if (temporaryDistance>=maxDistance) return std:round(temporaryDistance);
}
 
}
// this covers the case where the last match is on the last token in list, so that it can compute transpositions correctly
if ((c1 >= l1) || (c2 >= l2))
{
lcss += local_cs;
local_cs = 0;
c1 = c2 = (int)min(c1, c2);
}
}
lcss += local_cs;
return std::round((int)max(l1, l2) - lcss + trans); //add the cost of transpositions to the final result
}

You can find a Go implementation here, written by Jason W. Hutchinson.
There is also a Swift implementation here.
A Perl 6 implementation can be found here.

Siderite's Blog Part Deux

Monday 29 December 2014

The Martian, by Andy Weir

Wednesday 24 December 2014

Harry Potter and the Methods of Rationality, by Eliezer S. Yudkowsky

Monday 8 December 2014

A technically free Internet

Saturday 6 December 2014

Good and bad political TV shows

Monday 10 November 2014

Super Fast and Accurate string distance algorithm: Sift4

About me

Tools

License

Popular Posts

Labels

Blog Archive