Finding the intersection of two large sorted arrays
I am going to discuss in this post an interview question that pops up from time to time. The solution that is usually presented as best is the same, regardless of the inputs. I believe this to be a mistake. Let me explore this with you.
The problem is simple: given two sorted arrays of very large size, find the most efficient way to compute their intersection (the list of common items in both).
The solution that is given as correct is described here (you will have to excuse its Javiness), for example. The person who provided the answer made a great effort to list various solutions and list their O complexity and the answer inspires confidence, as coming from one who knows what they are talking about. But how correct is it? Another blog post describing the problem and hinting on some extra information that might influence the result is here.
Let's start with some code:
As for the generateArrays method, I will use a simple incremented value to make sure the values are sorted, but also randomly generated:
Note that n is 1e+7, so that the values fit into an integer. If you try a larger value it will overflow and result in negative values, so the array would not be sorted.
Time to explore ways of intersecting the arrays. Let's start with the recommended implementation:
Note that I am not counting the comparisons of the two pointers p1 and p2 with the Length of the arrays, which can be optimized by caching the length. They are just as resource using as comparing the array values, yet we discount them in the name of calculating a fictitious growth rate complexity. I am going to do that in the future as well. The optimization of the code itself is not part of the post.
Running the code I get the following output:
The number of comparisons is directly proportional with the value of n, approximately 2n. That is because we look for all the values in both arrays. If we populate the values with odd and even numbers, for example, so no intersections, the number of comparisons will be exactly 2n.
Now let me change the intersect method, make it more general:
Here is a method of generating two arrays that are different because all of the elements of the first are smaller than the those of the second. At the very end we put a single element that is equal, for the fun of it.
This is the worst case scenario for the algorithm and the value of comparisons is promptly 2n. But what if we would use binary search (what in the StackOverflow answer was dismissed as having O(n*log n) complexity instead of O(n)?) Well, then... the output becomes
49 comparisons is smack on the value of 2*log2(n). Yeah, sure, the data we used was doctored, so let's return to the randomly generated one. In that case, the number of comparisons grows horribly:
Why does that happen? Because in the randomly generated data the binary search find its worst case scenario: trying to find the first value. It divides the problem efficiently, but it still has to go through all the data to reach the first element. Surely we can't use this for a general scenario, even if it is fantastic for one specific case. And here is my qualm with the O notation: without specifying the type of input, the solution is just probabilistically the best. Is it?
Let's compare the results so far. We have three ways of generating data: randomly with increments from 1 to 100, odds and evens, small and large values. Then we have two ways of computing the next index to compare: linear and binary search. The approximate numbers of comparisons are as follows:
Can we create a hybrid findIndex that would have the best of both worlds? I will certainly try. Here is one possible solution:
What am I doing here? If I find the value, I return the index; if the value is smaller, not only do I advance the index, but I also increase the speed of the next advance; if the value is larger, then I slow down until I get to 1 again. Warning: I do not claim that this is the optimal algorithm, this is just something that was annoying me and I had to explore it.
OK. Let's see some results. I will decrease the value of n even more, to a million. Then I will generate the values with random increases of up to 10, 100 and 1000. Let's see all of it in action! This time is the actual count of comparisons (in millions):
So for the general cases, the increase in comparisons is at most twice, while for specific cases the decrease can be four orders of magnitude!
Because I had all of this in my head, I made a fool of myself at a job interview. I couldn't reason all of the things I wrote here in a few minutes and so I had to clear my head by composing this long monstrosity.
Is the best solution the one in O(n)? Most of the time. The algorithm is simple, no hidden comparisons, one can understand why it would be universally touted as a good solution. But it's not the best in every case. I have demonstrated here that I can minimize the extra comparisons in standard scenarios and get immense improvements for specific inputs, like arrays that have chunks of elements smaller than the next value in the other array. I would also risk saying that this findIndex version is adaptive to the conditions at hand with improbable scenarios as worst cases. It works reasonable well for normally distributed arrays, it does wonders for "chunky" arrays (in this is included the case when one array is much smaller than the other) and thus is a contender for some kinds of uses.
What I wanted to explore and now express is that finding the upper growth rate of an algorithm is just part of the story. Sometimes the best implementation fails for not adapting to the real input data. I will say this, though, for the default algorithm: it works with IEnumerables, since it never needs to jump forward over some elements. This intuitively gives me reason to believe that it could be optimized using the array/list indexing. Here it is, in IEnumerable fashion:
The source code for a project that tests my various ideas can be found on GitHub. There you can find the following algorithms:
The testing tool takes all algorithms and runs them on randomly generated arrays:
For really small first array, the Binary Search is king. For equal size arrays, usually the Standard algorithm works wins. However there are plenty of cases when Divide et Impera and Pair Search win - usually not by much. Sometimes it happens that Accelerating Search is better than Standard, but Pair Search wins! I still have the nagging feeling that Pair Search can be improved. I feel it in my gut! However I have so many other things to do for me to dwell on this.
Maybe one of you can find the solution! Your mission, should you choose to accept it, is to find a better algorithm for intersecting sorted arrays than the boring standard one.
The problem
The problem is simple: given two sorted arrays of very large size, find the most efficient way to compute their intersection (the list of common items in both).
The solution that is given as correct is described here (you will have to excuse its Javiness), for example. The person who provided the answer made a great effort to list various solutions and list their O complexity and the answer inspires confidence, as coming from one who knows what they are talking about. But how correct is it? Another blog post describing the problem and hinting on some extra information that might influence the result is here.
Implementation
Let's start with some code:
var rnd = new Random();Here I am creating two arrays of size n, using a generateArrays method, then I am counting the number of intersections and displaying the time elapsed. In the intersect method I will also count the number of comparisons, so that we avoid for now the complexities of Big O notation (pardon the pun).
var n = 100000000;
int[] arr1, arr2;
generateArrays(rnd, n, out arr1, out arr2);
var sw = new Stopwatch();
sw.Start();
var count = intersect(arr1, arr2).Count();
sw.Stop();
Console.WriteLine($"{count} intersections in {sw.ElapsedMilliseconds}ms");
As for the generateArrays method, I will use a simple incremented value to make sure the values are sorted, but also randomly generated:
private static void generateArrays(Random rnd, int n, out int[] arr1, out int[] arr2)
{
arr1 = new int[n];
arr2 = new int[n];
int s1 = 0;
int s2 = 0;
for (var i = 0; i < n; i++)
{
s1 += rnd.Next(1, 100);
arr1[i] = s1;
s2 += rnd.Next(1, 100);
arr2[i] = s2;
}
}
Note that n is 1e+7, so that the values fit into an integer. If you try a larger value it will overflow and result in negative values, so the array would not be sorted.
Time to explore ways of intersecting the arrays. Let's start with the recommended implementation:
private static IEnumerable<int> intersect(int[] arr1, int[] arr2)
{
var p1 = 0;
var p2 = 0;
var comparisons = 0;
while (p1<arr1.Length && p2<arr2.Length)
{
var v1 = arr1[p1];
var v2 = arr2[p2];
comparisons++;
switch(v1.CompareTo(v2))
{
case -1:
p1++;
break;
case 0:
p1++;
p2++;
yield return v1;
break;
case 1:
p2++;
break;
}
}
Console.WriteLine($"{comparisons} comparisons");
}
Note that I am not counting the comparisons of the two pointers p1 and p2 with the Length of the arrays, which can be optimized by caching the length. They are just as resource using as comparing the array values, yet we discount them in the name of calculating a fictitious growth rate complexity. I am going to do that in the future as well. The optimization of the code itself is not part of the post.
Running the code I get the following output:
19797934 comparisons
199292 intersections in 832ms
The number of comparisons is directly proportional with the value of n, approximately 2n. That is because we look for all the values in both arrays. If we populate the values with odd and even numbers, for example, so no intersections, the number of comparisons will be exactly 2n.
Experiments
Now let me change the intersect method, make it more general:
private static IEnumerable<int> intersect(int[] arr1, int[] arr2)Here I've replaced the increment of the pointers with a findIndex method that keeps incrementing the value of the pointer until the end of the array is reached or a value larger or equal with the one we are searching for was found. The functionality of the method remains the same, since the same effect would have been achieved by the main loop. But now we are free to try to tweak the findIndex method to obtain better results. But before we do that, I am going to P-hack the shit out of this science and generate the arrays differently.
{
var p1 = 0;
var p2 = 0;
var comparisons = 0;
while (p1 < arr1.Length && p2 < arr2.Length)
{
var v1 = arr1[p1];
var v2 = arr2[p2];
comparisons++;
switch (v1.CompareTo(v2))
{
case -1:
p1 = findIndex(arr1, v2, p1, ref comparisons);
break;
case 0:
p1++;
p2++;
yield return v1;
break;
case 1:
p2 = findIndex(arr2, v1, p2, ref comparisons);
break;
}
}
Console.WriteLine($"{comparisons} comparisons");
}
private static int findIndex(int[] arr, int v, int p, ref int comparisons)
{
p++;
while (p < arr.Length)
{
comparisons++;
if (arr[p] >= v) break;
p++;
}
return p;
}
Here is a method of generating two arrays that are different because all of the elements of the first are smaller than the those of the second. At the very end we put a single element that is equal, for the fun of it.
private static void generateArrays(Random rnd, int n, out int[] arr1, out int[] arr2)
{
arr1 = new int[n];
arr2 = new int[n];
for (var i = 0; i < n - 1; i++)
{
arr1[i] = i;
arr2[i] = i + n;
}
arr1[n - 1] = n * 3;
arr2[n - 1] = n * 3;
}
This is the worst case scenario for the algorithm and the value of comparisons is promptly 2n. But what if we would use binary search (what in the StackOverflow answer was dismissed as having O(n*log n) complexity instead of O(n)?) Well, then... the output becomes
49 comparisonsHere is the code for the findIndex method that would do that:
1 intersections in 67ms
private static int findIndex(int[] arr, int v, int p, ref int comparisons)
{
var start = p + 1;
var end = arr.Length - 1;
if (start > end) return start;
while (true)
{
var mid = (start + end) / 2;
var val = arr[mid];
if (mid == start)
{
comparisons++;
return val < v ? mid + 1 : mid;
}
comparisons++;
switch (val.CompareTo(v))
{
case -1:
start = mid + 1;
break;
case 0:
return mid;
case 1:
end = mid - 1;
break;
}
}
}
49 comparisons is smack on the value of 2*log2(n). Yeah, sure, the data we used was doctored, so let's return to the randomly generated one. In that case, the number of comparisons grows horribly:
304091112 comparisonswhich is larger than n*log2(n).
199712 intersections in 5095ms
Why does that happen? Because in the randomly generated data the binary search find its worst case scenario: trying to find the first value. It divides the problem efficiently, but it still has to go through all the data to reach the first element. Surely we can't use this for a general scenario, even if it is fantastic for one specific case. And here is my qualm with the O notation: without specifying the type of input, the solution is just probabilistically the best. Is it?
Let's compare the results so far. We have three ways of generating data: randomly with increments from 1 to 100, odds and evens, small and large values. Then we have two ways of computing the next index to compare: linear and binary search. The approximate numbers of comparisons are as follows:
Random | OddsEvens | SmallLarge | |
---|---|---|---|
Linear | 2n | 2n | 2n |
Binary search | 3/2*n*log(n) | 2*n*log(n) | 2*log(n) |
Alternatives
Can we create a hybrid findIndex that would have the best of both worlds? I will certainly try. Here is one possible solution:
private static int findIndex(int[] arr, int v, int p, ref int comparisons)
{
var inc = 1;
while (true)
{
if (p + inc >= arr.Length) inc = 1;
if (p + inc >= arr.Length) return arr.Length;
comparisons++;
switch(arr[p+inc].CompareTo(v))
{
case -1:
p += inc;
inc *= 2;
break;
case 0:
return p + inc;
case 1:
if (inc == 1) return p + inc;
inc /= 2;
break;
}
}
}
What am I doing here? If I find the value, I return the index; if the value is smaller, not only do I advance the index, but I also increase the speed of the next advance; if the value is larger, then I slow down until I get to 1 again. Warning: I do not claim that this is the optimal algorithm, this is just something that was annoying me and I had to explore it.
OK. Let's see some results. I will decrease the value of n even more, to a million. Then I will generate the values with random increases of up to 10, 100 and 1000. Let's see all of it in action! This time is the actual count of comparisons (in millions):
Random10 | Random100 | Random1000 | OddsEvens | SmallLarge | |
---|---|---|---|---|---|
Linear | 2 | 2 | 2 | 2 | 2 |
Binary search | 30 | 30 | 30 | 40 | 0.00004 |
Accelerated search | 3.4 | 3.9 | 3.9 | 4 | 0.0002 |
So for the general cases, the increase in comparisons is at most twice, while for specific cases the decrease can be four orders of magnitude!
Conclusions
Because I had all of this in my head, I made a fool of myself at a job interview. I couldn't reason all of the things I wrote here in a few minutes and so I had to clear my head by composing this long monstrosity.
Is the best solution the one in O(n)? Most of the time. The algorithm is simple, no hidden comparisons, one can understand why it would be universally touted as a good solution. But it's not the best in every case. I have demonstrated here that I can minimize the extra comparisons in standard scenarios and get immense improvements for specific inputs, like arrays that have chunks of elements smaller than the next value in the other array. I would also risk saying that this findIndex version is adaptive to the conditions at hand with improbable scenarios as worst cases. It works reasonable well for normally distributed arrays, it does wonders for "chunky" arrays (in this is included the case when one array is much smaller than the other) and thus is a contender for some kinds of uses.
What I wanted to explore and now express is that finding the upper growth rate of an algorithm is just part of the story. Sometimes the best implementation fails for not adapting to the real input data. I will say this, though, for the default algorithm: it works with IEnumerables, since it never needs to jump forward over some elements. This intuitively gives me reason to believe that it could be optimized using the array/list indexing. Here it is, in IEnumerable fashion:
private static IEnumerable<int> intersect(IEnumerable<int> arr1, IEnumerable<int> arr2)
{
var e1 = arr1.GetEnumerator();
var e2 = arr2.GetEnumerator();
var loop = e1.MoveNext() && e2.MoveNext();
while (loop)
{
var v1 = e1.Current;
var v2 = e2.Current;
switch (v1.CompareTo(v2))
{
case -1:
loop = e1.MoveNext();
break;
case 0:
loop = e1.MoveNext() && e2.MoveNext();
yield return v1;
break;
case 1:
loop = e2.MoveNext();
break;
}
}
}
Extra work
The source code for a project that tests my various ideas can be found on GitHub. There you can find the following algorithms:
- Standard - the O(m+n) one described above
- Reverse - same, but starting from the end of the arrays
- Binary Search - looks for values in the other array using binary search. Complexity O(m*log(n))
- Smart Choice - when m*log(n)<m+n, it uses the binary search, otherwise the standard one
- Accelerating - the one that speeds up when looking for values
- Divide et Impera - recursive algorithm that splits arrays by choosing the middle value of one and binary searching it in the other. Due to the complexity of the recursiveness, it can't be taken seriously, but sometimes gives surprising results
- Middle out - it takes the middle value of one array and binary searches it in the other, then uses Standard and Reverse on the resulting arrays
- Pair search - I had high hopes for this, as it looks two positions in front instead of one. Really good for some cases, though generally it is a bit more than Standard
The testing tool takes all algorithms and runs them on randomly generated arrays:
- Lengths m and n are chosen randomly from 1 to 1e+6
- A random number s of up to 100 "spikes" is chosen
- m and n are split into s+1 equal parts
- For each spike a random integer range is chosen and filled with random integer values
- At the end, the rest of the list is filled with any random values
Results
For really small first array, the Binary Search is king. For equal size arrays, usually the Standard algorithm works wins. However there are plenty of cases when Divide et Impera and Pair Search win - usually not by much. Sometimes it happens that Accelerating Search is better than Standard, but Pair Search wins! I still have the nagging feeling that Pair Search can be improved. I feel it in my gut! However I have so many other things to do for me to dwell on this.
Maybe one of you can find the solution! Your mission, should you choose to accept it, is to find a better algorithm for intersecting sorted arrays than the boring standard one.
0 comments:
Post a Comment