Saturday 29 October 2016

Controlling JSON serialization in .Net Core Web API (Serialize enum values as strings, not integers)

.Net Core Web API uses Newtonsoft's Json.NET to do JSON serialization and for other cases where you wanted to control Json.NET options you would do something like
JsonConvert.DefaultSettings = (() =>
{
var settings = new JsonSerializerSettings();
// do something with settings
return settings;
});
, but in this case it doesn't work. The way to do it is to use the fluent interface method and hook yourself in the ConfigureServices(IServiceCollection services) method, after the call to .AddMvc(), like this:
services
.AddMvc()
.AddJsonOptions(options =>
{
var settings=options.SerializerSettings;
// do something with settings
});

In my particular case I wanted to serialize enums as strings, not as integers. To do that, you need to use the StringEnumConverter class. For example if you wanted to serialize the Gender property of a person as a string you could have defined the entity like this:
public class Person
{
public string Name { get; set; }
[JsonConverter(typeof(StringEnumConverter))]
public GenderEnum Gender { get; set; }
}

In order to do this globally, add the converter to the settings converter list:
services
.AddMvc()
.AddJsonOptions(options =>
{
options.SerializerSettings.Converters.Add(new StringEnumConverter {
CamelCaseText = true
});
});

Note that in this case, I also instructed the converter to use camel case. The result of the serialization ends up as:
{"name":"James Carpenter","age":51,"gender":"male"}

Saturday 22 October 2016

Beware LINQ OrderBy in performance sensitive cases

I was doing this silly HackerRank algorithm challenge and I got the solution correctly, but it would always time out on test 7. I wracked my brain on all sorts of different ideas but to no avail. I was ready to throw in the towel and check out other people solutions, only they were all in C++ and seemed pretty similar to my own. And then I've made a code change and the test passed. I had replaced LINQ's OrderBy with Array.Sort.

Intrigued, I started investigating. The idea was creating a sorted integer array from a space delimited string of values. I had used Console.ReadLine().Split(' ').Select(s=>int.Parse(s)).OrderBy(v=>v); and it consumed above 7% of the total CPU of the test. Now I was using var arr=Console.ReadLine().Split(' ').Select(s=>int.Parse(s)).ToArray(); Array.Sort(arr); and the CPU usage for that piece of the code was 1.5%. So it was almost five times slower. How do the two implementations differ?

Array.Sort should be simple: an in place quicksort, the best general solution for this sort (heh heh heh) of problem. How about Enumerable.OrderBy? It returns an OrderedEnumerable which internally uses a Buffer<T> to get all the values in a container, then uses an EnumerableSorter to ... quicksort the values. Hmm...

Let's get back to Array.Sort. It's not as straightforward as it seems. First of all it "tries" a SZSort. If it works, fine, return that. This is an external native code implementation of QuickSort on several native value types. (More on that here) Then it goes to a SorterObjectArray that chooses, based on framework target, to use either an IntrospectiveSort or a DepthLimitedQuickSort. Even the implementation of this DepthLimitedQuickSort is much, much more complex than the quicksort used by OrderBy. IntrospectiveSort seems to be the one preferred for the future and is also heavily optimized, but less complex and easier to understand, perhaps. It uses quicksort, heapsort and insertionsort together.

Now, before you go all "OrderBy sucks!", read more about it. This StackOverflow list of answers seems to indicate that in case of strings, at least, the performance is similar. A lot of other interesting things there, as well. OrderBy uses a "stable" QuickSort, meaning that two items that are compared as equal will appear in their original order. Array.Sort does not guarantee that.

Anyway, the performance difference in my particular case seems to come from the native code implementation of the sort for integers, rather than algorithmic improvements, although I don't have the time right now to grab the various implementations and test them properly. However, just from the way the code reads, I would bet the IntrospectiveSort will compare favorably to the simple Quicksort implementation used in OrderBy.

Thursday 13 October 2016

My first DMCA notice

Today I received two DMCA notices. One of them might have been true, but the second was for a file which started with
/*
Copyright (c) 2010, Yahoo! Inc. All rights reserved.
Code licensed under the BSD License:
http://developer.yahoo.com/yui/license.html
version: 2.8.1
*/
Nice, huh?

The funny part is that these are files on my Google Drive, which are not used anywhere anymore and are accessible only by people with a direct link to them. Well, I removed the sharing on them, just in case. The DMCA is even more horrid than I thought. The links in it are general links towards a search engine for notices (not the link to the actual notice) and some legalese documents, the email it is coming from is noreply-6b094097@google.com and any hope that I might fight this is quashed with clear intention from the way the document is worded.

So remember: Google Drive is not yours, it's Google's. I wonder if I would have gotten the DMCA even if the file was not being shared. There is a high chance I would, since no one should be using the link directly.

Bleah, lawyers!

Monday 10 October 2016

Disqus customer support is non existent [Blogger comment synchronization broken]

I have enabled Disqus comments on this blog and it is supposed to work like this: every old comment from Blogger has to be imported into Disqus and every new comment from Disqus needs to be also saved in the Blogger system. Importing works just fine, but "syncing" does not. Every time someone posts a comment I receive this email:
Hi siderite,
 
You are receiving this email because you've chosen to sync your
comments on Disqus with your Blogger blog. Unfortunately, we were not
able to access this blog.
 
This may happen if you've revoked access to Disqus. To re-enable,
please visit:
https://siderite.disqus.com/admin/discussions/import/platform/blogger/
 
Thanks,
The Disqus Team
Of course, I have not revoked any access, but I "reenable" just the same only to be presented with a link to resync that doesn't work. I mean, it is so crappy that it returns the javascript error "e._ajax is undefined" for a line where e._ajax is used instead of e.ajax and even if that would have worked, it uses a config object that is not defined.

It doesn't really matter, because the ajax call just accesses (well, it should access) https://siderite.disqus.com/admin/discussions/import/platform/blogger/resync/. And guess what happens when I go there: I receive an email that the Disqus access in Blogger has been revoked.

No reply for the Disqus team for months, for me or anybody else having this problem. They have a silly page that explains that, of course, they are not at fault, Blogger did some refactoring and broke their system. Yeah, I believe that. They probably renamed the ajax function in jQuery as well. Damn Google!

Friday 7 October 2016

Fragmentation in an SQL table with lots of inserts, updates and deletes

I've met an interesting case today when we needed to manipulate data from tens of thousands of people daily. Assuming we would use table rows for the information, then we get a table in which rows are constantly added, updated and deleted. The issue is with the space allocated in table pages.

SQL works like this: If it needs space it allocates some as a "page" which can contain more records. When you delete records the space is not reclaimed, it remains as is (this is called ghosting). The exception is when all records in a page are deleted, in which case the page is reused as an empty page. When you update a record with more data then it held before (like when you have a variable length column), the page is split, with the rest of the records on the page moved to a new page.

In a heap table (no clustered index) the space inside pages is reused for new records or for updated records that don't fit in their allocated space, however if you use a clustered index, like a primary key, the space is not reused, since there needs to be a correlation between the value of the column and its position in the page. And here lies the problem. You may end up with a lot of pages with very few records in them. A typical page is 8 kilobytes, so a table with a few integers in a record can hold hundreds of records on a single page.

Fragmentation can be within a page, as described above, also called internal, but also external, between pages, when the recycled pages are used for data that is out of order. To get a large swathe of records the disk might be worked hard in order to jump from page to page to get what is logically a continuous blob of data. It is disk input/output that kills a database.

OK, back to our case. A possible solution was to store all the data for a user in a "blob", a VARBINARY column. For reads or changes only the disk space occupied by the blob would be changed, with C# code handling everything. It's what is called trading CPU for IO, which is generally good. However this NoSql-like idea itself smelled badly to me. We are supposed to trust our databases, not work against them. The solution I chose is monitoring index fragmentation and occasionally issuing clustered index rebuilding or reorganizing. I am willing to bet that reading/writing the data equivalent to several pages of table is going to be more expensive than selecting the changes I want to make. Also, rebuilding the index will end up storing all the data per user in the same space anyway.

However, this case made me think. Here is a situation in which the solution might have been (and it was in a similar case implemented by someone else) to micromanage the way the database works. It made me question using a clustered index/primary key on a table.

These articles helped me understand more:

Tuesday 4 October 2016

Undo a Perforce changelist

I had this problem with Perforce where I accidentally Reconciled my offline work with all the files in /bin and /obj folders, resulting in a huge 6000+ file changelist. OK, simple one button mistake, surely there must be some one button undoing what I just did. It appears there is not.

In order to fix this I have to follow these steps:
  1. Change the settings of Perforce to show files even in changelists larger than 1000 items (the default value)
  2. Select by hand in the changelist window the files from obj and bin folders and using Revert on them
  3. Revert the few other files that were unwanted in the changelist, like .suo and .user files - note that Revert on added files doesn't delete them, it just unadds them
  4. Create a file with paths to ignore and then use p4 set P4IGNORE=<filename> for future reconcile work

What didn't work was adding a filename or path filter when visualizing the changelist, since that is a changelist filter, not a files filter. It will show you changelists that have files that contain the pattern, but not filter the files inside the changelists themselves.

For reference, the p4ignore file I used looked like this:
p4ignore
bin/Debug
obj/Debug
*.suo
*.user
Note that I also added the p4ignore file itself, although the file was not in any Perforce repository (yet).

"But, Siderite, you should use Git (or whatever source control is the newest fad at the moment)!" Wish that could, my friend, wish that I could.

The Trials (The Red #2) and Going Dark (The Red #3), by Linda Nagata

book covers Having been so pleasantly surprised by First Light, the first book in The Red series, I quickly read the next two books: The Trials and Going Dark. However, possibly due to my high expectations, I have been disappointed by the continuation. Linda Nagata seemed to have reached that sweet spot between current tech trends and emergent future that makes stories feel both hard sci-fi and realistic. The integration between man and machine, the politics run by shadowy megarich "dragons" from the background, nuclear bombs detonated in major US cities, artificial intelligence and so on. The potential was immense!

Yet, the author chose to continue the story on the same flat note, like an ode to the Stockholm Syndrome, where the hero gets repeatedly coerced to run missions that at first seem bullshit, but in the end are rationalized as necessary and even dutiful by himself. The common intrusion of external forces into his emotional balance by way of direct brain stimulation also makes his feelings and motivations be completely isolated from the ones of the reader. A strange choice, considering the vast possibilities opened by the first book. Frankly, it felt like Nagata liked writing the first book and then was forced by publishers to make it "a trilogy", since that is the norm for fantasy and science fiction, but her heart wasn't really in it.

I don't want to spoil the ending, such as it is, but I will say it was disappointing as well, with no real closure for the reader of any of the important questions raised in First Light. Too bad, since I felt the story was beginning to touch on important subjects that needed to be discussed at a deeper level than just "boots on the ground".