Geek Love, by Katherine Dunn

book coverGeek Love is another title that mislead me. Geek, in this book, refers to the original definition of the word "A carnival performer who does wild or disgusting acts" and not a cool lovable geek as myself. The subject is "controversial": the life of an albino hunchbacked dwarf woman as a member of a carnival family.

There is a lot going on in the book. Carnival people poison themselves in order to make their children as freaky as possible. Said children are then raised only if they are strange enough. The failures are either preserved in glass jars if they are too mutated, or given away if they are too normal. The successes range from the main character, to a sociopath hairless cult leader with flippers instead of members, to siamese sisters that have the same lower body to a telekinetic God like child who only wants to be loved and gets manipulated into doing stuff for others. The sisters later give birth to a grotesquely obese child, fathered by a man with half his face blown up who squirts inside their vagina in his death moment. The death and the face blowing are unrelated. There is more, like a rich heiress who pays beautiful women to mutilate themselves in order to have a better life, unencumbered by the sexual desire of her subjects or of the people surrounding them.

The book itself was pretty innovative, but rather boring. If I had an alternative, I wouldn't have finished it, but as I had not, I am a bit proud of having finished it. If nothing else, the book is strange enough to be interesting. Also the writing is pretty good, introverted, taking the reader inside the mind of a person who considers normal people too bland and treasures her deformity and that of her daughter, fathered through telekinesis with the sperm of her brother.

Don't get me wrong, it was not painful reading the book and I do not regret having reading it. However I wish there was something more interesting in the story other than the strangeness of one's thoughts.

Portable Game Notation and parsing it with regular expressions

Short version: here is the link to the uploaded .NET regular expression. (look in the comment for the updated version)

I noticed that the javascript code that I am using to parse PGN chess games and display it is rather slow and I wanted to create my own PGN parser, one that would be optimal in speed. "It should be easy", I thought, as I was imagining getting the BNF syntax for PGN, copy pasting it into a parser generator and effortlessly getting the Javascript parser that would spit out all secrets of the game. It wasn't easy.

First of all, the BNF notation for Portable Game Notation was not complete. Sure, text was used to explain the left overs, but there was no real information about it in any of the "official" PGN pages or Wikipedia. Software and chess related FTPs and websites seemed to be terrible obsolete or missing altogether.

Then there was the parser generator. Wikipedia tells me that ANTLR is pretty good, as it can spew Javascript code on the other end. I downloaded it (a .jar Java file - ugh!), ran it, pasted BNF into it... got a generic error. 10 minutes later I was learning that ANTLR does not support BNF, but only its own notation. Searches for tools that would do the conversion automatically led me to smartass RTFM people who explained how easy it is to do it manually. Maybe they should have done for me, then.

After all this (and many fruitless searches on Google) I decided to use regular expressions. After all, it might make a lot of sense to have a parser in a language like C#, but the difference in speed between a Javascript implementation and a native regular expression should be pretty large, no matter how much they optimize the engine. Ok, let's define the rules of a PGN file then.

In a PGN file, a game always starts with some tags, explaining what the event is, who played, when, etc. The format of a tag is [name "value"]. There are PGN files that do not have this marker, but then there wouldn't be more than one game inside. The regular expression for a tag is: (\[\s*(?<tagName>\w+)\s*"(?<tagValue>[^"]*)"\s*\]\s*)+. Don't be scared, it only means some empty space maybe, then a word, some empty space again, then a quoted string that does not contain quotes, then some empty space again, all in square brackets and maybe followed by more empty space, all of this appearing at least once.

So far so good, now comes the list of moves. The simplest possible move looks like 1. e4, so a move number and a move. But there are more things that can be added to a move. For starters, the move for black could be following next (1. e4 e5) or a bit after, maybe if there are commentaries or variations for the move of the white player (1... e5). The move itself has a variety of possible forms:
  • e4 - pawn moved to e4
  • Nf3 - knight moved to f3
  • Qxe5 - queen captured on e5
  • R6xf6 - the rook on the 6 rank captured on f6
  • Raa8 - The rook on file a moved to a8
  • Ka1xc2 - the knight at a1 captured on c2
  • f8=Q - pawn moved to f8 and promoted to queen
  • dxe8=B - pawn on the d file captured on e8 and promoted to bishop

There is more information about the moves. If you give check, you must end it with a + sign, if you mate you end with #, if the move is weird, special, very good, very bad, you can end it with stuff like !!, !?, ?, !, etc which are the PGN version of WTF?!. And if that is not enough, there are some numbers called NAG which are supposed to represent a numeric, language independent, status code. Also, the letters that represent the pieces are not language independent, so a French PGN might look completely different from an English one. So let's attempt a regular expression for the move only. I will not implement NAG or other pieces for non-English languages: (?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?). I know, scary. But it means a letter in the list PNBRQK, one for each possible type of chess piece, which may appear or it may not, then a letter between a and h, which would represent a file, then a number between 1 and 8 which would represent a rank. Both letter and number might not appear, since they represent hints on where the piece that moved was coming from. Then there is a possible letter x, indicating a capture, then, finally, the destination coordinates for the move. There follows an equal sign and another piece, in case of promotion. An astute reader might observe that this also matches a rook that promotes to something else, for example. This is not completely strict. If this long expression is not matched, maybe something that looks like OO, O-O, OOO or O-O-O could be matched, representing the two possible types of castling, when a rook and a king move at the same time around each other, provided neither had not moved yet. And to top it off, we allow for some empty space and the characters ! and ? in order to let chess annotators express their feelings.

It's not over yet. PGN notation allows for commentaries, which are bits of text inside curly brackets {what an incredibly bad move!} and also variations. The variations show possible outcomes from the main branch. They are lists of moves that are enclosed in round brackets. The branches can be multiple and they can branch themselves! Now, this is a problem, as regular expressions are not recursive. But we only need to match variations and then reparse them in code when found. So, let's attempt a regular expression. It is getting quite big already, so let's add some tokens that can represent already discussed bits. I will use a @ sign to enclose the tokens. Here we go:
  • @tags@ - we will use this as a marker for one or more tags
  • @move@ - we will use this as a marker for the complicated move syntax explained above
  • (?<moveNumber>\d+)(?<moveMarker>\.|\.{3})\s*(?<moveValue>@move@)(?:\s*(?<moveValue2>@move@))?\s* - the move number, 1 or 3 dots, some empty space, then a move. It can be followed directly by another move, for black. Lets call this @line@
  • (?:\{(?<varComment>[^\}]*?)\}\s*)? - this is a comment match, something enclosed in curly brackets; we'll call it @comment@
  • (?:@line@@variations@@comment@)* - wow, so simple! Multiple lines, each maybe followed by variations and a comment. This would be a @list@ of moves.
  • (?<endMarker>1\-?0|0\-?1|1/2\-?1/2|\*)?\s* - this is the end marker of a game. It should be there, but in some cases it is not. It shows the final score or an unfinished match. We'll call it @ender@
  • (?<pgnGame>\s*@tags@@list@@ender@) - The final tokenised regular expression, containing an entire PGN game.

But it is not over yet. Remember @variations@ ? We did not define it and with good reason. A good approximation would be (?:\((?<variation>.*)\)\s*)*, which defines something enclosed in parenthesis. But it would not work well. Regular expressions are greedy by default, so it would just get the first round bracket and everything till the last found in the file! Using the non greedy marker ? would not work either, as the match will stop after the first closing bracket inside a variation. Comments might contain parenthesis characters as well.

The only solution is to better match a variation so that some sort of syntax checking is being performed. We know that a variation contains a list of moves, so we can use that, by defining @variations@ as (?:\((?<variation>@list@)\)\s*)*. @list@ already contains @variations@, though, so we can do this a number of times, to the maximum supported branch depth, then replace the final variation with the generic "everything goes" approximation from above. When we read the results of the match, we just take the variation matches and reparse them with the list subexpression, programatically, and check extra syntax features, like the number of moves being subsequent.

It is no wonder that at the Regular Expressions Library site there was no expression for PGN. I made the effort to upload it, maybe other people refine it and make it even better. Here is the link to the uploaded regular expression. The complete regular expression is here:

Note: the flavour of the regular expression above is .Net. Javascript does not support named tags, the things between the angle brackets, so if you want to make it work for js, remove ?<name> constructs from it.

Now to work on the actual javascript (ouch!)

Update: I took my glorious regular expression and used it in a javascript code only to find out that groups in Javascript do not act like collections of found items, but only the last match. In other words, if you match 'abc' with (.)* (match as many characters in a row, and capture each character in part) you will get an array that contains 'abc' as the first item and 'c' as the second. That's insane!

Update: As per Matty's suggestion, I've added the less used RxQ move syntax (I do have a hunch that it is not complete, for example stuff like RxN2, RxNa or RxNa2 might also be accepted, but they are not implemented in the regex). I also removed the need for at least one PGN tag. To avoid false positives you might still want to use the + versus the * notation after the tagName/tagValue construct. The final version is here:


The Regexlib version has also been updated in a comment (I don't know how - or if it is possible - to edit the original).

Wireless news

There is this childish game called "cordless phone", which funny enough is older than any possible concept of wireless telephony, where in a large group of people a message is sent to someone else by whispering it to your neighbour. Since humans are not network routers, small mistakes creep up in the message as it is copied and resent (hmm, there should be a genetic reference here somewhere as well).

The point is that, given enough people with their own imperfections and/or agendas, a message gets distorted as the number of middle men increases. It also happens in the world of news. Some news company invests in news by paying investigative reporters. The news is created by a human interpreting things from eye witness accounts to scientific papers, but then it is reported by other news agencies, where the original information is not the main source, but the previous news report. Then marketing shows its ugly head, as the titles need to be shockier, more impressive, forcing the hapless reader to open that link, pick up that paper, etc. Occasionally there are translations errors, but mostly it is about idiots who don't and can't understand what they are reporting on, so the original message gets massacred!

So here is one of the news of today, re-reported by Romanian media, after translation and obfuscation and marketization (and retranslation by me, sorry): "Einstein was wrong? A particle that is travelling at more than the speed of light has been discovered". In the body, written a little better, "Elementary subatomic particle" got translated as "Elementary particle of matter". Dear "science" reporters, the neutrino is not a particle that needed discovering and it is not part of normal matter, with which it interacts very little. What is new is just the strange behaviour of the faster than light travel, which is only hinted by some data that may be or not be correct and refuted by some other, like supernova explosions, information that you haven't even bothered to copy paste into your article. And, as if this was not enough, the comments of the readers, kind of like myself ranting here probably, are making the reporter seem brilliant in comparison.

Is there a solution? Not really. People should try to find the original source of messages as much as possible, or at least a reporting source that is professional enough to not skew the information too much when summarizing it for the general public. A technical solution could work that would analyse news reports, group them per topic, then remove copies and translations, red flag emotional language or hidden divergent messages and ignore the titles altogether, maybe generate new ones. And while I know this is possible to do, it would be very difficult (but possibly rewarding) as software goes. One thing is for certain: reading the titles and assuming that they correctly summarize the complete articles is a terrible mistake, alas, one that is very common.

T-SQL: Determining the byte size of a text in a specific encoding

There was this FTP surrogate program that used SQL as a filesystem. I needed to store the size of the file (which was an HTML template and was stored as NTEXT) in the row where the content was stored. The problem is that the size of a text in a Microsoft SQL Server NTEXT column is about two bytes per character, while the actual size of the content, stored web like in UTF8, was different to almost half.

I thought that there must be an easy way to compute it, trying to cast the string to TEXT then using LEN, trying DATALENGTH, BINARY, etc. Nothing worked. In the end I made my own function, because the size of a string in UTF8 is documented on the Wikipedia page of that encoding: 1 byte for ASCII characters (character code<128), 2 bytes for less than 2048, 3 for 65536 and 4 for the rest. So here is the sql function that computes the size in UTF8:

@text NVARCHAR(max)

WHILE (@i<=LEN(@text))

SET @val=UNICODE(SUBSTRING(@text,@i,1))

SET @size=@size+
WHEN @val<128 THEN 1
WHEN @val<2048 THEN 2
WHEN @val<65536 THEN 3
SET @i=@i+1


RETURN @size

A similar approach would work for any other encoding.

Paragraphs containing block elements in XHTML

I was updating the content of a span element via AJAX when I noticed that the content was duplicated. It was no rocket science: get element, replace content with the same thing that was rendered when the page was first loaded, just with updated values. How could it be duplicated?

Investigating the DOM in the browser (any browser, btw) I've noticed something strange: When the page was first loaded, the content was next to the container element, not inside it. I've looked at the page source, only to see that it was, by all counts, correct. It looked something like this:
. The DOM would show the div inside the paragraph element and the table as the sibling of the paragraph element. The behavior was fixed if the page DOCTYPE was changed from XHTML into something else.

It appears that block elements should not be inside layout elements, like p. The browsers are attempting to "fix" this problem and so they change the DOM, assuming that if a table starts inside the paragraph, then you must have forgotten to close the paragraph. If I was adding it via ajax, the browser did not seem to want to fix the content in any way, as I was manipulating the DOM directly and there was no parsing phase.

State of the Union

I have been reviewing my blog posts for the last few months and I noticed a troubling trend: a lot more social commentary and hobby related stuff than actual tech work. Check out this statistic of posts in the last three months:
  • TV and Movie: 5
  • Books: 6
  • Personal or hobby: 6
  • Social commentary: 1
  • Tech: 8
8 is marginally more than 6, but split them between misc and programming and you get 18 misc for 10 programming (with some overlapping). And consider that two of the tech posts were attempts to fix something that did not work so well.

What does this mean? Do I not learn new stuff at work? Am I not interested in tech work anymore? Am I working too much and not having time to blog? Well, it is a bit of all. I am interested in tech work, but right now I am fighting to adapt to the new job. I am learning new stuff, but that is mostly office related than new frontiers of programming. And I am a bit tired as well.

I have been thinking of cool tech stuff to share with you at least in this post, but I could find none. I am reading a lot of blogs with new information about stuff ranging from Windows 8, .Net 5, the future of C# and Visual Studio to videos of Vesta, things that verge on proving the dark matter model is wrong and amazing BIOS rootkits, but that is not what I am doing.

So let me summarize the technical state of my work so far:
  • Scrum - my workplace uses Scrum as a development practice and invests a lot in maintaining the quality of its implementation. I've learned a lot about the advantages, but also the disadvantages of the practice (there is nothing as annoying as an Outlook alert that you need to do the daily scrum meeting when you are concentrated on a task)
  • Visual Basic - as the original application that was bought by my employing company 5 years ago was written in Visual Basic, large portions of it are still VB. That only proves my point that refactoring code should be a priority, not a nice to have option. I wonder how many developing hours, research hours and hair roots could have been saved if the company would have invested in moving the application to a readable and canonical code form. I also wonder if the guy that invented Visual Basic is now burning in hell, as so many devs with whom I've talked about VB seem to want.
  • Visual Basic - it just deserves two bullet points, for the bullet reason only at least. Also, try converting C# generic and lambda expression code to Visual Basic. Hilarious!
  • Computing power - I am now working on a laptop that has a Quad Core I7 processor, 8Gb of RAM and a Solid State Drive. And I still want it 10 times faster. It seems to me that computing power is only keeping up with the size of the software projects and the complexity of the tools used to develop them, so that the total compile time for a project remains constant. Also, if for some reason the company issues you with a computer powerful enough to break the constant, they also need to enforce drive encryption as to compensate.
  • Continuous Integration and Unit Testing - it gives one a good feeling of comfort to know that after "it works on my machine", the source control server can compile, test and run the software successfully (while you are working at something else, no less).
  • Software Patterns - there are people who can think and visualize software patterns. They can architect any piece of code and make it really neat. However, it now seems to me that an over-architected software is just as hard to read and follow as a non-architected one. Fortunately for me, my colleagues are more the smart "let's make it work" type

That is about it. No magical silver bullet practices, no amazing software, no technological edge code, just plain software shop work.

Entourage ended

The main characters
As an avid viewer of TV series and movies alike, I am always discussing the latest shows with my friends and I have been surprised to notice that not many knew about Entourage. I consider it a shame, as this series is exactly what a TV show needs to be and so few actually manage to do what it does.

Entourage is the story of a young talented actor who rises from anonymity with the help of three childhood friends. They are practically brothers and, even if he is the only one of them who "made it", they still live together and share everything, while navigating the weird world of Hollywood. The format of the show is short half an hour episodes that never leave you hanging when they end and that, for me at least, always provide a good feeling. I am not talking about silly ha-ha comedy here, I am talking about a lightweight dramedy that makes you smile. At the end of an episode you don't want more, you feel content, and you only begin to crave more when that contentment wears off. This is what today's media shows have forgotten how to do!

I was a bit sad to see the eighth and last season of the show end with its eighth episode a few days ago. I really wanted more of this and now that Entourage is no more, I know it would be hard to find a show that would bring the same peace of mind after each episode. And you haven't heard or tried Entourage, you should. Good show!

The Sparrow (and a bit of Children of God), by Mary Doria Russell

Book coverThe Sparrow is, in my opinion, a good example of politics in literature, as it won several prestigious prizes, but is, in my own view, slightly above average. To quickly summarize the book, it's a Jesuit meets New World story, only the plot is set in the future and the new world is another planet.

Maybe I am just bitter because after tedious chapters about the relationships between the characters, the sci-fi was minimal and outdated. In fact, the author herself admitted that the kind of story she wanted to write could not be set on Earth because there are no new worlds here; this leads me to believe that the sci-fi was incidental and it bloody feels that way, too.

I will give credit where it is due, though. The woman documented herself well and presented interesting characters in great detail. Also the story itself is pretty solid, albeit a bit boring and focused too much on the religious. Other than that, it felt like a 1960's book. I was shocked to see that it was published in 1996. Some of the technical details described in the book were outdated even then.

About the plot, a discreet mission to a nearby planet is organized by the Jesuits, because all the other players in the field like the United Nation are too bureaucratic and talk about it more than actually doing something. The story starts with the return of the only survivor, priest Emilio Sandoz, and then the book continues with back and forth lines of now and then. The ending felt rushed and a bit anticlimactic. Maybe I was just not in an empathetic mood and couldn't care less about the religious and cultural sensitivities of the people involved and from all the characters, only Jimmy, the tall technical guy, felt agreeable. Who would have guessed? :)

Bottom line: I will not read the sequel book and that should say a lot. The book is not bad, though, and I may be berating it too much. A thing is certain, this is at the lower edge of the sci-fi scale, regardless of writing quality.

Update: In lack of a good book to read, I started on Children of God, the sequel to The Sparrow. Russell's talent for tedium, futile philosophy and rape fantasies reaches new heights in this book, so much that I just couldn't finish it. I usually finish watching movies and reading books even if they are bad, just for the experiential value alone, but I couldn't do it with this one. You have the same pointless priest, wallowing in self pity and then pathetically letting himself be used again, all the time asserting that he won't. And just so the reader would have no thoughts that the man could find a way out, a chapter of one of his kidnappers getting old on Rakhat, smack in the middle of the book, erases that possibility. What would be the point of reading on? Just for the ridiculous discussions about different Gods between random people? No. Just avoid this one.

Me playing black against my Nokia cellphone

Here is another chess game, which is not really remarkable as a chess game, but as my perception of it was. I thought I played on par with the cell phone up until the end where I usually falter and this time the cell did. After ChessMaster's analysis, I've realised that the AI made huge mistakes, as have I. I chose to play black because I usually play white and whenever I play as black I am at a loss at what to do, so the few games I've played lately with my cell or PDA were as black.

Also, in this post I am using a new style of annotation. ChessMaster saves PGN files in two ways: with analysis or with auto annotation, not both. I find the auto annotations very helpful when they explain how the game could have played, but not so helpful when I move a piece and it says I moved a piece or some other obvious thing like that. The analysis is more cryptic, but very helpful in understanding what the computer thought. Therefore I took the annotations I found useful and added them to the analysis file. I hope this is more helpful for the reader.

At the end you will see me play really strange, and that is because I was using my queen as a rook only while checkmating. The uncommented variation is the short end.

Well, I just removed the variation and the pointless mate. Just imagine the cell phone gained consciousness and resigned after I queened the pawn :) Also, if you see the post loading really slow, you should wait for it to end. I will try to optimize the chess board plugin when I have the time.

B00 King's Pawn Opening. The King's Pawn opening move is both popular and logical. It controls the center, opens lines

for both the Queen and the Bishop, and usually leads to an open game in which tactics, rather than slow maneuvering,




C20 King's Pawn Game. Black responds symmetrically, making a direct challenge to the central squares.


