Sunday, 25 September 2011

Portable Game Notation and parsing it with regular expressions

Short version: here is the link to the uploaded .NET regular expression. (look in the comment for the updated version)

I noticed that the javascript code that I am using to parse PGN chess games and display it is rather slow and I wanted to create my own PGN parser, one that would be optimal in speed. "It should be easy", I thought, as I was imagining getting the BNF syntax for PGN, copy pasting it into a parser generator and effortlessly getting the Javascript parser that would spit out all secrets of the game. It wasn't easy.

First of all, the BNF notation for Portable Game Notation was not complete. Sure, text was used to explain the left overs, but there was no real information about it in any of the "official" PGN pages or Wikipedia. Software and chess related FTPs and websites seemed to be terrible obsolete or missing altogether.

Then there was the parser generator. Wikipedia tells me that ANTLR is pretty good, as it can spew Javascript code on the other end. I downloaded it (a .jar Java file - ugh!), ran it, pasted BNF into it... got a generic error. 10 minutes later I was learning that ANTLR does not support BNF, but only its own notation. Searches for tools that would do the conversion automatically led me to smartass RTFM people who explained how easy it is to do it manually. Maybe they should have done for me, then.

After all this (and many fruitless searches on Google) I decided to use regular expressions. After all, it might make a lot of sense to have a parser in a language like C#, but the difference in speed between a Javascript implementation and a native regular expression should be pretty large, no matter how much they optimize the engine. Ok, let's define the rules of a PGN file then.

In a PGN file, a game always starts with some tags, explaining what the event is, who played, when, etc. The format of a tag is [name "value"]. There are PGN files that do not have this marker, but then there wouldn't be more than one game inside. The regular expression for a tag is: (\[\s*(?<tagName>\w+)\s*"(?<tagValue>[^"]*)"\s*\]\s*)+. Don't be scared, it only means some empty space maybe, then a word, some empty space again, then a quoted string that does not contain quotes, then some empty space again, all in square brackets and maybe followed by more empty space, all of this appearing at least once.

So far so good, now comes the list of moves. The simplest possible move looks like 1. e4, so a move number and a move. But there are more things that can be added to a move. For starters, the move for black could be following next (1. e4 e5) or a bit after, maybe if there are commentaries or variations for the move of the white player (1... e5). The move itself has a variety of possible forms:
  • e4 - pawn moved to e4
  • Nf3 - knight moved to f3
  • Qxe5 - queen captured on e5
  • R6xf6 - the rook on the 6 rank captured on f6
  • Raa8 - The rook on file a moved to a8
  • Ka1xc2 - the knight at a1 captured on c2
  • f8=Q - pawn moved to f8 and promoted to queen
  • dxe8=B - pawn on the d file captured on e8 and promoted to bishop


There is more information about the moves. If you give check, you must end it with a + sign, if you mate you end with #, if the move is weird, special, very good, very bad, you can end it with stuff like !!, !?, ?, !, etc which are the PGN version of WTF?!. And if that is not enough, there are some numbers called NAG which are supposed to represent a numeric, language independent, status code. Also, the letters that represent the pieces are not language independent, so a French PGN might look completely different from an English one. So let's attempt a regular expression for the move only. I will not implement NAG or other pieces for non-English languages: (?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?). I know, scary. But it means a letter in the list PNBRQK, one for each possible type of chess piece, which may appear or it may not, then a letter between a and h, which would represent a file, then a number between 1 and 8 which would represent a rank. Both letter and number might not appear, since they represent hints on where the piece that moved was coming from. Then there is a possible letter x, indicating a capture, then, finally, the destination coordinates for the move. There follows an equal sign and another piece, in case of promotion. An astute reader might observe that this also matches a rook that promotes to something else, for example. This is not completely strict. If this long expression is not matched, maybe something that looks like OO, O-O, OOO or O-O-O could be matched, representing the two possible types of castling, when a rook and a king move at the same time around each other, provided neither had not moved yet. And to top it off, we allow for some empty space and the characters ! and ? in order to let chess annotators express their feelings.

It's not over yet. PGN notation allows for commentaries, which are bits of text inside curly brackets {what an incredibly bad move!} and also variations. The variations show possible outcomes from the main branch. They are lists of moves that are enclosed in round brackets. The branches can be multiple and they can branch themselves! Now, this is a problem, as regular expressions are not recursive. But we only need to match variations and then reparse them in code when found. So, let's attempt a regular expression. It is getting quite big already, so let's add some tokens that can represent already discussed bits. I will use a @ sign to enclose the tokens. Here we go:
  • @tags@ - we will use this as a marker for one or more tags
  • @move@ - we will use this as a marker for the complicated move syntax explained above
  • (?<moveNumber>\d+)(?<moveMarker>\.|\.{3})\s*(?<moveValue>@move@)(?:\s*(?<moveValue2>@move@))?\s* - the move number, 1 or 3 dots, some empty space, then a move. It can be followed directly by another move, for black. Lets call this @line@
  • (?:\{(?<varComment>[^\}]*?)\}\s*)? - this is a comment match, something enclosed in curly brackets; we'll call it @comment@
  • (?:@line@@variations@@comment@)* - wow, so simple! Multiple lines, each maybe followed by variations and a comment. This would be a @list@ of moves.
  • (?<endMarker>1\-?0|0\-?1|1/2\-?1/2|\*)?\s* - this is the end marker of a game. It should be there, but in some cases it is not. It shows the final score or an unfinished match. We'll call it @ender@
  • (?<pgnGame>\s*@tags@@list@@ender@) - The final tokenised regular expression, containing an entire PGN game.


But it is not over yet. Remember @variations@ ? We did not define it and with good reason. A good approximation would be (?:\((?<variation>.*)\)\s*)*, which defines something enclosed in parenthesis. But it would not work well. Regular expressions are greedy by default, so it would just get the first round bracket and everything till the last found in the file! Using the non greedy marker ? would not work either, as the match will stop after the first closing bracket inside a variation. Comments might contain parenthesis characters as well.

The only solution is to better match a variation so that some sort of syntax checking is being performed. We know that a variation contains a list of moves, so we can use that, by defining @variations@ as (?:\((?<variation>@list@)\)\s*)*. @list@ already contains @variations@, though, so we can do this a number of times, to the maximum supported branch depth, then replace the final variation with the generic "everything goes" approximation from above. When we read the results of the match, we just take the variation matches and reparse them with the list subexpression, programatically, and check extra syntax features, like the number of moves being subsequent.

It is no wonder that at the Regular Expressions Library site there was no expression for PGN. I made the effort to upload it, maybe other people refine it and make it even better. Here is the link to the uploaded regular expression. The complete regular expression is here:
(?<pgnGame>\s*(?:\[\s*(?<tagName>\w+)\s*"(?<tagValue>[^"]*)"\s*\]\s*)+(?:(?<moveNumber>\d+)(?<moveMarker>\.|\.{3})\s*(?<moveValue>(?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?)(?:\s*(?<moveValue2>(?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?))?\s*(?:\(\s*(?<variation>(?:(?<varMoveNumber>\d+)(?<varMoveMarker>\.|\.{3})\s*(?<varMoveValue>(?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?)(?:\s*(?<varMoveValue2>(?:[PNBRQK]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?))?\s*(?:\((?<varVariation>.*)\)\s*)?(?:\{(?<varComment>[^\}]*?)\}\s*)?)*)\s*\)\s*)*(?:\{(?<comment>[^\}]*?)\}\s*)?)*(?<endMarker>1\-?0|0\-?1|1/2\-?1/2|\*)?\s*)


Note: the flavour of the regular expression above is .Net. Javascript does not support named tags, the things between the angle brackets, so if you want to make it work for js, remove ?<name> constructs from it.

Now to work on the actual javascript (ouch!)

Update: I took my glorious regular expression and used it in a javascript code only to find out that groups in Javascript do not act like collections of found items, but only the last match. In other words, if you match 'abc' with (.)* (match as many characters in a row, and capture each character in part) you will get an array that contains 'abc' as the first item and 'c' as the second. That's insane!

Update: As per Matty's suggestion, I've added the less used RxQ move syntax (I do have a hunch that it is not complete, for example stuff like RxN2, RxNa or RxNa2 might also be accepted, but they are not implemented in the regex). I also removed the need for at least one PGN tag. To avoid false positives you might still want to use the + versus the * notation after the tagName/tagValue construct. The final version is here:

(?<pgnGame>\s*(?:\[\s*(?<tagName>\w+)\s*"(?<tagValue>[^"]*)"\s*\]\s*)*(?:(?<moveNumber>\d+)(?<moveMarker>\.|\.{3})\s*(?<moveValue>(?:[PNBRQK]?[a-h]?[1-8]?x?(?:[a-h][1-8]|[NBRQK])(?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?)(?:\s*(?<moveValue2>(?:[PNBRQK]?[a-h]?[1-8]?x?(?:[a-h][1-8]|[NBRQK])(?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?))?\s*(?:\(\s*(?<variation>(?:(?<varMoveNumber>\d+)(?<varMoveMarker>\.|\.{3})\s*(?<varMoveValue>(?:[PNBRQK]?[a-h]?[1-8]?x?(?:[a-h][1-8]|[NBRQK])(?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?)(?:\s*(?<varMoveValue2>(?:[PNBRQK]?[a-h]?[1-8]?x?(?:[a-h][1-8]|[NBRQK])(?:\=[PNBRQK])?|O(-?O){1,2})[\+#]?(\s*[\!\?]+)?))?\s*(?:\((?<varVariation>.*)\)\s*)?(?:\{(?<varComment>[^\}]*?)\}\s*)?)*)\s*\)\s*)*(?:\{(?<comment>[^\}]*?)\}\s*)?)*(?<endMarker>1\-?0|0\-?1|1/2\-?1/2|\*)?\s*)

The Regexlib version has also been updated in a comment (I don't know how - or if it is possible - to edit the original).

4 comments:

  1. Nice regexes dude :-)
    There's a ply format that you don't list or parse - the destination square doesn't have to be specified in algebraic form; it might instead name the piece on the destination square.

    For example:
    RxQ

    This is permitted as long as it is unambiguous, ie when there is only one rook of the moving colour that can capture a Queen of enemy colour.

    Also, the meaning of the English in the section about the return value of Javascript's String.prototype.match is not clear to me. Perhaps you could clarify it?

    As far as I understand you, you expect a greedy /(.)*/ to perform non-greedily, returning ['a', 'b', 'c'] on input 'abc', and you expect match() to return multiple matches even when the 'global' flags was not specified.

    By experiment, '123'.match(/(.)+?/g) does what you want. Note the 'g' specifier.

    The only way I can see to get the result you describe, "an array that contains 'abc' as the first item and 'c' as the second" ie ['abc', 'c'] is to accidentally omit the 'g' modifier and do:
    '123'.match(/(.)*/)
    in which case ['abc', 'c'] is the correct return value.

    There's nothing 'crazy' going on - the MDN documentation explains what's happening reasonably clearly:

    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match?redirectlocale=en-US&redirectslug=JavaScript%2FReference%2FGlobal_Objects%2FString%2Fmatch

    ReplyDelete
  2. First of all thank you for the detailed comment (which proves you actually read my entire entry! :-O )

    I was not familiar with the RxQ notation. I guess I should include it.

    I did not mean non-greedy matching, I was referring to the way the Regex class works in C#. For var match=new Regex(@"(.)*").Match("abc"); it would return a Match object with Success=true, a Value="abc" and a Groups[1].Value="c". But match.Groups[1].Captures would be a CaptureCollection holding the values of "a", "b" and "c".

    In comparison, in Javascript there is no similar functionality, the returned object having preserved just the values and not all the captures. In conclusion, there is no way to match an entire PGN game, then move through the resulting match object and getting all the moves and information. You can only use it to detect if a PGN game is present in a string.

    Again, thanks for posting.

    ReplyDelete
  3. "There's a ply format that you don't list or parse - the destination square doesn't have to be specified in algebraic form; it might instead name the piece on the destination square.

    For example:
    RxQ

    This is permitted as long as it is unambiguous, ie when there is only one rook of the moving colour that can capture a Queen of enemy colour."

    This is rubbish. RxQ is never allowed in PGN. It is used in Descriptive Notation, the old P-K4 style.

    Your Regex shouldn't support RxQ any more than it should support calling the Knight a Weasel and having moves like King takes Weasel.

    ReplyDelete
  4. Maybe not allowed in PGN but in online chess games where manual input is accepted, moves like RxQ and even rxq or possible rq should be taken care of.

    ReplyDelete