Thursday, April 2, 2009

DFA - Behind the Scenes

Just posting for anyone interested. Something I learned while handling a large part of the finite state automation code was working with subsets, full sets, and so on. In order to ease the pain of working with a large series of characters, I created a 'RegularLanguageBitArray' which originally wrapped the Systems.Collections.BitArray. One thing I found with using it is with the frequency at which I was editing the values, it would probably be quicker to implement my own variation on the problem. Bitwise And, Or, Exclusive Or were all handled on a bit by bit basis, which doesn't really make sense if you're working with sets that can potentially span thousands of bits.

Thus the 'MinimalSetData' was created. A specially written variant of BitArray, a simple data structure aimed at easing the use of subsets which are a part of a bigger set. They contain an offset on where the first bit is set, a length, as well as the length of the full set they're a part of. On top of this, inverted sets were common in the reg-ex-ish syntax I provided for the parser compiler, so having a means to have all these together for reuse later on was helpful.

I utilized this special bit array functionality in three areas: Character ranges, token sub-set ranges, and syntactical ranges (for what rules, tokens, and token sub-elements).

The next step is to utilize the information provided by a few other classes (which I'll go into later) and perform look-ahead resolution. Based upon what I've been reading the type of parser I'm creating is similar to the Earley Parser, since I'm trying for a non-backtracking system that'll probably use a state-based system to handle the parse process. Here's hoping it doesn't have the same operational time (O(n3)).