Well,
After two years in the project, I decided a few days ago to start a major overhaul of the compiler's build code. In these two years I learned the basic structure of NFA, DFA, recognizers vs. transducers, look-ahead determination, creating a state-machine which can figure out the look-ahead of the langauge at any given point; however, in all of this, the code has become quite cluttered and needs a major rewrite.
I often had to spend a good twenty minutes to find my bearings in code I had just written the day before.
To avoid creating inferior code, I decided that it was a good time to take what I learned and put it to use: create common abstractions on the DFA/NFA constructs observed within the language. This will allow not only cleaner code, but code which can be more readily adapted to new uses later down the line.
To track the progress of this refresh, I've posted a checklist to its associated CodePlex discussions page.
Friday, September 4, 2009
Build: Refresh
Friday, August 28, 2009
LL(∞)
Finally,
I have the parser compiler somewhere that I can say is progress, and with that, comes knowing that the class of parser I'm writing a compiler for: LL(∞). This still places it in the LL parser category (top-down, left-most derivative); however, it means that the lookahead for the grammars is unbound to any specific k.
I've written a very simplistic recognizer (below) to go with the look-ahead state machine. Once the ahead aspect was finished it was child's play. Now comes the hard part: figuring out if this is a sufficiently structured machine to allow for data mining. Recognition is one thing, but capturing the essential elements is another. I don't doubt that this project is just beginning.
public ICSCSharpFileRule Parse(string filename)
{
Stream s = new FileStream(filename, FileMode.Open);
this.currentScanner.SetStream(s);
CSStreamState currentState = rootState;
var ambiguityStack = Pusher(new { Location = long.MinValue, Entry = (CSharpReaderScanData.Entry)null, State = (CSStreamState)null, Transition = new TokenTransition() });
var errorPoints = Pusher(new { Location = long.MinValue, Transition = new TokenTransition() });
var errorPoint = errorPoints.Pop();
var lastAmbiguity = ambiguityStack.Pop();
errorPoint = null;
lastAmbiguity = null;
int states = 0;
bool hasValidExit = false;
while (true)
{
if (lastAmbiguity != null)
this.currentScanner.MoveTo(lastAmbiguity.Location + lastAmbiguity.Entry.Length);
var currentScanData = this.currentScanner.NextToken(currentState.Transitions.FullSet);
if (currentScanData.Count == 0)
errorPoints.Push(new { Location = currentScanData.Location, Transition = currentState.Transitions.FullSet });
else
{
foreach (var entry in currentScanData)
ambiguityStack.Push(new { Location = currentScanData.Location, Entry = entry, State = currentState, Transition = entry.GetTransition() });
}
if (ambiguityStack.Count == 0)
break;
var currentAmbiguity = ambiguityStack.Pop();
//Unhinged token check.
if (currentAmbiguity.State.Transitions.ContainsKey(currentAmbiguity.Transition))
currentState = currentAmbiguity.State.Transitions[currentAmbiguity.Transition];
else
⊞...⊟/* *
* Token doesn't exist in valid set.
* It's still parsed because it can be
* ignored.
* */
currentState = currentAmbiguity.State;
⊞...⊟/* *
* If the entire language parser has reached
* either the end of the file, or there are
* no valid transitions remaining, the parse
* is done.
* */
if (currentAmbiguity.Entry.ID == CSharpReaderTokens.EndOFile ||
currentState.Transitions.Count == 0)
{
hasValidExit = true;
break;
}
lastAmbiguity = currentAmbiguity;
}
if (!hasValidExit)
{
foreach (var currentErrorPoint in errorPoints)
{
if (errorPoint == null)
errorPoint = currentErrorPoint;
else if (currentErrorPoint.Location > errorPoint.Location)
errorPoint = currentErrorPoint;
}
Console.WriteLine("{0} - Expected {1}, States: {2}", errorPoint.Location, errorPoint.Transition, states);
}
s.Dispose();
return null;
}
public static Stack<T> Pusher<T>(T root)
{
var result = new Stack<T>();
result.Push(root);
return result;
}
One of the fun things I've gained from all this is a better understanding of what types of lists to use and when to use them (Stack<T>, Queue<T>, HashSet<T>, List<T>). I also goofed and used LINQ to perform an order operation on the transitions within a given state. While alone not a terrible thing, it isn't exactly a bright one, either. By itself on a 5300+ state parse, it added 1.2 seconds to the evaluation time. I've also cleaned up the first/follow set calculations (before they were entirely separate) and added a third level of cache (to the transition table construction).
The third level of cache only yields a time savings of a few hundred milliseconds on an expression-heavy verify, but the frequency at which the cache finds a match was rather surprising (more often matched than didn't match).
As an interesting point of comparison: a little over a year ago, to parse a 150KB document without context sensitivity, it took 800 milliseconds. The same file, without context sensitivity takes 25 milliseconds, roughly 32 times faster. Guess deterministic finite automation is faster :]
Sadly, with context sensitivity a smaller file (but with higher cyclomatic complexity) takes 1.3 seconds to parse the first time. The second time it takes 50 milliseconds. The difference is due to the necessary code to construct the lookahead initial/first/follow sets and associated transition tables.
Hopefully there's still some work I can do to expidite the entire matter. I have some ideas; however such optimizations I wish to delay until I have a properly functioning parser. I don't want to further complicate an already complex matter.
Saturday, July 25, 2009
C♯ Parser
Well,
It's taken a long time and a lot of research but I'm finally making headway on the C♯ parser I've been writing (well writing the parser compiler to write...).
Previously, I decided to attempt a backtrack-less, left-most derivation, recursive descent parser. In the end, all the research told me: it's not worth it. Other LL(*) implementations, such as ANTLR, that use recursive descent and error recovery features to handle back-tracking, typically involve large amounts of spaghetti code that make reading or discovering the original language description nigh impossible.
So I thought: there has to be a better way. So far the research into that has been a struggle, there's little information, from what I can tell, on non-recursive descent parsing techniques that are top-down left-most derivation in nature, especially if you're interested in LL(*) vs LL(k).
Another issue with Recursive Descent is Left recursion, in three different varieties: Direct, indirect and hidden. Hidden is the most irksome case, but important nonetheless. To alleviate this entire issue, I shifted gears and started down a deterministic model that would work similar to a DFA for Regular Expressions, except this one would accept recursion.
Instead of pre-compiling this information into the application, due to cyclic model issues, I decided that it would be more appropriate to focus on embedding the logic to compute this directly in the resultant application. With this, the application can delay large computational operations related to look-ahead/follow sets until the information is absolutely necessary (as in, what's next right now?)
Presently I'm hand-writing the logic, using the C# parser as a basis for data, and once I'm done I can bake the logic associated into the parser builder. I'll post more once there's more to be said.
Sunday, June 14, 2009
Language Integrated Query - Manual or Aided Objectified Building
Well,
Today's project was implementing an objectified representation of Language Integrated Query. Even before I started, I knew that I wanted it to be as seamless as the original, so I decided to create two variants of the structure: your standard AST form, and a 'builder' form. Where the builder takes responsibility for selecting the proper body type (select, group, select into, group into), initializing the expression, including the clauses in the order you specify them, and then finally building the expression.
Here's an example of both versions, I used an example from the C# Programming guide (How to: Perform Left Outer Joins) because I needed a simple example for demonstrative purposes.
First up is the method to build the expressions through the normal do it yourself method:
ILinqExpression queryTestA = new LinqExpression();
queryTestA.From = new LinqFromClause(/* rangeVariableName: */"person", /* rangeSource: */"people".GetSymbolExpression());
var queryTestABody = new LinqSelectBody();
queryTestA.Body = queryTestABody;
queryTestABody.Clauses.Join(/* rangeVariableName: */"pet", /* rangeSource: */"pets".GetSymbolExpression(), /* conditionLeft: */"person".GetSymbolExpression(), /* conditionRight: */"pet".Fuse("Owner"), /* intoRangeName: */"gj");
queryTestABody.Clauses.From(/* rangeVariableName: */"subPet", /* rangeSource: */"gj".Fuse("DefaultIfEmpty").Fuse(new IExpression[0]));
queryTestABody.Selection = "string".Fuse("Format").Fuse("{0,-15}{1}".ToPrimitive(), "person".Fuse("FirstName"), ((Symbol)"subPet").EqualTo(IntermediateGateway.NullValue).IIf("string".Fuse("Empty"), "subPet".Fuse("Name")));
Next is the seamless version using the aid:
ILinqExpression queryTestB = LinqHelper
.From(/* rangeVariableName: */"person", /* rangeSource: */(Symbol)"people")
.Join(/* rangeVariableName: */ "pet", /* rangeSource: */(Symbol)"pets", /* conditionLeft: */(Symbol)"person", /* conditionRight: */"pet".Fuse("Owner"), /* intoRangeName: */"gj")
.From(/* rangeVariableName: */"subPet", /* rangeSource: */"gj".Fuse("DefaultIfEmpty").Fuse(new IExpression[0]))
.Select(/* selection: */"string".Fuse("Format").Fuse("{0,-15}{1}".ToPrimitive(), "person".Fuse("FirstName"), ((Symbol)"subPet").EqualTo(IntermediateGateway.NullValue).IIf("string".Fuse("Empty"), "subPet".Fuse("Name")))).Build();
So far it seems to pretty well emphasize the difference. Earlier today the selection line was horrendous due to constructing binary operations, I hadn't yet created extensions for simplifying their construction. Ensuring the operands are affixed to the proper precedence created very nasty looking code. I didn't use the original example's anonymous type initialization because I simply haven't created the ground work for it yet, so that's probably next on my list.
Here's an output to the console on printing the string version of the expression (defaults to C# syntax in the ToString overrides, for this programmer's convenience):
from person in people
join pet in pets on person equals pet.Owner into gj
from subPet in gj.DefaultIfEmpty()
select string.Format("{0,-15}{1}", person.FirstName, subPet == null ? string.Empty : subPet.Name)
And in doing so I noticed one thing, I have to be careful about examples, I used a 'string' instead of 'String' as a symbol. While C# might understand that when emitting, and VB's nicety of being case-insensitive will to, other language's that don't support it, won't.
