After a few years of working on a side project, aimed at replacing the simpler Code Generation Framework named the Objectified Intermediate Language, I've finally started to get to the point where the functional parity between the two is equivalent. The newer OIL framework allows more and understands the concepts behind true generic typing.
The initial tests of constant folding, type-parameter constraint validation, short-circuiting, simple member look-up, expression typing, and type inference, have all been removed for a simple reason: the tests involved no scope, had no concept of access level controls, and lacked the ingenuity of the real thing.
Once work has been finished on the generic replacement mechanisms I'll focus on structuring the actual phase based compiler.
It's been a long haul, but I've only just begun.
Saturday, September 17, 2011
Abstraction - Core
Friday, September 4, 2009
Build: Refresh
Well,
After two years in the project, I decided a few days ago to start a major overhaul of the compiler's build code. In these two years I learned the basic structure of NFA, DFA, recognizers vs. transducers, look-ahead determination, creating a state-machine which can figure out the look-ahead of the langauge at any given point; however, in all of this, the code has become quite cluttered and needs a major rewrite.
I often had to spend a good twenty minutes to find my bearings in code I had just written the day before.
To avoid creating inferior code, I decided that it was a good time to take what I learned and put it to use: create common abstractions on the DFA/NFA constructs observed within the language. This will allow not only cleaner code, but code which can be more readily adapted to new uses later down the line.
To track the progress of this refresh, I've posted a checklist to its associated CodePlex discussions page.
Friday, August 28, 2009
LL(∞)
Finally,
I have the parser compiler somewhere that I can say is progress, and with that, comes knowing that the class of parser I'm writing a compiler for: LL(∞). This still places it in the LL parser category (top-down, left-most derivative); however, it means that the lookahead for the grammars is unbound to any specific k.
I've written a very simplistic recognizer (below) to go with the look-ahead state machine. Once the ahead aspect was finished it was child's play. Now comes the hard part: figuring out if this is a sufficiently structured machine to allow for data mining. Recognition is one thing, but capturing the essential elements is another. I don't doubt that this project is just beginning.
public ICSCSharpFileRule Parse(string filename)
{
Stream s = new FileStream(filename, FileMode.Open);
this.currentScanner.SetStream(s);
CSStreamState currentState = rootState;
var ambiguityStack = Pusher(new { Location = long.MinValue, Entry = (CSharpReaderScanData.Entry)null, State = (CSStreamState)null, Transition = new TokenTransition() });
var errorPoints = Pusher(new { Location = long.MinValue, Transition = new TokenTransition() });
var errorPoint = errorPoints.Pop();
var lastAmbiguity = ambiguityStack.Pop();
errorPoint = null;
lastAmbiguity = null;
int states = 0;
bool hasValidExit = false;
while (true)
{
if (lastAmbiguity != null)
this.currentScanner.MoveTo(lastAmbiguity.Location + lastAmbiguity.Entry.Length);
var currentScanData = this.currentScanner.NextToken(currentState.Transitions.FullSet);
if (currentScanData.Count == 0)
errorPoints.Push(new { Location = currentScanData.Location, Transition = currentState.Transitions.FullSet });
else
{
foreach (var entry in currentScanData)
ambiguityStack.Push(new { Location = currentScanData.Location, Entry = entry, State = currentState, Transition = entry.GetTransition() });
}
if (ambiguityStack.Count == 0)
break;
var currentAmbiguity = ambiguityStack.Pop();
//Unhinged token check.
if (currentAmbiguity.State.Transitions.ContainsKey(currentAmbiguity.Transition))
currentState = currentAmbiguity.State.Transitions[currentAmbiguity.Transition];
else
⊞...⊟/* *
* Token doesn't exist in valid set.
* It's still parsed because it can be
* ignored.
* */
currentState = currentAmbiguity.State;
⊞...⊟/* *
* If the entire language parser has reached
* either the end of the file, or there are
* no valid transitions remaining, the parse
* is done.
* */
if (currentAmbiguity.Entry.ID == CSharpReaderTokens.EndOFile ||
currentState.Transitions.Count == 0)
{
hasValidExit = true;
break;
}
lastAmbiguity = currentAmbiguity;
}
if (!hasValidExit)
{
foreach (var currentErrorPoint in errorPoints)
{
if (errorPoint == null)
errorPoint = currentErrorPoint;
else if (currentErrorPoint.Location > errorPoint.Location)
errorPoint = currentErrorPoint;
}
Console.WriteLine("{0} - Expected {1}, States: {2}", errorPoint.Location, errorPoint.Transition, states);
}
s.Dispose();
return null;
}
public static Stack<T> Pusher<T>(T root)
{
var result = new Stack<T>();
result.Push(root);
return result;
}
One of the fun things I've gained from all this is a better understanding of what types of lists to use and when to use them (Stack<T>, Queue<T>, HashSet<T>, List<T>). I also goofed and used LINQ to perform an order operation on the transitions within a given state. While alone not a terrible thing, it isn't exactly a bright one, either. By itself on a 5300+ state parse, it added 1.2 seconds to the evaluation time. I've also cleaned up the first/follow set calculations (before they were entirely separate) and added a third level of cache (to the transition table construction).
The third level of cache only yields a time savings of a few hundred milliseconds on an expression-heavy verify, but the frequency at which the cache finds a match was rather surprising (more often matched than didn't match).
As an interesting point of comparison: a little over a year ago, to parse a 150KB document without context sensitivity, it took 800 milliseconds. The same file, without context sensitivity takes 25 milliseconds, roughly 32 times faster. Guess deterministic finite automation is faster :]
Sadly, with context sensitivity a smaller file (but with higher cyclomatic complexity) takes 1.3 seconds to parse the first time. The second time it takes 50 milliseconds. The difference is due to the necessary code to construct the lookahead initial/first/follow sets and associated transition tables.
Hopefully there's still some work I can do to expidite the entire matter. I have some ideas; however such optimizations I wish to delay until I have a properly functioning parser. I don't want to further complicate an already complex matter.
Saturday, July 25, 2009
C♯ Parser
Well,
It's taken a long time and a lot of research but I'm finally making headway on the C♯ parser I've been writing (well writing the parser compiler to write...).
Previously, I decided to attempt a backtrack-less, left-most derivation, recursive descent parser. In the end, all the research told me: it's not worth it. Other LL(*) implementations, such as ANTLR, that use recursive descent and error recovery features to handle back-tracking, typically involve large amounts of spaghetti code that make reading or discovering the original language description nigh impossible.
So I thought: there has to be a better way. So far the research into that has been a struggle, there's little information, from what I can tell, on non-recursive descent parsing techniques that are top-down left-most derivation in nature, especially if you're interested in LL(*) vs LL(k).
Another issue with Recursive Descent is Left recursion, in three different varieties: Direct, indirect and hidden. Hidden is the most irksome case, but important nonetheless. To alleviate this entire issue, I shifted gears and started down a deterministic model that would work similar to a DFA for Regular Expressions, except this one would accept recursion.
Instead of pre-compiling this information into the application, due to cyclic model issues, I decided that it would be more appropriate to focus on embedding the logic to compute this directly in the resultant application. With this, the application can delay large computational operations related to look-ahead/follow sets until the information is absolutely necessary (as in, what's next right now?)
Presently I'm hand-writing the logic, using the C# parser as a basis for data, and once I'm done I can bake the logic associated into the parser builder. I'll post more once there's more to be said.
