Sunday, June 14, 2015

Beginning Automation for Parser Generation

Got here quite a bit sooner than expected!

I decided to hand-write the necessary steps for the Parser from the lessons learned from the Lexer, and I had a few things to take away with it:

  • The lexer is fast - Around 6-8 times faster than ANTLR4, assuming future GetLineIndex and GetColumnIndex don't eat that time.  I suspect I'll be happy to match ANTLR4 by the end of this.  This doesn't approach C++ hand written parser speeds but I'm happy for now.
  • The parser context building phase is slow.
    • I plan on lifting the conceptual components of a given parse into two formats:
      1. Concrete - This is the actual structure of the parse
      2. Abstract - This is the fat-trimmed version of the CST (aka: AST)
    • In my hand-written test (using the super simple JSON language) the dual context building as you parse is what's slow, given the potential for ambiguity that will arise from a 'Collapse Point' perspective (see below) I think the 'AST' phase should be deferred.
The Collapse Points are rules that exist as alternations of other rules or tokens.  They serve no other purpose than to outline what's valid at that point.  For a shortcut, you could in theory (and in practice now I guess) condense the AST version such that the node of the alternation points directly to the alternation.

So if I have:
A ::=> B | C;

X ::= A;

B ::= ...;
C := ...;

 The result grammar should build a tree like so:

X - A - B
X - A - C

But with 'Collapse points' we get (for the AST):
X - B
X - C

This doesn't seem like a major change, but you might have five or six different 'collapse points' in a row.  If you had:
X - A - B - C - D - E - (F | G | H | I)
...(assume many other variants are here)...
Where A,B,C,D,E are all 'collapse points, your X would point directly to F, G, H or I, instead of having to hop through five levels of indirection.

Now this is all fine and good, but my next task is implementing this through automation.  I think if I make the CST build the AST on Demand only we can shave a LOT of time off of the resultant parse.  Right now ANTLR4 parses certain geological data in 600ms, I'm at 400ms, I want that lower, I am doing a little more work than ANTLR4 on the context lifting, so I hope this time can drop down lower.

For those who are interested, here's a general idea of what the hand-written version looks like (which is close to how the automated version will look.)

private IEntryRule ParseEntryInternal(int laDepth)
    var localContext = this._currentContext = new JsonReaderRuleContext(this.TokenStream_currentContextJsonReaderSymbols.EntryRule);
    var entryResult = new EntryRule() { Context = _currentContextParent = this._currentRule };
    this._currentRule = entryResult;
    int exitLADepth = -1;
    int currentLADepth = laDepth;
    long exitState = -1;
    JsonReaderSymbols identity;
    IJsonReaderRule body = null;
    this.State = 7;
    identity = this.SymbolStream.LookAhead(currentLADepth);
    switch (identity)
        case JsonReaderSymbols.JsonPunctuationToken_ArrayStart:
            this.State = 8;
            exitState = this.State;
            exitLADepth = currentLADepth;
            body = ParseArrayInternal(currentLADepth++);
            goto DecisionPoint;
        case JsonReaderSymbols.JsonPunctuationToken_ObjectStart:
            this.State = 8;
            exitLADepth = currentLADepth;
            body = ParseObjectInternal(currentLADepth++);
            goto DecisionPoint;
            goto DecisionPoint;
    int symbolLength = currentLADepth - laDepth;
    switch (exitState)
        case 8case 9:
        case 7:
            //ToDo: Error reporting based off of what is valid at the time.
    /* -----------------\
    |  State 7          |
    |      [Object]->9  |
    |      [Array]->8   |
    \----------------- */
    // Edge state 8
    // Edge state 9
    this._currentRule = entryResult.Parent;
    this._currentContext = (JsonReaderRuleContext)entryResult.Context.Parent;
    return entryResult;

No comments: