Tuesday, June 30, 2015

Predictions (Nearly) there

Well,

After reaching LL(1) status earlier, I've been working on making sure the predictions are accurate.  After a lot of reworking they appear to be getting closer.

A preview is below:


private int _Predict2OnMultExp(int laDepth)
{
    int exitLADepth = -1;
    int currentLADepth = laDepth;
    int exitState = -1;
    this._state = 20;
PredictMultiTargetState_20:
    switch (this._symbolStream.LookAhead(currentLADepth))
    {
        case SCReaderSymbols.SCRMultExpRecursive:
            exitState = this._state = 21;
            exitLADepth = ++currentLADepth;
            switch (this._symbolStream.LookAhead(currentLADepth))
            {
                case SCReaderSymbols.SCTOperators_Divide:
                case SCReaderSymbols.SCTOperators_Modulus:
                case SCReaderSymbols.SCTOperators_Multiply:
                    exitState = this._state = 25;
                    exitLADepth = ++currentLADepth;
                    goto PredictMultiTargetState_25;
                default:
                    this._symbolStream.PopNone();
                    return 12;
            }
        case SCReaderSymbols.SCRPrimExp:
            exitState = this._state = 22;
            exitLADepth = ++currentLADepth;
            switch (this._symbolStream.LookAhead(currentLADepth))
            {
                case SCReaderSymbols.SCTOperators_Divide:
                case SCReaderSymbols.SCTOperators_Modulus:
                case SCReaderSymbols.SCTOperators_Multiply:
                    exitState = this._state = 25;
                    exitLADepth = ++currentLADepth;
                    goto PredictMultiTargetState_25;
                default:
                    this._symbolStream.PopNone();
                    return 11;
            }
        case SCReaderSymbols.SCRMultExp:
            exitState = this._state = 24;
            exitLADepth = ++currentLADepth;
            return 12;
        case SCReaderSymbols.SCTIdentifier:
        case SCReaderSymbols.SCTNumber:
            exitState = this._state = 23;
            exitLADepth = ++currentLADepth;
            return 11;
        default:
            goto DecisionPoint;
    }
PredictMultiTargetState_25:
    return 12;
DecisionPoint:
    return 0;
}

I still have a lot of work necessary to cleanup decision results. But you'll notice the rule above is a left recursive rule. In testing (by hand writing it) it's very possible to generate a recursive descent parser that does reductions that handles left recursion. I just need to test further to see if it applies to all variations of left-recursion.

Friday, June 19, 2015

LL(1) Achieved

Well I've achieved LL(1) state.  Which means if a language only requires one token of look-ahead, the language can be parsed by it.

So far explicit captures are working, the object model that represents the rules doesn't quite procure the data from the context, but it's a start (i.e. the data is there, but no properties to get the data).  I'm currently working on LL(∞), currently the major issue appears to be Indirectly Left Recursive states.  They throw the system for a loop, but I'm sure it's just a matter of time.

The prediction states are going to take a little while to get correct.  The prediction pseudo-code appeared correct in initial tests, but I'll need to be careful that the prediction flow is accurate during reduction states.  Directly Left Recursive rules will need to provide a boolean value to denote their initial state, once the non-recursive variation has been parsed, it can reduce to itself, and check again.

Sunday, June 14, 2015

Beginning Automation for Parser Generation

Got here quite a bit sooner than expected!

I decided to hand-write the necessary steps for the Parser from the lessons learned from the Lexer, and I had a few things to take away with it:

  • The lexer is fast - Around 6-8 times faster than ANTLR4, assuming future GetLineIndex and GetColumnIndex don't eat that time.  I suspect I'll be happy to match ANTLR4 by the end of this.  This doesn't approach C++ hand written parser speeds but I'm happy for now.
  • The parser context building phase is slow.
    • I plan on lifting the conceptual components of a given parse into two formats:
      1. Concrete - This is the actual structure of the parse
      2. Abstract - This is the fat-trimmed version of the CST (aka: AST)
    • In my hand-written test (using the super simple JSON language) the dual context building as you parse is what's slow, given the potential for ambiguity that will arise from a 'Collapse Point' perspective (see below) I think the 'AST' phase should be deferred.
The Collapse Points are rules that exist as alternations of other rules or tokens.  They serve no other purpose than to outline what's valid at that point.  For a shortcut, you could in theory (and in practice now I guess) condense the AST version such that the node of the alternation points directly to the alternation.

So if I have:
A ::=> B | C;

X ::= A;

B ::= ...;
C := ...;

 The result grammar should build a tree like so:

X - A - B
or
X - A - C

But with 'Collapse points' we get (for the AST):
X - B
or
X - C

This doesn't seem like a major change, but you might have five or six different 'collapse points' in a row.  If you had:
X - A - B - C - D - E - (F | G | H | I)
...(assume many other variants are here)...
Where A,B,C,D,E are all 'collapse points, your X would point directly to F, G, H or I, instead of having to hop through five levels of indirection.

Now this is all fine and good, but my next task is implementing this through automation.  I think if I make the CST build the AST on Demand only we can shave a LOT of time off of the resultant parse.  Right now ANTLR4 parses certain geological data in 600ms, I'm at 400ms, I want that lower, I am doing a little more work than ANTLR4 on the context lifting, so I hope this time can drop down lower.

For those who are interested, here's a general idea of what the hand-written version looks like (which is close to how the automated version will look.)

private IEntryRule ParseEntryInternal(int laDepth)
{
    var localContext = this._currentContext = new JsonReaderRuleContext(this.TokenStream_currentContextJsonReaderSymbols.EntryRule);
    var entryResult = new EntryRule() { Context = _currentContextParent = this._currentRule };
    this._currentRule = entryResult;
    int exitLADepth = -1;
    int currentLADepth = laDepth;
    long exitState = -1;
    JsonReaderSymbols identity;
    IJsonReaderRule body = null;
MultitargetState_7:
    this.State = 7;
    identity = this.SymbolStream.LookAhead(currentLADepth);
    switch (identity)
    {
        case JsonReaderSymbols.JsonPunctuationToken_ArrayStart:
            this.State = 8;
            exitState = this.State;
            exitLADepth = currentLADepth;
            body = ParseArrayInternal(currentLADepth++);
            goto DecisionPoint;
        case JsonReaderSymbols.JsonPunctuationToken_ObjectStart:
            this.State = 8;
            exitLADepth = currentLADepth;
            body = ParseObjectInternal(currentLADepth++);
            goto DecisionPoint;
        default:
            goto DecisionPoint;
    }
DecisionPoint:
    int symbolLength = currentLADepth - laDepth;
    localContext.DelineateCapture("Body"body);
    localContext.DoReduction(this._symbolStreamlaDepthsymbolLength);
 
    switch (exitState)
    {
        case 8case 9:
            break;
        case 7:
            //ToDo: Error reporting based off of what is valid at the time.
            break;
    }
    /* -----------------\
    |  State 7          |
    |      [Object]->9  |
    |      [Array]->8   |
    \----------------- */
    // Edge state 8
    // Edge state 9
    this._currentRule = entryResult.Parent;
    this._currentContext = (JsonReaderRuleContext)entryResult.Context.Parent;
    return entryResult;
}

Monday, June 1, 2015

Lexer Generation

So far the system now generates a semi functional Lexer.  I got a lot further than anticipated in a single day on rewriting it with context sensitivity.  So far, in fact, I need to start writing a proper parser!  The Lexer identifies the ambiguities on the terminal edges of the grammar wide DFA.  If an ambiguity is present, it knows whether the symbol is ambiguous based off of use within the rules, predictions and follow predictions.  If two symbols are possible it just needs the parser state to tell which of them if is.   If it is ambiguous it has the context to tell which ambiguities from the language are relevant.