Got here quite a bit sooner than expected!
I decided to hand-write the necessary steps for the Parser from the lessons learned from the Lexer, and I had a few things to take away with it:
- The lexer is fast - Around 6-8 times faster than ANTLR4, assuming future GetLineIndex and GetColumnIndex don't eat that time. I suspect I'll be happy to match ANTLR4 by the end of this. This doesn't approach C++ hand written parser speeds but I'm happy for now.
- The parser context building phase is slow.
- I plan on lifting the conceptual components of a given parse into two formats:
- Concrete - This is the actual structure of the parse
- Abstract - This is the fat-trimmed version of the CST (aka: AST)
- In my hand-written test (using the super simple JSON language) the dual context building as you parse is what's slow, given the potential for ambiguity that will arise from a 'Collapse Point' perspective (see below) I think the 'AST' phase should be deferred.
So if I have:
A ::=> B | C;
X ::= A;
B ::= ...;
C := ...;
The result grammar should build a tree like so:
X - A - B
or
X - A - C
But with 'Collapse points' we get (for the AST):
X - B
or
X - C
This doesn't seem like a major change, but you might have five or six different 'collapse points' in a row. If you had:
X - A - B - C - D - E - (F | G | H | I)
...(assume many other variants are here)...
Where A,B,C,D,E are all 'collapse points, your X would point directly to F, G, H or I, instead of having to hop through five levels of indirection.
Now this is all fine and good, but my next task is implementing this through automation. I think if I make the CST build the AST on Demand only we can shave a LOT of time off of the resultant parse. Right now ANTLR4 parses certain geological data in 600ms, I'm at 400ms, I want that lower, I am doing a little more work than ANTLR4 on the context lifting, so I hope this time can drop down lower.
For those who are interested, here's a general idea of what the hand-written version looks like (which is close to how the automated version will look.)
private IEntryRule ParseEntryInternal(int laDepth) { var localContext = this._currentContext = new JsonReaderRuleContext(this.TokenStream, _currentContext, JsonReaderSymbols.EntryRule); var entryResult = new EntryRule() { Context = _currentContext, Parent = this._currentRule }; this._currentRule = entryResult; int exitLADepth = -1; int currentLADepth = laDepth; long exitState = -1; JsonReaderSymbols identity; IJsonReaderRule body = null; MultitargetState_7: this.State = 7; identity = this.SymbolStream.LookAhead(currentLADepth); switch (identity) { case JsonReaderSymbols.JsonPunctuationToken_ArrayStart: this.State = 8; exitState = this.State; exitLADepth = currentLADepth; body = ParseArrayInternal(currentLADepth++); goto DecisionPoint; case JsonReaderSymbols.JsonPunctuationToken_ObjectStart: this.State = 8; exitLADepth = currentLADepth; body = ParseObjectInternal(currentLADepth++); goto DecisionPoint; default: goto DecisionPoint; } DecisionPoint: int symbolLength = currentLADepth - laDepth; localContext.DelineateCapture("Body", body); localContext.DoReduction(this._symbolStream, laDepth, symbolLength); switch (exitState) { case 8: case 9: break; case 7: //ToDo: Error reporting based off of what is valid at the time. break; } /* -----------------\ | State 7 | | [Object]->9 | | [Array]->8 | \----------------- */ // Edge state 8 // Edge state 9 this._currentRule = entryResult.Parent; this._currentContext = (JsonReaderRuleContext)entryResult.Context.Parent; return entryResult; }
No comments:
Post a Comment