Friday, November 27, 2015

Method Body Byte Stream Disassembly

Been taking a break from the Parser aspects only to parse data of a different flavor.

L_00000: /*                    No Operation (0000) */ nop
L_00001: /*                       Load Arg0 (0002) */ ldarg.0
L_00002: /*  Load Int32 Constant Short Form (001F) */ ldc.i4.s 0x14
L_00004: /*  Branch If Less Than Short Form (0032) */ blt.s L_0000A
L_00006: /*  Load Int32 Constant Short Form (001F) */ ldc.i4.s 0x14
L_00008: /* Unconditional Branch Short Form (002B) */ br.s L_0000C
L_0000A: /*  Load Int32 Constant Short Form (001F) */ ldc.i4.s 0x9
L_0000C: /*                    No Operation (0000) */ nop
L_0000D: /*                     Call Method (0028) */ call MemberReference: Name = WriteLine, Signature = void (int32), Class=TypeRef: Name = Console, Namespace = System
L_00012: /*                    No Operation (0000) */ nop
L_00013: /*                          Return (002A) */ ret
Above is the results of finalizing what I started in 2011, the parsing of the method body IL into its actual structure.  I likely won't take it further than the above, but it's still neat that it basically mirrors what you see in IlDasm and .NET Reflector in IL mode.  Notable exception to how it denotes method/member references as I didn't see the need to break it down further.

This will help me in determining whether the output of the next phase is accurate: the linker/compiler.

I have a lot to learn still, but the quickest way to learn it is to do it.

Saturday, July 11, 2015

Boot Strapping Nearly Hit

I still have a little ways to go but OILexer is effectively nearly capable of Boot Strapping.

I found oddities within the grammar where my hand-written parser handled things incorrectly or, more accurately, inconsistently.

I have a few more things to fix-up on the predictions, follow ambiguities don't appear to behave correctly when a rule is being parsed as a part of a reduction (which is itself done within a prediction.)

As a result, if a rule is being parsed as a reduction, and that rule enters an exit state, if the token is potentially ambiguous it yields a situation where the next token's identity is marked as 'none'.  A quick fix for this would be to let that rule terminate and check for the none on the stack and clear the look-ahead, but I feel that's the wrong approach and I'm just putting a band-aid on the problem.  It's especially concerning if the ambiguity that's possible in that edge is valid for the current rule to continue.  Allowing lexical ambiguities opened up a whole new can of worms!

After I fix this, I'm thinking about allowing the tokenizer to terminate early in its scan based off of the rule's context this will allow for situations where two tokens are identical up to some point, but one of them continues on, if that longer token isn't valid in the language at the time, terminating early could enable certain oddly lexically ambiguous languages to parse accurately.

Take a look at type identifiers for instance, the type-name, outlined can be 'Version=' So unless you made your version token to parse Version, with '=', and #.#.#.#, the terminal that's present to specify 'Version' then '=' will never be registered on the token stream, because the QualifiedIdentifier would consume all the characters for it.  One option is to break apart the QualifiedIdentifier into a rule and separate its elements with a '=' but in testing this is hackish and doesn't appear to work for the worst case:
Version=, Version\=, Version=,...

Why someone would do that is beyond me, but .NET itself has no issues parsing it, though it does cause issues with the Visual Studio Hosting Process.  It seems to really dislike even the assembly name matching that pattern, let alone the type (which would have to be built using ilasm.)

Tuesday, June 30, 2015

Predictions (Nearly) there


After reaching LL(1) status earlier, I've been working on making sure the predictions are accurate.  After a lot of reworking they appear to be getting closer.

A preview is below:

private int _Predict2OnMultExp(int laDepth)
    int exitLADepth = -1;
    int currentLADepth = laDepth;
    int exitState = -1;
    this._state = 20;
    switch (this._symbolStream.LookAhead(currentLADepth))
        case SCReaderSymbols.SCRMultExpRecursive:
            exitState = this._state = 21;
            exitLADepth = ++currentLADepth;
            switch (this._symbolStream.LookAhead(currentLADepth))
                case SCReaderSymbols.SCTOperators_Divide:
                case SCReaderSymbols.SCTOperators_Modulus:
                case SCReaderSymbols.SCTOperators_Multiply:
                    exitState = this._state = 25;
                    exitLADepth = ++currentLADepth;
                    goto PredictMultiTargetState_25;
                    return 12;
        case SCReaderSymbols.SCRPrimExp:
            exitState = this._state = 22;
            exitLADepth = ++currentLADepth;
            switch (this._symbolStream.LookAhead(currentLADepth))
                case SCReaderSymbols.SCTOperators_Divide:
                case SCReaderSymbols.SCTOperators_Modulus:
                case SCReaderSymbols.SCTOperators_Multiply:
                    exitState = this._state = 25;
                    exitLADepth = ++currentLADepth;
                    goto PredictMultiTargetState_25;
                    return 11;
        case SCReaderSymbols.SCRMultExp:
            exitState = this._state = 24;
            exitLADepth = ++currentLADepth;
            return 12;
        case SCReaderSymbols.SCTIdentifier:
        case SCReaderSymbols.SCTNumber:
            exitState = this._state = 23;
            exitLADepth = ++currentLADepth;
            return 11;
            goto DecisionPoint;
    return 12;
    return 0;

I still have a lot of work necessary to cleanup decision results. But you'll notice the rule above is a left recursive rule. In testing (by hand writing it) it's very possible to generate a recursive descent parser that does reductions that handles left recursion. I just need to test further to see if it applies to all variations of left-recursion.

Friday, June 19, 2015

LL(1) Achieved

Well I've achieved LL(1) state.  Which means if a language only requires one token of look-ahead, the language can be parsed by it.

So far explicit captures are working, the object model that represents the rules doesn't quite procure the data from the context, but it's a start (i.e. the data is there, but no properties to get the data).  I'm currently working on LL(∞), currently the major issue appears to be Indirectly Left Recursive states.  They throw the system for a loop, but I'm sure it's just a matter of time.

The prediction states are going to take a little while to get correct.  The prediction pseudo-code appeared correct in initial tests, but I'll need to be careful that the prediction flow is accurate during reduction states.  Directly Left Recursive rules will need to provide a boolean value to denote their initial state, once the non-recursive variation has been parsed, it can reduce to itself, and check again.

Sunday, June 14, 2015

Beginning Automation for Parser Generation

Got here quite a bit sooner than expected!

I decided to hand-write the necessary steps for the Parser from the lessons learned from the Lexer, and I had a few things to take away with it:

  • The lexer is fast - Around 6-8 times faster than ANTLR4, assuming future GetLineIndex and GetColumnIndex don't eat that time.  I suspect I'll be happy to match ANTLR4 by the end of this.  This doesn't approach C++ hand written parser speeds but I'm happy for now.
  • The parser context building phase is slow.
    • I plan on lifting the conceptual components of a given parse into two formats:
      1. Concrete - This is the actual structure of the parse
      2. Abstract - This is the fat-trimmed version of the CST (aka: AST)
    • In my hand-written test (using the super simple JSON language) the dual context building as you parse is what's slow, given the potential for ambiguity that will arise from a 'Collapse Point' perspective (see below) I think the 'AST' phase should be deferred.
The Collapse Points are rules that exist as alternations of other rules or tokens.  They serve no other purpose than to outline what's valid at that point.  For a shortcut, you could in theory (and in practice now I guess) condense the AST version such that the node of the alternation points directly to the alternation.

So if I have:
A ::=> B | C;

X ::= A;

B ::= ...;
C := ...;

 The result grammar should build a tree like so:

X - A - B
X - A - C

But with 'Collapse points' we get (for the AST):
X - B
X - C

This doesn't seem like a major change, but you might have five or six different 'collapse points' in a row.  If you had:
X - A - B - C - D - E - (F | G | H | I)
...(assume many other variants are here)...
Where A,B,C,D,E are all 'collapse points, your X would point directly to F, G, H or I, instead of having to hop through five levels of indirection.

Now this is all fine and good, but my next task is implementing this through automation.  I think if I make the CST build the AST on Demand only we can shave a LOT of time off of the resultant parse.  Right now ANTLR4 parses certain geological data in 600ms, I'm at 400ms, I want that lower, I am doing a little more work than ANTLR4 on the context lifting, so I hope this time can drop down lower.

For those who are interested, here's a general idea of what the hand-written version looks like (which is close to how the automated version will look.)

private IEntryRule ParseEntryInternal(int laDepth)
    var localContext = this._currentContext = new JsonReaderRuleContext(this.TokenStream_currentContextJsonReaderSymbols.EntryRule);
    var entryResult = new EntryRule() { Context = _currentContextParent = this._currentRule };
    this._currentRule = entryResult;
    int exitLADepth = -1;
    int currentLADepth = laDepth;
    long exitState = -1;
    JsonReaderSymbols identity;
    IJsonReaderRule body = null;
    this.State = 7;
    identity = this.SymbolStream.LookAhead(currentLADepth);
    switch (identity)
        case JsonReaderSymbols.JsonPunctuationToken_ArrayStart:
            this.State = 8;
            exitState = this.State;
            exitLADepth = currentLADepth;
            body = ParseArrayInternal(currentLADepth++);
            goto DecisionPoint;
        case JsonReaderSymbols.JsonPunctuationToken_ObjectStart:
            this.State = 8;
            exitLADepth = currentLADepth;
            body = ParseObjectInternal(currentLADepth++);
            goto DecisionPoint;
            goto DecisionPoint;
    int symbolLength = currentLADepth - laDepth;
    switch (exitState)
        case 8case 9:
        case 7:
            //ToDo: Error reporting based off of what is valid at the time.
    /* -----------------\
    |  State 7          |
    |      [Object]->9  |
    |      [Array]->8   |
    \----------------- */
    // Edge state 8
    // Edge state 9
    this._currentRule = entryResult.Parent;
    this._currentContext = (JsonReaderRuleContext)entryResult.Context.Parent;
    return entryResult;

Monday, June 1, 2015

Lexer Generation

So far the system now generates a semi functional Lexer.  I got a lot further than anticipated in a single day on rewriting it with context sensitivity.  So far, in fact, I need to start writing a proper parser!  The Lexer identifies the ambiguities on the terminal edges of the grammar wide DFA.  If an ambiguity is present, it knows whether the symbol is ambiguous based off of use within the rules, predictions and follow predictions.  If two symbols are possible it just needs the parser state to tell which of them if is.   If it is ambiguous it has the context to tell which ambiguities from the language are relevant.

Saturday, May 30, 2015

Lexical Ambiguities - Check

On the ambiguity front, instead of lifting them into a rule, like the last post suggested, I went for a different approach: ambiguity symbols.

The reason for this is the following:
In lifting them into their own rule, this would drastically alter the actual analysis and structure of the language.  Every place a given ambiguity would be observed, I would have to restructure the machinery.  In short, this is complicated and the tradeoff just isn't there.

Using ambiguity symbols allows me to maintain the concept of a callstack, without changing how that callstack is perceived by the end user and simplifies things on the analysis front.  The smaller the footprint (impact) caused by this, the better.

Now to do this, I still have to do a lexical ambiguity check on all terminals, create the unique permutations that can exist, be it two terminals that overlap, or ten.  Then take those and create subsets of those, so if three terminals, A, B, and C are ambiguous, the valid permutations are: { A, B, C }, { A, B }, { A, C }, and { B, C }, each sub permutation represents a single ambiguity identity.  After this step, the normal analysis takes place, adding at the end of transition table creation, a step to do ambiguity detection.

Track the ambiguities which are actually encountered, and after the static analysis, remove the ambiguities from the equation that were not observed within the language.  Compiler warnings are provided for context that they existed, so there is still context for them.

The extra 'unused' ambiguities are removed to reduce the need to do logical checks against them during the lexing phase.  The lexer will be context aware to the degree that it knows what is valid next, specifically for the purpose of ambiguity detection. If no ambiguities exist at a given edge, there's no real check performed.

So far this ... increased the run-time of the analysis phase by about 100%.  The ambiguity detection is simplistic, that isn't the issue, the ambiguous paths that result are basically 'merged' leading to more static analysis to be necessary.  The original, unambiguous, paths are still present in the analysis.  I want to get it done before I worry about optimizing this and determining whether it's possible to reduce it further.

So far I've done more refactoring to the flow control of the Compiler phase to simplify generation of the result object model, things are basically a mess, and I wanted to clean it up and make it readable before moving onto writing the full on Lexer and then to the Parser.

I'm suspending one of the original ideas I wanted to handle: lexical extractions, where you can specify in your tokens parts that you want to call out as important (the sign of an exponent on a number, for instance, the whole-part of a number, et cetera.)  I'll likely move this into a post-parse phase as a non-deterministic capturing system.  I talked with Russ Cox back in 2012 about using a deterministic scheme to do lifting of common components in lexing and he said a non-deterministic automata would lend to a simpler architecture.  So it'll be revisited.

Friday, May 22, 2015

OILexer - Refactoring and Lexical Ambiguities

As I progress down this path, I learn more about the concepts involved in automating a parser; however, the further I go more questions arise.

About three weeks ago I mentally stepped through the process the parser would need to follow to solve certain situations and realized: Contextual Keywords are a problem.

In tools like ANTLR semantic predicates can usually assist in contextual ambiguity resolution.  In other cases, you merely change the language description.  Take for instance 'from', in ANTLR 4, you resolve this by making such ambiguities resolve to the same token, but specify an identifier rule that includes the keyword as part of its definition.  Where you actually intend to use 'from' you have a separate rule which resolves to just the token, and nothing else.

Personally I don't favor this approach, it feels like you're violating the DRY principle, and with sufficient context the lexer/parser in concert should be able to differentiate between the two.

So I'm taking the following approach:

  • Taking the Nondeterministic Finite Automata (NFA) of all tokens, maintaining context of which token each state is derived from.
    • Create a unified version that represents an 'all token automata.'
    • Create a deterministic finite automation (DFA) from this.
    • Find all Edge states of the DFA.
    • On edges with two or more tokens you have a Lexical Ambiguity.
    • Create a set of each combination of those tokens.
    • For each possible combination, create a rule within the language that represents the ambiguity.
      • Mark the rule as an ambiguity rule, important for later.
  • During the first/follow set resolution
    • You know the combination that created each ambiguity
    • Order your ambiguities by set length
    •  Intersect against the result transition table for every ambiguity
      • For each match, replace the ambiguous hit's tokens with the rule reference
      • Repeat until no ambiguity hits occur.
    • Force Branch Prediction on ambiguous rules
      • The token(s) that follow will be back-traceable to the original path that requires the original token, this context will allow you to determine which course should have been followed.
      • Special cases need injected during this process to be aware of other possible ambiguities that occur.
  • During path determination
    • When what's next is a known ambiguity rule
    • The forced branch prediction will tell you what actual path should've been followed, jump to the appropriate state based on the results of the forced branch prediction
This approach is intended for cases where the keyword is purely contextual, standard keywords are always keywords.  This is resolved by token precedence assignment.  Keywords is defined as being of a higher precedence than Identifier via:

Keywords>Identifier := ... 

In other news, I've also been slightly refactoring OILexer.  When I started this project in 2008, I had originally named things very simply: 'GD' was the acronym for 'GrammarDefinition', since the project was so early, that didn't bother me.  I've since changed this to be: OilexerGrammar (no acronym.)   So now IGDFile becomes IOilexerGrammarFile, IGDFileOptions becomes IOilexerGrammarFileOptions, and so on.

A similar set of refactors was done on \b(?<Identifier>(_\w+|[\w-[0-9_]]\w*)Entry)\b types.  They are now named OilexerGrammar${Identifier}.  With about six additional commits of fixing incorrect matches.

Sunday, April 5, 2015

Versioned Runtime Environment

In an effort to make Abstraction adaptable to multiple versions of .NET, I've been analyzing the various versions against each other and aim to create two things:

  1. An XML document and schema that can be used to translate known identities between the versions so binding to a given type can be done in a version independent manner.
    1. To achieve this requires knowledge of the .NET assemblies and their identities.  
    2. I could encode common concepts within .NET through code, but this is an error prone task due to the version of the library not often matching that of the framework.  
      1. The Visual Basic libraries are a good example.
  2. A statically typed set of libraries that expose the members to allow expressions to be easier to compose on the framework dependent types.
    1. This will require a lot of effort on my part.  Each library will have its own accompanying expr library.  mscorlib will have mscorlib.expr.dll  This can be viewed as overkill but the simplicity it offers can't be beat.
    2. One such example is Console.WriteLine(string), there would then be two ways in which you could construct this:
      1. writeLineExp = "Console".Fuse("WriteLine").Invoke(myStringExpression);
      2. writeLineExp = ConsoleExp.WriteLine(myStringExpression);
    3. The difference above isn't obvious, whereas the fuse method will emit the correct code, the ConsoleExp.WriteLine will avoid typing errors because it utilizes the compiler to understand the structure of the document.  The Fuse method is also disadvantaged by refactoring, since you can very easily emit incorrect code by a refactor that is done carelessly.
    4. The latter would also be easier to bind to the types because they would be baked into the expression assembly.
Right now I'm formalizing the XSD that will maintain the logical representation of the type model.  The resulted file will likely be pretty large (1.8MB for outline with no members, this will skyrocket once members are added in, I suspect somewhere around 18-25MB), hoping compression will help with that some.

Do note that evaluating every version of the framework through isolating the identity resolution  per runtime version takes quite a bit of memory.  Right now it takes about 2.2GB of memory to do the analysis.  Once the XML file is generated, if I'm writing a language service, I can use the ExtensionAttribute and adapt to the version without the annoying effort that it requires today.  The automated process is more accurate and nearly 100% accurate whereas doing it by hand is tiresome and relies on 'Helper' classes that define common types: this is what I want to stay away from.

There'll likely be two versions of the metadata:
  1. Limited to Libraries and Types
    1. This is basically finished now, I just need to review it to make sure that types have a unique version for each breaking change, anything missing means I'm missing a library of that version.
  2. Full Metadata - Libraries, Types and Members.
    1. This will be used to automate the expression model system. This tool will not be for your average programmer, but programmers who specialize in:
      1. Writing code that writes code (tools which generate code to inject into standard MS Build processes)
      2. Writing code that writes libraries directly (compilers)

Tuesday, February 24, 2015

T*y♯ - C♯ esque language with OILexer mixed in.

Well, after lots of work, I think the predictive logic is to the point where I can finally start the major work of creating the object model, lexer, parser and so on.

As of right now it takes 14 minutes to compute the language's predictions with 105 million comparisons during the course of the Rule NFA and DFA Generation, State->State Ambiguity Prediction resolution, and Follow-State ambiguity resolution.

The dangling else issue is resolved by only concerning itself with situations where what follows on an edge state matches what is required by the calling rule.  So, the typical dangling else issue won't be a concern because the parent context doesn't require the information.

Currently OILexer takes three minutes to handle the general case look-ahead prediction, but 11 minutes to do 'Object Model Construction'.  In this I also handle the follow edge ambiguities, so I need to expand the steps of the process to time them appropriately.

Behind the scenes in this I also handle the identification of key named elements.  If you call out a rule or token by name, it'll create a capture bucket for it.  I've yet to bring this all together, but the scaffolding is there.

I would upload and link the file that denotes the transitional points of the language, but the traversable html it generates is 585MB in size.  Suffice to say it doesn't currently process the language in x86 mode, I have to compile in x64 mode to give it more leg room.

I'll be posting back as I make more progress, I just know something is going to bite me in the arse, but time will tell.

This is just the parser phase, the process for handling the compiler aspects is taken on by Abstraction itself, which is a static compiler framework.  I'm looking forward to the road ahead!

Saturday, February 21, 2015

LL(*) Again

Seems the further I get the farther I seem from my goal. Decided to further expand the Pseudocode generated by the system and found gobs of bugs and shortcuts that don't make sense (from the view of actually figuring out where to go next, a->b->c, parse b->b->c??)

C'est la vie, you learn by making mistakes, I thought it was going through a C♯-esque language too quickly, turns out, it was!

I've noticed a few things: my grammar was ambiguous in multiple areas, leading to a indeterminate state explosion that was planning on eating all of my system's resources (all 12 GB.)

The path sets that result from constructing just the statements and expressions for the language yield a 133MB file of pseudo code for disambiguation.  A majority of this is just tracking the paths that were responsible for reaching that decision point.

My next step is to dig into the T*y♯ language to find out why it's failing assertions on certain predictive logic.

Monday, January 26, 2015

Revived! -- LL(*) prediction.

I'm finally making progress in this project.

Working on predictive LL(*) style parsers which will provide a prediction mechanism any time there is a non LL(1) compliant choice to be made.

Let's say you have a grammar which describes a class member declaration:

ClassMemberDeclaration                   ::=>
    ConstantDeclaration                     |
    FieldDeclaration                        |
    MethodDeclaration                       |
    LanguageDeclaration                     |
    PropertyDeclaration                     |
    EventDeclaration                        |
    IndexerDeclaration                      |
    OperatorDeclaration                     |
    ConstructorDeclaration                  |
    DestructorDeclaration                   |
    StaticConstructorDeclaration            |
    TypeDeclaration                         ;

All of these members are allowed 'Attributes' which begin with a '['.  Typically this would cause an LL(1) parser generator to halt, but here OILexer does the work to guess which path is valid:

// State 298 requires prediction: Predict298OnClassMemberDeclaration.
Within this method, when it encounters the '[' for the attributes it knows that all paths point to 'Attributes' so instead of trying to guess all of the tokens for attributes and then what's after that will disambiguate, it will simply call the ParseAttributes() and disambiguate further from there.

The look-ahead table is symbolic in nature, so be it a token or a rule, it'll go into the stream.

The results of the 'predict' method are the target state that the deterministic machine generated for a given rule would have gone into could it guess properly in one token.  So it not only tells it which rule to call, but where to go next.

The state machines are heavily reduced for now, but as I learn the viability of the process, it'll be better decidable.  There is a caveat, there might be possible incorrect guesses for now, and here's why:
  • Edge states which overlap the owning rule's tokens, the prediction might guess incorrectly.  
I'll be looking into fixing this, but for now I want to construct a parser from this point, and anything beyond that is a bonus.