Saturday, May 30, 2015

Lexical Ambiguities - Check

On the ambiguity front, instead of lifting them into a rule, like the last post suggested, I went for a different approach: ambiguity symbols.

The reason for this is the following:
In lifting them into their own rule, this would drastically alter the actual analysis and structure of the language.  Every place a given ambiguity would be observed, I would have to restructure the machinery.  In short, this is complicated and the tradeoff just isn't there.

Using ambiguity symbols allows me to maintain the concept of a callstack, without changing how that callstack is perceived by the end user and simplifies things on the analysis front.  The smaller the footprint (impact) caused by this, the better.

Now to do this, I still have to do a lexical ambiguity check on all terminals, create the unique permutations that can exist, be it two terminals that overlap, or ten.  Then take those and create subsets of those, so if three terminals, A, B, and C are ambiguous, the valid permutations are: { A, B, C }, { A, B }, { A, C }, and { B, C }, each sub permutation represents a single ambiguity identity.  After this step, the normal analysis takes place, adding at the end of transition table creation, a step to do ambiguity detection.

Track the ambiguities which are actually encountered, and after the static analysis, remove the ambiguities from the equation that were not observed within the language.  Compiler warnings are provided for context that they existed, so there is still context for them.

The extra 'unused' ambiguities are removed to reduce the need to do logical checks against them during the lexing phase.  The lexer will be context aware to the degree that it knows what is valid next, specifically for the purpose of ambiguity detection. If no ambiguities exist at a given edge, there's no real check performed.

So far this ... increased the run-time of the analysis phase by about 100%.  The ambiguity detection is simplistic, that isn't the issue, the ambiguous paths that result are basically 'merged' leading to more static analysis to be necessary.  The original, unambiguous, paths are still present in the analysis.  I want to get it done before I worry about optimizing this and determining whether it's possible to reduce it further.

So far I've done more refactoring to the flow control of the Compiler phase to simplify generation of the result object model, things are basically a mess, and I wanted to clean it up and make it readable before moving onto writing the full on Lexer and then to the Parser.

I'm suspending one of the original ideas I wanted to handle: lexical extractions, where you can specify in your tokens parts that you want to call out as important (the sign of an exponent on a number, for instance, the whole-part of a number, et cetera.)  I'll likely move this into a post-parse phase as a non-deterministic capturing system.  I talked with Russ Cox back in 2012 about using a deterministic scheme to do lifting of common components in lexing and he said a non-deterministic automata would lend to a simpler architecture.  So it'll be revisited.

Friday, May 22, 2015

OILexer - Refactoring and Lexical Ambiguities

As I progress down this path, I learn more about the concepts involved in automating a parser; however, the further I go more questions arise.

About three weeks ago I mentally stepped through the process the parser would need to follow to solve certain situations and realized: Contextual Keywords are a problem.

In tools like ANTLR semantic predicates can usually assist in contextual ambiguity resolution.  In other cases, you merely change the language description.  Take for instance 'from', in ANTLR 4, you resolve this by making such ambiguities resolve to the same token, but specify an identifier rule that includes the keyword as part of its definition.  Where you actually intend to use 'from' you have a separate rule which resolves to just the token, and nothing else.

Personally I don't favor this approach, it feels like you're violating the DRY principle, and with sufficient context the lexer/parser in concert should be able to differentiate between the two.

So I'm taking the following approach:

  • Taking the Nondeterministic Finite Automata (NFA) of all tokens, maintaining context of which token each state is derived from.
    • Create a unified version that represents an 'all token automata.'
    • Create a deterministic finite automation (DFA) from this.
    • Find all Edge states of the DFA.
    • On edges with two or more tokens you have a Lexical Ambiguity.
    • Create a set of each combination of those tokens.
    • For each possible combination, create a rule within the language that represents the ambiguity.
      • Mark the rule as an ambiguity rule, important for later.
  • During the first/follow set resolution
    • You know the combination that created each ambiguity
    • Order your ambiguities by set length
    •  Intersect against the result transition table for every ambiguity
      • For each match, replace the ambiguous hit's tokens with the rule reference
      • Repeat until no ambiguity hits occur.
    • Force Branch Prediction on ambiguous rules
      • The token(s) that follow will be back-traceable to the original path that requires the original token, this context will allow you to determine which course should have been followed.
      • Special cases need injected during this process to be aware of other possible ambiguities that occur.
  • During path determination
    • When what's next is a known ambiguity rule
    • The forced branch prediction will tell you what actual path should've been followed, jump to the appropriate state based on the results of the forced branch prediction
This approach is intended for cases where the keyword is purely contextual, standard keywords are always keywords.  This is resolved by token precedence assignment.  Keywords is defined as being of a higher precedence than Identifier via:

Keywords>Identifier := ... 

In other news, I've also been slightly refactoring OILexer.  When I started this project in 2008, I had originally named things very simply: 'GD' was the acronym for 'GrammarDefinition', since the project was so early, that didn't bother me.  I've since changed this to be: OilexerGrammar (no acronym.)   So now IGDFile becomes IOilexerGrammarFile, IGDFileOptions becomes IOilexerGrammarFileOptions, and so on.

A similar set of refactors was done on \b(?<Identifier>(_\w+|[\w-[0-9_]]\w*)Entry)\b types.  They are now named OilexerGrammar${Identifier}.  With about six additional commits of fixing incorrect matches.