Friday, May 22, 2015

OILexer - Refactoring and Lexical Ambiguities

As I progress down this path, I learn more about the concepts involved in automating a parser; however, the further I go more questions arise.

About three weeks ago I mentally stepped through the process the parser would need to follow to solve certain situations and realized: Contextual Keywords are a problem.

In tools like ANTLR semantic predicates can usually assist in contextual ambiguity resolution.  In other cases, you merely change the language description.  Take for instance 'from', in ANTLR 4, you resolve this by making such ambiguities resolve to the same token, but specify an identifier rule that includes the keyword as part of its definition.  Where you actually intend to use 'from' you have a separate rule which resolves to just the token, and nothing else.

Personally I don't favor this approach, it feels like you're violating the DRY principle, and with sufficient context the lexer/parser in concert should be able to differentiate between the two.

So I'm taking the following approach:

  • Taking the Nondeterministic Finite Automata (NFA) of all tokens, maintaining context of which token each state is derived from.
    • Create a unified version that represents an 'all token automata.'
    • Create a deterministic finite automation (DFA) from this.
    • Find all Edge states of the DFA.
    • On edges with two or more tokens you have a Lexical Ambiguity.
    • Create a set of each combination of those tokens.
    • For each possible combination, create a rule within the language that represents the ambiguity.
      • Mark the rule as an ambiguity rule, important for later.
  • During the first/follow set resolution
    • You know the combination that created each ambiguity
    • Order your ambiguities by set length
    •  Intersect against the result transition table for every ambiguity
      • For each match, replace the ambiguous hit's tokens with the rule reference
      • Repeat until no ambiguity hits occur.
    • Force Branch Prediction on ambiguous rules
      • The token(s) that follow will be back-traceable to the original path that requires the original token, this context will allow you to determine which course should have been followed.
      • Special cases need injected during this process to be aware of other possible ambiguities that occur.
  • During path determination
    • When what's next is a known ambiguity rule
    • The forced branch prediction will tell you what actual path should've been followed, jump to the appropriate state based on the results of the forced branch prediction
This approach is intended for cases where the keyword is purely contextual, standard keywords are always keywords.  This is resolved by token precedence assignment.  Keywords is defined as being of a higher precedence than Identifier via:

Keywords>Identifier := ... 

In other news, I've also been slightly refactoring OILexer.  When I started this project in 2008, I had originally named things very simply: 'GD' was the acronym for 'GrammarDefinition', since the project was so early, that didn't bother me.  I've since changed this to be: OilexerGrammar (no acronym.)   So now IGDFile becomes IOilexerGrammarFile, IGDFileOptions becomes IOilexerGrammarFileOptions, and so on.

A similar set of refactors was done on \b(?<Identifier>(_\w+|[\w-[0-9_]]\w*)Entry)\b types.  They are now named OilexerGrammar${Identifier}.  With about six additional commits of fixing incorrect matches.

No comments: