Saturday, May 30, 2015

Lexical Ambiguities - Check

On the ambiguity front, instead of lifting them into a rule, like the last post suggested, I went for a different approach: ambiguity symbols.

The reason for this is the following:
In lifting them into their own rule, this would drastically alter the actual analysis and structure of the language.  Every place a given ambiguity would be observed, I would have to restructure the machinery.  In short, this is complicated and the tradeoff just isn't there.

Using ambiguity symbols allows me to maintain the concept of a callstack, without changing how that callstack is perceived by the end user and simplifies things on the analysis front.  The smaller the footprint (impact) caused by this, the better.

Now to do this, I still have to do a lexical ambiguity check on all terminals, create the unique permutations that can exist, be it two terminals that overlap, or ten.  Then take those and create subsets of those, so if three terminals, A, B, and C are ambiguous, the valid permutations are: { A, B, C }, { A, B }, { A, C }, and { B, C }, each sub permutation represents a single ambiguity identity.  After this step, the normal analysis takes place, adding at the end of transition table creation, a step to do ambiguity detection.

Track the ambiguities which are actually encountered, and after the static analysis, remove the ambiguities from the equation that were not observed within the language.  Compiler warnings are provided for context that they existed, so there is still context for them.

The extra 'unused' ambiguities are removed to reduce the need to do logical checks against them during the lexing phase.  The lexer will be context aware to the degree that it knows what is valid next, specifically for the purpose of ambiguity detection. If no ambiguities exist at a given edge, there's no real check performed.

So far this ... increased the run-time of the analysis phase by about 100%.  The ambiguity detection is simplistic, that isn't the issue, the ambiguous paths that result are basically 'merged' leading to more static analysis to be necessary.  The original, unambiguous, paths are still present in the analysis.  I want to get it done before I worry about optimizing this and determining whether it's possible to reduce it further.

So far I've done more refactoring to the flow control of the Compiler phase to simplify generation of the result object model, things are basically a mess, and I wanted to clean it up and make it readable before moving onto writing the full on Lexer and then to the Parser.

I'm suspending one of the original ideas I wanted to handle: lexical extractions, where you can specify in your tokens parts that you want to call out as important (the sign of an exponent on a number, for instance, the whole-part of a number, et cetera.)  I'll likely move this into a post-parse phase as a non-deterministic capturing system.  I talked with Russ Cox back in 2012 about using a deterministic scheme to do lifting of common components in lexing and he said a non-deterministic automata would lend to a simpler architecture.  So it'll be revisited.

No comments: