Monday, March 30, 2009

LL(k) Parser - Tail Ambiguity

One of the fun things of doing it yourself is you learn of some annoying hurdles, the hard way. The most recent of which is what I'm calling 'Tail ambiguity', which is basically a lock created by the required part of a rule after the parsing rule exits. A lock exists with the optional aspect of the parsing rule and the required aspect of (one of) the calling rule(s).

A simple example is if you have a rule encased in '(' and ')', between the two you call a second rule, which ends with an optional series of ')' sets, you would have to determine on the calling rule how many of those ')' elements belong to the optional part of the current rule and how many belong to that of the calling rule. Things get more complicated when you consider recursion on the first rule encased in parenthesis.

After pondering this constantly from when I encountered it (by making a toy grammar for kicks, without the intent to find it), I think I nearly have a solution, I just want to make sure before I spend a few days implementing something that may not work. I haven't even began to cover what happens when the optional tail of a rule is locked between multiple rules, and those calling rules have potentially more than one token locked between them. Let alone the lock that might occur with the token's pattern itself.

Good news is this only applies to the optional part of the rule, if you can call that good news; however, since such an assumption can't be made, I'll likely have to reduce this system similar to a standard look-ahead ambiguity and utilize the same logic. Except the problem here is it's not fixed like standard rules, it's mutable since the depth, specific rules, and so on are not constant. Which means I'll have to adjust my overall look-ahead concept and come up with a dynamic solution that can handle standard look-ahead locks and tail locks, and combine to make something else entirely.

Guess that's what I get for trying to make a non-backtracking LL(k) parser. Such are the joys of programming.

Sunday, March 22, 2009

Look ahead exploration - LL(k)

Well about a week later, and a few dozen failures, I've learned a thing or two.

The initial attempt at a 'syntax tree' yielded a system wherein I had a true tree of the syntax, but the problem was, it didn't do a damn bit of good for a Recursive Descent parser. Extracting the information about which rule was related to what look-ahead was difficult and hackish, at best a guess. Which doesn't work if you don't want a backtracking RD parser.

So I decided to use the Syntax tree in a limited form: It'll express each rule in their flatform with no branching from that point. Next I obtain the full look-ahead for the rule at a given point in the rule, this does depth exploration and tells me the full scope of tokens that are available for a given look, but it doesn't tell me which it unifies on and so on. So I took it a step further, and created a look-ahead path tree. This yields a dataset which informs me of which tokens are used, and what path is necessary to get to them (ie. a stateless transition table that only explores rules relative to a given token). After obtaining the full look-ahead, I condensed the results so that common paths for the same token aren't duplicated (this ensures that the logic build for this set is minimal, and not redundant).

The next step is to utilize the above data with the rule flat-forms and build a more accurate tree of what's necessary, it'll contain the look-ahead information as well as everything else, which should assist in the overall logic construction. I'll have to add in a special case for direct and indirect left recursion, but it should be fine, I hope.

Once this nonsense is done, and I have it recognizing the language, I can focus on integrating the previously structured data sets. If it doesn't work, I'll just rewrite that part until it does.

Saturday, March 14, 2009

Language Data Analysis and Syntactical Analysis

Well, presently I'm working on two aspects of the language analysis phase, data dependencies (what it exports in the form of a Concrete Syntax Tree) and Syntactical analysis. Both are turning out to be harder than expected, though the data analysis is a lot easier by comparison than the syntactical aspect. Below is a small snippet from the console (reformatted for reading) that describes the data-wise union of the sub-rules of a given rule in a language:

RelationalExpression
Left False RuleDataNode (RelationalExpression)
Operator False EnumDataNode (Operators: LessThan | LessThanOrEqualTo | GreaterThan | GreaterThanOrEqualTo | IsType | CastToOrNull)
Right False RuleDataNode (TypeReference)
False RuleDataNode (ShiftLevelExpression)
ConditionalExpression
Term False RuleDataNode (LogicalOrExpression)
QuestionDelimeter False EnumDataNode (Operators: ConditionalOperator)
TruePart False RuleDataNode (ConditionalExpression)
DecisionDelimeter False EnumDataNode (Operators: Colon)
FalsePart False RuleDataNode (ConditionalExpression)
UnaryOperatorExpression
PrefixUnaryOperator True EnumDataNode (Operators: Increment | Decrement | Addition | Subtraction)
Operand False RuleDataNode (Operand)
False RuleDataNode (Expression)
False RuleDataNode (SingleUnitExpression)
False RuleDataNode (CreateNewObjectExpression)
UnnamedEntity False EnumDataNode (Operators: LeftParenthesis | RightParenthesis)
False EnumDataNode (Operators: RightParenthesis)
TargetType False RuleDataNode (TypeReference)


Fairly straight forward, though a bit cryptic unless you see the language description. I used similar concepts used in the Lexical Analysis, rule element merging and so on. Each individual rule will utilize the intersection of the full dataset after it's been intersected across all rules. This should give me the main rule interface/class' data set and the individual sub-sets of each permutation of the rule.

As above, the RelationalExpression splits on the 'Right' term. This is due to it using 'is' and 'as' for the Left term and a type-reference, and the other relational operators for the ShiftLevelExpression.

There's a few special notes, the 'True' or 'False' column represents the multiplicity of the elements. In cases where alternation is used, if two elements are named the same, if their types are different they'll be placed in the same group, the final data extraction phase will sort them out as needed.

Additionally, sub-groups defined in a rule will also be lifted into a special sub-set data series. I'll cover this more once I have a more practical example.

Then there's the Syntactical Analysis phase. This is taking a large part of my focus due to its complexity. I've figured out how to mine what I need to figure out the initial token set for a given rule at its starting point, but I think the next step is to start from the 'Root' rule and branch out from there. Once I have more concrete data I'll go into more. It's also going to require a reanalysis of the lexical aspects to discern overlapping elements, in cases where they're overlapping, it'll divulge exactly how much so if two literal sets overlap on elements it only need concern itself with those, if the current rule doesn't require any of the elements overlapped, then it's a non-issue. Major issues are capture recognizer type tokens such as Identifiers, which will effectively trump Keywords, causing the look-ahead to increase, which is no big deal.

I'll probably cross link the data analysis aspect with the syntactical aspect prior to the logical branching exploration phase, this way when ambiguities arise, it can discern which data elements are impacted and lift those into an ambiguity type so that they remain intact, but are noted as an ambiguity.

Funny how the moment I think I'm almost done, the stack gets bigger.

Edit: Figures as I post this, I find a bug.

It appears I made a flub, UnaryOperatorExpression contains two unnamed elements, Operators, but one's LeftParenthesis | RightParenthesis and the other's RightParenthesis. The issue is, both LeftParenthesis should have been merged into one and both RightParenthesis should have been merged into one. So now I need to add code to find the best match when merging instead of the first equally natured element it can be merged with.

Edit 2: Fixed - Though now I have another bug. The joys of programming.

Monday, March 9, 2009

Context Awareness for Simple Lexical analysis

I've been working on the project quite a bit lately, the major thing of late is context awareness for state machines. Unlike the much older hack attempt from a year+ ago, which used a radix tree structure or something similar, this uses a proper DFA implementation.

Most of the Compiler Compilers I've used seem to use a series of state transition tables (a series of integers that relates to what state to go to on the next character), something I've decided against due to their hard to follow nature. For the sake of understanding the code is doing what it should be doing, I'm going with a hard-coded result, which should also make certain other aspects easier to implement, such as context awareness.

Below, I've included a simple state machine, its focus is on parsing the attribute targets for a set of Custom Attribute declarations.


public bool Next(char @char)
{
    switch (this.state)
    {
        case 0:
            switch (@char)
            {
                case 'a':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.AssemblyTarget) != AttributeTargetCases.None))
                    {
                        // a

                        this.state = 1;
                        return true;
                    }
                    break;
                case 'r':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.ReturnTypeTarget) != AttributeTargetCases.None))
                    {
                        // r

                        this.state = 2;
                        return true;
                    }
                    break;
                case 'f':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.FieldTarget) != AttributeTargetCases.None))
                    {
                        // f

                        this.state = 3;
                        return true;
                    }
                    break;
                case 'm':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.MethodTarget) != AttributeTargetCases.None))
                    {
                        // m

                        this.state = 4;
                        return true;
                    }
                    break;
                case 'e':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.EventTarget) != AttributeTargetCases.None))
                    {
                        // e

                        this.state = 5;
                        return true;
                    }
                    break;
                case 'p':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.PropertyTarget) != AttributeTargetCases.None))
                    {
                        // p

                        this.state = 6;
                        return true;
                    }
                    break;
                case 't':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.TypeTarget) != AttributeTargetCases.None))
                    {
                        // t

                        this.state = 7;
                        return true;
                    }
                    break;
            }
            break;

*truncated for publication limitations*

As you can tell, it short-circuits on the first letter, and that's all the further it would need to go. Though you can tell it's a very simple example. A better example is shown below, more realistic at least:

/* Excerpt from CSKeywordsTokenStateMachine */
public bool Next(char @char)
{
    switch (this.state)
    {
        case 0:
            switch (@char)
            {
                /* Truncated for publication limitations */
                case 'r':
                    if ((((this.ActiveSets & (KeywordCases.Case2 | KeywordCases.Case3)) != KeywordCases.None) && (((this.AllowedInSet2 & (KeywordsCase2.ReadOnly | KeywordsCase2.Ref)) != KeywordsCase2.None) || ((this.AllowedInSet3 & KeywordsCase3.Return) != KeywordsCase3.None))))
                    {
                        // r

                        this.state = 15;
                        return true;
                    }
                    break;
            }
        /* Second state after 'r' */
        case 15:
            switch (@char)
            {
                case 'e':
                    // re

                    this.state = 231;
                    return true;
                    break;
            }
            break;
        
/* *
         * Third character after 're', further
         * decisions made to limit path travel for 
         * context awareness.
         * */

        case 231:
            switch (@char)
            {
                case 'a':
                    if (((this.AllowedInSet2 & KeywordsCase2.ReadOnly) != KeywordsCase2.None))
                    {
                        // rea

                        this.state = 232;
                        return true;
                    }
                    break;
                case 'f':
                    if (((this.AllowedInSet2 & KeywordsCase2.Ref) != KeywordsCase2.None))
                    {
                        this.exitState = ExitStates.Ref;
                        // ref

                        this.state = -1;
                    }
                    break;
                case 't':
                    if (((this.AllowedInSet3 & KeywordsCase3.Return) != KeywordsCase3.None))
                    {
                        // ret

                        this.state = 233;
                        return true;
                    }
                    break;
            }
            break;
        /* Truncated for publication limitations */
    }
}

As you can see in cases where there's multiple elements overlapping the first few characters, it was necessary to include them all in the list. Further in the code, it goes to the effort of continuing to refine the output. Reducing the time needed to parse a given token, if it's not needed.

The literals defined in the syntax of a given grammar are 'deliteralized' and turned into literal references. The first instance of a literal in the definition of a token (which is made up of nothing but literals) is used. For explicit control, the token and literal name can be specified, in case one would have the need.

In cases of more flexibility, a capturing recognizer is used, I'll probably go through later and write it properly so it can handle the captures, but for now I'm just using this as a boot-strapper, so it's not a big deal.

Friday, March 6, 2009

NFA - DFA - and so on

I'd like to say I've been super busy, but the reality is quite the opposite.

My job's entered work-share, which means the hours I was working (60+) were cut into less than half.

In other news, I decided to use the extra time to focus on some other areas of language research, namely NFA->DFA->Minimal state state machines. So far it's going very well. The next step, since for the scope of the project, I'm finished with tokens, is to do the same to syntax elements. This will require a fair bit more effort, since the DFA implementation I've got working is simple, to say the least.

Right now I'm creating a mock-up for the syntax analysis, to get a good idea of what I'm in for. I have a fairly good idea of how I want it to work, next is getting there. There's a lot of other work to be done with regards to the Concrete Syntax Tree it'll build of the rules and so on.

I'll post more once I have it to a point where it doesn't just emit a bunch of token state machines.