Monday, March 9, 2009

Context Awareness for Simple Lexical analysis

I've been working on the project quite a bit lately, the major thing of late is context awareness for state machines. Unlike the much older hack attempt from a year+ ago, which used a radix tree structure or something similar, this uses a proper DFA implementation.

Most of the Compiler Compilers I've used seem to use a series of state transition tables (a series of integers that relates to what state to go to on the next character), something I've decided against due to their hard to follow nature. For the sake of understanding the code is doing what it should be doing, I'm going with a hard-coded result, which should also make certain other aspects easier to implement, such as context awareness.

Below, I've included a simple state machine, its focus is on parsing the attribute targets for a set of Custom Attribute declarations.


public bool Next(char @char)
{
    switch (this.state)
    {
        case 0:
            switch (@char)
            {
                case 'a':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.AssemblyTarget) != AttributeTargetCases.None))
                    {
                        // a

                        this.state = 1;
                        return true;
                    }
                    break;
                case 'r':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.ReturnTypeTarget) != AttributeTargetCases.None))
                    {
                        // r

                        this.state = 2;
                        return true;
                    }
                    break;
                case 'f':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.FieldTarget) != AttributeTargetCases.None))
                    {
                        // f

                        this.state = 3;
                        return true;
                    }
                    break;
                case 'm':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.MethodTarget) != AttributeTargetCases.None))
                    {
                        // m

                        this.state = 4;
                        return true;
                    }
                    break;
                case 'e':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.EventTarget) != AttributeTargetCases.None))
                    {
                        // e

                        this.state = 5;
                        return true;
                    }
                    break;
                case 'p':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.PropertyTarget) != AttributeTargetCases.None))
                    {
                        // p

                        this.state = 6;
                        return true;
                    }
                    break;
                case 't':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.TypeTarget) != AttributeTargetCases.None))
                    {
                        // t

                        this.state = 7;
                        return true;
                    }
                    break;
            }
            break;

*truncated for publication limitations*

As you can tell, it short-circuits on the first letter, and that's all the further it would need to go. Though you can tell it's a very simple example. A better example is shown below, more realistic at least:

/* Excerpt from CSKeywordsTokenStateMachine */
public bool Next(char @char)
{
    switch (this.state)
    {
        case 0:
            switch (@char)
            {
                /* Truncated for publication limitations */
                case 'r':
                    if ((((this.ActiveSets & (KeywordCases.Case2 | KeywordCases.Case3)) != KeywordCases.None) && (((this.AllowedInSet2 & (KeywordsCase2.ReadOnly | KeywordsCase2.Ref)) != KeywordsCase2.None) || ((this.AllowedInSet3 & KeywordsCase3.Return) != KeywordsCase3.None))))
                    {
                        // r

                        this.state = 15;
                        return true;
                    }
                    break;
            }
        /* Second state after 'r' */
        case 15:
            switch (@char)
            {
                case 'e':
                    // re

                    this.state = 231;
                    return true;
                    break;
            }
            break;
        
/* *
         * Third character after 're', further
         * decisions made to limit path travel for 
         * context awareness.
         * */

        case 231:
            switch (@char)
            {
                case 'a':
                    if (((this.AllowedInSet2 & KeywordsCase2.ReadOnly) != KeywordsCase2.None))
                    {
                        // rea

                        this.state = 232;
                        return true;
                    }
                    break;
                case 'f':
                    if (((this.AllowedInSet2 & KeywordsCase2.Ref) != KeywordsCase2.None))
                    {
                        this.exitState = ExitStates.Ref;
                        // ref

                        this.state = -1;
                    }
                    break;
                case 't':
                    if (((this.AllowedInSet3 & KeywordsCase3.Return) != KeywordsCase3.None))
                    {
                        // ret

                        this.state = 233;
                        return true;
                    }
                    break;
            }
            break;
        /* Truncated for publication limitations */
    }
}

As you can see in cases where there's multiple elements overlapping the first few characters, it was necessary to include them all in the list. Further in the code, it goes to the effort of continuing to refine the output. Reducing the time needed to parse a given token, if it's not needed.

The literals defined in the syntax of a given grammar are 'deliteralized' and turned into literal references. The first instance of a literal in the definition of a token (which is made up of nothing but literals) is used. For explicit control, the token and literal name can be specified, in case one would have the need.

In cases of more flexibility, a capturing recognizer is used, I'll probably go through later and write it properly so it can handle the captures, but for now I'm just using this as a boot-strapper, so it's not a big deal.

No comments: