Friday, September 4, 2009

Build: Refresh

Well,

After two years in the project, I decided a few days ago to start a major overhaul of the compiler's build code. In these two years I learned the basic structure of NFA, DFA, recognizers vs. transducers, look-ahead determination, creating a state-machine which can figure out the look-ahead of the langauge at any given point; however, in all of this, the code has become quite cluttered and needs a major rewrite.

I often had to spend a good twenty minutes to find my bearings in code I had just written the day before.

To avoid creating inferior code, I decided that it was a good time to take what I learned and put it to use: create common abstractions on the DFA/NFA constructs observed within the language. This will allow not only cleaner code, but code which can be more readily adapted to new uses later down the line.

To track the progress of this refresh, I've posted a checklist to its associated CodePlex discussions page.

Friday, August 28, 2009

LL(∞)

Finally,

I have the parser compiler somewhere that I can say is progress, and with that, comes knowing that the class of parser I'm writing a compiler for: LL(∞). This still places it in the LL parser category (top-down, left-most derivative); however, it means that the lookahead for the grammars is unbound to any specific k.

I've written a very simplistic recognizer (below) to go with the look-ahead state machine. Once the ahead aspect was finished it was child's play. Now comes the hard part: figuring out if this is a sufficiently structured machine to allow for data mining. Recognition is one thing, but capturing the essential elements is another. I don't doubt that this project is just beginning.

public ICSCSharpFileRule Parse(string filename)
{

    Stream s = new FileStream(filename, FileMode.Open);
    this.currentScanner.SetStream(s);

    CSStreamState currentState = rootState;
    var ambiguityStack = Pusher(new { Location = long.MinValue, Entry = (CSharpReaderScanData.Entry)null, State = (CSStreamState)null, Transition = new TokenTransition() });
    var errorPoints = Pusher(new { Location = long.MinValue, Transition = new TokenTransition() });
    var errorPoint = errorPoints.Pop();
    var lastAmbiguity = ambiguityStack.Pop();
    errorPoint = null;
    lastAmbiguity = null;
    int states = 0;
    bool hasValidExit = false;
    while (true)
    {
        if (lastAmbiguity != null)
            this.currentScanner.MoveTo(lastAmbiguity.Location + lastAmbiguity.Entry.Length);
        var currentScanData = this.currentScanner.NextToken(currentState.Transitions.FullSet);
        if (currentScanData.Count == 0)
            errorPoints.Push(new { Location = currentScanData.Location, Transition = currentState.Transitions.FullSet });
        else
        {
            foreach (var entry in currentScanData)
                ambiguityStack.Push(new { Location = currentScanData.Location, Entry = entry, State = currentState, Transition = entry.GetTransition() });
        }
        if (ambiguityStack.Count == 0)
            break;
        var currentAmbiguity = ambiguityStack.Pop();
        //Unhinged token check.
        if (currentAmbiguity.State.Transitions.ContainsKey(currentAmbiguity.Transition))
            currentState = currentAmbiguity.State.Transitions[currentAmbiguity.Transition];
        else
            
/* *
             * Token doesn't exist in valid set.
             * It's still parsed because it can be
             * ignored.
             * */

            currentState = currentAmbiguity.State;
        
/* *
         * If the entire language parser has reached
         * either the end of the file, or there are
         * no valid transitions remaining, the parse
         * is done.
         * */

        if (currentAmbiguity.Entry.ID == CSharpReaderTokens.EndOFile ||
            currentState.Transitions.Count == 0)
        {
            hasValidExit = true;
            break;
        }
        lastAmbiguity = currentAmbiguity;
    }
    if (!hasValidExit)
    {
        foreach (var currentErrorPoint in errorPoints)
        {
            if (errorPoint == null)
                errorPoint = currentErrorPoint;
            else if (currentErrorPoint.Location > errorPoint.Location)
                errorPoint = currentErrorPoint;
        }
        Console.WriteLine("{0} - Expected {1}, States: {2}", errorPoint.Location, errorPoint.Transition, states);
    }
    s.Dispose();
    return null;
}

public static Stack<T> Pusher<T>(T root)
{
    var result =  new Stack<T>();
    result.Push(root);
    return result;
}


One of the fun things I've gained from all this is a better understanding of what types of lists to use and when to use them (Stack<T>, Queue<T>, HashSet<T>, List<T>). I also goofed and used LINQ to perform an order operation on the transitions within a given state. While alone not a terrible thing, it isn't exactly a bright one, either. By itself on a 5300+ state parse, it added 1.2 seconds to the evaluation time. I've also cleaned up the first/follow set calculations (before they were entirely separate) and added a third level of cache (to the transition table construction).

The third level of cache only yields a time savings of a few hundred milliseconds on an expression-heavy verify, but the frequency at which the cache finds a match was rather surprising (more often matched than didn't match).



As an interesting point of comparison: a little over a year ago, to parse a 150KB document without context sensitivity, it took 800 milliseconds. The same file, without context sensitivity takes 25 milliseconds, roughly 32 times faster. Guess deterministic finite automation is faster :]

Sadly, with context sensitivity a smaller file (but with higher cyclomatic complexity) takes 1.3 seconds to parse the first time. The second time it takes 50 milliseconds. The difference is due to the necessary code to construct the lookahead initial/first/follow sets and associated transition tables.

Hopefully there's still some work I can do to expidite the entire matter. I have some ideas; however such optimizations I wish to delay until I have a properly functioning parser. I don't want to further complicate an already complex matter.

Saturday, July 25, 2009

C♯ Parser

Well,

It's taken a long time and a lot of research but I'm finally making headway on the C♯ parser I've been writing (well writing the parser compiler to write...).

Previously, I decided to attempt a backtrack-less, left-most derivation, recursive descent parser. In the end, all the research told me: it's not worth it. Other LL(*) implementations, such as ANTLR, that use recursive descent and error recovery features to handle back-tracking, typically involve large amounts of spaghetti code that make reading or discovering the original language description nigh impossible.

So I thought: there has to be a better way. So far the research into that has been a struggle, there's little information, from what I can tell, on non-recursive descent parsing techniques that are top-down left-most derivation in nature, especially if you're interested in LL(*) vs LL(k).

Another issue with Recursive Descent is Left recursion, in three different varieties: Direct, indirect and hidden. Hidden is the most irksome case, but important nonetheless. To alleviate this entire issue, I shifted gears and started down a deterministic model that would work similar to a DFA for Regular Expressions, except this one would accept recursion.

Instead of pre-compiling this information into the application, due to cyclic model issues, I decided that it would be more appropriate to focus on embedding the logic to compute this directly in the resultant application. With this, the application can delay large computational operations related to look-ahead/follow sets until the information is absolutely necessary (as in, what's next right now?)

Presently I'm hand-writing the logic, using the C# parser as a basis for data, and once I'm done I can bake the logic associated into the parser builder. I'll post more once there's more to be said.

Sunday, June 14, 2009

Language Integrated Query - Manual or Aided Objectified Building

Well,

Today's project was implementing an objectified representation of Language Integrated Query. Even before I started, I knew that I wanted it to be as seamless as the original, so I decided to create two variants of the structure: your standard AST form, and a 'builder' form. Where the builder takes responsibility for selecting the proper body type (select, group, select into, group into), initializing the expression, including the clauses in the order you specify them, and then finally building the expression.

Here's an example of both versions, I used an example from the C# Programming guide (How to: Perform Left Outer Joins) because I needed a simple example for demonstrative purposes.

First up is the method to build the expressions through the normal do it yourself method:

ILinqExpression queryTestA = new LinqExpression();
queryTestA.From = new LinqFromClause(/* rangeVariableName: */"person", /* rangeSource: */"people".GetSymbolExpression());
var queryTestABody = new LinqSelectBody();
queryTestA.Body = queryTestABody;
queryTestABody.Clauses.Join(/* rangeVariableName: */"pet",    /* rangeSource: */"pets".GetSymbolExpression(), /* conditionLeft: */"person".GetSymbolExpression(), /* conditionRight: */"pet".Fuse("Owner"), /* intoRangeName: */"gj");
queryTestABody.Clauses.From(/* rangeVariableName: */"subPet", /* rangeSource: */"gj".Fuse("DefaultIfEmpty").Fuse(new IExpression[0]));
queryTestABody.Selection = "string".Fuse("Format").Fuse("{0,-15}{1}".ToPrimitive(), "person".Fuse("FirstName"), ((Symbol)"subPet").EqualTo(IntermediateGateway.NullValue).IIf("string".Fuse("Empty"), "subPet".Fuse("Name")));


Next is the seamless version using the aid:
ILinqExpression queryTestB = LinqHelper
                             .From(/* rangeVariableName: */"person", /* rangeSource: */(Symbol)"people")
                             .Join(/* rangeVariableName: */   "pet", /* rangeSource: */(Symbol)"pets", /* conditionLeft: */(Symbol)"person", /* conditionRight: */"pet".Fuse("Owner"), /* intoRangeName: */"gj")
                             .From(/* rangeVariableName: */"subPet", /* rangeSource: */"gj".Fuse("DefaultIfEmpty").Fuse(new IExpression[0]))
                             .Select(/* selection: */"string".Fuse("Format").Fuse("{0,-15}{1}".ToPrimitive(), "person".Fuse("FirstName"), ((Symbol)"subPet").EqualTo(IntermediateGateway.NullValue).IIf("string".Fuse("Empty"), "subPet".Fuse("Name")))).Build();


So far it seems to pretty well emphasize the difference. Earlier today the selection line was horrendous due to constructing binary operations, I hadn't yet created extensions for simplifying their construction. Ensuring the operands are affixed to the proper precedence created very nasty looking code. I didn't use the original example's anonymous type initialization because I simply haven't created the ground work for it yet, so that's probably next on my list.

Here's an output to the console on printing the string version of the expression (defaults to C# syntax in the ToString overrides, for this programmer's convenience):
from person in people
join pet in pets on person equals pet.Owner into gj
from subPet in gj.DefaultIfEmpty()
select string.Format("{0,-15}{1}", person.FirstName, subPet == null ? string.Empty : subPet.Name)


And in doing so I noticed one thing, I have to be careful about examples, I used a 'string' instead of 'String' as a symbol. While C# might understand that when emitting, and VB's nicety of being case-insensitive will to, other language's that don't support it, won't.

Friday, May 22, 2009

Lambda Expressions - Hoisting

Just musing over the concept of lambda expressions, AKA. anonymous methods. If what I understand is correct, the entire method of hoisting involves a simple analysis of the scopes involved within the hoist. You'd merely have to create a scope dependency chain within each scope. If a scope depends on another scope, you evaluate the associated variables, and move them forward. This process would be nested, and if a sub-scope depended upon a scope two levels higher, the parent scope would also depend upon its parent's scope, albeit indirectly. The scopes, in essence would reference one another as instances of the classes that represent each individual scope.

I'll post more once I've got a working implementation within the code generation framework.

Building a framework for compilers...

I'm taking a break from the Parser Compiler madness to work on some functionality that would be useful for a compiler. It's what I call an AssemblyWorkspace. It's basically a large scale wrapper around the concept of an INamespaceParent (which is also a type parent). In creating an AssemblyWorkspace, you provide it with the assembly that's its 'core' assembly. Typical examples would be the Intermediate Project you're working on building through the infrastructure. On top of the core assembly, there's the referenced assemblies.

The namespace of the workspace is a union of all the namespaces within its scope, same goes for the types. When a type is encountered, if there are multiple instances of that type in the current scope, a light-weight wrapper type is introduced into the system, an ambiguity type. The ambiguity type contains a list of the types that are locked for the name.

Now the concept is simple, but making it work is a lot of work. I'm making it as simple and straight forward as any other INamespaceParent, so you can iterate the namespace names, types, and so on without having to worry about where that type comes from. Its aim is to simplify the symbol resolution phase. It'll be used for the global scope that exists beyond the current type hierarchy the code structure has available to it. I'll likely create a scope system which will use an AssemblyWorkspace as its core aspect and go from there. The other fun issue with it is making it malleable so that when references are inserted and removed, the workspace remains current. The final issue is speed. The higher level the implementation the more likely it is to be slow. The Abstraction Project is very high level. Its meant to be as simple as can be and every concept that can be expressed in a simplified form, is.

Common example:

var myAssembly = IntermediateGateway.CreateAssembly("myAssembly");
myAssembly.AssemblyInformation.AssemblyVersion = new Version(1, 0);
var defaultNamespace = myAssembly.Namespaces.Add("MyNamespace.Subnamespace");
var testClass = defaultNamespace.Classes.Add("TestClass");
var bitwiseAndOp = testClass.BinaryOperatorCoercions.Add(CoercibleBinaryOperators.BitwiseAnd, testClass);


The scope implementation will have an alias system, an import system (for namespaces) and general querying functionality to simplify lookup.

On an amusing note, I create some hellateous type-names:
IntermediateGenericSegmentableInstantiableType<IClassCtorMember,IIntermediateClassCtorMember,IClassEventMember,IIntermediateClassEventMember,IntermediateClassEventMember.EventMethodMember,IClassFieldMember,IIntermediateClassFieldMember,IClassIndexerMember,IIntermediateClassIndexerMember,IClassMethodMember,IIntermediateClassMethodMember,IClassPropertyMember,IIntermediateClassPropertyMember,IClassType,IIntermediateClassType,IntermediateClassType>.BinaryOperatorMember

Thursday, April 2, 2009

DFA - Behind the Scenes

Just posting for anyone interested. Something I learned while handling a large part of the finite state automation code was working with subsets, full sets, and so on. In order to ease the pain of working with a large series of characters, I created a 'RegularLanguageBitArray' which originally wrapped the Systems.Collections.BitArray. One thing I found with using it is with the frequency at which I was editing the values, it would probably be quicker to implement my own variation on the problem. Bitwise And, Or, Exclusive Or were all handled on a bit by bit basis, which doesn't really make sense if you're working with sets that can potentially span thousands of bits.

Thus the 'MinimalSetData' was created. A specially written variant of BitArray, a simple data structure aimed at easing the use of subsets which are a part of a bigger set. They contain an offset on where the first bit is set, a length, as well as the length of the full set they're a part of. On top of this, inverted sets were common in the reg-ex-ish syntax I provided for the parser compiler, so having a means to have all these together for reuse later on was helpful.

I utilized this special bit array functionality in three areas: Character ranges, token sub-set ranges, and syntactical ranges (for what rules, tokens, and token sub-elements).

The next step is to utilize the information provided by a few other classes (which I'll go into later) and perform look-ahead resolution. Based upon what I've been reading the type of parser I'm creating is similar to the Earley Parser, since I'm trying for a non-backtracking system that'll probably use a state-based system to handle the parse process. Here's hoping it doesn't have the same operational time (O(n3)).

Monday, March 30, 2009

LL(k) Parser - Tail Ambiguity

One of the fun things of doing it yourself is you learn of some annoying hurdles, the hard way. The most recent of which is what I'm calling 'Tail ambiguity', which is basically a lock created by the required part of a rule after the parsing rule exits. A lock exists with the optional aspect of the parsing rule and the required aspect of (one of) the calling rule(s).

A simple example is if you have a rule encased in '(' and ')', between the two you call a second rule, which ends with an optional series of ')' sets, you would have to determine on the calling rule how many of those ')' elements belong to the optional part of the current rule and how many belong to that of the calling rule. Things get more complicated when you consider recursion on the first rule encased in parenthesis.

After pondering this constantly from when I encountered it (by making a toy grammar for kicks, without the intent to find it), I think I nearly have a solution, I just want to make sure before I spend a few days implementing something that may not work. I haven't even began to cover what happens when the optional tail of a rule is locked between multiple rules, and those calling rules have potentially more than one token locked between them. Let alone the lock that might occur with the token's pattern itself.

Good news is this only applies to the optional part of the rule, if you can call that good news; however, since such an assumption can't be made, I'll likely have to reduce this system similar to a standard look-ahead ambiguity and utilize the same logic. Except the problem here is it's not fixed like standard rules, it's mutable since the depth, specific rules, and so on are not constant. Which means I'll have to adjust my overall look-ahead concept and come up with a dynamic solution that can handle standard look-ahead locks and tail locks, and combine to make something else entirely.

Guess that's what I get for trying to make a non-backtracking LL(k) parser. Such are the joys of programming.

Sunday, March 22, 2009

Look ahead exploration - LL(k)

Well about a week later, and a few dozen failures, I've learned a thing or two.

The initial attempt at a 'syntax tree' yielded a system wherein I had a true tree of the syntax, but the problem was, it didn't do a damn bit of good for a Recursive Descent parser. Extracting the information about which rule was related to what look-ahead was difficult and hackish, at best a guess. Which doesn't work if you don't want a backtracking RD parser.

So I decided to use the Syntax tree in a limited form: It'll express each rule in their flatform with no branching from that point. Next I obtain the full look-ahead for the rule at a given point in the rule, this does depth exploration and tells me the full scope of tokens that are available for a given look, but it doesn't tell me which it unifies on and so on. So I took it a step further, and created a look-ahead path tree. This yields a dataset which informs me of which tokens are used, and what path is necessary to get to them (ie. a stateless transition table that only explores rules relative to a given token). After obtaining the full look-ahead, I condensed the results so that common paths for the same token aren't duplicated (this ensures that the logic build for this set is minimal, and not redundant).

The next step is to utilize the above data with the rule flat-forms and build a more accurate tree of what's necessary, it'll contain the look-ahead information as well as everything else, which should assist in the overall logic construction. I'll have to add in a special case for direct and indirect left recursion, but it should be fine, I hope.

Once this nonsense is done, and I have it recognizing the language, I can focus on integrating the previously structured data sets. If it doesn't work, I'll just rewrite that part until it does.

Saturday, March 14, 2009

Language Data Analysis and Syntactical Analysis

Well, presently I'm working on two aspects of the language analysis phase, data dependencies (what it exports in the form of a Concrete Syntax Tree) and Syntactical analysis. Both are turning out to be harder than expected, though the data analysis is a lot easier by comparison than the syntactical aspect. Below is a small snippet from the console (reformatted for reading) that describes the data-wise union of the sub-rules of a given rule in a language:

RelationalExpression
Left False RuleDataNode (RelationalExpression)
Operator False EnumDataNode (Operators: LessThan | LessThanOrEqualTo | GreaterThan | GreaterThanOrEqualTo | IsType | CastToOrNull)
Right False RuleDataNode (TypeReference)
False RuleDataNode (ShiftLevelExpression)
ConditionalExpression
Term False RuleDataNode (LogicalOrExpression)
QuestionDelimeter False EnumDataNode (Operators: ConditionalOperator)
TruePart False RuleDataNode (ConditionalExpression)
DecisionDelimeter False EnumDataNode (Operators: Colon)
FalsePart False RuleDataNode (ConditionalExpression)
UnaryOperatorExpression
PrefixUnaryOperator True EnumDataNode (Operators: Increment | Decrement | Addition | Subtraction)
Operand False RuleDataNode (Operand)
False RuleDataNode (Expression)
False RuleDataNode (SingleUnitExpression)
False RuleDataNode (CreateNewObjectExpression)
UnnamedEntity False EnumDataNode (Operators: LeftParenthesis | RightParenthesis)
False EnumDataNode (Operators: RightParenthesis)
TargetType False RuleDataNode (TypeReference)


Fairly straight forward, though a bit cryptic unless you see the language description. I used similar concepts used in the Lexical Analysis, rule element merging and so on. Each individual rule will utilize the intersection of the full dataset after it's been intersected across all rules. This should give me the main rule interface/class' data set and the individual sub-sets of each permutation of the rule.

As above, the RelationalExpression splits on the 'Right' term. This is due to it using 'is' and 'as' for the Left term and a type-reference, and the other relational operators for the ShiftLevelExpression.

There's a few special notes, the 'True' or 'False' column represents the multiplicity of the elements. In cases where alternation is used, if two elements are named the same, if their types are different they'll be placed in the same group, the final data extraction phase will sort them out as needed.

Additionally, sub-groups defined in a rule will also be lifted into a special sub-set data series. I'll cover this more once I have a more practical example.

Then there's the Syntactical Analysis phase. This is taking a large part of my focus due to its complexity. I've figured out how to mine what I need to figure out the initial token set for a given rule at its starting point, but I think the next step is to start from the 'Root' rule and branch out from there. Once I have more concrete data I'll go into more. It's also going to require a reanalysis of the lexical aspects to discern overlapping elements, in cases where they're overlapping, it'll divulge exactly how much so if two literal sets overlap on elements it only need concern itself with those, if the current rule doesn't require any of the elements overlapped, then it's a non-issue. Major issues are capture recognizer type tokens such as Identifiers, which will effectively trump Keywords, causing the look-ahead to increase, which is no big deal.

I'll probably cross link the data analysis aspect with the syntactical aspect prior to the logical branching exploration phase, this way when ambiguities arise, it can discern which data elements are impacted and lift those into an ambiguity type so that they remain intact, but are noted as an ambiguity.

Funny how the moment I think I'm almost done, the stack gets bigger.

Edit: Figures as I post this, I find a bug.

It appears I made a flub, UnaryOperatorExpression contains two unnamed elements, Operators, but one's LeftParenthesis | RightParenthesis and the other's RightParenthesis. The issue is, both LeftParenthesis should have been merged into one and both RightParenthesis should have been merged into one. So now I need to add code to find the best match when merging instead of the first equally natured element it can be merged with.

Edit 2: Fixed - Though now I have another bug. The joys of programming.

Monday, March 9, 2009

Context Awareness for Simple Lexical analysis

I've been working on the project quite a bit lately, the major thing of late is context awareness for state machines. Unlike the much older hack attempt from a year+ ago, which used a radix tree structure or something similar, this uses a proper DFA implementation.

Most of the Compiler Compilers I've used seem to use a series of state transition tables (a series of integers that relates to what state to go to on the next character), something I've decided against due to their hard to follow nature. For the sake of understanding the code is doing what it should be doing, I'm going with a hard-coded result, which should also make certain other aspects easier to implement, such as context awareness.

Below, I've included a simple state machine, its focus is on parsing the attribute targets for a set of Custom Attribute declarations.


public bool Next(char @char)
{
    switch (this.state)
    {
        case 0:
            switch (@char)
            {
                case 'a':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.AssemblyTarget) != AttributeTargetCases.None))
                    {
                        // a

                        this.state = 1;
                        return true;
                    }
                    break;
                case 'r':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.ReturnTypeTarget) != AttributeTargetCases.None))
                    {
                        // r

                        this.state = 2;
                        return true;
                    }
                    break;
                case 'f':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.FieldTarget) != AttributeTargetCases.None))
                    {
                        // f

                        this.state = 3;
                        return true;
                    }
                    break;
                case 'm':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.MethodTarget) != AttributeTargetCases.None))
                    {
                        // m

                        this.state = 4;
                        return true;
                    }
                    break;
                case 'e':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.EventTarget) != AttributeTargetCases.None))
                    {
                        // e

                        this.state = 5;
                        return true;
                    }
                    break;
                case 'p':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.PropertyTarget) != AttributeTargetCases.None))
                    {
                        // p

                        this.state = 6;
                        return true;
                    }
                    break;
                case 't':
                    if (((this.AllowedAttributeTargets & AttributeTargetCases.TypeTarget) != AttributeTargetCases.None))
                    {
                        // t

                        this.state = 7;
                        return true;
                    }
                    break;
            }
            break;

*truncated for publication limitations*

As you can tell, it short-circuits on the first letter, and that's all the further it would need to go. Though you can tell it's a very simple example. A better example is shown below, more realistic at least:

/* Excerpt from CSKeywordsTokenStateMachine */
public bool Next(char @char)
{
    switch (this.state)
    {
        case 0:
            switch (@char)
            {
                /* Truncated for publication limitations */
                case 'r':
                    if ((((this.ActiveSets & (KeywordCases.Case2 | KeywordCases.Case3)) != KeywordCases.None) && (((this.AllowedInSet2 & (KeywordsCase2.ReadOnly | KeywordsCase2.Ref)) != KeywordsCase2.None) || ((this.AllowedInSet3 & KeywordsCase3.Return) != KeywordsCase3.None))))
                    {
                        // r

                        this.state = 15;
                        return true;
                    }
                    break;
            }
        /* Second state after 'r' */
        case 15:
            switch (@char)
            {
                case 'e':
                    // re

                    this.state = 231;
                    return true;
                    break;
            }
            break;
        
/* *
         * Third character after 're', further
         * decisions made to limit path travel for 
         * context awareness.
         * */

        case 231:
            switch (@char)
            {
                case 'a':
                    if (((this.AllowedInSet2 & KeywordsCase2.ReadOnly) != KeywordsCase2.None))
                    {
                        // rea

                        this.state = 232;
                        return true;
                    }
                    break;
                case 'f':
                    if (((this.AllowedInSet2 & KeywordsCase2.Ref) != KeywordsCase2.None))
                    {
                        this.exitState = ExitStates.Ref;
                        // ref

                        this.state = -1;
                    }
                    break;
                case 't':
                    if (((this.AllowedInSet3 & KeywordsCase3.Return) != KeywordsCase3.None))
                    {
                        // ret

                        this.state = 233;
                        return true;
                    }
                    break;
            }
            break;
        /* Truncated for publication limitations */
    }
}

As you can see in cases where there's multiple elements overlapping the first few characters, it was necessary to include them all in the list. Further in the code, it goes to the effort of continuing to refine the output. Reducing the time needed to parse a given token, if it's not needed.

The literals defined in the syntax of a given grammar are 'deliteralized' and turned into literal references. The first instance of a literal in the definition of a token (which is made up of nothing but literals) is used. For explicit control, the token and literal name can be specified, in case one would have the need.

In cases of more flexibility, a capturing recognizer is used, I'll probably go through later and write it properly so it can handle the captures, but for now I'm just using this as a boot-strapper, so it's not a big deal.