Wednesday, June 15, 2016

Grokking the CLI - Part 2: Meta Meanderings

Background

This is Part 2 in a series on Understanding the Metadata within the .NET Common Language Infrastructure, aka the ECMA-335.

Part 1 can be found here.

Five Streams to Rule them All

Last time we went over how to get to the hints that pointed at where the data is located.  We had to understand the MS DOS Header, PE Image, its Data Dictionary, just to get a pointer to the real data.

Now that we have that, we can start peeling back the layers that make up the metadata.

The Relative Virtual Address we received at the end of our last foray points us to a MetadataRoot structure, which outlines the Metadata Streams.  Once you've read in the MetadataRoot, you'll then read in a series of StreamHeaders, which point to the individual streams (all prefixed with a pound #, which I'm omitting for this post).   Those streams are as follows:
  • Strings:The strings which make up the names of things, such as the names of methods, properties, and so on.
  • Blob:Binary blobs which make up constants, type specifications, method signatures, member reference signatures, and so on.
  • US:User Strings - When you use string constants in code (ex. Console.WriteLine("TEST");), the strings end up here.
  • Guid:The Guids that get generated by the .NET compilers end up here.  You probably don't know they exist.  The only metadata table that refers to this is the Module table.
  • ~ or -:The Metadata Table Stream, perhaps the most complex and 'varied' stream of them all.  As of writing this, there are 38 tables supported in .NET Metadata.

Strings

In .NET all streams are 'compressed' in some way, the Strings stream is no exception.  One of the simplest ways they accomplish 'compression' in this stream is by Right Side Similarity.  If you have Two strings in your assembly: MyItem, and Item.  This will be saved as a single string of MyItem, suffixed with a null character (0x00.)

The metadata tables that refer to this string table, give indexes within the string table's data.  Those indexes would point to the place where the string starts, and the null character would say when it's finished.

So if the string MyItem was at a relative location of 0, Item would be at a relative location of 2.
Here's the basic approach to reading the stream that I used. To make things simpler on me, I wrote a 'Substream' that allows relative indexes to be easier to calculate. It makes it so I don't need to worry about adding on the section's offset each time I read things.
private unsafe bool ReadSubstring(uint index)
{
    lock (this.syncObject)
    {
        reader.BaseStream.Position = index;
        /* *
         * To save space, smaller strings that exist as the 
         * tail end of another string, are condensed accordingly.
         * *
         * It's quicker to construct the strings from the 
         * original source than it is to iterate through the
         * location table used to quickly look items up.
         * */
        uint loc = index;
        while (loc < base.Size)
        {
            byte current = reader.ReadByte();
            if (current == 0)
                break;
            loc++;
        }
        uint size = loc - index;
        reader.BaseStream.Position = index;
        byte[] result = new byte[size];

        for (int i = 0; i < size; i++)
            result[i] = reader.ReadByte();
        this.AddSubstring(ConvertUTF8ByteArray(result), index);
        return loc < base.Size;
    }
}

Blob

I would say that the Blob stream increases the difficulty a little bit. Blobs start with their lengths as a compressed int (details in User Strings below) With this, we get a multitude of signatures, I would go over all of them here, but a great article already exists for this. I've provided an excerpt from the blob SignatureParser below, to indicate that it's similar to normal parsing, with exception to the fact that it's parsing binary instead of text:
internal static ICliMetadataMethodSignature ParseMethodSignature(EndianAwareBinaryReader reader, CliMetadataFixedRoot metadataRoot, bool canHaveRefContext = true)
{

    const CliMetadataMethodSigFlags legalFlags                =
            CliMetadataMethodSigFlags.HasThis                 |
            CliMetadataMethodSigFlags.ExplicitThis            ;

    const CliMetadataMethodSigConventions legalConventions    =
            CliMetadataMethodSigConventions.Default           |
            CliMetadataMethodSigConventions.VariableArguments |
            CliMetadataMethodSigConventions.Generic           |
            CliMetadataMethodSigConventions.StdCall           |
            CliMetadataMethodSigConventions.Cdecl             ;

    const int legalFirst  = (int)legalFlags                   |
                            (int)legalConventions             ;
    byte firstByte        = reader.ReadByte();
    if ((firstByte & legalFirst) == 0 && firstByte != 0)
        throw new BadImageFormatException("Unknown calling convention encountered.");
    var callingConvention = ((CliMetadataMethodSigConventions)firstByte) & legalConventions;
    var flags = ((CliMetadataMethodSigFlags)firstByte) & legalFlags;


    int paramCount;
    int genericParamCount = 0;
    if ((callingConvention & CliMetadataMethodSigConventions.Generic) == CliMetadataMethodSigConventions.Generic)
        genericParamCount = CliMetadataFixedRoot.ReadCompressedUnsignedInt(reader);
    paramCount = CliMetadataFixedRoot.ReadCompressedUnsignedInt(reader);
    ICliMetadataReturnTypeSignature returnType = ParseReturnTypeSignature(reader, metadataRoot);
    bool sentinelEncountered = false;
    if (canHaveRefContext)
    {
        ICliMetadataVarArgParamSignature[] parameters = new ICliMetadataVarArgParamSignature[paramCount];
        for (int i = 0; i < parameters.Length; i++)
        {
            byte nextByte = (byte)(reader.PeekByte() & 0xFF);
            if (nextByte == (byte)CliMetadataMethodSigFlags.Sentinel)
                if (!sentinelEncountered)
                {
                    flags |= CliMetadataMethodSigFlags.Sentinel;
                    sentinelEncountered = true;
                    reader.ReadByte();
                }
            parameters[i] = (ICliMetadataVarArgParamSignature)ParseParam(reader, metadataRoot, true, sentinelEncountered);
        }
        return new CliMetadataMethodRefSignature(callingConvention, flags, returnType, parameters);
    }
    else
    {
        ICliMetadataParamSignature[] parameters = new ICliMetadataParamSignature[paramCount];
        for (int i = 0; i < parameters.Length; i++)
            parameters[i] = ParseParam(reader, metadataRoot);
        return new CliMetadataMethodDefSignature(callingConvention, flags, returnType, parameters);
    }
}
There are a few oddities in the blob signatures, the StandAloneSig table, for instance, usually only defines local variable types, that is the description of what type was used for the local variables in methods. An exceptions was found by me in 2012 when I started writing the CLI metadata parser: Field Signatures. Had I been aware enough to know, I would've found that the two mentioned in the first answer also stumbled across this a few years prior. Please note, the answer in the first linked MSDN post links to a blog that no longer exists, the second MSDN link points to their discovery, but with less detail. Turns out field signatures in the StandAloneSig table are there to support debugging the constants you specify in your code, but only when compiling in Debug mode.
The irony here is the constants from excerpt above were what clued me to the out of place Field signatures.
Blobs also contain details about the constants used, and custom attributes. I haven't personally gotten to the constants because as of yet I haven't needed to.

US - User Strings

User Strings are pretty self explanatory. They're the stream that method bodies refer to when loading string constants to work with them. This behaves much more like the Blob stream than the normal Strings stream. All user strings start with a compressed int, which is read like so:
public static int ReadCompressedUnsignedInt(EndianAwareBinaryReader reader, out byte bytesUsed)
{
    byte compressedFirstByte           = reader.ReadByte();
    const int sevenBitMask             = 0x7F;
    const int fourteenBitmask          = 0xBF;
    const int twentyNineBitMask        = 0xDF;
    bytesUsed                          = 1;
    int decompressedResult             = 0;

    if ((compressedFirstByte & sevenBitMask) == compressedFirstByte)
        decompressedResult             = compressedFirstByte;
    else if ((compressedFirstByte & fourteenBitmask) == compressedFirstByte)
    {
        byte hiByte                    = (byte)(compressedFirstByte & 0x3F);
        byte loByte                    = reader.ReadByte();
        decompressedResult             = loByte | hiByte << 8;
        bytesUsed                      = 2;
    }
    else if ((compressedFirstByte & twentyNineBitMask) == compressedFirstByte)
    {
        byte hiWordHiByte              = (byte)(compressedFirstByte & 0x1F);
        byte hiWordLoByte              = reader.ReadByte();
        byte loWordHiByte              = reader.ReadByte();
        byte loWordLoByte              = reader.ReadByte();
        decompressedResult             = loWordLoByte | loWordHiByte << 8 | hiWordLoByte << 16 | hiWordHiByte << 24;
        bytesUsed                      = 4;
    }
    return decompressedResult;
}
In the case of User Strings, all byte-counts are odd. The last byte contains a zero (0) or one (1) depending on whether or not the string needs special processing beyond UTF-8 processing (Partition ][, 24.2.4 of the ECMA-335 spec). As I was writing this post, I realized I ignore this point entirely! Looks like I have a ToDo entry in my future.

Guid

The Guid stream is the simplest, but that doesn't mean its less important. The Guids provided in this area are used to distinguish the Modules that result from a build from one another. Interesting to note, Roslyn is allowing a deterministic build approach which will guarantee the MVID (Module Version Identifier) will be identical from build to build if you enable deterministic builds.

To Be Continued...

The Table Stream (#~ or #-) will require a post all of its own. Stay tuned for the next installment.

Saturday, June 11, 2016

Grokking the CLI - Part 1: Meta Mayhem

Background

Over the past eight years I've been writing two projects: OILexer and Abstraction. Abstraction's focus was initially on outlining a Compiler framework for the .NET CLI.

As time went on, it became clear that reflection just wasn't enough for me to get the nitty-gritty details that I needed to roll my own compiler.

"But wait", you say, ".NET has Reflection, Reflection.Emit and AssemblyBuilder built in!"

Yes, it is true, .NET by default allows you to emit assemblies as you please; however, one thing that it requires is a pretty deep understanding of its Common Intermediate Language (CIL.) You also are limited to building for the version of .NET you're running under.

You can adjust the result assembly's configuration file, or use the underlying CLR Native APIs, but the former doesn't really generate a 2.0 library, and the latter is a bit more complex as at that point you're not really using the AssemblyBuilder / ILGenerator any longer. 

Introduction

Let's say you want to write your own compiler, and you want complete control, that's the focus we're on today. The first thing you must understand in this: it's not going to be quick, or easy.

As it stands, I've written a library that understands .NET Metadata in all its tables and do basic ILDasm level disassembly on method bodies.  I can translate object graphs into code, but at this point I haven't gotten to the point of taking those graphs and writing Metadata out of them.  So we'll focus this series on what I have done.
There's a lot to understand.  I won't repeat the exact nature of things here, because the ECMA-335 explains it more completely than I do; however, I will give a basic overview:
  1. PE Headers - The parts that make up the portable executable, which is the executable format used by windows operating systems.
  2. CLI Header - The extensions to the PE headers that outlines where the Metadata (#3) is, where the entrypoint is, the version of the framework you're targeting, and all so on.
  3. CLI Data Streams - The actual data that says what the what constants you use, the metadata tables, the names of things within the metadata tables, and the binary blobs.

PE Headers

MS DOS Header

This section starts out with details that many may not know still exist as a part of windows executables: the MS DOS header.  For most .NET Applications, they use a fixed set of bytes to represent this MS-DOS based program, with the exception of a portion named 'lfanew', which is an offset to the Portable Executable signature, which would then be followed by the windows version of the application.

Windows Headers

The windows headers tend to be a bit more well defined within the ECMA-335 specification.  Since they're not the major focus of this we'll explain them as simply:
  1. PEHeader - Lays out the structure of the PE image 
    1. Coff (Common Object File Format) Sections
  2. PE Optional Header - for .NET Applications this is mandatory, as the Data Dictionaries point to the details we're interested in.  If you're writing your own metadata reader, you must understand the PE Optional header to a certain detail.  You must fully grok it if you plan on writing a compiler (or at least fake it enough so the .NET Apps you write don't fail to load.)
    1. Standard Fields
    2. NT Specific Fields
    3. Data Dictionaries - A series of address/size pairs which point to data within the PE image.
      1. We're interested in the CLI Header pointer, which will give us the location of the data for .NET Applications.

CLI Header

The CLI Header contains:
  1. Cb - the size of the header
  2. MajorRuntimeVersion
  3. MinorRuntimeVersion
  4. MetaData - the RVA and Size of the metadata sections
  5. Flags - Cli-relevant details about the image, such as 'Requries 32-bit process, Signed, native entry-point defined, IL-Only'
  6. EntryPointToken - Token to the 'Main' method for applications (non-libraries)
  7. Resources
  8. StrongNameSignature
  9. CodeManagerTable - Always Zero
  10. VTableFixups
  11. ExportAddressTableJumps
  12. ManagedNativeHeader

RVA

Relative Virtual Address: A relative virtual address is a 'pointer' to where the data should be in a virtual sense.  Once a library or application is loaded, the locations of things may change based on how the operating system wants to arrange the data.  These virtual addresses give you a sense of where the data is, and usually require you to 'resolve' them.  Since we're not actually loading the library to execute it, we can simplify our resolution rules.

You would take the Coff sections of the PE Image, scanning through them until the RVA was within the range of that given Coff section's Virtual Address and the size of its raw data.

Getting to the Point of it All

We resolve the RVA of the CLI Header, read it in, then resolve the RVA of Metadata (#5 above) of the CLI Header to get the location of the data streams within the CLI Application.

It took knowing about all of the above, just to know where the data is!  We haven't even began grokking it yet.

To Be Continued...

We'll start breaking into the metadata streams and their meanings in the next installment.

Nearly at Dogfooding state

This post was accidentally promoted to this month on this day, edited the post through the blogger's dashboard and for some reason, it got promoted to today.

Well when I started this journey, I didn't have the slightest clue about what I was doing. I still don't know what I'm doing but I at least have an idea of what I want to be doing.

Over the past eight years I've been developing the parser for OILexer to parse grammar files, and it's got its fair share of oddities lying around: production rules and tokens refer to 'flag' items in different ways, tokens say it's a flag via 'IsFlag=true;', the other just uses a bang character (!).

The risk of developing show stopping infinite loops is a real concern when you're developing an interactive parser that's more often in a bad parse state than it is in a proper parse state.

Now OILexer's capable of reproducing this parser (in a much more regular and consistent way) that appears to operate at a factor of 60-80 times quicker.

Granted this is preliminary and I'm still implementing the error recovery, but things look quite promising.

Monday, May 30, 2016

Visitor Models and Maintaining Large Systems

After a hundred+ types maintained by hand, in a version that returns a value with a provided context parameter, and a version that is just a void-based visitor.  I decided that it would behoove me to automate the process of my visitor models.

I have a system that handles reflection, code translation, and so on, and the next step for this project is transformation, or taking code and manipulating it based on what the code is.

I plan on having two steps to the transformation:

  1. 'Refactor Impact' analysis
  2. Transforming
Refactor impact analysis is basically just saying 'Tell me what you are planning to do to the code model'.  I figure this way I can group items which have the same level of impact, ones that aren't planning on majorly overhauling the object model could be parallelized.  Such as those which are modifying an expression, statement or so on, or localizing the changes to a given method, versus its siblings as well.

The next step would be the actual manipulation of the model, taking what we know from step 1 and applying the transformations.

Granted, this will be much easier in theory than in practice; however, I plan on using the system I used to automate the visitor models to automate the analysis/transformation.  Mostly because it's just too much work for one single developer to do in their spare time (that and I'm lazy.)

Sunday, April 10, 2016

OILexer - The start of syntax error notices. (Continued)

On the syntax verification aspect, so far, so good.

Using the previous example as a jumping point, I've implemented basic syntax awareness within the generated parsers.  Left-recursive rules seem to be a bit 'iffy' but I'll be rewriting that area of focus soon anyway (I think I hit upon a pattern in the approach, it's just an arduous process.)

Below is a screenshot of a grammar file 'Errors.oilexer' which has an error at character 4, a '-'.



Oilexer (bottom) seems to somewhat disagree on what's valid at this point:
I haven't investigated why OILexer seems to think that '(' is valid, but I might find it's incorrect in its understanding of the language. It also seems to think that 'End of line' is always a valid component, as it appears with most of its syntax errors.

Wednesday, April 6, 2016

OILexer - The start of syntax error notices.

Long overdue, but below is a screenshot of a simple language starting to flex its knowledge of the language.

This is the command-line arguments from OILexer represented as a grammar within itself. Didn't take a whole lot of work to get it there by hand, the next step is replicating the behavior within OILexer itself, then using that information to move the project further. I was staying my hand at this process, but I'm currently at a dead-end on another area, I figured now is as good a time as any to start this.

Sunday, February 7, 2016

OILexer - Main Goals

OILexer is a project I've been working on since 2007 (or 2008? I forget) until now; though, only sporadically.

My main goal with the project is to make writing a language parser as clean, and simple, as possible.

Other systems typically offer mechanisms for custom code injection on how to handle corner cases or specially parse certain constructs.

With OILexer the aim is to mitigate much of that and have the tool figure it out (ambiguities and what to do) on its own.

There are a few reasons I took this approach:

  1. When I started, parsing was something I wasn't terribly good at, so allowing 'custom language-targeted text' would've been error prone.
  2. I think part of my concern was I didn't want a 'Single Target Language' grammar, I wanted to make it so that if you wanted a VB.NET parser, you'd just simply switch languages.
    1. Do note: my VB output mechanism isn't written as of right now, but the objects which represent the result code are in 'general purpose' constructs that would make it as easy as writing a VB Translator.

This has many advantages, as well as a few disadvantages:

Advantages/Goals:

  1. The grammars are much cleaner, which could lead you to focus on the structure of the language itself, versus 'what do I do here'
  2. Precedence is simple.
    1. You merely need to specify 'X' rules that outline the precedence. Where 'X' is the number of levels you intend to go.
    2. With 'Collapse' point rules, you can add intermediate steps between them to 'flatten' the resulted hierarchy.
    3. These intermediate step rules add one (1) level to each rule, but templates mean you can write a template that handles creating these for you.
  3. Left recursion should be handled for you, without left factoring.
  4. Contextual keywords should be handled for you.
    1. If 'identifier' collides with a new keyword that you only want as a keyword in certain cases, the program should be smart enough to figure this out.
    2. Early tests show this is possible, so long as the keyword doesn't cause the rule to terminate with the same tokens that the other variant.
    3. Two tokens which overlap identically should only be considered 'ambiguous' if they show up at the same time.
      1. Example: If I have two '(' tokens that are used for different reasons, it shouldn't call them ambiguous if they aren't used in the same place.
      2. Even if they do, it should tell which is intended by what comes next, just like any other prediction.
  5. You should get both a Concrete Syntax Tree as well as an Abstract Syntax Tree.
    1. Collapse points, mentioned above, mean that 'rules you don't care about' shouldn't be present in the final mockup that you receive.
      1. Typically these rules represent what's valid at a given juncture, like a 'Class Member', or a child of a class.
      2. Usually you don't care so much about the 'Class Member' node, you care what kind of node it ended up being. How you got there isn't so important at the AST level.
      3. Another good example is a 'Type Declaration', when debugging, you don't want to have to expand a 'Type Declaration' each time you are trying to see what kind of type it is.
      4. Further, if it's also a Class Member, you'd have to expand two levels, Class Member -and- Type Declaration. Not needing to do this saves time.
      5. Beyond just 'expanding nodes' when you're using a visitor model over the AST, those 'irrelevant nodes' merely increase the stack pressure, removing them from the final AST simplifies their evaluation.
    2. If you don't call out a particular node by name, you don't care about it, so the 'AST' will ignore everything not called out by name.
      1. Most notable case is Type Declarations, such as a Class Declaration. You don't need to know that they wrote 'class' at the start. The fact that you're staring at a class declaration object says they did.
      2. This further expands to 'Single Operator' mathematical expressions.
        1. Conditional Operator which uses '?' and ':'
        2. Logical Or Expressions which use '||'
    3. Since we're human, having the 'Concrete Syntax Tree' will be useful when you want to verify it's parsing correctly.
      1. If I get it wrong (which I'm still human)
      2. If you find you got it wrong and need a way to look at the constituent symbols which make up a rule.
        1. Once I hit the 'Bootstrapping' phase, I found that my hand-written parser was not following the language description language's own rules!
        2. I had to write in alternative legal parses to account for this issue until I fixed the hand-written parser (which in retrospect, my hand-written parser kind of sucks.)

Disadvantages:

  1. Prediction code can get pretty nasty.
    1. Left-recursive predictions are horrible right now, I'm drafting up a process to fix this, but it'll take some time.
    2. Some languages enter 'Recursively Enumerable' territory, causing OILexer to never end its evaluation.
      1. Some of these situations may well be fixed by the Left recursive prediction fix I'll be working on soon.
      2. Short answer is: I botched Indirect Left Recursion for complex languages (by complex, mostly means 'real world').
  2. Collapse Points can create lots of 'placeholder' types.
    1. To solve the issue of removing collapse-able rules from the AST, I used inheritance to represent situations where the rules actually mean something else at the same time.
    2. If Class Declaration is also a Type Declaration, Class Declaration must have a derivation which implements the placeholder interface for Type Declaration.
    3. If Class Declaration is also a Type Declaration, which is also a Class Member, we must therefore have a derivation of the Class Declaration's Type Declaration which is also implements the Class Member placeholder interface (effectively a class within a class.)
    4. Essentially the inheritance graphs of a given AST node go as far as the rabbit hole that leads to them does, with respect to Collapse points.
    5. The saving grace here is it's done for you, and it should just work.
  3. There's no good 'naming scheme' to represent ambiguous identities.
    1. If you have a keyword named 'Yield' which is ambiguous with 'Identifier', it might be called Ambiguity1, or Ambiguity21, depending on how many are in the language and when it was encountered.
    2. I didn't bother trying to create the name from the tokens which are ambiguous, because the ambiguities may be between more than two tokens (or fifty tokens, if you're nuts.)
    3. This makes reading code with ambiguous matches a bit arduous.
    4. I do generate documentation comments for the ambiguities, but they don't appear in Debug Mode :(
  4. Follow predictions require discriminator logic, which can get messy.
    1. If a token for a given rule is on an edge state, we must check to see if accepting that token would cause the parse to fail, because the calling rule required it to parse successfully.
      1. There are valid cases when the parent requires it and we can still accept it: if the parent parse can assuredly succeed without the token we want to consume, we can happily gobble it up in that rule. But not before we can guarantee that.
    2. Currently this is semi-broken for complex cases. If the rule being parsed is the result of a 'predictive reduction' the context of the ambiguity is not yet known within this discriminator logic (meaning: it'll likely reject certain valid parses.)
  5. Error detection is a 'placeholder' and doesn't really work.
    1. I can mark a rule as invalid, but I don't yet capture why it is invalid.
    2. To do this accurately, I need to capture the state it failed at, so I can ask the parser what was valid at that point, based on the rule stack at the time.
    3. This is deliberately left undone for now because its function may change depending on how I implement indirect left recursion code.
OILexer is a project of pure fun for me. I think once I fix the 'Indirect Left Recursive Failures', the 'Tail Discriminator' concerns, and properly implement the error detection and recovery, I'll have something truly useful. Though knowing me, once I get there, I'll use the bootstrapped parser to rewrite the parsing phase of the program. Attempting this will give me practical experience in how to use the result AST/CST structures, and shed light on what other creepy crawlies lie in the mix that need fixed.

Monday, February 1, 2016

Parse Times Looking Promising

After doing some initial testing, the boot-strap capable parser seems to be able to best OILexer itself in parsing its own language. When I started this project back in 2008, the experience I had writing parsers was limited at best. The new mechanics put in place do things I can only hope to do by hand: unbound look-ahead. Granted the number of tokens seems to vary, but in cases where the look-ahead requires a loop, it seems to handle it just fine. Some initial testing, comparing the two file sets, shows the automated parser is quite fast when compared to its hand-written brethren:

Hand Built Parser:

Automated Parser

The paths for the hand-built parser are longer because the files are copied to the output directory for the project. I've blurred the full path to avoid giving too much about my personal system.

I'm rather shocked that there are cases where it is as much as a hundred and fifty times faster. These are -running in the debugger- figures (64-bit target), so the release times may vary. If anything I think the automated parser will just get faster.

There's a few things to note: the automated parser doesn't handle errors. It can detect them, but as far as saying where they are after the fact, it's a no-go. It's worth noting that OILexer's hand-written parser likely has a few areas where it will still hit an 'infinite loop', as I've encountered one or two in the live-edit mode within Visual Studio during its development (ok, more than one or two.) These have been patched along the way, but they're usually due to unexpected input and the fact that the hand-written parser does some wonky peeking at the lexical view to avoid needing to parse certain characters, which... sometimes don't even have valid tokens!

I suspect the automated build will slow down some once Error Handling is introduced. I'm getting ready to 'break stuff' in light of issues with Indirect Left Recursion that I think I have a logical plan of action for, but it's not a simple fix. More on that later (once I have figured it out :)

It's interesting to me to compare the automated version's 'prediction' versus the hand-written variant.

Automated Determination:
private int _Predict14OnTokenItem(int laDepth)
{
    int exitLADepth = -1;
    int currentLADepth = laDepth;
    int exitState = -1;
    this._state = 673;
PredictMultiTargetState_673:
    switch (this._symbolStream.LookAhead(currentLADepth))
    {
        case OilexerParserSymbols.BuiltInLanguageCommands_BaseEncode:
        case OilexerParserSymbols.BuiltInLanguageCommands_Scan:
        case OilexerParserSymbols.BuiltInLanguageCommands_Subtract:
            exitState = this._state = 681;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.CharacterRange:
            exitState = this._state = 680;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.LanguageOperatorToken_BeginGroup:
            exitState = this._state = 679;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.Char:
        case OilexerParserSymbols.String:
            exitState = this._state = 678;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.Identifier:
            exitState = this._state = 677;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.Ambiguity1:
        case OilexerParserSymbols.Ambiguity2:
        case OilexerParserSymbols.Ambiguity3:
            exitState = this._state = 674;
            exitLADepth = ++currentLADepth;
            switch (this._symbolStream.LookAhead(currentLADepth))
            {
                case OilexerParserSymbols.LanguageOperatorToken_BeginGroup:
                    exitState = this._state = 676;
                    exitLADepth = ++currentLADepth;
                    goto PredictMultiTargetState_676;
                case OilexerParserSymbols.LanguageOperatorToken_IdentitySeparator:
                case OilexerParserSymbols.LanguageOperatorToken_ItemIsRecognizer:
                case OilexerParserSymbols.LanguageOperatorToken_OneOrMore:
                case OilexerParserSymbols.LanguageOperatorToken_OptionsSeparator:
                case OilexerParserSymbols.LanguageOperatorToken_RepetitionRangeStart:
                case OilexerParserSymbols.LanguageOperatorToken_ZeroOrMore:
                case OilexerParserSymbols.LanguageOperatorToken_ZeroOrOne:
                    exitState = this._state = 675;
                    exitLADepth = ++currentLADepth;
                    goto PredictMultiTargetState_675;
                default:
                    goto DecisionPoint;
            }
        default:
            goto DecisionPoint;
    }
PredictMultiTargetState_676:
    goto DecisionPoint;
PredictMultiTargetState_675:
    goto DecisionPoint;
DecisionPoint:
    switch (exitState)
    {
        case 674: case 675: case 677:
            return 548 /* SoftReferenceScannableEntryItem */;
        case 676: case 681:
            return 546 /* BuiltInCommandTokenItem */;
        case 678:
            return 549 /* LiteralTokenItem */;
        case 679:
            return 547 /* GroupTokenItem */;
        case 680:
            return 550 /* CharacterRangeTokenItem */;
    }
    return 0 /* The prediction did not match a valid alt. */;
}
Hand-written variant, doing prediction within the TokenExpression that uses the TokenItems
private void ParseTokenBody(List<ITokenExpression> expressions)
{
    int l = this.CurrentTokenizer.GetLineIndex(this.StreamPosition);
    int ci = this.CurrentTokenizer.GetColumnIndex(this.StreamPosition);
    long p = this.StreamPosition;
    List<ITokenItem> seriesItemMembers = new List<ITokenItem>();
    while (true)
    {
        ITokenItem item = null;
        IOilexerGrammarToken igdt = LookAhead(0);
        if (igdt == null && CurrentTokenizer.CurrentError != null)
        {
            currentTarget.SyntaxErrors.SyntaxError(CurrentTokenizer.CurrentError);
 
            break;
        }
        else
        {
            if (igdt == null)
            {
                LogError(OilexerGrammarParserErrors.Expected, "identifier, '(' or ';'");
                return;
            }
            switch (igdt.TokenType)
            {
                case OilexerGrammarTokenType.Comment:
                    var commentToken = (OilexerGrammarTokens.CommentToken)igdt;
                    if (captureRegions && commentToken.MultiLine)
                    {
                        var gdResult = (OilexerGrammarFile)currentTarget.Result;
                        gdResult.AddCommentRegion(commentToken);
                    }
                    PopAhead();
                    continue;
                case OilexerGrammarTokenType.CharacterLiteral:
                    item = ParseTokenCharLiteral();
                    break;
                case OilexerGrammarTokenType.StringLiteral:
                    item = ParseTokenStringLiteral();
                    break;
                case OilexerGrammarTokenType.Identifier:
                    item = ParseTokenReferenceItem();
                    break;
                case OilexerGrammarTokenType.CharacterRange:
                    item = ParseTokenCharacterRange();
                    break;
                case OilexerGrammarTokenType.NumberLiteral:
                    LogError(OilexerGrammarParserErrors.Unexpected, "number", igdt.Position);
                    goto yield;
                case OilexerGrammarTokenType.PreprocessorDirective:
                    LogError(OilexerGrammarParserErrors.Unexpected, "preprocessor", igdt.Position);
                    goto yield;
                case OilexerGrammarTokenType.Operator:
                    switch (((OilexerGrammarTokens.OperatorToken)(igdt)).Type)
                    {
                        case OilexerGrammarTokens.OperatorType.Pipe:
                            ClearAhead();
                            goto yield;
                        case OilexerGrammarTokens.OperatorType.LeftParenthesis:
                            ClearAhead();
                            item = (ITokenGroupItem)ParseTokenGroup();
                            break;
                        case OilexerGrammarTokens.OperatorType.RightParenthesis:
                            if (parameterDepth <= 0)
                                LogError(OilexerGrammarParserErrors.Expected, ";", igdt.Position);
                            goto yield;
                        case OilexerGrammarTokens.OperatorType.GreaterThan:
                        case OilexerGrammarTokens.OperatorType.Comma:
                            if (commandDepths.Contains(parameterDepth))
                            {
                                ClearAhead();
                                goto yield;
                            }
                            PopAhead();
                            LogError(OilexerGrammarParserErrors.Expected, "identifier, '(', or ';'");
                            PopAhead();
                            goto yield;
                        case OilexerGrammarTokens.OperatorType.SemiColon:
                            ClearAhead();
                            if (templateDepth != 0 || parameterDepth != 0)
                            {
                                Expect((templateDepth == 0 ? ">" : string.Empty) + (parameterDepth > 0 ? ")" : string.Empty), this.StreamPosition);
                            }
                            goto yield;
                        default:
                            goto default_2;
                    }
                    break;
                default:
                default_2:
                    LogError(OilexerGrammarParserErrors.Expected, "identifier, '(' or ';'");
                    PopAhead();
                    break;
            }
        }
        if (item != null)
        {
            ClearAhead();
            if (TokenizerLookAhead(0) == ':')
            {
                ParseItemOptions(item);
            }
            ParseItemRepeatOptions(item);
            seriesItemMembers.Add(item);
        }
    }
yield:
    expressions.Add(new TokenExpression(seriesItemMembers, CurrentTokenizer.FileName, l, ci, p));
}

Wednesday, January 20, 2016

Debugger Type Proxies are Awesome

In OILexer, when you define a rule as an alternation of other rules, such as:

A ::=> /* The '>' represents a point of collapse. */
   B |
   C |
   D ;

I was working on OILexer's code gen process and I've always been a little miffed by the level of deep diving into object graphs that I've had to do for left-recursive representations of mathematical operator precedence (maths, * before + and so on)

Sometimes to see that they typed 5, I would have to go n levels deep, where n is the number of precedences defined in the language.

Instead of making a type that represents 'A' and giving it a single child of either a B, C or D, the logical solution was to use inheritance, and if you have a 'D' in place of that 'A', you'd have a derivative of D which is also an A.

This is all well and good, but when you're trying to debug an app, it can create some developer-based bottle necks (time-wise.)

Here's an example of what debugging was like before the automated DebuggerTypeProxy instances:

After automating the debugger type proxies:

You'll quickly notice that getting to the meat of the definition is near immediate by comparison.

The example, in both screenshots, represents the opening concept, just expanded out a bit further.

This does create some nasty object hierarchies but the trade-off of generated types for the simplicity of browsing the object hierarchy makes it worth it in the end.

OILexer - Time for Errors

The next step with OILexer is error handling. I have very (basic) error detection logic in place, but as of right now, it doesn't handle errors at all, it just kind of ignores them. On left-recursive rules it unwinds back to a 'valid' point and continues on.


On non-left recursive rules it marks the rule as having an error and goes from there.


What I'll likely end up doing is keeping track of how far into the stream a particular parse gets and the error collection will be enumerated if the EOF marker hasn't been hit. Then, based on how far the 'bad' rule got, isolate what was valid at that point, and provide a logical error for the results.


I might have to do occasional polling of the error stream as the parse continues and eliminate the false positives.


Once error handling is in place, my next step is to isolate how to detect recursively infinite systems, or more accurately how to reduce them into something that can complete. Certain normal languages, like C#, don't complete, they just spin around the same sets of rules indefinitely.


This is largely due to the rules that share common token patterns that are recursively enumerable in a way where they replicate the same pattern and just get longer, and longer in their depth. An example is relational comparison expressions and generic types. Both use Identifier, Less Than, Identifier, ad nauseum and this results in something that 'does not complete'. The web of rule identities and the logical path to how it got to each is a spectacular mess, trying to put some logic in place that isolates these cases will be a challenge.


Once these are figured out, the likely solution will be to 'parse one, then try the other' Since logically one will complete with success and the other will likely not. The fun part is it can sometimes lead to two valid paths, and the 'longest' winner takes home the prize.


After all this is said and done, I should have enough in place to 'boot strap' the real OILexer, an amalgamation between C#-esque syntax with OILexer's syntax mixed in.


Pattern matching is in itself an integral part of a large number of domains, and simplifying the means to which it can be used is one of my goals.


To that end, I'll need to do a lot of 'preprocessing' work with OILexer's data once that phase is hit, as it's extremely slow at doing its thing. This is largely due to the complex set analysis work it does, but once I have the Metadata Library writer, I could essentially write the tables that represent the OILexer code and pack that into the resulting assembly if it's as new or newer than the source files that make up that portion of the assembly. There'll need to be patch-up work, but it'll effectively be .obj files that C++-style compilers use.