Wednesday, June 15, 2016

Grokking the CLI - Part 2: Meta Meanderings

Background

This is Part 2 in a series on Understanding the Metadata within the .NET Common Language Infrastructure, aka the ECMA-335.

Part 1 can be found here.

Five Streams to Rule them All

Last time we went over how to get to the hints that pointed at where the data is located.  We had to understand the MS DOS Header, PE Image, its Data Dictionary, just to get a pointer to the real data.

Now that we have that, we can start peeling back the layers that make up the metadata.

The Relative Virtual Address we received at the end of our last foray points us to a MetadataRoot structure, which outlines the Metadata Streams.  Once you've read in the MetadataRoot, you'll then read in a series of StreamHeaders, which point to the individual streams (all prefixed with a pound #, which I'm omitting for this post).   Those streams are as follows:
  • Strings:The strings which make up the names of things, such as the names of methods, properties, and so on.
  • Blob:Binary blobs which make up constants, type specifications, method signatures, member reference signatures, and so on.
  • US:User Strings - When you use string constants in code (ex. Console.WriteLine("TEST");), the strings end up here.
  • Guid:The Guids that get generated by the .NET compilers end up here.  You probably don't know they exist.  The only metadata table that refers to this is the Module table.
  • ~ or -:The Metadata Table Stream, perhaps the most complex and 'varied' stream of them all.  As of writing this, there are 38 tables supported in .NET Metadata.

Strings

In .NET all streams are 'compressed' in some way, the Strings stream is no exception.  One of the simplest ways they accomplish 'compression' in this stream is by Right Side Similarity.  If you have Two strings in your assembly: MyItem, and Item.  This will be saved as a single string of MyItem, suffixed with a null character (0x00.)

The metadata tables that refer to this string table, give indexes within the string table's data.  Those indexes would point to the place where the string starts, and the null character would say when it's finished.

So if the string MyItem was at a relative location of 0, Item would be at a relative location of 2.
Here's the basic approach to reading the stream that I used. To make things simpler on me, I wrote a 'Substream' that allows relative indexes to be easier to calculate. It makes it so I don't need to worry about adding on the section's offset each time I read things.
private unsafe bool ReadSubstring(uint index)
{
    lock (this.syncObject)
    {
        reader.BaseStream.Position = index;
        /* *
         * To save space, smaller strings that exist as the 
         * tail end of another string, are condensed accordingly.
         * *
         * It's quicker to construct the strings from the 
         * original source than it is to iterate through the
         * location table used to quickly look items up.
         * */
        uint loc = index;
        while (loc < base.Size)
        {
            byte current = reader.ReadByte();
            if (current == 0)
                break;
            loc++;
        }
        uint size = loc - index;
        reader.BaseStream.Position = index;
        byte[] result = new byte[size];

        for (int i = 0; i < size; i++)
            result[i] = reader.ReadByte();
        this.AddSubstring(ConvertUTF8ByteArray(result), index);
        return loc < base.Size;
    }
}

Blob

I would say that the Blob stream increases the difficulty a little bit. Blobs start with their lengths as a compressed int (details in User Strings below) With this, we get a multitude of signatures, I would go over all of them here, but a great article already exists for this. I've provided an excerpt from the blob SignatureParser below, to indicate that it's similar to normal parsing, with exception to the fact that it's parsing binary instead of text:
internal static ICliMetadataMethodSignature ParseMethodSignature(EndianAwareBinaryReader reader, CliMetadataFixedRoot metadataRoot, bool canHaveRefContext = true)
{

    const CliMetadataMethodSigFlags legalFlags                =
            CliMetadataMethodSigFlags.HasThis                 |
            CliMetadataMethodSigFlags.ExplicitThis            ;

    const CliMetadataMethodSigConventions legalConventions    =
            CliMetadataMethodSigConventions.Default           |
            CliMetadataMethodSigConventions.VariableArguments |
            CliMetadataMethodSigConventions.Generic           |
            CliMetadataMethodSigConventions.StdCall           |
            CliMetadataMethodSigConventions.Cdecl             ;

    const int legalFirst  = (int)legalFlags                   |
                            (int)legalConventions             ;
    byte firstByte        = reader.ReadByte();
    if ((firstByte & legalFirst) == 0 && firstByte != 0)
        throw new BadImageFormatException("Unknown calling convention encountered.");
    var callingConvention = ((CliMetadataMethodSigConventions)firstByte) & legalConventions;
    var flags = ((CliMetadataMethodSigFlags)firstByte) & legalFlags;


    int paramCount;
    int genericParamCount = 0;
    if ((callingConvention & CliMetadataMethodSigConventions.Generic) == CliMetadataMethodSigConventions.Generic)
        genericParamCount = CliMetadataFixedRoot.ReadCompressedUnsignedInt(reader);
    paramCount = CliMetadataFixedRoot.ReadCompressedUnsignedInt(reader);
    ICliMetadataReturnTypeSignature returnType = ParseReturnTypeSignature(reader, metadataRoot);
    bool sentinelEncountered = false;
    if (canHaveRefContext)
    {
        ICliMetadataVarArgParamSignature[] parameters = new ICliMetadataVarArgParamSignature[paramCount];
        for (int i = 0; i < parameters.Length; i++)
        {
            byte nextByte = (byte)(reader.PeekByte() & 0xFF);
            if (nextByte == (byte)CliMetadataMethodSigFlags.Sentinel)
                if (!sentinelEncountered)
                {
                    flags |= CliMetadataMethodSigFlags.Sentinel;
                    sentinelEncountered = true;
                    reader.ReadByte();
                }
            parameters[i] = (ICliMetadataVarArgParamSignature)ParseParam(reader, metadataRoot, true, sentinelEncountered);
        }
        return new CliMetadataMethodRefSignature(callingConvention, flags, returnType, parameters);
    }
    else
    {
        ICliMetadataParamSignature[] parameters = new ICliMetadataParamSignature[paramCount];
        for (int i = 0; i < parameters.Length; i++)
            parameters[i] = ParseParam(reader, metadataRoot);
        return new CliMetadataMethodDefSignature(callingConvention, flags, returnType, parameters);
    }
}
There are a few oddities in the blob signatures, the StandAloneSig table, for instance, usually only defines local variable types, that is the description of what type was used for the local variables in methods. An exceptions was found by me in 2012 when I started writing the CLI metadata parser: Field Signatures. Had I been aware enough to know, I would've found that the two mentioned in the first answer also stumbled across this a few years prior. Please note, the answer in the first linked MSDN post links to a blog that no longer exists, the second MSDN link points to their discovery, but with less detail. Turns out field signatures in the StandAloneSig table are there to support debugging the constants you specify in your code, but only when compiling in Debug mode.
The irony here is the constants from excerpt above were what clued me to the out of place Field signatures.
Blobs also contain details about the constants used, and custom attributes. I haven't personally gotten to the constants because as of yet I haven't needed to.

US - User Strings

User Strings are pretty self explanatory. They're the stream that method bodies refer to when loading string constants to work with them. This behaves much more like the Blob stream than the normal Strings stream. All user strings start with a compressed int, which is read like so:
public static int ReadCompressedUnsignedInt(EndianAwareBinaryReader reader, out byte bytesUsed)
{
    byte compressedFirstByte           = reader.ReadByte();
    const int sevenBitMask             = 0x7F;
    const int fourteenBitmask          = 0xBF;
    const int twentyNineBitMask        = 0xDF;
    bytesUsed                          = 1;
    int decompressedResult             = 0;

    if ((compressedFirstByte & sevenBitMask) == compressedFirstByte)
        decompressedResult             = compressedFirstByte;
    else if ((compressedFirstByte & fourteenBitmask) == compressedFirstByte)
    {
        byte hiByte                    = (byte)(compressedFirstByte & 0x3F);
        byte loByte                    = reader.ReadByte();
        decompressedResult             = loByte | hiByte << 8;
        bytesUsed                      = 2;
    }
    else if ((compressedFirstByte & twentyNineBitMask) == compressedFirstByte)
    {
        byte hiWordHiByte              = (byte)(compressedFirstByte & 0x1F);
        byte hiWordLoByte              = reader.ReadByte();
        byte loWordHiByte              = reader.ReadByte();
        byte loWordLoByte              = reader.ReadByte();
        decompressedResult             = loWordLoByte | loWordHiByte << 8 | hiWordLoByte << 16 | hiWordHiByte << 24;
        bytesUsed                      = 4;
    }
    return decompressedResult;
}
In the case of User Strings, all byte-counts are odd. The last byte contains a zero (0) or one (1) depending on whether or not the string needs special processing beyond UTF-8 processing (Partition ][, 24.2.4 of the ECMA-335 spec). As I was writing this post, I realized I ignore this point entirely! Looks like I have a ToDo entry in my future.

Guid

The Guid stream is the simplest, but that doesn't mean its less important. The Guids provided in this area are used to distinguish the Modules that result from a build from one another. Interesting to note, Roslyn is allowing a deterministic build approach which will guarantee the MVID (Module Version Identifier) will be identical from build to build if you enable deterministic builds.

To Be Continued...

The Table Stream (#~ or #-) will require a post all of its own. Stay tuned for the next installment.

Saturday, June 11, 2016

Grokking the CLI - Part 1: Meta Mayhem

Background

Over the past eight years I've been writing two projects: OILexer and Abstraction. Abstraction's focus was initially on outlining a Compiler framework for the .NET CLI.

As time went on, it became clear that reflection just wasn't enough for me to get the nitty-gritty details that I needed to roll my own compiler.

"But wait", you say, ".NET has Reflection, Reflection.Emit and AssemblyBuilder built in!"

Yes, it is true, .NET by default allows you to emit assemblies as you please; however, one thing that it requires is a pretty deep understanding of its Common Intermediate Language (CIL.) You also are limited to building for the version of .NET you're running under.

You can adjust the result assembly's configuration file, or use the underlying CLR Native APIs, but the former doesn't really generate a 2.0 library, and the latter is a bit more complex as at that point you're not really using the AssemblyBuilder / ILGenerator any longer. 

Introduction

Let's say you want to write your own compiler, and you want complete control, that's the focus we're on today. The first thing you must understand in this: it's not going to be quick, or easy.

As it stands, I've written a library that understands .NET Metadata in all its tables and do basic ILDasm level disassembly on method bodies.  I can translate object graphs into code, but at this point I haven't gotten to the point of taking those graphs and writing Metadata out of them.  So we'll focus this series on what I have done.
There's a lot to understand.  I won't repeat the exact nature of things here, because the ECMA-335 explains it more completely than I do; however, I will give a basic overview:
  1. PE Headers - The parts that make up the portable executable, which is the executable format used by windows operating systems.
  2. CLI Header - The extensions to the PE headers that outlines where the Metadata (#3) is, where the entrypoint is, the version of the framework you're targeting, and all so on.
  3. CLI Data Streams - The actual data that says what the what constants you use, the metadata tables, the names of things within the metadata tables, and the binary blobs.

PE Headers

MS DOS Header

This section starts out with details that many may not know still exist as a part of windows executables: the MS DOS header.  For most .NET Applications, they use a fixed set of bytes to represent this MS-DOS based program, with the exception of a portion named 'lfanew', which is an offset to the Portable Executable signature, which would then be followed by the windows version of the application.

Windows Headers

The windows headers tend to be a bit more well defined within the ECMA-335 specification.  Since they're not the major focus of this we'll explain them as simply:
  1. PEHeader - Lays out the structure of the PE image 
    1. Coff (Common Object File Format) Sections
  2. PE Optional Header - for .NET Applications this is mandatory, as the Data Dictionaries point to the details we're interested in.  If you're writing your own metadata reader, you must understand the PE Optional header to a certain detail.  You must fully grok it if you plan on writing a compiler (or at least fake it enough so the .NET Apps you write don't fail to load.)
    1. Standard Fields
    2. NT Specific Fields
    3. Data Dictionaries - A series of address/size pairs which point to data within the PE image.
      1. We're interested in the CLI Header pointer, which will give us the location of the data for .NET Applications.

CLI Header

The CLI Header contains:
  1. Cb - the size of the header
  2. MajorRuntimeVersion
  3. MinorRuntimeVersion
  4. MetaData - the RVA and Size of the metadata sections
  5. Flags - Cli-relevant details about the image, such as 'Requries 32-bit process, Signed, native entry-point defined, IL-Only'
  6. EntryPointToken - Token to the 'Main' method for applications (non-libraries)
  7. Resources
  8. StrongNameSignature
  9. CodeManagerTable - Always Zero
  10. VTableFixups
  11. ExportAddressTableJumps
  12. ManagedNativeHeader

RVA

Relative Virtual Address: A relative virtual address is a 'pointer' to where the data should be in a virtual sense.  Once a library or application is loaded, the locations of things may change based on how the operating system wants to arrange the data.  These virtual addresses give you a sense of where the data is, and usually require you to 'resolve' them.  Since we're not actually loading the library to execute it, we can simplify our resolution rules.

You would take the Coff sections of the PE Image, scanning through them until the RVA was within the range of that given Coff section's Virtual Address and the size of its raw data.

Getting to the Point of it All

We resolve the RVA of the CLI Header, read it in, then resolve the RVA of Metadata (#5 above) of the CLI Header to get the location of the data streams within the CLI Application.

It took knowing about all of the above, just to know where the data is!  We haven't even began grokking it yet.

To Be Continued...

We'll start breaking into the metadata streams and their meanings in the next installment.

Nearly at Dogfooding state

This post was accidentally promoted to this month on this day, edited the post through the blogger's dashboard and for some reason, it got promoted to today.

Well when I started this journey, I didn't have the slightest clue about what I was doing. I still don't know what I'm doing but I at least have an idea of what I want to be doing.

Over the past eight years I've been developing the parser for OILexer to parse grammar files, and it's got its fair share of oddities lying around: production rules and tokens refer to 'flag' items in different ways, tokens say it's a flag via 'IsFlag=true;', the other just uses a bang character (!).

The risk of developing show stopping infinite loops is a real concern when you're developing an interactive parser that's more often in a bad parse state than it is in a proper parse state.

Now OILexer's capable of reproducing this parser (in a much more regular and consistent way) that appears to operate at a factor of 60-80 times quicker.

Granted this is preliminary and I'm still implementing the error recovery, but things look quite promising.