Sunday, February 7, 2016

OILexer - Main Goals

OILexer is a project I've been working on since 2007 (or 2008? I forget) until now; though, only sporadically.

My main goal with the project is to make writing a language parser as clean, and simple, as possible.

Other systems typically offer mechanisms for custom code injection on how to handle corner cases or specially parse certain constructs.

With OILexer the aim is to mitigate much of that and have the tool figure it out (ambiguities and what to do) on its own.

There are a few reasons I took this approach:

  1. When I started, parsing was something I wasn't terribly good at, so allowing 'custom language-targeted text' would've been error prone.
  2. I think part of my concern was I didn't want a 'Single Target Language' grammar, I wanted to make it so that if you wanted a VB.NET parser, you'd just simply switch languages.
    1. Do note: my VB output mechanism isn't written as of right now, but the objects which represent the result code are in 'general purpose' constructs that would make it as easy as writing a VB Translator.

This has many advantages, as well as a few disadvantages:

Advantages/Goals:

  1. The grammars are much cleaner, which could lead you to focus on the structure of the language itself, versus 'what do I do here'
  2. Precedence is simple.
    1. You merely need to specify 'X' rules that outline the precedence. Where 'X' is the number of levels you intend to go.
    2. With 'Collapse' point rules, you can add intermediate steps between them to 'flatten' the resulted hierarchy.
    3. These intermediate step rules add one (1) level to each rule, but templates mean you can write a template that handles creating these for you.
  3. Left recursion should be handled for you, without left factoring.
  4. Contextual keywords should be handled for you.
    1. If 'identifier' collides with a new keyword that you only want as a keyword in certain cases, the program should be smart enough to figure this out.
    2. Early tests show this is possible, so long as the keyword doesn't cause the rule to terminate with the same tokens that the other variant.
    3. Two tokens which overlap identically should only be considered 'ambiguous' if they show up at the same time.
      1. Example: If I have two '(' tokens that are used for different reasons, it shouldn't call them ambiguous if they aren't used in the same place.
      2. Even if they do, it should tell which is intended by what comes next, just like any other prediction.
  5. You should get both a Concrete Syntax Tree as well as an Abstract Syntax Tree.
    1. Collapse points, mentioned above, mean that 'rules you don't care about' shouldn't be present in the final mockup that you receive.
      1. Typically these rules represent what's valid at a given juncture, like a 'Class Member', or a child of a class.
      2. Usually you don't care so much about the 'Class Member' node, you care what kind of node it ended up being. How you got there isn't so important at the AST level.
      3. Another good example is a 'Type Declaration', when debugging, you don't want to have to expand a 'Type Declaration' each time you are trying to see what kind of type it is.
      4. Further, if it's also a Class Member, you'd have to expand two levels, Class Member -and- Type Declaration. Not needing to do this saves time.
      5. Beyond just 'expanding nodes' when you're using a visitor model over the AST, those 'irrelevant nodes' merely increase the stack pressure, removing them from the final AST simplifies their evaluation.
    2. If you don't call out a particular node by name, you don't care about it, so the 'AST' will ignore everything not called out by name.
      1. Most notable case is Type Declarations, such as a Class Declaration. You don't need to know that they wrote 'class' at the start. The fact that you're staring at a class declaration object says they did.
      2. This further expands to 'Single Operator' mathematical expressions.
        1. Conditional Operator which uses '?' and ':'
        2. Logical Or Expressions which use '||'
    3. Since we're human, having the 'Concrete Syntax Tree' will be useful when you want to verify it's parsing correctly.
      1. If I get it wrong (which I'm still human)
      2. If you find you got it wrong and need a way to look at the constituent symbols which make up a rule.
        1. Once I hit the 'Bootstrapping' phase, I found that my hand-written parser was not following the language description language's own rules!
        2. I had to write in alternative legal parses to account for this issue until I fixed the hand-written parser (which in retrospect, my hand-written parser kind of sucks.)

Disadvantages:

  1. Prediction code can get pretty nasty.
    1. Left-recursive predictions are horrible right now, I'm drafting up a process to fix this, but it'll take some time.
    2. Some languages enter 'Recursively Enumerable' territory, causing OILexer to never end its evaluation.
      1. Some of these situations may well be fixed by the Left recursive prediction fix I'll be working on soon.
      2. Short answer is: I botched Indirect Left Recursion for complex languages (by complex, mostly means 'real world').
  2. Collapse Points can create lots of 'placeholder' types.
    1. To solve the issue of removing collapse-able rules from the AST, I used inheritance to represent situations where the rules actually mean something else at the same time.
    2. If Class Declaration is also a Type Declaration, Class Declaration must have a derivation which implements the placeholder interface for Type Declaration.
    3. If Class Declaration is also a Type Declaration, which is also a Class Member, we must therefore have a derivation of the Class Declaration's Type Declaration which is also implements the Class Member placeholder interface (effectively a class within a class.)
    4. Essentially the inheritance graphs of a given AST node go as far as the rabbit hole that leads to them does, with respect to Collapse points.
    5. The saving grace here is it's done for you, and it should just work.
  3. There's no good 'naming scheme' to represent ambiguous identities.
    1. If you have a keyword named 'Yield' which is ambiguous with 'Identifier', it might be called Ambiguity1, or Ambiguity21, depending on how many are in the language and when it was encountered.
    2. I didn't bother trying to create the name from the tokens which are ambiguous, because the ambiguities may be between more than two tokens (or fifty tokens, if you're nuts.)
    3. This makes reading code with ambiguous matches a bit arduous.
    4. I do generate documentation comments for the ambiguities, but they don't appear in Debug Mode :(
  4. Follow predictions require discriminator logic, which can get messy.
    1. If a token for a given rule is on an edge state, we must check to see if accepting that token would cause the parse to fail, because the calling rule required it to parse successfully.
      1. There are valid cases when the parent requires it and we can still accept it: if the parent parse can assuredly succeed without the token we want to consume, we can happily gobble it up in that rule. But not before we can guarantee that.
    2. Currently this is semi-broken for complex cases. If the rule being parsed is the result of a 'predictive reduction' the context of the ambiguity is not yet known within this discriminator logic (meaning: it'll likely reject certain valid parses.)
  5. Error detection is a 'placeholder' and doesn't really work.
    1. I can mark a rule as invalid, but I don't yet capture why it is invalid.
    2. To do this accurately, I need to capture the state it failed at, so I can ask the parser what was valid at that point, based on the rule stack at the time.
    3. This is deliberately left undone for now because its function may change depending on how I implement indirect left recursion code.
OILexer is a project of pure fun for me. I think once I fix the 'Indirect Left Recursive Failures', the 'Tail Discriminator' concerns, and properly implement the error detection and recovery, I'll have something truly useful. Though knowing me, once I get there, I'll use the bootstrapped parser to rewrite the parsing phase of the program. Attempting this will give me practical experience in how to use the result AST/CST structures, and shed light on what other creepy crawlies lie in the mix that need fixed.

Monday, February 1, 2016

Parse Times Looking Promising

After doing some initial testing, the boot-strap capable parser seems to be able to best OILexer itself in parsing its own language. When I started this project back in 2008, the experience I had writing parsers was limited at best. The new mechanics put in place do things I can only hope to do by hand: unbound look-ahead. Granted the number of tokens seems to vary, but in cases where the look-ahead requires a loop, it seems to handle it just fine. Some initial testing, comparing the two file sets, shows the automated parser is quite fast when compared to its hand-written brethren:

Hand Built Parser:

Automated Parser

The paths for the hand-built parser are longer because the files are copied to the output directory for the project. I've blurred the full path to avoid giving too much about my personal system.

I'm rather shocked that there are cases where it is as much as a hundred and fifty times faster. These are -running in the debugger- figures (64-bit target), so the release times may vary. If anything I think the automated parser will just get faster.

There's a few things to note: the automated parser doesn't handle errors. It can detect them, but as far as saying where they are after the fact, it's a no-go. It's worth noting that OILexer's hand-written parser likely has a few areas where it will still hit an 'infinite loop', as I've encountered one or two in the live-edit mode within Visual Studio during its development (ok, more than one or two.) These have been patched along the way, but they're usually due to unexpected input and the fact that the hand-written parser does some wonky peeking at the lexical view to avoid needing to parse certain characters, which... sometimes don't even have valid tokens!

I suspect the automated build will slow down some once Error Handling is introduced. I'm getting ready to 'break stuff' in light of issues with Indirect Left Recursion that I think I have a logical plan of action for, but it's not a simple fix. More on that later (once I have figured it out :)

It's interesting to me to compare the automated version's 'prediction' versus the hand-written variant.

Automated Determination:
private int _Predict14OnTokenItem(int laDepth)
{
    int exitLADepth = -1;
    int currentLADepth = laDepth;
    int exitState = -1;
    this._state = 673;
PredictMultiTargetState_673:
    switch (this._symbolStream.LookAhead(currentLADepth))
    {
        case OilexerParserSymbols.BuiltInLanguageCommands_BaseEncode:
        case OilexerParserSymbols.BuiltInLanguageCommands_Scan:
        case OilexerParserSymbols.BuiltInLanguageCommands_Subtract:
            exitState = this._state = 681;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.CharacterRange:
            exitState = this._state = 680;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.LanguageOperatorToken_BeginGroup:
            exitState = this._state = 679;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.Char:
        case OilexerParserSymbols.String:
            exitState = this._state = 678;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.Identifier:
            exitState = this._state = 677;
            exitLADepth = ++currentLADepth;
            goto DecisionPoint;
        case OilexerParserSymbols.Ambiguity1:
        case OilexerParserSymbols.Ambiguity2:
        case OilexerParserSymbols.Ambiguity3:
            exitState = this._state = 674;
            exitLADepth = ++currentLADepth;
            switch (this._symbolStream.LookAhead(currentLADepth))
            {
                case OilexerParserSymbols.LanguageOperatorToken_BeginGroup:
                    exitState = this._state = 676;
                    exitLADepth = ++currentLADepth;
                    goto PredictMultiTargetState_676;
                case OilexerParserSymbols.LanguageOperatorToken_IdentitySeparator:
                case OilexerParserSymbols.LanguageOperatorToken_ItemIsRecognizer:
                case OilexerParserSymbols.LanguageOperatorToken_OneOrMore:
                case OilexerParserSymbols.LanguageOperatorToken_OptionsSeparator:
                case OilexerParserSymbols.LanguageOperatorToken_RepetitionRangeStart:
                case OilexerParserSymbols.LanguageOperatorToken_ZeroOrMore:
                case OilexerParserSymbols.LanguageOperatorToken_ZeroOrOne:
                    exitState = this._state = 675;
                    exitLADepth = ++currentLADepth;
                    goto PredictMultiTargetState_675;
                default:
                    goto DecisionPoint;
            }
        default:
            goto DecisionPoint;
    }
PredictMultiTargetState_676:
    goto DecisionPoint;
PredictMultiTargetState_675:
    goto DecisionPoint;
DecisionPoint:
    switch (exitState)
    {
        case 674: case 675: case 677:
            return 548 /* SoftReferenceScannableEntryItem */;
        case 676: case 681:
            return 546 /* BuiltInCommandTokenItem */;
        case 678:
            return 549 /* LiteralTokenItem */;
        case 679:
            return 547 /* GroupTokenItem */;
        case 680:
            return 550 /* CharacterRangeTokenItem */;
    }
    return 0 /* The prediction did not match a valid alt. */;
}
Hand-written variant, doing prediction within the TokenExpression that uses the TokenItems
private void ParseTokenBody(List<ITokenExpression> expressions)
{
    int l = this.CurrentTokenizer.GetLineIndex(this.StreamPosition);
    int ci = this.CurrentTokenizer.GetColumnIndex(this.StreamPosition);
    long p = this.StreamPosition;
    List<ITokenItem> seriesItemMembers = new List<ITokenItem>();
    while (true)
    {
        ITokenItem item = null;
        IOilexerGrammarToken igdt = LookAhead(0);
        if (igdt == null && CurrentTokenizer.CurrentError != null)
        {
            currentTarget.SyntaxErrors.SyntaxError(CurrentTokenizer.CurrentError);
 
            break;
        }
        else
        {
            if (igdt == null)
            {
                LogError(OilexerGrammarParserErrors.Expected, "identifier, '(' or ';'");
                return;
            }
            switch (igdt.TokenType)
            {
                case OilexerGrammarTokenType.Comment:
                    var commentToken = (OilexerGrammarTokens.CommentToken)igdt;
                    if (captureRegions && commentToken.MultiLine)
                    {
                        var gdResult = (OilexerGrammarFile)currentTarget.Result;
                        gdResult.AddCommentRegion(commentToken);
                    }
                    PopAhead();
                    continue;
                case OilexerGrammarTokenType.CharacterLiteral:
                    item = ParseTokenCharLiteral();
                    break;
                case OilexerGrammarTokenType.StringLiteral:
                    item = ParseTokenStringLiteral();
                    break;
                case OilexerGrammarTokenType.Identifier:
                    item = ParseTokenReferenceItem();
                    break;
                case OilexerGrammarTokenType.CharacterRange:
                    item = ParseTokenCharacterRange();
                    break;
                case OilexerGrammarTokenType.NumberLiteral:
                    LogError(OilexerGrammarParserErrors.Unexpected, "number", igdt.Position);
                    goto yield;
                case OilexerGrammarTokenType.PreprocessorDirective:
                    LogError(OilexerGrammarParserErrors.Unexpected, "preprocessor", igdt.Position);
                    goto yield;
                case OilexerGrammarTokenType.Operator:
                    switch (((OilexerGrammarTokens.OperatorToken)(igdt)).Type)
                    {
                        case OilexerGrammarTokens.OperatorType.Pipe:
                            ClearAhead();
                            goto yield;
                        case OilexerGrammarTokens.OperatorType.LeftParenthesis:
                            ClearAhead();
                            item = (ITokenGroupItem)ParseTokenGroup();
                            break;
                        case OilexerGrammarTokens.OperatorType.RightParenthesis:
                            if (parameterDepth <= 0)
                                LogError(OilexerGrammarParserErrors.Expected, ";", igdt.Position);
                            goto yield;
                        case OilexerGrammarTokens.OperatorType.GreaterThan:
                        case OilexerGrammarTokens.OperatorType.Comma:
                            if (commandDepths.Contains(parameterDepth))
                            {
                                ClearAhead();
                                goto yield;
                            }
                            PopAhead();
                            LogError(OilexerGrammarParserErrors.Expected, "identifier, '(', or ';'");
                            PopAhead();
                            goto yield;
                        case OilexerGrammarTokens.OperatorType.SemiColon:
                            ClearAhead();
                            if (templateDepth != 0 || parameterDepth != 0)
                            {
                                Expect((templateDepth == 0 ? ">" : string.Empty) + (parameterDepth > 0 ? ")" : string.Empty), this.StreamPosition);
                            }
                            goto yield;
                        default:
                            goto default_2;
                    }
                    break;
                default:
                default_2:
                    LogError(OilexerGrammarParserErrors.Expected, "identifier, '(' or ';'");
                    PopAhead();
                    break;
            }
        }
        if (item != null)
        {
            ClearAhead();
            if (TokenizerLookAhead(0) == ':')
            {
                ParseItemOptions(item);
            }
            ParseItemRepeatOptions(item);
            seriesItemMembers.Add(item);
        }
    }
yield:
    expressions.Add(new TokenExpression(seriesItemMembers, CurrentTokenizer.FileName, l, ci, p));
}