Sunday, February 7, 2016

OILexer - Main Goals

OILexer is a project I've been working on since 2007 (or 2008? I forget) until now; though, only sporadically.

My main goal with the project is to make writing a language parser as clean, and simple, as possible.

Other systems typically offer mechanisms for custom code injection on how to handle corner cases or specially parse certain constructs.

With OILexer the aim is to mitigate much of that and have the tool figure it out (ambiguities and what to do) on its own.

There are a few reasons I took this approach:

  1. When I started, parsing was something I wasn't terribly good at, so allowing 'custom language-targeted text' would've been error prone.
  2. I think part of my concern was I didn't want a 'Single Target Language' grammar, I wanted to make it so that if you wanted a VB.NET parser, you'd just simply switch languages.
    1. Do note: my VB output mechanism isn't written as of right now, but the objects which represent the result code are in 'general purpose' constructs that would make it as easy as writing a VB Translator.

This has many advantages, as well as a few disadvantages:


  1. The grammars are much cleaner, which could lead you to focus on the structure of the language itself, versus 'what do I do here'
  2. Precedence is simple.
    1. You merely need to specify 'X' rules that outline the precedence. Where 'X' is the number of levels you intend to go.
    2. With 'Collapse' point rules, you can add intermediate steps between them to 'flatten' the resulted hierarchy.
    3. These intermediate step rules add one (1) level to each rule, but templates mean you can write a template that handles creating these for you.
  3. Left recursion should be handled for you, without left factoring.
  4. Contextual keywords should be handled for you.
    1. If 'identifier' collides with a new keyword that you only want as a keyword in certain cases, the program should be smart enough to figure this out.
    2. Early tests show this is possible, so long as the keyword doesn't cause the rule to terminate with the same tokens that the other variant.
    3. Two tokens which overlap identically should only be considered 'ambiguous' if they show up at the same time.
      1. Example: If I have two '(' tokens that are used for different reasons, it shouldn't call them ambiguous if they aren't used in the same place.
      2. Even if they do, it should tell which is intended by what comes next, just like any other prediction.
  5. You should get both a Concrete Syntax Tree as well as an Abstract Syntax Tree.
    1. Collapse points, mentioned above, mean that 'rules you don't care about' shouldn't be present in the final mockup that you receive.
      1. Typically these rules represent what's valid at a given juncture, like a 'Class Member', or a child of a class.
      2. Usually you don't care so much about the 'Class Member' node, you care what kind of node it ended up being. How you got there isn't so important at the AST level.
      3. Another good example is a 'Type Declaration', when debugging, you don't want to have to expand a 'Type Declaration' each time you are trying to see what kind of type it is.
      4. Further, if it's also a Class Member, you'd have to expand two levels, Class Member -and- Type Declaration. Not needing to do this saves time.
      5. Beyond just 'expanding nodes' when you're using a visitor model over the AST, those 'irrelevant nodes' merely increase the stack pressure, removing them from the final AST simplifies their evaluation.
    2. If you don't call out a particular node by name, you don't care about it, so the 'AST' will ignore everything not called out by name.
      1. Most notable case is Type Declarations, such as a Class Declaration. You don't need to know that they wrote 'class' at the start. The fact that you're staring at a class declaration object says they did.
      2. This further expands to 'Single Operator' mathematical expressions.
        1. Conditional Operator which uses '?' and ':'
        2. Logical Or Expressions which use '||'
    3. Since we're human, having the 'Concrete Syntax Tree' will be useful when you want to verify it's parsing correctly.
      1. If I get it wrong (which I'm still human)
      2. If you find you got it wrong and need a way to look at the constituent symbols which make up a rule.
        1. Once I hit the 'Bootstrapping' phase, I found that my hand-written parser was not following the language description language's own rules!
        2. I had to write in alternative legal parses to account for this issue until I fixed the hand-written parser (which in retrospect, my hand-written parser kind of sucks.)


  1. Prediction code can get pretty nasty.
    1. Left-recursive predictions are horrible right now, I'm drafting up a process to fix this, but it'll take some time.
    2. Some languages enter 'Recursively Enumerable' territory, causing OILexer to never end its evaluation.
      1. Some of these situations may well be fixed by the Left recursive prediction fix I'll be working on soon.
      2. Short answer is: I botched Indirect Left Recursion for complex languages (by complex, mostly means 'real world').
  2. Collapse Points can create lots of 'placeholder' types.
    1. To solve the issue of removing collapse-able rules from the AST, I used inheritance to represent situations where the rules actually mean something else at the same time.
    2. If Class Declaration is also a Type Declaration, Class Declaration must have a derivation which implements the placeholder interface for Type Declaration.
    3. If Class Declaration is also a Type Declaration, which is also a Class Member, we must therefore have a derivation of the Class Declaration's Type Declaration which is also implements the Class Member placeholder interface (effectively a class within a class.)
    4. Essentially the inheritance graphs of a given AST node go as far as the rabbit hole that leads to them does, with respect to Collapse points.
    5. The saving grace here is it's done for you, and it should just work.
  3. There's no good 'naming scheme' to represent ambiguous identities.
    1. If you have a keyword named 'Yield' which is ambiguous with 'Identifier', it might be called Ambiguity1, or Ambiguity21, depending on how many are in the language and when it was encountered.
    2. I didn't bother trying to create the name from the tokens which are ambiguous, because the ambiguities may be between more than two tokens (or fifty tokens, if you're nuts.)
    3. This makes reading code with ambiguous matches a bit arduous.
    4. I do generate documentation comments for the ambiguities, but they don't appear in Debug Mode :(
  4. Follow predictions require discriminator logic, which can get messy.
    1. If a token for a given rule is on an edge state, we must check to see if accepting that token would cause the parse to fail, because the calling rule required it to parse successfully.
      1. There are valid cases when the parent requires it and we can still accept it: if the parent parse can assuredly succeed without the token we want to consume, we can happily gobble it up in that rule. But not before we can guarantee that.
    2. Currently this is semi-broken for complex cases. If the rule being parsed is the result of a 'predictive reduction' the context of the ambiguity is not yet known within this discriminator logic (meaning: it'll likely reject certain valid parses.)
  5. Error detection is a 'placeholder' and doesn't really work.
    1. I can mark a rule as invalid, but I don't yet capture why it is invalid.
    2. To do this accurately, I need to capture the state it failed at, so I can ask the parser what was valid at that point, based on the rule stack at the time.
    3. This is deliberately left undone for now because its function may change depending on how I implement indirect left recursion code.
OILexer is a project of pure fun for me. I think once I fix the 'Indirect Left Recursive Failures', the 'Tail Discriminator' concerns, and properly implement the error detection and recovery, I'll have something truly useful. Though knowing me, once I get there, I'll use the bootstrapped parser to rewrite the parsing phase of the program. Attempting this will give me practical experience in how to use the result AST/CST structures, and shed light on what other creepy crawlies lie in the mix that need fixed.

No comments: