Friday, August 28, 2009

LL(∞)

Finally,

I have the parser compiler somewhere that I can say is progress, and with that, comes knowing that the class of parser I'm writing a compiler for: LL(∞). This still places it in the LL parser category (top-down, left-most derivative); however, it means that the lookahead for the grammars is unbound to any specific k.

I've written a very simplistic recognizer (below) to go with the look-ahead state machine. Once the ahead aspect was finished it was child's play. Now comes the hard part: figuring out if this is a sufficiently structured machine to allow for data mining. Recognition is one thing, but capturing the essential elements is another. I don't doubt that this project is just beginning.

public ICSCSharpFileRule Parse(string filename)
{

    Stream s = new FileStream(filename, FileMode.Open);
    this.currentScanner.SetStream(s);

    CSStreamState currentState = rootState;
    var ambiguityStack = Pusher(new { Location = long.MinValue, Entry = (CSharpReaderScanData.Entry)null, State = (CSStreamState)null, Transition = new TokenTransition() });
    var errorPoints = Pusher(new { Location = long.MinValue, Transition = new TokenTransition() });
    var errorPoint = errorPoints.Pop();
    var lastAmbiguity = ambiguityStack.Pop();
    errorPoint = null;
    lastAmbiguity = null;
    int states = 0;
    bool hasValidExit = false;
    while (true)
    {
        if (lastAmbiguity != null)
            this.currentScanner.MoveTo(lastAmbiguity.Location + lastAmbiguity.Entry.Length);
        var currentScanData = this.currentScanner.NextToken(currentState.Transitions.FullSet);
        if (currentScanData.Count == 0)
            errorPoints.Push(new { Location = currentScanData.Location, Transition = currentState.Transitions.FullSet });
        else
        {
            foreach (var entry in currentScanData)
                ambiguityStack.Push(new { Location = currentScanData.Location, Entry = entry, State = currentState, Transition = entry.GetTransition() });
        }
        if (ambiguityStack.Count == 0)
            break;
        var currentAmbiguity = ambiguityStack.Pop();
        //Unhinged token check.
        if (currentAmbiguity.State.Transitions.ContainsKey(currentAmbiguity.Transition))
            currentState = currentAmbiguity.State.Transitions[currentAmbiguity.Transition];
        else
            
/* *
             * Token doesn't exist in valid set.
             * It's still parsed because it can be
             * ignored.
             * */

            currentState = currentAmbiguity.State;
        
/* *
         * If the entire language parser has reached
         * either the end of the file, or there are
         * no valid transitions remaining, the parse
         * is done.
         * */

        if (currentAmbiguity.Entry.ID == CSharpReaderTokens.EndOFile ||
            currentState.Transitions.Count == 0)
        {
            hasValidExit = true;
            break;
        }
        lastAmbiguity = currentAmbiguity;
    }
    if (!hasValidExit)
    {
        foreach (var currentErrorPoint in errorPoints)
        {
            if (errorPoint == null)
                errorPoint = currentErrorPoint;
            else if (currentErrorPoint.Location > errorPoint.Location)
                errorPoint = currentErrorPoint;
        }
        Console.WriteLine("{0} - Expected {1}, States: {2}", errorPoint.Location, errorPoint.Transition, states);
    }
    s.Dispose();
    return null;
}

public static Stack<T> Pusher<T>(T root)
{
    var result =  new Stack<T>();
    result.Push(root);
    return result;
}


One of the fun things I've gained from all this is a better understanding of what types of lists to use and when to use them (Stack<T>, Queue<T>, HashSet<T>, List<T>). I also goofed and used LINQ to perform an order operation on the transitions within a given state. While alone not a terrible thing, it isn't exactly a bright one, either. By itself on a 5300+ state parse, it added 1.2 seconds to the evaluation time. I've also cleaned up the first/follow set calculations (before they were entirely separate) and added a third level of cache (to the transition table construction).

The third level of cache only yields a time savings of a few hundred milliseconds on an expression-heavy verify, but the frequency at which the cache finds a match was rather surprising (more often matched than didn't match).



As an interesting point of comparison: a little over a year ago, to parse a 150KB document without context sensitivity, it took 800 milliseconds. The same file, without context sensitivity takes 25 milliseconds, roughly 32 times faster. Guess deterministic finite automation is faster :]

Sadly, with context sensitivity a smaller file (but with higher cyclomatic complexity) takes 1.3 seconds to parse the first time. The second time it takes 50 milliseconds. The difference is due to the necessary code to construct the lookahead initial/first/follow sets and associated transition tables.

Hopefully there's still some work I can do to expidite the entire matter. I have some ideas; however such optimizations I wish to delay until I have a properly functioning parser. I don't want to further complicate an already complex matter.