Saturday, March 14, 2009

Language Data Analysis and Syntactical Analysis

Well, presently I'm working on two aspects of the language analysis phase, data dependencies (what it exports in the form of a Concrete Syntax Tree) and Syntactical analysis. Both are turning out to be harder than expected, though the data analysis is a lot easier by comparison than the syntactical aspect. Below is a small snippet from the console (reformatted for reading) that describes the data-wise union of the sub-rules of a given rule in a language:

RelationalExpression
Left False RuleDataNode (RelationalExpression)
Operator False EnumDataNode (Operators: LessThan | LessThanOrEqualTo | GreaterThan | GreaterThanOrEqualTo | IsType | CastToOrNull)
Right False RuleDataNode (TypeReference)
False RuleDataNode (ShiftLevelExpression)
ConditionalExpression
Term False RuleDataNode (LogicalOrExpression)
QuestionDelimeter False EnumDataNode (Operators: ConditionalOperator)
TruePart False RuleDataNode (ConditionalExpression)
DecisionDelimeter False EnumDataNode (Operators: Colon)
FalsePart False RuleDataNode (ConditionalExpression)
UnaryOperatorExpression
PrefixUnaryOperator True EnumDataNode (Operators: Increment | Decrement | Addition | Subtraction)
Operand False RuleDataNode (Operand)
False RuleDataNode (Expression)
False RuleDataNode (SingleUnitExpression)
False RuleDataNode (CreateNewObjectExpression)
UnnamedEntity False EnumDataNode (Operators: LeftParenthesis | RightParenthesis)
False EnumDataNode (Operators: RightParenthesis)
TargetType False RuleDataNode (TypeReference)


Fairly straight forward, though a bit cryptic unless you see the language description. I used similar concepts used in the Lexical Analysis, rule element merging and so on. Each individual rule will utilize the intersection of the full dataset after it's been intersected across all rules. This should give me the main rule interface/class' data set and the individual sub-sets of each permutation of the rule.

As above, the RelationalExpression splits on the 'Right' term. This is due to it using 'is' and 'as' for the Left term and a type-reference, and the other relational operators for the ShiftLevelExpression.

There's a few special notes, the 'True' or 'False' column represents the multiplicity of the elements. In cases where alternation is used, if two elements are named the same, if their types are different they'll be placed in the same group, the final data extraction phase will sort them out as needed.

Additionally, sub-groups defined in a rule will also be lifted into a special sub-set data series. I'll cover this more once I have a more practical example.

Then there's the Syntactical Analysis phase. This is taking a large part of my focus due to its complexity. I've figured out how to mine what I need to figure out the initial token set for a given rule at its starting point, but I think the next step is to start from the 'Root' rule and branch out from there. Once I have more concrete data I'll go into more. It's also going to require a reanalysis of the lexical aspects to discern overlapping elements, in cases where they're overlapping, it'll divulge exactly how much so if two literal sets overlap on elements it only need concern itself with those, if the current rule doesn't require any of the elements overlapped, then it's a non-issue. Major issues are capture recognizer type tokens such as Identifiers, which will effectively trump Keywords, causing the look-ahead to increase, which is no big deal.

I'll probably cross link the data analysis aspect with the syntactical aspect prior to the logical branching exploration phase, this way when ambiguities arise, it can discern which data elements are impacted and lift those into an ambiguity type so that they remain intact, but are noted as an ambiguity.

Funny how the moment I think I'm almost done, the stack gets bigger.

Edit: Figures as I post this, I find a bug.

It appears I made a flub, UnaryOperatorExpression contains two unnamed elements, Operators, but one's LeftParenthesis | RightParenthesis and the other's RightParenthesis. The issue is, both LeftParenthesis should have been merged into one and both RightParenthesis should have been merged into one. So now I need to add code to find the best match when merging instead of the first equally natured element it can be merged with.

Edit 2: Fixed - Though now I have another bug. The joys of programming.

No comments: