Elevated Parser

From Esolang
Jump to navigation Jump to search

Elevated Parser is a tool for converting source code to abstract syntax tree based on syntax definitions. It can also be viewed as a programming language that determines whether the given source code can be parsed using the given syntax definitions. This parser is used to parse the majority of programming languages invented by User:Hakerh400 and a few other languages.

This parser can parse a subset of context-free languages, with the ability of backtracking. It recursively searches for the longest nonterminal that can be parsed at the current position in the source code. Once such non-terminal is found, even if backtrack operation occurs, it would not be parsed again, which reduces time complexity from exponentional to linear in some cases. This parser can parse some languages that cannot be parsed using regular expressions.

Unlike many other popular parsers, this one does not output a program (state machine) that parses source code, but rather directly outputs a syntax tree. The tree can later be traversed (top-down or bottom-up) to either generate bytecode for a specific interpreter, or some interpreters may directly work with the syntax tree to perform computation (possibly transforming it into a graph).

There are no collisions. If the given source code can be parsed in more than one way based on the given syntax rules, the parser will return one of the possible syntax trees.

Syntax notation

Syntax definition consists of one or more syntax rules.

Each syntax rule defines a single non-terminal. It has a name and then one or more patterns inside curly braces. Patterns are separated by | character.

A pattern contains one or more elements. Element can be one of:

  • Literal string - example: "abc"
  • Range of characters - example: [0-9]
  • Non-terminal - example: number

Here is an example:

integer{ digit | integer digit }
digit{ [0-9] }

In this example we have two syntax rules (two non-terminals): integer and digit. Rule for integer contains two patterns. One pattern is digit and the other is integer digit.

Arrays of elements

Each element will be matched exactly once, unless we explicitly define a minimal and maximal number of times the element can be matched. Range of element repetitions is written in curly braces directly after an element (no whitespace should appear). For example:

name{ [A-Z] [a-z]{3, 5} }

This rule defines a non-terminal name that starts with an uppercase letter and then 3, 4 or 5 lowercase letters appear.

If we provide exactly one number, it applies to both limits. For example, [a-z]{10} will match exactly 10 lowercase letters. We could also specify "up to 10" ([a-z]{*, 10}) or "10 or more" ([a-z]{10, *}) letters.

Quantifiers * and + are syntactic sugars for {0, *} and {1, *} respectively. So, digit* will match zero or more digits, while digit+ will match one or more digits.

Separators

When matching arrays of elements, it is sometimes very useful to also match separators between elements. Separators are written directly after quantifiers.

For example, if we want to match an array of integers, like this: [5, 123, 45, 11, 0, 2], we can define the following syntax:

array{ "[" integer*separator "]" }
integer{ [0-9]+ }
separator{ space* "," space* }
space{ " " }

Basically, integer*separator means that integer will be matched zero or more times, but between each two consecutive integers a separator must be matched. The separator rule matches zero or more spaces, then a single comma character and then again zero or more spaces.

Optional quantifier

Quantifier ? is syntactic sugar for {0, 1}.

Any character

Denoted by . symbol. It matches any character.

Non-greedy specifier

There is a special symbol ?. It is different from {0, 1} quantifier, because {0, 1} will not backtrack if parsing fails. For example, if we want ot parse a multiline comment, we may try to do it like this:

comment{ "/*" .* "*/" }

However, it does not work in this parser, because parser is greedy by default and it will match .* as much as possible (to the end of the source code, skipping any */ strings) and it will not backtrack. We can force backtracking by appending ? symbol (it is not a quantifier):

comment{ "/*" .*? "*/" }

Now it will try to match "*/" after each character from .*? element.

Note: constructions like a{ b?? } are allowed (the first ? is a quantifier, while the second ? is a non-greedy specifier).

Separator quantifiers

It is not allowed to inline separator quantifiers. For example, this is invalid syntax rule:

word{ [a-z]*[A-Z]+ }

because [A-Z] is a separator and it cannot have inline quantifier +. We need to separate it into a new non-terminal:

word{ [a-z]*upperCase }
upperCase{ [A-Z]+ }

Case-insensitive strings

Can be achieved by appending i character after a literal string. Example:

mul{ "mul"i space+ num space+ num }
num{ [0-9]+ }
space{ " " }

Lookahead

Two features recently added to the parser are positive and negative lookahead. Lookahead asserts that an element must be present or must not be present, but matched length of that element is zero (does not consume characters).

Positive lookahead can be achieved by prepending : character before an element, while negative lookahead can be achieved by prepending ! character before an element. Example:

element{ increment | group | num }
increment{ "inc"i !letter space element }
group{ "(" space element*space space ")" }
letter{ [a-zA-Z] }
num{ [0-9]+ }
space{ " "* }

Nonterminal element matches following strings:

  • inc 5
  • inc inc 100
  • inc2
  • inc ()
  • inc (3 4)
  • inc(3 4)
  • inc inc(3()4)
  • inc(inc1)

but does not match following strings:

  • incinc 11
  • incc 3
  • inc3c()
  • inc(inc(incn))

Examples

Addition

Addition is a left-associative operation. We can define it like this:

add{ expr "+" num | num }
expr{ num | add }
num{ [0-9]+ }

Many of the popular parsers (like Yacc, CUP, etc...) are unable to parse left-recursive definitions if written as-is and throw collision errors. However, this parser is able to parse it, and even if we specified expr "+" expr instead of expr "+" num, it would still be able to parse it properly (it would choose one possible AST).

Exponentiation

Exponentiation, as a right-associative operation, can be parsed like this:

pow{ pow "^" expr | num }
expr{ num | pow }
num{ [0-9]+ }

Processing abstract syntax tree

The tree is, by default, traversed in the bottom-up direction and user-defined functions for each non-terminal are called to construct an AST using user-defined classes.

User functions are called for each rule with a single argument that is an instance of ASTNode class. It has the following attributes (or getters):

  • pti - index (0-indexed) of the pattern that is matched
  • str - string representation of the matched element
  • es - array of matched elements
  • seps - array of matched separators (length is one less than es)
  • fst - alias for es[0]

Each element of es and seps arrays is an instance of ASTElem class. It has the following attributes (or getters):

  • str - string representation of the matched element part
  • arr - array of matched element parts (optionally replaced with user-defined classes from previous iterations)
  • fst - alias for arr[0]

Exaple: given the following syntax rules:

functionCall{ ident s0 args }
args{ "(" s0 arg*commaSep s0 ")"  }
arg{ ident | functionCall }
ident{ [a-zA-Z]+ }
commaSep{ s0 "," s0 }
s0{ " "* }

and the following source code we want to parse:

abc(x, d(y))

We will get the following AST:

ASTNode "functionCall"
 pti:  0
 str:  "abc(x, d(y))"
 es:   Array [
   0: ASTElem
     0: ASTNode "ident"
       ...
   1: ASTElem
     0: ASTNode "s0"
       ...
   2: ASTElem
     0: ASTNode "args"
       ...
 ]
 seps: []

Note: the ... parts are not shown because they depend on user-defined classes.

Interpreters

These interpreters use the parser.