Elevated Parser
Elevated Parser is a tool for converting source code to abstract syntax tree based on syntax definitions. It can also be viewed as a programming language that determines whether the given source code can be parsed using the given syntax definitions. This parser is used to parse the majority of programming languages invented by User:Hakerh400 and a few other languages.
This parser can parse a subset of context-free languages, with the ability of backtracking. It recursively searches for the longest nonterminal that can be parsed at the current position in the source code. Once such non-terminal is found, even if backtrack operation occurs, it would not be parsed again, which reduces time complexity from exponentional to linear in some cases. This parser can parse some languages that cannot be parsed using regular expressions.
Unlike many other popular parsers, this one does not output a program (state machine) that parses source code, but rather directly outputs a syntax tree. The tree can later be traversed (top-down or bottom-up) to either generate bytecode for a specific interpreter, or some interpreters may directly work with the syntax tree to perform computation (possibly transforming it into a graph).
There are no collisions. If the given source code can be parsed in more than one way based on the given syntax rules, the parser will return one of the possible syntax trees.
Syntax notation
Syntax definition consists of one or more syntax rules.
Each syntax rule defines a single non-terminal. It has a name and then one or more patterns inside curly braces. Patterns are separated by |
character.
A pattern contains one or more elements. Element can be one of:
- Literal string - example:
"abc"
- Range of characters - example:
[0-9]
- Non-terminal - example:
number
Here is an example:
integer{ digit | integer digit } digit{ [0-9] }
In this example we have two syntax rules (two non-terminals): integer
and digit
. Rule for integer
contains two patterns. One pattern is digit
and the other is integer digit
.
Arrays of elements
Each element will be matched exactly once, unless we explicitly define a minimal and maximal number of times the element can be matched. Range of element repetitions is written in curly braces directly after an element (no whitespace should appear). For example:
name{ [A-Z] [a-z]{3, 5} }
This rule defines a non-terminal name
that starts with an uppercase letter and then 3
, 4
or 5
lowercase letters appear.
If we provide exactly one number, it applies to both limits. For example, [a-z]{10}
will match exactly 10
lowercase letters. We could also specify "up to 10" ([a-z]{*, 10}
) or "10 or more" ([a-z]{10, *}
) letters.
Quantifiers *
and +
are syntactic sugars for {0, *}
and {1, *}
respectively. So, digit*
will match zero or more digits, while digit+
will match one or more digits.
Separators
When matching arrays of elements, it is sometimes very useful to also match separators between elements. Separators are written directly after quantifiers.
For example, if we want to match an array of integers, like this: [5, 123, 45, 11, 0, 2]
, we can define the following syntax:
array{ "[" integer*separator "]" } integer{ [0-9]+ } separator{ space* "," space* } space{ " " }
Basically, integer*separator
means that integer
will be matched zero or more times, but between each two consecutive integers a separator
must be matched. The separator
rule matches zero or more spaces, then a single comma character and then again zero or more spaces.
Optional quantifier
Quantifier ?
is syntactic sugar for {0, 1}
.
Any character
Denoted by .
symbol. It matches any character.
Non-greedy specifier
There is a special symbol ?
. It is different from {0, 1}
quantifier, because {0, 1}
will not backtrack if parsing fails. For example, if we want ot parse a multiline comment, we may try to do it like this:
comment{ "/*" .* "*/" }
However, it does not work in this parser, because parser is greedy by default and it will match .*
as much as possible (to the end of the source code, skipping any */
strings) and it will not backtrack. We can force backtracking by appending ?
symbol (it is not a quantifier):
comment{ "/*" .*? "*/" }
Now it will try to match "*/"
after each character from .*?
element.
Note: constructions like a{ b?? }
are allowed (the first ?
is a quantifier, while the second ?
is a non-greedy specifier).
Separator quantifiers
It is not allowed to inline separator quantifiers. For example, this is invalid syntax rule:
word{ [a-z]*[A-Z]+ }
because [A-Z]
is a separator and it cannot have inline quantifier +
. We need to separate it into a new non-terminal:
word{ [a-z]*upperCase } upperCase{ [A-Z]+ }
Case-insensitive strings
Can be achieved by appending i
character after a literal string. Example:
mul{ "mul"i space+ num space+ num } num{ [0-9]+ } space{ " " }
Lookahead
Two features recently added to the parser are positive and negative lookahead. Lookahead asserts that an element must be present or must not be present, but matched length of that element is zero (does not consume characters).
Positive lookahead can be achieved by prepending :
character before an element, while negative lookahead can be achieved by prepending !
character before an element. Example:
element{ increment | group | num } increment{ "inc"i !letter space element } group{ "(" space element*space space ")" } letter{ [a-zA-Z] } num{ [0-9]+ } space{ " "* }
Nonterminal element
matches following strings:
inc 5
inc inc 100
inc2
inc ()
inc (3 4)
inc(3 4)
inc inc(3()4)
inc(inc1)
but does not match following strings:
incinc 11
incc 3
inc3c()
inc(inc(incn))
Examples
Addition
Addition is a left-associative operation. We can define it like this:
add{ expr "+" num | num } expr{ num | add } num{ [0-9]+ }
Many of the popular parsers (like Yacc, CUP, etc...) are unable to parse left-recursive definitions if written as-is and throw collision errors. However, this parser is able to parse it, and even if we specified expr "+" expr
instead of expr "+" num
, it would still be able to parse it properly (it would choose one possible AST).
Exponentiation
Exponentiation, as a right-associative operation, can be parsed like this:
pow{ pow "^" expr | num } expr{ num | pow } num{ [0-9]+ }
Processing abstract syntax tree
The tree is, by default, traversed in the bottom-up direction and user-defined functions for each non-terminal are called to construct an AST using user-defined classes.
User functions are called for each rule with a single argument that is an instance of ASTNode
class. It has the following attributes (or getters):
pti
- index (0-indexed) of the pattern that is matchedstr
- string representation of the matched elementes
- array of matched elementsseps
- array of matched separators (length is one less thanes
)fst
- alias fores[0]
Each element of es
and seps
arrays is an instance of ASTElem
class. It has the following attributes (or getters):
str
- string representation of the matched element partarr
- array of matched element parts (optionally replaced with user-defined classes from previous iterations)fst
- alias forarr[0]
Exaple: given the following syntax rules:
functionCall{ ident s0 args } args{ "(" s0 arg*commaSep s0 ")" } arg{ ident | functionCall } ident{ [a-zA-Z]+ } commaSep{ s0 "," s0 } s0{ " "* }
and the following source code we want to parse:
abc(x, d(y))
We will get the following AST:
ASTNode "functionCall" pti: 0 str: "abc(x, d(y))" es: Array [ 0: ASTElem 0: ASTNode "ident" ... 1: ASTElem 0: ASTNode "s0" ... 2: ASTElem 0: ASTNode "args" ... ] seps: []
Note: the ...
parts are not shown because they depend on user-defined classes.
Interpreters
These interpreters use the parser.