Abstract syntax tree
An abstract syntax tree (AST) is an applicative tree encoding the wikipedia:abstract syntax of a program.
Motivation
In parsing theory, the grammar of a language equips every sentence with a concrete syntax tree (CST) whose vertices are productions. However, these trees generally do not have nice graph homomorphisms to the structures used in compilers for tasks like optimization or instruction selection. To simplify the problem, we can select an abstract grammar which is wikipedia:adjoint to the concrete grammar, and use the adjunction to forget the concrete details of the CST, resulting in an abstract tree.
Applications
Note that the above adjunction also comes with a free mapping which generates concrete syntax representing a given AST, known as the serializer of abstract syntax.
The primary application for ASTs prior to the wide availability of the Internet was in compiler engineering, where abstract syntax simplifies almost every piece of the traditional compiler pipeline. In particular, wikipedia:instruction selection is always witnessed by a tree homomorphism from abstract syntax to templated sequences of machine code, and algorithms like NOLTIS (Goldstein & Koes 2008) are merely quick searches for acceptable witnesses. Additionally, serialization serves as useful debugging and pretty-printing.
With the advent of large highly-networked rooms with many computers and the transition from data warehouses to data centers, perhaps more understandable as "warehouse-scale machines" (Barroso & Hölzle 2009), ASTs became a way to represent messages passed between computers. In particular, the serializer is used to produce network-safe messages of concrete syntax which are easy to parse, and the abstract syntax is exposed as an API over those messages.
Abstract binding trees
It can be useful to have generic bindings for ASTs which are not language-dependent. For any abstract grammar, we may add two additional productions: one for variable names and one for a one-variable binder. We also must wrap the original AST with variable-management annotations. In Zephyr ASDL, given an abstract syntax ast, we could have:
ast = … abt = Var(str name) | Abs(str name, abt term) | Tm(str* free, abt tree) | Pure(ast tree)
This construction gives the type of an abstract binding tree (ABT) (Harper 2012). Formal capture-avoiding substitution can be defined on ABTs up to alpha-equivalence without knowledge of the underlying syntax; in pseudocode:
-- XXX exercise for reader: write newNameNotIn, rename substitute value name body = case body of Var(n) -> if n == name then value else Var(n) -- NB: usually more complex than this! Abs(x, t) -> let x' = newNameNotIn t, t' = rename x x' t in Abs(x', substitute value name t') Tm(xs, t) -> Tm(xs, if name in xs then substitute value name t else t) Pure(t) -> Pure(t)
Any sort of binding within an abstract syntax can be lifted to an abstract binding. As such, ABTs generalize lambda calculi.
Note that ABTs themselves are only coherent up to alpha-equivalence; while names can be chosen by users, they may not be predictably preserved after substitution.
Herbrand structures
Any type of ASTs can be thought of as a wikipedia:Herbrand structure. Readers fluent in Prolog may think of ASTs as ground values generated by functors and atoms, for example. Similarly, a type of ABTs with substitution can be thought of as a wikipedia:Herbrand universe with logic variables which reify each substitution action via unification.
See also
- Zephyr ASDL, a language for describing abstract syntax
References
- Near-Optimal Instruction Selection on DAGs, Goldstein & Koes 2008
- The Datacenter as a Computer, Barroso & Hölzle 2009
- Practical Foundations for Programming Languages, Harper 2012 (pdf)