Formal grammar

This is still a work in progress. It may be changed in the future.

Formal grammars are models of formal languages. Grammars are classified into two major categories; generative, and analytic. Most formal grammars are generative, but some are analytic.

Definition

Generative grammars can be formalized as consisting of two disjoint sets of symbols, called terminals and nonterminals, as well as production rules. Production rules map strings of symbols (including at least one nonterminal) into new strings of symbols, replacing the string. Formal grammars are often nondeterministic, in two ways. The first is that they allow for a single string to map to multiple different strings. The second is in derivation order. Rules do not have precedence and can be applied in any order, so long as they match.

Productions are applied one at a time, as opposed to parallel grammars (like L-systems). A string is said to be derived from a prior string under a grammar if a single production rule can be applied to the prior to get the derived string. Repeated derivations may produce a string consisting of only terminals. If this occurs the string is called a sentence, and the derivation process halts. If a string has no valid derivations but still has nonterminals, the process also halts, but the string is not considered a sentence. The set of all sentences that a grammar produces from a start symbol is referred to as the grammar's language.

To illustrate the two types of nondeterminism, observe these examples:

N=\lbrace S\rbrace ,T=\lbrace a,b,c\rbrace \ S\rightarrow a\ S\rightarrow b\ S\rightarrow c

When applied to the starting symbol $S$ then we can freely choose to use any of the production rules. If we choose any of them we will result in a string of only terminals. The language generated by this grammar is $\lbrace a,b,c\rbrace$ . This demonstrates how one symbol (and therefore string) can nondeterministically map into multiple different symbols.

$N=\lbrace S,A,B,C\rbrace ,T=\lbrace a,b,c\rbrace S\rightarrow aBCS\rightarrow aSBCCB\rightarrow BCaB\rightarrow abbB\rightarrow bbbC\rightarrow bccC\rightarrow cc$

This grammar generates the language consisting of strings where there are $a$ s followed by $b$ s followed by $c$ s where the sequences have the same length. Here is an example production chain:

S
aSBC
aaBCBC
aaBBCC
aabBCC
aabbCC
aabbcC
aabbcc

Note how the string aaBCBC can map to either aaBBCC or aabCBC. This nondeterministic behavior is due to the lack of order between rules, rather than having multiple options for a single rule. Note a derivation path starting from aabCBC:

aabCBC
aabcBC

This string cannot map to any other string, as there is no rule for the nonterminals in this configuration. The derivation must therefore halt, but since the string is not made up of solely terminals, it is not a sentence and not part of the grammar's language.

When evaluating a grammar to determine what strings can be generated from a starting string, we can build up a tree, where the branches are different paths which are taken when multiple possible derivations exist for a given starting string. The leaves of the tree which have only terminals in the string are the sentences of the language. We can consider the terminal leaves of any node of the tree, not just the start. If we pick a node and consider its language, we take the Brzozowski derivative of the full language with respect to the derivation path. The Brzozowski derivative of a language with respect to a string is the set of all sentences (the language) which can be derived from the string.

Types

Type 0: unrestricted

The most general form of grammar is the unrestricted grammar. The above definition for general formal grammars is that of unrestricted grammars. Unrestricted grammars are able to model all recursively enumerable languages, the same languages that Turing machines recognize. This means that unrestricted grammars are Turing complete.

A proof of this can be given by reproducing a nondeterministic Turing machine's tape construction and transition rules in the replacement rules of a grammar. Halting machines result in a terminal string/sentence.

Type 1: context sensitive

If we restrict the values that a substring can be mapped to by disallowing symbol deletion, we lose the ability to directly model all recursively enumerable languages. The class of grammars which cannot delete symbols but are otherwise unrestricted are called context-sensitive, monotonic, linear bounded, or noncontracting grammars.

The reason that these languages are not able to model recursively enumerable languages is due to their inability to clean up. Consider a grammar which converts minimal binary strings into minimal hexadecimal strings, that is, digit strings without leading zeroes. Since hexadecimal strings are up to four times shorter, the string must contract/decrease in size in order to produce minimal hexadecimal strings. Of course for this example, the leading zeroes could be ignored by something which is using the output of this grammar. In fact, all recursively enumerable languages can be modeled by a noncontracting grammar whose final outputs include terminal symbols which are to be ignored, even if the ignore symbols can only be in certain parts of the string.

If all symbols must be considered then context sensitive grammars are in the same computational class as linear bounded automata, since any string can be decided to be a member of a particular context sensitive grammar by a nondeterministic linear bounded automaton. The procedure is to give the nondeterministic automaton two tapes, each with the same length as the input string. This is permitted since the automaton's tape length can be an affine function of the input string's size. The first tape starts out with the input and the second the start symbol. The machine iteratively, nondeterministically transforms the second tape according to the grammar, never choosing rules which cause the working string to be longer than the tape. If the machine cannot choose a rule which ensures the string fits in the tape, then the input word mustn't be part of the noncontracting grammar's language, since if the production string exceeds the length, it can never become smaller again. If at some point the machine produces the input string in any of its execution paths, then the input is part of the language.

This procedure can be reformulated for linear bounded automata with one tape, and might be reformulated for deterministic LBAs. The latter is an open problem.

Type 2: context free

If instead of disallowing erasure rules, we disallow rules which replace strings of multiple symbols, only allowing rules which replace symbols with strings, then the power is reduced further than context-sensitive grammars. These grammars are called context-free, since they cannot consider a symbol's context. Context-sensitive grammars model simply nested languages, which are recognizable by pushdown automata.

The vast majority of programming languages are parsed as if they are context-free languages. Most will perform context-sensitive analysis, but the initial parse phase will generally make use of context-free analysis.

Brainfuck (and most of its derivatives) is entirely context-free, meaning this nondeterministic context-free grammar can generate every valid brainfuck program (without I/O), that is, its language is precisely brainfuck:

$N=\lbrace S\rbrace ,T=\lbrace <,>,+,-,[,]\rbrace S\rightarrow <SS\rightarrow >SS\rightarrow +SS\rightarrow -SS\rightarrow \lambda S\rightarrow [S]S$

I/O support could be added with little difficulty, but would make the notation more confusing.

Type 3: regular

Regular grammars are the most restricted in common use. Regular grammars have three valid forms for their production rules:

A → b

A → bC

A → λ

That is, a nonterminal may either map to a terminal, a terminal followed by a nonterminal, or the empty string.

Regular languages are recognizable by finite state automata, meaning that regular grammars are equivalent in power to FSMs.

Regular grammars are perhaps the most common of all, despite being the most restrictive. Proper regular expressions are an alternate notation for regular grammars with equivalent power. Extensions to regular expressions increases the recognition power beyond regular languages.

Derivative

Regular languages are unique in that the derivative of the language can be computed from the grammar, the operation taking in a regular grammar and producing a new regular grammar with no derivative operators. Brzozowski showed this by describing a recursive algorithm which computes the derivative of a regular expression. Context-free and greater languages can recurse, their derivatives may also recurse, so Brzozowski's algorithm cannot generalize to produce e.g. context-free grammars from context-free grammars.^[1] In fact, since determining if two context-free grammars are equivalent is undecidable (unlike for regular grammars), no algorithm which produces the grammar of the derivative of a context-free language can exist.^{[citation needed]}

Semi-Thue

A semi-Thue system or grammar is similar to a formal grammar, except that there is only one unified set of symbols, and not a single fixed axiom. Semi-Thue systems are equivalent in power to unrestricted grammars, any semi-Thue system can be translated into an unrestricted grammar where each symbol from the semi-Thue system corresponds to a nonterminal.

Thue

Thue grammars are a subset of semi-Thue grammars where the production rules are symmetric. That is, if $A\rightarrow B$ then $B\rightarrow A$ and vice versa for all $A$ and $B$ .

Evaluation order

The evaluation order of formal grammars is not specified. This allows for grammars to nondeterministically generate every possible word of a language from a starting symbol. Evaluation orders can be specified, however, which gives rise to other systems.

Markov algorithms are unrestricted grammars where every rule is indexed by a unique number. Evaluation proceeds by determining which rules match the current partial derivation, then selecting the production from the rule with the lowest index. Markov algorithms are necessarily deterministic, since for any set of unique numbers there will always be one smallest number, meaning there will only ever be at most one possible matching rule.

L-systems are unrestricted grammars where all matching rules are applied at the same time, with the word being the concatenation of all of the valid productions at once.

Normal forms

While formal grammars can have almost any form for their matches and productions, certain normal forms are commonly used to quantify what sorts of syntaxes give rise to certain language sets. A normal form specifies where terminals and nonterminals can occur, as well as if the empty string λ can be produced.

Chomsky

Chomsky normal form is a form for context-free grammars where each rule can be one of:

A → BC

A → a

A → λ

Greibach

Greibach normal form is an alternate form for context-free grammars where each rule is in one of the forms:

A → a

A → aB

A → aBC

A → aBCD

...

That is, every rule maps a nonterminal to a terminal followed by any number of nonterminals.

Greibach normal form cannot generate the empty string, but can if S → λ is incorporated. Despite being unable to contract, Greibach normal form can represent any context-free grammar that doesn't include the empty string, even if the Chomsky normal form includes rules which map to the empty string.

Kuroda

Kuroda normal form is a form for context-sensitive grammars where each rule can be one of:

AB → CD

A → BC

A → B

A → a

Kuroda normal form is similar to Chomsky normal form, with the added ability to consider the context of a nonterminal. Kuroda normal form corresponds to type 1 languages, context sensitive/noncontracting. If the following form (deletion) is added, then this form can represent all type-0 languages, and is equivalent to unrestricted grammars:

A → λ

Pentonnen

Pentonnen normal form is a restricted variant of Kuroda normal form where each rule must be one of:

AB → AC

A → BC

A → a

That is, only the nonterminal to the left of a nonterminal can be used as context, and therefore information cannot flow from right to left, only left to right. This form can represent all type-1 grammars, or all type-0 if the deletion form is added:

A → λ

Specific formal grammars

Generative

L-systems

Analytic

Parsing expression grammars

Grammar-based languages

Analytic

META II

References

↑ Might, Matthew., Darais, David. (2010, Oct 24). Yacc is dead. arXiv. arXiv:1010.5023. https://doi.org/10.48550/arXiv.1010.5023

[1] Might, Matthew., Darais, David. (2010, Oct 24). Yacc is dead. arXiv. arXiv:1010.5023. https://doi.org/10.48550/arXiv.1010.5023

[1]

Formal grammar

Contents

Definition