Formal grammar

From Esolang
(Redirected from Context-sensitive grammar)
Jump to navigation Jump to search
This is still a work in progress. It may be changed in the future.

Formal grammars are models of formal languages. Grammars are classified into two major categories; generative, and analytic. Most formal grammars are generative, but some are analytic.

Definition

Generative grammars can be formalized as consisting of two disjoint sets of symbols, called terminals and nonterminals, as well as production rules. Production rules map strings of symbols (including at least one nonterminal) into new strings of symbols, replacing the string. Formal grammars are often nondeterministic, in two ways. The first is that they allow for a single string to map to multiple different strings. The second is in derivation order. Rules do not have precedence and can be applied in any order, so long as they match.

Productions are applied one at a time, as opposed to parallel grammars (like L-systems). A string is said to be derived from a prior string under a grammar if a single production rule can be applied to the prior to get the derived string. Repeated derivations may produce a string consisting of only terminals. If this occurs the string is called a sentence, and the derivation process halts. If a string has no valid derivations but still has nonterminals, the process also halts, but the string is not considered a sentence. The set of all sentences that a grammar produces from a start symbol is referred to as the grammar's language.

To illustrate the two types of nondeterminism, observe these examples:

Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle N = \lbrace S \rbrace, T = \lbrace a, b, c \rbrace\ S \rarr a\ S \rarr b\ S \rarr c}

When applied to the starting symbol Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle S} then we can freely choose to use any of the production rules. If we choose any of them we will result in a string of only terminals. The language generated by this grammar is Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \lbrace a, b, c \rbrace} . This demonstrates how one symbol (and therefore string) can nondeterministically map into multiple different symbols.


Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle N = \lbrace S, A, B, C \rbrace, T = \lbrace a, b, c \rbrace S \rarr aBC S \rarr aSBC CB \rarr BC aB \rarr ab bB \rarr bb bC \rarr bc cC \rarr cc}

This grammar generates the language consisting of strings where there are Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle a} s followed by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle b} s followed by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle c} s where the sequences have the same length. Here is an example production chain:

S
aSBC
aaBCBC
aaBBCC
aabBCC
aabbCC
aabbcC
aabbcc

Note how the string aaBCBC can map to either aaBBCC or aabCBC. This nondeterministic behavior is due to the lack of order between rules, rather than having multiple options for a single rule. Note a derivation path starting from aabCBC:

aabCBC
aabcBC

This string cannot map to any other string, as there is no rule for the nonterminals in this configuration. The derivation must therefore halt, but since the string is not made up of solely terminals, it is not a sentence and not part of the grammar's language.

When evaluating a grammar to determine what strings can be generated from a starting string, we can build up a tree, where the branches are different paths which are taken when multiple possible derivations exist for a given starting string. The leaves of the tree which have only terminals in the string are the sentences of the language. We can consider the terminal leaves of any node of the tree, not just the start. If we pick a node and consider its language, we take the Brzozowski derivative of the full language with respect to the derivation path. The Brzozowski derivative of a language with respect to a string is the set of all sentences (the language) which can be derived from the string.

Types

Type 0: unrestricted

The most general form of grammar is the unrestricted grammar. The above definition for general formal grammars is that of unrestricted grammars. Unrestricted grammars are able to model all recursively enumerable languages, the same languages that Turing machines recognize. This means that unrestricted grammars are Turing complete.

A proof of this can be given by reproducing a nondeterministic Turing machine's tape construction and transition rules in the replacement rules of a grammar. Halting machines result in a terminal string/sentence.

Type 1: context sensitive

If we restrict the values that a substring can be mapped to by disallowing symbol deletion, we lose the ability to directly model all recursively enumerable languages. The class of grammars which cannot delete symbols but are otherwise unrestricted are called context-sensitive, monotonic, linear bounded, or noncontracting grammars.

The reason that these languages are not able to model recursively enumerable languages is due to their inability to clean up. Consider a grammar which converts minimal binary strings into minimal hexadecimal strings, that is, digit strings without leading zeroes. Since hexadecimal strings are up to four times shorter, the string must contract/decrease in size in order to produce minimal hexadecimal strings. Of course for this example, the leading zeroes could be ignored by something which is using the output of this grammar. In fact, all recursively enumerable languages can be modeled by a noncontracting grammar whose final outputs include terminal symbols which are to be ignored, even if the ignore symbols can only be in certain parts of the string.

If all symbols must be considered then context sensitive grammars are in the same computational class as linear bounded automata, since any string can be decided to be a member of a particular context sensitive grammar by a nondeterministic linear bounded automaton. The procedure is to give the nondeterministic automaton two tapes, each with the same length as the input string. This is permitted since the automaton's tape length can be an affine function of the input string's size. The first tape starts out with the input and the second the start symbol. The machine iteratively, nondeterministically transforms the second tape according to the grammar, never choosing rules which cause the working string to be longer than the tape. If the machine cannot choose a rule which ensures the string fits in the tape, then the input word mustn't be part of the noncontracting grammar's language, since if the production string exceeds the length, it can never become smaller again. If at some point the machine produces the input string in any of its execution paths, then the input is part of the language.

This procedure can be reformulated for linear bounded automata with one tape, and might be reformulated for deterministic LBAs. The latter is an open problem.

Type 2: context free

If instead of disallowing erasure rules, we disallow rules which replace strings of multiple symbols, only allowing rules which replace symbols with strings, then the power is reduced further than context-sensitive grammars. These grammars are called context-free, since they cannot consider a symbol's context. Context-sensitive grammars model simply nested languages, which are recognizable by pushdown automata.

The vast majority of programming languages are parsed as if they are context-free languages. Most will perform context-sensitive analysis, but the initial parse phase will generally make use of context-free analysis.

Brainfuck (and most of its derivatives) is entirely context-free, meaning this nondeterministic context-free grammar can generate every valid brainfuck program (without I/O), that is, its language is precisely brainfuck:

Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle N = \lbrace S \rbrace, T = \lbrace <, >, +, -, [, ] \rbrace S \rarr <S S \rarr >S S \rarr +S S \rarr -S S \rarr \lambda S \rarr [S]S}

I/O support could be added with little difficulty, but would make the notation more confusing.

Type 3: regular

Regular grammars are the most restricted in common use. Regular grammars have three valid forms for their production rules:

Ab
AbC
Aλ

That is, a nonterminal may either map to a terminal, a terminal followed by a nonterminal, or the empty string.

Regular languages are recognizable by finite state automata, meaning that regular grammars are equivalent in power to FSMs.

Regular grammars are perhaps the most common of all, despite being the most restrictive. Proper regular expressions are an alternate notation for regular grammars with equivalent power. Extensions to regular expressions increases the recognition power beyond regular languages.

Regular expressions

Regular expressions are a recursive (context-free) metalanguage which describes regular languages, where the following are valid forms for an expression:

  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \lambda} the empty string
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle s} a symbol from some alphabet
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle (E)} explicit grouping
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle (E_0 \cdot E_1)} or simply Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle (E_0E_1)} concatenation
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle (E_0|E_1)} alternation (corresponding to nondeterminism)
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle (E)^{\ast}} repetition, which is effectively Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle (\lambda|(E|(EE|(EEE|(\dots)))))} for all finite lengths

All regular grammars can be encoded as regular expressions and every regular expression can be freely expanded to a regular grammar.

Regular expressions lend themselves to being generalized to contexts other than languages generated from symbols. Kleene algebras (Wikipedia) are effectively regular expressions applied to arbitrary sets, and refers to concatenation as product, alternation as addition, and adds an additive identity. Kleene algebras on sets of symbols typically correspond to regular expressions. This framing allows for a variety of problems to be expressed in terms of formal languages. For instance, different types of automata or matrices of Kleene algebras.[1] Further, since closed semirings (Wikipedia) are a specific case of Kleene algebras, a large amount of linear algebra can be done on Kleene algebras,[2] culminating in the generalization of the Floyd-Warshall algorithm (Wikipedia) to the Gauss–Jordan–Floyd–Warshall–Kleene algorithm.[3]

Derivative

Regular languages are unique in that the derivative of the language can be computed from the grammar, the operation taking in a regular grammar and producing a new regular grammar with no derivative operators. Context-free and greater languages can recurse, their derivatives may also recurse, so the below method cannot generalize to produce grammars from grammars.[4] In fact, since determining if two context-free grammars are equivalent is undecidable (unlike for regular grammars), no algorithm which produces the grammar of the derivative of a context-free language can exist.

The following procedure was given by Brzozowski.[5]

We first define the function Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \delta(E)} which determines if an expression includes/matches the empty string. The function returns Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \lambda} if it contains the empty string or the empty set Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \empty} if it does not. Note that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \lambda \cdot s = s \cdot \lambda = s} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \empty \cdot s = s \cdot \empty = \empty} for all sequences of symbols Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle s} . A casewise definition can be given for the delta function:

  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \delta(s) = \empty} for all single symbols and nonzero length sequences of symbols
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \delta(\lambda) = \lambda}
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \delta(\empty) = \empty}
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \delta(s^{\ast}) = \lambda}
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \delta(E_0 \cdot E_1) = \delta(E_0) \cdot \delta(E_1)}
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \delta(E_0 | E_1) = \delta(E_0) | \delta(E_1)}

With this, the derivative can be expressed in a similarly recursive manner:

  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle D_s(s) = \lambda}
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle D_s(a) = \empty} where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle s \neq a}
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle D_s(E^{\ast}) = D_s(E)\cdot E^{\ast}}
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle D_s(E_0 \cdot E_1) = (D_s(E_0) \cdot E_1) | (\delta(E_0)) \cdot D_s(E_1)) }
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle D_s(E_0 | E_1) = D_s(E_0) | D_s(E_1) }

The formulae for the empty inclusion and derivative of alternation are actually special cases of the operations for boolean set operations. A boolean set operation applied to two regular expressions is the set operation applied to the language of each expression. Alternation is equivalent to union. Concatenation is not a boolean set operation, hence the differing form.

Semi-Thue

A semi-Thue system or grammar is similar to a formal grammar, except that there is only one unified set of symbols, and not a single fixed axiom. Semi-Thue systems are equivalent in power to unrestricted grammars, any semi-Thue system can be translated into an unrestricted grammar where each symbol from the semi-Thue system corresponds to a nonterminal.

Thue

Thue grammars are a subset of semi-Thue grammars where the production rules are symmetric. That is, if Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle A \rarr B} then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle B \rarr A} and vice versa for all Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle A} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle B} .

Evaluation order

The evaluation order of formal grammars is not specified. This allows for grammars to nondeterministically generate every possible word of a language from a starting symbol. Evaluation orders can be specified, however, which gives rise to other systems.

Markov algorithms are unrestricted grammars where every rule is indexed by a unique number. Evaluation proceeds by determining which rules match the current partial derivation, then selecting the production from the rule with the lowest index. Markov algorithms are necessarily deterministic, since for any set of unique numbers there will always be one smallest number, meaning there will only ever be at most one possible matching rule.

L-systems are unrestricted grammars where all matching rules are applied at the same time, with the word being the concatenation of all of the valid productions at once.

Normal forms

While formal grammars can have almost any form for their matches and productions, certain normal forms are commonly used to quantify what sorts of syntaxes give rise to certain language sets. A normal form specifies where terminals and nonterminals can occur, as well as if the empty string λ can be produced.

Chomsky

Chomsky normal form is a form for context-free grammars where each rule can be one of:

ABC
Aa
Aλ

Greibach

Greibach normal form is an alternate form for context-free grammars where each rule is in one of the forms:

Aa
AaB
AaBC
AaBCD
...

That is, every rule maps a nonterminal to a terminal followed by any number of nonterminals.

Greibach normal form cannot generate the empty string, but can if Sλ is incorporated. Despite being unable to contract, Greibach normal form can represent any context-free grammar that doesn't include the empty string, even if the Chomsky normal form includes rules which map to the empty string.

Kuroda

Kuroda normal form is a form for context-sensitive grammars where each rule can be one of:

ABCD
ABC
AB
Aa

Kuroda normal form is similar to Chomsky normal form, with the added ability to consider the context of a nonterminal. Kuroda normal form corresponds to type 1 languages, context sensitive/noncontracting. If the following form (deletion) is added, then this form can represent all type-0 languages, and is equivalent to unrestricted grammars:

Aλ

Pentonnen

Pentonnen normal form is a restricted variant of Kuroda normal form where each rule must be one of:

ABAC
ABC
Aa

That is, only the nonterminal to the left of a nonterminal can be used as context, and therefore information cannot flow from right to left, only left to right. This form can represent all type-1 grammars, or all type-0 if the deletion form is added:

Aλ

Specific formal grammars

Generative

Analytic

Grammar-based languages

Analytic

References

  1. Kozen, Dexter. 1994. A Completeness Theorem for Kleene Algebras and the Algebra of Regular Events. Information and Computation, Volume 110, Issue 2, 366-390. https://doi.org/10.1006/inco.1994.1037
  2. Dolan, Stephen. 2013. Fun with semirings: a functional pearl on the abuse of linear algebra. SIGPLAN Not. 48, 9 (September 2013), 101–110. https://dl.acm.org/doi/10.1145/2500365.2500613
  3. O'Connor, Russell. 2011. A Very General Method of Computing Shortest Paths. https://r6.ca/blog/20110808T035622Z.html
  4. Might, Matthew., Darais, David. (2010, Oct 24). Yacc is dead. arXiv. arXiv:1010.5023. https://doi.org/10.48550/arXiv.1010.5023
  5. Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 (Oct. 1964), 481–494. https://doi.org/10.1145/321239.321249