Regular expression

Regular expressions (also called regexp and regex) are a recursive (context-free) metalanguage which describe regular languages. A given regular expression is a metastring which describes a particular regular language. The structure of regular expressions is defined recursively, where expressions can be in one of these forms, and composed together:

$\lambda$ the empty string, also called $\epsilon$
$s$ a symbol from some alphabet
$(E)$ explicit grouping
$(E_{0}\cdot E_{1})$ or simply $(E_{0}E_{1})$ concatenation
$(E_{0}|E_{1})$ alternation (corresponding to nondeterminism)
$(E)^{\ast }$ repetition or Kleene closure, which is effectively $(\lambda |(E|(EE|(EEE|(\dots )))))$ for all finite lengths

Proper regular expressions consisting of only the above sub-expressions model exactly the regular languages, meaning that any regular expression can be converted/compiled into a finite state machine.

Because a regular expression is a string that describes strings, it is a meta-string. Also because of this, it needs either an extended alphabet from the strings it describes, or some concept of quoting.

Derivative

The Brzozowski derivative of a regular language is able to be computed from the grammar itself, producing another regular grammar. Brzozowski demonstrated this by describing an algorithm to compute the derivative of the language modeled by a regular expression, which produces a new regular expression.^[1]

We first define the function $\delta (E)$ which determines if an expression includes/matches the empty string. The function returns $\lambda$ if it contains the empty string or the empty set $\emptyset$ if it does not. Note that $\lambda \cdot s=s\cdot \lambda =s$ and $\emptyset \cdot s=s\cdot \emptyset =\emptyset$ for all sequences of symbols $s$ . A casewise definition can be given for the delta function:

$\delta (s)=\emptyset$ for all single symbols and nonzero length sequences of symbols
$\delta (\lambda )=\lambda$
$\delta (\emptyset )=\emptyset$
$\delta (s^{\ast })=\lambda$
$\delta (E_{0}\cdot E_{1})=\delta (E_{0})\cdot \delta (E_{1})$
$\delta (E_{0}|E_{1})=\delta (E_{0})|\delta (E_{1})$

With this, the derivative can be expressed in a similarly recursive manner:

$D_{s}(s)=\lambda$
$D_{s}(a)=\emptyset$ where $s\neq a$
$D_{s}(E^{\ast })=D_{s}(E)\cdot E^{\ast }$
$D_{s}(E_{0}\cdot E_{1})=(D_{s}(E_{0})\cdot E_{1})|(\delta (E_{0}))\cdot D_{s}(E_{1}))$
$D_{s}(E_{0}|E_{1})=D_{s}(E_{0})|D_{s}(E_{1})$

The formulae for the empty inclusion and derivative of alternation are actually special cases of the operations for boolean set operations. A boolean set operation applied to two regular expressions is the set operation applied to the language of each expression. Alternation is equivalent to union. Concatenation is not a boolean set operation, hence the differing form.

Derivatives can be used to determine if a string matches a regular expression by taking the derivative with respect to each character in turn. If the final derived language accepts the empty string, then the expression accepts the input string.

Mathematical generalization

Kleene algebras generalize regular expressions to contexts beyond languages generated from symbols. In a Kleene algebra, the operations of regular expressions are generalized: concatenation becomes non-commutative multiplication and alternation becomes commutative addition, along with some algebraic laws giving identities for both operations. In the other direction, Kleene algebras on sets of symbols typically correspond to regular expressions. This framing allows for a variety of problems to be expressed in terms of formal languages, since several types of automata carry Kleene algebras.^[2] Further, since closed semirings are a specific case of Kleene algebras and matrices over scalars of closed semirings are themselves also closed semirings, a large amount of linear algebra can be done on Kleene algebras,^[3] culminating in the generalization of the Floyd-Warshall algorithm to the Gauss–Jordan–Floyd–Warshall–Kleene algorithm.^[4]

Extensions

Regular expressions have been extended in many different ways. These extensions can be either regular (preserving the strict modeling of only regular languages) or non-regular (allowing for regexes to model more than the regular languages).

Regular

The wildcard character . is shorthand for alternation between all symbols. If the set of symbols is $\lbrace a,b,c,d\rbrace$ then the expression . is equivalent to (a|b|c|d) and matches the strings a, b, c, and d.
Character classes are a shorthand for matching one of many different characters. They are denoted by square brackets, and allow for ranges to be specified. (a|b|c|d|e) is equivalent to the character group [abcde] or [a-e]. Character classes can also match the opposite of the characters inside, so if the set of possible symbols is the English alphabet then [^a-e] is equivalent to [f-z]. Character classes can always be converted to alternation, and are thus regular.
Class escapes are shorthand for common character classes. For instance, in ASCII the class escape \d would be equivalent to [0-9].
The one-or-more closure + is similar to the Kleene star except that it does not match the empty string. a+ matches all strings of one or more as. (E)+ is always equivalent to (E)(E)*.
The optional quantifier ? matches an expression zero or one times. (E)? is always equivalent to $(\lambda |E)$ .
The range quantifier {n,m} matches an expression between n and m times. The range quantifier can be converted to alternation in a similar manner to ?.
Mathematically, regular expressions are nondeterministic and generative, meaning that the expression a* actually generates the strings a, aaaa, aaaaaaaa, as well as the empty string and all strings of a at every length, all at the same time. When used analytically to match strings, regular expressions make evaluation decisions. One decision is greediness. While a* could generate and therefore match aaa in the string aaaaa, greediness would have it generate and match the entire string. Greediness or non-greediness does not change the power of the language, since it is an implementation detail. Many regex implementations allow for quantifiers to be made non-greedy by putting a ? after it, like *?.
Lookahead allows for a regular expression to match only if the rest of the string matches an expression. Lookbehind is similar, but for a pattern on the string before.
Boundary assertions are a specific form of lookaround which specifically looks for certain boundaries. These boundaries can include the beginning or end of the string, or a valid word boundary, which typically includes the beginning or end of a string or whitespace.

Non-regular

Capture groups are a semantic use for grouping, where a group (denoted by parentheses) captures the text its enclosed expression matches, assigning it to a string register. Capture groups are often able to be referenced in the expression itself, like in the regex (a|b)c\1 where \1 expands to the string that the capture group (a|b) matches. This regex matches the strings aca and bcb but not acb or bca. Capture groups with in-expression references allows for regexes to implement at least some analytic context-sensitive grammars.
Parsing expression grammars (PEG) are a generalized form of regular expressions which allows for structural recursion. PEGs can easily model all context-free grammars, as well as at least some context-sensitive ones.
Definitions allow for structural recursion in much the same way PEG describes structural recursion, thereby allowing for the modeling of all context-free grammars.^[5]

Common use

Regular expressions are commonly used for find and replace. In this context, capture groups are almost always provided, with the replacement string able to reference them. If executed in a loop, regex replacement is Turing complete. Iterated regex is a model of computation explicitly based on this concept, although other languages based on the idea exist.

Languages

These are languages based on regular expressions. They are often more powerful than strict regular expressions. In fact, all of the languages listed below are Turing complete.

These languages are exactly equivalent to (some dialect) of regular expression:

Minreg the Cat: effectively a reformulation of proper regular expressions in reverse Polish notation
REXS: supports referencing capture groups in the match

These are based on regular expressions, but the computational class is not known:

Proof

References

↑ Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 (Oct. 1964), 481–494. https://doi.org/10.1145/321239.321249
↑ Kozen, Dexter. 1994. A Completeness Theorem for Kleene Algebras and the Algebra of Regular Events. Information and Computation, Volume 110, Issue 2, 366-390. https://doi.org/10.1006/inco.1994.1037
↑ Dolan, Stephen. 2013. Fun with semirings: a functional pearl on the abuse of linear algebra. SIGPLAN Not. 48, 9 (September 2013), 101–110. https://dl.acm.org/doi/10.1145/2500365.2500613
↑ O'Connor, Russell. 2011. A Very General Method of Computing Shortest Paths. https://r6.ca/blog/20110808T035622Z.html
↑ Popov, Nikita. (2012, June 15). The true power of regular expressions. https://www.npopov.com/2012/06/15/The-true-power-of-regular-expressions.html

[1] Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 (Oct. 1964), 481–494. https://doi.org/10.1145/321239.321249

[2] Kozen, Dexter. 1994. A Completeness Theorem for Kleene Algebras and the Algebra of Regular Events. Information and Computation, Volume 110, Issue 2, 366-390. https://doi.org/10.1006/inco.1994.1037

[3] Dolan, Stephen. 2013. Fun with semirings: a functional pearl on the abuse of linear algebra. SIGPLAN Not. 48, 9 (September 2013), 101–110. https://dl.acm.org/doi/10.1145/2500365.2500613

[4] O'Connor, Russell. 2011. A Very General Method of Computing Shortest Paths. https://r6.ca/blog/20110808T035622Z.html

[5] Popov, Nikita. (2012, June 15). The true power of regular expressions. https://www.npopov.com/2012/06/15/The-true-power-of-regular-expressions.html

[1]

[2]

[3]

[4]

[5]

Regular expression

Contents

Derivative

Mathematical generalization

Extensions

Regular

Non-regular

Common use

Languages

See also

References

Navigation menu

Regular expression

Derivative

Mathematical generalization

Extensions

Regular

Non-regular

Common use

Languages

See also

References

Navigation menu

Search