Regular expression
Regular expressions (also called regexp and regex) are a recursive (context-free) metalanguage which describe regular languages. A given regular expression is a metastring which describes a particular regular language. The structure of regular expressions is defined recursively, where expressions can be in one of these forms, and composed together:
- the empty string, also called
- a symbol from some alphabet
- explicit grouping
- or simply concatenation
- alternation (corresponding to nondeterminism)
- repetition or Kleene closure, which is effectively for all finite lengths
Proper regular expressions consisting of only the above sub-expressions model exactly the regular languages, meaning that any regular expression can be converted/compiled into a finite state machine.
Because a regular expression is a string that describes strings, it is a meta-string. Also because of this, it needs either an extended alphabet from the strings it describes, or some concept of quoting.
Derivative
The Brzozowski derivative of a regular language is able to be computed from the grammar itself, producing another regular grammar. Brzozowski demonstrated this by describing an algorithm to compute the derivative of the language modeled by a regular expression, which produces a new regular expression.[1]
We first define the function which determines if an expression includes/matches the empty string. The function returns if it contains the empty string or the empty set if it does not. Note that and for all sequences of symbols . A casewise definition can be given for the delta function:
- for all single symbols and nonzero length sequences of symbols
With this, the derivative can be expressed in a similarly recursive manner:
- where
The formulae for the empty inclusion and derivative of alternation are actually special cases of the operations for boolean set operations. A boolean set operation applied to two regular expressions is the set operation applied to the language of each expression. Alternation is equivalent to union. Concatenation is not a boolean set operation, hence the differing form.
Derivatives can be used to determine if a string matches a regular expression by taking the derivative with respect to each character in turn. If the final derived language accepts the empty string, then the expression accepts the input string.
Mathematical generalization
Regular expressions lend themselves to being generalized to contexts other than languages generated from symbols. Kleene algebras (Wikipedia) are effectively regular expressions applied to arbitrary sets, and refers to concatenation as product, alternation as addition, and adds an additive identity. Kleene algebras on sets of symbols typically correspond to regular expressions. This framing allows for a variety of problems to be expressed in terms of formal languages. For instance, different types of automata or matrices of Kleene algebras.[2] Further, since closed semirings (Wikipedia) are a specific case of Kleene algebras, a large amount of linear algebra can be done on Kleene algebras,[3] culminating in the generalization of the Floyd-Warshall algorithm (Wikipedia) to the Gauss–Jordan–Floyd–Warshall–Kleene algorithm.[4]
Extensions
Regular expressions have been extended in many different ways. These extensions can be either regular (preserving the strict modeling of only regular languages) or non-regular (allowing for regexes to model more than the regular languages).
Regular
- The wildcard character
.
is shorthand for alternation between all symbols. If the set of symbols is then the expression.
is equivalent to(a|b|c|d)
and matches the stringsa
,b
,c
, andd
. - Character classes are a shorthand for matching one of many different characters. They are denoted by square brackets, and allow for ranges to be specified.
(a|b|c|d|e)
is equivalent to the character group[abcde]
or[a-e]
. Character classes can also match the opposite of the characters inside, so if the set of possible symbols is the English alphabet then[^a-e]
is equivalent to[f-z]
. Character classes can always be converted to alternation, and are thus regular. - Class escapes are shorthand for common character classes. For instance, in ASCII the class escape
\d
would be equivalent to[0-9]
. - The one-or-more closure
+
is similar to the Kleene star except that it does not match the empty string.a+
matches all strings of one or more as.(E)+
is always equivalent to(E)(E)*
. - The optional quantifier
?
matches an expression zero or one times.(E)?
is always equivalent to . - The range quantifier
{n,m}
matches an expression between n and m times. The range quantifier can be converted to alternation in a similar manner to?
. - Mathematically, regular expressions are nondeterministic and generative, meaning that the expression
a*
actually generates the stringsa
,aaaa
,aaaaaaaa
, as well as the empty string and all strings of a at every length, all at the same time. When used analytically to match strings, regular expressions make evaluation decisions. One decision is greediness. Whilea*
could generate and therefore matchaaa
in the stringaaaaa
, greediness would have it generate and match the entire string. Greediness or non-greediness does not change the power of the language, since it is an implementation detail. Many regex implementations allow for quantifiers to be made non-greedy by putting a?
after it, like*?
. - Lookahead allows for a regular expression to match only if the rest of the string matches an expression. Lookbehind is similar, but for a pattern on the string before.
- Boundary assertions are a specific form of lookaround which specifically looks for certain boundaries. These boundaries can include the beginning or end of the string, or a valid word boundary, which typically includes the beginning or end of a string or whitespace.
Non-regular
- Capture groups are a semantic use for grouping, where a group (denoted by parentheses) captures the text its enclosed expression matches, assigning it to a string register. Capture groups are often able to be referenced in the expression itself, like in the regex
(a|b)c\1
where\1
expands to the string that the capture group(a|b)
matches. This regex matches the stringsaca
andbcb
but notacb
orbca
. Capture groups with in-expression references allows for regexes to implement at least some analytic context-sensitive grammars. - Parsing expression grammars (PEG) are a generalized form of regular expressions which allows for structural recursion. PEGs can easily model all context-free grammars, as well as at least some context-sensitive ones.
- Definitions allow for structural recursion in much the same way PEG describes structural recursion, thereby allowing for the modeling of all context-free grammars.[5]
Common use
Regular expressions are commonly used for find and replace. In this context, capture groups are almost always provided, with the replacement string able to reference them. If executed in a loop, regex replacement is Turing complete. Iterated regex is a model of computation explicitly based on this concept, although other languages based on the idea exist.
Languages
These are languages based on regular expressions. They are often more powerful than strict regular expressions. In fact, all of the languages listed below are Turing complete.
These languages are exactly equivalent to (some dialect) of regular expression:
- Minreg the Cat: effectively a reformulation of proper regular expressions in reverse Polish notation
- REXS: supports referencing capture groups in the match
These are based on regular expressions, but the computational class is not known:
See also
References
- ↑ Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 (Oct. 1964), 481–494. https://doi.org/10.1145/321239.321249
- ↑ Kozen, Dexter. 1994. A Completeness Theorem for Kleene Algebras and the Algebra of Regular Events. Information and Computation, Volume 110, Issue 2, 366-390. https://doi.org/10.1006/inco.1994.1037
- ↑ Dolan, Stephen. 2013. Fun with semirings: a functional pearl on the abuse of linear algebra. SIGPLAN Not. 48, 9 (September 2013), 101–110. https://dl.acm.org/doi/10.1145/2500365.2500613
- ↑ O'Connor, Russell. 2011. A Very General Method of Computing Shortest Paths. https://r6.ca/blog/20110808T035622Z.html
- ↑ Popov, Nikita. (2012, June 15). The true power of regular expressions. https://www.npopov.com/2012/06/15/The-true-power-of-regular-expressions.html