Iterated regex
Iterated regex is a language/computational model where a single regular expression replacement is repeatedly applied against an input string.
Form
A possible canonical form for iterated regex programs could be :
match/replacement
Where match is some regular expression, and replacement is the replacement. A fuller syntax might be
match/replacement/options
Which permits matching options to be specified.
Execution proceeds by running this replacement on the input string to derive a new string, then running the replacement on the new string, and so on, ad infinitum. A halting condition may be specified, but there's no obvious canonical criterion.
Matching is inherently nondeterministic. To alleviate this, iterated regex matches from the start of the string, with greedy behavior.
The replacement can be done in one of two ways: partial, or full. Partial replacement first determines which part of the input was matched, then replaces only that portion with the replacement text. Full replacement destroys the entire input and replaces it with the replacement text.
Example Python implementation (partial replacement) which halts when the string contains HALT
:
import re def interpret(program, input_string): match, replacement = program.split("/") while "HALT" not in input_string: input_string = re.sub(match, replacement, input_string)
Variations
Variants where there are multiple rules can be imagined. There are several ways they could be handled.
One way is to have a cyclic program counter. In this version, each rule has an index, and each rule is executed in turn. When the final rule is executed, the execution wraps back around to the first.
A variant on the cyclic version is to instead have each step start at the first rule. It checks to see if the first rule matches any of the text. If not, it proceeds to check the second, then the third, etc. Once a matching rule is found, its replacement is enacted, and the next step starts again from the first rule. This is implemented in REBEL and is similar to Markov algorithms.
Another option is to handle execution nondeterministically. Like in the Markovian version, rules can only perform replacement if they match the string. Unlike the Markovian version, the rules are unordered. In this version, all rules are taken at the same time. Much like other nondeterministic languages, a stochastic choice could be taken instead to prune the execution to one path. This variant is similar to unrestricted grammars, with the stochastic case being similar to the language Thue.
Finally, a variant where all rules execute at the same time could also be given. This is similar to L-systems.
Dialects
Literal
The most restricted dialect for the matching language is the literal language. It is not a metalanguage, it is unable to match strings other than itself. It has only two forms of expression:
s
match the character sab
match the expressiona
followed byb
The literal matching dialect is unable to model all regular languages.
Full replacement semantics would result in all strings, no matter their content, being replaced with a constant, fixed replacement string, or the empty string. This dialect with these semantics is therefore incapable of computation.
If we allow for multiple rules, but keep full replacement, then various cycling behaviors could be performed, depending on how the rules are handled. Still no computation.
If partial matches and multiple rules are allowed, then the Turing complete Markov algorithms, unrestricted grammars, and L-systems can be modeled, as well as cyclic rewriting systems (which are likely also Turing complete).
For partial matches with a single rule, it is unclear what types of computation can be modeled. The question is the same as asking what sorts of computation a Markov algorithm with a single rule can perform.
Basic
The basic dialect is a very restricted form of regular expressions which is the intersection of the typical mathematical definition, and basic POSIX regex. In addition to the above, the following are added:
e*
match the expressione
zero or more times(e)
increase precedence and capture the string that the expressione
matched
Additionally, the replacement string may include backreferences in the form \n
for some decimal number n
. Parenthetical groups capture the text they match from the input string, and store it in a string register. The register that a given capture group corresponds to is determined by the occurrence number of the opening parenthesis of the group. In the color-coded example (())()
, the red group is 1, the green is 2, and the blue is 3. A backreference in the replacement text will expand with the value of the string register. A group which does not match, for instance a|(b)
on the string a
, will be empty.
Semantically, quantification and matching in general is greedy. Thus, while (a*b*)*b
on the string aaabaab
could match the first aaab
, since it is greedy, it will match the entire string. This allows for deterministic execution.
A single rule with basic syntax and full replacement is Turing complete, see below. The cases for single rule partial replacement, and multiple rule full and partial replacement, are also complete. This means the rest of the dialects are complete as well.
Normal
The normal dialect adds alternation, which is commonly seen in mathematical definitions.
a|b
match the expressiona
or the expressionb
Alternation is simply equivalent in power but not semantics as star, e.g. a|b
will match at least the same strings as a*b*
.
Convenience
This dialect adds the following on top of the normal dialect:
.
wildcard or universal class, matches any character from the symbol set[chars]
character class, matches any of the characters listed, supports ranges[^chars]
anticlass, matches the opposite of what the class would have matchede{n,m}
range, match the expressione
between n and m many times (inclusive), also supports{n}
for exact,{n,}
for minimum, and{,m}
for maximume+
match the expressione
one or more timese?
match the expressione
zero or one times^
match the start of the input string$
match the end of the input string
Copy
The copy dialect allows for modeling (at least some) context-sensitive languages. Instead of backreferences only being allowed in the replacement, the expression itself can reference them. It adds:
\n
match the text stored in the nth string register
Selected programs
The copy dialect allows for a small expression of Collatz sequences:
(^(a)((aa)*)$)|(^(a*)\6$)/\3\2\3\2\3\2\2\6
The first part before the pipe matches a string that starts with an a and has an even number of as following, that is, an odd number. The second part matches a string of as which is followed by the same number of as, that is, an even number. The first part of the replacement is for the odd case and consists of 3 copies of \3\2, the length of \3 is 2k, so \3\2 is 2k+1. An additional \2 is added, so the first part has the effect of replacing an odd string with 3n+1. The second part is \6 and simply replaces the string with half of itself. Only one side of the alternation will be matched, so either \2 (and potentially \3) have contents, or \6 has contents, but not both.
Universal machines
Iterated regex find and replace is Turing complete, User:LyricLy has demonstrated this by implementing a universal bitwise cyclic tag interpreter in the construct. This means that a while loop checking a string for some property, with the body consisting of a single regex find and replace, is Turing complete.
The idea behind LyricLy's method is to separate the string into two components, a program section and a data section. A valid program section has the form <(0|10|11)*#(0|10|11)*>
, with the data being (0|1)*
and therefore a full state description being <(0|10|11)*#(0|10|11)*>(0|1)*
. The universal machine operates by moving the code pointer #
along the program, interpreting the entire word after it as an instruction. There are effectively 4 instructions:
>(0|1)*
the loop around instruction: this is just the marker for the end of the program, when this is encountered, the pointer must return to the start0(0|10|11)*>(0|1)(0|1)*
deletion: the(0|1)
immediately after the>
is deleted1(0|1)(0|10|11)*>1(0|1)*
append: the(0|1)
immediately after the first1
is copied to the end of the data string1(0|1)(0|10|11)*>(0(0|1)*|λ)
no-op: nothing occurs
The full match string first matches the program before the code pointer, then tries to match one of the instructions. Each instruction has its own associated capture groups, and therefore its own desired replacement string. The first form puts the pointer before the program. The rest move the instruction before the code pointer, and put (most of) the rest of the string back as is (with potential additions). Since these forms are placed in a large alternation, all groups will be empty except for the program, and the groups for the specific instruction. This means that the replacement strings for each instruction can be combined together into one.