Thupit

From Esolang
Jump to navigation Jump to search

Thupit is an esoteric programming language that has been used in numerous Turing-completeness proofs over the years, but apparently without ever being named or fully specified. It was finally named by User:ais523 in 2024, but had existed long before that time, and it is unknown which year it was first used.

It is a simplification/tarpitification of Thue.

Specification

A Thupit program consists of an initial string – a string of characters – and a set of rules, each of which is a pair of a search string and a replace string. Implementations can choose for themselves what set of characters are supported within the strings, as long as at least two different characters are supported (although if implementations are capable of supporting a wide range of characters, they should aim to support at least the entirety of printable ASCII except for possibly " and \).

Program execution starts by initializing a "working string" as a copy of the initial string, then repeatedly looking for the any of the search strings within the working string. If exactly one is found, it is replaced with the corresponding replace string (and then the interpreter goes back to looking for a search string within the working string). If none are found, the program halts. If the working string ever contains more than one search string, or contains more than one copy of any of the search strings, the program has undefined behaviour, allowing the interpreter to do anything without violating the specification. Additionally, it is undefined behaviour to enter a "trivial loop" in which the working string returns to a state it has previously been in – in this situation, obeying the language's normal rules would create an infinite loop in which the working string changed through the same sequence of states over and over again, but implementations may behave in arbitrary other ways if this happens (including but not limited to halting). This applies specifically to loops in which the state of the working string is exactly the same on every loop iteration; infinite loops that never exactly repeat the contents of the working string are permitted and do not cause undefined behaviour (thus, in that situation, the program does not halt).

A program is a mathematical object (a string, plus a set of pairs of strings) rather than a source file. However, if programs need to be stored in a file, it is recommended to use a JSON-like notation in which a string is enclosed in double quotes, pairs are surrounded by square brackets and separated with commas, and a set is also surrounded by square brackets and separated with commas.

Variations

A variation of the language, Blank Tape Thupit, has an infinite number of implicit "blank characters" (which an implementation might represent as, e.g., 0 or ) at each end of the working string; a blank character within a search string can match a blank character past the edge of the working string. In this version, all the search and replace strings must have the same length. (The main version of the language needs to support replace strings that are longer than the search string in order to provide a method of expanding the amount of available memory; but the blank tape version can do so by making a replacement that overlaps the edge of the working string.)

Computational class

With many symbols

With sufficiently many symbols, Thupit is Turing complete, because it's possible to almost directly compile a Turing machine that starts on a blank tape into Thupit. The basic idea is to use one character per Turing machine symbol (e.g. 0, 1, etc.) to represent tape cells that are not pointed to by the tape pointer (the character specifies which symbol the tape cell is storing), and one character per state-symbol pair to represent tape cells that are pointed to by the tape pointer (the character specifies both which symbol the tape cell is storing, and the state the Turing machine is in) – the example below uses lowercase letters (representing the state) for symbol 0 and uppercase letters (representing the state) for symbol 1.

It is then possible to compile the Turing machine's transitions almost directly into Thupit. Each triple of (s,f,t) = (Turing machine state, tape symbol at tape pointer, tape symbol that the tape pointer moves onto) compiles into a Thupit rule: if the Turing machine transition specifies moving right, then the search string is sft (where sf means "the character that represents the pair of state s and symbol f) and the replace string is nat, where n is the new symbol specified in the transition and a is the new state; if the Turing machine specifies moving left, the same construction is done mirrored, with a search string of tsf and replace string of atn.

The Turing machine's halt transitions do not produce any Thupit rules. It is clear that only one Thupit rule can match at any given time: rules can only match if they contain the correct state-symbol character that represents the Turing machine's current state and current symbol (and there is only ever one state-symbol character in the working string at a time), and each of those rules has that symbol in the same place and a different symbol beside it (thus meaning that it's impossible for two of the rules to match simultaneously). This construction is thus sufficient to prove Blank Tape Thupit Turing-complete, even when search and replace strings are limited to length 2, assuming that the emulated Turing machine never enters a trivial loop (see below).

For the primary (non-blank-tape) version of the language, it is possible to emulate an infinite blank tape by using one character (e.g. () to represent the infinite zeroes to the left of the tape, and another (e.g. )) to represent the infinite zeroes to the right of the tape. For each rule where t is the Turing machine's blank symbol, a copy of that rule is added where t is the end-of-tape symbol, but with a new end-of-tape symbol added beyond the replacement, e.g. if there is a rule ["a0","1b"], a tape-extension rule ["a)","1b)"] would also be added.

To complete the proof, it needs to be shown that Turing-machines are Turing complete even if a trivial loop is considered to be undefined behaviour. An easy way to prove this is to compile a Turing-complete deterministic reversible language into a Turing machine, in a way that preserves the reversibility (e.g. Reversible Bitfuck compiles directly, via using one Turing machine state for each possible instruction pointer location in the Reversible Bitfuck program); for a deterministic reversible language that can detect an attempt to reverse past its startup state, it is impossible for the language to ever enter a trivial loop because it would be impossible to reverse back past the point at which the loop was first entered, but the startup state cannot be part of the loop.

Here's an example of a Turing machine (the 2-symbol 4-state busy beaver winner) compiled into Thupit:

[["a0","1b"],["a1","1B"],["a)","1b)"],
 ["0b","a1"],["1b","A1"],["(b","(a1"],
 ["d0","1d"],["d1","1D"],["d)","1d)"],
 ["0A","b1"],["1A","B1"],["(A","(b1"],
 ["0B","c0"],["1B","C0"],["(B","(c0"],
 ["0C","d1"],["1C","D1"],["(C","(d1"],
 ["D0","0a"],["D1","0A"],["D)","0a)"]]
"(a)"

With two symbols

Any Thupit program can be compiled into one that uses only two symbols, meaning that because it is Turing-complete with sufficiently many symbols, it must also be Turing-complete using only two symbols.

There are multiple constructions that work; this section discusses one that works in both the main and blank-tape versions of the language. The basic idea is to represent each symbol from the source program as 0001…000 in the compiled program, where the is made by concatenating a constant number of 001 and 011 substrings. For example, if there were four symbols in the source program, they would be represented in the compiled program as 0001001001000, 0001001011000, 0001011001000, and 0001011011000. This guarantees that rules made out of compiled symbols can only match at symbol boundaries (because the substrings 0001 and 1000 only ever appear at the start or end of a symbol, respectively). For a blank-tape construction, the blank symbol can instead be represented as a number of 0s of the correct length (e.g. 0000000000000) – as long as at least one non-blank symbol appears in the rule, that will enforce that the rule matches only at symbol boundaries (and a blank-tape program cannot have a rule whose search string consists entirely of blank symbols, as it would match infinitely often, causing undefined behaviour due to the multiple matches).

External resources

See also