The Subtyping Machine
The Subtyping Machine is an example of a "naturally occurring" esoteric programming language. The first known implementation was created by mistake in 2004, with the release of Java 5; but the language's existence was not publicly noted until 2017, by Radu Grigore, who named it, proved it Turing-complete and provided a less cumbersome syntax for it. User:ais523 adapted this syntax to be machine-readable, more regular, and more precisely specified, in addition to writing an interpreter, and a compiler into the original syntax that enables the language to be run on a Java interpreter. The language has some similarities to StackFlow, but is sufficiently different to not look like a derivative.
Syntax
A program in The Subtyping Machine consists of an initial state, plus a number of rules.
An identifier follows the same syntax as a Java identifier, except that the first character must not be _
or x
: these leading characters are reserved for the implementation. Identifiers are case-sensitive, and in particular may start with X
.
A rule consists of a match and a replacement. A match consists of two identifiers separated by <
, e.g. foo<bar
. A replacement consists of a sequence of identifiers followed by >
. A rule is written as a match, followed by =
, followed by a replacement.
An initial string consists of a sequence of identifiers, followed by <
, followed by another sequence of identifiers.
Horizontal whitespace is allowed anywhere except within an identifier, and mandatory between two consecutive identifiers within the same sequence. Each rule must be a single line, and a line of its own; likewise, the initial state must be written on a single line, that isn't shared with anything else. Comments run from #
to the end of the line, and have no influence on the operation of the program. Blank lines (and lines containing only comments) are permitted, and have no effect.
There are actually two possible syntaxes for a match, replacement, or initial string. One is as described above. The other possible syntax is produced via reversing the order of all the identifiers and </>
mark within the original string, and replacing <
with >
or vice versa. For example, the rule foo<bar = foo baz baz quux>
could also be written as bar>foo = <quux baz baz foo
. (Although it is legal to reverse the match of a rule but not the replacement, or vice versa, doing so tends to be confusing, because it does not change the actual semantics of the program even though it looks like it does. Readable programs will normally have the punctuation marks in the match and replacement pointing in opposite directions.)
Well-formedness rules
For the translation into Java to work correctly, the following rules must be followed:
- Each rule's replacement must contain an even number of identifiers.
- The initial state must have an odd number of identifiers to the left of the
<
, and an even number of identifiers to the right of the<
. - No identifier may appear on both the left of a
<
within a rule match, and the right of a<
within a rule match. (It is legal, and common, for the same identifier to appear in a large number of rule matches, but it must appear on the same side of the<
every time.)
All these rules are based on the meaning of the code, not the syntax, e.g. the rule matches a<b
and a>c
are incompatible because a
appears on both sides of a <
/>
, the pointy side in each case.
It is considered good style, although not required, for programmers to use different cases for identifiers that can appear on the broad side of a >
/<
in a rule match, and on the narrow side of a >
/<
in a rule match, to avoid accidentally mixing them.
Semantics
A running program in The Subtyping Machine has a state consisting of two sequences of identifiers, separated by <
or >
. (This is reversible in the same way that the syntax is.) These two sequences act very similarly to a tape of symbols (although unlike the tape in a language like brainfuck, it can change length during execution), or can alternatively be seen as stacks (with the top of the stack being the point next to the <
/>
). Steps of execution consist of "rewriting the tape around the <
/>
", or equivalently as popping the tops of the stacks and pushing a sequence to replace them.
There are two possible replacements:
- If the same identifier exists on both sides of the
<
/>
, the identifier is deleted from both sides and the<
/>
reverses. For example,a b c<c a
will be replaced bya b>a
. This general rule applies regardless of what rules exist in the program. - If a different identifier exists on each side of the
<
/>
, a search will be made in the program for a rule that matches. The rule's match has to specify identifiers in question (i.e. the top of both stacks), with the<
/>
pointing in the same direction. For example, witha b>a
, a matching rule would require a match ofb>a
(which might be written as the equivalenta<b
). If such a rule is found, the matched section of the tape will be replaced by the rule's replacement, oriented so that the<
/>
now points in the opposite direction.
In every case where a replacement occurs, the <
/>
will reverse direction.
If a replacement cannot occur, the program terminates. Programs have an exit code, of "success" or "failure", chosen according to the situation at the time of termination:
- If an identifier exists on each side of the
<
/>
, but there is no matching rule, the program exits in failure. (This includes the situation in which a rule almost matches, but the<
/>
points in the wrong direction.) - If an identifier exists on only the narrow side of the
<
/>
(with the sequence on the broad side being empty), the program also exits. This situation is usually a success exit; however, if the identifier touching the narrow side of the<
/>
in the state appears on the broad side of the<
/>
in any rule, it is a failure exit instead. - It is not possible for a well-formed program to have no identifier on the narrow side of the
<
/>
(as this always has an odd number of elements). The language could be extended to allow ill-formed programs, however; in that case, this program is treated as a failure exit unless there is no identifier on the broad side either, in which case it is treated as a success exit.
Implementations are allowed to try to detect infinite loops, but do not have to. If an implementation does detect an infinite loop, it may immediately exit the program in failure.
Example
Here's a small, contrived, example program:
# comment A> d = <s X B s X d B> d = < # another comment d X s A> d X d X d X s
and an example of the steps it goes through when run:
d X s A>d X d X d X s d X s<s X B s X d X d X d X s d X>X B s X d X d X d X s d<B s X d X d X d X s >s X d X d X d X s
This program then terminates in success (s
does not appear on the broad side of the <
/>
in any rule match).
Computational class
The Subtyping Machine's semantics are easily powerful enough to construct a two-counter machine, a two-stack machine, or a tape machine; often multiple steps are required to accomplish anything, due to the inability to write to both sides of the <
/>
and the fact that it keeps changing direction, but it's easy to maintain a symbol a short distance on the broad side of the <
/>
to remember what state you're in, and to use it like an instruction pointer.
The original Turing-completeness proof, by Radu Grigore, used a Turing machine as the language that was compiled into The Subtyping Machine.
Relationship to Java
As is clear from the sometimes weird restrictions observed in The Subtyping Machine, it was based on the semantics of a subset of an existing language, in this case Java's type system; a run of a program in The Subtyping Machine can be seen as an attempt to type-check a Java program (specifically, to determine whether one Java type is a subtype of another Java type, thus the name). This means that The Subtyping Machine can be compiled into Java in such a way that the process of attempting to compile the Java program will act as an interpreter for The Subtyping Machine; in other words, it runs its code at Java's compile time. In this context, a success exit relates to a program that type-checks successfully; a failure exit to a program with a type error.
The language is derived from a particular fragment of Java's type system: specifically, the fragment of contravariant generic interfaces with one (or no) generic parameter and no methods. This is easiest to understand from an example. Here's what the example program above looks like when compiled to compile-time Java:
interface xx {} interface A<x> {} interface B<x> {} interface d<x> extends A<s<? super X<? super B<? super s<? super X<? super d<x>>>>>>>, B<x>, xx {} interface X<x> extends xx {} interface s<x> extends xx {} class x { d<? super X<? super d<? super X<? super d<? super X<? super s<xx>>>>>>> xc; A<? super s<? super X<? super d<xx>>>> xd = xc; }
In order to type-check this program, the Java compiler must determine whether the assignment xd = xc;
is legal, and thus whether a value of xc
's type can necessarily be stored in a variable of xd
's type: it needs to check for a subtyping relationship. If the "outermost type wrappers" of the two types we're checking are the same, then the Java compiler can simply discard them and keep looking. Otherwise, it needs to look for an extends
relationship on xc
's type to make it match xd
's – the equivalent of a rule lookup – and the generic constraint within the extends
clause effectively causes a replacement (which could be much longer than the original type). Because all the types are contravariant, the direction in which the subtyping relationship is checked reverses at every step.
External resources
- The original paper about the language
- Interpreter, also compiler to Java; interprets by default (with no output other than exit status); use
-g
for debug output, or-j
to compile to Java rather than interpreting