Esimpl

From Esolang
Jump to navigation Jump to search

Esimpl is an esoteric programming language created by User:ais523 in 2022, with significant changes in 2023. Its intended purpose is as an intermediate representation for compilers and interpreters related to low-level esolangs (and Turing tarpits in particular), aiming to be powerful enough to express the typical low-level esolang easily, whilst sufficiently constrained to be easy to implement in esolangs itself. Efficiency was also a consideration in designing the language: many languages used for proving Turing-completeness, such as tag systems, tend to emulate programs written in other languages very slowly (e.g. the standard proof of Turing-completeness for tag systems incurs an exponential slowdown); Esimpl was designed to incur only a linear slowdown when compiling, e.g., brainfuck or a Turing machine into it.

The main inspirations for Esimpl were Advance The Wheel! and cyclic tag, but the resulting language ended up more like a generalized version of StackFlow. As for its name, it can be seen either as an abbreviation of "esoteric implementation", or as a cyclic anagram of "simple".

Data storage

A running Esimpl program stores its data in a number of semideques (technically an output-restricted deque: see wikipedia:Double-ended queue#Distinctions and sub-types). A semideque is a type of list/array/vector which can only be modified via three basic operations: adding one or more elements at the start; adding one or more elements at the end; or removing one element from the start (and branching based on which element was removed). Both queues and stacks are special cases of semideques (you can specialise a semideque into a queue via removing the "add at start" operation, or into a stack by removing the "add at end" operation). The elements stored in a semideque are non-negative integers; in theory these are unbounded, but because all integers added to a semideque must appear literally in the program, any given program will have some maximum for the elements it uses (and it is fairly easy to rewrite an Esimpl program so that it uses only 0 and 1 as semideque elements). Each program uses a fixed number of semideques, named using numbers ("semideque 0", "semideque 1", and so on); these are consecutive nonnegative integers starting from 0.

In addition to the semideques, an Esimpl program also has an instruction pointer, as usual for imperative programming languages; commands in an Esimpl program are grouped into stanzas, and the instruction pointer points to stanzas as a whole rather than individual commands. The instruction pointer does not change implicitly; each stanza must contain a control flow instruction (usually a pop-goto instruction) that specifies which stanza should run next.

An Esimpl program starts by specifying how many semideques it contains and their initial state, and also which stanza runs first. (The program then continues by specifying the commands that make up the program, and how they are arranged into stanzas and tables.) This specification is syntactically similar (but not identical) to a stanza, and known as "stanza 0",

Program structure: commands, stanzas, tables

An Esimpl program is arranged hierarchically in three levels: apart from stanza 0, the program is made of a number of tables, each of which consists of one or more stanzas, each of which consists of one or more commands. (These are non-overlapping – any given command belongs to exactly one stanza, and any given stanza (other than stanza 0) belongs to exactly one table.)

Tables are similar to the bodies of the "case"/"switch" statements seen in many practical imperative languages: rather than "if" statements that go to one of two places depending on whether a boolean is true or false, Esimpl's conditionals jump to one of n places depending on the value of an integer: if the integer is 0, the first stanza in the table runs, if the integer is 1, the second stanza in the table runs, if the integer is 2, the third stanza in the table runs, and so on. Each table is "linked to" either one particular semideque, or to user input (with this link specified in the program), and conditionals that use that table must branch based on an integer popped from that semideque (or from user input). Tables do not have names of their own, but rather are named using the number of the first stanza they contain (this means that only a subset of stanza numbers are used to name tables, so table names tend to be discontinuous).

Stanzas are blocks of "non-interfering" commands (in the sense that the commands within any given stanza could be run in any order and would produce the same result regardless of the order). Stanzas are implicitly numbered based on their location in the program: the first stanza after stanza 0 is numbered 1, the next 2, and so on. Stanzas that belong to the same table must appear consecutively in the program (and the order in which they appear determines which integer branches to which stanza), but the stanza numbering otherwise ignores the table structure (e.g. if a program consists of two tables, one of two stanzas and one of three stanzas, the first table will have stanzas numbered 1 and 2 and the second table will have stanzas numbered 3, 4, and 5). Stanzas within tables linked to user input are not legal targets for the goto command, and program execution cannot start at such a stanza either (only at a stanza within a table linked to a semideque).

Each stanza is formed of two parts:

  1. A possibly empty sequence of data instructions, which specify elements that should be added to semideques (at the start and end respectively) or output to the user;
  2. A control instruction, which optionally pops from a semideque or takes input from the user, and then (typically based on the value popped/read) specifies the stanza that should be run next (or that the program should halt).

It is undefined behaviour if a control instruction attempts to transfer control to stanza 0, or to a stanza past the end of the program, or to pop from an entirely empty semideque. It is also undefined behaviour if two data instructions within the same stanza attempt to produce output, or to push to the same end of the same semideque, or to push to the start of a semideque when the control command pops from the start of that semideque. (These restrictions mostly ensure that the order within which the commands within a stanza run does not matter; conceptually they all run concurrently / in parallel, although in practice most implementations will run them sequentially.) However, a no-op push does not conflict with a pop (you can push no elements to a semideque, and pop from it, in the same stanza), although it does conflict with other pushes to the same end of the same semideque.

Commands

The data commands are as follows:

push
Pushes zero or more elements onto the start of a given semideque: the list of elements to push is hardcoded as part of the push command, as is the order that they are pushed in. (Pushing no elements onto a semideque is allowed, but does nothing.) Because elements are popped only from the start of a semideque, this is a "stack-style" push.
pushback
Identical to the push instruction except that the elements are pushed onto the end rather than the start of the semideque (making this a "queue-style" push).
output
Pushes zero or more elements onto the back of a conceptual "output queue"; each of these elements must either be the integer 0 or the integer 1 (a program that tries to push anything else onto the output queue is syntactically invalid). Whenever the output queue contains one or more 1s, the prefix of the output queue that starts at its front and ends at its first 1 inclusive is popped from the output queue, and then a byte is output whose codepoint is the number of 0s popped this way. (Popping no 0s, i.e. just a single 1, is legal and produces a NUL byte. Popping 256 or more 0s simultaneously is undefined behaviour. To enable Esimpl to be implemented in languages that do not have binary I/O, implementations are not required to be able to output bytes with values above 126 correctly, although they should aim to implement all 256 byte values if doing so is possible in the language in which they are implemented.)

The control commands are as follows:

goto
Transfers control flow to the specified stanza. This can be any stanza that's within a table that's linked to a semideque. This command cannot be used to transfer control to a stanza within an input-linked table, nor to stanza 0 (attempting to do either of those goto operations is undefined behaviour). The syntax for the goto instruction specifies which semideque the table containing the stanza is linked to (this specification causes undefined behaviour if it incorrectly fails to match the table, but otherwise has no effect).
pop-goto
Pops a value from a specified semideque, then transfers control to a stanza within a specified table (with the table being specified using the number of the first stanza it contains). The specified table must be linked to the specified semideque, or this is undefined behaviour. If the value 0 is popped, the first stanza in the table runs; if a 1 is popped, the second stanza in the table runs; and so on.
input-goto
Attempts to pop a value from a conceptual "input queue", which is initially empty. Once the value is popped, transfers control to a stanza within a specified table, much the same way as with pop-goto (the table must be linked to input rather than to a semideque). If the input queue is empty, then before popping a value from it, one byte of user input is read, and pushed onto the back of the input queue in the same format as used for output (i.e. if a byte with codepoint n is read, the input queue is filled with n 0s followed by a 1). There is no requirement for implementations to be able to handle EOF conditions – Esimpl implementations are allowed to halt at end of input, or read garbage data – but for implementations that can reliably handle EOF, they should act as though a 2 was popped from the input queue if an attempt to refill it by reading input failed due to an EOF condition.
halt
Halts the program.

Esimpl implementations are encouraged to statically verify that control instructions will never overflow past the end of a table (i.e. each table contains at least enough stanzas to cover any value that could possibly be popped from the semideque it's linked to, which can be determined via verifying which values are pushed onto or start on that semideque; and input-linked tables contain at least three stanzas), and to reject any program for which such an overflow is possible; likewise, they are encouraged to statically verify that control instructions correctly specify a table (or stanza within a table) which is linked to the semideque they specify/pop from (or to input, if they pop from there). It is not a requirement for implementations to perform this verification (and programs which would fail the verification cause undefined behaviour at runtime if they do end up overflowing a table or transferring control to a table which is linked to the wrong thing).

Syntax

Esimpl supports two main syntaxes, a text syntax and a binary syntax. The binary syntax is designed primarily so that it requires very little code to convert into a convenient form for interpreters to use as an internal format for storing an Esimpl program, rather than being designed for efficiency.

Text syntax

In the text syntax, a command is written as the semideque being operated on (or, in the case of a goto instruction, that the target stanza is linked to), then a space, then the type of command, then a space, then the data (space-separated if the data is a list). Semideque numbers, stanza numbers, and data to be pushed onto semideques are expressed in decimal. For example, a command that pushes 1, 2, and 3 onto semideque 4 would be

4 push 1 2 3

When pushing multiple data onto a semideque at once, the list is ordered such that the first datum listed is the first that will be popped (i.e. in order to write the above command as three separate pushes in three separate stanzas, you would have to push 3, then 2, then 1; but to split up 4 pushback 1 2 3, you would pushback 1, then 2, then 3). Output is similar (but has no stanza number), e.g. to output a newline, you would write

output 0 0 0 0 0 0 0 0 0 0 1

The I/O commands, and halt, do not specify a semideque number.

Each command's name can be abbreviated to a single letter, if desired (although writing out the command's name in full is also legal). Here are the commands' abbreviated and full names:

o output
p push
q pushback

g goto
h halt
i input-goto
j pop-goto

As such, a complete stanza might look like this:

0 p 1 0 2 0
0 q 1 0 3 0
1 q 1 1 1
o 0 0
2 j 100

Commands must be separated by at least one newline; blank lines are permitted, as are comments (which run from # to a newline). Newlines cannot be placed within a command. A stanza's control command is always written after its data commands, never before them.

Stanzas are written in order, and do not need explicit separators (because the control commands uniquely identify the ends of stanzas). Tables, however, do have explicit separators; after the last stanza of a table (or after stanza 0) and before the first stanza of the next is a separator line of the form iotable or number table (to specify that the table immediately following the separator line is an input-linked table, or a table linked to the semideque numbered number, respectively). table and iotable can be abbreviated to t and u respectively.

Stanza 0 appears at the start of the program, and is formatted much as though it were an actual stanza, consisting of a list of "commands": for each semideque used by the program, its initial contents are specified in the form of a push command (which could be a no-op push specifying zero elements if the semideque is supposed to start out empty); and the stanza at which program execution starts is specified in the form of a goto command (listing the semideque to which its table is linked, as usual). The only syntactic difference from a "normal" stanza is that it does not belong to a table, and thus is not preceded by a table separator line.

Binary syntax

The binary syntax for Esimpl is intended for ease of parsing in low-powered esolangs (rather than for efficiency, like most binary formats are). As such, it uses only bytes with small values, and numbers are expressed in unary. Stanzas are "normalized" so that they contain a push and a pushback instruction for every semideque (which pushes or pushbacks an empty list if the stanza would not otherwise add to that semideque).

The binary representation of a stanza consists of the following sections (which appear in this order within the representation):

Table separator
Before the first stanza of each table comes a 0x0A byte. This is omitted for stanza 0 (because it does not belong to any table), and does not appear in the second or subsequent stanza of any table (its purpose is to identify which stanzas belong to which tables).
Table link information
The stanza starts with a sequence of bytes that specify which semideque the stanza's table is linked to (this information is duplicated for every stanza within the table). This consists of 0x05 bytes followed by 0x04 bytes: the number of 0x05 bytes specifies the semideque's number, and the section is then padded with 0x04 bytes until the length of the section is exactly equal to the number of semideques used by the program. (Thus, if there were three semideques, 04 04 04 specifies that the stanza's table is linked to semideque 0, 05 04 04 to semideque 1, and 05 05 04 to semideque 2.) For a stanza in an input-linked table, this section still has the same length, but consists entirely of 0x05 bytes (conceptually linking to a semideque "beyond the end" of those that the program uses). This section is entirely omitted in stanza 0.
Note that this section is purely about which semideque this stanza's table is linked to (as opposed to the rest of the encoding, in which everything that cares about table links cares about which semideque this stanza's control command's target's table is linked to).
Per-semideque data
The stanza continues with the data pushed onto semideque 0, then an 0x03 byte, then the data pushbacked onto semideque 0, then an 0x02 byte, then the data pushed onto semideque 1, then an 0x03 byte, and so on – this "pushing section" ends with the 0x02 after the data pushbacked onto the last semideque. The data itself is specified in a similar format as that used for I/O – a datum with value n is specified as n 0x00 bytes followed by an 0x01 byte (and as with the text format, the values are ordered such that the value that appears first is the value that will be popped first). For stanza 0, the encoding is slightly different: the initial data of semideque 0, then 0x02, then the initial data of semideque 1, then 0x02, and so on, ending with the 0x02 after the initial state of the last semideque (the difference from normal stanzas is that the 0x03 bytes are missing).
This section also encodes the stanza/table number for goto/pop-goto/input-goto instructions, by adding additional bytes unrelated to the push and pushback commands:
  • For goto, the section encoding the "push" command for the semideque to which the goto target's table is linked is modified by prepending the representation of an additional pushed element, whose value is the stanza number of the goto target. (For example, 2 push 1 2 would normally be encoded as 00 01 00 00 01 03, but if the stanza ended with 2 goto 3, the "push" section for semideque 2 would instead be encoded as 00 00 00 01 00 01 00 00 01 03.)
  • input-goto is basically the same as goto, but in this case, the data is prepended to the start of the per-semideque data (i.e. the "push" section for semideque 0) because the target table is linked to user input, rather than to a semideque.
  • For pop-goto, the "push" command for the semideque to which the target table is linked (which should be just 03, because a stanza can't contain a non-empty push and a pop aimed at the same end of the same semideque) is modified by prepending a number of 0x00 bytes within it equal to the stanza number of the first stanza in the target table. (Unlike a goto command, this encoding does not have a trailing 0x01 byte – this is the only difference in encoding between a pop-goto instruction, and a goto instruction which happens to target the first stanza of a table.)
Output instruction
After the per-semideque data comes a (possibly empty) sequence of bytes that specifies the stanza's output instruction. (The syntax does not distinguish between no-op output instructions that push an empty list onto the output queue, and the output instruction being entirely absent, as these are clearly equivalent.) In this sequence, 0x06 specifies a 0 being pushed to the output queue, and 0x07 specifies a 1 being pushed to the output queue; the total number of 0x06s and 0x07s is equal to the number of elements pushed onto the output queue by the output instruction.
Control instruction
The encoding of a stanza ends by encoding its control instruction, as follows:
  • For control instructions with a semideque number n (goto and pop-goto), that semideque number (and the fact that a goto/pop-goto command is in use) is encoded using an 0x09 byte (exception: in stanza 0, an 0x0D byte is used instead of the 0x09 byte), followed by n repeats of the two-byte sequence "0x03 0x02", finishing with an 0x08 byte. (The stanza/table target of the control instruction is not specified here, but rather in the per-semideque data; likewise, the distinction between goto and pop-goto is specified via whether the relevant part of the per-semideque data is omitting a trailing 01 or not, rather than being made here.)
  • For an input-goto instruction, this section consists of a single 0x0B byte. (The table targeted by the input-goto instruction was specified earlier, by pushing its number onto semideque 0.)
  • For a halt instruction, this section consists of a single 0x0C byte.
End of program
At the end of the program comes an 0x0E byte. Anything after this in the input file is not part of the program, and an implementation does not need to read it (but may choose to read and discard it if it wishes). Implementations that read the program from standard input should consider anything after the 0xE byte to be part of the program's input rather than part of the program itself.

Computational class

Esimpl is obviously Turing complete, because many Turing-complete languages can be compiled into it more or less directly. The most obvious is StackFlow, which can be defined as the special case of Esimpl in which pushback instructions are never used, and in which there is only one table linked to each semideque, plus a couple of restrictions to statically ensure that there's no undefined behaviour. The Waterfall Model also compiles into Esimpl directly, because it can be compiled directly into StackFlow. Turing machines can be compiled into Esimpl simply by splitting their tape into two stacks (causing every Turing machine state/symbol combination to be directly expressible as an Esimpl stanza, and every state to be directly expressible as an Esimpl table). Bignum brainfuck without negative-valued cells also has a fairly simple compilation into Esimpl, via representing the tape as two stacks, expressing the stack elements in unary with separators between them; if you do that, every brainfuck instruction except < and > then directly compiles into an Esimpl stanza (with < and > compiling into simple loops). Brainpocalypse can be compiled into Esimpl using a similar construction.

When compiling a Turing machine into Esimpl, the above construction incurs only a linear slowdown, meaning that Esimpl can be efficient in addition to being general. The pushback instruction is not used in this construction, so it can be removed from the language (making it fully stack-based) whilst still maintaining the same speed. (In fact, any Esimpl program can be modified to remove pushbacks via replacing one semideque with two stacks, which is a very similar construction to the implementation of a queue in terms of two stacks; in the general case, this makes the program a lot longer, but nonetheless incurs only a linear slowdown.) It is also possible to use binary rather than unary encodings of numbers, in order to gain similar efficiency when compiling number-based languages.

Esimpl is still Turing-complete if the push instruction is removed (except in stanza 0), making it a fully queue-based language; this variant of the language can, e.g., implement a tag system directly. When doing this, some efficiency is lost – after storing data on a semideque, it then takes time proportional to the size of that semideque to retrieve it again, which means that some programs that can be implemented in linear time on a Turing machine will have quadratic performance in queue-based Esimpl. (One example is a program that inputs an arbitrary string, and outputs that string in reverse.) However, all purely queue-based languages for which queue elements are drawn from a finite set share this restriction, and it is much easier to run Esimpl at the "queue-based speed limit" than it is for most queue-based esolangs (e.g. simple Turing-completeness constructions for tag systems run in exponential time, because control flow for tag systems is based on a sum of all elements in the queue rather than on elements in specific positions, and this makes it hard to store data other than with counter-like representations). This is the main reason why Esimpl has the otherwise redundant pushback instructions – it allows it to serve as an efficient intermediate representation when compiling into queue-based languages.

Esimpl is also still Turing-complete, with only a linear slowdown, using only 0 and 1 as elements of the semideques. The basic idea is to replace each stanza with two: the second stanza of the pair corresponds to the stanza in the original program, whereas the first stanza of the pair consists of a single pop-goto instruction (specifying the first stanza of the next pair, and the same semideque that was popped to reach it). The data in the stanzas, and in push and pushback instructions, is now encoded in Esimpl's normal modified unary (n 0s followed by a 1), and the resulting program will behave identically to the original.

It is also possible to write an Esimpl program to use only a single semideque (unfortunately, with a loss of performance). The general idea is to store all the semideques in a single queue, using an otherwise unused value as a separator, and split stanzas up into sections that read/write data separated by loops that rotate the queue to move the semideque you want to operate on to the start or end as appropriate.

See also

External resources