Precognition

From Esolang
Jump to navigation Jump to search

Precognition is an esoteric programming language, which was created by User:ais523 in 2018 upon realising that several partially-completed languages were converging to something of a similar point. It thus serves a wide range of purposes, being usable as anything from a simple text preprocessor to a powerful declarative language. The language intentionally has few commands, but is not a Turing tarpit (in that it's meant to be clear what commands you could use to write any given code, and you don't have to, e.g., invent arithmetic yourself to write in it).

The language is basically just a set of find-and-replace rules, somewhat comparable to regexes (i.e. the "generalised" regular expressions that are found in many mainstream programming languages, and that have features that mathematical regular expressions don't have). However, the language is non-deterministic in the mathematical sense, having a similar relationship to imperative languages as NP problems do to P problems (in the "P vs. NP" sense). In other words, a Precognition program can provide multiple possible rules that could match in any given situation, or multiple plausible replacements, and the interpreter will "magically" / "precognitively" pick whichever one happens to make the program work. (The simplest way to implement this is to run all the possibilities in parallel until one of them happens to find a solution, then discard all the others. Implementations are free to choose a more efficient algorithm if they wish.)

Data storage

At any given point in time, the state of a Precognition program is entirely captured by three data: a) which part of the program is currently executing (i.e. the instruction pointer); b) the data string; and c) which elements of the data string were created as a result of a replacement this round (and thus are not eligible to be replaced a second time within the round).

The most significant and interesting part of this is the data string, which is where an executing program will typically store all the data it needs. This is a potentially unbalanced s-expression, which is an ordered list for which each elements can be a lexeme or a groupeme.

The lexemes are drawn from an arbitrary set, the lexeme alphabet, for which all elements are distinguishable (i.e. for any two lexemes, you can tell if they're the same or not). Implementations are free to define their own lexeme alphabets, but typical sets would consist of characters in some encoding (e.g. printable ASCII characters), which are general-purpose but usable for communication with a user, plus a few internal-use lexemes that are not used for communication with a user. The lexeme alphabet must, at least, contain some method of representing numbers (e.g. via writing them in decimal). In this specification, lexemes will be represented as letters and decimal digits; English letters for regular lexemes, non-English letters for internal-use. Lexemes correspond to the "atoms" found in languages such as Lisp.

There are two sorts of groupeme; an opening groupeme and a closing groupeme, with all opening groupemes being equal to each other, and likewise all closing groupemes being equal to each other. These will be represented as [ and ] in this specification. (Again, implementations are free to define their own representation of the two groupemes; there's normally no reason to disqualify them from appearing in I/O done with the user, so they'll need some representation when that happens. A common choice would be for [ and ] to be groupemes and the rest of printable ASCII to be lexemes.)

Groupemes are bracket-like. A potentially unbalanced s-expression is defined recursively as balanced (i.e. just an s-expression string) if it consists entirely of parsemes, where a parseme is defined as either a lexeme, or (an opening groupeme followed by a balanced s-expression string followed by a closing groupeme). A groupeme that is not part of any parseme is considered unmatched.

Precognition has a dual view of its data string. It can be seen either in terms of its parsemes and unmatched groupemes, or as a list of lexemes and groupemes (without trying to match them). Thus, it's normal to be able to treat parsemes as a unit, but also normal to be able to replace groupemes individually, "re-bracketing" the surrounding parts of the data string and causing different parts of it to start matching each other.

Syntax

Precognition programs are strings written in a similar alphabet to that of the data string, e.g. the program can contain lexemes and groupemes (even unmatched groupemes). However, these are only used to represent literals. The actual commands that operate on these literals are a separate alphabet of two metagroupemes and fifteen metalexemes that are distinct from the groupemes and lexemes that make up the data; as such, the alphabet for the program as a whole is larger than that for the data that's being operated on. (This is why a Precognition implementation would typically use printable ASCII for the lexemes; it frees up the control codes, which are unprintable, for use as metalexemes.)

A Precognition program consists of a number of parts, optionally followed by an initial value for the data string. Each part in the program runs in sequence, but the parts are written in reverse order; the last part runs first, then the second-last part, etc., until eventually the first part has run and the program exits. There's no control flow from one part to the next; once a part has finished running, it's finished. (The intended metaphor is that a Precognition program can use another Precognition program to produce its initial data string; the first part of the program first runs the entire rest of the program to find its initial data string, then it can run. This naturally leads to the parts running in the opposite order to the order in which they are written. Historically, this mechanism was written to allow you to prepend a Precognition program to a program in a different language, to act as a preprocessor, and have things work even if a second preprocessor was already in use with the same program.)

There are two types of part; a Type I part and a Type II part. They are very similar in behaviour; the only difference (although it can be a pretty major difference!) is that a Type I part only runs for one round and then stops, whereas a Type II part runs rounds repeatedly until it can make no further changes. Parts are separated from each other and from the initial data string (if one is present) using the "part break" metalexeme (written as ; in this specification); a type I part is followed by two part breaks, a type II part by only a single part break. A final part, with no part break, is treated as type II.

Each part, in turn, consists of one or more rules. A rule has two sections: a match and a replacement. In the simplest case, these are both literals (i.e. strings of lexemes and groupemes). In more complex cases, these can contain metalexemes as a kind of equivalent to regex metacharacters: they allow a match or replacement to correspond to many possible (or even infinitely many possible) literals, via describing the whole set of possible literals rather than simply giving a single element directly. As such, the use of metalexemes makes it possible for a part to, effectively, contain rules generated according to a particular scheme, perhaps infinitely many such rules. (For example, the rule x'y:x''y, which contains the metalexeme ', would effectively be equivalent to the infinite set of rules xy:xy, xay:xaay, xby:xbby, …, xzy:xzzy, x[]y:x[][]y, xaay:xaaaay, …, because ' is defined as "any balanced s-expression string, but always the same string within any given rule".) Rules are separated via a "rule break" metalexeme; the match and replacement of a rule are separated via the same character. (A part must therefore contain an odd number of rule breaks. This fact is used to distinguish a part from an initial string, which would contain zero rule breaks.)

Metagroupemes are comparable to groupemes; they introduce a hierarchical structure within matches and replacements, just like groupemes can introduce a hierarchical structure within the data string. (In this specification, the metagroupemes are written as ( for opening and ) for closing.) Unlike groupemes in the data string, metagroupemes in the program must be matched (failing to do so is syntactically incorrect). The purpose of metagroupemes is to define the scope of metalexemes (e.g. many metalexemes apply only to the preceding program element, unless it's another metalexeme or a metagroupeme-balanced group).

There is no requirement for groupemes in the program to be matched. However, if a groupeme within a single match or replacement happens to be matched with another groupeme (rather than unmatched), they imply metagroupemes just outside the groupemes; a rule that looked like a[b]c[:[d] would be interpreted as a([b])c[:([d]). This saves typing in the common case where the structure of a rule and the structure of the data it operates on are similar. This matching is blocked by intervening unmatched metagroupemes; [(]) is interpreted as having no matched groupemes (although the two metagroupemes match each other), as the ( is unmatched between the [ and ], despite being matched in the context of the string as a whole.

Metalexemes

Here are the metalexemes that exist:

part break ;
Separates parts from each other, and from the initial string. Can be doubled to force the preceding part to be interpreted as type I.
rule break :
Separates rules from each other, and the match of a rule from its replacement.
element .
Represents any single lexeme or groupeme (but in any given execution of a rule, always the same lexeme/groupeme, in both the match and the replacement). This includes both matched and unmatched groupemes.
string 1 ', string 2 ", string 3 _
Each of these represents any balanced s-expression string, i.e. zero or more parsemes. However, in any given execution of a rule, each use of string 1 will always represent the same string, in both the match and replacement; likewise for string 2 and string 3.
parseme 1 -, parseme 2 =
Each of these represents any single parseme. In any execution of a given rule, each use of parseme 1 will always represent the same parseme, in both the match and replacement; likewise for parseme 2.
high-precedence or !, low-precedence or |
These metalexemes are used to write a rule with two possibilities; either the text to the left of the !, or the text to the right of the !, matches or is used as the replacement (likewise for |). A high-precedence or binds as tightly as possible; normally just a single lexeme, metalexeme, or metaparseme on each side of it (with metalexemes that modify something, such as +, it's forced to bind to both the metalexeme and the thing it modifies). A low-precedence or binds as loosely as possible, being stopped only by the metaparseme pair that contains it or by the edge of the match or replacement.
also &
This parses the same way as a low-precedence or, but has the opposite meaning: instead of generating two rules (one which matches/replaces the text on the left, and one which matches/replaces the text on the right), this generates only those rules which match the text on both sides. For example, (a&a) is equivalent to (a), and (a&b) won't match and is unusable as a replacement. This is typically used as a method of "defining variables", e.g. (abcde&') will match abcde, and also cause every ' in the rule to become equivalent to abcde.
except #
This is a sort of "reverse" version of "also"; only rules containing something that matches the description to the left of the #, but not the description to the right of the #, are generated. For example, (a~e#c) is equivalent to a!b!d!e.
Note that this does not create any form of "scope", so a match like -#[=*] doesn't do what you might expect; it doesn't match "parsemes that aren't just repeats of another parseme", but rather all nonempty parseme (because, say, [aaa] does not consist entirely of repeats of b, thus a workable rule could be generated with - as [aaa] and = as b). The difference between the right-hand-side of # and not in Prolog is comparable to the difference between findall and bagof in Prolog; # can be thought of looking for something specific that its LHS isn't equal to, rather than checking to see if the LHS couldn't possibly match any possible RHS.
maybe ?
Equivalent to !(), i.e. it's a high-precedence or where one of the possibilities is an empty string.
repeat *
This parses in the same way as ?, but allows the possibility to its left to be matched/generated multiple times; for example, a:b* can replace each a with , b, bb, bbb, etc..
nontrivial repeat +
This is like +, but disallows the possibility of an empty match (i.e. the thing to its left is generated at least once).
+ also has an entirely different meaning, when it appears at the start of a match or replacement. A leading + on a match causes á to be added to the start, and é to be added to the end, of both the match and replacement of the rule (thus +a+:b means áa+é:ábé). A leading + on a replacement causes the entirety of the matched text to be added at the start of the replacement; so for example, a+:+b is equivalent to a+&':'b. The interaction of both these cases at once is such that +a:+b means áaé:áabé.
range ~
Possibly the most complex metalexeme, which has two purposes. (In either case, it parses the same way as !, i.e. taking the smallest nonempty amount it can from each side. This has the highest precedence of any metalexeme, even above !. Unlike !, for which the associativity doesn't matter, ~ is left-associative, i.e. 1~2~3 is equivalent to (1~2)~3.)
If both the left and right of the range are literal regular lexemes, this matches any single lexeme that's inclusively "between" the two in value. What this means is defined by the implementation, but would typically be based on numerical, alphabetical, or codepoint order; for example, 4~7 would normally be equivalent to 4!5!6!7, and f~h to f!g!h. This interpretation may also be used for other literal parsemes, if the implementation has defined an ordering on them; for example, [12]~[14] might well be interpreted as equivalent to ([12]|[13]|[14]).
In other cases (e.g. in 4~(6) or (aa+&')~"), this is used for repeat with a count, and to handle conversion of numbers between unary and the implementation's numeric format (normally decimal). It matches a string consisting of a number of smaller strings, where what appears on the left of the ~ is a description of the smaller strings, and what appears on the right is the number of smaller strings that must appear. (The smaller strings don't all have to be equal, unless something else (such as an ') is forcing them to be equal, like in the example of (aa+&') above.)

Metalexemes in the initial string

The program can contain metalexemes in addition to the characters that can appear in the data string. This means that when an initial string is embedded in the program, it can contain metalexemes too.

For the most part, these are interpreted as though they were the replacement part of the only rule contained in a hypothetical extra part that ran before the program started, and that rule's match part happened to exactly match the initial string. For example, an initial string of ab!cd would pick as the initial string whichever option out of abd or acd makes the program halt.

However, "element" and "string" metalexemes are interpreted differently. . is replaced by a copy of standard input (thus allowing the program to take input and nonetheless have data of its own in the initial data string). ', ", and _ are replaced by copies of the first, second, and third command-line arguments, respectively.

Note that the initial string can also contain internal-use lexemes, even though these couldn't be part of input taken from the user.

Metalexeme precedence

Here's the precedence and associativity of the (non-nullary) metalexemes:

highest precedence

~    (left-associative)
!    (associativity irrelevant)
?*+  (postfix; relative precedence not observable)
(concatenation; associativity irrelevant)
&    (associativity irrelevant)
#    (left-associative)
|    (associativity irrelevant)
:    (associates pairwise, i.e. a:b:c:d:e:f is (a:b):(c:d):(e:f))
;    (right-associative)

lowest precedence

Program execution

At the very start of program execution, if no initial data string exists within the program, the initial string is taken from user input. If an initial data string does exist, that's used instead. In either situation, an internal-use lexeme is added to each end of the string: "start anchor" (written as á in this specification) at the start, "end anchor" (written as é) at the end.

The parts in a program execute in the reverse of the order in which they're written. This execution consists of multiple rounds; the rule is that the string produced by a replacement cannot itself be replaced until the next round (even if the replacement is empty, matches later in the round can't extend across both sides of the position of the replacement). A round ends when no further replacements are possible via rules in this part (either because none of them match, or because all the potential matches are against parts of the string which were generated this round). The actual activity of a round consists of finding the match part of some rule within the part of the data string that's untouched so far this round, replacing it with the replacement part of that rule, and repeating that process until the round cannot execute further. (The choice of which of the potentially infinite number of rules to use, and which part of the data string to match it against, are made precognitively, i.e. if possible, some choice is made which will eventually enable the program to terminate. If more than one such choice is possible, there are no constraints on which one the implementation chooses, except that if termination is possible it must eventually happen; the implementation isn't allowed to keep termination possible forever but postpone it indefinitely.)

At the end of a round, the same part gets to execute again, but only if a) it's type II and b) some replacement was actually possible. Otherwise, the execution moves onto the next part for the next round (i.e. the one that appeared before the current part in the program).

Once all the parts of a program have executed, the final data string is output, minus any internal-use lexemes (such lexemes typically won't have any encoding in the character set used for I/O and thus couldn't be output anyway). The program exits with an exit status indicating success.

Alternatively, if the implementation can prove that, despite any potential precognitive attempt to make things work, there'll always be more replacements possible, it exits with an exit status indicating failure. (If it is the case that there will always be more replacements, but the implementation can't prove it, it enters an infinite loop.)

Encoding table

A suggested "minimal" program alphabet for Precognition interpreters that want to keep the set of commands small consists of the 2 metagroupemes, the 15 metalexemes, the 2 groupemes, 2 internal-use lexemes (start anchor and end anchor; an implementation could provide more, but a minmal alphabet doesn't need any), and the 10 decimal digits (each lexemes), for a total of 32 possible characters that can appear within the program. Of course, most implementations will prefer much larger alphabets with more lexemes usable for communication with the user, or perhaps support multiple alphabets.

Here are some suggested representations of these characters (other than the digits, which are highly likely to be useful for communication with the user); both clear representations, and representations intended to avoid clashes with the "regular" lexemes used for I/O. The ASCII suggestions are the ones used in this specification, and may be useful when writing programs where the alphabet of lexemes is either very small (e.g. only digits, or only alphanumerics) or very large (and thus are represented in some other way than single characters). The control code suggestions avoid a clash with printable ASCII (and with commonly used control codes, such as tabs and newlines); the Latin-1 suggestions avoid a clash with anything in ASCII, whilst still being fairly easy to type. (Note that in both these cases, you'd probably want the groupeme characters to be something that the user could type directly; square brackets are the most common choice for list delimiters, so those might work well.) The codepoints map all these characters into the range 10-31 (allowing codepoints 0-9 to be regular lexemes representing decimal digits, thus fitting the 32 characters of the minimal alphabet into five-bit codepoints).

                        ASCII   Control code   Latin-1   Codepoint
start anchor              ^          STX          á         10
end anchor                $          ETX          é         11

opening groupeme          [         (SOH)        (¼)        12
closing groupeme          ]         (EOT)        (¾)        13

opening metagroupeme      (          SI           «         14
closing metagroupeme      )          SO           »         15

part break                ;          GS           ¦         16
rule break                :          RS           ¶         17
parseme 1                 -          ACK          ©         18
parseme 2                 =          NAK          ®         19
element                   .          SUB          °         20
string 1                  '          DC1          ¹         21
string 2                  "          DC2          ²         22
string 3                  _          DC3          ³         23
high-precedence or        !          US           ¡         24
low-precedence or         |          FS           º         25
also                      &          DC4          ª         26
except                    #          CAN          ¬         27
maybe                     ?          ENQ          ¿         28
repeat                    *          SYN          ×         29
nontrivial repeat         +          DLE          ÷         30
range                     ~          ETB          ½         31

Examples

Addition

In sufficiently simple Precognition programs, the actual precognition hardly matters, as there'll be only one way to apply rules anyway. For example, here's a program that adds two decimal integers, taken as command-line arguments:

á1~'é:';1~'1~"

This consists of a single part with a single rule, which can only possibly match the entire initial string (due to the á…é anchors; it matches a string of 1s of length ', and replaces them (together with the anchors!) with '. In other words, it's measuring the length of the data string, assuming it's a unary number.

It also defines the initial string, as consisting of 1 repeated ' times, followed by 1 repeated " times, i.e. the concatenation of the two input arguments (each interpreted as decimal). So we're effectively converting to unary, concatenating (to add the numbers), and converting back to decimal.

Primality test (imperative)

(Newlines have been added for readability.)

+1?2':neither:
+'+'2':composite:
+'2':prime:
+("#'+)2':+1;

1~.211

This program test for prime numbers via trial division, taking the number to test from standard input.

The data string is made of two parts: the number (in unary), and the current factor being tested (also in unary), separated with 2. We have three base cases: the input is 0 or 1 (giving us neither as the output); the input is exactly divisible by the trial factor, 2 or more times (giving us composite); or the input is equal to the trial factor (giving us prime, as we didn't find a factor below the number itself). All three are pretty simple (e.g. the composite case is matching the data string against á'+'2'é: "any number of copies of ', another ', a 2, and another ', for some '"; the only possible match is when ' is the trial factor, so the number must be a multiple).

The fourth rule of the part is more interesting. The match part here can only match when ' is the current factor being tested; thus it's looking for any input (") which is not composed of some number of copies of ' (#'+). If it finds such an input, it adds a 1 to the end of the data string, thus increasing the trial factor. (Note that the program would terminate even without the "not" criterion here; however, it's necessary to prevent composite numbers being reported as prime due to the possibility that the factor might otherwise be repeatedly incremented until it reaches the number.)

Primality test (declarative)

(Again, newlines have been added for readability.)

+1':prime;

+1?:neither:
+(1+1&')+':composite;

1~.

This program finds prime numbers in quite a different way, using two parts.

The first part to run attempts to express the unary number as 1? (i.e. 0 or 1), and replace it with neither. It also attempts to express it as '+' (i.e. two or more '), where ' is in turn of the form 1+ (i.e. one or more 1), and replace it with composite. If a rule can run, it must; otherwise the round wouldn't end. So we end up with neither, composite, or (if it's the original value, before the part can end. (This would happen even if the part were of type I; the type of the part doesn't matter because it can't do anything when it comes to a second round.)

The second part will replace anything which still has digits with prime, thus dealing with the remaining case.

Factorisation

(Again, newlines have been added for readability.)

, áxé::
áxa~'x:', áx;

x(aa+&')~("#0#1)x:x'xa~"x;

xa~.x

Factorisation can be implemented via a similar declarative technique. This time, we represent the numbers as strings of a between x; the first part to run will split such numbers into factors, when it finds them (and continue until no factors are left to find, i.e. we have a string with prime lengths of as). We exclude factors of 1 (and 0, which would create an infinite loop on input 0) explicitly via using #.

The other part converts those a strings back into decimal (and neatly comma-separates them; we're assuming that we have a comma lexeme available for this purpose), removing the leading x from each as it goes, using the á anchor to record what place in the data string it's reached; it also removes the trailing x and trailing comma once it reaches the end.

Note that this example states nothing about the order in which the factors appear, so they'll appear in an arbitrary order. What about producing them in sorted order?:

, áxé::
áxa~'x:', áx;

x'1+x'x:+;

x(aa+&')~("#0#1)x:x'xa~"x;

xa~.x

In this case, we've added an extra part that runs just after the splitting into factors. This part might seem to not do anything, because the match and replacement of its only rule are the same. However, it has a major effect on how the program works; this is a type II part, thus will keep running rounds until no replacement is possible, and thus if it matches, the part can never finish running because it'll continue to match indefinitely.

As such, the precognition mechanic takes over; the first part has to run in such a way that the second part doesn't match, lest the second part never be able to make progress. Note that it can't simply refuse to factorise a composite number, because then the first part wouldn't finish running (it's a type II part). The second part will match if a larger number appears before a smaller one, and (despite the fact that the ' could embed an x) can't match without a larger number appearing before a smaller one, and thus will force the output to end up sorted.

Computational class

Precognition is Turing complete. The easiest way to see this is to note that it embeds Thue (although it's powerful enough to allow many other Turing-complete languages to be implemented more or less directly in it).

See also

  • Thue: a simpler find-and-replace based language; also non-deterministic in its original version, although most programmers seem to misinterpret this as being probabilistic
  • Retina: a language based around regular expression matches and substitutions, intended to be practically applicable despite being esoteric
  • Underload: the first versions of Precognition were intended as a domain-specific language for implementing extensions to Underload
  • wikipedia:Prolog: a major inspiration for Precognition (Prolog often translates to Precognition more or less directly)
  • wikipedia:Lisp: the best-known example of s-expressions, which are what Precognition most commonly manipulates