User:Salpynx/Simple translation conjecture

From Esolang
Jump to navigation Jump to search

A page for developing and clarifying my idea for a Simple translation mechanism where a command can act as a sort of self-interpreter, first mentioned at [[1]] and relating to the (in-progress?) formal definition of Simple translation.

Comments on the talk page welcome. I'm starting this in my userspace because I think the idea I'm working on is more general than a Picofuck attempt, although it should be useful for that. It also may encompass more than Simple translation, although I think that is the concept that covers what I'm trying work with. Once I set the ideas down, we can see how well they match. Note that I'm not really coming at this from an instruction minimisation angle, that's something I haven't given much serious thought to, but I can see how it is interesting and related.

Update May 2020

ω ≠ ε

The original version of this conjecture proposes a system where ε (empty string) could be thought of as representing the end of a string or program. Thanks to good feedback from User:Orby and User:Int-e (via IRC) I see now why this cannot work using standard definitions.

User:Orby suggested terminology for a start-of-string symbol: α (alpha), and end-of-string ω (omega). I'm running with this terminology to represent these concepts which may or may not exist or be justifiable in any model. I'm defining ω as a generalised concept of an EOF (end of file) marker for abstract "program-strings", not relying on the notion of a 'file' as a container -- does this concept have an existing name? I am now certain that ω, if it exists at all, is distinct from ε. That claim was a mistake. I'll need to edit the rest of this page to clear up the confusion that causes.

What I was trying to describe was a model that allowed an implicit ω. I believe that model can be described accurately as a Finite state transducer with the following modifications:

  1. No path from the start state may have an input label of ε.
  2. Any input string has a new symbol ω ∉ Σ (where Σ is the input alphabet) appended to it.
  • Start state input labels taken from: Σ ∪ {ω}
  • Non-start input labels taken from: Σ ∪ {ε, ω}

2 will have the intended consequence that any path with an input label of ω leads to an end state.

I'm less confident now that this is justifiable as a general model, but at least it can be described properly for analysis. ε was the only implicit symbol I felt comfortable appending to an input string, I'm less sure about whether an implicit ω can be said to exist in all cases, or exactly how it exists in any case.

Just as I convinced myself that ω, if it exists, can only be a characteristic of specific languages and should not be part of a general model, I realised that the definition of wikipedia:finite-state transducers always permits an implicit α (because transition labels from initial states are explicitly defined to be from Σ ∪ {ε}). I don't really see why that is justified, but I'm also convinced that any ST construction that relies on pre-appending something for α can trivially be re-formed to do the same thing without it. I'm less sure about ω though. I think I've demonstrated that having or allowing an ω is extremely powerful, and pretty much anything is possible in terms of ST. I strongly suspect that at least most ω constructions can be shifted to non-terminal symbols, but I think it's possible some very contrived cases won't be translatable. I'll certainly be pleasantly surprised if it turns out all ω enabled ST tricks can be reformulated to something not relying on an implicit ω.

AFAICT, the standard definition of an input-string (≈ "program string") accepted by a FST is a finite string with a knowable start point, but an unknowable end point. This seems to me more of an accidental feature of the kind of machine interpreting the string, and not necessarily the properties of the "program string" itself. My machine was conceived as something that read a finite string from a buffer, and the knowledge that the buffer was fully consumed was readily available. I'm going to move on from arguing whether α and ω actually exist, and focus on determining how to replicate the ST constructions without a single final ω though.

Assorted thoughts on finite-state transducers and simple translations

FSTs seem like the perfect tool to analyse the simple translation concept. They describe regular languages, which is sufficient for simple translations. All STs could be described by a context-free grammar, but that is more powerful than necessary. All STs can be described by a pair of FSTs:

  • T1: A* → X ⊆ B*

and

  • T2: B* → Y ⊆ A*

A ST maps every possible A* to a subset of B* This means

  • even syntactically invalid programs from A* are mapped to some string in B*
  • Every syntactically valid program in A* MUST translate to a syntacticaly valid one in B* because of the ST equivalency requirement
  • Syntactically invalid A* program translations may or may not be syntactically valid in B*, and the result of execution is undefined

I think this is also true: If X = B*, and Y = A* then language A and B are trivial substitutions of one another.

The current definition uses the term 'command's in a language; I think this is acceptable loose terminology, but doesn't need to be be defined further. The formal definition should be in terms of symbol in an alphabet, but we can all agree that a 'command' is roughly some string of one or more symbols, the exact details of which are language dependant. FWIW, I have always been imagining the ST tables' LHS containing a single symbol even for languages with multi-symbol commands. It's equivalent, and FSTs treat things that way with the various states which may be transitioned to.

For generalised simple translations: There may be reasons why a simple translation is not bi-directional (i.e. one language of the pair is not TC), and I think reflexive simple translations are valid and can be studied too.

Conclusion WRT ω's significance for simple translations

https://swtch.com/~rsc/regexp/regexp1.html : "Simple assertions like ^, $, and \b are easy to accommodate in an NFA, delaying the match one byte for forward assertions."

The proper terminology for ^ and $ is 'assertions', and it seems to be a feature of 'regexes' ... which are different from 'regular expressions'. These symbols "... can be viewed as assertions about the text around them", so about the context surrounding the program, as distinct from the program itself. $ is therefore not a feature of regular languages.

Parsing programming languages is generally regarded as a task for at least a push-down automaton (which can recognise context-free languages).

Lexing (a step before parsing) is often performed by a FSM. A simple-translation is a process equivalent to lexing (as the current definition only requires a FSM to perform).

Can all languages be effectively lexed / simple-translated by a FSM? ... it appears not, as many languages cannot be described as regular-languages if they are to be parsable.

Examples include: Unary, BCT, Python, any 2d language that permits pointer wrapping over its bottom edge, and likely many more. (These examples don't feel very special anymore, anything requiring matched parentheses counts too. The difference is that matched parens can generally be pushed through a lexer without modification.)

A simple translation can successfully operate on program source as if it were a regular-language, translating using a finite-state transducer, but the resulting translation can only be a regular-language too (this probably needs proof / clarification...). This is insufficient for the translation to be parsable (in many/most/all cases) in the target language.

Argument runs: a FST cannot possibly translate an arbitrary CFL feature from one language to a completely different but somehow corresponding CFL feature of another. More powerful parsing is required.

The minimum modification required to enable a simple-translation to function identically to the parsable source program is to include $ / ω as an explicit terminal symbol in the source regular-language, since the proper parsing / translating task can be deferred to the fixed translation of this symbol, regardless of the power of the lexer / simple-translation machine. The translation simply needs to copy across the raw data and leave the remaining work to a machine encoded in $/ω. The simple-translation definition deals with TC languages so we can be confident that a translation of the $/ω symbol can implement a parser at least as powerful as a push-down automaton.

Other CFL features of a source program considered as a regular-language can be pushed through the translation mechanism to a comparable construction, but not a $ / ω if it has a special semantic significance in the source language.

A simple-translation is therefore a process that maps a (potentially) context-free language A to (potentially) context-free language B using a finite-state transducer. The translation operates as if it is acting upon regular-languages, but the super-regular features, which enable them to encode semantics and function a particular way, must survive the translation, in both directions.

Conjecture

Any less-than-or-equal-to Turing complete language can be Simple translated to any Turing complete concatenative language.

(It's possible I'm using a non-standard definition of concatenative language here, I'll define what I mean below)

Implication

There exists >1 (actually infinite due to NOPs) relatively simple (linear?) function to convert a Gödel numbering of any less-than-or-equal-to Turing complete language into a Gödel numbering of any other TC language that meets the definition of 'concatenative' as used above which performs the exact same computation. Note: applying the function itself will be straight forward, but deriving the parameters / constants for the function will be very complex in most cases, and the resulting code may be highly highly inefficient. But possible.

Method

  • Any and all programming languages can have their source code (syntax) encoded as some 'string' of symbols from an alphabet A.
  • Any and all programming languages can conceptually be interpreted by a hypothetical REPL which maintains a state and reads each symbol of a program sequentially and performs "computation" when it has consumed sufficient symbols to do so.

This REPL can be modelled as follows: at each step it takes an initial state, S0, consumes the next program symbol, P1, and modifies the state to produce S1 and passes it to the next symbol, P2.

(Sn-1) → Pn → (Sn)

Because this is a symbol-by-symbol REPL, it may need to store read but unprocessed symbols in State until it has consumed enough (a complete semantic unit) to know what action to take.

The claim is that this is always possible for any and all languages because at the worst case, by definition of a 'program', the entire program may need to be read to get a semantically complete programmatic-unit. The REPL can always know this from its current position in the symbol stream. The only formal requirement is to accept and recognise the existence of the empty string 'ε' as an element of all possible alphabets for any language. ε acts as an EOF / end-of-string marker but will only be required for languages where this distinction is actually present in its semantics, i.e. it's not adding a semantic element, but modelling something that is actually required by the language semantics, regardless of whether a specification mentions it explicitly.

Empty symbol can be thought of as a potential answer to the REPL's question "What is the next symbol?", to encode _anything_ (syntax or semantics) there has to be at least two possible answers, <next symbol> or ε. This models any unary style encoding where the language alphabet has to consist of 2 symbols: {<mark> and ε} to represent a finite length unary program.

For the purposes of Simple translation this conjecture requires that ε is a first class member of any LHS alphabet translation table, if needed to fully represent the semantics of that language, and therefore can be transalted into some finite string of other symbols in a RHS language.

Similarly ε can appear on the RHS as a translation, but it does not mean terminate the program, and further translations can be appended as normal.

This is very interesting. I am hard pressed to say that ε is not always implicitly defined. In C, for example, if reading from a stream it is EOF and when reading from a string it is a NULL character. In this sense, ε doesn't mean terminate the program, but it does signal to the REPL, or interpreter, "no more symbols". I don't know if I follow you when you identify this with the empty string. If ε is the empty string, shouldn't it have no effect on the REPL (i.e. ++ = +ε+)? That is where I think ε as you're defining it is different from other LHS symbols. If ε is the empty string, then we should be able to insert it anywhere without disrupting semantics (i.e. it should be a NOP). What does it mean to say, "If ε appears in the middle of the program it is a NOP, but if ε appears at the end of the program it is EOF."? How does the REPL distinguish between those two conditions without having yet another implicit symbol to determine true EOF? Orby (talk) 14:27, 7 May 2020 (UTC)
I am going to argue that EOF and ε are distinct symbols. I think it's perfectly reasonable to include both on the LHS of a simple translation, but I think we may need to make their meaning clear. EOF, for example, has different meanings depending on the language. In most languages, EOF signals halt (that is, if the IP reaches this symbol, halt). In other languages, there is an implicit loop. In that case, EOF signals jump to start. If we include an explicit EOF marker on the LHS, we can create simple translations between languages with implicit loops and languages without. Additionally, some languages have particular behavior associated with a start symbol--- i.e. some kind of initialization. Some languages with implicit loops will immediately halt if certain entry conditions into the loop aren't met. What if we used α to denote the start symbol and ω to denote EOF. User:Ais523 called the inclusion of these symbols on the LHS a "generalized simple translation", as they capture a more general category of simple translation than requiring that ω0 = ω1 and α0 = α1 (where α0 is the start symbol in the source language and α1 is the start symbol in the destination language). Orby (talk) 14:42, 7 May 2020 (UTC)

Further claims

The syntax of a target language may be arbitrarily encoded, and still retain the property of being able to be Simple translated to another language. Extreme examples to illustrate:

  • Standard zip compressed Malbolge code can be simple translated to a TC language, symbol by symbol.
  • Reversed Python source code can be symbol-by-symbol simple translated to another language (including Python) using the above REPL construction.
    • note the reverse of ABCε is CBAε NOT εCBA.

In these examples obviously the full source needs to be consumed to perform the decode operation, but we have already accepted that is is allowable.

  • A language may be symbol-by-symbol simple translated into itself if it meets the concatenative definition, i.e. every allowable symbol (character) of Python 3 could be simple-translated into Python code which when concatenated and run, will produce exactly the same computation and results.

Similarly, any language with n symbols could be re-encoded to less symbols m where m >= 2 (including ε), this is the syntax part of instruction minimisation. Equally, the alphabet could be arbitrarily expanded, and the simple-translation ability will still hold.

  • Infinite alphabets and infinite source programs aren't an obstacle or special case (there may be some pathological infinite programs, but I think generally any problems with infinite source can be equivalently simple-translated).

Examples using a high level language

A relatively trivial symbol-by-symbol simple-translation of Python to Python using sed to make the translation substitutions (this won't handle single quotes properly):

sed "s/\(.\)/CODE = CODE + '\1' if 'CODE' in vars() else '\1'\n/g;s/$/CODE = CODE + '\\\n' if 'CODE' in vars() else '\\\n'/;\$aexec(CODE)" {python-input-file}

This uses the ε final substitution 'trick' for illustration, but a Python to Python translation that doesn't rely on this should also be possible, but will require more logic for each translation.

To illustrate the claim that arbitrary encoding does not affect the construction, here is a 'reversed-python' simple translation: sed "s/\(.\)/CODE = CODE + '\1' if 'CODE' in vars() else '\1'\n/g;s/$/CODE = CODE + '\\\n' if 'CODE' in vars() else '\\\n'/;\$aexec(CODE[::-1])" {reverse-python-input-file}

which produces valid executable Python character-by-character from the source:

)latot % "s% si latot ehT"(tnirp
i =+ latot    
)i(tnirp    
:)01(egnar ni i rof
0 = latot

:tset pool nohtyp #

Finally, to test the zipped Malbolge encoding concept, this script produces a simple translation symbol-by-symbol from a zipfile containing a single Python script:

#!/usr/bin/env python3
import sys
"""
   Program to take a zip file containing a single Python source file
   and 'simple-translate' it symbol-by-symbol to Python to produce an identical result to the uncompressed original code.
   Input via STDIN:
       zip-translate.py < python-source.zip
          OR to execute:
       python3 <(zip-translate.py < python-source.zip)
"""

interpreter = """import io, zipfile
zf = zipfile.ZipFile(io.BytesIO(CODE), 'r')
for fileinfo in zf.infolist():
    exec(zf.read(fileinfo).decode('utf-8'))
"""

with sys.stdin.buffer as f:
    while 1:
        c = f.read(1)
        if c:
           print("CODE = CODE + b'{byte}' if 'CODE' in vars() else b'{byte}'".format(byte=r'\x' + c.hex()))
        else:
           print(" # end of program, replace ε with code to decode and evaluate previously accumulated data:")
           print(interpreter)
           break 

an example translation is: P ⇒ CODE = CODE + b'\x50' if 'CODE' in vars() else b'\x50' ... ε ⇒ {code which converts the accumulated CODE data to a bytestream, unzips it, and uses Python's builtin exec() to evaluate the resulting code}

Relying on final ε substitutions should not be necessary for all languages. It is for unary encodings, but Python and bf derivatives should be able to avoid these if they are not desirable. I'm not sure how to formalise this though -- insisting ε is part of every alphabet when considered as a set seems reasonable in a formal sense, but it is clear that ε as a semantically meaningful 'next symbol' is not required for most languages, only some. It also seems unreasonable to restrict translations to only semantically meaningful symbols from a language, since ignorable whitespace and comments are a standard feature of many languages.

Hopefully these (arguably dumb) Python examples illustrate the point to some extent, and can be used as a guide for more serious attempts in other languages.

Concatenative semantics

The arbitrary re-encoding of syntax is one thing, but this 'Salpynx style' REPL construct appears to work because semantics of many languages function the same way as strings, so a parallel between syntax re-encoding can be made with semantic-re-encoding. This allows the specific semantics of each programmatic unit to be re-encoded in a way that can reproduce the function of the original, but in a different way. (e.g postponing any expected action to the final instance of an ε as a self-interpreter that performs anything needed while any other previously occurring semantic unit has merely made a change to the state to indicate its existence to the ε self-interpreter)

Note: I'm running out of steam in this session, this semantic musing is the more critical bit, and I'm probably not doing it justice today. I will need to revisit.

The structure of strings, the syntax of a program, is that of a monoid.

  • The binary operation is concatenation, and the identity is ε.

The structure of a 'concatenative' language program, (semantic) is a sequence of 'programmatic units' which can function independently of other units. They take an input state, and optionally modify it. This is also a monoid.

  • The binary operation is program-sub-unit concatenation, and the identity in a NOP (always possible in a TC language, and as a programmatic sub-unit)


A concatenative programming language for the purposes of this conjecture (apologies for potentially overloading an existing term) is one where at some level it can be divided up into these stand-alone semantic sub units, which are then processed in sequence. A programmatic sub unit of a program is one that could stand alone as a program itself.

  • + is a programmatic sub-unit in bf, and also a full program.
  • print("Hello, World!") is a programmatic sub-unit in Python, also a full program.
  • print(" is not a programmatic sub-unit in Python because it is semantically incomplete and cannot function as a stand-alone program.

Some languages (likely example: Fractran (raised by User:Ais523 in IRC as an example)) do not meet this criterion because the entire program must be read before any computation can begin, and there is no way to divide the source code of any program into sub-units which can stand alone. The REPL construct above supports interpreting languages like this (interpretation is always deferred until the last symbol, which is perfectly valid), which is why these languages can be simple translated FROM, TO a different language. However these languages cannot be simple-translated TO, because there is no natural semantic concatenation possible with their language.

A 2D language like Befunge is on the surface non-concatanative, but I think can be written in a 1D concatenative way to enable appearing on the RHS of a simple translation.

However if we treat a full program in a language (not nativley divisible into complete sub units) as the smallest possible semantic unit (as is the case) but allow these units to interoperate in exactly the way specified above, whereby a programmatic unit takes an initial state, and optionally modifies it before passing it on to the next unit, the construction above is fully enabled again. We are simply running a program in such a language from a state determined by the final state of a prior program in a concatenative sequence which meets the definition above. I'm not sure of the full implications of this yet. Is this a new language, a meta-language based on whatever the sub-unit language is, or a meta-concept that applies to any programming language?

This makes me think harder about the definition of a command in the context of a simple translation. I have an intuitive sense of what that means. It is not equivalent to a programmatic sub-unit (e.g. [ in bf is clearly a command, but not a programmatic sub-unit). So in some sense a bf program "[blahblahblah]" is more akin to a Fractran program in this sense as the string doesn't take on any programmatic meaning until the final ] is encountered. Orby (talk) 14:12, 7 May 2020 (UTC)
Yes, I was definitely thinking about treating each programmatic sub-unit as indivisible because that seems to work in the worst case scenario for translating when languages don't have much else in common, to prove translations are possbile (but not necessarily optimal). The Picofuck translation I originally envisioned needed to have quite a bit of repeated logic to handle full (and nested!) loops as a single unit. I guessed there should be ways to simplify a translation if the two languages happen to have some features in common, but hadn't figured out how to formalise that. Reading your comment made me realise that while [ is not and stand-alone sub-unit like I defined, we can create an equally fragmented code part in Python to begin a loop on the same kind of condition. So RBF and Python share the ability to translate at a below programmatic sub-unit level... I don't yet know what to call that, but if feels like a good shortcut to take with RBF -- leveraging comparable features to minimise the translation size. I think that's what you did with Nanofuck, its a maximally efficient translation, and why the size concept is good to evaluate these. I seem to be pursuing a maximally inefficient solution to prove a worst case is generally possible! Salpynx (talk) 20:58, 7 May 2020 (UTC)
Hrm. while it is easy to append a "begin-loop" fragment in python (i.e. high level substitute language) without knowing when the end will occur, I'm not as sure about passing the state neatly to and from a fragment like this. It might be ok, but I may need to try it to test. Salpynx (talk) 21:05, 7 May 2020 (UTC)

IO

In my paper notes I have covered how input and output can be incorporated into this model, which I think is the way to generalise all IO operations to TC languages regardless of whether they have built-in IO commands. I note that Simple_translation#Input_and_output already touches on this. IO doesn't really change anything above, but can be formalised. Output is simply the portion of a generated state that is lost to, or not used by any subsequent symbol operation. In a language without native IO, it can simply be written to an output state, then ignored (possibly even overwritten by subsequent operations? It is no longer part of the computation. It is sufficient that it occurred). Input is just the portion of the initial state received by a symbol operation that it may be instructed to check and take some action upon. It is not for the code to know whether it was user supplied or supplied by some other pre-occurring process -- they are equivalent.

Conclusion / WIP

I'm still working on this. I think what I'm doing is coming up with some parallel between syntactic and semantic encoding, and claiming syntactic encoding is very free, while semantic encoding is almost as free if certain (not especially stringent?) conditions are met. There are some range of encoding transforms and mappings that can be performed on the syntax and semantics that leave them equivalent in some sense, and there are ways to divide each into sub-units (symbols) so that these mappings can be performed sequentially symbol by symbol ... and this has something to do with the structure of monoids. I suspect I'm possibly reinventing some wheels. My interest in this stems from exploring how code can be composed using functions and Gödel numbers (see my Brainfoctal and Hello World! program construction).

I have been trying to write up the details of a (joke like) Gödel numbering language for Deadfish where I created a function to produce a Brainfoctal equivalent code number, which was then run as bf. This involved creating a simple translation from Deadfish to bf, and the only way to handle Deadfish overflow was encode the check in every translated command. At the time I had not encountered the Simple translation concept, but I was surprised how few languages had such simple translation equivalents to another, which I wanted to steal to enable easy Gödel number conversions. At the time I was only looking for one-way conversions, and the thing that excited me about User:Orby's definition was the two way requirement, which was something I hadn't even consciously considered. The punchline of my Deadfish numbering was that the translation function was bijective, so that any Brainfoctal code would generate some float (non-integer Real number) in Gödelfish, which implies it is Turing complete. (I also came up with a function to generate an output encoding as an n-ary non-negative integer given a Deadfish Gödel number input, but I should write that up in its own article now that the wiki has math markup). So again, my interest stems from playing with encodings which culminates AFAICT in Gödel numbering and the ability to split those back out into arbitrary representations. I have another WIP that uses a Lenguage like joke to enable the formation of potentially sensible syntactic units and new symbols which can be formed into real programs. Actually, I'm astounded at how well the simple translation concept covers some of the things I've been working on without quite realising what I was doing or fully having it formalised. Thanks User:Orby for attempting a formal definition!


Simple translation of BCT into Python, avoiding end-of-program symbol tricks

BCT is another language which requires information to indicate the program has been fully received before the cycling execution can begin. Here is a simple translation from BCT to Python that only requires the BCT program symbols 0 and 1 to be translated. No extra beginning or end code required. It is however very repetitive. If we accept that finite-state transducers can always prepend an α sequence to the output, the cycle() def should be placed there, instead of repeated for every symbol. The trick used here is to append syntax that 'undoes' the first part of a command before it is even executed. The absence of the 'undo', which occurs only on the final symbol, allows the cycle() to execute on the last translated symbol only. I can't see how this could generalise to all languages though. It seems to be a feature that happens to exist in Python, and using simple translations to substitute fragments of programmatic sub-units is a bit cheat-like. Actually performing a complete computation then reversing it at every step would be an alternative, but then I can't see how you could avoid entering an infinite loop that you'd need to 'undo' once it ... completes. The translation prompts the user for a data string to operate on.

NB: these python substitutions DO NOT end (or begin) with a newline. The construction relies on the first and last parts appearing on the same line if they are adjacent.

BCT Python equivalent
0
0 if False else 1
def cycle(P):
    D = input("Input datastring:")
    while D:
        if P[0] == '0':
            D = D[1:]
        else:
            D += P[1] if D[0] == '1' else ''
            P = P[1:] + P[0]
        P = P[1:] + P[0]
        print(D)
    return -1
P = '0' if 'P' not in locals() else P + '0'
cycle(P) + 1
1
0 if False else 1
def cycle(P):
    D = input("Input datastring:")
    while D:
        if P[0] == '0':
            D = D[1:]
        else:
            D += P[1] if D[0] == '1' else ''
            P = P[1:] + P[0]
        P = P[1:] + P[0]
        print(D)
    return -1
P = '1' if 'P' not in locals() else P + '1'
cycle(P) + 1

I believe this is a valid example of a simple translation, but it was created to test the bounds of the formal concept. I'm in no hurry to create the second table for Python to BCT. A 2-way BCT / BF like language is something I'd be interested in working towards.