MONOD

From Esolang
Jump to navigation Jump to search

Molecular biology influenced esoteric programming language named after French biologist Jacques Monod. Since I have little better to do, I'll probably start documenting it here now it's in workable shape. If anything passes for an original design document, it's probably this blog post (dead link).

Instructions

Codewheel

Monodv1diagram.png

Wat :||||||

The codewheel above is that for v1.0 of MONOD. It is based on the same principle as the circular representations of the real genetic code, being one of the better ways of displaying essentially three-dimensional data in two.

The main conceit of MONOD is that the sourcecode is a string over Σ = {G,A,T,C} and so looks like DNA. It is stored as plaintext with a .dna suffix. MONOD, like DNA, is divided into 64 triplets Σ³ = {GGG,GGA...CCC} each of which can code for an instruction/amino acid. The first member of the triplet is the innermost circle, with the second and third being read off, so TAT and TAA both => BIND. The interpreter reads in the string over Σ and converts it into instructions, also inaccurately called "codons" (this properly being the name for the Σ³ triplets).

Execution Control

Before explaining what the instructions are, a brief explanation of control of execution and a digression into data types in MONOD:

(real) Genomes are not so much programs as file structures from which the cell can pull up programs to execute for a million different purposes. MONOD works on the same principle. By itself, a .dna file is useless -- just as DNA floating around in a test-tube is.

There are three datatypes in MONOD - codons, binding sequences and integers. Codons are represented as four-letter capital strings and at present there are 16 - these are the operations. Integers are just integers. Binding sequences take the form "*aaabbb...nnn", where nnn is a lower case version of Σ³ -- *gagtat, *gggggg, *ggg are valid binding sequences, *ga, *gagagu, *tremme are not. * by itself is also valid. It is binding sequences that are effectively the main agents of control in MONOD.

Execution does not happen to notional "DNA" strings -- instead what is executed are "proteins", which are copies of sections of the .dna sourcecode. Since it is necessary to start somewhere, a protein is added to "prime" execution of the program, this being provided in .rna (or parsed .dna) form. .rna files are plaintext strings of codons, binding sequences and integers that *are* executed upon initiation. For any interesting programs to happen, this priming "Kadmon" protein should contain code that "transcrates" (or TRANs) one or more proteins into being and add them to the list of proteins in existence (to begin with, just Kadmon). When the interpreter has finished executing the Kadmon protein (or any protein), it moves onto the next protein in the list of proteins produced.

Proteins, therefore, can interact with other proteins in two ways - by binding to them, or by TRANSing them into existence (in which case, the protein is automatically bound to them). Binding is achieved by matching of binding sequences - so that a protein with the correct codon sequence and the binding sequence *tatgag will look for that binding sequence in the genome. Every time it finds one. If the protein is attempting to TRANS all *tatgag proteins into existence, it will check to see if each instance of *tatgag is followed by a codon indicating that A Protein Starts Here, and if so, TRANSes all relevant proteins into existence. Binding is a similar process.

In MONOD, state is stored in two places -- locally, individual protein instantiations have an attribute "phosphorylation" derived from a biochemical term. In MONOD this is simply a (signed!) integer. Globally there is a "chemical array", 64 signed integers large. Proteins can increment or decrement this in several ways, and it is displayed finally when all proteins have been executed.

If a protein is bound to another, then certain operations which would change the state of the protein itself, or change state based on protein state will instead act according to, or upon the bound protein. In this way, proteins can act upon proteins that are yet to be executed and some long-distance control can be wielded.

If you have gotten this far, you are probably about ready to understand the...

Instruction List

If the operation has a dollar after it, eg. META$, then any triplet following it is interpreted as an integer, referred to as n. A trailing asterisk eg. BIND* indicates that a binding sequence follows, referred to as *x. This notation is strictly for illustrative purposes and is not present in actual code.

NULL Both are nops, pass by, do nothing
FOOO
LYSE Upon reaching this instruction, the protein is deleted from the protein list. Any instructions after LYSE are therefore ignored.
  1. If a protein is not LYSE-terminated, the protein will not be deleted. The protein will be permitted to run for a certain number of cycles, whereupon it will be automatically lysed upon completion for making a nuisance of itself. This has some biological justification.
  2. If the protein is bound to another, the protein will instead LYSE its bound protein, and continue as usual. To lyse both proteins place LYSE LYSE, otherwise the protein will continue on a second execution cycle before it is LYSEd.
PHOS Increments (PHOS) or decrements (DEPH) the protein's phosphorylation by 1. If a protein is bound, then the bound protein is in/decremented.
DEPH
COPH$ If the value of the chemical array at index n is nonzero, then the instruction acts as PHOS, otherwise it is a nop.
META$ Increments (META) or decrements (PHOS) the chemical array at index n by the phosphorylation of the protein. If bound, by the phosphorylation of the CATA$ / bound protein.
SENS$ Increases the phosphorylation state of the protein to the value of the chemical array at index n. If bound, the phosphorylation state of the bound protein is set.
JESU$ Increases the value of the chemical array at the m'th index by n, where m is the absolute value of the protein's phosphorylation state. If bound, use that phosphorylation state.
BIND* Marks the beginning of a binding site. Used in first-stage parsing, in execution, mostly a nop, but useful for identification of binding sites.
*BEND Marks the ending of a binding site. See BIND.
BACK/* If a protein is bound, then release the protein. IF there is a binding sequence following (there may not), look for all proteins containing the binding ]# site *x anywhere in them, and bind to the first such protein found on the protein array.
TRAN (*) Look for all instances of *x on the genome. At each, proceed forward until a BEGN (v.sub.) codon is reached, and begin creating a protein, adding codons until a STOP (v.sub.) codon is reached.
  1. Not strictly next to a binding site, typical location is TRAN BIND *x BEND BACK
  2. For example, suppose a genome contains: ... *gagcatgag NULL PHOS BEGN META 23 PHOS LYSE STOP LYSE ... Then a TRANSing protein would create a new protein FooA with sequence: META 23 PHOS LYSE
  3. Protein names are generated randomly and have the form XyzA where X,y,z are randomly selected letters of the alphabet. The primer protein is always called KadA. This is, again, biologically inspired.
COTR (*) As TRAN, but only if phosphorylation is positive. Not added to wheel yet.
BEGN TRAN infrastructure. See TRAN for details.
STOP

Blank space on the codewheel indicates reserved triplets for future use. BALLS triplets are reserved reserved triplets, used when meaningless codons, written ????, are translated back into DNA code.

Sample Session

pthag@pthag-desktop:~/python/monod$ python monod.py -x bettergenome.dna kadmon.rna -e -s
      __  __  ___  _   _  ___  ____  
     |  \/  |/ _ \| \ | |/ _ \|  _ \ 
     | |\/| | | | |  \| | | | | | | |
     | |  | | |_| | |\  | |_| | |_| |
     |_|  |_|\___/|_| \_|\___/|____/ 
     robert harry nicodemus williams~

                Hello.

Priming with KadA (Kadmon)! ['TRAN', 'BIND', '*tatgag', 'BEND', 'BACK', 'NULL', 'LYSE']
Starting to execute KadA (Kadmon)
   KadA finds 4 "*tatgag" in genome.
   New Protein XhiA! ['PHOS', 'PHOS', 'PHOS', 'META', 21, 'FOOO', 'LYSE'] (genome offset 5)
   New Protein XagA! ['PHOS', 'META', 13, 'DEPH', 'JESU', 21, 'JESU', 2, 'BACK', 'BIND', '*catgat', 'BEND', 'PHOS', 'PHOS', 'PHOS', 'PHOS', 'JESU', 21, 'BACK', 'LYSE'] (genome offset 18)
   New Protein DjtA! ['BIND', '*catgat', 'BEND', 'PHOS', 'PHOS', 'PHOS', 'PHOS', 'PHOS', 'META', 8, 'LYSE'] (genome offset 46)
   New Protein CvjA! ['PHOS', 'PHOS', 'META', 52, 'NULL', 'BIND', '*', 'BEND', 'FOOO', 'FOOO', 'BEND', 'DEPH', 'LYSE'] (genome offset 63)
   KadA does nothing (BEND)
   KadA unbinds CvjA!
   KadA does nothing (NULL)
   KadA lyses itself.
Finished executing KadA (Kadmon)
Starting to execute XhiA.
   XhiA increased phosphorylation to 1
   XhiA increased phosphorylation to 2
   XhiA increased phosphorylation to 3
   XhiA increased register 21 by 3.
   XhiA does nothing (FOOO)
   XhiA lyses itself.
Finished executing XhiA.
Starting to execute XagA.
   XagA increased phosphorylation to 1
   XagA increased register 13 by 1.
   XagA decreased phosphorylation to 0
   XagA phosphor-sets register 0 to 21
   XagA phosphor-sets register 0 to 2
   XagA unbound nothing (BACK)
   XagA activates binding site *catgat
   XagA binds DjtA through *catgat
   XagA increases phosphorylation of DjtA to 1
   XagA increases phosphorylation of DjtA to 2
   XagA increases phosphorylation of DjtA to 3
   XagA increases phosphorylation of DjtA to 4
   XagA phosphor-sets register 4 by 21 to DjtA
   XagA unbinds DjtA!
   XagA lyses itself.
Finished executing XagA.
Starting to execute DjtA.
   DjtA does nothing (BIND)
   DjtA does nothing (BEND)
   DjtA increased phosphorylation to 5
   DjtA increased phosphorylation to 6
   DjtA increased phosphorylation to 7
   DjtA increased phosphorylation to 8
   DjtA increased phosphorylation to 9
   DjtA increased register 8 by 9.
   DjtA lyses itself.
Finished executing DjtA.
Starting to execute CvjA.
   CvjA increased phosphorylation to 1
   CvjA increased phosphorylation to 2
   CvjA increased register 52 by 2.
   CvjA does nothing (NULL)
   CvjA does nothing (BIND)
   CvjA does nothing (BEND)
   CvjA does nothing (FOOO)
   CvjA does nothing (FOOO)
   CvjA does nothing (BEND)
   CvjA decreased phosphorylation to 1
   CvjA lyses itself.
Finished executing CvjA.
Protein array empty. Stopping.
Final chemical memory state:
2  0  0  0 21  0  0  0 
9  0  0  0  0  1  0  0
0  0  0  0  0  3  0  0
0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0
0  0  0  0  2  0  0  0
0  0  0  0  0  0  0  0

Example Programs

The One In The Sample

Not terribly exciting, but at least it makes numbers appear. If put on the spot, I guess you could say it sets chem[8] = (0+4-1)+5 and futzes around with chem[4].

bettergenome.dna

AAACCCCCCAAATAATATGAGATAGCGCCCCCCCCCTACTTTTTTCAGGGCAAATAATATGAGATAGCGCCCTACACTGGGACTTTTACTAAAAATTAACATGATATACC
CCCCCCCCCCACTTTTAATCAGGGCAAAACTCCCTAATATGAGATAGCGTAACATGATATACCCCCCCCCCCCCCCTACAGACAGGGCAAATAATATGAGATAGCGCCCCC
CTACCTAAAATAAATTTTTTTTATAGGGCAGGGC

kadmon.rna

TRAN BIND *tatgag BEND BACK NULL LYSE

Null

null.dna

this genome is A useful one for debugging purposes - it does Absolutely nothing at All.

Useful, sure, but also enlightening! The interpreter ignores all non{AGCT} characters, so this is equivalent to

AAA

The Interpreter

Is written poorly in Python, and at MONOD/interpreterv1 for want of anywhere else to put it. It goes into py2exe quite nicely, which is available.

Command line options

-s -- prints ASCII equivalents of chemical array numbers next to it at the end
-e -- prints all messages (rather than some) in English.
-h -- is supposed to display help
-t -- mode for converting .rna (easier to write) files into .dna files
-n -- prints off the list of triplet-equivalents for numbers 0 to 64
-x foo.dna bar.rna -- main functionality
-c -- inputs text and spits out a DNA string version - useful for meaningful
      binding site names
-f -- old school version of -x, asks for filenames and waits for input

If none of -h, -t, -n, -x, -c, -f are given, then the interpreter reverts to debug mode, and executes the genome and Kadmon protein hardcoded into it.