NucleoCode

NucleoCode is a language designed to mimic the behaviour of a DNA double helix during transcription. Program length is limited only by the memory of the machine running it, and as such, a program could be many times longer than the human genome. There are three representations of a NucleoCode program: Two parallel strings of nucleotides, amino acids (plus non-coding DNA represented as nucleotides), and Code.

Terminology
In NucleoCode, the program is referred to as the genome. Strictly speaking, it represents only a single double helix, but only one helix can run at any one time. The genome is subdivided into genes, executable code, and non-coding DNA, which contains metadata and control flow instructions. Genes are divided into codons, which refer to amino acids, the instruction set of NucleoCode, and finally, codons and non-coding DNA is divisible into 4 nucleotides - A, C, T, and G. Codons are 3 nucleotides long.

Instruction Set
NucleoCode has a total of 20 amino acids, which correspond to the 20 amino acids used by living organisms. These can also be read as a base 20 number system when used as a numeric argument. When read as instructions, however, they can be subdivided into 8 'immutable' amino acids representing certain universal instructions and 12 'mutable' amino acids whose meanings depend on metadata assigned to the gene they occur in. The 8 immutable amino acids are given below (abbreviations follow IUPAC) in both Code and amino acid format M, goto  : Jumps to the named gene W, return : Returns to just after the last goto D, (     : Opens an expression  E, )      : Closes an expression Q, var   : References the value of the named variable I, =     : Assigns the value of the expression to the named variable H, in    : Reads in either a number or a Unicode string, splits it, and stores the results in the named variables K, out   : Splits the values of the arguments, then outputs them as either numbers or a Unicode string The 12 mutable amino acids are N, T, R, S, P, L, A, G, V, Y, C, F.

There are two additional codons, named O and B, which are used exclusively as numbers. M is also used to begin a gene, and the final codon, X, is used to end a gene. O, B, and X correspond to the three STOP codons, Opal, Amber, and Ochre.

The Genome
The genome consists of two strands, labelled 5' and 3'. Execution can switch between the two strands, and relative location is preserved. Switching may only occur in non-coding DNA, and is handled by the nucleotide sequence ACGT in the 5' strand, and therefore TGCA in the 3' strand. Therefore, in nucleotide notation, both strands are written, and the line must start with 5': for the 5' strand, and 3': for the 3' strand. The genome can be written over multiple lines, and each line must be marked by which strand it is. Furthermore, the ordering of the lines must be 5', 3', 5', 3', etc. In the other notations, however, switching is marked by writing the label of the other thread.

The Gene and Non-Coding DNA
All genes begin with the sequence ATG (written [ in both Code and aminoacids; note that this also corresponds to M) and are ended by any of the stop codons, Ochre, Opal, or Amber (all written ]). Because the start sequence can occur at any frame in the strand, not just a whole number of codons from the beginning, the sequence ATG is disallowed in non-coding DNA. The genes themselves are made up of a series of instructions followed by their arguments. Arguments can themselves be instructions. For example, out;(num;(0));(num;(3)) outputs the number 3. Brackets must be placed around all arguments. The three nucleotides before the beginning of a gene indicate what instruction set is used within that gene. As of the time of writing, this stands at CGT for arithmetic, TTA for logic, and ACT for flow control. Three nucleotides were chosen as this allows for plenty of expansion of the language. These three are the minimum that must be present before every gene. Optionally, a 9-nucleotide 'name' can be given before the type marker. These function as the goto references. Within non-coding DNA, variables are defined using the sequence CAA or CAG (the same sequences used for referencing variables). The next 9 nucleotides are then the variable name.