User:Conor O'Brien/Essays/Programming in Brainfuck

From Esolang
Jump to navigation Jump to search

Introduction

Brainfuck is a Turing tarpit, meaning it has few commands, but maintains status of "Turing complete", that is, being able to compute anything computable by a Turing machine. This means that it is technically possible to achieve any programming task. However, programming in brainfuck can be infuriating, especially without any sort of infrastructure and external support. Thus, it is often necessary for one to use or implement a debugger for this language. I will define a small extension to brainfuck herein:

An example of a debugger in brainfuck.
New command Effect
# Debug the program's memory. Ideally, this will give the programmer knowledge of all the cells used, the current cell, and where the debugging command is in the code.
@ Increment cell mark. Each cell is "marked" at 0. Upon incrementing this mark, the cell appears differently to the debugger, through aid of color or symbol.

This is my own personal first step to making programming in brainfuck bearable.

Ideal Code layout

I find that the most useful construct for more complicated programs is the following:

- A workspace section - A data section - A marker section

The workspace section would allow one to perform algorithms that require temporary spaces. It should contain only zeroes except for during specific algorithm execution. The data section contains all the information your program would like to store. The marker section allows you to navigate through the various data sections, and are filled with 1s, except for where a data section would end. The way this works is that these sections are "zipped" together. Each cell is followed by a cell from the next section, and is repeated cyclically. Instead of the sections being next to each other, like so:

   W0 W1 W2 W3 ... WN D0 D1 D2 D3 ... DN M0 M1 M2 M3 ... MN

We would organize the tape as such:

   W0 D0 M0 W1 D1 M1 W2 D2 M2 W3 D3 M3 ... WN DN MN ...

This allows for the marker section to be useful, since brainfuck excels at relative data manipulation, since all of the commands are indeed relative. After the structured memory, one can use it freely, perhaps for performing algorithms or printing text.

An example

Let's say you wanted to make the equivalent of the following C code snippet:

uint8_t age = 3;
char name[10] = "Harold";

In conventional brainfuck, this might look like this:

name = Harold:  -[>+<-------]>-<++++++++++++[>>++++++++>+++++++++>+++++++++>+++++++++>++++++++<<<<<<-]>>+>++++++>+++>>++++
age = 3:        <<<<<<+++

Which leaves the code looking like this:

00000:  003  072  097  114  111  108  100  000  000  000  000  000  .Harold.....

However, in our new system, it would be longer, but also cleaner. I will omit the actual code in favor of displaying only the memory:

00000:  000  003  000  000  072  001  000  097  001  000  114  001  ....H..a..r.
00012:  000  111  001  000  108  001  000  100  001  000  000  000  .o..l..d....

We can identify two tangible sections:

00000:  000  003  000  000  072  001  000  097  001  000  114  001  ....H..a..r.
        <--sectn.1-->  <---------------------------------section 2
00012:  000  111  001  000  108  001  000  100  001  000  000  000  .o..l..d....
        --------------------------------------------------------->

Each section ends with 000. Also, the string ends with three 000 cells; this is to show that the string ends with a null byte, so that loops like `[.>>>]` to print out each character (and similar iteration tasks) will terminate without any special logic.

This observation leads me right in to the beauty of this model: it is possible to represent many core types of data. All data is either a single element (such as a number), or a series of elements, whether it is a struct or a string, both rely on contiguous data storage, since a struct is merely named fields which are stored contiguously in memory. For example, struct { int a; int b; } is essentially the same thing as a 2-length int array, since brainfuck has no named constants to speak of.

Thus, all information can be readily represented using this model. As for data manipulation, any algorithm that requires a temporary cell can be performed with ease. However, when comparing cells that lie in different data sections, it is necessary to temporarily deform the given sections such that navigating along the data marker line would result in stopping prematurely at the designated cell. This would physically involve changing the marker cell from 1 to 0, and is simple enough of an act.

On printing text

The Brainfuck text encoder found on copy.sh suffices for programs whose sole purpose is to print text, but besides that, it serves little good to the brainfuck programmer. In this section, I will describe a more useful method for printing text, and a program that would help generate texts.

Motivation

When printing text, it is often desirable to use some small amount of memory. A secondary desire is for the actual printing of the text to be relatively short. No one wants code like ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++.+++++++++++++++++++++++++.--------------------.++++++++., even though it only uses one cell. To restate: it shouldn't be too long, and it shouldn't use too many cells.

As has been said, a common method of generating text is to use the aforementioned copy.sh generator or similar. The copy.sh interpreter comes from this answer on PPCG.se:

In Java, computes a short BF snippet which can convert any number to any other number. Each output byte is generated by either transforming the last output byte or a fresh 0 on the tape.

The snippets are generated in three ways. First by simple repetitions of + and - (e.g. ++++ converts 7 to 11), by combining known snippets (e.g. if A converts 5 to 50 and B converts 50 to 37, then AB converts 5 to 37) and simple multiplications (e.g. [--->+++++<] multiplies the current number by 5/3). The simple multiplications take advantage of wraparound to generate unusual results (e.g. --[------->++<]> generates 36 from 0, where the loop executes 146 times, with a total of 4 descending and 1 ascending wraparounds).

This is a rather interesting method, but I doubt its quality. I have found that this approach excels at printing shorter strings (say, 0 to 14 characters) and sometimes longer strings. Let's call this method the "concatenative method".

Let's for the sake of example we want to read a string into memory from the user, and consequently wish to display the following prompt:

Please enter your name: 

(There's a space after the colon.) This string is 24 bytes.

Using the concatenative method, we have the following:

[--->+<]>-.+[->+++<]>++.+++++++++.++++++.+++[->+++<]>.+++++++++++++.[-->+
++++<]>+++.--[->++++<]>+.----------.++++++.---.[-->+++++<]>+++.+[----->+<
]>+.-------------.++++++++++++.--------.+++[->+++<]>++.[-->+<]>+++.

286 bytes, 14 cells used, and a ratio of 11.9 instructions to encode 1 character.

If we wanted to optimize for cell usage, we could do it 2, but at the cost of 772 bytes, with a ratio of 32.2:

++++++++[>++++++++++++<-]>+-----------------.++++++++++++++++++++++++++++
.-------.----.++++++++++++++++++.--------------.-------------------------
--------------------------------------------.++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++.+++++++++.++++++.--------------
-.+++++++++++++.---------------------------------------------------------
-------------------------.+++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++.----------.++++++.---.--------
-------------------------------------------------------------------------
-.+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++.-------------.++++++++++++.--------.-----------------------------
--------------.--------------------------.

It is possible to beat both of these, in terms of bytes and cells used. Let me describe my own method.

Algorithm

For generating a string S in N cells (N ≥ 2), we will spit the code into N − 1 sections. These sections will be obtained by splitting on indices of maximal delta values in the sorted code points of S. One then finds the integer closest to the mean of each sections, which is called "the core". Then, one encodes the cores as multiples of some number k plus a constant ji. This allows one to use a familiar brainfuck construct. After, each character is then encoded via incrementing or decrementing by some amount the core cell for which the character might have been found in the original sections.

The actually positioning of the cores as achieved through brute forces; for C cores, the program iterates through all C! permutations of the cores, checking for optimal core placement in terms of minimizing code length.

I have implemented this algorithm here, and have found that 3 to 5 cores works best. I have a function called gen_bf_auto that brute forces the optimal core amount, between 1 and 7. (Which would imply cells from 2 to 8.) Using 5 cores to generate the above text uses 195 bytes and 6 cells, and a ratio of about 8.1:

++++++++++++++++[>+++>++++++>+++++++>++>+++++<<<<<-]>++++++++++>+++++>++>
>.<<------.<.----.>+++++++.<++++.>>.<<.>-----.++++++.<.>--.>.<+++++++.---
-------.++++++.---.>.<----.<----.>-.<++++.<.>>>.

There is one final advantage that this algorithm: that it can clear itself with [>]<[[-]<], assuming none of the final few bytes printed was a null character. This will point to the initial cell used, which must be initially empty.

Conclusion

From a ratio of about 11, we have achieved a ratio closer to 8.