User:Hakerh400/Prefix-free serialization

From Esolang
Jump to navigation Jump to search

Prefix-free serialization is a process of converting a complex data structure to a finite sequence of bits, such that no sequence is a prefix of another sequence and each sequence is a serialization of a valid data structure. Similarly, prefix-free deserialization is a reversed process. By definition, the process must be reversible.

Since all sequences of bits are finite and prefix-free, we may also talk about them as infinite sequences having the property that in each sequence after some finite number of bits there are infinitely many zeros (or ones, or random bits, etc). For each sequence we can determinte its length, so after the last bit whatever appears can be ignored, therefore having infinitely many zeros at the end is not a problem.

There are different ways of serializing some data structures in a prefix-free manner, but in this article we explain the most common algorithms. We start from primitive values (booleans, integers, strings, etc) and we refer to them when providing examples for more complex data structures.

When we say in this article that an algorithm is recommended or not recommended we mean that programming languages designed by User:Hakerh400 follow the recommended algorithms, while non-recommended ones would work, but we do not use them in the referenced languages.

Terminology

In this article we only explain the process of serialization. The reversed process (deserialization) can be constructed by reversing the algorithms provided.

At the beginning of a serialization process, the sequence of bits is empty. There is a single method writeBit(bit) that takes a bit as argument an appends it to the end of the sequence. For example

writeBit(1)
writeBit(0)
writeBit(1)
writeBit(1)

results in the following sequence

1011

Simple data structures

Fixed-sized integer

If we want to serialize uint8_t, or uint16_t, or uint32_t, or any other unsigned integer that is represented on a fixed number of bits, we simply write all of its bits, but it is recommended to write them from lowest to highest bit. For example, value 0x37 of type uint8_t will be serialized to the following sequence of bits

11101100

because 0x37 is 110111 in binary, but we serialize it on exactly 8 bits, so we append two extra zeros at the end. This method is called writeFixedSizedInt(value, numberOfBits). For the example above, we would use

writeFixedSizedInt(0x37, 8)

Varying-sized upper-bound integer

By fixed-sized integers we mean unsigned integers that have maximal value of 2n-1 for some integral number n. Varying-sized upper-bound integers are unsigned integers that may have any (but fixed) value as upper bound. Method for writing such integers is called simply write(value, max).

For example, if we want to serialize a letter of alphabet, we could use writeFixedSizedInt(letter, 5), but then we would have a problem: some of the values between 26 and 32 are illegal. If they appear in the serialized sequence, we would not be able to deserialize it, but we want the process of serialization to be a bijection. Or another example: if we want to serialize a digit, we could use writeFixedSizedInt(digit, 4), but again, values between 10 and 16 are illegal.

The solution is to use the write method. For letters, it would be write(letter, 25) and for digits it would be write(digit, 9). Note that writeFixedSizedInt may be a special case of write. In fact, if writeFixedSizedInt serializes bits from highest to lowest (not recommended), then it can be implemented like this:

writeFixedSizedInt(value, numberOfBits){
  max = 2 ^ numberOfBits - 1
  write(value, max)
}

In the pseudo-code above, ^ stands for exponentiation.

While writeFixedSizedInt is trivial to implement, it may not be the case for write. Here is how the serialization actually works. Unlike writeFixedSizedInt, here we serialize bits from highest to lowest.

Suppose we want to serialize digit 8 - it is write(8, 9). First, we write down max and value in binary:

            |
            V
max:        1001
value:      1000
serialized:

Arrow represent the current bit. The rule is simple: for each bit of the value, if so far all bits of max and value were equal (including the current bit) and the current bit of max is 0, then skip the current bit of value, otherwise write the current bit of the value to the output sequence. The current (first) bit is 1 both in max and value, so we write it to the output sequence:

             |
             V
max:        1001
value:      1000
serialized: 1

Now, the current bit is 0 both in max and value, but since all bits so far were equal (including the current bit) and the current bit is 0, we skip it and proceed to the next bit:

              |
              V
max:        1001
value:      1000
serialized: 1

Similar case here. All bits so far are equal and the current bit is 0, so skip it:

               |
               V
max:        1001
value:      1000
serialized: 1

Now the bits are not equal, so we definitely write the current bit of the value to the output sequence, obtaining the final output:

                |
                V
max:        1001
value:      1000
serialized: 10

As we can see, digit 8 is serialized to 10. For reference, here are serializations of all digits:

0 -> 0000
1 -> 0001
2 -> 0010
3 -> 0011
4 -> 0100
5 -> 0101
6 -> 0110
7 -> 0111
8 -> 10
9 -> 11

If value has less bytes than max, we prepend zeros. For example, the start of serialization for digit 1 would be:

            |
            V
max:        1001
value:      0001
serialized:

The term varying-sized comes from the fact that the number of bits required for serializing an integer of fixed maximal length may depend on the integer itself.

Note: If max is 1, then this function is identical to writeBit. If max is 0, this function has no effects. If value is larger than max, it is an error.

Having max equal to 0 may sometimes be useful to template array elements. In process of deserialization, read(max) when max is 0 also has no effects.

Arbitrary-sized non-negative integer

There are various techniques for serializing an arbitrary-large integer, but we recommend the following algorithm. Given an integer, we increment it by 1, then write it in binary (highest to lowest bit), remove the first bit (which must be 1), then reverse bits, before each bit prepend 1 and after all bits append 0. We call this method writeInt(value).

Lets serialize number 0. First, we increment it by 1, obtaining 1. Now we write it in binary - its simply 1. Then remove the first bit - we are left with no bits. Then prepend 1 before each bit - nothing to be done, we have no bits. Then append 0 - so now we have 0. Therefore number 0 is serialized to a single bit 0.

Lets now serialize number 1. First, we increment it by 1, obtaining 2. Now we write it in binary - its 10. Then remove the first bit - now we have 0. Then prepend 1 before each bit - we obtain 10. Then append 0 - so now we have 100. Therefore number 1 is serialized to 100.

Similarly, you can do it for other integers, here are some examples:

0  -> 0
1  -> 100
2  -> 110
3  -> 10100
4  -> 11100
5  -> 10110
6  -> 11110
7  -> 1010100
8  -> 1110100
9  -> 1011100
10 -> 1111100
11 -> 1010110
...

This algorithm is trivial to reverse. As a reference, language Bitwise Trance reads addresses from memory by applying process of deserialization using the reverse of this algorithm.

Arbitrary-sized signed integer

This method is called writeIntSign(value). It is similar to writeInt, except there is an extra sign bit (0 - negative, 1 - positive), which is added after the last bit if the first bit of the serialized sequence is 1. Examples:

0  -> 0
1  -> 1001
-1 -> 1000
2  -> 1101
-2 -> 1100
3  -> 101001
-3 -> 101000
...

Byte

Bytes are integers in range 0 to 255, so they can be serialized either using writeFixedSizedInt(byte, 8) or write(byte, 255). We recommend the latter.

Array

Method is writeArr(array). Before each element of the array we prepend 1, then serialize elements and after the last element append 0. Note that in the process of deserialization we must know how each element should be deserialized, so arrays are recommended to contain elements of same type (serialized in the same way).

Another way is to firstly serialize the length of the array using writeInt, then serialize all elements without prepending 1s and appending 0. This method is not recommended.

Using the former method, serializing an array of digits (see digit serialization above) containing three elements [8, 5, 9] would result in the following sequence:

 Digit 8 serialized to 10
 |
 |  Digit 5 serialized to 0101
 |  |
 |  |    Digit 9 serialized to 11
 |  |    |
 V  V    V
 -- ---- --
110101011110
-  -    -  -
^  ^    ^  ^
|  |    |  |
Bit before the first element
   |    |  |
   Bit before the second element
        |  |
        Bit before the third element
           |
           Bit after the last element

Sorted array

If we know than an array is sorted, we can serialize the first element using write or writeInt and then for the remaining elements serialize only the difference between the current element and the previous one (or invert the sign in case of non-increasing array). Method for this is writeArrSorted(array).

String

By strings we mean arrays of bytes (and UTF-8 strings can also be represented as arrays of bytes). So we serialize a string by prepending 1 before each byte, then serialize each byte using write(byte, 255) and append bit 0 at the end. This method is called writeStr(string).

Advanced data structures

We have seen how to serialize some of the fundamental data structures (may also be called primitives). In this chapter we explain how to serialize complex data structures involving nodes and references.

Directed graph

writeGraph(root)

This method takes a root node as argument and serializes all nodes that can be reached by following pointers, starting from the root node. It can be realized using a stack (LIFO), or a queue (FIFO) structure. We recommend using a queue. Here is a pseudo-code:

writeGraph(root){
  seen = []
  queue = [root]

  while(queue is not empty){
    node = pop the first element from the queue

    if(node is in seen){
      writeBit(0)
      write(index of node in seen, length of seen - 1)
    }else{
      if(seen is not empty){
        writeBit(1)
      }

      push node to seen
      serialize node content
      writeInt(number of node pointers)

      foreach(pointer ptr of node pointers){
        push ptr to the queue
      }
    }
  }
}

Node content is something that the node itself contains, excluding the pointers to other nodes (for example, a number value). Nodes may not have any content - in that case, simply skip serializing node content.

Here is an example of a graph we want to serialize:

    +---+    0     +---+
    | 2 |--------->| 6 |
    +---+          +---+
      |              ^
      |              |
      | 1            |
      |              |
      V              |
    +---+     1      |
+---| 5 |------------+
|   +---+
|     ^
|     |
+-----+
   0

There are three nodes and each node has an arbitrary-sized non-negative integer as its content. Node with number 2 has two pointers (labeled as 0 and 1), node with number 5 also has two pointers, while node with number 6 has no pointers. Let 2 be the root node. Here is how we serialize the graph:

iteration:  0
seen:       []
queue:      [2]
serialized: 

iteration:  1
seen:       [2]
queue:      [6, 5]
serialized: 110110

iteration:  2
seen:       [2, 6]
queue:      [5]
serialized: 1101101111100

iteration:  3
seen:       [2, 6, 5]
queue:      [5, 6]
serialized: 1101101111100110110110

iteration:  4
seen:       [2, 6, 5]
queue:      [6]
serialized: 110110111110011011011001

iteration:  5
seen:       [2, 6, 5]
queue:      []
serialized: 110110111110011011011001001

The final serialized graph is 110110111110011011011001001.