# User:Hakerh400/Prefix-free serialization

**Prefix-free serialization** is a process of converting a complex data structure to a finite sequence of bits, such that no sequence is a prefix of another sequence and each sequence is a serialization of a valid data structure. Similarly, **prefix-free deserialization** is a reversed process. By definition, the process must be reversible.

Since all sequences of bits are finite and prefix-free, we may also talk about them as infinite sequences having the property that in each sequence after some finite number of bits there are infinitely many zeros (or ones, or random bits, etc). For each sequence we can determinte its length, so after the last bit whatever appears can be ignored, therefore having infinitely many zeros at the end is not a problem.

There are different ways of serializing some data structures in a prefix-free manner, but in this article we explain the most common algorithms. We start from primitive values (booleans, integers, strings, etc) and we refer to them when providing examples for more complex data structures.

When we say in this article that an algorithm is *recommended* or *not recommended* we mean that programming languages designed by User:Hakerh400 follow the recommended algorithms, while non-recommended ones would work, but we do not use them in the referenced languages.

## Terminology

In this article we only explain the process of serialization. The reversed process (deserialization) can be constructed by reversing the algorithms provided.

At the beginning of a serialization process, the sequence of bits is empty. There is a single method `writeBit(bit)`

that takes a bit as argument an appends it to the end of the sequence. For example

writeBit(1) writeBit(0) writeBit(1) writeBit(1)

results in the following sequence

1011

## Simple data structures

### Fixed-sized integer

If we want to serialize `uint8_t`

, or `uint16_t`

, or `uint32_t`

, or any other unsigned integer that is represented on a fixed number of bits, we simply write all of its bits, but it is recommended to write them from lowest to highest bit. For example, value `0x37`

of type `uint8_t`

will be serialized to the following sequence of bits

11101100

because `0x37`

is `110111`

in binary, but we serialize it on exactly `8`

bits, so we append two extra zeros at the end. This method is called `writeFixedSizedInt(value, numberOfBits)`

. For the example above, we would use

writeFixedSizedInt(0x37, 8)

### Varying-sized upper-bound integer

By fixed-sized integers we mean unsigned integers that have maximal value of `2`

for some integral number ^{n}-1`n`

. Varying-sized upper-bound integers are unsigned integers that may have any (but fixed) value as upper bound. Method for writing such integers is called simply `write(value, max)`

.

For example, if we want to serialize a letter of alphabet, we could use `writeFixedSizedInt(letter, 5)`

, but then we would have a problem: some of the values between `26`

and `32`

are illegal. If they appear in the serialized sequence, we would not be able to deserialize it, but we want the process of serialization to be a bijection. Or another example: if we want to serialize a digit, we could use `writeFixedSizedInt(digit, 4)`

, but again, values between `10`

and `16`

are illegal.

The solution is to use the `write`

method. For letters, it would be `write(letter, 25)`

and for digits it would be `write(digit, 9)`

. Note that `writeFixedSizedInt`

may be a special case of `write`

. In fact, if `writeFixedSizedInt`

serializes bits from highest to lowest (not recommended), then it can be implemented like this:

writeFixedSizedInt(value, numberOfBits){ max = 2 ^ numberOfBits - 1 write(value, max) }

In the pseudo-code above, `^`

stands for exponentiation.

While `writeFixedSizedInt`

is trivial to implement, it may not be the case for `write`

. Here is how the serialization actually works. Unlike `writeFixedSizedInt`

, here we serialize bits from highest to lowest.

Suppose we want to serialize digit `8`

- it is `write(8, 9)`

. First, we write down `max`

and `value`

in binary:

| V max: 1001 value: 1000 serialized:

Arrow represent the current bit. The rule is simple: for each bit of the value, if so far all bits of `max`

and `value`

were equal (including the current bit) and the current bit of `max`

is `0`

, then skip the current bit of `value`

, otherwise write the current bit of the `value`

to the output sequence. The current (first) bit is `1`

both in `max`

and `value`

, so we write it to the output sequence:

| V max: 1001 value: 1000 serialized: 1

Now, the current bit is `0`

both in `max`

and `value`

, but since all bits so far were equal (including the current bit) and the current bit is `0`

, we skip it and proceed to the next bit:

| V max: 1001 value: 1000 serialized: 1

Similar case here. All bits so far are equal and the current bit is `0`

, so skip it:

| V max: 1001 value: 1000 serialized: 1

Now the bits are not equal, so we definitely write the current bit of the `value`

to the output sequence, obtaining the final output:

| V max: 1001 value: 1000 serialized: 10

As we can see, digit `8`

is serialized to `10`

. For reference, here are serializations of all digits:

0 -> 0000 1 -> 0001 2 -> 0010 3 -> 0011 4 -> 0100 5 -> 0101 6 -> 0110 7 -> 0111 8 -> 10 9 -> 11

If `value`

has less bytes than `max`

, we prepend zeros. For example, the start of serialization for digit `1`

would be:

| V max: 1001 value: 0001 serialized:

The term varying-sized comes from the fact that the number of bits required for serializing an integer of fixed maximal length may depend on the integer itself.

**Note:** If `max`

is `1`

, then this function is identical to `writeBit`

. If `max`

is `0`

, this function has no effects. If `value`

is larger than `max`

, it is an error.

Having `max`

equal to `0`

may sometimes be useful to template array elements. In process of deserialization, `read(max)`

when `max`

is `0`

also has no effects.

### Arbitrary-sized non-negative integer

There are various techniques for serializing an arbitrary-large integer, but we recommend the following algorithm. Given an integer, we increment it by `1`

, then write it in binary (highest to lowest bit), remove the first bit (which must be `1`

), then reverse bits, before each bit prepend `1`

and after all bits append `0`

. We call this method `writeInt(value)`

.

Lets serialize number `0`

. First, we increment it by `1`

, obtaining `1`

. Now we write it in binary - its simply `1`

. Then remove the first bit - we are left with no bits. Then prepend `1`

before each bit - nothing to be done, we have no bits. Then append `0`

- so now we have `0`

. Therefore number `0`

is serialized to a single bit `0`

.

Lets now serialize number `1`

. First, we increment it by `1`

, obtaining `2`

. Now we write it in binary - its `10`

. Then remove the first bit - now we have `0`

. Then prepend `1`

before each bit - we obtain `10`

. Then append `0`

- so now we have `100`

. Therefore number `1`

is serialized to `100`

.

Similarly, you can do it for other integers, here are some examples:

0 -> 0 1 -> 100 2 -> 110 3 -> 10100 4 -> 11100 5 -> 10110 6 -> 11110 7 -> 1010100 8 -> 1110100 9 -> 1011100 10 -> 1111100 11 -> 1010110 ...

This algorithm is trivial to reverse. As a reference, language Bitwise Trance reads addresses from memory by applying process of deserialization using the reverse of this algorithm.

### Arbitrary-sized signed integer

This method is called `writeIntSign(value)`

. It is similar to `writeInt`

, except there is an extra sign bit (0 - negative, 1 - positive), which is added after the last bit if the first bit of the serialized sequence is `1`

. Examples:

0 -> 0 1 -> 1001 -1 -> 1000 2 -> 1101 -2 -> 1100 3 -> 101001 -3 -> 101000 ...

### Byte

Bytes are integers in range `0`

to `255`

, so they can be serialized either using `writeFixedSizedInt(byte, 8)`

or `write(byte, 255)`

. We recommend the latter.

### Array

Method is `writeArr(array)`

. Before each element of the array we prepend `1`

, then serialize elements and after the last element append `0`

. Note that in the process of deserialization we must know how each element should be deserialized, so arrays are recommended to contain elements of same type (serialized in the same way).

Another way is to firstly serialize the length of the array using `writeInt`

, then serialize all elements without prepending `1`

s and appending `0`

. This method is not recommended.

Using the former method, serializing an array of digits (see digit serialization above) containing three elements `[8, 5, 9]`

would result in the following sequence:

Digit 8 serialized to 10 | | Digit 5 serialized to 0101 | | | | Digit 9 serialized to 11 | | | V V V -- ---- -- 110101011110 - - - - ^ ^ ^ ^ | | | | Bit before the first element | | | Bit before the second element | | Bit before the third element | Bit after the last element

### Sorted array

If we know than an array is sorted, we can serialize the first element using `write`

or `writeInt`

and then for the remaining elements serialize only the difference between the current element and the previous one (or invert the sign in case of non-increasing array). Method for this is `writeArrSorted(array)`

.

### String

By strings we mean arrays of bytes (and UTF-8 strings can also be represented as arrays of bytes). So we serialize a string by prepending `1`

before each byte, then serialize each byte using `write(byte, 255)`

and append bit `0`

at the end. This method is called `writeStr(string)`

.

## Advanced data structures

We have seen how to serialize some of the fundamental data structures (may also be called primitives). In this chapter we explain how to serialize complex data structures involving nodes and references.

### Directed graph

`writeGraph(root)`

This method takes a root node as argument and serializes all nodes that can be reached by following pointers, starting from the root node. It can be realized using a stack (LIFO), or a queue (FIFO) structure. We recommend using a queue. Here is a pseudo-code:

writeGraph(root){ seen = [] queue = [root] while(queue is not empty){ node = pop the first element from the queue if(node is in seen){ writeBit(0) write(index of node in seen, length of seen - 1) }else{ if(seen is not empty){ writeBit(1) } push node to seen serialize node content writeInt(number of node pointers) foreach(pointer ptr of node pointers){ push ptr to the queue } } } }

Node content is something that the node itself contains, excluding the pointers to other nodes (for example, a number value). Nodes may not have any content - in that case, simply skip serializing node content.

Here is an example of a graph we want to serialize:

+---+ 0 +---+ | 2 |--------->| 6 | +---+ +---+ | ^ | | | 1 | | | V | +---+ 1 | +---| 5 |------------+ | +---+ | ^ | | +-----+ 0

There are three nodes and each node has an arbitrary-sized non-negative integer as its content. Node with number `2`

has two pointers (labeled as `0`

and `1`

), node with number `5`

also has two pointers, while node with number `6`

has no pointers. Let `2`

be the root node. Here is how we serialize the graph:

iteration: 0 seen: [] queue: [2] serialized: iteration: 1 seen: [2] queue: [6, 5] serialized: 110110 iteration: 2 seen: [2, 6] queue: [5] serialized: 1101101111100 iteration: 3 seen: [2, 6, 5] queue: [5, 6] serialized: 1101101111100110110110 iteration: 4 seen: [2, 6, 5] queue: [6] serialized: 110110111110011011011001 iteration: 5 seen: [2, 6, 5] queue: [] serialized: 110110111110011011011001001

The final serialized graph is `110110111110011011011001001`

.