User:Hakerh400/Prefix-free serialization
Prefix-free serialization is a process of converting a complex data structure to a finite sequence of bits, such that no sequence is a prefix of another sequence and each sequence is a serialization of a valid data structure. Similarly, prefix-free deserialization is a reversed process. By definition, the process must be reversible.
Since all sequences of bits are finite and prefix-free, we may also talk about them as infinite sequences having the property that in each sequence after some finite number of bits there are infinitely many zeros (or ones, or random bits, etc). For each sequence we can determinte its length, so after the last bit whatever appears can be ignored, therefore having infinitely many zeros at the end is not a problem.
There are different ways of serializing some data structures in a prefix-free manner, but in this article we explain the most common algorithms. We start from primitive values (booleans, integers, strings, etc) and we refer to them when providing examples for more complex data structures.
When we say in this article that an algorithm is recommended or not recommended we mean that programming languages designed by User:Hakerh400 follow the recommended algorithms, while non-recommended ones would work, but we do not use them in the referenced languages.
Terminology
In this article we only explain the process of serialization. The reversed process (deserialization) can be constructed by reversing the algorithms provided.
At the beginning of a serialization process, the sequence of bits is empty. There is a single method writeBit(bit)
that takes a bit as argument an appends it to the end of the sequence. For example
writeBit(1) writeBit(0) writeBit(1) writeBit(1)
results in the following sequence
1011
Simple data structures
Fixed-sized integer
If we want to serialize uint8_t
, or uint16_t
, or uint32_t
, or any other unsigned integer that is represented on a fixed number of bits, we simply write all of its bits, but it is recommended to write them from lowest to highest bit. For example, value 0x37
of type uint8_t
will be serialized to the following sequence of bits
11101100
because 0x37
is 110111
in binary, but we serialize it on exactly 8
bits, so we append two extra zeros at the end. This method is called writeFixedSizedInt(value, numberOfBits)
. For the example above, we would use
writeFixedSizedInt(0x37, 8)
Varying-sized upper-bound integer
By fixed-sized integers we mean unsigned integers that have maximal value of 2n-1
for some integral number n
. Varying-sized upper-bound integers are unsigned integers that may have any (but fixed) value as upper bound. Method for writing such integers is called simply write(value, max)
.
For example, if we want to serialize a letter of alphabet, we could use writeFixedSizedInt(letter, 5)
, but then we would have a problem: some of the values between 26
and 32
are illegal. If they appear in the serialized sequence, we would not be able to deserialize it, but we want the process of serialization to be a bijection. Or another example: if we want to serialize a digit, we could use writeFixedSizedInt(digit, 4)
, but again, values between 10
and 16
are illegal.
The solution is to use the write
method. For letters, it would be write(letter, 25)
and for digits it would be write(digit, 9)
. Note that writeFixedSizedInt
may be a special case of write
. In fact, if writeFixedSizedInt
serializes bits from highest to lowest (not recommended), then it can be implemented like this:
writeFixedSizedInt(value, numberOfBits){ max = 2 ^ numberOfBits - 1 write(value, max) }
In the pseudo-code above, ^
stands for exponentiation.
While writeFixedSizedInt
is trivial to implement, it may not be the case for write
. Here is how the serialization actually works. Unlike writeFixedSizedInt
, here we serialize bits from highest to lowest.
Suppose we want to serialize digit 8
- it is write(8, 9)
. First, we write down max
and value
in binary:
| V max: 1001 value: 1000 serialized:
Arrow represent the current bit. The rule is simple: for each bit of the value, if so far all bits of max
and value
were equal (including the current bit) and the current bit of max
is 0
, then skip the current bit of value
, otherwise write the current bit of the value
to the output sequence. The current (first) bit is 1
both in max
and value
, so we write it to the output sequence:
| V max: 1001 value: 1000 serialized: 1
Now, the current bit is 0
both in max
and value
, but since all bits so far were equal (including the current bit) and the current bit is 0
, we skip it and proceed to the next bit:
| V max: 1001 value: 1000 serialized: 1
Similar case here. All bits so far are equal and the current bit is 0
, so skip it:
| V max: 1001 value: 1000 serialized: 1
Now the bits are not equal, so we definitely write the current bit of the value
to the output sequence, obtaining the final output:
| V max: 1001 value: 1000 serialized: 10
As we can see, digit 8
is serialized to 10
. For reference, here are serializations of all digits:
0 -> 0000 1 -> 0001 2 -> 0010 3 -> 0011 4 -> 0100 5 -> 0101 6 -> 0110 7 -> 0111 8 -> 10 9 -> 11
If value
has less bytes than max
, we prepend zeros. For example, the start of serialization for digit 1
would be:
| V max: 1001 value: 0001 serialized:
The term varying-sized comes from the fact that the number of bits required for serializing an integer of fixed maximal length may depend on the integer itself.
Note: If max
is 1
, then this function is identical to writeBit
. If max
is 0
, this function has no effects. If value
is larger than max
, it is an error.
Having max
equal to 0
may sometimes be useful to template array elements. In process of deserialization, read(max)
when max
is 0
also has no effects.
Arbitrary-sized non-negative integer
There are various techniques for serializing an arbitrary-large integer, but we recommend the following algorithm. Given an integer, we increment it by 1
, then write it in binary (highest to lowest bit), remove the first bit (which must be 1
), then reverse bits, before each bit prepend 1
and after all bits append 0
. We call this method writeInt(value)
.
Lets serialize number 0
. First, we increment it by 1
, obtaining 1
. Now we write it in binary - its simply 1
. Then remove the first bit - we are left with no bits. Then prepend 1
before each bit - nothing to be done, we have no bits. Then append 0
- so now we have 0
. Therefore number 0
is serialized to a single bit 0
.
Lets now serialize number 1
. First, we increment it by 1
, obtaining 2
. Now we write it in binary - its 10
. Then remove the first bit - now we have 0
. Then prepend 1
before each bit - we obtain 10
. Then append 0
- so now we have 100
. Therefore number 1
is serialized to 100
.
Similarly, you can do it for other integers, here are some examples:
0 -> 0 1 -> 100 2 -> 110 3 -> 10100 4 -> 11100 5 -> 10110 6 -> 11110 7 -> 1010100 8 -> 1110100 9 -> 1011100 10 -> 1111100 11 -> 1010110 ...
This algorithm is trivial to reverse. As a reference, language Bitwise Trance reads addresses from memory by applying process of deserialization using the reverse of this algorithm.
Arbitrary-sized signed integer
This method is called writeIntSign(value)
. It is similar to writeInt
, except there is an extra sign bit (0 - negative, 1 - positive), which is added after the last bit if the first bit of the serialized sequence is 1
. Examples:
0 -> 0 1 -> 1001 -1 -> 1000 2 -> 1101 -2 -> 1100 3 -> 101001 -3 -> 101000 ...
Byte
Bytes are integers in range 0
to 255
, so they can be serialized either using writeFixedSizedInt(byte, 8)
or write(byte, 255)
. We recommend the latter.
Array
Method is writeArr(array)
. Before each element of the array we prepend 1
, then serialize elements and after the last element append 0
. Note that in the process of deserialization we must know how each element should be deserialized, so arrays are recommended to contain elements of same type (serialized in the same way).
Another way is to firstly serialize the length of the array using writeInt
, then serialize all elements without prepending 1
s and appending 0
. This method is not recommended.
Using the former method, serializing an array of digits (see digit serialization above) containing three elements [8, 5, 9]
would result in the following sequence:
Digit 8 serialized to 10 | | Digit 5 serialized to 0101 | | | | Digit 9 serialized to 11 | | | V V V -- ---- -- 110101011110 - - - - ^ ^ ^ ^ | | | | Bit before the first element | | | Bit before the second element | | Bit before the third element | Bit after the last element
Sorted array
If we know than an array is sorted, we can serialize the first element using write
or writeInt
and then for the remaining elements serialize only the difference between the current element and the previous one (or invert the sign in case of non-increasing array). Method for this is writeArrSorted(array)
.
String
By strings we mean arrays of bytes (and UTF-8 strings can also be represented as arrays of bytes). So we serialize a string by prepending 1
before each byte, then serialize each byte using write(byte, 255)
and append bit 0
at the end. This method is called writeStr(string)
.
Advanced data structures
We have seen how to serialize some of the fundamental data structures (may also be called primitives). In this chapter we explain how to serialize complex data structures involving nodes and references.
Directed graph
writeGraph(root)
This method takes a root node as argument and serializes all nodes that can be reached by following pointers, starting from the root node. It can be realized using a stack (LIFO), or a queue (FIFO) structure. We recommend using a queue. Here is a pseudo-code:
writeGraph(root){ seen = [] queue = [root] while(queue is not empty){ node = pop the first element from the queue if(node is in seen){ writeBit(0) write(index of node in seen, length of seen - 1) }else{ if(seen is not empty){ writeBit(1) } push node to seen serialize node content writeInt(number of node pointers) foreach(pointer ptr of node pointers){ push ptr to the queue } } } }
Node content is something that the node itself contains, excluding the pointers to other nodes (for example, a number value). Nodes may not have any content - in that case, simply skip serializing node content.
Here is an example of a graph we want to serialize:
+---+ 0 +---+ | 2 |--------->| 6 | +---+ +---+ | ^ | | | 1 | | | V | +---+ 1 | +---| 5 |------------+ | +---+ | ^ | | +-----+ 0
There are three nodes and each node has an arbitrary-sized non-negative integer as its content. Node with number 2
has two pointers (labeled as 0
and 1
), node with number 5
also has two pointers, while node with number 6
has no pointers. Let 2
be the root node. Here is how we serialize the graph:
iteration: 0 seen: [] queue: [2] serialized: iteration: 1 seen: [2] queue: [6, 5] serialized: 110110 iteration: 2 seen: [2, 6] queue: [5] serialized: 1101101111100 iteration: 3 seen: [2, 6, 5] queue: [5, 6] serialized: 1101101111100110110110 iteration: 4 seen: [2, 6, 5] queue: [6] serialized: 110110111110011011011001 iteration: 5 seen: [2, 6, 5] queue: [] serialized: 110110111110011011011001001
The final serialized graph is 110110111110011011011001001
.