diff --git a/SERIALISATION.md b/SERIALISATION.md index f64ddf1..95970b4 100644 --- a/SERIALISATION.md +++ b/SERIALISATION.md @@ -16,32 +16,79 @@ I2C or SPI. All these require the data to be presented as linear sequences of bytes. The problem is how to convert an arbitrary Python object to such a sequence, and how subsequently to restore the object. -I am aware of four ways of achieving this, each with their own advantages and -drawbacks. In two cases the encoded strings comprise ASCII characters, in the -other two they are binary (bytes can take all possible values). +There are numerous standards for achieving this, five of which are readily +available to MicroPython. Each has its own advantages and drawbacks. In two +cases the encoded strings aim to be human readable and comprise ASCII +characters. In the others they comprise binary `bytes` objects where bytes can +take all possible values. The following are the formats with MicroPython +support: 1. ujson (ASCII, official) 2. pickle (ASCII, official) 3. ustruct (binary, official) - 4. protobuf [binary, unofficial](https://github.com/dogtopus/minipb) + 4. MessagePack [binary, unofficial](https://github.com/peterhinch/micropython-msgpack) + 5. protobuf [binary, unofficial](https://github.com/dogtopus/minipb) -The first two are self-describing: the format includes a definition of its +The `ujson` and `pickle` formats produce human-readable byte sequences. These +aid debugging. The use of ASCII data means that a delimiter can be used to +identify the end of a message. This is because it is possible to guarantee that +the delimiter will never occur within a message. A delimiter cannot be used +with binary formats because a message byte can take all possible values +including that of the delimiter. The drawback of ASCII formats is inefficiency: +the byte sequences are relatively long. + +Numbers 1, 2 and 4 are self-describing: the format includes a definition of its structure. This means that the decoding process can re-create the object in the absence of information on its structure, which may therefore change at runtime. -Further, `ujson` and `pickle` produce human-readable byte sequences which aid -debugging. The drawback is inefficiency: the byte sequences are relatively -long. They are variable length. This means that the receiving process must be -provided with a means to determine when a complete string has been received. +Self describing formats inevitably are variable length. This means that the +receiving process must be provided with a means to determine when a complete +message has been received. In the case of ASCII formats a delimiter may be used +but in the case of `MessagePack` this presents something of a challenge. -The `ustruct` and `protobuf` solutions are binary formats: the byte sequences -comprise binary data which is neither human readable nor self-describing. -Binary sequences require that the receiver has information on their structure -in order to decode them. In the case of `ustruct` sequences are of a fixed -length which can be determined from the structure. `protobuf` sequences are -variable length requiring handling discussed below. +The `ustruct` format is binary: the byte sequence comprises binary data which +is neither human readable nor self-describing. The problem of message framing +is solved by hard coding a fixed message structure and length which is known to +transmitter and receiver. In simple cases of fixed format data, `ustruct` +provides a simple, efficient solution. -The benefit of binary sequences is efficiency: sequence length is closer to the -information-theoretic minimum, compared to the ASCII options. +In `protobuf` and `MessagePack` messages are variable length; both can handle +data whose length varies at runtime. `MessagePack` also allows the message +structure to change at runtime. It is also extensible to enable the efficient +coding of additional Python types or instances of user defined classes. + +The `protobuf` standard requires transmitter and receiver to share a schema +which defines the message structure. Message length may change at runtime, but +structure may not. + +## 1.1 Transmission over unreliable links + +Consider a system where a transmitter periodically sends messages to a receiver +over a communication link. An aspect of the message framing problem arises if +that link is unreliable, meaning that bytes may be lost or corrupted in +transit. In the case of ASCII formats with a delimiter the receiver, once it +has detected the problem, can discard characters until the delimiter is +received and then wait for a complete message. + +In the case of binary formats it is generally impossible to re-synchronise to a +continuous stream of data. In the case of regular bursts of data a timeout can +be used. Otherwise "out of band" signalling is required where the receiver +signals the transmitter to request retransmission. + +## 1.2 Concurrency + +In `uasyncio` systems the transmitter presents no problem. A message is created +using synchronous code, then transmitted using asynchronous code typically with +a `StreamWriter`. In the case of ASCII protocols a delimiter - usually `b"\n"` +is appended. + +In the case of ASCII protocols the receiver can use `StreamReader.readline()` +to await a complete message. + +`ustruct` also presents a simple case in that the number of expected bytes is +known to the receiver which simply awaits that number. + +The variable length binary protocols present a difficulty in that the message +length is unknown in advance. A solution is available for `MessagePack`. # 2. ujson and pickle @@ -198,7 +245,47 @@ Output: (11, 22, b'the quick brown fox jumps over') ``` -# 4. Protocol Buffers +# 4. MessagePack + +Of the binary formats this is the easiest to use and can be a "drop in" +replacement for `ujson` as it supports the same four methods `dump`, `dumps`, +`load` and `loads`. An application might initially be developed with `ujson`, +the protocol being changed to `MessagePack` later. Creation of a `MessagePack` +string can be done with: +```python +import umsgpack +obj = [1.23, 2.56, 89000] +msg = umsgpack.dumps(obj) # msg is a bytes object +``` +Retrieval of the object is as follows: +```python +import umsgpack +# Retrieve the message msg +obj = umsgpack.dumps(msg) +``` +An ingenious feature of the standard is its extensibility. This can be used to +add support for additional Python types or user defined classes. This example +shows `complex` data being supported as if it were a native type: +```python +import umsgpack +from umsgpack_ext import mpext +with open('data', 'wb') as f: + umsgpack.dump(mpext(1 + 4j), f) # mpext() handles extension type +``` +Reading back: +```python +import umsgpack +import umsgpack_ext # Decoder only needs access to this module +with open('data', 'rb') as f: + z = umsgpack.load(f) +print(z) # z is complex +``` +Please see [this repo](https://github.com/peterhinch/micropython-msgpack). The +docs include references to the standard and to other implementations. The repo +includes an asynchronous receiver which enables incoming messages to be decoded +as they arrive while allowing other tasks to run concurrently. + +# 5. Protocol Buffers This is a [Google standard](https://developers.google.com/protocol-buffers/) described in [this Wikipedia article](https://en.wikipedia.org/wiki/Protocol_Buffers). @@ -230,7 +317,7 @@ inner `tuple` are strings, with element 0 defining the field's key. Subsequent elements define the field's data type; in most cases the data type is defined by a single string. -## 4.1 Installation +## 5.1 Installation The library comprises a single file `minipb.py`. It has a dependency, the `logging` module `logging.py` which may be found in @@ -238,7 +325,7 @@ The library comprises a single file `minipb.py`. It has a dependency, the On RAM constrained platforms `minipb.py` may be cross-compiled or frozen as bytecode for even lower RAM consumption. -## 4.2 Data types +## 5.2 Data types These are listed in [the docs](https://github.com/dogtopus/minipb/wiki/Schema-Representations). @@ -256,14 +343,14 @@ a subset may be used which maps onto Python data types: other platforms with special firmware builds. 7. 'X' An empty field. -## 4.2.1 Required and Optional fields +## 5.2.1 Required and Optional fields If a field is prefixed with `*` it is a `required` field, otherwise it is optional. The field must still exist in the data: the only difference is that a `required` field cannot be set to `None`. Optional fields can be useful, notably for boolean types which can then represent three states. -## 4.3 Application design +## 5.3 Application design The following is a minimal example which can be pasted at the REPL: ```python @@ -287,7 +374,7 @@ being saved to a binary file, the file will need an index. Where data is to be transmitted over and interface each string should be prepended with a fixed length "size" field. The following example illustrates this. -## 4.4 Transmitter/Receiver example +## 5.4 Transmitter/Receiver example These examples can't be cut and pasted at the REPL as they assume `send(n)` and `receive(n)` functions which access the interface. @@ -329,7 +416,7 @@ while True: # Do something with the received dict ``` -## 4.5 Repeating fields +## 5.5 Repeating fields This feature enables variable length lists to be encoded. List elements must all be of the same (declared) data type. In this example the `value` and `txt` @@ -357,13 +444,13 @@ tx = w.encode(data) rx = w.decode(tx) print(rx) ``` -### 4.5.1 Packed repeating fields +### 5.5.1 Packed repeating fields The author of `minipb` [does not recommend](https://github.com/dogtopus/minipb/issues/6) their use. Their purpose appears to be in the context of fixed-length fields which are outside the scope of pure Python programming. -## 4.6 Message fields (nested dicts) +## 5.6 Message fields (nested dicts) The concept of message fields is a Protocol Buffer notion. In MicroPython terminology a message field contains a `dict` whose contents are defined by @@ -404,7 +491,7 @@ print(rx) print(rx['nested'][2]['str2']) # Access inner dict instances ``` -### 4.6.1 Recursion +### 5.6.1 Recursion This is surely overkill in most MicroPython applications, but for the sake of completeness message fields can be recursive: