SERIALISATION.md: Add CBOR information.

master
Peter Hinch 2025-05-23 17:05:10 +01:00
rodzic d7529baff2
commit fdf4250f0c
1 zmienionych plików z 80 dodań i 66 usunięć

Wyświetl plik

@ -16,20 +16,23 @@ I2C or SPI. All these require the data to be presented as linear sequences of
bytes. The problem is how to convert an arbitrary Python object to such a bytes. The problem is how to convert an arbitrary Python object to such a
sequence, and how subsequently to restore the object. sequence, and how subsequently to restore the object.
There are numerous standards for achieving this, five of which are readily There are numerous standards for achieving this, six of which are readily
available to MicroPython. Each has its own advantages and drawbacks. In two available to MicroPython. Each has its own advantages and drawbacks. In two
cases the encoded strings aim to be human readable and comprise ASCII cases the encoded strings aim to be human readable and comprise ASCII
characters. In the others they comprise binary `bytes` objects where bytes can characters. In the others they comprise binary `bytes` objects where bytes can
take all possible values. The following are the formats with MicroPython take all possible values. The following are the formats with known MicroPython
support: support:
1. ujson (ASCII, official) Self-describing:
1. json [ASCII, official](http://docs.micropython.org/en/latest/library/json.html).
2. pickle (ASCII, official) 2. pickle (ASCII, official)
3. ustruct (binary, official) 3. MessagePack [binary, unofficial](https://github.com/peterhinch/micropython-msgpack)
4. MessagePack [binary, unofficial](https://github.com/peterhinch/micropython-msgpack) 4. CBOR [binary, unofficial](https://github.com/alexmrqt/micropython-cbor/tree/master)
5. protobuf [binary, unofficial](https://github.com/dogtopus/minipb) Requiring a schema:
5. struct [binary, official](http://docs.micropython.org/en/latest/library/struct.html)
6. protobuf [binary, unofficial](https://github.com/dogtopus/minipb)
The `ujson` and `pickle` formats produce human-readable byte sequences. These The `json` and `pickle` formats produce human-readable byte sequences. These
aid debugging. The use of ASCII data means that a delimiter can be used to aid debugging. The use of ASCII data means that a delimiter can be used to
identify the end of a message. This is because it is possible to guarantee that identify the end of a message. This is because it is possible to guarantee that
the delimiter will never occur within a message. A delimiter cannot be used the delimiter will never occur within a message. A delimiter cannot be used
@ -37,36 +40,45 @@ with binary formats because a message byte can take all possible values
including that of the delimiter. The drawback of ASCII formats is inefficiency: including that of the delimiter. The drawback of ASCII formats is inefficiency:
the byte sequences are relatively long. the byte sequences are relatively long.
Numbers 1, 2 and 4 are self-describing: the format includes a definition of its All bar 5 and 6 are are self-describing: the format includes a definition of
structure. This means that the decoding process can re-create the object in the its structure. This means that the decoding process can re-create the object in
absence of information on its structure, which may therefore change at runtime. the absence of information on its structure, which may therefore change at
Self describing formats inevitably are variable length. This is no problem runtime. Self describing formats inevitably are variable length. This is no
where data is being saved to file, but if it is being communicated across a problem where data is being saved to file, but if it is being communicated
link the receiving process needs a means to determine when a complete message across a link the receiving process needs a means to determine when a complete
has been received. In the case of ASCII formats a delimiter may be used but in message has been received. In the case of ASCII formats a delimiter may be used
the case of `MessagePack` this presents something of a challenge. but in the cases of `MessagePack` and `CBOR` this presents something of a
challenge.
The `ustruct` format is binary: the byte sequence comprises binary data which The `struct` format is binary: the byte sequence comprises binary data which
is neither human readable nor self-describing. The problem of message framing is neither human readable nor self-describing. The problem of message framing
is solved by hard coding a fixed message structure and length which is known to is solved by hard coding a fixed message structure and length which is known to
transmitter and receiver. In simple cases of fixed format data, `ustruct` transmitter and receiver. In simple cases of fixed format data, `struct`
provides a simple, efficient solution. provides a simple, efficient solution.
In `protobuf` and `MessagePack` messages are variable length; both can handle In `protobuf`, `CBOR` and `MessagePack` messages are variable length; all can
data whose length varies at runtime. `MessagePack` also allows the message handle data whose length varies at runtime. `MessagePack` and `CBOR` allow the
structure to change at runtime. It is also extensible to enable the efficient message structure to change at runtime. They are also extensible to enable the
coding of additional Python types or instances of user defined classes. efficient coding of additional Python types or instances of user defined classes.
The `protobuf` standard requires transmitter and receiver to share a schema The `protobuf` standard requires transmitter and receiver to share a schema
which defines the message structure. Message length may change at runtime, but which defines the message structure. Message length may change at runtime, but
structure may not. structure may not.
There has been some discussion of supporting [CBOR](https://cbor.io/). There is a Uniquely `CBOR` has an ability to accept data objects whose size is initially
MicroPython library [here](https://github.com/onetonfoot/micropython-cbor). This unknown. An encoder can receive a declaration that an array is to follow, then a
is a binary format with a focus on minimising message length. I have not yet had sequence of elements. A terminator signals the end of the array. This
time to study this. functionality is not provided in the
[MicroPython implementation](https://github.com/alexmrqt/micropython-cbor/tree/master).
Implementing such support would require a Python API to be defined.
## 1.1 Transmission over unreliable links ## 1.1 Protocol References
[MessagePack](https://github.com/msgpack/msgpack/tree/master)
[CBOR](https://cbor.io/)
[CBOR spec](https://www.rfc-editor.org/rfc/rfc8949.html)
## 1.2 Transmission over unreliable links
Consider a system where a transmitter periodically sends messages to a receiver Consider a system where a transmitter periodically sends messages to a receiver
over a communication link. An aspect of the message framing problem arises if over a communication link. An aspect of the message framing problem arises if
@ -80,7 +92,7 @@ continuous stream of data. In the case of regular bursts of data a timeout can
be used. Otherwise "out of band" signalling is required where the receiver be used. Otherwise "out of band" signalling is required where the receiver
signals the transmitter to request retransmission. signals the transmitter to request retransmission.
## 1.2 Concurrency ## 1.3 Concurrency
In `asyncio` systems the transmitter presents no problem. A message is created In `asyncio` systems the transmitter presents no problem. A message is created
using synchronous code, then transmitted using asynchronous code typically with using synchronous code, then transmitted using asynchronous code typically with
@ -90,19 +102,19 @@ is appended.
In the case of ASCII protocols the receiver can use `StreamReader.readline()` In the case of ASCII protocols the receiver can use `StreamReader.readline()`
to await a complete message. to await a complete message.
`ustruct` also presents a simple case in that the number of expected bytes is `struct` also presents a simple case in that the number of expected bytes is
known to the receiver which simply awaits that number. known to the receiver which simply awaits that number.
The variable length binary protocols present a difficulty in that the message The variable length binary protocols present a difficulty in that the message
length is unknown in advance. A solution is available for `MessagePack`. length is unknown in advance. A solution is available for `MessagePack`.
# 2. ujson and pickle # 2. json and pickle
These are very similar. `ujson` is documented These are very similar. `json` is documented
[here](http://docs.micropython.org/en/latest/library/ujson.html). `pickle` has [here](http://docs.micropython.org/en/latest/library/json.html). `pickle` has
identical methods so this doc may be used for both. identical methods so this doc may be used for both.
The advantage of `ujson` is that JSON strings can be accepted by CPython and by The advantage of `json` is that JSON strings can be accepted by CPython and by
other languages. The drawback is that only a subset of Python object types can other languages. The drawback is that only a subset of Python object types can
be converted to legal JSON strings; this is a limitation of the be converted to legal JSON strings; this is a limitation of the
[JSON specification](http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf). [JSON specification](http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf).
@ -114,7 +126,9 @@ The strings produced are incompatible with CPython's `pickle`, but can be
decoded in CPython by using the MicroPython decoder. There is a decoded in CPython by using the MicroPython decoder. There is a
[bug](https://github.com/micropython/micropython/issues/2280) in the [bug](https://github.com/micropython/micropython/issues/2280) in the
MicroPython implementation when running under MicroPython. A workround consists MicroPython implementation when running under MicroPython. A workround consists
of never encoding short strings which change repeatedly. of never encoding short strings which change repeatedly. The official MicroPython
library achieves its simplicity by invoking the compiler at runtime. This is
costly in RAM.
## 2.1 Usage examples ## 2.1 Usage examples
@ -130,11 +144,11 @@ print('Decoded data (partial):', v[3])
``` ```
JSON. Note that dictionary keys must be strings: JSON. Note that dictionary keys must be strings:
```python ```python
import ujson import json
data = {'1':'test', '2':1.414, '3': [11, 12, 13]} data = {'1':'test', '2':1.414, '3': [11, 12, 13]}
s = ujson.dumps(data) s = json.dumps(data)
print('Human readable data:', s) print('Human readable data:', s)
v = ujson.loads(s) v = json.loads(s)
print('Decoded data (partial):', v['3']) print('Decoded data (partial):', v['3'])
``` ```
@ -142,15 +156,15 @@ print('Decoded data (partial):', v['3'])
In real applications the data, and hence the string length, vary at runtime. In real applications the data, and hence the string length, vary at runtime.
The receiving process needs to know when a complete string has been received or The receiving process needs to know when a complete string has been received or
read from a file. In practice `ujson` and `pickle` do not include newline read from a file. In practice `json` and `pickle` do not include newline
characters in encoded strings. If the data being encoded includes a newline, it characters in encoded strings. If the data being encoded includes a newline, it
is escaped in the string: is escaped in the string:
```python ```python
import ujson import json
data = {'1':b'test\nmore', '2':1.414, '3': [11, 12, 13]} data = {'1':b'test\nmore', '2':1.414, '3': [11, 12, 13]}
s = ujson.dumps(data) s = json.dumps(data)
print('Human readable data:', s) print('Human readable data:', s)
v = ujson.loads(s) v = json.loads(s)
print('Decoded data (partial):', v['1']) print('Decoded data (partial):', v['1'])
``` ```
If this is pasted at the REPL you will observe that the human readable data If this is pasted at the REPL you will observe that the human readable data
@ -165,9 +179,9 @@ reading may be done using `readline` methods as in this code fragment where
`u` is a UART instance: `u` is a UART instance:
```python ```python
s = ujson.dumps(data) s = json.dumps(data)
# pickle produces a bytes object whereas ujson produces a string # pickle produces a bytes object whereas json produces a string
# In the case of ujson it is probably wise to convert to bytes # In the case of json it is probably wise to convert to bytes
u.write(s.encode()) u.write(s.encode())
# Pickle: # Pickle:
# u.write(s) # u.write(s)
@ -175,14 +189,14 @@ u.write(b'\n')
# receiver # receiver
s = u.readline() s = u.readline()
v = ujson.loads(s) # ujson can cope with bytes object v = json.loads(s) # json can cope with bytes object
# ujson and pickle can cope with trailing newline. # json and pickle can cope with trailing newline.
``` ```
# 3. ustruct # 3. struct
This is documented This is documented
[here](http://docs.micropython.org/en/latest/library/ustruct.html). The binary [here](http://docs.micropython.org/en/latest/library/struct.html). The binary
format is efficient, but the format of a sequence cannot change at runtime and format is efficient, but the format of a sequence cannot change at runtime and
must be "known" to the decoding process. Records are of fixed length. If data must be "known" to the decoding process. Records are of fixed length. If data
is to be stored in a binary random access file, the fixed record size means is to be stored in a binary random access file, the fixed record size means
@ -190,59 +204,59 @@ that the offset of a given record may readily be calculated.
Write a 100 record file. Each record comprises three 32-bit integers: Write a 100 record file. Each record comprises three 32-bit integers:
```python ```python
import ustruct import struct
fmt = 'iii' # Record format: 3 signed ints fmt = 'iii' # Record format: 3 signed ints
rlen = ustruct.calcsize(fmt) # Record length rlen = struct.calcsize(fmt) # Record length
buf = bytearray(rlen) buf = bytearray(rlen)
with open('myfile', 'wb') as f: with open('myfile', 'wb') as f:
for x in range(100): for x in range(100):
y = x * x y = x * x
z = x * 10 z = x * 10
ustruct.pack_into(fmt, buf, 0, x, y, z) struct.pack_into(fmt, buf, 0, x, y, z)
f.write(buf) f.write(buf)
``` ```
Read record no. 10 from that file: Read record no. 10 from that file:
```python ```python
import ustruct import struct
fmt = 'iii' fmt = 'iii'
rlen = ustruct.calcsize(fmt) # Record length rlen = struct.calcsize(fmt) # Record length
buf = bytearray(rlen) buf = bytearray(rlen)
rnum = 10 # Record no. rnum = 10 # Record no.
with open('myfile', 'rb') as f: with open('myfile', 'rb') as f:
f.seek(rnum * rlen) f.seek(rnum * rlen)
f.readinto(buf) f.readinto(buf)
result = ustruct.unpack_from(fmt, buf) result = struct.unpack_from(fmt, buf)
print(result) print(result)
``` ```
Owing to the fixed record length, integers must be constrained to fit the Owing to the fixed record length, integers must be constrained to fit the
length declared in the format string. length declared in the format string.
Binary formats cannot use delimiters as any delimiter character may be present Binary formats cannot use delimiters as any delimiter character may be present
in the data - however the fixed length of `ustruct` records means that this is in the data - however the fixed length of `struct` records means that this is
not a problem. not a problem.
For performance oriented applications, `ustruct` is the only serialisation For performance oriented applications, `struct` is the only serialisation
approach which can be used in a non-allocating fashion, by using pre-allocated approach which can be used in a non-allocating fashion, by using pre-allocated
buffers as in the above example. buffers as in the above example.
## 3.1 Strings ## 3.1 Strings
In `ustruct` the `s` data type is normally prefixed by a length (defaulting to In `struct` the `s` data type is normally prefixed by a length (defaulting to
1). This ensures that records are of fixed size, but is potentially inefficient 1). This ensures that records are of fixed size, but is potentially inefficient
as shorter strings will still occupy the same amount of space. Longer strings as shorter strings will still occupy the same amount of space. Longer strings
will silently be truncated. Short strings are packed with zeros. will silently be truncated. Short strings are packed with zeros.
```python ```python
import ustruct import struct
fmt = 'ii30s' fmt = 'ii30s'
rlen = ustruct.calcsize(fmt) # Record length rlen = struct.calcsize(fmt) # Record length
buf = bytearray(rlen) buf = bytearray(rlen)
ustruct.pack_into(fmt, buf, 0, 11, 22, 'the quick brown fox') struct.pack_into(fmt, buf, 0, 11, 22, 'the quick brown fox')
ustruct.unpack_from(fmt, buf) struct.unpack_from(fmt, buf)
ustruct.pack_into(fmt, buf, 0, 11, 22, 'rats') struct.pack_into(fmt, buf, 0, 11, 22, 'rats')
ustruct.unpack_from(fmt, buf) # Packed with zeros struct.unpack_from(fmt, buf) # Packed with zeros
ustruct.pack_into(fmt, buf, 0, 11, 22, 'the quick brown fox jumps over the lazy dog') struct.pack_into(fmt, buf, 0, 11, 22, 'the quick brown fox jumps over the lazy dog')
ustruct.unpack_from(fmt, buf) # Truncation struct.unpack_from(fmt, buf) # Truncation
``` ```
Output: Output:
```python ```python
@ -254,8 +268,8 @@ Output:
# 4. MessagePack # 4. MessagePack
Of the binary formats this is the easiest to use and can be a "drop in" Of the binary formats this is the easiest to use and can be a "drop in"
replacement for `ujson` as it supports the same four methods `dump`, `dumps`, replacement for `json` as it supports the same four methods `dump`, `dumps`,
`load` and `loads`. An application might initially be developed with `ujson`, `load` and `loads`. An application might initially be developed with `json`,
the protocol being changed to `MessagePack` later. Creation of a `MessagePack` the protocol being changed to `MessagePack` later. Creation of a `MessagePack`
string can be done with: string can be done with:
```python ```python