|Title:||N-Dimensional Array msgpack Extension Type|
|Post-History:||2020-04-14, 2020-04-21, 2020-07-02, 2020-07-05, 2020-12-09|
This YEP is now superseded by YEP-113. The yaq ecosystem no longer uses this msgpack based custom RPC and uses Apache Avro
This YEP describes a msgpack extension type suitable for N-Dimensional homogeneous arrays. This uses a subset of the Numpy Array Interface, with msgpack for serialization.
Msgpack provides a much more compact serialization for numeric types compared to JSON. However, msgpack arrays are heterogeneous, and thus each element must specify its datatype. For large arrays, this is potentially expensive, both in terms of serialization time and transport bandwidth. For large, potentially high dimensional, homogeneous arrays, the type information can be specified once and the data simply transmitted in native C-style contiguous form.
The type specification uses the integer value 110 (0x6e, ASCII 'n').
ext 8 stores an integer and a byte array whose length is upto (2^8)-1 bytes: +--------+--------+-------+~~~~~~~~+ | 0xc7 |XXXXXXXX| 110 | data | +--------+--------+-------+~~~~~~~~+ ext 16 stores an integer and a byte array whose length is upto (2^16)-1 bytes: +--------+--------+--------+-------+~~~~~~~~+ | 0xc8 |YYYYYYYY|YYYYYYYY| 110 | data | +--------+--------+--------+-------+~~~~~~~~+ ext 32 stores an integer and a byte array whose length is upto (2^32)-1 bytes: +--------+--------+--------+--------+--------+-------+~~~~~~~~+ | 0xc9 |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ| 110 | data | +--------+--------+--------+--------+--------+-------+~~~~~~~~+
XXXXXXXX is a 8-bit unsigned integer which represents
YYYYYYYY_YYYYYYYY is a 16-bit big-endian unsigned integer which represents
ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ is a big-endian 32-bit unsigned integer which represents
N is a length of
data is a msgpack formatted Map as described below
This describes the subset of version 3 of the Numpy Array Interface which is required.
The Map has four required keys: (
shape: Tuple whose elements are the array size in each dimension. Each entry is an integer.
typestr: A string providing the basic type of the homogeneous array The basic string format consists of 3 parts: a character describing the byteorder of the data (<: little-endian, >: big-endian, |: not-relevant), a character code giving the basic type of the array, and an integer providing the number of bytes the type uses.
The basic types supported by this protocol are a subset of those supported by Numpy.
These are chosen to maximize compatibility without relying on Python/Numpy specific behavior.
b: Boolean (integer with only True and False values)
u: Unsigned Integer
f: Floating Point
c: Complex Floating Point
data: C-style (row-major) contiguous bytes representing the contents of the array. (This differs from the Numpy specification, as it needs to be transmitted over the RPC, and is also therefore required)
version: An integer showing the version of the interface (i.e. 3 for this version). Be careful not to use this to invalidate objects exposing future versions of the interface.
Other parts of the Numpy specification (including both the optional fields and other datatypes) are explicitly NOT included in this specification. They are deemed to be too specific to Numpy/Python to be confident in that the representations translate seamlessly in other potential implementations. As such, parsers MAY utilize these fields if provided, but are NOT REQUIRED to do so. Given that, sending of these additional fields is STRONGLY discouraged, and may result in improper data transfer.
This packing incurs ~50 bytes of overhead (regardless of the size of the array). (The exact cost is not completely fixed, as it depends on the number of dimensions and the efficiency of representing the shape in msgpack native int types) As such, this is less efficient serialization than msgpack native arrays for arrays <~40 members. However, since this cost is (relatively) fixed, large arrays do not grow as quickly.
This only extends capability, though requires both client and daemon to implement parsing and packing in this format.
import msgpack class ArrayInterface: def __init__(self, array_interface): array_interface["shape"] = tuple(array_interface["shape"]) self.__array_interface__ = array_interface def pack_ndarray(ndarray): interface = ndarray.__array_interface__ if "strides" in interface: if interface["strides"] is not None: raise PackException( "Strided array attempted to pack, please pack as C-Contiguous" ) del interface["strides"] if "descr" in interface and len(interface["descr"]) == 1: del interface["descr"] interface["data"] = ndarray.tobytes() ser = packb_(interface) return ExtType(110, ser) def ext_hook(code, data): if code == 110: interface = ArrayInterface(unpackb_(data)) if "numpy" in sys.modules: import numpy as np return np.array(interface) return interface return ExtType(code, data)
These were deemed verbose due to each element needing to specify its type. For large arrays, this is a fair amount of more data.
Discussion can be found on the gitlab issue for this YEP.
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
built 2021-03-24 00:24:38 CC0: no copyright