yaq-yeps/110


N-Dimensional Array msgpack Extension Type

YEP:110
Title:N-Dimensional Array msgpack Extension Type
Authors:Kyle Sunden
Status:rejected
Tags:standard
Post-History:2020-04-14, 2020-04-21, 2020-07-02, 2020-07-05, 2020-12-09

Superseded

This YEP is now superseded by YEP-113. The yaq ecosystem no longer uses this msgpack based custom RPC and uses Apache Avro

Abstract

This YEP describes a msgpack extension type suitable for N-Dimensional homogeneous arrays. This uses a subset of the Numpy Array Interface, with msgpack for serialization.

Table of Contents

Motivation

Msgpack provides a much more compact serialization for numeric types compared to JSON. However, msgpack arrays are heterogeneous, and thus each element must specify its datatype. For large arrays, this is potentially expensive, both in terms of serialization time and transport bandwidth. For large, potentially high dimensional, homogeneous arrays, the type information can be specified once and the data simply transmitted in native C-style contiguous form.

Specification

msgpack header

The type specification uses the integer value 110 (0x6e, ASCII 'n').

ext 8 stores an integer and a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+-------+~~~~~~~~+
|  0xc7  |XXXXXXXX|  110  |  data  |
+--------+--------+-------+~~~~~~~~+

ext 16 stores an integer and a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+-------+~~~~~~~~+
|  0xc8  |YYYYYYYY|YYYYYYYY|  110  |  data  |
+--------+--------+--------+-------+~~~~~~~~+

ext 32 stores an integer and a byte array whose length is upto (2^32)-1 bytes:
+--------+--------+--------+--------+--------+-------+~~~~~~~~+
|  0xc9  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|  110  |  data  |
+--------+--------+--------+--------+--------+-------+~~~~~~~~+

where * XXXXXXXX is a 8-bit unsigned integer which represents N * YYYYYYYY_YYYYYYYY is a 16-bit big-endian unsigned integer which represents N * ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ is a big-endian 32-bit unsigned integer which represents N * N is a length of data * data is a msgpack formatted Map as described below

Array Interface contents

This describes the subset of version 3 of the Numpy Array Interface which is required.

The Map has four required keys: (shape, typestr, data, version):

shape: Tuple whose elements are the array size in each dimension. Each entry is an integer.

typestr: A string providing the basic type of the homogeneous array The basic string format consists of 3 parts: a character describing the byteorder of the data (<: little-endian, >: big-endian, |: not-relevant), a character code giving the basic type of the array, and an integer providing the number of bytes the type uses. The basic types supported by this protocol are a subset of those supported by Numpy. These are chosen to maximize compatibility without relying on Python/Numpy specific behavior.

data: C-style (row-major) contiguous bytes representing the contents of the array. (This differs from the Numpy specification, as it needs to be transmitted over the RPC, and is also therefore required)

version: An integer showing the version of the interface (i.e. 3 for this version). Be careful not to use this to invalidate objects exposing future versions of the interface.

Other parts of the Numpy specification (including both the optional fields and other datatypes) are explicitly NOT included in this specification. They are deemed to be too specific to Numpy/Python to be confident in that the representations translate seamlessly in other potential implementations. As such, parsers MAY utilize these fields if provided, but are NOT REQUIRED to do so. Given that, sending of these additional fields is STRONGLY discouraged, and may result in improper data transfer.

Size consideration

This packing incurs ~50 bytes of overhead (regardless of the size of the array). (The exact cost is not completely fixed, as it depends on the number of dimensions and the efficiency of representing the shape in msgpack native int types) As such, this is less efficient serialization than msgpack native arrays for arrays <~40 members. However, since this cost is (relatively) fixed, large arrays do not grow as quickly.

Backwards Compatibility

This only extends capability, though requires both client and daemon to implement parsing and packing in this format.

Reference Implementation

import msgpack

class ArrayInterface:
    def __init__(self, array_interface):
        array_interface["shape"] = tuple(array_interface["shape"])
        self.__array_interface__ = array_interface


def pack_ndarray(ndarray):
    interface = ndarray.__array_interface__
    if "strides" in interface:
        if interface["strides"] is not None:
            raise PackException(
                "Strided array attempted to pack, please pack as C-Contiguous"
            )
        del interface["strides"]
    if "descr" in interface and len(interface["descr"]) == 1:
        del interface["descr"]
    interface["data"] = ndarray.tobytes()
    ser = packb_(interface)
    return ExtType(110, ser)


def ext_hook(code, data):
    if code == 110:
        interface = ArrayInterface(unpackb_(data))
        if "numpy" in sys.modules:
            import numpy as np

            return np.array(interface)
        return interface

    return ExtType(code, data)

Rejected Ideas

Using msgpack arrays

These were deemed verbose due to each element needing to specify its type. For large arrays, this is a fair amount of more data.

Copyright

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.


built 2023-10-10 23:52:42                                      CC0: no copyright