YEP: | 113 |
---|---|
Title: | N-Dimensional Array Avro Named Type |
Authors: | Kyle Sunden |
Status: | accepted |
Tags: | standard |
Post-History: | 2020-12-09, 2020-12-21 |
This YEP describes an Avro record type suitable for N-Dimensional homogeneous arrays. This uses a subset of the Numpy Array Interface, with Avro for serialization.
For large, potentially high dimensional, homogeneous arrays, the type information can be specified once and the data simply transmitted in native C-style contiguous form. The Array Interface provides an easy and portable way of specifying homogeneous arrays, including of types not natively supported by Avro itself (e.g. complex numeric). YEP-110 was a similar definition for when yaq used msgpack instead of Avro.
The ndarray type uses the following schema definition:
{
"name": "ndarray",
"type": "record",
"logicalType": "ndarray",
fields": [
{
"name": "shape",
"type": {
"items": "int",
"type": "array"
}
},
{
"name": "typestr",
"type": "string"
},
{
"name": "data",
"type": "bytes"
},
{
"name": "version",
"type": "int"
}
]
}
This describes the subset of version 3 of the Numpy Array Interface which is required.
The record has four required keys: (shape
, typestr
, data
, version
):
shape
: Tuple whose elements are the array size in each dimension. Each entry is an integer.
typestr
: A string providing the basic type of the homogeneous array The basic string format consists of 3 parts: a character describing the byteorder of the data (<: little-endian, >: big-endian, |: not-relevant), a character code giving the basic type of the array, and an integer providing the number of bytes the type uses.
The basic types supported by this protocol are a subset of those supported by Numpy.
These are chosen to maximize compatibility without relying on Python/Numpy specific behavior.
b
: Boolean (integer with only True and False values)i
: Integeru
: Unsigned Integerf
: Floating Pointc
: Complex Floating Pointdata
: C-style (row-major) contiguous bytes representing the contents of the array. (This differs from the Numpy specification, as it needs to be transmitted over the RPC, and is also therefore required)
version
: An integer showing the version of the interface (i.e. 3 for this version). Be careful not to use this to invalidate objects exposing future versions of the interface.
Other parts of the Numpy specification (including both the optional fields and other datatypes) are explicitly NOT included in this specification. They are deemed to be too specific to Numpy/Python to be confident in that the representations translate seamlessly in other potential implementations. As such, parsers MAY implement other datatypes if provided, but are NOT REQUIRED to do so. Given that, sending of datatypes not specified here is STRONGLY discouraged, and may result in improper data transfer.
Implementations are encouraged to automatically return language-native ndarray objects rather than returning the dictionary representing the map as transmitted.
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
built 2023-10-10 23:52:42 CC0: no copyright