yaq-yeps/113


N-Dimensional Array Avro Named Type

YEP:113
Title:N-Dimensional Array Avro Named Type
Authors:Kyle Sunden
Status:accepted
Tags:standard
Post-History:2020-12-09, 2020-12-21

Abstract

This YEP describes an Avro record type suitable for N-Dimensional homogeneous arrays. This uses a subset of the Numpy Array Interface, with Avro for serialization.

Table of Contents

Motivation

For large, potentially high dimensional, homogeneous arrays, the type information can be specified once and the data simply transmitted in native C-style contiguous form. The Array Interface provides an easy and portable way of specifying homogeneous arrays, including of types not natively supported by Avro itself (e.g. complex numeric). YEP-110 was a similar definition for when yaq used msgpack instead of Avro.

Specification

The ndarray type uses the following schema definition:

{
    "name": "ndarray",
    "type": "record",
    "logicalType": "ndarray",
    fields": [
       {
      "name": "shape",
      "type": {
         "items": "int",
         "type": "array"
      }
       },
       {
      "name": "typestr",
      "type": "string"
       },
       {
      "name": "data",
      "type": "bytes"
       },
       {
      "name": "version",
      "type": "int"
       }
    ]
}

Array Interface contents

This describes the subset of version 3 of the Numpy Array Interface which is required.

The record has four required keys: (shape, typestr, data, version):

shape: Tuple whose elements are the array size in each dimension. Each entry is an integer.

typestr: A string providing the basic type of the homogeneous array The basic string format consists of 3 parts: a character describing the byteorder of the data (<: little-endian, >: big-endian, |: not-relevant), a character code giving the basic type of the array, and an integer providing the number of bytes the type uses. The basic types supported by this protocol are a subset of those supported by Numpy. These are chosen to maximize compatibility without relying on Python/Numpy specific behavior.

data: C-style (row-major) contiguous bytes representing the contents of the array. (This differs from the Numpy specification, as it needs to be transmitted over the RPC, and is also therefore required)

version: An integer showing the version of the interface (i.e. 3 for this version). Be careful not to use this to invalidate objects exposing future versions of the interface.

Other parts of the Numpy specification (including both the optional fields and other datatypes) are explicitly NOT included in this specification. They are deemed to be too specific to Numpy/Python to be confident in that the representations translate seamlessly in other potential implementations. As such, parsers MAY implement other datatypes if provided, but are NOT REQUIRED to do so. Given that, sending of datatypes not specified here is STRONGLY discouraged, and may result in improper data transfer.

Implementations are encouraged to automatically return language-native ndarray objects rather than returning the dictionary representing the map as transmitted.

Copyright

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.


built 2023-10-10 23:52:42                                      CC0: no copyright