Schema specification

dsch schema specification.

In dsch, data is structured according to a given schema, which must be defined prior to working with the data (i.e. saving and loading). A schema is a tree-like hierarchical structure of schema nodes, each of which applies certain constraints to the data. In general, there are three different kinds of schema nodes:

  1. Items
    represent a data point (e.g. a string or a NumPy array). These are the leaves of the tree.
  2. Compilations
    represent a compound of data made up from multiple named fields. Each field is represented by another node and therefore supports its own constraints.
  3. Lists
    can contain multiple elements of the same type. The constraints are described with a single sub-node, but then applied to all data elements.

Compilations and lists both support all kinds of sub-nodes and can be nested. This allows to specify arbitrary hierarchically structured schemas.

Note

The classes in this module are used to specify a schema. This specification is different from what the user sees when interacting with the actual data. For example, a list-type schema node only has a single sub-node, since the schema node only specifies the constraints to be applied to the data. When users interact with the data, however, they use data nodes, which, in the case of lists, can contain multiple sub-nodes. Data nodes are defined in the dsch.data module.

class dsch.schema.Array(dtype, unit='', max_shape=None, min_shape=None, ndim=1, max_value=None, min_value=None, depends_on=None)

Schema node for NumPy ndarray values.

This node type accepts values of type numpy.ndarray.

In addition to the actual value(s), Arrays contain metadata:

  • The unit of the physical quantity that is represented by the values, e.g. ‘V’ for volts.

Also, Arrays support various constraints:

  • NumPy data type (numpy.dtype). This directly validates the array’s dtype. Note that the data type is always matched exactly, so one cannot require “any of the int-dtypes”. This attribute is non-optional, since many backends require knowledge of the data type for efficient storage.
  • Minimum and maximum array shape. Like numpy.ndarray.shape, this tuples limit the size of each dimension of the array. If both minimum and maximum shape are given, they must have the same length.
  • Number of array dimensions. If a minimum or maximum array shape is given, this is determined automatically. Otherwise, it can be given explicitly or left at the default value of 1.
  • Minimum and maximum array values. This constraint is applied to each individual value in the array, meaning that a single array element value outside the given boundaries causes validation to fail.
  • Linking with independent variables. Often, an array represents a variable that depends on another variable, e.g. a time-dependent voltage vector u(t). When t is registered as an independent variable of u, dsch ensures that the array sizes match. Independent variables can be defined by supplying the desired node’s name to the depends_on argument. Note that this only works if the dependent and independent variable’s nodes are both direct sub-nodes of the same Compilation. For multi-dimensional arrays, depends_on is a tuple, containing a node name (or None) for each of the array’s dimensions.

Note: The constraint parameters for array shape, values and variable dependencies all default to None, effectively disabling the corresponding validation step.

Variables:
  • dtype (numpy.dtype or str) – Required NumPy dtype.
  • unit (str) – Unit of the physical quantity, e.g. ‘V’ for volts. Unit (and values) should be given without any SI prefixes.
  • max_shape (tuple) – Maximum allowed array shape.
  • min_shape (tuple) – Minimum allowed array shape.
  • ndim (int) – Number of array dimensions.
  • max_value – Maximum allowed value for any element.
  • min_value – Minimum allowed value for any element.
  • depends_on (str or tuple) – Name(s) of nodes representing the independent variables.
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data, independent_values)

Validate given data against the node’s constraints.

For Array nodes, this ensures that the given data type is of type numpy.ndarray and that all constraints (dimensions, dtype etc.) are met.

If depends_on is set, the array dimensions are automatically validated against the independent variable’s array length. Therefore, in contrast to other node types, this method requires a second argument, supplying a tuple of numpy.ndarray representing the independent variables. The tuple’s length must consequently be equal to ndim and to the length of depends_on.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:
  • test_data – Data to be validated.
  • independent_values – Values of the independent variables.
Raises:

dsch.exceptions.ValidationError – if validation fails.

class dsch.schema.Bool

Schema node for scalar boolean values.

This node type only accepts values of type bool.

No configuration is required.

to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data)

Validate given data against the node’s constraints.

For Bool nodes, this ensures that the given data type is of type bool.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:test_data – Data to be validated.
Raises:dsch.exceptions.ValidationError – if validation fails.
class dsch.schema.Bytes(min_length=None, max_length=None)

Schema node for bytes values.

This node type accepts regular Python byte strings, i.e. bytes objects. Constraints can be optionally configured for minimum and maximum length.

Variables:
  • min_length (int) – Minimum allowed bytes length.
  • max_length (int) – Maximum allowed bytes length.
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data)

Validate given data against the node’s constraints.

For String nodes, this ensures that the given data type is of type str, and that the string length is within the limits.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:test_data – Data to be validated.
Raises:dsch.exceptions.ValidationError – if validation fails.
class dsch.schema.Compilation(subnodes, optionals=None)

Schema node for compound values composed from multiple named sub-nodes.

Usually, a Compilation is used to name and group related data. Together with List, this node type allows to build arbitrary hierarchical schemas.

The corresponding data node is a subclass of dsch.data.Compilation and provides attributes corresponding to the schema node’s sub-node names. Each of those attributes then represents a data node. While the general functionality is more similar to a dict, the object representation is preferred because of the more compact “dotted” notation. This is especially relevant when nesting compilations, e.g. measurement.sampling.frequency vs. measurement['sampling']['frequency'].

Compilations support defining optional sub-nodes. This can be used when describing a data structure that has truly optional fields, i.e. fields that can be omitted without making the entire data unusable. The selected sub-nodes are then ignored when checking for completeness via dsch.data.Compilation.complete.

Variables:
  • subnodes – dict-like mapping of names to schema sub-nodes.
  • optionals (list) – List of names of sub-nodes that are ignored when checking for completeness.
classmethod from_dict(node_dict)

Create a new instance from a dict representation.

Parameters:node_dict – dict-representation of the node to be loaded.
Returns:New compilation-type schema node.
Return type:Compilation
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
class dsch.schema.Date(set_on_create=False)

Schema node for date values.

This node type accepts regular Python dates, i.e. datetime.date objects.

If set_on_create is set to True, the node value is automatically set to the current date whenever a new data node is created.

Variables:set_on_create (bool) – Automatically apply the current date on data node creation.
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data)

Validate given data against the node’s constraints.

For Date nodes, this ensures that the given data type is of type datetime.date.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:test_data – Data to be validated.
Raises:dsch.exceptions.ValidationError – if validation fails.
class dsch.schema.DateTime(set_on_create=False)

Schema node for datetime values.

This node type accepts regular Python datetimes, i.e. datetime.datetime objects.

If set_on_create is set to True, the node value is automatically set to the current date whenever a new data node is created.

Variables:set_on_create (bool) – Automatically apply the current date and time on data node creation.
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data)

Validate given data against the node’s constraints.

For DateTime nodes, this ensures that the given data type is of type datetime.datetime.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:test_data – Data to be validated.
Raises:dsch.exceptions.ValidationError – if validation fails.
class dsch.schema.List(subnode, max_length=None, min_length=None)

Schema node for lists of same-type elements.

A List is used to represent multiple data items that must meet the same constraints, e.g. a list of equally sized NumPy arrays. It uses a single schema node to specify the constraints for all entries. Note that this behavior is different from regular python lists, which can contain arbitrary entries.

In addition to the sub-nodes, constraints can also be set for the List itself, specifying the maximum and minimum list length (i.e. number of items in the list).

Often, a List is used with a Compilation as its sub-node, allowing to represent arbitrary hierarchical schemas.

Variables:
  • subnode – A single schema node, used to validate all list entries.
  • max_length (int) – Maximum number of list entries.
  • min_length (int) – Minimum number of list entries.
classmethod from_dict(node_dict)

Create a new instance from a dict representation.

Parameters:node_dict – dict-representation of the node to be loaded.
Returns:New list-type schema node.
Return type:List
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data)

Validate given data against the node’s constraints.

For List nodes, this ensures that the list length is within the limits specified with max_length and min_length.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:test_data – Data to be validated.
Raises:dsch.exceptions.ValidationError – if validation fails.
class dsch.schema.Scalar(dtype, unit='', max_value=None, min_value=None)

Schema node for NumPy scalar values.

This node type accepts scalar values of all numeric NumPy scalar types, i.e. subclasses of numpy.number.

In addition to the actual value, Scalars contain some metadata:

  • The unit of the physical quantity that is represented by the value, e.g. ‘V’ for volts.

Also, Scalars support constraints:

  • NumPy data type (numpy.dtype). This directly validates the scalar’s dtype. Note that the data type is always matched exactly, so one cannot require “any of the int-dtypes”. This attribute is non-optional, since many backends require knowledge of the data type for efficient storage.
  • Minimum and maximum values.
Variables:
  • dtype (numpy.dtype or str) – Required NumPy dtype.
  • unit (str) – Unit of the physical quantity, e.g. ‘V’ for volts. Unit (and value) should be given without any SI prefixes.
  • max_value – Maximum allowed value.
  • min_value – Minimum allowed value.
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data)

Validate given data against the node’s constraints.

For Scalar nodes, this ensures that the given data type is of a subtype of numpy.number and that all constraints (dtype etc.) are met.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:test_data – Data to be validated.
Raises:dsch.exceptions.ValidationError – if validation fails.
class dsch.schema.SchemaNode

Base class for all kinds of schema nodes.

All schema node classes must derive from this, providing very general functionality.

classmethod from_dict(node_dict)

Create a new instance from a dict representation.

Parameters:node_dict – dict-representation of the node to be loaded.
Returns:New schema node instance.
hash()

Calculate the node’s SHA256 hash.

The hash is calculated based on the JSON representation of the node. Consequently, identical node configurations result in the same hash.

Returns:SHA256 hash (hex) of the schema.
Return type:str
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
to_json()

Return the node representation as a JSON string.

The JSON data structure is identical to the dict returned by to_dict().

Returns:JSON representation of the node.
Return type:str
class dsch.schema.String(min_length=None, max_length=None)

Schema node for string values.

This node type accepts regular Python strings, i.e. str objects. Constraints can be optionally configured for minimum and maximum string length.

Variables:
  • min_length (int) – Minimum allowed string length.
  • max_length (int) – Maximum allowed string length.
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data)

Validate given data against the node’s constraints.

For String nodes, this ensures that the given data type is of type str, and that the string length is within the limits.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:test_data – Data to be validated.
Raises:dsch.exceptions.ValidationError – if validation fails.
class dsch.schema.Time(set_on_create=False)

Schema node for time values.

This node type accepts regular Python times, i.e. datetime.time objects.

If set_on_create is set to True, the node value is automatically set to the current time whenever a new data node is created.

Variables:set_on_create (bool) – Automatically apply the current time on data node creation.
to_dict()

Return the node representation as a dict.

The representation dict includes a field node_type with the node class name and a field config with a dict of the configuration options.

Returns:dict-representation of the node.
Return type:dict
validate(test_data)

Validate given data against the node’s constraints.

For Time nodes, this ensures that the given data type is of type datetime.time.

If validation succeeds, the method terminates silently. Otherwise, an exception is raised.

Parameters:test_data – Data to be validated.
Raises:dsch.exceptions.ValidationError – if validation fails.
dsch.schema.node_from_dict(node_dict)

Create a new node from its node_dict.

This is effectively a shorthand for choosing the correct node class and then calling its from_dict method.

Parameters:node_dict (dict) – dict-representation of the node.
Returns:New schema node with the specified type and configuration.
dsch.schema.node_from_json(json_str)

Create a new node from its JSON representation.

Parameters:json_str (str) – JSON representation of the node.
Returns:New schema node with the specified type and configuration.