Schema specification¶
dsch schema specification.
In dsch, data is structured according to a given schema, which must be defined prior to working with the data (i.e. saving and loading). A schema is a tree-like hierarchical structure of schema nodes, each of which applies certain constraints to the data. In general, there are three different kinds of schema nodes:
- Items
- represent a data point (e.g. a string or a NumPy array). These are the leaves of the tree.
- Compilations
- represent a compound of data made up from multiple named fields. Each field is represented by another node and therefore supports its own constraints.
- Lists
- can contain multiple elements of the same type. The constraints are described with a single sub-node, but then applied to all data elements.
Compilations and lists both support all kinds of sub-nodes and can be nested. This allows to specify arbitrary hierarchically structured schemas.
Note
The classes in this module are used to specify a schema. This specification
is different from what the user sees when interacting with the actual data.
For example, a list-type schema node only has a single sub-node, since
the schema node only specifies the constraints to be applied to the data.
When users interact with the data, however, they use data nodes, which,
in the case of lists, can contain multiple sub-nodes.
Data nodes are defined in the dsch.data module.
-
class
dsch.schema.Array(dtype, unit='', max_shape=None, min_shape=None, ndim=1, max_value=None, min_value=None, depends_on=None)¶ Schema node for NumPy ndarray values.
This node type accepts values of type
numpy.ndarray.In addition to the actual value(s), Arrays contain metadata:
- The unit of the physical quantity that is represented by the values, e.g. ‘V’ for volts.
Also, Arrays support various constraints:
- NumPy data type (
numpy.dtype). This directly validates the array’sdtype. Note that the data type is always matched exactly, so one cannot require “any of the int-dtypes”. This attribute is non-optional, since many backends require knowledge of the data type for efficient storage. - Minimum and maximum array shape. Like
numpy.ndarray.shape, this tuples limit the size of each dimension of the array. If both minimum and maximum shape are given, they must have the same length. - Number of array dimensions. If a minimum or maximum array shape is given,
this is determined automatically. Otherwise, it can be given explicitly
or left at the default value of
1. - Minimum and maximum array values. This constraint is applied to each individual value in the array, meaning that a single array element value outside the given boundaries causes validation to fail.
- Linking with independent variables. Often, an array represents a variable
that depends on another variable, e.g. a time-dependent voltage vector
u(t). Whentis registered as an independent variable ofu, dsch ensures that the array sizes match. Independent variables can be defined by supplying the desired node’s name to thedepends_onargument. Note that this only works if the dependent and independent variable’s nodes are both direct sub-nodes of the sameCompilation. For multi-dimensional arrays,depends_onis a tuple, containing a node name (orNone) for each of the array’s dimensions.
Note: The constraint parameters for array shape, values and variable dependencies all default to
None, effectively disabling the corresponding validation step.Variables: - dtype (
numpy.dtypeorstr) – Required NumPy dtype. - unit (str) – Unit of the physical quantity, e.g. ‘V’ for volts. Unit (and values) should be given without any SI prefixes.
- max_shape (tuple) – Maximum allowed array shape.
- min_shape (tuple) – Minimum allowed array shape.
- ndim (int) – Number of array dimensions.
- max_value – Maximum allowed value for any element.
- min_value – Minimum allowed value for any element.
- depends_on (str or tuple) – Name(s) of nodes representing the independent variables.
-
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data, independent_values)¶ Validate given data against the node’s constraints.
For
Arraynodes, this ensures that the given data type is of typenumpy.ndarrayand that all constraints (dimensions, dtype etc.) are met.If
depends_onis set, the array dimensions are automatically validated against the independent variable’s array length. Therefore, in contrast to other node types, this method requires a second argument, supplying a tuple ofnumpy.ndarrayrepresenting the independent variables. The tuple’s length must consequently be equal tondimand to the length ofdepends_on.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: - test_data – Data to be validated.
- independent_values – Values of the independent variables.
Raises: dsch.exceptions.ValidationError– if validation fails.
-
class
dsch.schema.Bool¶ Schema node for scalar boolean values.
This node type only accepts values of type
bool.No configuration is required.
-
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data)¶ Validate given data against the node’s constraints.
For
Boolnodes, this ensures that the given data type is of typebool.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: test_data – Data to be validated. Raises: dsch.exceptions.ValidationError– if validation fails.
-
-
class
dsch.schema.Bytes(min_length=None, max_length=None)¶ Schema node for bytes values.
This node type accepts regular Python byte strings, i.e.
bytesobjects. Constraints can be optionally configured for minimum and maximum length.Variables: -
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data)¶ Validate given data against the node’s constraints.
For
Stringnodes, this ensures that the given data type is of typestr, and that the string length is within the limits.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: test_data – Data to be validated. Raises: dsch.exceptions.ValidationError– if validation fails.
-
-
class
dsch.schema.Compilation(subnodes, optionals=None)¶ Schema node for compound values composed from multiple named sub-nodes.
Usually, a
Compilationis used to name and group related data. Together withList, this node type allows to build arbitrary hierarchical schemas.The corresponding data node is a subclass of
dsch.data.Compilationand provides attributes corresponding to the schema node’s sub-node names. Each of those attributes then represents a data node. While the general functionality is more similar to a dict, the object representation is preferred because of the more compact “dotted” notation. This is especially relevant when nesting compilations, e.g.measurement.sampling.frequencyvs.measurement['sampling']['frequency'].Compilations support defining optional sub-nodes. This can be used when describing a data structure that has truly optional fields, i.e. fields that can be omitted without making the entire data unusable. The selected sub-nodes are then ignored when checking for completeness via
dsch.data.Compilation.complete.Variables: - subnodes – dict-like mapping of names to schema sub-nodes.
- optionals (list) – List of names of sub-nodes that are ignored when checking for completeness.
-
classmethod
from_dict(node_dict)¶ Create a new instance from a dict representation.
Parameters: node_dict – dict-representation of the node to be loaded. Returns: New compilation-type schema node. Return type: Compilation
-
class
dsch.schema.Date(set_on_create=False)¶ Schema node for date values.
This node type accepts regular Python dates, i.e.
datetime.dateobjects.If
set_on_createis set toTrue, the node value is automatically set to the current date whenever a new data node is created.Variables: set_on_create (bool) – Automatically apply the current date on data node creation. -
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data)¶ Validate given data against the node’s constraints.
For
Datenodes, this ensures that the given data type is of typedatetime.date.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: test_data – Data to be validated. Raises: dsch.exceptions.ValidationError– if validation fails.
-
-
class
dsch.schema.DateTime(set_on_create=False)¶ Schema node for datetime values.
This node type accepts regular Python datetimes, i.e.
datetime.datetimeobjects.If
set_on_createis set toTrue, the node value is automatically set to the current date whenever a new data node is created.Variables: set_on_create (bool) – Automatically apply the current date and time on data node creation. -
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data)¶ Validate given data against the node’s constraints.
For
DateTimenodes, this ensures that the given data type is of typedatetime.datetime.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: test_data – Data to be validated. Raises: dsch.exceptions.ValidationError– if validation fails.
-
-
class
dsch.schema.List(subnode, max_length=None, min_length=None)¶ Schema node for lists of same-type elements.
A
Listis used to represent multiple data items that must meet the same constraints, e.g. a list of equally sized NumPy arrays. It uses a single schema node to specify the constraints for all entries. Note that this behavior is different from regular python lists, which can contain arbitrary entries.In addition to the sub-nodes, constraints can also be set for the List itself, specifying the maximum and minimum list length (i.e. number of items in the list).
Often, a
Listis used with aCompilationas its sub-node, allowing to represent arbitrary hierarchical schemas.Variables: -
classmethod
from_dict(node_dict)¶ Create a new instance from a dict representation.
Parameters: node_dict – dict-representation of the node to be loaded. Returns: New list-type schema node. Return type: List
-
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data)¶ Validate given data against the node’s constraints.
For
Listnodes, this ensures that the list length is within the limits specified withmax_lengthandmin_length.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: test_data – Data to be validated. Raises: dsch.exceptions.ValidationError– if validation fails.
-
classmethod
-
class
dsch.schema.Scalar(dtype, unit='', max_value=None, min_value=None)¶ Schema node for NumPy scalar values.
This node type accepts scalar values of all numeric NumPy scalar types, i.e. subclasses of
numpy.number.In addition to the actual value, Scalars contain some metadata:
- The unit of the physical quantity that is represented by the value, e.g. ‘V’ for volts.
Also, Scalars support constraints:
- NumPy data type (
numpy.dtype). This directly validates the scalar’sdtype. Note that the data type is always matched exactly, so one cannot require “any of the int-dtypes”. This attribute is non-optional, since many backends require knowledge of the data type for efficient storage. - Minimum and maximum values.
Variables: - dtype (
numpy.dtypeorstr) – Required NumPy dtype. - unit (str) – Unit of the physical quantity, e.g. ‘V’ for volts. Unit (and value) should be given without any SI prefixes.
- max_value – Maximum allowed value.
- min_value – Minimum allowed value.
-
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data)¶ Validate given data against the node’s constraints.
For
Scalarnodes, this ensures that the given data type is of a subtype ofnumpy.numberand that all constraints (dtype etc.) are met.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: test_data – Data to be validated. Raises: dsch.exceptions.ValidationError– if validation fails.
-
class
dsch.schema.SchemaNode¶ Base class for all kinds of schema nodes.
All schema node classes must derive from this, providing very general functionality.
-
classmethod
from_dict(node_dict)¶ Create a new instance from a dict representation.
Parameters: node_dict – dict-representation of the node to be loaded. Returns: New schema node instance.
-
hash()¶ Calculate the node’s SHA256 hash.
The hash is calculated based on the JSON representation of the node. Consequently, identical node configurations result in the same hash.
Returns: SHA256 hash (hex) of the schema. Return type: str
-
classmethod
-
class
dsch.schema.String(min_length=None, max_length=None)¶ Schema node for string values.
This node type accepts regular Python strings, i.e.
strobjects. Constraints can be optionally configured for minimum and maximum string length.Variables: -
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data)¶ Validate given data against the node’s constraints.
For
Stringnodes, this ensures that the given data type is of typestr, and that the string length is within the limits.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: test_data – Data to be validated. Raises: dsch.exceptions.ValidationError– if validation fails.
-
-
class
dsch.schema.Time(set_on_create=False)¶ Schema node for time values.
This node type accepts regular Python times, i.e.
datetime.timeobjects.If
set_on_createis set toTrue, the node value is automatically set to the current time whenever a new data node is created.Variables: set_on_create (bool) – Automatically apply the current time on data node creation. -
to_dict()¶ Return the node representation as a dict.
The representation dict includes a field
node_typewith the node class name and a fieldconfigwith a dict of the configuration options.Returns: dict-representation of the node. Return type: dict
-
validate(test_data)¶ Validate given data against the node’s constraints.
For
Timenodes, this ensures that the given data type is of typedatetime.time.If validation succeeds, the method terminates silently. Otherwise, an exception is raised.
Parameters: test_data – Data to be validated. Raises: dsch.exceptions.ValidationError– if validation fails.
-
-
dsch.schema.node_from_dict(node_dict)¶ Create a new node from its
node_dict.This is effectively a shorthand for choosing the correct node class and then calling its
from_dictmethod.Parameters: node_dict (dict) – dict-representation of the node. Returns: New schema node with the specified type and configuration.