Advanced Topics¶

Schema design¶

Nested schemas¶

The example schema from the Tutorial is a very simple one, so let’s extend it! Suppose we do not only want to store a single measurement result of our weather station, but multiple results taken at different times. A simple solution for this would be to wrap the previous schema in a List:

schema_list = dsch.schema.List(
    dsch.schema.Compilation({
        'time': dsch.schema.DateTime(),
        'temperature': dsch.schema.Scalar(dtype='float', unit='°C'),
        'humidity': dsch.schema.Scalar(dtype='float', unit='%', min_value=0,
                                       max_value=100)
    })
)

If we now create a storage for this schema, we can use append() to add individual measurement results to the list:

storage_list = dsch.create('list.h5', schema_list)
storage_list.data.append({
    'time': datetime.datetime.now(),
    'temperature': 21,
    'humidity': 42
})

Each list item is a Compilation that behaves exactly like in the Tutorial:

>>> storage_list.data[0].temperature.value
21.0

Alternatively, if you prefer to work with the data nodes directly, the argument to append can be omitted, creating an empty data node:

storage_list.data.append()
storage_list.data[0].time.value = datetime.datetime.now()
storage_list.data[0].temperature.value = 21
storage_list.data[0].humidity.value = 42

By nesting Lists and Compilations, arbitrary schemas can be composed.

Schema extension¶

When working with measurement devices, a common approach is to implement small controller libraries for every device model. These libraries can be used by an application to control multiple devices in a measurement system, and to aggregate their data.

Dsch allows every library to define a schema for its result datasets. The application can then define another schema, incorporating all required library schemas and possibly adding other fields:

from lib1 import schema as schema_lib1
from lib2 import schema as schema_lib2
schema_app = dsch.schema.Compilation({
    'lib1': schema_lib1,
    'lib2': schema_lib2,
    'app_data': dsch.schema.Compilation(...),
})

This way, the original library schema is preserved, which means that all programs expecting a library’s schema need not be changed to work with the application’s schema. This is true for the library itself as well as possible other libraries and applications consuming its data.

However, this requires one additional abstraction. The library (and possible other consuming programs) must be able to handle both cases:

“Library only” mode, where the library’s schema is the only (and therefore the top-level) schema inside a storage.
“Application” mode, where the library’s schema is only a subset of a broader schema.

The difference is that in “library only” mode, the library has to create (or load) and possibly save the entire storage, while in “application” mode, this is the applications responsibility. To simplify this, dsch provides the PseudoStorage class which automates the process. It can be initialized with either a string argument:

pseudo = dsch.PseudoStorage('example_lib_data.h5', schema_lib1)

… or with a data node argument:

storage = dsch.load('example_app_data.h5')
pseudo = dsch.PseudoStorage(storage.data.lib1, schema_lib1)

In both cases, pseudo.data now provides the data node corresponding to the top-level node in schema_lib1. The library does not need to perform any further checks or decisions in this regard.

Additionally, PseudoStorage can be used as a context manager:

with pseudo as p:
    p.data.spam = 2342
    p.data.eggs = True

This automatically handles saving the storage when leaving the context, if appropriate (i.e. if we are in “library only” mode).

Using PseudoStorage is the recommended way for using dsch in library code.

Multiple schema versions¶

Sometimes, schema changes cannot be avoided, so a new version must be designed. However, backwards compatibility is usually desired, at least on the data consumption side.

When using load(), this can be achieved by simply not setting the required_schema argument. Then, the storage’s schema_node attribute can be checked for compatibility and possible adaption of the subsequent data handling steps.

When using PseudoStorage, a different approach is required since the schema_node attribute cannot be omitted upon object creation. This is because the PseudoStorage must “know” the desired schema for cases in which it has to create a new storage.

To use multiple schema versions with PseudoStorage, supply the schema_alternatives attribute on object creation:

current_schema = dsch.schema.Compilation(...)
old_schema = dsch.schema.Compilation(...)
pseudo = dsch.PseudoStorage(storage_path, schema_node=current_schema,
                            schema_alternatives=(old_schema,))

Now, when loading from a storage or a data node, pseudo will first check the detected schema against current_schema (because that was specified as schema_node). If these do not match, every schema in schema_alternatives is tried, and only if none of these match, an InvalidSchemaError is raised. For creating new storages, only the schema_node is used and schema_alternatives are not considered.

An arbitrary number of alternative schemas can be specified through schema_alternatives, and each can be given as either the schema node object or as a string, representing the corresponding schema node’s hash.

Querying field state¶

“Complete” and “empty” fields¶

As presented in the tutorial, all data nodes have an empty attribute that, if True, indicates the absence of a value for this node. For Compilation, empty works recursively.

Note

To restore a non-empty node back to the empty state, i.e. entirely remove the stored data, use the clear() method.

For practical use, it can be helpful to know whether a dataset contains all required information, i.e. whether it is complete. Therefore, data nodes also have a complete attribute, which indicates the presence of a value.

Note that the value of complete is only the inverse of empty for regular data nodes. For Compilation, they are evaluated recursively for all sub-nodes, which means that complete is True if all sub-nodes are complete, while empty is True if all sub-nodes are empty. Thus, both can be False at the same time.

Optional fields in Compilations¶

Some schemas may contain optional fields, i.e. fields that are not required for a dataset to be considered “complete”. For example, a measurement result might contain a “comment” field that is not strictly required for the dataset to make sense. In this case, complete should return True even if no comment is provided.

This behaviour can be achived by simply passing a list of optionals during schema node initialization of the Compilation:

schema = dsch.schema.Compilation({
    'time': dsch.schema.Array(dtype='float', unit='s'),
    'voltage': dsch.schema.Array(dtype='float', unit='V'),
    'comment': dsch.schema.String(),
}, optionals=['comment'])

Validation¶

Ensuring a compatible schema¶

When loading a storage, dsch can ensure that it conforms to a specific schema. Then, consuming code can rely on the data to really be structured in the expected way. Schemas are automatically identified by a SHA256 hash, which can be queried via any schema node’s hash() or a storage’s schema_hash(). Once determined, it can be given to load() as the require_schema argument, causing dsch to raise a RuntimeError if the to-be-loaded storage has a different schema:

hash = known_good_storage.schema_hash()
unknown_storage = dsch.load(path_to_storage, require_schema=hash)

Inter-node validation¶

Usually, validation only covers a single node at a time, so each node’s value is validated against the exact node’s constraints. This is insufficient for e.g. digital signals, like a measured voltage over time, which could be represented as two Array instances voltage and time. In this case, time is the independent variable and voltage depends on time, implicitly requiring the length of the arrays to be equal and the dimensionality of time to be 1.

Automatic validation of these constraints can be achieved by providing a depends_on argument to the dependent variable’s schema node:

schema = dsch.schema.Compilation({
    'time': dsch.schema.Array(dtype='float'),
    'voltage': dsch.schema.Array(dtype='float', depends_on=('time',))
})

That argument must be an iterable of field names corresponding to all independent variables, so this also works for arrays of higher dimensionality. For example, a 2-dimensional matrix could have two entries in depends_on, one for each dimension. If no independent variable exists for a particular dimension, None may be specified instead of a field name.