Tutorial¶
This tutorial will give you a quick overview on working with DSCH. Just make sure you have installed it and get started!
Defining a Schema¶
Let’s start with a simple example.
First, we must import dsch itself:
import dsch
Suppose we want to store data of a little weather station that measures temperature and humidity, plus the date and time of each measurement. With this set of quantities in mind, we can construct a schema that essentially defines the structure of our dataset. In this example, the schema might look like this:
schema = dsch.schema.Compilation({
'time': dsch.schema.DateTime(),
'temperature': dsch.schema.Scalar(dtype='float', unit='°C'),
'humidity': dsch.schema.Scalar(dtype='float', unit='%', min_value=0,
max_value=100)
})
Here, we use a Compilation as a container for our data
fields, allowing us to name the individual fields. Also, we define the units to
be used for the physical quantities and sensible value limits for the humidity.
For numerical data, the dtype (corresponding to NumPy dtype)
must also be given.
Opening a Storage¶
Dsch can store data using a number of backends, e.g. different file formats. However, since backends could also be implemented for databases or other non-file storage engines, we avoid the term “file” and use “storage” instead. A storage always holds data corresponding to exactly one schema.
To open a storage, we can either use dsch.load() or
dsch.create(), depending on whether the storage
already exists. In both cases, we must provide a location for the storage, e.g.
the path to a file. Since we started from scratch, we must use
dsch.create(), which requires our previously
defined schema as an argument:
storage = dsch.create('test.h5', schema)
If we wanted to open an existing storage, we would not need to provide the schema, as it is automatically loaded:
storage = dsch.load('test.h5')
Note that in both cases, we did not have to explicitly state the backend to be
used. This is because dsch automatically detects that this is a file path ending
in “.h5” and chooses the HDF5 backend accordingly. For instance, if we wrote
test.npz, the NumPy npz backend would be chosen instead. This auto-detection
can also be overridden if required, see create() and
load().
Accessing Data¶
Once we have a storage object, we can start accessing the data. All data is
provided via the data attribute of a storage, which is structured in the
exact way that we previously defined in our schema. So, in our example,
storage.data (the top-level data node) is a Compilation containing three
child nodes (time, temperature and humidity).
Child nodes of Compilations are represented as attributes, so they can be easily
accessed with the “dotted” notation.
Note, however, that these data nodes are not simply the stored values, but
objects with additional functionality. The actual stored value is available
through the data node’s value attribute:
>>> storage.data.time.value
[...]
NodeEmptyError: Node is empty. The value of empty nodes is undefined.
Well, we should have expected this! Since we just created a new, empty file, there is no data present. In fact, we can check whether a node is empty:
>>> storage.data.time.empty
True
This even works for Compilation nodes, where all child nodes are checked recursively:
>>> storage.data.empty
True
The empty attribute is an example of functionality that data nodes provide
beyond simply storing a value.
Depending on the node type and the backend in use, there are different
functional ranges.
Of course, we can also assign new variables for any node, providing a shortcut for access:
>>> temp = storage.data.temperature
>>> temp.empty
True
Modifying Data¶
The data stored in a data node can be changed by setting the value
attribute. This is also the way to apply an initial value to an empty node:
import datetime
storage.data.time.value = datetime.datetime.now()
storage.data.temperature.value = 21
storage.data.humidity.value = 42
Now, we can inspect the filled data structure:
>>> storage.data.empty
False
>>> storage.data.temperature.value
21.0
An alternative to setting all values individually is to use the Compilation’s
replace method, which accepts a dict:
storage.data.replace({
'time': datetime.datetime.now(),
'temperature': 21,
'humidity': 42
})
This is equivalent to the example above.
Data Validation¶
All data can be validated against the constraints defined in the schema.
For example, our schema states that the value for humidity must be in the
range from 0 to 100.
Since we previously set that value to 42, validation succeeds (i.e. terminates
silently):
>>> storage.data.humidity.validate()
However, if we set an out-of range value, a
ValidationError is raised:
>>> storage.data.humidity.value = 123
>>> storage.data.humidity.validate()
[...]
ValidationError: Maximum value exceeded. (Expected: 100. Got: 123.0)
Of course, we can also validate the entire storage in a single step:
>>> storage.validate()
[...]
SubnodeValidationError: Field "humidity" failed validation: Maximum value exceeded. (Expected: 100. Got: 123.0)
Note that now, a SubnodeValidationError is raised, providing
details on the affected node.
Storing Data¶
For all current backends, changes to the data inside a storage are not
automatically written to disk.
To do that, you must call save() explicitly:
>>> storage.save()
[...]
SubnodeValidationError: Field "humidity" failed validation: Maximum value exceeded.
Oh, right, we still have that invalid value set for humidity! As we can see,
data is, by default, automatically validated before saving. This prevents us
from accidentally producing files with invalid for physically impossible values.
Of course, when we provide a sensible value again, we can easily save our file:
>>> storage.data.humidity.value = 42
>>> storage.save()
Conclusion¶
Handling data with dsch is easy! Just define a schema, open a storage for it, and fill it with data - that’s it for basic usage patterns!
Of course, there are a few more features in dsch that you might want to use. These are presented in short blocks in Advanced Topics.