Working, tested prototype
Tackling H5MD flexibility
In the past few weeks, I’ve been able to reuse much of my initial zarrtraj
code in making H5MD files cloud-streamable,
but H5MD presents a much larger challenge in terms of parsing complexity. Here’s a few examples-
- All time datasets are optional
- Two datasets that contain data on the same parameter (like time for example) can use two different units
- Any dataset can be time-dependent or time-independent
This means I have to find a balance between allowing flexibility and constraining what H5MD files MDAnalysis can
practically read. In general, the more flexibility allowed, the harder it is to write a reader that is fast.
So far, there are only a few H5MD rules that zarrtraj
has to break, including the rule that simulation box
and position integration step datasets have to be links to the same underlying data since this isn’t possible in zarr
.
There are a few things I’ve found which have made managing the complexity of H5MD and streaming much easier.
First, abstracting out the idea of an H5MDElement
as it is defined in the H5MD specification has made interacting
with the datasets in code much easier (thanks for Edis Jakupovic for this idea). Basically, this means instead of
constantly writing out checks that would look like this:
if "offset" in zarr_group["particles/trajectory/positions/step"] and
zarr_group["particles/trajectory/positions/step"].shape == ():
step = zarr_group["particles/trajectory/positions/step"][:] + offset * frame
else:
step = zarr_group["particles/trajectory/positions/step"][frame]
I can interact with the file much more easily like this, having the H5MDElement
handle all the complexity of fixed, explicit,
time-independent, and time-dependent datasets:
element.step[i]
I’ve also found moto
is much easier to use as a session-scoped server. If there are many pytest
testing classes and methods that
require mock AWS services to test streaming, starting, stopping, or resetting the server frequently adds complexity to code and
adds overhead. Simply starting the server when testing begins and stopping it when all tests are finished makes tests faster and code cleaner.
Finally, I’ve found it useful to detach the file reading and file validation. One class can validate the file for its compliance with
the H5MD format, and the other performs the heavy IO of loading things into an MDAnalysis Timestep
object. This also makes it much easier
to experiment with different cacheing strategies, since there is no coupling between file validation and IO.
I’m excited to get started on an H5MD cloud writer and get to the optimization and experimentation phase where I try to beat current reading and writing speed!