Tackling H5MD flexibility

In the past few weeks, I’ve been able to reuse much of my initial zarrtraj code in making H5MD files cloud-streamable, but H5MD presents a much larger challenge in terms of parsing complexity. Here’s a few examples-

  • All time datasets are optional
  • Two datasets that contain data on the same parameter (like time for example) can use two different units
  • Any dataset can be time-dependent or time-independent

This means I have to find a balance between allowing flexibility and constraining what H5MD files MDAnalysis can practically read. In general, the more flexibility allowed, the harder it is to write a reader that is fast. So far, there are only a few H5MD rules that zarrtraj has to break, including the rule that simulation box and position integration step datasets have to be links to the same underlying data since this isn’t possible in zarr.

There are a few things I’ve found which have made managing the complexity of H5MD and streaming much easier. First, abstracting out the idea of an H5MDElement as it is defined in the H5MD specification has made interacting with the datasets in code much easier (thanks for Edis Jakupovic for this idea). Basically, this means instead of constantly writing out checks that would look like this:

if "offset" in zarr_group["particles/trajectory/positions/step"] and 
        zarr_group["particles/trajectory/positions/step"].shape == ():
    step = zarr_group["particles/trajectory/positions/step"][:] + offset * frame
else:
    step = zarr_group["particles/trajectory/positions/step"][frame]

I can interact with the file much more easily like this, having the H5MDElement handle all the complexity of fixed, explicit, time-independent, and time-dependent datasets:

element.step[i]

I’ve also found moto is much easier to use as a session-scoped server. If there are many pytest testing classes and methods that require mock AWS services to test streaming, starting, stopping, or resetting the server frequently adds complexity to code and adds overhead. Simply starting the server when testing begins and stopping it when all tests are finished makes tests faster and code cleaner.

Finally, I’ve found it useful to detach the file reading and file validation. One class can validate the file for its compliance with the H5MD format, and the other performs the heavy IO of loading things into an MDAnalysis Timestep object. This also makes it much easier to experiment with different cacheing strategies, since there is no coupling between file validation and IO.

I’m excited to get started on an H5MD cloud writer and get to the optimization and experimentation phase where I try to beat current reading and writing speed!

Updated: