Skip to main content

Spec: Versioned Resource Tree Structure

Core to the data platform is the requirement to store "resources" (data entities) that are versioned and can express hierarchies.

We implement a "versioned resource tree" data structure that serves as the foundation for the data platform.

Logically, a resource is an entity with a payload attached. This payload can be an arbitrary JSON object, or a reference to a row in another table. A resource can have a parent resource associated with it. This is a "contains" relationship, where the parent resource is the "container" and the child resource is the "contained".

Example: a DICOMStudy resource has a payload describing its location on the PACS, and defines a Patient resource as its logical parent. The Parent resource has a payload describing maybe some identifier for the patient as well as basic demographic statistics. It again is contained in another resource - a clinical trial. Etc etc etc

A resource is versioned. This versioning includes a) the payload contents, b) the parent relation, and c) by consequence, the children. It allows for us to retrieve a version of a history with an old payload or an old set of children.

Example: A clinical trial consists out of 100 patients at time point X. We train a model on this dataset. As time goes by, the trial grows to 110 patients. For reproducibility purposes, we might however want to re-obtain the state of the trial at time point X, with only the original 100 patients.

Table Schemas

The resources table defines the identifiers of all logical resources:

fieldtype
identifierstring
versionstring

The resource_versions table contains the actual versions of resources along with their payloads and parent-relationships:

fieldtypecomment
identifierstringPK (identifier, version)
versionstringPK (identifier, version)
payload_uidstringnon-enforced FK to payload_table
payload_tablestringthe name of the table that contains the payload
parent_identifierstringthe identifier of the parent resource
parent_versionstringthe version of the parent resource
childrenJSONa list of child resources
is_deletedbooleanflag whether resource is soft-deleted
created_attimestampwhen this version was created

Payload tables require a uid field to join on later. E.g. payload_patient:

fieldtypecomment
uidstringPK
namestringthe name of the patient
ageintthe age of the patient

Operations

We define 5 core requests to interact with the resource storage:

Each of these requests will trigger a version update of all impacted resources. This includes any parent resources!

When performing a bunch of transactions at the same time (such as creating a whole set of new resources), we want to avoid explosion of version numbers. To address this, we implement a InMemoryTransactionManager, which will "preview" the joint impact of a set of operations and only increment version numbers once.

Assumptions & Constraints

  • We do not store payload data in a simple JSON column, but instead store payloads in separate tables. While this results in a polymorphism of the foreign relationship that isn't particularly native to relational databases, it allows for us to enforce a schema on payloads as well as migrate data as schemas change. We are interested in providing "schema guarantees" to users as well as having all data accessible at any point in time - therefore storing any version of a resource in any possible compatible schema is simply infeasible. We are sacrificing perfect reproducibility for simplicity and feasibility.

Considerations

Why not structure with studies, trials, etc. as "folders"? Because we want to associate the folders with (queriable) data just as much as the resources themselves. Made no sense to differentiate between a "folder node" and a "leaf node".

Future Improvements

  • Move children lists into their own table to deduplicate data. When updating payloads but not any children, we could avoid storing the same list of children again by moving the children lists to a separate table indexed by their hash.

  • Figure out how to ensure hard deletes. A legal requirement of our clinical trial data collection is that patients have the right to request the comprehensive deletion of their data. Right now however, we only model soft deletes. Hard deletes might be implementable by removing all payloads and re-naming the resource identifiers.