Spec: Versioned Resource Tree Structure
Core to the data platform is the requirement to store "resources" (data entities) that are versioned and can express hierarchies.
We implement a "versioned resource tree" data structure that serves as the foundation for the data platform.
Logically, a resource is an entity with a payload attached. This payload can be an arbitrary JSON object, or a reference to a row in another table. A resource can have a parent resource associated with it. This is a "contains" relationship, where the parent resource is the "container" and the child resource is the "contained".
Example: a DICOMStudy resource has a payload describing its location on the PACS, and defines a Patient resource as its logical parent. The Parent resource has a payload describing maybe some identifier for the patient as well as basic demographic statistics. It again is contained in another resource - a clinical trial. Etc etc etc
A resource is versioned. This versioning includes a) the payload contents, b) the parent relation, and c) by consequence, the children. It allows for us to retrieve a version of a history with an old payload or an old set of children.
Example: A clinical trial consists out of 100 patients at time point X. We train a model on this dataset. As time goes by, the trial grows to 110 patients. For reproducibility purposes, we might however want to re-obtain the state of the trial at time point X, with only the original 100 patients.
Table Schemas
The resources table defines the identifiers of all logical resources:
| field | type |
|---|---|
| identifier | string |
| version | string |
The resource_versions table contains the actual versions of resources along with their payloads and parent-relationships:
| field | type | comment |
|---|---|---|
| identifier | string | PK (identifier, version) |
| version | string | PK (identifier, version) |
| payload_uid | string | non-enforced FK to payload_table |
| payload_table | string | the name of the table that contains the payload |
| parent_identifier | string | the identifier of the parent resource |
| parent_version | string | the version of the parent resource |
| children | JSON | a list of child resources |
| is_deleted | boolean | flag whether resource is soft-deleted |
| created_at | timestamp | when this version was created |
Payload tables require a uid field to join on later. E.g. payload_patient:
| field | type | comment |
|---|---|---|
| uid | string | PK |
| name | string | the name of the patient |
| age | int | the age of the patient |
Operations
We define 5 core requests to interact with the resource storage:
- Creating a new resource: ResourceCreateRequest
- Updating the payload of an existing resource: ResourcePayloadChangeRequest
- Changing the parent of an existing resource: ResourceParentChangeRequest
- Soft-deleting a resource: ResourceDeleteRequest
- Un-deleting a resource: ResourceUndeleteRequest
Each of these requests will trigger a version update of all impacted resources. This includes any parent resources!
When performing a bunch of transactions at the same time (such as creating a whole set of new resources), we want to avoid explosion of version numbers.
To address this, we implement a InMemoryTransactionManager, which will "preview" the joint impact of a set of operations and only increment version numbers once.
Assumptions & Constraints
- We do not store payload data in a simple JSON column, but instead store payloads in separate tables. While this results in a polymorphism of the foreign relationship that isn't particularly native to relational databases, it allows for us to enforce a schema on payloads as well as migrate data as schemas change. We are interested in providing "schema guarantees" to users as well as having all data accessible at any point in time - therefore storing any version of a resource in any possible compatible schema is simply infeasible. We are sacrificing perfect reproducibility for simplicity and feasibility.
Considerations
Why not structure with studies, trials, etc. as "folders"? Because we want to associate the folders with (queriable) data just as much as the resources themselves. Made no sense to differentiate between a "folder node" and a "leaf node".
Future Improvements
-
Move children lists into their own table to deduplicate data. When updating payloads but not any children, we could avoid storing the same list of children again by moving the children lists to a separate table indexed by their hash.
-
Figure out how to ensure hard deletes. A legal requirement of our clinical trial data collection is that patients have the right to request the comprehensive deletion of their data. Right now however, we only model soft deletes. Hard deletes might be implementable by removing all payloads and re-naming the resource identifiers.