Skip to main content

Modeling Data

The data platform project assumes all VIRDX-relevant data is well-represented in a tabular format when combined with a blob storage for e.g. large imaging data.

The data platform provides a number of tables by defining what are effectively Pydantic schemas. In the following, we will go over each data layer and what modeling decisions we make in each.

Source Layer

Data admitted to the source layer is not undergoing any processing or structuring. We effectively store the original source files as they come, and only for the purpose of having a system of record for the raw data.

Here, the tables in the Postgres are effectively an index, and do not contain useful information by itself. In the GenericFile table, we store for each path the following information:

  • its filename + the relative path of where the file came from
  • its hash value
  • the location of where we copied it to in the blob storage system.

As we work with a lot of DICOM images, and we are interested in querying those more regularly based on its metadata attributes, we introduce a special DICOMFile table, storing what a GenericFile is storing, plus a number of extracted DICOM tags.

That's it. We've run scripts that "upload" files from disk to the data platform, effectively "copying" the data. In itself, this is not very valuable as it is just a copy of the file system. This filesystem is however now better queriable and has received stable identifiers, which will become relevant later on.

Integration Layer

The integration layer defines a number of schemas/tables that are our best-effort at providing universally compatible schemas for all kinds of data we work with: from patients to MR images to pathologists' assessments and more. The schemas represent our current consensus on how to best represent data in universal structures while preserving as many details as possible. This is a difficult task where there is no perfect solution. We may be updating these schemas as we learn more about our data and our needs.

The goal here is abstract away from the exact source any piece of data came from. We harmonize data across sources to effectively eliminate any source-specific data wrangling in subsequent processes.

We define models such as Patient, ImagingStudy, PIRADSAssessment and more. Details on available models can be found in the schema reference.

Application Layer

Data within the application layer is created by transforming integration-level and (if needed) source-level data. Such application level datasets are not explicitly structured/typed via modeled tables.

Instead, we store the data-generating process/code as an Application, and the output of this process is an Artefact, which can correspond to arbitrary data stored in the blob storage. The tables for applications and artefacts simply hold references and serve as an index and record of the processing and the result.

More on this can be read in the artefacts how-to.

Deep Dive: the Resource model

Tables across all three layers of data are backed by the same fundamental approach of managing these tables - they are all considered payloads of resources.

What does this mean? As one of the goals for the data platform is to ensure reproducibility and access control for any data, we needed one core modeling backend to build on.

We came up with the concept of a general-purpose Resource - a resource defines the logic entity of any data item. This can be for example a clinical datasource (such as "the Bamberg trial"), or a measured PSA value ("Patient 43's PSA lab value from January 1st 1900), to an individual DICOM file. Resources themselves hold little information:

  • an identifier: this is a unique name for the resource. To indicate of what type a resource is, we follow a type/identifier naming scheme, e.g. datasource/basel, or series/1.2.409813. Indicating hierarchy as part of the identifier is also an option: psa/basel/pat_b341h42/psa-99.
  • a parent: core to the platform data structures is modeling data as trees. Every resource can optionally have a single parent resource. This allows us to model the hierarchy of our data well: a datasource consists of patients, patients have MRI studies associated with, but also PSA measurements, and an MRI study consists out of multiple series and maybe PI-RADS readings.
  • the payload - this is the key component of a resource. All the aforementioned tables for data across source, integration, and application layers are actually payload tables. A resource entry points to a payload in one of the number of payload tables. An ImagingStudy payload may hold information from which DICOMFile it concerns. A Measurement payload may hold the date, value, unit, and type of a measurement (e.g., PSA, prostate volume, PSA density). A PIRADSAssessment payload may hold the PI-RADS reading provided by a radiologist or VxAnnotate annotator.

By building on top of these resources, we can now centrally implement reproducibility and access control by associating this information with the Resource table instead of having to define it for 20+ different payload tables individually.

Versioning (SCD-2)

The platform uses Slowly Changing Dimension Type 2 for versioning.

  • resources table: index of all logical resources.
  • resource_versions table: all versions of all resources, each with valid_from and valid_to timestamps.

Say we store a resource psa/pat001/2025-10-10 that holds information on a patient's PSA value. Then resources features a single row with the identifier psa/pat001/2025-10-10. For each version of this PSA measurement that we have, we have a row in resource_versions that holds detailed information. If the PSA value recorded needs to be corrected, we add a new row. Importantly, each row in resource_versions has a valid_from and valid_to field. Say we push a new resource at timestamp T: we set the previous' rows valid_to=T and create the new row with valid_from=T and valid_to=null.

How does this solve our versioning needs?

  • We want to retrieve all patients that were part of the Bamberg datasource at timepoint T. We can query:
    SELECT * FROM resource_versions
    WHERE
    parent_identifier = 'datasource/bamberg'
    AND payload_table = 'resource_patients'
    AND valid_from <= T
    AND (valid_to IS NULL OR valid_to > T);
  • Any older version of a resource can be obtained:
    SELECT m.psa_value
    FROM resource_versions rv
    JOIN payload_measurements m ON rv.payload_uid = m.uid
    WHERE
    rv.resource_identifier = 'psa/pat001/2025-10-10'
    AND rv.valid_from <= T
    AND (rv.valid_to IS NULL OR rv.valid_to > T);

Soft Deletes

A resource is soft-deleted by setting its valid_to field, and not creating a new, open-ended row. This will effectively exclude the resource from any future queries, while preserving historical correctness.

Hard-deletion is not yet implemented. We can either remove all relevant rows and make sure that any eventual child resources get assigned new parent resources. Alternatively, we could rename identifiers and wipe any payload data.