Skip to main content

VxPlatform

New name needed

What

VxPlatform is our data platform for structured data. We store Resources in the platform - a resource can represent anything from a clinical trial, to a measured PSA value, to an acquired DICOM image series.

Resources themselves hold little information:

  • an identifier: this is a unique name for the resource. To indicate of what type a resource is, we follow a type/identifier naming scheme, e.g. trial/basel, or series/1.2.409813. Indicating hierarchy as part of the identifier is also an option: psa/basel/pat_b341h42/psa-99.
  • a parent: core to the platform data structures is modeling data as trees. Every resource can optionally have a single parent resource. This allows us to model the hierarchy of our data well: a trial consists of patients, patients have MRI studies associated with, but also PSA measurements, and an MRI study consists out of multiple series and maybe PI-RADS readings.

Key to a resource is its payload. Payloads hold the actual information associated with a resource and must follow one of the pre-defined schemas. Schemas can be added and modified in general - they just require explicit migration.

A DICOMStudy payload may hold information to where the DICOMs are located - this could be a path on disk or a URL to an Orthanc instance. A PSAMeasurement payload may hold the date, value, and unit of a PSA measurement. A VXAScoringAnnotation payload may hold the PI-RADS reading provided by a VxAnnotate annotator.

Payloads can also define JSON-columns: that would allow storage of arbitrary unstructured data into the table.

Exploring data

Visit 192.168.10.102:2700 while in the VPN to access the web frontend.

Using the CLI

Install the client library:

pixi global install vxp_client

Configure the desired data platform API to port 2701:

vxp config --url http://192.168.10.102:2701/

You can now use the CLI to explore available datasets:

vxp datasets list
> ...

# this will download the latest version of the viseg-train dataset to the specified folder
vxp datasets download -n viseg-train -o ./dataset
> ...

The download command will result in a resources.jsonl file being stored into the specified directory. This will contain all resources along with their payloads.

We are however usually interested in aggregated .csv files or entire DICOM files! For this, we provide materializer methods: they take resources and their payloads as inputs, and perform relevant processing.

E.g. run the bids materializer to 1) load the DICOM files from provided references, 2) convert DICOMs into Volumes and 3) structure according to the BIDS standard. Combine that with the clinicalcsv materializer to obtain .csv files containing all clinical data:

vxp materialize -m bids -m clinicalcsv -i ./dataset

What are Datasets?

Key feature to VxPlatform is strict versioning of resources. This is relevant as we desire reproducibility when training & evaluating ML models.

To support this, a dataset is not simply a list of resource IDs along with their version tags. Instead, a dataset is a Python script that defines the logic of how a dataset is compiled. Our datasets are rarely a static, fixed set of countably many resources, but they are rather a logical selection of our increasing amounts of data (e.g. "any T2 image from a patient with no csPCa and biopsy-results present"). Datasets are defined + version-controlled in the data-platform repo as well.

VxPlatform allows for the "locking" of a dataset by executing the script on the current state of our data warehouse. This "lockfile" contains all the IDs and version numbers of the resources that were determined to be included by the Python script.

# if you want to (re-)compile an existing, centrally defined dataset
vxp compile -n viseg-train
> Compiled and saved as `viseg-train-2025-10-01-19-42`

# if you want to download a specific lockfile
vxp datasets download -n viseg-train -l viseg-train-2025-10-01-19-42

# for local dev purposes, you might want to define your own dataset script
# this command produces a resources.jsonl in the specified directory
vxp compile -f ./my_custom_dataset.py -o ./