Skip to main content

Integrating datasets

VxData provides database tables with a useful format for representing clinical trials and research data. Our job is to integrate existing, messy datasets into the clean and unified structure vxData expects.

In this how-to guide, we outline how to think about and implement integration scripts that run this conversion once. We focus on the scenario that we want to integrate an existing dataset into vxData once. This guide does not touch on the details of scripts that are run continuously to e.g. keep data up-to-date with some external source of truth.

This guide only touches on a few data models we have implemented in vxData. Consider the payload definitions at packages/vxdata-schemas/src/vxdata.schemas/payloads.py for more details on available data models.

Step 1: prepare the source dataset

Make sure the dataset to be integrated is stored on disk in its most original form. It might make sense to e.g. unzip things, but there should be basically no other processing steps applied to the dataset on disk.

Ideally, store this data into /mnt/storage/projects/vxdata/<dataset-name>.

Step 2: Write the script to upload this data in vxData format

Given this source dataset, we are now interested in a single script that takes this path as input, reads from it, and adds the appropriate entries into the vxData tables. This script should be idempotent and not require any additional parametrization.

In general, scripts will set up a connection to the desired vxData deployment and submit typed *Create schemas to the API — flat Pydantic models that carry both the resource-level fields (identifier, parent_identifier, …) and the payload fields in one object.

In our context here, scripts are usually considered to be run once. We have found it effective for scripts to follow the following structure:

from vxdata.sdk import Client
from vxdata.schemas.create import PatientCreate # import the *Create types you use

HARDCODED_PATH_TO_DATA = "/mnt/storage/..."

def main(client: Client):
patients = find_patients()
client.patients.create(patients)

def find_patients() -> list[PatientCreate]:
...

if __name__ == "__main__":
argparser = argparse.ArgumentParser()
argparser.add_argument("--api", type=str, required=True)
args = argparser.parse_args()

client = Client(args.api)
main(client)

The top-level main should provide an easy-to-understand overview over the basic orchestration workflow. The details of this are however entirely up to you as the dataset integrating person. You may prefer a strongly dataframe-oriented data processing approach, or working from a "per patient loop" perspective, or from a "per target resource" perspective, or ...

We will now go through a few of the commonly performed steps in such scripts.

Understanding resources in vxData

VxData is centered around the notion of a resource: a single entity with a stable identifier. Each resource may have up to one single parent resource: a clearly defined, hierarchically superior other resource. E.g. a captured PSA value resource may have the concerning patient resource set as its parent resource. Or an extracted nifti volume sets the MRI acquisition session as its parent. This enforces a natural hierarchy and grouping of resources.

Note that there may be scenarios in which it is not entirely clear what exactly is the best parent to set. This ambiguity should be flagged and discussed with the team.

Each resource may be associated with a payload: a model of additional "metadata". The payload is where the most interesting data is captured: details on the patient, the MRI study, or the actual lab values are stored in these tables.

Resource identifiers (aka their names) follow a standardized format across resources on vxData: <resource-type>/<resource-name>. For example, a datasource is named datasource/essen01, a patient is named patient/BB12345, and a PSA value may be named psa/BB12345/2019-09-12. It is oftentimes convenient to have the resource name be hierarchical as in the PSA example. From the resource identifier alone, it should be very clear what the resource contains.

Resources contain the following fields in general:

  • identifier
  • parent_identifier
  • license: should be set if we are ingesting a public dataset. For all internal data, use "VIRDX".
  • access_level: set to dev by default - we'll make more use of this in the future.
  • derived_from: list[str]: a field to allow you to record some form of lineage. E.g. an annotation may declare that it was derived from an AnnotationSession resource.

Creating the datasource

The highest level of object modeled in vxData is the datasource: a representation of the overall trial or dataset some data is coming from. Existing datasources are for example: Basel, Essen01, PICAI, ProstateX, ...

from vxdata.schemas.create import DataSourceCreate

# run this once in the beginning of your script
client.datasources.create(
DataSourceCreate(
identifier="datasource/my-funny-dataset",
parent_identifier=None, # OK for datasources
description="short funny title", # not really interesting for datasources
shorthand="FD", # used in patient prefixes
)
)

Modeling patients

Datasets are typically hierarchically structured and have some notion of a single patient. Patient IDs/names can be very chaotic and inconsistent across datasets, we therefore standardize to a new, vxData-specific patient name, while keeping a reference to the original source data patient name.

In the course of your script, while iterating over the patients you determine, include the following code:

# .register() allocates and creates the patient; raises if one already exists.
# For idempotent scripts, catch the error and fall back to .resolve() instead.
new_patient_id = client.patients.register(
external_uid="<the name the dataset uses>", # e.g. ProstateX001, W842sl4_f3PquL, ...
datasource_id="datasource/my-funny-dataset",
)
# new_patient_id is of the form `patient/<shorthand><five-digit-number>`, e.g. `patient/FD12345`.

This will return the vxData-provided patient identifier. Use this resource identifier in all subsequent resources, e.g. as parent identifier or use the identifier shorthand (e.g. FD12345) in other resource identifiers.

Patients have a default_split field indicating the split they are assigned to. We adhere to this data split in all our experiments to ensure no data leakage. The field itself is populated by a central data splitting script, so during ingestion, you should most likely leave this field set to None.

Modeling studies

For MRI-containing datasets, we model individual radiologist visits as ImagingStudy resources.

You should for example include the identifier of original imaging study - this may be the DICOM UID by which you've deemed a study related, or the acquisition date of the study if the DICOM UID is not consistent with what we would consider a proper imaging session.

Modeling image data

So far, we have only considered adding table entries to vxData. So far, we are interested in uploading the nifti and other files to the S3-compatible storage backend.

So far, we've mostly been using the Volume payload to model a converted Nifti. Most of the fields are matching what we had been using as part of vicom, but that may be subject to change in the future. This payload makes it easy to retrieve images based on e.g. series descriptions or scanner manufacturers or acquisition parameters.

The typical flow for integrating imaging data is: prepare the files on local disk (e.g. converted DICOM → NIfTI + JSON sidecar), hand them to client.storage.upload(...) (which will handle the upload), and use the returned S3 URLs as path_nii / path_json on a VolumeCreate that is parented to the ImagingStudy.

from pathlib import Path
from vxdata.schemas.create import VolumeCreate

# one entry per volume we want to register
local_files = {
"t2": {"nii": Path("/tmp/pat01/t2.nii.gz"), "json": Path("/tmp/pat01/t2.json")},
"adc": {"nii": Path("/tmp/pat01/adc.nii.gz"), "json": Path("/tmp/pat01/adc.json")},
}

# uploads all files under a shared prefix and mirrors the dict shape
s3_urls = client.storage.upload(local_files, group="volumes")
# s3_urls == {"t2": {"nii": "s3://.../t2.nii.gz", "json": "s3://.../t2.json"},
# "adc": {"nii": "s3://.../adc.nii.gz", "json": "s3://.../adc.json"}}

resources = [
VolumeCreate(
identifier=f"volume/BB12345/001/{name}",
parent_identifier=study_id, # the ImagingStudy identifier
volume_uid=f"sub-BB12345_ses-001_{name}",
path_nii=paths["nii"],
path_json=paths["json"],
volume_type=...,
b_value=...,
# TODO: populate metadata based on the converted JSON sidecar
)
for name, paths in s3_urls.items()
]
client.volumes.create(resources)

When uploading larger datasets, make sure to consider batching and failure-recovery/-cleanup.

Modeling histopathology

Pathology is modeled as two resource types: a PathologySpecimen (the tissue) and one or more PathologyAssessment children that describe findings on that tissue. This split keeps "what tissue was taken" cleanly separated from "what was found in it", and allows for multiple independent assessments (e.g. different pathologists, re-reads, derived summary findings) of the same tissue.

The type field on PathologySpecimen captures the procedure (BIOPSY, RPE, RESECTION, EXCISION, OTHER_OR_UNKNOWN). Datasets oftentimes contain a diverse mix of clinical reports of differing detail-richness etc. VxData supports both easy querying and detailedness by providing two flags for PathologySpecimen entities:

  • The is_summary flag indicates whether an entity is a "high level, aggregate summary" representation or whether it is an individual component of a pathological workup. For example, a biopsy report will report aggregate results by saying that a systematic biopsy was performed and that overall, the biopsy is graded with gleason score X. The pathology specimen entry created for this information has is_summary=True, and a pathology assessment entry with the gleason scores is created while referencing the specimen. Additionally, the report may provide per-core results: individual cores have is_summary=False and are represented by additional PathologySpecimen entries alongside PathologyAssessment if available. Similarly, an RPE workup may contain such a hierarchy of resources.
  • As a pathological workup may consist of multiple batches of biopsy cores, we are interested in modeling what the single, "root" pathology specimen is. For each patient with pathology data available, we should have a single specimen entry with is_event_root=True - this entry should not be child of another specimen, but instead be the absolute root of the pathology information. This makes querying for a patient's "global" pathology status very easy downstream.

Location is modeled as a set of orthogonal, optional enum fields — side (left/right/bilateral/midline), level (apex → base), zone (PZ/TZ/CZ/AS), ap_position, laterality, and tissue_location — each filled in only to the granularity the source actually reports. Anything the source encodes that does not cleanly map to one of these enums (free-text region strings, procedure notes, site-specific codes) goes into the catch-all other_data: dict field rather than being forced into a shape it does not fit.

from vxdata.schemas.create import PathologySpecimenCreate, PathologyAssessmentCreate

# a biopsy event: one summary specimen + N core specimens, each with its own assessment
summary_sid = f"specimen/{sh}/biopsy-summary/1"
resources.append(PathologySpecimenCreate(
identifier=summary_sid,
parent_identifier=new_patient_id,
type="BIOPSY",
is_summary=True,
is_event_root=True,
date=biopsy_date,
total_cores=12,
is_systematic=True,
))
resources.append(PathologyAssessmentCreate(
identifier=f"assessment/{sh}/biopsy-summary/1",
parent_identifier=summary_sid,
affected_cores=3, cancer_type="tumor", gleason_primary=4, gleason_secondary=3,
))

for idx, core in enumerate(cores, 1):
core_sid = f"specimen/{sh}/biopsy-core/{idx}"
resources.append(PathologySpecimenCreate(
identifier=core_sid,
parent_identifier=summary_sid, # cores hang off the summary
type="BIOPSY",
is_summary=False,
is_event_root=False,
side=core.side,
level=core.level,
zone=core.zone,
other_data={"region": core.raw_region_text}, # dump anything unmodeled here
))
resources.append(PathologyAssessmentCreate(
identifier=f"assessment/{sh}/biopsy-core/{idx}",
parent_identifier=core_sid,
gleason_primary=core.gleason_primary,
gleason_secondary=core.gleason_secondary,
))

See apps/vxdata-jobs/src/worker/f_20260410_essen01/04_extract_clinical.py for a worked, production-scale example covering biopsy summaries, individual cores, and RPE specimens from one source dataset.

Modeling masks and maps

The way PIRADS lesions, biopsy results, RPE results are reported is extremely heterogeneous. For now, we model stuff extremely application-focused: for MRI use cases, we store VoxelMap entries, consisting of a nifti associated with a "task". Tasks may be anatomy-segmentation, zone-segmentation, pirads, or isup.

Each voxel in VoxelMaps represents a task-specific value. E.g. an integer value 2 in a voxel may represent a PIRADS 1 annotation provided by an annotator at some point. We do not model e.g. PIRADS lesions or pixel-precise biopsy results on a more fine-grained detail than the VoxelMap for now.

Concretely, a VoxelMap payload carries three things: a path_nii pointing at the NIfTI on S3, a task label identifying what the voxel values mean (current vocabulary: isup, cspca, pca, pirads, lesion, anatomy, prostate, pz, tz), and a reference_mri field that holds the Volume identifier this mask was drawn against. The integer → semantic mapping of voxel values is recorded in the free-form labels: dict field (e.g. {"0": "background", "1": "prostate"}), so consumers never need to guess what a value means.

A note on the parent of a VoxelMap: we currently do not have a best practice parent for voxelmaps. They may be parented to the ImagingStudy, or the reference MRI volume, or the AnnotationSession it was created from. Here, the parent does not need to encode a critical piece of information (as we can recover everything relevant from the reference_mri or the derived_from fields).

Typical ingestion looks like: collect all NIfTI files for one batch into a temp directory, upload them in one storage.upload call, and then create one VoxelMap resource per file pointing at the returned S3 path:

import shutil, tempfile
from pathlib import Path
from vxdata.schemas.create import VoxelMapCreate

with tempfile.TemporaryDirectory() as tmp:
out = Path(tmp) / "voxelmaps"
out.mkdir()
for mask_path, vm_id, *_ in batch:
shutil.copy(mask_path, out / (vm_id.replace("/", "_") + ".nii.gz"))

storage_root = client.storage.upload(out, group="voxelmaps").rstrip("/")

client.voxel_maps.create([
VoxelMapCreate(
identifier=vm_id,
parent_identifier=study_id,
path_nii=f"{storage_root}/{vm_id.replace('/', '_')}.nii.gz",
task="prostate",
reference_mri=ref_volume_id,
labels={"0": "background", "1": "prostate"},
)
for mask_path, vm_id, study_id, ref_volume_id, _ in batch
])

Step 3: Apply the script

Make sure your script accepts a destination API endpoint to run against.

You must never point an untested, un-reviewed data transformation/integration script to the production vxData deployment. Instead, make sure to spin up a (local) dev deployment and inject backup data if needed. See README.md for more information on this.

Once the script is well-tested and behaving according to your expectations, request a review on your PR. Make sure to include 1) what data the source dataset holds, 2) how this is mapped into vxData schemas, 3) summary stats or screenshots of data visible. Update the Data Main Doc on Notion with notes on the dataset.

Once the PR is merged onto data-platform@main, we will run it against the production database.

Appendix: Integrating large datasets

When trying to integrate large amounts of data (hundreds of GB, terabytes) by uploading it to S3, it becomes cumbersome to run client.storage.upload all the time.

For this setting, the MinIO CLI client provides a convenient mirror function that provides a high-performance syncing mechanism from a disk path to S3. See apps/vxdata-jobs/src/worker/source_2026_01_28_ingest_files/mirror.py for a pattern here: We define a directory on disk that has a known and desired structure. We then run mc mirror to upload the exact same data into S3. We can run high-throughput parsing and therefore resource payload creation by accessing the files on disk. As the file structure on disk as well as the structure on S3 is known (as we explicitly specify it), we can manually adapt the S3 path references in the resource payloads.