Data Ingestion

The apps/worker project contains functionality to upload existing data from the hard disks to the database.

Currently, we support the following trials / data sources:

Bamberg: clinical values
Basel: clinical values, references to DICOMs
PI-CAI: clinical values, references to .mha files
Dasa: clinical values
VxAnnotate: all annotation data

Technical Setup

To provide a better overview over available data ingestion and data processing routines as well as their past executions, we enable centralized management of their execution via Prefect.

Why? A growing list of python scripts performing different parts of data ingestion became increasingly messy. In addition, it has always been unclear what was already run by whom. The prefect setup hopes to address these issues.

There are numerous ways to execute these scripts:

In local dev: spin up the worker container. You can then use the configured prefect server (either via the web UI or via their CLI) to trigger the execution of a workflow.
In local dev: in apps/worker, a regular pixi env is still available. Simply run pixi run python flows/<workflow>.py. Prefect will spin up a temporary Prefect server for this, but that doesn't bother us here.
In prod: use the web frontend or configure the CLI appropriately.

Background: the Prefect Setup

A self-hosted instance of Prefect server needs to be running.

The worker application in this monorepo provides a Dockerfile and Python files that are spun up as a Prefect Worker. If the container is running, it makes itself available to the Prefect server as a workflow execution environment.

From the web UI, we can pick a data workflow, configure it, and execute it. The worker mounts our shared storage disks and communicates with VxPlatform using the standard client library.

The worker container is launched in detached mode by the deployment scripts - this can happen both locally and in production.

Running Ingestion Scripts

Option A: from the UI

Option B: from the CLI

Wherever you may have configured the Prefect CLI, you can run prefect deployment run 'ingest-vxannotate-data'

Dataset Details

Bamberg Trial Data

We can upload clinical data, DICOM references, and Volume references separately. Each of the scripts will make sure that minimal required other resources are created alongside (e.g. clinical data will create ImagingStudy resources to properly organize PIRADS reports).

Uploading clinical data

Clinical data was exported from legacy virdx-platform projects as CSV files and prepared in the /mnt/storage/projects/vxplatform/bamberg-clinical/data_cleaned directory.

In the worker pixi proect, we can then run the following to upload the information:

pixi run python ingest-bamberg/clinicals.py \
    --path /mnt/storage/projects/vxplatform/bamberg-clinical/data_cleaned \
    --url <API URL of the target deployment>

Uploading DICOM references

TODO

pixi run python ingest-bamberg/dicoms.py \
    --path /mnt/storage/clinical_trials/bamberg/dicoms \
    --url <API URL of the target deployment>

Uploading Volume references

pixi run python ingest-bamberg/volumes.py \
    --path /mnt/storage/clinical_trials/bamberg/train \
    --url <API URL of the target deployment>

Basel Trial Data

TODO

VxAnnotate Data

We run regular exports of VxAnnotate data to the platform.

pixi run python ingest-vxannotate/main.py \
    --vxa-url <API URL of VxAnnotate instance>
    --project <project ID in VxAnnotate>
    --auth <CF authorization cookie needed for API access>
    --url <API URL of the target deployment>

PICAI Data

/mnt/storage/data/mri/public_datasets/picai
- images_processed: for each case, three .nii.gz files: <case-id>_adc.nii.gz, <case-id>_hbv.nii.gz, <case-id>_t2w.nii.gz
- labels_processed: for each case, a single .nii.gz file: <case-id>.nii.gz
- viseg_masks: for each case, a volume (.nii + .json) named anatomy_<case-id>.nii.gz/.json

Technical Setup​

Background: the Prefect Setup​

Running Ingestion Scripts​

Dataset Details​

Bamberg Trial Data​

Uploading clinical data​

Uploading DICOM references​

Uploading Volume references​

Basel Trial Data​

VxAnnotate Data​

PICAI Data​