Data Ingestion
The apps/worker project contains functionality to upload existing data from the hard disks to the database.
Currently, we support the following trials / data sources:
- Bamberg: clinical values
- Basel: clinical values, references to DICOMs
- PI-CAI: clinical values, references to .mha files
- Dasa: clinical values
- VxAnnotate: all annotation data
Technical Setup
To provide a better overview over available data ingestion and data processing routines as well as their past executions, we enable centralized management of their execution via Prefect.
Why? A growing list of python scripts performing different parts of data ingestion became increasingly messy. In addition, it has always been unclear what was already run by whom. The prefect setup hopes to address these issues.
There are numerous ways to execute these scripts:
- In local dev: spin up the worker container. You can then use the configured prefect server (either via the web UI or via their CLI) to trigger the execution of a workflow.
- In local dev: in
apps/worker, a regular pixi env is still available. Simply runpixi run python flows/<workflow>.py. Prefect will spin up a temporary Prefect server for this, but that doesn't bother us here. - In prod: use the web frontend or configure the CLI appropriately.
Background: the Prefect Setup
A self-hosted instance of Prefect server needs to be running.
The worker application in this monorepo provides a Dockerfile and Python files that are spun up as a Prefect Worker.
If the container is running, it makes itself available to the Prefect server as a workflow execution environment.
From the web UI, we can pick a data workflow, configure it, and execute it.
The worker mounts our shared storage disks and communicates with VxPlatform using the standard client library.
The worker container is launched in detached mode by the deployment scripts - this can happen both locally and in production.
Running Ingestion Scripts
Option A: from the UI
Option B: from the CLI
Wherever you may have configured the Prefect CLI, you can run
prefect deployment run 'ingest-vxannotate-data'
Dataset Details
Bamberg Trial Data
We can upload clinical data, DICOM references, and Volume references separately.
Each of the scripts will make sure that minimal required other resources are created alongside (e.g. clinical data will create ImagingStudy resources to properly organize PIRADS reports).
Uploading clinical data
Clinical data was exported from legacy virdx-platform projects as CSV files and prepared in the /mnt/storage/projects/vxplatform/bamberg-clinical/data_cleaned directory.
In the worker pixi proect, we can then run the following to upload the information:
pixi run python ingest-bamberg/clinicals.py \
--path /mnt/storage/projects/vxplatform/bamberg-clinical/data_cleaned \
--url <API URL of the target deployment>
Uploading DICOM references
TODO
pixi run python ingest-bamberg/dicoms.py \
--path /mnt/storage/clinical_trials/bamberg/dicoms \
--url <API URL of the target deployment>
Uploading Volume references
pixi run python ingest-bamberg/volumes.py \
--path /mnt/storage/clinical_trials/bamberg/train \
--url <API URL of the target deployment>
Basel Trial Data
TODO
VxAnnotate Data
We run regular exports of VxAnnotate data to the platform.
pixi run python ingest-vxannotate/main.py \
--vxa-url <API URL of VxAnnotate instance>
--project <project ID in VxAnnotate>
--auth <CF authorization cookie needed for API access>
--url <API URL of the target deployment>
PICAI Data
/mnt/storage/data/mri/public_datasets/picai
- images_processed: for each case, three .nii.gz files: <case-id>_adc.nii.gz, <case-id>_hbv.nii.gz, <case-id>_t2w.nii.gz
- labels_processed: for each case, a single .nii.gz file: <case-id>.nii.gz
- viseg_masks: for each case, a volume (.nii + .json) named anatomy_<case-id>.nii.gz/.json