Skip to main content

Including source data in the data platform

We are moving towards three tiers of data in the data platform:

  • Source data: raw, immutable data as obtained from external sources.
  • Integration data: Normalized representations of patients, studies, pathology & radiology assessments, etc.
  • Application data: use-case specific representations of data, highly customized to the use case.

Currently, the data platform contains only a mixture of integration (unified modeling of patients, studies, measurements, assessments) and application data (vicom Volumes being an opinionated conversion of source DICOMs). In order to move towards included source data, this PR adds ingestion scripts that load and store raw data from disk into the data platform.

This includes:

  • Dasa Data
    • All DICOM files
    • The clinical information Excel file
  • Basel Data
    • All DICOM files
    • The clinical information Excel file
  • Bamberg Data
    • All DICOM files
    • All clinical documents provided
  • ProstateX
    • All DICOM files
    • All biopsy result, clinical information CSV files
    • All rcuocolo lesion masks
  • PRUS
    • All DICOM files (both MRI and Ultrasound)
    • All biopsy location information
    • TODO
  • PICAI
    • All .mha files
    • The mapping to ProstateX file
    • All clinical value files
    • All segmentation masks, both anatomical and lesion

To support these uploads, we introduce two new tables/payloads:

  • GenericFile
  • DICOMFile

Moreover, this PR will update existing ingestion scripts to use the platform-stored files instead of loading these from disks.

Implementation Plan

1: Set up schemas

class FileGroup:
identifier: str
datasource_id: str
relative_path: str

class GenericFile:
identifier: str
sha256: str
metadata: dict | None
datasource_id: str
relative_path: str
filename: str
s3_url: str

class DICOMFile:
identifier: str
sha256: str
metadata: dict | None
datasource_id: str
relative_path: str
filename: str
s3_url: str

# === SOP Common ===
SOPClassUID: str
SOPInstanceUID: str

# === Patient ===
PatientID: str
IssuerOfPatientID: str | None
PatientName: str | None
PatientBirthDate: str | None

# === Study ===
StudyInstanceUID: str
AccessionNumber: str | None
StudyDate: str | None
StudyTime: str | None
StudyDescription: str | None
StudyID: str | None
ReferringPhysicianName: str | None

# === Series ===
SeriesInstanceUID: str
Modality: str | None
SeriesNumber: str | None
SeriesDescription: str | None
SeriesDate: str | None
SeriesTime: str | None
ProtocolName: str | None
BodyPartExamined: str | None

# === Frame of Reference ===
FrameOfReferenceUID: str | None

# === Equipment ===
Manufacturer: str | None
ManufacturerModelName: str | None
InstitutionName: str | None
InstitutionalDepartmentName: str | None
SoftwareVersions: str | None

# === General Image ===
InstanceNumber: str | None
ImageType: str | None
ContentDate: str | None
ContentTime: str | None
ImageComments: str | None
BurnedInAnnotation: str | None

# === Image Plane ===
ImagePositionPatient: str | None
ImageOrientationPatient: str | None
SliceThickness: str | None
SpacingBetweenSlices: str | None
PixelSpacing: str | None

# === Image Pixel ===
Rows: str | None
Columns: str | None
BitsAllocated: str | None
BitsStored: str | None
HighBit: str | None
PixelRepresentation: str | None
PhotometricInterpretation: str | None
SamplesPerPixel: str | None
NumberOfFrames: str | None

# === MR Image ===
ScanningSequence: str | None
SequenceName: str | None
SequenceVariant: str | None
RepetitionTime: str | None
EchoTime: str | None
InversionTime: str | None
FlipAngle: str | None
MagneticFieldStrength: str | None
EchoTrainLength: str | None
NumberOfAverages: str | None

# === Diffusion ===
DiffusionBValue: str | None

# === Acquisition ===
AcquisitionUID: str | None
AcquisitionDate: str | None
AcquisitionTime: str | None
AcquisitionDateTime: str | None
AcquisitionDuration: str | None
ImagesInAcquisition: str | None

# === Counts (query-level) ===
ModalitiesInStudy: str | None
NumberOfStudyRelatedSeries: str | None
NumberOfStudyRelatedInstances: str | None
NumberOfSeriesRelatedInstances: str | None

This will then result in a resource structure as for example:

  Datasource datasource/bamberg
FileGroup group/bamberg/raw_data
FileGroup group/bamberg/raw_data/transfer_01
GenericFile file/bamberg/001/BB001.pdf
GenericFile file/bamberg/001/123456.dcm
GenericFile file/bamberg/001/789.dcm
GenericFile file/bamberg/001/XYZ.dcm
...
Datasource datasource/vxannotate
FileGroup group/vxa/proj_123/case_456/
GenericFile file/vxa/proj_123/case_456/mask_01.nii.gz
GenericFile file/vxa/proj_123/case_456/mask_01_w_patho.nii.gz
GenericFile file/vxa/proj_123/case_456/assessment.json
GenericFile file/vxa/proj_123/case_456/assessment_w_patho.json

(defined via parent relationships)