Skip to main content

VIRDX data platform

The data platform is our central application for storing the ground truth of VIRDX data.

How-tos: achieving goals as a newcomer

  • Getting started — frontend, installing the Python client, first connection
  • Accessing data — querying tables, downloading and uploading files
  • Artefacts — running applications and retrieving versioned outputs

Tutorials: learning how to accomplish things

Explainers: learning things in depth

  • Database structure — data layers, the Resource model, SCD-2 versioning
  • Deployment — docker-compose and Kubernetes deployments, backups

Reference: looking things up

Motivation & Overview

The goal for the data platform is to

  1. ingest all incoming data sources - whether proprietary, public, or acquired, no matter the data modality.
  2. provide a unified representation of the data, addressing the heterogeneous nature of individual datasets.
  3. making all data easily accessible to research and applications.

Secondary goals for the data platform include:

  • Enabling reproducibility for experimentation.
  • Enforcing consistent train/val/test splits across experiments.
  • Providing mechanisms for access control & auditability by becoming the central gateway for any data access.

How is the data platform set up?

The data platform is a central service that is deployed in our infrastructure. This means you will be using a web client or python client to view, query, and download data from this central deployment. This client will talk to the data platform API, which in turn coordinates with a Postgres table storage as well as a MinIO-based blob storage system. The data platform effectively acts as a CRUD-layer (Create, Read, Update, Delete) for data stored in Postgres and MinIO. Most of the data you're interacting with can effectively be represented as DataFrames/CSVs, given that we work a lot with the Postgres tables. All data not representable in tables is stored in the blob storage - MRI data, pathology data, PDFs, and more.

What data is stored in the data platform?

The three aforementioned primary goals of the data platform are reflected in the different tiers of data available in the data platform:

  1. Raw, immutable source data - the data that we acquire by running clinical trials, downloading public datasets, or buying data from external partners. We store this data in its original form, as is. This is the data you're used to seeing in the shared NFS folders at /mnt/storage. We call this type of data the "source-level data".
  2. Normalized data - by transforming raw data into consistent schemas and formats, we are able to abstract away the details and complexities of the individual, highly heterogeneous data sources. This makes it much easier for us to learn e.g. how many patients with what disease status etc are available to our research. By structuring this data consistently, we are able to move it into a SQL table store, allowing for rapid querying and transformation. Blob data such as imaging data remains stored in the S3 bucket. We call this type of data the "integration-level data".
  3. Application-specific data - while the normalized data provides consistency, it is not necessarily very ergonomic to work with in specific projects. To meet the modeling demands of individual projects, the data platform allows for workflows that transform source- + integration-level data into application-specific forms of the data. This process may involve heavy filtering and transformation operations. We call this type of data the "application-level data".

Unless you are working on a project with tailored application-level data, you should almost always be interacting with integration-level data. For information on how and why data is modeled, refer to the explainer on database structure.

What is currently out-of-scope for the data platform?

  • The data platform is not a storage location for quick dumps of experiment results. It is currently not designed for users constantly uploading data themselves.

Getting Started

See the getting started guide.