VxData
VxData is the central data system for the data ecosystem at VirDx.
These docs are structured as follows:
- Guides -- for learning and getting started
- Getting Started: first steps with vxData.
- Reading Data: retrieving individual resources, querying large amounts of data, loading files.
- Writing Data: creating new resources, updating existing resources.
- Artifacts: the philosophy behind artifacts, a model for experiment workflows.
- Integrating Datasets: mental models around integrating new datasets.
- Developing: how to spin up local dev deployments for safely working with vxData and its data.
- Explainers -- for a deep dive into topics
- Data Modeling: an explainer on SQL database internals
- Deployment: how the system is deployed and how it interacts with other systems.
- Reference -- for looking things up
- API Surface: the API endpoints provided by the data platform.
- SDK Surface: the SDK interfaces exposed.
- VxData 1.0 Migration Guide: an overview over changes that came with vxData version 1.0.
Motivation & Overview
The goal for the data platform is to
- ingest all incoming data sources - whether proprietary, public, or acquired, no matter the data modality.
- provide a unified representation of the data, addressing the heterogeneous nature of individual datasets.
- making all data easily accessible to research and applications.
Secondary goals for the data platform include:
- Enabling reproducibility for experimentation.
- Enforcing consistent train/val/test splits across experiments.
- Providing mechanisms for access control & auditability by becoming the central gateway for any data access.
How is the data platform set up?
The data platform is a central service that is deployed in our infrastructure. This means you will be using a web client or python client to view, query, and download data from this central deployment. This client will talk to the data platform API, which in turn coordinates with a Postgres table storage as well as a MinIO-based blob storage system. The data platform effectively acts as a CRUD-layer (Create, Read, Update, Delete) for data stored in Postgres and MinIO. Most of the data you're interacting with can effectively be represented as DataFrames/CSVs, given that we work a lot with the Postgres tables. All data not representable in tables is stored in the blob storage - MRI data, pathology data, PDFs, and more.
What data is stored in the data platform?
The three aforementioned primary goals of the data platform are reflected in the different tiers of data available in the data platform:
- Raw, immutable source data - the data that we acquire by running clinical trials, downloading public datasets, or buying data from external partners. We store this data in its original form, as is. This is the data you're used to seeing in the shared NFS folders at
/mnt/storage. We call this type of data the "source-level data". - Normalized data - by transforming raw data into consistent schemas and formats, we are able to abstract away the details and complexities of the individual, highly heterogeneous data sources. This makes it much easier for us to learn e.g. how many patients with what disease status etc are available to our research. By structuring this data consistently, we are able to move it into a SQL table store, allowing for rapid querying and transformation. Blob data such as imaging data remains stored in the S3 bucket. We call this type of data the "integration-level data".
- Application-specific data - while the normalized data provides consistency, it is not necessarily very ergonomic to work with in specific projects. To meet the modeling demands of individual projects, the data platform allows for workflows that transform source- + integration-level data into application-specific forms of the data. This process may involve heavy filtering and transformation operations. We call this type of data the "application-level data".
Unless you are working on a project with tailored application-level data, you should almost always be interacting with integration-level data. For information on how and why data is modeled, refer to the explainer on database structure.
Getting Started
See the getting started guide.