Reading Data
VxData exposes many different tables, modeling different aspects of the data domains we're working with.
At its core, VxData is a CRUD layer - a Create-Read-Update-Delete system for structured data. The SDK directly implements this mental model by exposing the following general-purpose methods to work with data:
client.resources.retrieve(<identifier>)
client.resources.create(<resource>)
client.resources.update(<identifier>, <update>)
client.resources.delete(<identifier>)
Every "entity" in vxData is considered a resource with its own unique, stable identifier.
For example, our Bamberg trial is considered a resource with the ID datasource/bamberg.
Patient number 17 in this trial is considered a resource too - with identifier patient/BB00017.
Every single MRI volume, PSA measurement, or diagnostic report of this patient is considered its own resource with unique identifiers!
Each resource is associated with one of $n$ pre-defined payloads: structured data that provide the actually relevant content of the resource.
A patient is associated with a PatientPayload, containing fields for e.g. patient birthyear or similar.
An MRI volume on the other hand is associated with a VolumePayload and contains fields such as b_value or image_plane.
All payloads are defined in the vxdata-schemas package, which comes with the SDK.
You can find definitions for all payloads in there.
Retrieving Specific Resources
VxData SDK makes it easy to obtain the data for specific resources you may be interested in. To provide you with better type hints, we have implemented payload-type-specific namespaces.
Instead of returning generic ResourceResponse objects as the client.resources.retrieve() method would return, these namespaces provide more useful payload-specific variants:
from vxdata.schemas.response import *
client.patients.retrieve("patient/BB00017") -> PatientResponse
client.imaging_studies.retrieve([
"imagingstudy/BB00017/001",
"imagingstudy/BB00017/002",
]) -> list[ImagingStudyResponse]
client.pathology_assessments.retrieve(...) -> PathologyAssessmentResponse
...
These methods are useful for obtaining the data for specific resources you already know the identifiers for.
Querying Resources
If you want to do more large-scale queries for data not based on individual identifiers, but rather on certain properties, you should use the SDK's query() method:
from vxdata.sdk import Client, F
client = Client()
patients = client.patients.query().collect()
volumes = (
client.volumes.query()
.filter(F.volume_type == "T2", F.image_plane == "transversal")
.limit(100)
.collect()
)
rpe = (
client.pathology_specimens.query()
.filter(F.type == "RPE")
.collect_as_pydantic()
)
This querying functionality allows for super-fast retrieval of large amounts of data of a specific payload.
The query builder's collect() method returns a polars DataFrame for you to work with.
Optionally, this dataframe can be converted into corresponding pydantic response models using collect_as_pydantic().
Explore available resource types and their fields in the frontend query tab or by looking at the payload schemas in the codebase.
A more detailed guide on querying can be found below.
Table data is returned directly as Polars DataFrames. Blob data (MRIs, DICOMs, PDFs) is stored on S3.
Tables may contain URLs of blob files that are stored on S3 - for example, the Volume table holds a column path_nii that points to the location of the corresponding nifti file.
We can a) download files individually by specifying their paths or b) pull all data contained in a single polars dataframe at once.
Download individual S3 URLs
# Single file
local_path = client.storage.download("s3://bucket/path/to/file.nii.gz", dest_dir=Path("./downloads"))
# Multiple files
local_paths = client.storage.download(
["s3://bucket/file1.nii.gz", "s3://bucket/file2.nii.gz"],
dest_dir=Path("./downloads")
)
Download all S3 URLs in a DataFrame
from pathlib import Path
volumes = client.volumes.query().filter(...).collect() # build your dataframe of interest here
# Automatically detects S3 URL columns, downloads in parallel to disk, and replaces paths in dataframe with local paths
volumes_local = client.storage.materialize(volumes, dest_dir=Path("./downloads"))
Querying in Detail
The platform contains various resource types such as patients, imaging studies, volumes, and clinical assessments. To see what resources are available and explore their data, visit the Query tab in the frontend.
The client uses a chainable query builder pattern. You construct a query by chaining methods like .filter(), .select(), and .limit(). The query is only executed when you call .collect(), which returns a Polars DataFrame with the results.
# Get all volumes from the platform
volumes = client.volumes.query().collect()
patients = client.patients.query().collect()
# Chainable filtering
volumes = (
client.volumes.query()
.filter(F.volume_type == "T2",
F.image_plane == "transversal")
.limit(100)
.collect()
)
Field Accessor (F)
The F accessor provides operator overloading for filter syntax:
from vxdata.sdk import F
# Comparison operators
F.volume_type == "T2" # equals
F.volume_type != "T1" # not equals
F.slice_thickness > 2.5 # greater than
F.slice_thickness >= 2.5 # greater or equal
F.slice_thickness < 5.0 # less than
F.slice_thickness <= 5.0 # less or equal
# Methods
F.series_description.contains("T2")
F.patient_id.startswith("DA")
F.patient_id.is_in(["DA001", "DA002", "DA003"])
F.slice_thickness.between(2.0, 4.0)
F.acquisition_datetime.after("2024-01-01")
F.acquisition_datetime.before("2025-01-01")
Metadata Fields
These metadata fields exist for every resource type:
| Field | Description |
|---|---|
identifier | Resource identifier (e.g., volume/DA00001/...) |
parent_identifier | Parent resource identifier |
valid_from | Version valid from timestamp |
valid_to | Version valid to timestamp |
created_at | Creation timestamp |
license | License string |
Additional fields vary by resource type and can be seen in the frontend on /query.
# Get specific volumes by identifier
volumes = (
client.volumes.query()
.filter(F.identifier.is_in([
"volume/DA00001/study-1/series-1",
"volume/DA00002/study-1/series-1",
]))
.collect()
)
Note: Don't use
F.parent_identifierto filter by datasource or patient as it only matches direct parents. Use Lineage Filters instead to query all descendants.
Lineage Filters
Resources form a hierarchy: Datasource → Patient → ImagingStudy → Volume. Each resource automatically has lineage fields (datasource_id, patient_id, study_id) derived from its ancestry. Use these to filter resources at any level:
from vxdata.sdk import Client, F, has_ancestor
client = Client()
# All volumes from a datasource
volumes = client.volumes.query().filter(F.datasource_id == "datasource/dasa").collect()
# Multiple datasources
volumes = client.volumes.query().filter(F.datasource_id.is_in(["datasource/dasa", "datasource/basel"])).collect()
# All resources under a patient
volumes = client.volumes.query().filter(F.patient_id == "patient/DA00001").collect()
# All resources under a study
volumes = client.volumes.query().filter(F.study_id == "imagingstudy/DA00001/study-1").collect()
# Generic ancestor filter (for other ancestor types)
volumes = client.volumes.query().filter(has_ancestor("datasource/dasa")).collect()
Combine lineage filters with other filters:
# Axial T2 volumes from DASA
volumes = (
client.volumes.query()
.filter(
F.datasource_id == "datasource/dasa",
F.volume_type == "T2",
F.image_plane == "transversal"
)
.collect()
)
| Filter | Description | Example |
|---|---|---|
F.datasource_id | Resources under datasource | F.datasource_id == "datasource/dasa" |
F.patient_id | Resources under patient | F.patient_id == "patient/DA001" |
F.study_id | Resources under study | F.study_id == "imagingstudy/DA001/123" |
has_ancestor(value) | Generic ancestor filter | has_ancestor("datasource/dasa") |
Combining Filters with AND / OR
By default, multiple filters passed to .filter() are combined with AND. You can use & (AND) and | (OR) operators to build more complex expressions:
from vxdata.sdk import F
# OR: match T2 or ADC volumes
volumes = (
client.volumes.query()
.filter((F.volume_type == "T2") | (F.volume_type == "ADC"))
.collect()
)
# AND + OR: T2 or ADC volumes from a specific datasource
volumes = (
client.volumes.query()
.filter(
((F.volume_type == "T2") | (F.volume_type == "ADC"))
& (F.datasource_id == "datasource/dasa")
)
.collect()
)
# Combine a pre-built filter with another condition
in_sources = F.datasource_id.is_in(["datasource/dasa", "datasource/basel"])
specimens = (
client.pathology_specimens.query()
.filter(in_sources & (F.is_event_root == True))
.collect()
)
Parentheses control grouping — & binds tighter than |, matching Python's standard operator precedence. You can nest arbitrarily:
# (A or B) and (C or D)
.filter(
((F.volume_type == "T2") | (F.volume_type == "ADC"))
& ((F.datasource_id == "datasource/dasa") | (F.datasource_id == "datasource/basel"))
)
Subqueries
Embed one query inside another using .select() and .is_in():
# Get volumes linked to submitted annotation sessions
# Note: derived_from is an array, so we check if any element matches
session_ids = (
client.annotation_sessions.query()
.filter(F.source == "manual-masks", F.status == "submitted")
.select("identifier")
.collect()["identifier"]
.to_list()
)
masks = (
client.volumes.query()
.filter(F.volume_type == "SEG")
.collect()
.filter(
pl.col("derived_from").list.eval(pl.element().is_in(session_ids)).list.any()
)
)
This first fetches the session IDs, then filters volumes where any derived_from element matches.
ANY/ALL Comparisons
For numeric comparisons against subquery results, use ANY() and ALL():
from vxdata.sdk import F, ANY, ALL
# Volumes thicker than any T1 volume
thick_volumes = (
client.volumes.query()
.filter(
F.slice_thickness > ANY(
client.volumes.query()
.filter(F.volume_type == "T1")
.select("slice_thickness")
)
)
.collect()
)
# Volumes thinner than all T2 volumes
thin_volumes = (
client.volumes.query()
.filter(
F.slice_thickness < ALL(
client.volumes.query()
.filter(F.volume_type == "T2")
.select("slice_thickness")
)
)
.collect()
)
Supported operators: >, >=, <, <=
Column Selection
Use .select() to fetch only specific columns:
# Only fetch the columns you need
masks = (
client.volumes.query()
.filter(F.volume_type == "SEG")
.select("path_nii", "segmentation_reference_uid")
.collect()
)
Note: When .select() is used inside .is_in(), it creates a subquery. When used before .collect(), it limits the returned columns.
Combining with Polars
Since .collect() returns a Polars DataFrame, you can use Polars operations to join and aggregate data from multiple queries:
import polars as pl
from vxdata.sdk import Client, F
client = Client()
# Get axial T2 volumes for patients from bamberg
volumes = (
client.volumes.query()
.filter(F.volume_type == "T2",
F.image_plane == "transversal",
F.datasource_id == "datasource/bamberg")
.select("identifier", "patient_id")
.collect()
)
# Get highest ISUP per patient
pathology = (
client.pathology_assessments.query()
.filter(F.datasource_id == "datasource/bamberg")
.select("patient_id", "isup")
.collect()
.group_by("patient_id")
.agg(pl.col("isup").max().alias("max_isup"))
)
# Get highest PIRADS per patient
pirads = (
client.pirads_assessments.query()
.filter(F.datasource_id == "datasource/bamberg")
.select("patient_id", "score")
.collect()
.group_by("patient_id")
.agg(pl.col("score").max().alias("max_pirads"))
)
# Join everything together
result = (
volumes
.join(pathology, on="patient_id", how="left")
.join(pirads, on="patient_id", how="left")
)