Skip to main content

My First Cohort

What is a Cohort?

A cohort is a carefully selected group of medical data that meets specific criteria for research or analysis. Think of it as creating a snapshot of your dataset at a specific point in time, filtered to include only the patients and data that are relevant for your study.

For example, you might create a cohort of:

  • All patients born after 1970 with at least one MRI scan
  • Patients with specific pathology findings and corresponding imaging
  • Training and validation splits for machine learning models

Cohorts in VXPlatform are:

  • Versioned: They capture data as it existed at a specific timestamp
  • Reproducible: The same cohort script always produces the same Lockfiles for a given timestamp
  • Shareable: Lockfiles can be shared, while the actual data remains secure
  • Split-aware: They can automatically divide data into training, validation, and test sets
Cohort Building Process

Getting Started

In this tutorial, we'll create a simple cohort that selects the 100 most recent transversal T2 MRI volumes and splits them into training (70%), validation (10%), and test (20%) sets.

Step 1: Create a Cohort Script

A cohort script is a Python function that queries the platform and returns a dictionary of resource identifiers organized by split (train/validation/test).

Querying Data

There are two ways to get data from the platform:

Option 1: Get all data at once (simplest for exploration)

# Get all payload types as a dictionary of DataFrames
dfs = client.get_all_dataframes()

# Access individual DataFrames and use Polars to filter/transform
volumes = dfs["Volume"]
patients = dfs["Patient"]
imaging_studies = dfs["ImagingStudy"]

The platform returns Polars DataFrames that you can filter and transform as needed.

Option 2: Query specific resource types with filters (more efficient)

# Query only volumes with server-side filtering
t2_volumes = client.query_resources(
"Volume",
filters=[Equals("volume_type", "T2")],
as_json=False
)

However this approach is much more efficient as it only fetches specific data and applies filters on the server. See the querying guide for details on available filters and query options.

Example Script

Create a file called my_first_cohort.py:

import polars as pl
from vxp_client import client as vc
from vxp_client.datasets.decorator import dataset_compilation_entrypoint


@dataset_compilation_entrypoint()
def select_resources(
client: vc.PlatformClient,
) -> dict[str, list[dict[str, object]]]:
"""
A simple example cohort: select 100 most recent transversal T2 volumes and split them.
"""

# Get all data from the platform
dfs = client.get_all_dataframes()

# Select transversal T2 volumes, take the 100 most recent
volumes = (
dfs["Volume"]
.filter(pl.col("volume_type") == "T2")
.filter(pl.col("image_plane") == "transversal")
.sort("_created_at", descending=True)
.head(100)
)

# Split into train (70%), validation (10%), and test (20%)
n_volumes = len(volumes)
n_train = int(n_volumes * 0.70)
n_val = int(n_volumes * 0.10)

train_volumes = volumes[:n_train]
val_volumes = volumes[n_train:n_train + n_val]
test_volumes = volumes[n_train + n_val:]

return {
"train": train_volumes.select("_identifier").to_series().to_list(),
"validation": val_volumes.select("_identifier").to_series().to_list(),
"test": test_volumes.select("_identifier").to_series().to_list(),
}

Key points:

  • Use @dataset_compilation_entrypoint() decorator to mark the function as a cohort script
  • Use client.get_all_dataframes() to fetch all data at once - it returns a dictionary of Polars DataFrames
  • Use Polars operations (.filter(), .sort(), .head(), etc.) for data manipulation
  • Return a dictionary with split names as keys and lists of resource identifiers as values

For more efficient queries with server-side filtering, see the querying guide.

Step 2: Compile the Cohort (Using CLI)

The compilation step runs your script and creates a lockfile - a snapshot of which resources were selected at a specific timestamp.

# Compile using a local script file
vxp cohort compile my-first-cohort --file my_first_cohort.py

# Or use the script that is already registered in the package
vxp cohort compile my-first-cohort

This creates a lockfile like my-first-cohort-2025-12-08T16-15-31.lock that contains:

  • The timestamp when it was compiled
  • Lists of resource identifiers for each split (train/validation/test)
  • Metadata about the cohort

Step 3: Compile the Cohort (Using Python Client)

Alternatively, you can compile cohorts programmatically:

from vxp_client import cohort
from pathlib import Path

# Compile from a script file
lockfile = cohort.compile(
name="my-first-cohort",
file=Path("my_first_cohort.py"),
output=Path("."),
timestamp="2025-12-08T16:15:31",
url="http://192.168.10.101:2700"
)

print(f"Cohort compiled: ")

Step 4: Populate the Cohort Data (Using CLI)

Once you have a lockfile, you can fetch the actual data:

# Export as CSV files (one per payload type and split)
vxp cohort populate my-first-cohort-2025-12-08T16-15-31.lock \
--output ./data \
--format csv

# Export as Parquet files (more efficient for large datasets)
vxp cohort populate my-first-cohort-2025-12-08T16-15-31.lock \
--output ./data \
--format parquet

# Export as JSONL (one JSON object per line)
vxp cohort populate my-first-cohort-2025-12-08T16-15-31.lock \
--output ./data \
--format jsonl

This will create files like:

data/
├── Volume.my-first-cohort.2025-12-08T16-15-31.train.csv
├── Volume.my-first-cohort.2025-12-08T16-15-31.val.csv
└── Volume.my-first-cohort.2025-12-08T16-15-31.test.csv

Step 5: Populate the Cohort Data (Using Python Client)

You can also populate cohorts programmatically:

from vxp_client import cohort
from vxp_client.lockfile import Lockfile
from pathlib import Path

# Load the lockfile
lock = Lockfile.from_file(Path("my-first-cohort-2025-12-08T16-15-31.lock"))

# Populate and get DataFrames
data = cohort.populate(
lockfile=lock,
url="http://192.168.10.101:2700",
format="csv" # returns DataFrames grouped by split and payload type
)

# Access the data
for split_name, dataframes_with_types in data.items():
print(f"\n split:")
for payload_type, df in dataframes_with_types:
print(f" : rows")

# You can now work with the DataFrame
# df.write_csv(f"_.csv")
# df.write_parquet(f"_.parquet")

Step 6: Use Your Cohort Data

Now you have organized, versioned data ready for your research or machine learning pipeline:

import polars as pl

# Load your training data
train_volumes = pl.read_csv("data/Volume.my-first-cohort.2025-12-08T16-15-31.train.csv")

# Analyze or train your model
print(f"Training set: volumes")
print(f"Volume types: ")
print(f"Image planes: ")

# The same process works for validation and test sets
val_volumes = pl.read_csv("data/Volume.my-first-cohort.2025-12-08T16-15-31.validation.csv")
test_volumes = pl.read_csv("data/Volume.my-first-cohort.2025-12-08T16-15-31.test.csv")

print(f"Validation set: volumes")
print(f"Test set: volumes")

Key Concepts

  • Lockfile: A snapshot of your cohort at a specific timestamp. Share this to ensure reproducibility.
  • Splits: Organize your data into train/validation/test (or any custom splits you define).
  • Versioning: The timestamp ensures you always get the same data, even as the platform evolves.
  • Registered Scripts: Scripts in vxp_client/cohort_scripts/registered_scripts/ can be compiled by name without providing a file path.

Next Steps

  • Learn more about querying resources
  • Explore payload schemas to understand available data types
  • Check out more complex cohort examples in the registered_scripts directory