Skip to main content

Working with Datasets

Datasets in VxPlatform are a set of features that allow for the efficient definition & sharing of reproducible sets of data.

A dataset recipe is a logical grouping of resources by certain criteria or rules. For example: the viseg training recipe describes the set of all T2 images where we have segmentation annotations available.

A dataset lockfile is a list of resources along with exact version numbers of them. For example: the lockfile of the viseg training dataset holds all specific T2s + masks at the date of locking the dataset.

Recipes are defined through Python code that implements the logic used to select which resources to include and which to exclude. Lockfiles are necessary for reproducibility reasons: as our trial data continuously grows, the lockfiles describe the exact set of resources included by a recipe at a certain point in time.

You can develop & lock recipes locally on your own. Once you are ready to share the results with others, you can submit your script to the data-platform repository and register it to be available to others for easy access.

Creating a Dataset Recipe

A "recipe" is nothing but a script that queries a VxPlatform instance for relevant resources. The output of a recipe is effectively a list of resources along with specific version numbers.

You can use the following template for your Python file to make it compatible with existing vxp_client tooling, including the VxPlatform backend:

import vxp_client

@vxp_client.dataset_compilation_entrypoint()
def all_basel_data(
client: vxp_client.PlatformClient
) -> dict[str, list[dict]]:
"""Returns all data associated with the Basel trial."""
r_basel = client.get_resource("trial/basel")
...
result = {
"train": [
,

],
"test": [ ... ]
}
return result

Your script does not need to contain anything else - the dataset_compilation_entrypoint decorator takes care of the rest.

How do I pick resources?

Inside of the decorated function, you can use whatever tooling you want. There are no restrictions on what packages to import, or by what logic to select data. You could query the weather forecast or ask an LLM what resources to include.

Locking a Dataset

Once your recipe script is ready, you can try locking it using:

vxp datasets lock --file path/to/your/script.py

This will produce a resources.train.jsonl and resources.test.jsonl file on your disk.

Note that the train/test identifiers come from the return object of our dataset recipe script: the first level of keys in the return dictionary indicate the "split" of the dataset. Downstream tooling will consider this information in its processing (e.g. move files into separate folders).

Processing a Lockfile

We now have one or more dataset lockfiles to work with.

You could now go ahead and parse these JSONs yourself, and e.g. copy image data from the paths specified in e.g. DICOMSeries or Volume resources.

vxp_client client implements & exposes a few methods for commonly used operations on the resources contained in such dataset lockfiles:

vxp materialize -f resources.train.jsonl \
-m bids \
-m clinical-to-csv \
-m ...

Sharing & Publishing a Dataset

To make your dataset recipe available on VxPlatform, you need to register it in the dataset_repository project. This means: putting your script in the apps/dataset_repository folder, and adding an entry to the apps/dataset_repository/datasets.yaml file. The yaml file entry needs to look like this:

- name: my_dataset_name
version: 1
description: A description of what this dataset contains.
script_location: ./relative/path/to/your/script/from/dataset_repository/root.py

Commit this to a separate branch and open a PR to get this deployed.