Working with Datasets

Datasets in VxPlatform are a set of features that allow for the efficient definition & sharing of reproducible sets of data.

A dataset recipe is a logical grouping of resources by certain criteria or rules. For example: the viseg training recipe describes the set of all T2 images where we have segmentation annotations available.

A dataset lockfile captures the resource identifiers selected for a specific timestamp. Each lockfile stores the shared _timestamp alongside per-split lists of identifiers, ensuring we can fetch the same snapshot again later.

Recipes are defined through Python code that implements the logic used to select which resources to include and which to exclude. Lockfiles are necessary for reproducibility reasons: as our trial data continuously grows, the lockfiles describe the exact set of resources included by a recipe at a certain point in time.

You can develop & lock recipes locally on your own. Once you are ready to share the results with others, you can submit your script to the data-platform repository and register it to be available to others for easy access.

Creating a Dataset Recipe

A "recipe" is nothing but a script that queries a VxPlatform instance for relevant resources. The output of a recipe is a mapping of dataset splits to resource identifiers.

You can use the following template for your Python file to make it compatible with existing vxp_client tooling, including the VxPlatform backend:

import vxp_client

@vxp_client.dataset_compilation_entrypoint()
def all_basel_data(
  client: vxp_client.PlatformClient
) -> dict[str, list[dict]]:
    """Returns all data associated with the Basel trial."""
    r_basel = client.get_resource("trial/basel")
    ...
    result =  {
      "train": [
        "trial/basel",
        "study/123",
      ],
      "test": [ ... ]
    }
    return result

Your script does not need to contain anything else - the dataset_compilation_entrypoint decorator takes care of the rest.

How do I pick resources?

Inside of the decorated function, you can use whatever tooling you want. There are no restrictions on what packages to import, or by what logic to select data. You could query the weather forecast or ask an LLM what resources to include.

Locking a Dataset

Once your recipe script is ready, you can try locking it using:

vxp datasets lock --file path/to/your/script.py --timestamp 2024-01-01T00:00:00Z

This will produce files like resources.train.2024-01-01T00-00-00Z.jsonl and resources.test.2024-01-01T00-00-00Z.jsonl on your disk.

If you omit --timestamp, the CLI will use the current UTC time when invoking the script.

The dataset script writes a small JSON file of the form:

{
  "_timestamp": "2024-01-01T00:00:00Z",
  "train": ["trial/basel", "study/123"],
  "test": ["study/456"]
}

Note that the train/test identifiers come from the return object of our dataset recipe script: the first level of keys in the return dictionary indicate the "split" of the dataset. Downstream tooling will consider this information in its processing (e.g. move files into separate folders).

Processing a Lockfile

We now have one or more dataset lockfiles to work with.

You could now go ahead and parse these JSONs yourself, and e.g. copy image data from the paths specified in e.g. DICOMSeries or Volume resources.

vxp_client client implements & exposes a few methods for commonly used operations on the resources contained in such dataset lockfiles:

vxp processors -f resources.train.2024-01-01T00-00-00Z.jsonl \
  -m bids \
  -m clinical-to-csv \
  -m ...

To make your dataset recipe available on VxPlatform, you need to register it in the dataset_repository project. This means: putting your script in the apps/dataset_repository folder, and adding an entry to the apps/dataset_repository/datasets.yaml file. The yaml file entry needs to look like this:

- name: my_dataset_name
  version: 1
  description: A description of what this dataset contains.
  script_location: ./relative/path/to/your/script/from/dataset_repository/root.py

Commit this to a separate branch and open a PR to get this deployed.

Creating a Dataset Recipe​

Locking a Dataset​

Processing a Lockfile​

Sharing & Publishing a Dataset​

Creating a Dataset Recipe

Locking a Dataset

Processing a Lockfile

Sharing & Publishing a Dataset