ClearML

Clearml.conf Minio Set-up

To be able to use minio as a data bucket to store artifacts like model checkpoints or datasets you have to adjust the clearml.conf file. The file is usually located at ~/clearml.conf.

More detailed information can be found here.

First you have to create a key/secret pair. Go to your browser and access the local minio port e.g. http://viopsy-pc:10000 and log in with the credentials (default is username: minioadmin and password: minioadmin). Navigate to the "Access Key" tab in the menu bar and click "Create access key" in the right upper corner. Fill out the form and copy the "Access Key" and the "Secret Key" you need them for the clearml.config file (see below).

Optional If you want to create a new sub-bucket that you would like to use as a storage navigate to the "Buckets" option in the menu bar and click "Create Bucket" in the right upper corner.

Then replace the aws entry in the clearml.config file with the following values:

    aws {
        s3 {
            credentials: [
                {
                    host: "{DOMAIN}" # url has no http:// or s3://, only the domain
                    bucket: "{BUCKET_NAME}" # name of bucket you would like to use
                    key: "{ACCESS_KEY}"
                    secret: "{SECRET_KEY}"
                    multipart: false
                    secure: false
                }
            ]
        }
    }

Example:

    aws {
        s3 {
            credentials: [
                {
                    host: "viopsy-pc:10000" # url has no http:// or s3://, only the domain
                    bucket: "clearml" # name of bucket you would like to use
                    key: "VT8zNshq7FvgXbPLzVfv"
                    secret: "qTRD9GCxkodgBhA2BbVJrqoRmhEBjJEzgim3hDkB"
                    multipart: false
                    secure: false
                }
            ]
        }
    }

Dataset Loading

Using data from a ClearML dataset is very easy. You can simply call the dataset, and request a local copy of it. This downloads the data to a local cache, so that it is only loaded once. This ensures that we all use the same datasets, and that changes to datasets over time are being logged.

from clearml import Dataset

dataset_name = ...

# the alias is used as variable name to save the dataset as input dataset for the current task
alias = 'training_dataset'
dataset = Dataset.get(
        dataset_name=dataset_name, alias=alias
    )
# download the dataset locally
local_path = dataset.get_local_copy()

By default ClearML will always load the newest version of a dataset. If you would like to get a certain version of the dataset this can be done using the dataset_version argument. You can also search only publised datasets with the only_published argument. For full documentation on all arguments see the ClearML documentation.

dataset = Dataset.get(
        dataset_name=dataset_name, alias=alias, version='1.0.0', only_published=True)

Source datasets

Original, unprocessed datasets can be added in ClearML under datasets source. This makes them accessible for all projects, and individual projects can then create their own datasets based on any processing needs. Currently there are two source datasets registered into ClearML: PROSTATEx-2 and PRUS. If you use these datasets for your projects, try to inherit from these dataset! Please add any new, unprocessed datasets here as well.

Clearml Dataset Generation

To make our data processing reproducible, it would be good to upload datasets to ClearML rather than using only local copies. By saving the data generation process into a ClearML task, the generation of the dataset is also fully reproducible. To save all information about the code, it is best to create a ClearML dataset, and use this task to generate the task as follows:

# make it work in a Github action and locally
minio_host  = os.environ.get("MINIO_HOST", "viopsy-pc:10000")

# create the task
task = Task.init(
    project_name="datasets_source", # project name
    task_name="new_dataset_name", # name of the dataset to create
    output_uri=f"s3://{minio_host}/clearml",
)

# create the dataset from the task
prostatex_dataset = Dataset.create(
    dataset_tags=["PROSTATEx-2", "source_data", "public_data", "mri"],
    use_current_task=True,
)

This will create an empty dataset, to which you can then add any files. You can now add files to the dataset, and add any metadata:

# this will add all files in data_folder to a folder called 'images' in the dataset
prostatex_dataset.add_files(data_folder, dataset_path='images')

# add a df as metadata (this can for example be a csv file)
prostatex_dataset.set_metadata(df)

# add this current script as metadata (nice to have easily accessible)
prostatex_dataset.set_metadata(Path(__file__), "source_script")

Once all data has been added, you can now upload and finalize the dataset. After finalising a dataset no files or metadata can be added anymore.

# upload all files to clearml/minio
prostatex_dataset.upload()

# finish the dataset
prostatex_dataset.finalize()

task.close()

!!!tip If you have finished a dataset, and it is ready to be used in projects you can go to the respective ClearML task and set it to published. This indicates that it is ready for use, and should not be deleted.

Dataset Inheritance

To generate a new dataset out of other ClearML dataset, you can use inheritance. This will ensure the dataset contains all data from the parent datasets, without copying the data in memory.

# create task
task = Task.init(
    project_name="vireg",
    task_name=cfg.output_dataset,
    output_uri=f"s3://{get_minio_backend()}/clearml",
)

# dataseat to inherit from (use alias to log this as an input dataset in the Clearml data generation task)
dataset_id = Dataset.get(
    dataset_name="parent_dataset", alias="input_dataset_{dataset_name}"
).id

# this dataset will now contain all files in the parent dataset
child_dataset = Dataset.create(
    parent_datasets=[dataset_id], # to inherit from multiple datasets, simply add more ids in the list
    use_current_task=True,
)

By inheriting from datasets, ClearML automatically creates an inheritance graph which visualizes the data generation process (for an example see here):

As this is quite a nice feature, and tracks the dependencies very nicely, it can sometimes be useful to define a dataset as a parent, even though all files will be changed (for example, when converting from original Dicoms to Vicom nifty). In that case you can inherit from the dataset, remove all files from it, and then add any of the new files. While this feels a little bit hacky, I (Linde) have not really found a better way to do this.

# remove all files from the dataset
child_dataset.remove_files('*', recursive=True)

# add the newly generated files
child_dataset.add_files(path)

Dataset Versioning

If you try to create a dataset which already exists, ClearML will generate a second version of the dataset. This means that if you have to regenerate a dataset (i.e. because some data has changed), ClearML will neatly keep track of all previous versions. To ensure that the dataset is not saved in whole to memory each time a new version is generated, it is best practice to inherit from the previous version if you are only adding/removing files. This can be easily achieved by using the writeable_copy flag in the Dataset.get() function. This creates a new version of the dataset with the current dataset as parent.

# get a new version of the dataset, with the current dataset as it's parent
dataset = Dataset.get(
        dataset_name=dataset_name, alias='input_dataset', writeable_copy=True)

# this will work now (will throw an error if the `writeable_copy` is set to False)
dataset.add_files(path)

Model Loading

How to load models from ClearMl.

Note It is also possible to connect external models with existing tasks. See the ClearML documentation for more details.

1. Load a specific model based on a model_id or a name/project/tag/published combination

from clearml import InputModel

# only returns a single model when multiple models fit the specified parameters
model = InputModel(
    model_id=None,
    name="TestModel",
    project="TestProject",
    tags=['best', 'worst'],
    only_published=False
    )

model_weights = model.get_weights()

2. Query a list of models based on specific attributes

from clearml import Model

model_list = Model.query_models(
    project_name='TestProject', # Only models from `TestProject` project
    model_name="TestModel", # Only models with model name
    tags=['latest', '-best'], # Only models with `latest` tag or models without `best` tag (checkout ClearML documentation for more logic operations)
    only_published=False, # If `True`, only published models are returned
    include_archived=True, # If `True`, include archived models
    max_results=5, # Maximum number of models returned in the list
    metadata={"key":"value"} # Only models with matching metadata
)

model_0 = model_list[0].get_local_copy()

Model Loading

How to load models from ClearMl.

Note It is also possible to connect external models with existing tasks. See the ClearML documentation for more details.

1. Load a specific model based on a model_id or a name/project/tag/published combination

from clearml import InputModel

# only returns a single model when multiple models fit the specified parameters
model = InputModel(
    model_id=None,
    name="TestModel",
    project="TestProject",
    tags=['best', 'worst'],
    only_published=False
    )

model_weights = model.get_weights()

2. Query a list of models based on specific attributes

from clearml import Model

model_list = Model.query_models(
    project_name='TestProject', # Only models from `TestProject` project
    model_name="TestModel", # Only models with model name
    tags=['latest', '-best'], # Only models with `latest` tag or models without `best` tag (checkout ClearML documentation for more logic operations)
    only_published=False, # If `True`, only published models are returned
    include_archived=True, # If `True`, include archived models
    max_results=5, # Maximum number of models returned in the list
    metadata={"key":"value"} # Only models with matching metadata
)

model_0 = model_list[0].get_local_copy()

Clearml.conf Minio Set-up​

Dataset Loading​

Source datasets​

Clearml Dataset Generation​

Dataset Inheritance​

Dataset Versioning​

Model Loading​

1. Load a specific model based on a model_id or a name/project/tag/published combination​

2. Query a list of models based on specific attributes​

Model Loading​

1. Load a specific model based on a model_id or a name/project/tag/published combination​

2. Query a list of models based on specific attributes​

Clearml.conf Minio Set-up

Dataset Loading

Source datasets

Clearml Dataset Generation

Dataset Inheritance

Dataset Versioning

Model Loading

1. Load a specific model based on a model_id or a name/project/tag/published combination

2. Query a list of models based on specific attributes

Model Loading

1. Load a specific model based on a model_id or a name/project/tag/published combination

2. Query a list of models based on specific attributes