ClearML
Clearml.conf Minio Set-up
To be able to use minio as a data bucket to store artifacts like model checkpoints or datasets you
have to adjust the clearml.conf file. The file is usually located at ~/clearml.conf.
More detailed information can be found here.
First you have to create a key/secret pair. Go to your browser and access the local minio port e.g. http://viopsy-pc:10000 and log in with the credentials (default is username: minioadmin and password: minioadmin). Navigate to the "Access Key" tab in the menu bar and click "Create access key" in the right upper corner. Fill out the form and copy the "Access Key" and the "Secret Key" you need them for the clearml.config file (see below).
Optional If you want to create a new sub-bucket that you would like to use as a storage navigate to the "Buckets" option in the menu bar and click "Create Bucket" in the right upper corner.
Then replace the aws entry in the clearml.config file with the following values:
aws {
s3 {
credentials: [
{
host: "{DOMAIN}" # url has no http:// or s3://, only the domain
bucket: "{BUCKET_NAME}" # name of bucket you would like to use
key: "{ACCESS_KEY}"
secret: "{SECRET_KEY}"
multipart: false
secure: false
}
]
}
}
Example:
aws {
s3 {
credentials: [
{
host: "viopsy-pc:10000" # url has no http:// or s3://, only the domain
bucket: "clearml" # name of bucket you would like to use
key: "VT8zNshq7FvgXbPLzVfv"
secret: "qTRD9GCxkodgBhA2BbVJrqoRmhEBjJEzgim3hDkB"
multipart: false
secure: false
}
]
}
}
Dataset Loading
Using data from a ClearML dataset is very easy. You can simply call the dataset, and request a local copy of it. This downloads the data to a local cache, so that it is only loaded once. This ensures that we all use the same datasets, and that changes to datasets over time are being logged.
from clearml import Dataset
dataset_name = ...
# the alias is used as variable name to save the dataset as input dataset for the current task
alias = 'training_dataset'
dataset = Dataset.get(
dataset_name=dataset_name, alias=alias
)
# download the dataset locally
local_path = dataset.get_local_copy()
By default ClearML will always load the newest version of a dataset. If you would like to get a certain version of the dataset this can be done using the dataset_version argument. You can also search only publised datasets with the only_published argument. For full documentation on all arguments see the ClearML documentation.
dataset = Dataset.get(
dataset_name=dataset_name, alias=alias, version='1.0.0', only_published=True)
Source datasets
Original, unprocessed datasets can be added in ClearML under datasets source. This makes them accessible for all projects, and individual projects can then create their own datasets based on any processing needs. Currently there are two source datasets registered into ClearML: PROSTATEx-2 and PRUS. If you use these datasets for your projects, try to inherit from these dataset! Please add any new, unprocessed datasets here as well.
Clearml Dataset Generation
To make our data processing reproducible, it would be good to upload datasets to ClearML rather than using only local copies. By saving the data generation process into a ClearML task, the generation of the dataset is also fully reproducible. To save all information about the code, it is best to create a ClearML dataset, and use this task to generate the task as follows:
# make it work in a Github action and locally
minio_host = os.environ.get("MINIO_HOST", "viopsy-pc:10000")
# create the task
task = Task.init(
project_name="datasets_source", # project name
task_name="new_dataset_name", # name of the dataset to create
output_uri=f"s3://{minio_host}/clearml",
)
# create the dataset from the task
prostatex_dataset = Dataset.create(
dataset_tags=["PROSTATEx-2", "source_data", "public_data", "mri"],
use_current_task=True,
)
This will create an empty dataset, to which you can then add any files. You can now add files to the dataset, and add any metadata:
# this will add all files in data_folder to a folder called 'images' in the dataset
prostatex_dataset.add_files(data_folder, dataset_path='images')
# add a df as metadata (this can for example be a csv file)
prostatex_dataset.set_metadata(df)
# add this current script as metadata (nice to have easily accessible)
prostatex_dataset.set_metadata(Path(__file__), "source_script")
Once all data has been added, you can now upload and finalize the dataset. After finalising a dataset no files or metadata can be added anymore.
# upload all files to clearml/minio
prostatex_dataset.upload()
# finish the dataset
prostatex_dataset.finalize()
task.close()
!!!tip
If you have finished a dataset, and it is ready to be used in projects you can go to the respective ClearML task and set it to published. This indicates that it is ready for use, and should not be deleted.
Dataset Inheritance
To generate a new dataset out of other ClearML dataset, you can use inheritance. This will ensure the dataset contains all data from the parent datasets, without copying the data in memory.
# create task
task = Task.init(
project_name="vireg",
task_name=cfg.output_dataset,
output_uri=f"s3://{get_minio_backend()}/clearml",
)
# dataseat to inherit from (use alias to log this as an input dataset in the Clearml data generation task)
dataset_id = Dataset.get(
dataset_name="parent_dataset", alias="input_dataset_{dataset_name}"
).id
# this dataset will now contain all files in the parent dataset
child_dataset = Dataset.create(
parent_datasets=[dataset_id], # to inherit from multiple datasets, simply add more ids in the list
use_current_task=True,
)
By inheriting from datasets, ClearML automatically creates an inheritance graph which visualizes the data generation process (for an example see here):
As this is quite a nice feature, and tracks the dependencies very nicely, it can sometimes be useful to define a dataset as a parent, even though all files will be changed (for example, when converting from original Dicoms to Vicom nifty). In that case you can inherit from the dataset, remove all files from it, and then add any of the new files. While this feels a little bit hacky, I (Linde) have not really found a better way to do this.
# remove all files from the dataset
child_dataset.remove_files('*', recursive=True)
# add the newly generated files
child_dataset.add_files(path)
Dataset Versioning
If you try to create a dataset which already exists, ClearML will generate a second version of the dataset. This means that if you have to regenerate a dataset (i.e. because some data has changed), ClearML will neatly keep track of all previous versions. To ensure that the dataset is not saved in whole to memory each time a new version is generated, it is best practice to inherit from the previous version if you are only adding/removing files. This can be easily achieved by using the writeable_copy flag in the Dataset.get() function. This creates a new version of the dataset with the current dataset as parent.
# get a new version of the dataset, with the current dataset as it's parent
dataset = Dataset.get(
dataset_name=dataset_name, alias='input_dataset', writeable_copy=True)
# this will work now (will throw an error if the `writeable_copy` is set to False)
dataset.add_files(path)
Model Loading
How to load models from ClearMl.
Note It is also possible to connect external models with existing tasks. See the ClearML documentation for more details.
1. Load a specific model based on a model_id or a name/project/tag/published combination
from clearml import InputModel
# only returns a single model when multiple models fit the specified parameters
model = InputModel(
model_id=None,
name="TestModel",
project="TestProject",
tags=['best', 'worst'],
only_published=False
)
model_weights = model.get_weights()
2. Query a list of models based on specific attributes
from clearml import Model
model_list = Model.query_models(
project_name='TestProject', # Only models from `TestProject` project
model_name="TestModel", # Only models with model name
tags=['latest', '-best'], # Only models with `latest` tag or models without `best` tag (checkout ClearML documentation for more logic operations)
only_published=False, # If `True`, only published models are returned
include_archived=True, # If `True`, include archived models
max_results=5, # Maximum number of models returned in the list
metadata={"key":"value"} # Only models with matching metadata
)
model_0 = model_list[0].get_local_copy()
Model Loading
How to load models from ClearMl.
Note It is also possible to connect external models with existing tasks. See the ClearML documentation for more details.
1. Load a specific model based on a model_id or a name/project/tag/published combination
from clearml import InputModel
# only returns a single model when multiple models fit the specified parameters
model = InputModel(
model_id=None,
name="TestModel",
project="TestProject",
tags=['best', 'worst'],
only_published=False
)
model_weights = model.get_weights()
2. Query a list of models based on specific attributes
from clearml import Model
model_list = Model.query_models(
project_name='TestProject', # Only models from `TestProject` project
model_name="TestModel", # Only models with model name
tags=['latest', '-best'], # Only models with `latest` tag or models without `best` tag (checkout ClearML documentation for more logic operations)
only_published=False, # If `True`, only published models are returned
include_archived=True, # If `True`, include archived models
max_results=5, # Maximum number of models returned in the list
metadata={"key":"value"} # Only models with matching metadata
)
model_0 = model_list[0].get_local_copy()