Hyperparameter Optimization in AutoMM ===================================== Hyperparameter optimization (HPO) is a method that helps solve the challenge of tuning hyperparameters of machine learning models. ML algorithms have multiple complex hyperparameters that generate an enormous search space, and the search space in deep learning methods is even larger than traditional ML algorithms. Tuning on a massive search space is a tough challenge, but AutoMM provides various options for you to guide the fitting process based on your domain knowledge and the constraint on computing resources. Create Image Dataset -------------------- In this tutorial, we are going to again use the subset of the `Shopee-IET dataset `__ from Kaggle for demonstration purpose. Each image contains a clothing item and the corresponding label specifies its clothing category. Our subset of the data contains the following possible labels: ``BabyPants``, ``BabyShirt``, ``womencasualshoes``, ``womenchiffontop``. We can load a dataset by downloading a url data automatically: .. code:: python import autogluon.core as ag from autogluon.multimodal import MultiModalPredictor from autogluon.vision import ImageDataset from datetime import datetime train_data, _, test_data = ImageDataset.from_folders('https://autogluon.s3.amazonaws.com/datasets/shopee-iet.zip') train_data = train_data.sample(frac=0.5) print(train_data) .. parsed-literal:: :class: output /home/ci/opt/venv/lib/python3.8/site-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( .. parsed-literal:: :class: output Downloading /home/ci/.gluoncv/archive/shopee-iet.zip from https://autogluon.s3.amazonaws.com/datasets/shopee-iet.zip... .. parsed-literal:: :class: output 100%|██████████| 40895/40895 [00:01<00:00, 21595.23KB/s] .. parsed-literal:: :class: output data/ ├── test/ └── train/ image label 278 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 1 355 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 1 684 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 3 180 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 0 579 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 2 .. ... ... 370 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 1 613 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 3 561 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 2 171 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 0 50 /home/ci/.gluoncv/datasets/shopee-iet/data/tra... 0 [400 rows x 2 columns] There are in total 400 data points in this dataset. The ``image`` column stores the path to the actual image, and the ``label`` column stands for the label class. The Regular Model Fitting ------------------------- Recall that if we are to use the default settings predefined by Autogluon, we can simply fit the model using ``MultiModalPredictor`` with three lines of code: .. code:: python predictor_regular = MultiModalPredictor(label="label") start_time = datetime.now() predictor_regular.fit( train_data=train_data, hyperparameters = {"model.timm_image.checkpoint_name": "ghostnet_100"} ) end_time = datetime.now() elapsed_seconds = (end_time - start_time).total_seconds() elapsed_min = divmod(elapsed_seconds, 60) print("Total fitting time: ", f"{int(elapsed_min[0])}m{int(elapsed_min[1])}s") .. parsed-literal:: :class: output Global seed set to 123 No path specified. Models will be saved in: "AutogluonModels/ag-20221115_202303/" Downloading: "https://github.com/huawei-noah/CV-backbones/releases/download/ghostnet_pth/ghostnet_1x.pth" to /home/ci/.cache/torch/hub/checkpoints/ghostnet_1x.pth Auto select gpus: [0] Using 16bit native Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params ---------------------------------------------------------------------- 0 | model | TimmAutoModelForImagePrediction | 3.9 M 1 | validation_metric | Accuracy | 0 2 | loss_func | CrossEntropyLoss | 0 ---------------------------------------------------------------------- 3.9 M Trainable params 0 Non-trainable params 3.9 M Total params 7.813 Total estimated model params size (MB) Epoch 0, global step 1: 'val_accuracy' reached 0.20000 (best 0.20000), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=0-step=1.ckpt' as top 3 Epoch 0, global step 3: 'val_accuracy' reached 0.32500 (best 0.32500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=0-step=3.ckpt' as top 3 Epoch 1, global step 4: 'val_accuracy' reached 0.32500 (best 0.32500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=1-step=4.ckpt' as top 3 Epoch 1, global step 6: 'val_accuracy' reached 0.42500 (best 0.42500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=1-step=6.ckpt' as top 3 Epoch 2, global step 7: 'val_accuracy' reached 0.46250 (best 0.46250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=2-step=7.ckpt' as top 3 Epoch 2, global step 9: 'val_accuracy' reached 0.45000 (best 0.46250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=2-step=9.ckpt' as top 3 Epoch 3, global step 10: 'val_accuracy' reached 0.50000 (best 0.50000), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=3-step=10.ckpt' as top 3 Epoch 3, global step 12: 'val_accuracy' reached 0.57500 (best 0.57500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=3-step=12.ckpt' as top 3 Epoch 4, global step 13: 'val_accuracy' reached 0.61250 (best 0.61250), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=4-step=13.ckpt' as top 3 Epoch 4, global step 15: 'val_accuracy' reached 0.62500 (best 0.62500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=4-step=15.ckpt' as top 3 Epoch 5, global step 16: 'val_accuracy' reached 0.63750 (best 0.63750), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=5-step=16.ckpt' as top 3 Epoch 5, global step 18: 'val_accuracy' was not in top 3 Epoch 6, global step 19: 'val_accuracy' reached 0.67500 (best 0.67500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=6-step=19.ckpt' as top 3 Epoch 6, global step 21: 'val_accuracy' reached 0.66250 (best 0.67500), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=6-step=21.ckpt' as top 3 Epoch 7, global step 22: 'val_accuracy' reached 0.70000 (best 0.70000), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=7-step=22.ckpt' as top 3 Epoch 7, global step 24: 'val_accuracy' reached 0.68750 (best 0.70000), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=7-step=24.ckpt' as top 3 Epoch 8, global step 25: 'val_accuracy' reached 0.73750 (best 0.73750), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=8-step=25.ckpt' as top 3 Epoch 8, global step 27: 'val_accuracy' reached 0.70000 (best 0.73750), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202303/epoch=8-step=27.ckpt' as top 3 Epoch 9, global step 28: 'val_accuracy' was not in top 3 Epoch 9, global step 30: 'val_accuracy' was not in top 3 `Trainer.fit` stopped: `max_epochs=10` reached. .. parsed-literal:: :class: output Total fitting time: 0m50s Let’s check out the test accuracy of the fitted model: .. code:: python scores = predictor_regular.evaluate(test_data, metrics=["accuracy"]) print('Top-1 test acc: %.3f' % scores["accuracy"]) .. parsed-literal:: :class: output Top-1 test acc: 0.650 Use HPO During Model Fitting ---------------------------- If you would like more control over the fitting process, you can specify various options for hyperparameter optimizations(HPO) in ``MultiModalPredictor`` by simply adding more options in ``hyperparameter`` and ``hyperparameter_tune_kwargs``. There are a few options we can have in MultiModalPredictor. We use `Ray Tune `__ ``tune`` library in the backend, so we need to pass in a `Tune search space `__ or an `AutoGluon search space `__ which will be converted to Tune search space. 1. Defining the search space of various ``hyperparameter`` values for the training of neural networks: :: hyperparameters = { "optimization.learning_rate": tune.uniform(0.00005, 0.005), "optimization.optim_type": tune.choice(["adamw", "sgd"]), "optimization.max_epochs": tune.choice(["10", "20"]), "model.timm_image.checkpoint_name": tune.choice(["swin_base_patch4_window7_224", "convnext_base_in22ft1k"]) } .. raw:: html
    This is an example but not an exhaustive list. You can find the full supported list in `Customize AutoMM `__ .. raw:: html
2. Defining the search strategy for HPO with ``hyperparameter_tune_kwargs``. You can pass in a string or initialize a ``ray.tune.schedulers.TrialScheduler`` object. .. raw:: html
    a. Specifying how to search through your chosen hyperparameter space (supports ``random`` and ``bayes``): :: "searcher": "bayes" .. raw:: html
.. raw:: html
    b. Specifying how to schedule jobs to train a network under a particular hyperparameter configuration (supports ``FIFO`` and ``ASHA``): :: "scheduler": "ASHA" .. raw:: html
.. raw:: html
    c. Number of trials you would like to carry out HPO: :: "num_trials": 20 .. raw:: html
Let’s work on HPO with combinations of different learning rates and backbone models: .. code:: python from ray import tune predictor_hpo = MultiModalPredictor(label="label") hyperparameters = { "optimization.learning_rate": tune.uniform(0.00005, 0.001), "model.timm_image.checkpoint_name": tune.choice(["ghostnet_100", "mobilenetv3_large_100"]) } hyperparameter_tune_kwargs = { "searcher": "bayes", # random "scheduler": "ASHA", "num_trials": 2, } start_time_hpo = datetime.now() predictor_hpo.fit( train_data=train_data, hyperparameters=hyperparameters, hyperparameter_tune_kwargs=hyperparameter_tune_kwargs, ) end_time_hpo = datetime.now() elapsed_seconds_hpo = (end_time_hpo - start_time_hpo).total_seconds() elapsed_min_hpo = divmod(elapsed_seconds_hpo, 60) print("Total fitting time: ", f"{int(elapsed_min_hpo[0])}m{int(elapsed_min_hpo[1])}s") .. parsed-literal:: :class: output Global seed set to 123 No path specified. Models will be saved in: "AutogluonModels/ag-20221115_202355/" You can enable each individual trial using multiple gpus by installing ray_lightning.Supported ray_lightning versions and the compatible torch lightning versions are {'0.2.x': '1.5.x'}. /home/ci/opt/venv/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py:642: DeprecationWarning: `checkpoint_dir` in `func(config, checkpoint_dir)` is being deprecated. To save and load checkpoint in trainable functions, please use the `ray.air.session` API: from ray.air import session def train(config): # ... session.report({"metric": metric}, checkpoint=checkpoint) For more information please see https://docs.ray.io/en/master/ray-air/key-concepts.html#session warnings.warn( .. raw:: html == Status ==
Current time: 2022-11-15 20:24:54 (running for 00:00:56.32)
Memory usage on this node: 6.8/31.0 GiB
Using AsyncHyperBand: num_stopped=1 Bracket: Iter 4096.000: None | Iter 1024.000: None | Iter 256.000: None | Iter 64.000: None | Iter 16.000: 0.875 | Iter 4.000: 0.8125 | Iter 1.000: 0.25312499701976776
Resources requested: 0/8 CPUs, 0/1 GPUs, 0.0/13.79 GiB heap, 0.0/6.9 GiB objects (0.0/1.0 accelerator_type:T4)
Current best trial: 6f2d1126 with val_accuracy=0.862500011920929 and parameters={'optimization.learning_rate': 0.0007634687340859385, 'model.timm_image.checkpoint_name': 'ghostnet_100'}
Result logdir: /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20221115_202355
Number of trials: 2/2 (2 TERMINATED)
Trial name status loc model.timm_image.... optimization.lear... iter total time (s) val_accuracy
6f2d1126 TERMINATED10.0.0.6:2793ghostnet_100 0.000763469 20 40.4923 0.8625
7710b42e TERMINATED10.0.0.6:2793mobilenetv3_lar_0cb0 0.000658298 1 2.25543 0.15


.. parsed-literal:: :class: output Trial 6f2d1126 reported val_accuracy=0.29 with parameters={'optimization.learning_rate': 0.0007634687340859385, 'model.timm_image.checkpoint_name': 'ghostnet_100'}. Trial 6f2d1126 reported val_accuracy=0.81 with parameters={'optimization.learning_rate': 0.0007634687340859385, 'model.timm_image.checkpoint_name': 'ghostnet_100'}. Trial 6f2d1126 reported val_accuracy=0.85 with parameters={'optimization.learning_rate': 0.0007634687340859385, 'model.timm_image.checkpoint_name': 'ghostnet_100'}. Trial 6f2d1126 reported val_accuracy=0.86 with parameters={'optimization.learning_rate': 0.0007634687340859385, 'model.timm_image.checkpoint_name': 'ghostnet_100'}. Trial 6f2d1126 reported val_accuracy=0.89 with parameters={'optimization.learning_rate': 0.0007634687340859385, 'model.timm_image.checkpoint_name': 'ghostnet_100'}. Trial 6f2d1126 reported val_accuracy=0.88 with parameters={'optimization.learning_rate': 0.0007634687340859385, 'model.timm_image.checkpoint_name': 'ghostnet_100'}. Trial 6f2d1126 reported val_accuracy=0.86 with parameters={'optimization.learning_rate': 0.0007634687340859385, 'model.timm_image.checkpoint_name': 'ghostnet_100'}. Trial 6f2d1126 completed. Last result: val_accuracy=0.862500011920929,should_checkpoint=True Trial 7710b42e reported val_accuracy=0.15 with parameters={'optimization.learning_rate': 0.0006582978183027599, 'model.timm_image.checkpoint_name': 'mobilenetv3_large_100'}. This trial completed. Total fitting time: 1m5s Let’s check out the test accuracy of the fitted model after HPO: .. code:: python scores_hpo = predictor_hpo.evaluate(test_data, metrics=["accuracy"]) print('Top-1 test acc: %.3f' % scores_hpo["accuracy"]) .. parsed-literal:: :class: output Top-1 test acc: 0.863 From the training log, you should be able to see the current best trial as below: :: Current best trial: 47aef96a with val_accuracy=0.862500011920929 and parameters={'optimization.learning_rate': 0.0007195214018085505, 'model.timm_image.checkpoint_name': 'ghostnet_100'} After our simple 2-trial HPO run, we got a better test accuracy, by searching different learning rates and models, compared to the out-of-box solution provided in the previous section. HPO helps select the combination of hyperparameters with highest validation accuracy.