AutoMM Detection - Quick Start on a Tiny COCO Format Dataset¶

In this section, our goal is to fast finetune a pretrained model on a small dataset in COCO format, and evaluate on its test set. Both training and test sets are in COCO format. See Convert Data to COCO Format for how to convert other datasets to COCO format.

Setting up the imports¶

Make sure mmcv and mmdet are installed:

#!pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0  # To use object detection, downgrade the torch version if it's >=2.2
!mim install "mmcv==2.1.0"  # For Google Colab, use the line below instead to install mmcv
#!pip install "mmcv==2.1.0" -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.1.0/index.html
!pip install "mmdet==3.2.0"

Show code cell output Hide code cell output

Looking in links: https://download.openmmlab.com/mmcv/dist/cu124/torch2.5.0/index.html
Requirement already satisfied: mmcv==2.1.0 in /home/ci/opt/venv/lib/python3.11/site-packages (2.1.0)
Requirement already satisfied: addict in /home/ci/opt/venv/lib/python3.11/site-packages (from mmcv==2.1.0) (2.4.0)
Requirement already satisfied: mmengine>=0.3.0 in /home/ci/opt/venv/lib/python3.11/site-packages (from mmcv==2.1.0) (0.10.5)
Requirement already satisfied: numpy in /home/ci/opt/venv/lib/python3.11/site-packages (from mmcv==2.1.0) (1.26.4)
Requirement already satisfied: packaging in /home/ci/opt/venv/lib/python3.11/site-packages (from mmcv==2.1.0) (24.2)
Requirement already satisfied: Pillow in /home/ci/opt/venv/lib/python3.11/site-packages (from mmcv==2.1.0) (11.1.0)
Requirement already satisfied: pyyaml in /home/ci/opt/venv/lib/python3.11/site-packages (from mmcv==2.1.0) (6.0.2)
Requirement already satisfied: yapf in /home/ci/opt/venv/lib/python3.11/site-packages (from mmcv==2.1.0) (0.43.0)
Requirement already satisfied: matplotlib in /home/ci/opt/venv/lib/python3.11/site-packages (from mmengine>=0.3.0->mmcv==2.1.0) (3.10.0)
Requirement already satisfied: rich in /home/ci/opt/venv/lib/python3.11/site-packages (from mmengine>=0.3.0->mmcv==2.1.0) (13.9.4)
Requirement already satisfied: termcolor in /home/ci/opt/venv/lib/python3.11/site-packages (from mmengine>=0.3.0->mmcv==2.1.0) (2.5.0)
Requirement already satisfied: opencv-python>=3 in /home/ci/opt/venv/lib/python3.11/site-packages (from mmengine>=0.3.0->mmcv==2.1.0) (4.11.0.86)
Requirement already satisfied: platformdirs>=3.5.1 in /home/ci/opt/venv/lib/python3.11/site-packages (from yapf->mmcv==2.1.0) (4.3.6)
Requirement already satisfied: contourpy>=1.0.1 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmengine>=0.3.0->mmcv==2.1.0) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmengine>=0.3.0->mmcv==2.1.0) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmengine>=0.3.0->mmcv==2.1.0) (4.55.8)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmengine>=0.3.0->mmcv==2.1.0) (1.4.8)
Requirement already satisfied: pyparsing>=2.3.1 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmengine>=0.3.0->mmcv==2.1.0) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmengine>=0.3.0->mmcv==2.1.0) (2.9.0.post0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/ci/opt/venv/lib/python3.11/site-packages (from rich->mmengine>=0.3.0->mmcv==2.1.0) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/ci/opt/venv/lib/python3.11/site-packages (from rich->mmengine>=0.3.0->mmcv==2.1.0) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /home/ci/opt/venv/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich->mmengine>=0.3.0->mmcv==2.1.0) (0.1.2)
Requirement already satisfied: six>=1.5 in /home/ci/opt/venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->mmengine>=0.3.0->mmcv==2.1.0) (1.17.0)
Requirement already satisfied: mmdet==3.2.0 in /home/ci/opt/venv/lib/python3.11/site-packages (3.2.0)
Requirement already satisfied: matplotlib in /home/ci/opt/venv/lib/python3.11/site-packages (from mmdet==3.2.0) (3.10.0)
Requirement already satisfied: numpy in /home/ci/opt/venv/lib/python3.11/site-packages (from mmdet==3.2.0) (1.26.4)
Requirement already satisfied: pycocotools in /home/ci/opt/venv/lib/python3.11/site-packages (from mmdet==3.2.0) (2.0.8)
Requirement already satisfied: scipy in /home/ci/opt/venv/lib/python3.11/site-packages (from mmdet==3.2.0) (1.15.1)
Requirement already satisfied: shapely in /home/ci/opt/venv/lib/python3.11/site-packages (from mmdet==3.2.0) (2.0.7)
Requirement already satisfied: six in /home/ci/opt/venv/lib/python3.11/site-packages (from mmdet==3.2.0) (1.17.0)
Requirement already satisfied: terminaltables in /home/ci/opt/venv/lib/python3.11/site-packages (from mmdet==3.2.0) (3.1.10)
Requirement already satisfied: tqdm in /home/ci/opt/venv/lib/python3.11/site-packages (from mmdet==3.2.0) (4.67.1)
Requirement already satisfied: contourpy>=1.0.1 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmdet==3.2.0) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmdet==3.2.0) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmdet==3.2.0) (4.55.8)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmdet==3.2.0) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmdet==3.2.0) (24.2)
Requirement already satisfied: pillow>=8 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmdet==3.2.0) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmdet==3.2.0) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /home/ci/opt/venv/lib/python3.11/site-packages (from matplotlib->mmdet==3.2.0) (2.9.0.post0)

To start, let’s import MultiModalPredictor:

from autogluon.multimodal import MultiModalPredictor

/home/ci/opt/venv/lib/python3.11/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import \

And also import some other packages that will be used in this tutorial:

import os
import time

from autogluon.core.utils.loaders import load_zip

Downloading Data¶

We have the sample dataset ready in the cloud. Let’s download it:

zip_file = "https://automl-mm-bench.s3.amazonaws.com/object_detection_dataset/tiny_motorbike_coco.zip"
download_dir = "./tiny_motorbike_coco"

load_zip.unzip(zip_file, unzip_dir=download_dir)
data_dir = os.path.join(download_dir, "tiny_motorbike")
train_path = os.path.join(data_dir, "Annotations", "trainval_cocoformat.json")
test_path = os.path.join(data_dir, "Annotations", "test_cocoformat.json")

Downloading ./tiny_motorbike_coco/file.zip from https://automl-mm-bench.s3.amazonaws.com/object_detection_dataset/tiny_motorbike_coco.zip...

0%|          | 0.00/21.8M [00:00<?, ?iB/s]
48%|████▊     | 10.5M/21.8M [00:00<00:00, 105MiB/s]
96%|█████████▋| 21.0M/21.8M [00:00<00:00, 103MiB/s]
100%|██████████| 21.8M/21.8M [00:00<00:00, 74.2MiB/s]

While using COCO format dataset, the input is the json annotation file of the dataset split. In this example, trainval_cocoformat.json is the annotation file of the train-and-validate split, and test_cocoformat.json is the annotation file of the test split.

Creating the MultiModalPredictor¶

We select the "medium_quality" presets, which uses a YOLOX-large model pretrained on COCO dataset. This preset is fast to finetune or inference, and easy to deploy. We also provide presets "high_quality" with a DINO-Resnet50 model and "best quality" with a DINO-SwinL model, with much higher performance but also slower and with higher GPU memory usage.

presets = "medium_quality"

We create the MultiModalPredictor with selected presets. We need to specify the problem_type to "object_detection", and also provide a sample_data_path for the predictor to infer the catgories of the dataset. Here we provide the train_path, and it also works using any other split of this dataset. And we also provide a path to save the predictor. It will be saved to a automatically generated directory with timestamp under AutogluonModels if path is not specified.

# Init predictor
import uuid

model_path = f"./tmp/{uuid.uuid4().hex}-quick_start_tutorial_temp_save"

predictor = MultiModalPredictor(
    problem_type="object_detection",
    sample_data_path=train_path,
    presets=presets,
    path=model_path,
)

Finetuning the Model¶

Learning rate, number of epochs, and batch_size are included in the presets, and thus no need to specify. Note that we use a two-stage learning rate option during finetuning by default, and the model head will have 100x learning rate. Using a two-stage learning rate with high learning rate only on head layers makes the model converge faster during finetuning. It usually gives better performance as well, especially on small datasets with hundreds or thousands of images. We also compute the time of the fit process here for better understanding the speed. We run it on a g4.2xlarge EC2 machine on AWS, and part of the command outputs are shown below:

start = time.time()
predictor.fit(train_path)  # Fit
train_end = time.time()

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Downloading yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth from https://download.openmmlab.com/mmdetection/v2.0/yolox/yolox_l_8x8_300e_coco/yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth...
Loads checkpoint by local backend from path: yolox_l_8x8_300e_coco_20211126_140236-d3bd2b23.pth
The model and loaded state dict do not match exactly

size mismatch for bbox_head.multi_level_conv_cls.0.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([10, 256, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.0.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([10]).
size mismatch for bbox_head.multi_level_conv_cls.1.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([10, 256, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.1.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([10]).
size mismatch for bbox_head.multi_level_conv_cls.2.weight: copying a param with shape torch.Size([80, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([10, 256, 1, 1]).
size mismatch for bbox_head.multi_level_conv_cls.2.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([10]).

=================== System Info ===================
AutoGluon Version:  1.2.1b20250206
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count:          8
Pytorch Version:    2.5.1+cu124
CUDA Version:       12.4
Memory Avail:       28.41 GB / 30.95 GB (91.8%)
Disk Space Avail:   WARNING, an exception (FileNotFoundError) occurred while attempting to get available disk space. Consider opening a GitHub Issue.
===================================================
Using default root folder: ./tiny_motorbike_coco/tiny_motorbike/Annotations/... Specify `model.mmdet_image.coco_root=...` in hyperparameters if you think it is wrong.
AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save
    ```
Seed set to 0
0%|          | 0.00/217M [00:00<?, ?iB/s]
0%|          | 182k/217M [00:00<02:03, 1.76MiB/s]
0%|          | 477k/217M [00:00<01:30, 2.40MiB/s]
0%|          | 838k/217M [00:00<01:14, 2.91MiB/s]
1%|          | 1.30M/217M [00:00<01:01, 3.54MiB/s]
1%|          | 1.87M/217M [00:00<00:50, 4.29MiB/s]
1%|          | 2.56M/217M [00:00<00:41, 5.17MiB/s]
2%|▏         | 3.34M/217M [00:00<00:35, 6.02MiB/s]
2%|▏         | 4.41M/217M [00:00<00:28, 7.40MiB/s]
3%|▎         | 5.74M/217M [00:00<00:22, 9.21MiB/s]
3%|▎         | 7.21M/217M [00:01<00:19, 10.9MiB/s]
4%|▍         | 8.83M/217M [00:01<00:16, 12.5MiB/s]
5%|▍         | 10.5M/217M [00:01<00:15, 13.8MiB/s]
6%|▌         | 12.2M/217M [00:01<00:14, 14.5MiB/s]
6%|▋         | 13.9M/217M [00:01<00:13, 15.2MiB/s]
7%|▋         | 15.5M/217M [00:01<00:13, 15.4MiB/s]
8%|▊         | 17.2M/217M [00:01<00:12, 15.9MiB/s]
9%|▊         | 18.9M/217M [00:01<00:12, 16.3MiB/s]
9%|▉         | 20.6M/217M [00:01<00:12, 16.3MiB/s]
10%|█         | 22.3M/217M [00:01<00:11, 16.5MiB/s]
11%|█         | 24.1M/217M [00:02<00:11, 16.7MiB/s]
12%|█▏        | 25.7M/217M [00:02<00:11, 16.7MiB/s]
13%|█▎        | 27.4M/217M [00:02<00:11, 16.7MiB/s]
13%|█▎        | 29.1M/217M [00:02<00:11, 16.7MiB/s]
14%|█▍        | 30.7M/217M [00:02<00:11, 16.7MiB/s]
15%|█▍        | 32.4M/217M [00:02<00:11, 16.7MiB/s]
16%|█▌        | 34.3M/217M [00:02<00:10, 17.0MiB/s]
17%|█▋        | 36.2M/217M [00:02<00:10, 17.6MiB/s]
17%|█▋        | 37.9M/217M [00:02<00:10, 17.3MiB/s]
18%|█▊        | 39.7M/217M [00:02<00:10, 17.4MiB/s]
19%|█▉        | 41.6M/217M [00:03<00:10, 17.5MiB/s]
20%|██        | 43.6M/217M [00:03<00:09, 17.7MiB/s]
21%|██        | 45.4M/217M [00:03<00:09, 17.8MiB/s]
22%|██▏       | 47.2M/217M [00:03<00:09, 17.0MiB/s]
23%|██▎       | 49.0M/217M [00:03<00:09, 17.5MiB/s]
23%|██▎       | 50.8M/217M [00:03<00:10, 15.4MiB/s]
24%|██▍       | 52.6M/217M [00:03<00:10, 16.0MiB/s]
25%|██▌       | 54.5M/217M [00:03<00:09, 16.6MiB/s]
26%|██▌       | 56.5M/217M [00:03<00:09, 17.2MiB/s]
27%|██▋       | 58.4M/217M [00:04<00:09, 17.4MiB/s]
28%|██▊       | 60.3M/217M [00:04<00:08, 17.7MiB/s]
29%|██▊       | 62.3M/217M [00:04<00:08, 17.8MiB/s]
30%|██▉       | 64.3M/217M [00:04<00:08, 18.4MiB/s]
30%|███       | 66.1M/217M [00:04<00:08, 18.2MiB/s]
31%|███▏      | 68.0M/217M [00:04<00:08, 18.1MiB/s]
32%|███▏      | 69.8M/217M [00:04<00:08, 18.1MiB/s]
33%|███▎      | 71.6M/217M [00:04<00:08, 18.0MiB/s]
34%|███▍      | 73.4M/217M [00:04<00:08, 17.9MiB/s]
35%|███▍      | 75.4M/217M [00:04<00:07, 17.9MiB/s]
36%|███▌      | 77.4M/217M [00:05<00:07, 18.5MiB/s]
36%|███▋      | 79.2M/217M [00:05<00:07, 18.4MiB/s]
37%|███▋      | 81.1M/217M [00:05<00:07, 18.1MiB/s]
38%|███▊      | 83.0M/217M [00:05<00:07, 18.5MiB/s]
39%|███▉      | 84.9M/217M [00:05<00:07, 18.4MiB/s]
40%|███▉      | 86.8M/217M [00:05<00:07, 18.3MiB/s]
41%|████      | 88.8M/217M [00:05<00:06, 18.7MiB/s]
42%|████▏     | 90.7M/217M [00:05<00:06, 18.6MiB/s]
43%|████▎     | 92.5M/217M [00:05<00:06, 18.4MiB/s]
43%|████▎     | 94.4M/217M [00:06<00:06, 18.1MiB/s]
44%|████▍     | 96.2M/217M [00:06<00:06, 18.0MiB/s]
45%|████▌     | 98.0M/217M [00:06<00:06, 17.8MiB/s]
46%|████▌     | 99.7M/217M [00:06<00:06, 17.8MiB/s]
47%|████▋     | 102M/217M [00:06<00:06, 17.8MiB/s]
48%|████▊     | 103M/217M [00:06<00:06, 17.9MiB/s]
48%|████▊     | 105M/217M [00:06<00:06, 18.0MiB/s]
49%|████▉     | 107M/217M [00:06<00:06, 17.1MiB/s]
50%|█████     | 109M/217M [00:06<00:06, 17.4MiB/s]
51%|█████     | 111M/217M [00:06<00:06, 17.3MiB/s]
52%|█████▏    | 112M/217M [00:07<00:06, 17.0MiB/s]
53%|█████▎    | 114M/217M [00:07<00:06, 16.8MiB/s]
53%|█████▎    | 116M/217M [00:07<00:05, 17.0MiB/s]
54%|█████▍    | 118M/217M [00:07<00:05, 17.0MiB/s]
55%|█████▌    | 120M/217M [00:07<00:05, 17.4MiB/s]
56%|█████▌    | 122M/217M [00:07<00:05, 18.2MiB/s]
57%|█████▋    | 124M/217M [00:07<00:05, 17.8MiB/s]
58%|█████▊    | 126M/217M [00:07<00:05, 18.1MiB/s]
59%|█████▊    | 128M/217M [00:07<00:04, 18.6MiB/s]
60%|█████▉    | 129M/217M [00:07<00:04, 18.1MiB/s]
60%|██████    | 131M/217M [00:08<00:04, 18.0MiB/s]
61%|██████▏   | 133M/217M [00:08<00:04, 18.2MiB/s]
62%|██████▏   | 135M/217M [00:08<00:04, 18.5MiB/s]
63%|██████▎   | 137M/217M [00:08<00:04, 18.4MiB/s]
64%|██████▍   | 139M/217M [00:08<00:04, 18.3MiB/s]
65%|██████▍   | 141M/217M [00:08<00:04, 18.1MiB/s]
66%|██████▌   | 143M/217M [00:08<00:04, 18.0MiB/s]
66%|██████▋   | 144M/217M [00:08<00:04, 17.9MiB/s]
67%|██████▋   | 146M/217M [00:08<00:03, 18.1MiB/s]
68%|██████▊   | 148M/217M [00:09<00:03, 18.7MiB/s]
69%|██████▉   | 150M/217M [00:09<00:03, 18.5MiB/s]
70%|██████▉   | 152M/217M [00:09<00:03, 18.4MiB/s]
71%|███████   | 154M/217M [00:09<00:03, 18.3MiB/s]
72%|███████▏  | 156M/217M [00:09<00:03, 18.2MiB/s]
73%|███████▎  | 158M/217M [00:09<00:03, 18.3MiB/s]
73%|███████▎  | 160M/217M [00:09<00:03, 18.3MiB/s]
74%|███████▍  | 161M/217M [00:09<00:03, 18.1MiB/s]
75%|███████▌  | 163M/217M [00:09<00:03, 18.0MiB/s]
76%|███████▌  | 165M/217M [00:09<00:02, 17.9MiB/s]
77%|███████▋  | 167M/217M [00:10<00:02, 17.9MiB/s]
78%|███████▊  | 169M/217M [00:10<00:02, 18.4MiB/s]
79%|███████▊  | 171M/217M [00:10<00:02, 18.3MiB/s]
79%|███████▉  | 173M/217M [00:10<00:02, 18.4MiB/s]
80%|████████  | 174M/217M [00:10<00:02, 18.1MiB/s]
81%|████████  | 176M/217M [00:10<00:02, 18.1MiB/s]
82%|████████▏ | 178M/217M [00:10<00:02, 18.5MiB/s]
83%|████████▎ | 180M/217M [00:10<00:02, 18.3MiB/s]
84%|████████▎ | 182M/217M [00:10<00:01, 18.5MiB/s]
85%|████████▍ | 184M/217M [00:10<00:01, 18.4MiB/s]
86%|████████▌ | 186M/217M [00:11<00:01, 18.4MiB/s]
86%|████████▋ | 188M/217M [00:11<00:01, 18.7MiB/s]
87%|████████▋ | 190M/217M [00:11<00:01, 18.6MiB/s]
88%|████████▊ | 192M/217M [00:11<00:01, 18.5MiB/s]
89%|████████▉ | 193M/217M [00:11<00:01, 18.9MiB/s]
90%|████████▉ | 195M/217M [00:11<00:01, 18.6MiB/s]
91%|█████████ | 197M/217M [00:11<00:01, 18.5MiB/s]
92%|█████████▏| 199M/217M [00:11<00:01, 17.6MiB/s]
92%|█████████▏| 201M/217M [00:11<00:00, 17.6MiB/s]
93%|█████████▎| 203M/217M [00:12<00:00, 17.6MiB/s]
94%|█████████▍| 204M/217M [00:12<00:00, 17.7MiB/s]
95%|█████████▌| 206M/217M [00:12<00:00, 18.3MiB/s]
96%|█████████▌| 208M/217M [00:12<00:00, 18.5MiB/s]
97%|█████████▋| 210M/217M [00:12<00:00, 18.3MiB/s]
98%|█████████▊| 212M/217M [00:12<00:00, 18.2MiB/s]
98%|█████████▊| 214M/217M [00:12<00:00, 18.2MiB/s]
99%|█████████▉| 216M/217M [00:12<00:00, 18.6MiB/s]
/home/ci/opt/venv/lib/python3.11/site-packages/mmengine/runner/checkpoint.py:347: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(filename, map_location=map_location)
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.43GB/15.0GB (Used/Total)
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name              | Type                             | Params | Mode 
-------------------------------------------------------------------------------
0 | model             | MMDetAutoModelForObjectDetection | 54.2 M | train
1 | validation_metric | MeanAveragePrecision             | 0      | train
-------------------------------------------------------------------------------
54.2 M    Trainable params
0         Non-trainable params
54.2 M    Total params
216.620   Total estimated model params size (MB)
592       Modules in train mode
0         Modules in eval mode
/home/ci/opt/venv/lib/python3.11/site-packages/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/ci/opt/venv/lib/python3.11/site-packages/torch/functional.py:534: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3595.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/ci/opt/venv/lib/python3.11/site-packages/mmdet/models/task_modules/assigners/sim_ota_assigner.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
Epoch 2, global step 15: 'val_map' reached 0.33114 (best 0.33114), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=2-step=15.ckpt' as top 1
Epoch 5, global step 30: 'val_map' reached 0.34902 (best 0.34902), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=5-step=30.ckpt' as top 1
Epoch 8, global step 45: 'val_map' reached 0.35936 (best 0.35936), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=8-step=45.ckpt' as top 1
Epoch 11, global step 60: 'val_map' reached 0.43478 (best 0.43478), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=11-step=60.ckpt' as top 1
Epoch 14, global step 75: 'val_map' was not in top 1
Epoch 17, global step 90: 'val_map' reached 0.44727 (best 0.44727), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=17-step=90.ckpt' as top 1
Epoch 20, global step 105: 'val_map' was not in top 1
Epoch 23, global step 120: 'val_map' was not in top 1
Epoch 26, global step 135: 'val_map' reached 0.44859 (best 0.44859), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=26-step=135.ckpt' as top 1
Epoch 29, global step 150: 'val_map' reached 0.45323 (best 0.45323), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=29-step=150.ckpt' as top 1
Epoch 32, global step 165: 'val_map' was not in top 1
Epoch 35, global step 180: 'val_map' was not in top 1
Epoch 38, global step 195: 'val_map' was not in top 1
Epoch 41, global step 210: 'val_map' reached 0.45324 (best 0.45324), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=41-step=210.ckpt' as top 1
Epoch 44, global step 225: 'val_map' reached 0.45510 (best 0.45510), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=44-step=225.ckpt' as top 1
Epoch 47, global step 240: 'val_map' reached 0.45563 (best 0.45563), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/epoch=47-step=240.ckpt' as top 1
`Trainer.fit` stopped: `max_epochs=50` reached.
/home/ci/autogluon/multimodal/src/autogluon/multimodal/utils/checkpoint.py:63: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  avg_state_dict = torch.load(checkpoint_paths[0], map_location=torch.device("cpu"))["state_dict"]  # nosec B614
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).

Notice that at the end of each progress bar, if the checkpoint at current stage is saved, it prints the model’s save path. In this example, it’s ./quick_start_tutorial_temp_save.

Print out the time and we can see that it’s fast!

print("This finetuning takes %.2f seconds." % (train_end - start))

This finetuning takes 541.77 seconds.

Evaluation¶

To evaluate the model we just trained, run following code.

And the evaluation results are shown in command line output. The first line is mAP in COCO standard, and the second line is mAP in VOC standard (or mAP50). For more details about these metrics, see COCO’s evaluation guideline. Note that for presenting a fast finetuning we use presets “medium_quality”, you could get better result on this dataset by simply using “high_quality” or “best_quality” presets, or customize your own model and hyperparameter settings: Customization, and some other examples at Fast Fine-tune Coco or High Performance Fine-tune Coco.

predictor.evaluate(test_path)
eval_end = time.time()

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
saving file at /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/AutogluonModels/ag-20250206_095124/object_detection_result_cache.json
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.01s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.08s).
Accumulating evaluation results...
DONE (t=0.04s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.358
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.516
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.379
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.215
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.450
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.751
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.250
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.416
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.440
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.392
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.522
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.812

Using default root folder: ./tiny_motorbike_coco/tiny_motorbike/Annotations/... Specify `model.mmdet_image.coco_root=...` in hyperparameters if you think it is wrong.
/home/ci/opt/venv/lib/python3.11/site-packages/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
A new predictor save path is created. This is to prevent you to overwrite previous predictor saved here. You could check current save path at predictor._save_path. If you still want to use this path, set resume=True
No path specified. Models will be saved in: "AutogluonModels/ag-20250206_095124"

Print out the evaluation time:

print("The evaluation takes %.2f seconds." % (eval_end - train_end))

The evaluation takes 1.85 seconds.

We can load a new predictor with previous save_path, and we can also reset the number of GPUs to use if not all the devices are available:

# Load and reset num_gpus
new_predictor = MultiModalPredictor.load(model_path)
new_predictor.set_num_gpus(1)

Load pretrained checkpoint: /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/tmp/cda81f8095574f0e892107ef43cb18fc-quick_start_tutorial_temp_save/model.ckpt
/home/ci/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:2117: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(path, map_location=torch.device("cpu"))["state_dict"]  # nosec B614

Evaluating the new predictor gives us exactly the same result:

# Evaluate new predictor
new_predictor.evaluate(test_path)

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
saving file at /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/AutogluonModels/ag-20250206_095128/object_detection_result_cache.json
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.09s).
Accumulating evaluation results...
DONE (t=0.04s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.358
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.516
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.379
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.215
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.450
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.751
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.250
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.416
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.440
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.392
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.522
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.812

Using default root folder: ./tiny_motorbike_coco/tiny_motorbike/Annotations/... Specify `model.mmdet_image.coco_root=...` in hyperparameters if you think it is wrong.
/home/ci/opt/venv/lib/python3.11/site-packages/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
A new predictor save path is created. This is to prevent you to overwrite previous predictor saved here. You could check current save path at predictor._save_path. If you still want to use this path, set resume=True
No path specified. Models will be saved in: "AutogluonModels/ag-20250206_095128"

{'map': 0.3583638102025215,
 'mean_average_precision': 0.3583638102025215,
 'map_50': 0.5162189109732803,
 'map_75': 0.37926466733124664,
 'map_small': 0.21460996477647665,
 'map_medium': 0.45018566230019214,
 'map_large': 0.7510578004619188,
 'mar_1': 0.25046276720695326,
 'mar_10': 0.4161428235846841,
 'mar_100': 0.4395503875968992,
 'mar_small': 0.3920833333333334,
 'mar_medium': 0.5222222222222223,
 'mar_large': 0.8122986954565902}

For how to set the hyperparameters and finetune the model with higher performance, see AutoMM Detection - High Performance Finetune on COCO Format Dataset.

Inference¶

Now that we have gone through the model setup, finetuning, and evaluation, this section details the inference. Specifically, we layout the steps for using the model to make predictions and visualize the results.

To run inference on the entire test set, perform:

pred = predictor.predict(test_path)
print(len(pred))
print(pred[:3])

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
50
                                               image  \
0  ./tiny_motorbike_coco/tiny_motorbike/Annotatio...   
1  ./tiny_motorbike_coco/tiny_motorbike/Annotatio...   
2  ./tiny_motorbike_coco/tiny_motorbike/Annotatio...   

                                              bboxes  
0  [{'class': 'bicycle', 'class_id': 0, 'bbox': [...  
1  [{'class': 'motorbike', 'class_id': 7, 'bbox':...  
2  [{'class': 'person', 'class_id': 8, 'bbox': [1...

Using default root folder: ./tiny_motorbike_coco/tiny_motorbike/Annotations/... Specify `model.mmdet_image.coco_root=...` in hyperparameters if you think it is wrong.
/home/ci/opt/venv/lib/python3.11/site-packages/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
A new predictor save path is created. This is to prevent you to overwrite previous predictor saved here. You could check current save path at predictor._save_path. If you still want to use this path, set resume=True
No path specified. Models will be saved in: "AutogluonModels/ag-20250206_095130"
Saved detection results to /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/AutogluonModels/ag-20250206_095130/result.txt

The output pred is a pandas DataFrame that has two columns, image and bboxes.

In image, each row contains the image path

In bboxes, each row is a list of dictionaries, each one representing a bounding box: {"class": <predicted_class_name>, "bbox": [x1, y1, x2, y2], "score": <confidence_score>}

Note that, by default, the predictor.predict does not save the detection results into a file.

To run inference and save results, run the following:

pred = predictor.predict(test_path, save_results=True, as_coco=False)

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!

Using default root folder: ./tiny_motorbike_coco/tiny_motorbike/Annotations/... Specify `model.mmdet_image.coco_root=...` in hyperparameters if you think it is wrong.
/home/ci/opt/venv/lib/python3.11/site-packages/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
A new predictor save path is created. This is to prevent you to overwrite previous predictor saved here. You could check current save path at predictor._save_path. If you still want to use this path, set resume=True
No path specified. Models will be saved in: "AutogluonModels/ag-20250206_095131"
Saved detection results to /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/AutogluonModels/ag-20250206_095131/result.txt
A new predictor save path is created. This is to prevent you to overwrite previous predictor saved here. You could check current save path at predictor._save_path. If you still want to use this path, set resume=True
No path specified. Models will be saved in: "AutogluonModels/ag-20250206_095131-001"
Saved detection results as dataframe to /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/AutogluonModels/ag-20250206_095131/result.txt

Here, we save pred into a .csv file, which exactly follows the same layout as in pred. You can use a predictor initialized in any way (i.e. finetuned predictor, predictor with pretrained model, etc.).

Visualizing Results¶

To run visualizations, ensure that you have opencv installed. If you haven’t already, install opencv by running

!pip install opencv-python

Requirement already satisfied: opencv-python in /home/ci/opt/venv/lib/python3.11/site-packages (4.11.0.86)
Requirement already satisfied: numpy>=1.21.2 in /home/ci/opt/venv/lib/python3.11/site-packages (from opencv-python) (1.26.4)

To visualize the detection bounding boxes, run the following:

from autogluon.multimodal.utils import ObjectDetectionVisualizer

conf_threshold = 0.4  # Specify a confidence threshold to filter out unwanted boxes
image_result = pred.iloc[30]

img_path = image_result.image  # Select an image to visualize

visualizer = ObjectDetectionVisualizer(img_path)  # Initialize the Visualizer
out = visualizer.draw_instance_predictions(image_result, conf_threshold=conf_threshold)  # Draw detections
visualized = out.get_image()  # Get the visualized image

from PIL import Image
from IPython.display import display
img = Image.fromarray(visualized, 'RGB')
display(img)

../../../../_images/e6fa725e92730de790b8d3916160c4e2d2369c04cd02ea5249bf5e24df5919c7.png

Testing on Your Own Data¶

You can also predict on your own images with various input format. The follow is an example:

Download the example image:

from autogluon.multimodal import download
image_url = "https://raw.githubusercontent.com/dmlc/web-data/master/gluoncv/detection/street_small.jpg"
test_image = download(image_url)

Downloading street_small.jpg from https://raw.githubusercontent.com/dmlc/web-data/master/gluoncv/detection/street_small.jpg...

0%|          | 0.00/119k [00:00<?, ?iB/s]

Run inference on data in a json file of COCO format (See Convert Data to COCO Format for more details about COCO format). Note that since the root is by default the parent folder of the annotation file, here we put the annotation file in a folder:

import json

# create a input file for demo
data = {"images": [{"id": 0, "width": -1, "height": -1, "file_name": test_image}], "categories": []}
os.mkdir("input_data_for_demo")
input_file = "input_data_for_demo/demo_annotation.json"
with open(input_file, "w+") as f:
    json.dump(data, f)

pred_test_image = predictor.predict(input_file)
print(pred_test_image)

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
image  \
0  input_data_for_demo/../street_small.jpg   

                                              bboxes  
0  [{'class': 'person', 'class_id': 8, 'bbox': [2...

Using default root folder: input_data_for_demo/... Specify `model.mmdet_image.coco_root=...` in hyperparameters if you think it is wrong.
/home/ci/opt/venv/lib/python3.11/site-packages/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
Saved detection results to /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/AutogluonModels/ag-20250206_095131-001/result.txt

Run inference on data in a list of image file names:

pred_test_image = predictor.predict([test_image])
print(pred_test_image)

image                                             bboxes
0  street_small.jpg  [{'class': 'person', 'class_id': 8, 'bbox': [2...

/home/ci/opt/venv/lib/python3.11/site-packages/mmdet/models/backbones/csp_darknet.py:118: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
A new predictor save path is created. This is to prevent you to overwrite previous predictor saved here. You could check current save path at predictor._save_path. If you still want to use this path, set resume=True
No path specified. Models will be saved in: "AutogluonModels/ag-20250206_095134"
Saved detection results to /home/ci/autogluon/docs/tutorials/multimodal/object_detection/quick_start/AutogluonModels/ag-20250206_095134/result.txt

Other Examples¶

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization¶

To learn how to customize AutoMM, please refer to Customize AutoMM.

Citation¶

@article{DBLP:journals/corr/abs-2107-08430,
  author    = {Zheng Ge and
               Songtao Liu and
               Feng Wang and
               Zeming Li and
               Jian Sun},
  title     = {{YOLOX:} Exceeding {YOLO} Series in 2021},
  journal   = {CoRR},
  volume    = {abs/2107.08430},
  year      = {2021},
  url       = {https://arxiv.org/abs/2107.08430},
  eprinttype = {arXiv},
  eprint    = {2107.08430},
  timestamp = {Tue, 05 Apr 2022 14:09:44 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2107-08430.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org},
}