Multimodal Prediction¶

For problems on multimodal data tables that contain image, text, and tabular data, AutoGluon provides MultiModalPredictor (abbreviated as AutoMM) that automatically selects and fuses deep learning backbones from popular packages like timm, huggingface/transformers, CLIP, etc. You can use it to build models for multimodal problems that involve image, text, and tabular features, e.g., predicting the product price based on the items’ description, photo, and other metadata, or matching images with text descriptions.

In addition, being good at multimodal problems implies that the predictor will be good for each specific modality. Thus, you can also use AutoMM to solve standard NLP/Vision tasks like sentiment classification, intent detection, paraphrase detection, image classification. Moreover, AutoMM can be used as a basic model in the multi-layer stack-ensemble of TabularPredictor.

In the following, we prepared a few tutorials to help you learn how to use AutoMM to solve problems that involve image, text, and tabular data.

Text Prediction and Entity Extraction¶

AutoMM for Text Prediction - Quick Starttext_prediction/beginner_text.html

How to train high-quality text prediction models with MultiModalPredictor in under 5 minutes.

AutoMM for Text Prediction - Multilingual Problemstext_prediction/multilingual_text.html

How to use MultiModalPredictor to build models on datasets with languages other than English.

Named Entity Recognition with AutoMM - Quick Starttext_prediction/ner.html

How to use MultiModalPredictor for entity extraction.

Image Prediction¶

AutoMM for Image Classification - Quick Startimage_prediction/beginner_image_cls.html

How to train image classification models with MultiModalPredictor.

Zero-Shot Image Classification with CLIPimage_prediction/clip_zeroshot.html

How to enable zero-shot image classification in AutoMM via pretrained CLIP model.

Object Detection¶

Quick Start on a Tiny COCO Format Datasetobject_detection/quick_start/quick_start_coco.html

How to train high quality object detection model with MultiModalPredictor in under 5 minutes on COCO format dataset.

Prepare COCO2017 Datasetobject_detection/data_preparation/prepare_coco17.html

How to prepare COCO2017 dataset for object detection.

Prepare Pascal VOC Datasetobject_detection/data_preparation/prepare_voc.html

How to prepare Pascal VOC dataset for object detection.

Prepare Watercolor Datasetobject_detection/data_preparation/prepare_watercolor.html

How to prepare Watercolor dataset for object detection.

Convert VOC Format Dataset to COCO Formatobject_detection/data_preparation/voc_to_coco.html

How to convert a dataset from VOC format to COCO format for object detection.

Fast Finetune on COCO Format Datasetobject_detection/finetune/detection_fast_finetune_coco.html

How to fast finetune a pretrained model on a dataset in COCO format.

High Performance Finetune on COCO Format Datasetobject_detection/finetune/detection_high_performance_finetune_coco.html

How to finetune a pretrained model on a dataset in COCO format with high performance.

Inference using a pretrained model - Quick Startobject_detection/inference/detection_inference_quick_start.html

How to inference with a pretrained model on a small dataset (COCO Format)

Inference using a pretrained model - COCO datasetobject_detection/inference/detection_inference_coco.html

How to inference with a pretrained model on COCO dataset

Inference using a pretrained model - VOC datasetobject_detection/inference/detection_inference_voc.html

How to inference with a pretrained model on VOC dataset

Evaluate Pretrained YOLOv3 on COCO Format Datasetobject_detection/evaluation/detection_eval_yolov3_coco.html

How to evaluate the very fast pretrained YOLOv3 model on dataset in COCO format.

Evaluate Pretrained Faster R-CNN on COCO Format Datasetobject_detection/evaluation/detection_eval_fasterrcnn_coco.html

How to evaluate the pretrained Faster R-CNN model with high performance on dataset in COCO format.

Evaluate Pretrained Deformable DETR on COCO Format Datasetobject_detection/evaluation/detection_eval_ddetr_coco.html

How to evaluate the pretrained Deformable DETR model with higher performance on dataset in COCO format

Evaluate Pretrained Faster R-CNN on VOC Format Datasetobject_detection/evaluation/detection_eval_fasterrcnn_voc.html

How to evaluate the pretrained Faster R-CNN model on dataset in VOC format

Matching¶

Text-to-text Matching with AutoMM - Quick Startmatching/text2text_matching.html

How to use AutoMM for text to text matching.

Semantic Textual Search with AutoGluon Multimodal Matchingmatching/semantic_search.html

How to use semantic embeddings to improve search ranking performance.

Extract Image/Text Embeddings in AutoMM for Matching Problemsmatching/clip_embedding.html

How to use CLIP to extract embeddings for retrieval problem.

Multimodal Classification / Regression¶

AutoMM for Text + Tabular - Quick Startmulitmodal_prediction/multimodal_text_tabular.html

How MultiModalPredictor can be applied to multimodal data tables with a mix of text, numerical, and categorical columns. Here, we train a model to predict the price of books.

AutoMM for Image + Text + Tabular - Quick Startmulitmodal_prediction/beginner_multimodal.html

How to use MultiModalPredictor to train a model that predicts the adoption speed of pets.

Advanced Topics¶

Single GPU Billion-scale Model Training via Parameter-Efficient Finetuningadvanced_topics/efficient_finetuning_basic.html

How to take advantage of larger foundation models with the help of parameter-efficient finetuning. In the tutorial, we will use combine IA^3, BitFit, and gradient checkpointing to finetune FLAN-T5-XL.

Customize AutoMMadvanced_topics/customization.html

How to customize AutoMM configurations.