Multimodal Prediction#

For problems on multimodal data tables that contain image, text, and tabular data, AutoGluon provides MultiModalPredictor (abbreviated as AutoMM) that automatically selects, fuses, and tunes foundation models from popular packages like timm, huggingface/transformers, CLIP, MMDetection etc.

You can not only use AutoMM to solve standard NLP/Vision tasks such as sentiment classification, intent detection, paraphrase detection, image classification, but also use it for multimodal problems that involve image, text, tabular features, object bounding boxes, named entities, etc. Moreover, AutoMM can be used as a basic model in the multi-layer stack-ensemble of AutoGluon Tabular, and is powering up the FT-Transformer in TabularPredictor.

Here are some example use-cases of AutoMM:

  • Multilingual text classification: Tutorial

  • Predicting pets’ popularity based on their description, photo, and other metadata: Tutorial, Example

  • Predicting the price of book: Tutorial

  • Scoring student’s essays: Example

  • Image classification: Tutorial

  • Object detection: Tutorial, Example

  • Extracting named entities: Tutorial

  • Search for relevant text / image via text queries: Tutorial

  • Document Classification (Experimental): Tutorial

In the following, we decomposed the functionalities of AutoMM and prepared step-by-step guide for each functionality.

Text Data#

AutoMM for Text Prediction - Quick Start

How to train high-quality text prediction models with MultiModalPredictor.

AutoMM for Text Prediction - Multilingual Problems

How to use MultiModalPredictor to build models on datasets with languages other than English.

AutoMM for Named Entity Recognition - Quick Start

How to use MultiModalPredictor for entity extraction.

Image Data – Classification / Regression#

AutoMM for Image Classification - Quick Start

How to train image classification models with MultiModalPredictor.

Zero-Shot Image Classification with CLIP

How to enable zero-shot image classification in AutoMM via pretrained CLIP model.

Image Data – Object Detection#

Quick Start on a Tiny COCO Format Dataset

How to train high quality object detection model with MultiModalPredictor in under 5 minutes on COCO format dataset.

Prepare COCO2017 Dataset

How to prepare COCO2017 dataset for object detection.

Prepare Pascal VOC Dataset

How to prepare Pascal VOC dataset for object detection.

Prepare Watercolor Dataset

How to prepare Watercolor dataset for object detection.

Convert VOC Format Dataset to COCO Format

How to convert a dataset from VOC format to COCO format for object detection.

Object Detection with DataFrame

How to use pd.DataFrame format for object detection

Document Data#

AutoMM for Scanned Document Classification

How to use MultiModalPredictor to build a scanned document classifier.

Classifying PDF Documents with AutoMM

How to use MultiModalPredictor to build a PDF document classifier.

Matching#

Text-to-text Matching with AutoMM - Quick Start

How to use AutoMM for text to text matching.

Image-to-Image Matching with AutoMM - Quick Start

How to use AutoMM for image to image matching.

Image-to-Text Matching with AutoMM - Quick Start

How to use AutoMM for image to text matching.

Zero Shot Image-to-Text Matching with AutoMM

How to use AutoMM for zero shot image to text matching.

Text Semantic Search with AutoMM

How to use semantic embeddings to improve search ranking performance.

Multimodal Data#

AutoMM for Text + Tabular - Quick Start

How MultiModalPredictor can be applied to multimodal data tables with a mix of text, numerical, and categorical columns.

AutoMM for Image + Text + Tabular - Quick Start

How to use MultiModalPredictor to train a model on image, text, numerical, and categorical data.

AutoMM for Entity Extraction with Text and Image - Quick Start

How to use MultiModalPredictor to train a model for multimodal named entity recognition.

Advanced Topics#

Single GPU Billion-scale Model Training via Parameter-Efficient Finetuning

How to take advantage of larger foundation models with the help of parameter-efficient finetuning. In the tutorial, we will use combine IA^3, BitFit, and gradient checkpointing to finetune FLAN-T5-XL.

Hyperparameter Optimization in AutoMM

How to do hyperparameter optimization in AutoMM.

Knowledge Distillation in AutoMM

How to do knowledge distillation in AutoMM.

Continuous Training with AutoMM

How to continue training in AutoMM.

Customize AutoMM

How to customize AutoMM configurations.

AutoMM Presets

How to use AutoMM presets.

Few Shot Learning with FewShotSVMPredictor

How to use SVM combined with feature extraction for few shot learning.

How to use FocalLoss

How to use focal loss in AutoMM.

Faster Prediction with TensorRT

How to use TensorRT in accelerating AutoMM model inference.