AutoMM for Text + Tabular - Quick Start¶
In many applications, text data may be mixed with numeric/categorical data.
AutoGluon’s MultiModalPredictor can train a single neural network that jointly operates on multiple feature types,
including text, categorical, and numerical columns. The general idea is to embed the text, categorical and numeric fields
separately and fuse these features across modalities. This tutorial demonstrates such an application.
import numpy as np
import pandas as pd
import warnings
import os
warnings.filterwarnings('ignore')
np.random.seed(123)
!python3 -m pip install openpyxl
Collecting openpyxl
Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
Book Price Prediction Data¶
For demonstration, we use the book price prediction dataset from the MachineHack Book Price Prediction Hackathon. Our goal is to predict a book’s price given various features like its author, the abstract, the book’s rating, etc.
!mkdir -p price_of_books
!wget https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/Data.zip -O price_of_books/Data.zip
!cd price_of_books && unzip -o Data.zip
!ls price_of_books/Participants_Data
--2024-12-13 07:42:20-- https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/Data.zip
Resolving automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)... 52.217.1.220, 52.217.138.9, 52.217.116.249, ...
Connecting to automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)|52.217.1.220|:443... connected.
HTTP request sent, awaiting response...
200 OK
Length: 3521673 (3.4M) [application/zip]
Saving to: ‘price_of_books/Data.zip’
price_of_books/Data 0%[ ] 0 --.-KB/s
price_of_books/Data 100%[===================>] 3.36M --.-KB/s in 0.08s
2024-12-13 07:42:20 (40.0 MB/s) - ‘price_of_books/Data.zip’ saved [3521673/3521673]
Archive: Data.zip
inflating: Participants_Data/Data_Test.xlsx
inflating: Participants_Data/Data_Train.xlsx
inflating: Participants_Data/Sample_Submission.xlsx
Data_Test.xlsx Data_Train.xlsx Sample_Submission.xlsx
train_df = pd.read_excel(os.path.join('price_of_books', 'Participants_Data', 'Data_Train.xlsx'), engine='openpyxl')
train_df.head()
| Title | Author | Edition | Reviews | Ratings | Synopsis | Genre | BookCategory | Price | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | The Prisoner's Gold (The Hunters 3) | Chris Kuzneski | Paperback,– 10 Mar 2016 | 4.0 out of 5 stars | 8 customer reviews | THE HUNTERS return in their third brilliant no... | Action & Adventure (Books) | Action & Adventure | 220.00 |
| 1 | Guru Dutt: A Tragedy in Three Acts | Arun Khopkar | Paperback,– 7 Nov 2012 | 3.9 out of 5 stars | 14 customer reviews | A layered portrait of a troubled genius for wh... | Cinema & Broadcast (Books) | Biographies, Diaries & True Accounts | 202.93 |
| 2 | Leviathan (Penguin Classics) | Thomas Hobbes | Paperback,– 25 Feb 1982 | 4.8 out of 5 stars | 6 customer reviews | "During the time men live without a common Pow... | International Relations | Humour | 299.00 |
| 3 | A Pocket Full of Rye (Miss Marple) | Agatha Christie | Paperback,– 5 Oct 2017 | 4.1 out of 5 stars | 13 customer reviews | A handful of grain is found in the pocket of a... | Contemporary Fiction (Books) | Crime, Thriller & Mystery | 180.00 |
| 4 | LIFE 70 Years of Extraordinary Photography | Editors of Life | Hardcover,– 10 Oct 2006 | 5.0 out of 5 stars | 1 customer review | For seven decades, "Life" has been thrilling t... | Photography Textbooks | Arts, Film & Photography | 965.62 |
We do some basic preprocessing to convert Reviews and Ratings in the data table to numeric values, and we transform prices to a log-scale.
def preprocess(df):
df = df.copy(deep=True)
df.loc[:, 'Reviews'] = pd.to_numeric(df['Reviews'].apply(lambda ele: ele[:-len(' out of 5 stars')]))
df.loc[:, 'Ratings'] = pd.to_numeric(df['Ratings'].apply(lambda ele: ele.replace(',', '')[:-len(' customer reviews')]))
df.loc[:, 'Price'] = np.log(df['Price'] + 1)
return df
train_subsample_size = 1500 # subsample for faster demo, you can try setting to larger values
test_subsample_size = 5
train_df = preprocess(train_df)
train_data = train_df.iloc[100:].sample(train_subsample_size, random_state=123)
test_data = train_df.iloc[:100].sample(test_subsample_size, random_state=245)
train_data.head()
| Title | Author | Edition | Reviews | Ratings | Synopsis | Genre | BookCategory | Price | |
|---|---|---|---|---|---|---|---|---|---|
| 949 | Furious Hours | Casey Cep | Paperback,– 1 Jun 2019 | 4.0 | NaN | ‘It’s been a long time since I picked up a boo... | True Accounts (Books) | Biographies, Diaries & True Accounts | 5.743003 |
| 5504 | REST API Design Rulebook | Mark Masse | Paperback,– 7 Nov 2011 | 5.0 | NaN | In todays market, where rival web services com... | Computing, Internet & Digital Media (Books) | Computing, Internet & Digital Media | 5.786897 |
| 5856 | The Atlantropa Articles: A Novel | Cody Franklin | Paperback,– Import, 1 Nov 2018 | 4.5 | 2.0 | #1 Amazon Best Seller! Dystopian Alternate His... | Action & Adventure (Books) | Romance | 6.893656 |
| 4137 | Hickory Dickory Dock (Poirot) | Agatha Christie | Paperback,– 5 Oct 2017 | 4.3 | 21.0 | There’s more than petty theft going on in a Lo... | Action & Adventure (Books) | Crime, Thriller & Mystery | 5.192957 |
| 3205 | The Stanley Kubrick Archives (Bibliotheca Univ... | Alison Castle | Hardcover,– 21 Aug 2016 | 4.6 | 3.0 | In 1968, when Stanley Kubrick was asked to com... | Cinema & Broadcast (Books) | Humour | 6.889591 |
Training¶
We can simply create a MultiModalPredictor and call predictor.fit() to train a model that operates on across all types of features.
Internally, the neural network will be automatically generated based on the inferred data type of each feature column.
To save time, we subsample the data and only train for three minutes.
from autogluon.multimodal import MultiModalPredictor
import uuid
time_limit = 3 * 60 # set to larger value in your applications
model_path = f"./tmp/{uuid.uuid4().hex}-automm_text_book_price_prediction"
predictor = MultiModalPredictor(label='Price', path=model_path)
predictor.fit(train_data, time_limit=time_limit)
/home/ci/opt/venv/lib/python3.11/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
from torch.distributed.optim import \
=================== System Info ===================
AutoGluon Version: 1.2b20241213
Python Version: 3.11.9
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count: 8
Pytorch Version: 2.5.1+cu124
CUDA Version: 12.4
Memory Avail: 28.42 GB / 30.95 GB (91.8%)
Disk Space Avail: 182.31 GB / 255.99 GB (71.2%)
===================================================
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
Label info (max, min, mean, stddev): (9.115699967822062, 3.6109179126442243, 6.02567, 0.7694)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
AutoMM starts to create your model. ✨✨✨
To track the learning progress, you can open a terminal and launch Tensorboard:
```shell
# Assume you have installed tensorboard
tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/46aef9153c8c4178ba1591985d30323b-automm_text_book_price_prediction
```
Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params | Mode
------------------------------------------------------------------
0 | model | MultimodalFusionMLP | 110 M | train
1 | validation_metric | MeanSquaredError | 0 | train
2 | loss_func | MSELoss | 0 | train
------------------------------------------------------------------
110 M Trainable params
0 Non-trainable params
110 M Total params
442.755 Total estimated model params size (MB)
84 Modules in train mode
225 Modules in eval mode
Epoch 0, global step 4: 'val_rmse' reached 1.17493 (best 1.17493), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/46aef9153c8c4178ba1591985d30323b-automm_text_book_price_prediction/epoch=0-step=4.ckpt' as top 3
Epoch 0, global step 10: 'val_rmse' reached 0.98727 (best 0.98727), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/46aef9153c8c4178ba1591985d30323b-automm_text_book_price_prediction/epoch=0-step=10.ckpt' as top 3
Epoch 1, global step 14: 'val_rmse' reached 1.41323 (best 0.98727), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/46aef9153c8c4178ba1591985d30323b-automm_text_book_price_prediction/epoch=1-step=14.ckpt' as top 3
Epoch 1, global step 20: 'val_rmse' reached 0.97952 (best 0.97952), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/46aef9153c8c4178ba1591985d30323b-automm_text_book_price_prediction/epoch=1-step=20.ckpt' as top 3
Epoch 2, global step 24: 'val_rmse' reached 1.01497 (best 0.97952), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/46aef9153c8c4178ba1591985d30323b-automm_text_book_price_prediction/epoch=2-step=24.ckpt' as top 3
Epoch 2, global step 30: 'val_rmse' reached 0.87318 (best 0.87318), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/46aef9153c8c4178ba1591985d30323b-automm_text_book_price_prediction/epoch=2-step=30.ckpt' as top 3
Epoch 3, global step 34: 'val_rmse' was not in top 3
Time limit reached. Elapsed time is 0:03:00. Signaling Trainer to stop.
Epoch 3, global step 36: 'val_rmse' reached 0.85707 (best 0.85707), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/multimodal_prediction/tmp/46aef9153c8c4178ba1591985d30323b-automm_text_book_price_prediction/epoch=3-step=36.ckpt' as top 3
Start to fuse 3 checkpoints via the greedy soup algorithm.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[8], line 7
5 model_path = f"./tmp/{uuid.uuid4().hex}-automm_text_book_price_prediction"
6 predictor = MultiModalPredictor(label='Price', path=model_path)
----> 7 predictor.fit(train_data, time_limit=time_limit)
File ~/autogluon/multimodal/src/autogluon/multimodal/predictor.py:529, in MultiModalPredictor.fit(self, train_data, presets, tuning_data, max_num_tuning_data, id_mappings, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_predictor, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts, predictions, labels, predictors)
526 assert isinstance(predictors, list)
527 learners = [ele if isinstance(ele, str) else ele._learner for ele in predictors]
--> 529 self._learner.fit(
530 train_data=train_data,
531 presets=presets,
532 tuning_data=tuning_data,
533 max_num_tuning_data=max_num_tuning_data,
534 time_limit=time_limit,
535 save_path=save_path,
536 hyperparameters=hyperparameters,
537 column_types=column_types,
538 holdout_frac=holdout_frac,
539 teacher_learner=teacher_learner,
540 seed=seed,
541 standalone=standalone,
542 hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
543 clean_ckpts=clean_ckpts,
544 id_mappings=id_mappings,
545 predictions=predictions,
546 labels=labels,
547 learners=learners,
548 )
550 return self
File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:666, in BaseLearner.fit(self, train_data, presets, tuning_data, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_learner, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts, **kwargs)
659 self.prepare_fit_args(
660 time_limit=time_limit,
661 seed=seed,
662 standalone=standalone,
663 clean_ckpts=clean_ckpts,
664 )
665 fit_returns = self.execute_fit()
--> 666 self.on_fit_end(
667 training_start=training_start,
668 strategy=fit_returns.get("strategy", None),
669 strict_loading=fit_returns.get("strict_loading", True),
670 standalone=standalone,
671 clean_ckpts=clean_ckpts,
672 )
674 return self
File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:610, in BaseLearner.on_fit_end(self, training_start, strategy, strict_loading, standalone, clean_ckpts)
607 self._fit_called = True
608 if not self._is_hpo:
609 # top_k_average is called inside hyperparameter_tune() when building the final predictor.
--> 610 self.top_k_average(
611 save_path=self._save_path,
612 top_k_average_method=self._config.optim.top_k_average_method,
613 strategy=strategy,
614 strict_loading=strict_loading,
615 # Not strict loading if using parameter-efficient finetuning
616 standalone=standalone,
617 clean_ckpts=clean_ckpts,
618 )
620 training_end = time.time()
621 self._total_train_time = training_end - training_start
File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:1449, in BaseLearner.top_k_average(self, save_path, top_k_average_method, strategy, last_ckpt_path, strict_loading, standalone, clean_ckpts)
1440 logger.info(
1441 f"Start to fuse {len(top_k_model_paths)} checkpoints via the greedy soup algorithm."
1442 )
1444 self._load_state_dict(
1445 path=top_k_model_paths[0],
1446 prefix=prefix,
1447 strict=strict_loading,
1448 )
-> 1449 best_score = self.evaluate(self._tuning_data, metrics=[eval_metric])[self._eval_metric_name]
1450 for i in range(1, len(top_k_model_paths)):
1451 cand_avg_state_dict = average_checkpoints(
1452 checkpoint_paths=ingredients + [top_k_model_paths[i]],
1453 )
KeyError: 'rmse'
Prediction¶
We can easily obtain predictions and extract data embeddings using the MultiModalPredictor.
predictions = predictor.predict(test_data)
print('Predictions:')
print('------------')
print(np.exp(predictions) - 1)
print()
print('True Value:')
print('------------')
print(np.exp(test_data['Price']) - 1)
performance = predictor.evaluate(test_data)
print(performance)
embeddings = predictor.extract_embedding(test_data)
embeddings.shape
Other Examples¶
You may go to AutoMM Examples to explore other examples about AutoMM.
Customization¶
To learn how to customize AutoMM, please refer to Customize AutoMM.