AutoMM for Text - Multilingual Problems#

People around the world speaks lots of languages. According to SIL International’s Ethnologue: Languages of the World, there are more than 7,100 spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of real-world problems involve text written in languages other than English.

In this tutorial, we introduce how MultiModalPredictor can help you build multilingual models. For the purpose of demonstration, we use the Cross-Lingual Amazon Product Review Sentiment dataset, which comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese. We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways:

Finetune the German BERT
Cross-lingual transfer from English to German

Note: You are recommended to also check Single GPU Billion-scale Model Training via Parameter-Efficient Finetuning about how to achieve better performance via parameter-efficient finetuning.

Load Dataset#

The Cross-Lingual Amazon Product Review Sentiment dataset contains Amazon product reviews in four languages. Here, we load the English and German fold of the dataset. In the label column, 0 means negative sentiment and 1 means positive sentiment.

!wget --quiet https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip
!unzip -q -o amazon_review_sentiment_cross_lingual.zip -d .

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
                .sample(1000, random_state=123)
train_de_df.reset_index(inplace=True, drop=True)

test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123)
test_de_df.reset_index(inplace=True, drop=True)
print(train_de_df)

     label                                               text
      0  Dieser Film, nur so triefend von Kitsch, ist h...
      0  Wie so oft: Das Buch begeistert, der Film entt...
      1  Schon immer versuchten Männer ihre Gefühle geg...
      1  Wenn man sich durch 10 Minuten Disney-Trailer ...
      1  Eine echt geile nummer zum Abtanzen und feiern...
..     ...                                                ...
    0  Ich dachte dies wäre ein richtig spannendes Bu...
    0  Wer sich den Schrott wirklich noch ansehen möc...
    0  Sicher, der Film greift ein aktuelles und hoch...
    1  Dieser Bildband lässt das Herz von Sarah Kay-F...
    1  ...so das war nun mein drittes Buch von Jenny-...

[1000 rows x 2 columns]

train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv',
                          sep='\t',
                          header=None,
                          names=['label', 'text']) \
                .sample(1000, random_state=123)
train_en_df.reset_index(inplace=True, drop=True)

test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv',
                          sep='\t',
                          header=None,
                          names=['label', 'text']) \
               .sample(200, random_state=123)
test_en_df.reset_index(inplace=True, drop=True)
print(train_en_df)

     label                                               text
      0  This is a film that literally sees little wron...
      0  This music is pretty intelligent, but not very...
      0  One of the best pieces of rock ever recorded, ...
      0  Reading the posted reviews here, is like revis...
      1  I've just finished page 341, the last page. It...
..     ...                                                ...
    1  This album deserves to be (at least) as popula...
    1  This book, one of the few that takes a more ac...
    1  I loved it because it really did show Sagan th...
    1  Stuart Gordons "DAGON" is a unique horror gem ...
    0  I've heard Al Lee speak before and thought tha...

[1000 rows x 2 columns]

Finetune the German BERT#

Our first approach is to finetune the German BERT model pretrained by deepset. Since MultiModalPredictor integrates with the Huggingface/Transformers (as explained in Customize AutoMM), we directly load the German BERT model available in Huggingface/Transformers, with the key as bert-base-german-cased. To simplify the experiment, we also just finetune for 4 epochs.

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label='label')
predictor.fit(train_de_df,
              hyperparameters={
                  'model.hf_text.checkpoint_name': 'bert-base-german-cased',
                  'optimization.max_epochs': 2
              })

No path specified. Models will be saved in: "AutogluonModels/ag-20230622_213033/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [1, 0]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.1b20230622.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033".

- Validation metric is "roc_auc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai

1 GPUs are detected, and 1 GPUs will be used.
   - GPU 0 name: Tesla T4
   - GPU 0 memory: 15.74GB/15.84GB (Free/Total)
CUDA version is 11.7.

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 109 M 
1 | validation_metric | BinaryAUROC                  | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
218.166   Total estimated model params size (MB)
Epoch 0, global step 3: 'val_roc_auc' reached 0.64021 (best 0.64021), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033/epoch=0-step=3.ckpt' as top 3
Epoch 0, global step 7: 'val_roc_auc' reached 0.78431 (best 0.78431), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033/epoch=0-step=7.ckpt' as top 3
Epoch 1, global step 10: 'val_roc_auc' reached 0.80512 (best 0.80512), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033/epoch=1-step=10.ckpt' as top 3
Epoch 1, global step 14: 'val_roc_auc' reached 0.81032 (best 0.81032), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033/epoch=1-step=14.ckpt' as top 3
`Trainer.fit` stopped: `max_epochs=2` reached.
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model 🎉🎉🎉

- To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033")
    ```

- You can open a terminal and launch Tensorboard to visualize the training log:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033
    ```

- If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub: https://github.com/autogluon/autogluon

<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f38800aa910>

score = predictor.evaluate(test_de_df)
print('Score on the German Testset:')
print(score)

Score on the German Testset:
{'roc_auc': 0.7294671474358974}

score = predictor.evaluate(test_en_df)
print('Score on the English Testset:')
print(score)

Score on the English Testset:
{'roc_auc': 0.53406692794694}

We can find that the model can achieve good performance on the German dataset but performs poorly on the English dataset. Next, we will show how to enable cross-lingual transfer so you can get a model that can magically work for both German and English.

Cross-lingual Transfer#

In the real-world scenario, it is pretty common that you have trained a model for English and would like to extend the model to support other languages like German. This setting is also known as cross-lingual transfer. One way to solve the problem is to apply a machine translation model to translate the sentences from the other language (e.g., German) to English and apply the English model. However, as showed in “Unsupervised Cross-lingual Representation Learning at Scale”, there is a better and cost-friendlier way for cross lingual transfer, enabled via large-scale multilingual pretraining. The author showed that via large-scale pretraining, the backbone (called XLM-R) is able to conduct zero-shot cross lingual transfer, meaning that you can directly apply the model trained in the English dataset to datasets in other languages. It also outperforms the baseline “TRANSLATE-TEST”, meaning to translate the data from other languages to English and apply the English model.

In AutoGluon, you can just turn on presets="multilingual" in MultiModalPredictor to load a backbone that is suitable for zero-shot transfer. Internally, we will automatically use state-of-the-art models like DeBERTa-V3.

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label='label')
predictor.fit(train_en_df,
              presets='multilingual',
              hyperparameters={
                  'optimization.max_epochs': 2
              })

No path specified. Models will be saved in: "AutogluonModels/ag-20230622_213218/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.1b20230622.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213218".

- Validation metric is "roc_auc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213218
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:4                                                                                    │
│                                                                                                  │
│   1 from autogluon.multimodal import MultiModalPredictor                                         │
│   2                                                                                              │
│   3 predictor = MultiModalPredictor(label='label')                                               │
│ ❱ 4 predictor.fit(train_en_df,                                                                   │
│   5 │   │   │     presets='multilingual',                                                        │
│   6 │   │   │     hyperparameters={                                                              │
│   7 │   │   │   │     'optimization.max_epochs': 2                                               │
│                                                                                                  │
│ /home/ci/autogluon/multimodal/src/autogluon/multimodal/predictor.py:864 in fit                   │
│                                                                                                  │
│    861 │   │   │   )                                                                             │
│    862 │   │   │   return predictor                                                              │
│    863 │   │                                                                                     │
│ ❱  864 │   │   self._fit(**_fit_args)                                                            │
│    865 │   │   training_end = time.time()                                                        │
│    866 │   │   self._total_train_time = training_end - training_start                            │
│    867                                                                                           │
│                                                                                                  │
│ /home/ci/autogluon/multimodal/src/autogluon/multimodal/predictor.py:1140 in _fit                 │
│                                                                                                  │
│   1137 │   │   │   self._output_shape = len(df_preprocessor.label_generator.unique_entity_group  │
│   1138 │   │                                                                                     │
│   1139 │   │   if self._model is None:                                                           │
│ ❱ 1140 │   │   │   model = create_fusion_model(                                                  │
│   1141 │   │   │   │   config=config,                                                            │
│   1142 │   │   │   │   num_classes=self._output_shape,                                           │
│   1143 │   │   │   │   classes=self._classes,                                                    │
│                                                                                                  │
│ /home/ci/autogluon/multimodal/src/autogluon/multimodal/utils/model.py:442 in create_fusion_model │
│                                                                                                  │
│   439 │                                                                                          │
│   440 │   for model_name in names:                                                               │
│   441 │   │   model_config = getattr(config.model, model_name)                                   │
│ ❱ 442 │   │   model = create_model(                                                              │
│   443 │   │   │   model_name=model_name,                                                         │
│   444 │   │   │   model_config=model_config,                                                     │
│   445 │   │   │   num_classes=num_classes,                                                       │
│                                                                                                  │
│ /home/ci/autogluon/multimodal/src/autogluon/multimodal/utils/model.py:209 in create_model        │
│                                                                                                  │
│   206 │   │   │   pretrained=pretrained,                                                         │
│   207 │   │   )                                                                                  │
│   208 │   elif model_name.lower().startswith(HF_TEXT):                                           │
│ ❱ 209 │   │   model = HFAutoModelForTextPrediction(                                              │
│   210 │   │   │   prefix=model_name,                                                             │
│   211 │   │   │   checkpoint_name=model_config.checkpoint_name,                                  │
│   212 │   │   │   num_classes=num_classes,                                                       │
│                                                                                                  │
│ /home/ci/autogluon/multimodal/src/autogluon/multimodal/models/huggingface_text.py:84 in __init__ │
│                                                                                                  │
│    81 │   │   self.config, self.model = get_hf_config_and_model(                                 │
│    82 │   │   │   checkpoint_name=checkpoint_name, pretrained=pretrained, low_cpu_mem_usage=lo   │
│    83 │   │   )                                                                                  │
│ ❱  84 │   │   self._hf_model_input_names = AutoTokenizer.from_pretrained(checkpoint_name).mode   │
│    85 │   │                                                                                      │
│    86 │   │   if isinstance(self.model, T5PreTrainedModel):                                      │
│    87 │   │   │   self.is_t5 = True                                                              │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:676  │
│ in from_pretrained                                                                               │
│                                                                                                  │
│   673 │   │   if model_type is not None:                                                         │
│   674 │   │   │   tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]     │
│   675 │   │   │   if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):          │
│ ❱ 676 │   │   │   │   return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_pat   │
│   677 │   │   │   else:                                                                          │
│   678 │   │   │   │   if tokenizer_class_py is not None:                                         │
│   679 │   │   │   │   │   return tokenizer_class_py.from_pretrained(pretrained_model_name_or_p   │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1804 in    │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   1801 │   │   │   else:                                                                         │
│   1802 │   │   │   │   logger.info(f"loading file {file_path} from cache at {resolved_vocab_fil  │
│   1803 │   │                                                                                     │
│ ❱ 1804 │   │   return cls._from_pretrained(                                                      │
│   1805 │   │   │   resolved_vocab_files,                                                         │
│   1806 │   │   │   pretrained_model_name_or_path,                                                │
│   1807 │   │   │   init_configuration,                                                           │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1959 in    │
│ _from_pretrained                                                                                 │
│                                                                                                  │
│   1956 │   │                                                                                     │
│   1957 │   │   # Instantiate tokenizer.                                                          │
│   1958 │   │   try:                                                                              │
│ ❱ 1959 │   │   │   tokenizer = cls(*init_inputs, **init_kwargs)                                  │
│   1960 │   │   except OSError:                                                                   │
│   1961 │   │   │   raise OSError(                                                                │
│   1962 │   │   │   │   "Unable to load vocabulary from file. "                                   │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/transformers/models/deberta_v2/tokenization_debert │
│ a_v2_fast.py:133 in __init__                                                                     │
│                                                                                                  │
│   130 │   │   mask_token="[MASK]",                                                               │
│   131 │   │   **kwargs                                                                           │
│   132 │   ) -> None:                                                                             │
│ ❱ 133 │   │   super().__init__(                                                                  │
│   134 │   │   │   vocab_file,                                                                    │
│   135 │   │   │   tokenizer_file=tokenizer_file,                                                 │
│   136 │   │   │   do_lower_case=do_lower_case,                                                   │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py:114 in     │
│ __init__                                                                                         │
│                                                                                                  │
│   111 │   │   │   fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)                  │
│   112 │   │   elif slow_tokenizer is not None:                                                   │
│   113 │   │   │   # We need to convert a slow tokenizer to build the backend                     │
│ ❱ 114 │   │   │   fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)                        │
│   115 │   │   elif self.slow_tokenizer_class is not None:                                        │
│   116 │   │   │   # We need to create and convert a slow tokenizer to build the backend          │
│   117 │   │   │   slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)                    │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py:1162 in     │
│ convert_slow_tokenizer                                                                           │
│                                                                                                  │
│   1159 │                                                                                         │
│   1160 │   converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]                       │
│   1161 │                                                                                         │
│ ❱ 1162 │   return converter_class(transformer_tokenizer).converted()                             │
│   1163                                                                                           │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py:438 in      │
│ __init__                                                                                         │
│                                                                                                  │
│    435 │   │                                                                                     │
│    436 │   │   super().__init__(*args)                                                           │
│    437 │   │                                                                                     │
│ ❱  438 │   │   from .utils import sentencepiece_model_pb2 as model_pb2                           │
│    439 │   │                                                                                     │
│    440 │   │   m = model_pb2.ModelProto()                                                        │
│    441 │   │   with open(self.original_tokenizer.vocab_file, "rb") as f:                         │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py:92   │
│ in <module>                                                                                      │
│                                                                                                  │
│     89 │   file=DESCRIPTOR,                                                                      │
│     90 │   create_key=_descriptor._internal_create_key,                                          │
│     91 │   values=[                                                                              │
│ ❱   92 │   │   _descriptor.EnumValueDescriptor(                                                  │
│     93 │   │   │   name="UNIGRAM",                                                               │
│     94 │   │   │   index=0,                                                                      │
│     95 │   │   │   number=1,                                                                     │
│                                                                                                  │
│ /home/ci/opt/venv/lib/python3.8/site-packages/google/protobuf/descriptor.py:796 in __new__       │
│                                                                                                  │
│    793 │   def __new__(cls, name, index, number,                                                 │
│    794 │   │   │   │   type=None,  # pylint: disable=redefined-builtin                           │
│    795 │   │   │   │   options=None, serialized_options=None, create_key=None):                  │
│ ❱  796 │     _message.Message._CheckCalledFromGeneratedFile()                                    │
│    797 │     # There is no way we can build a complete EnumValueDescriptor with the              │
│    798 │     # given parameters (the name of the Enum is not known, for example).                │
│    799 │     # Fortunately generated files just pass it to the EnumDescriptor()                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 
3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much 
slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

score_in_en = predictor.evaluate(test_en_df)
print('Score in the English Testset:')
print(score_in_en)

score_in_de = predictor.evaluate(test_de_df)
print('Score in the German Testset:')
print(score_in_de)

We can see that the model works for both German and English!

Let’s also inspect the model’s performance on Japanese:

test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123)
test_jp_df.reset_index(inplace=True, drop=True)
print(test_jp_df)

print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df))
score_in_jp = predictor.evaluate(test_jp_df)
print('Score in the Japanese Testset:')
print(score_in_jp)

Amazingly, the model also works for Japanese!

Other Examples#

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization#

To learn how to customize AutoMM, please refer to Customize AutoMM.