AutoMM for Text - Multilingual Problems#

Open In Colab Open In SageMaker Studio Lab

People around the world speaks lots of languages. According to SIL International’s Ethnologue: Languages of the World, there are more than 7,100 spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of real-world problems involve text written in languages other than English.

In this tutorial, we introduce how MultiModalPredictor can help you build multilingual models. For the purpose of demonstration, we use the Cross-Lingual Amazon Product Review Sentiment dataset, which comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese. We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways:

  • Finetune the German BERT

  • Cross-lingual transfer from English to German

Note: You are recommended to also check Single GPU Billion-scale Model Training via Parameter-Efficient Finetuning about how to achieve better performance via parameter-efficient finetuning.

Load Dataset#

The Cross-Lingual Amazon Product Review Sentiment dataset contains Amazon product reviews in four languages. Here, we load the English and German fold of the dataset. In the label column, 0 means negative sentiment and 1 means positive sentiment.

!wget --quiet https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip
!unzip -q -o amazon_review_sentiment_cross_lingual.zip -d .
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
                .sample(1000, random_state=123)
train_de_df.reset_index(inplace=True, drop=True)

test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123)
test_de_df.reset_index(inplace=True, drop=True)
print(train_de_df)
     label                                               text
0        0  Dieser Film, nur so triefend von Kitsch, ist h...
1        0  Wie so oft: Das Buch begeistert, der Film entt...
2        1  Schon immer versuchten Männer ihre Gefühle geg...
3        1  Wenn man sich durch 10 Minuten Disney-Trailer ...
4        1  Eine echt geile nummer zum Abtanzen und feiern...
..     ...                                                ...
995      0  Ich dachte dies wäre ein richtig spannendes Bu...
996      0  Wer sich den Schrott wirklich noch ansehen möc...
997      0  Sicher, der Film greift ein aktuelles und hoch...
998      1  Dieser Bildband lässt das Herz von Sarah Kay-F...
999      1  ...so das war nun mein drittes Buch von Jenny-...

[1000 rows x 2 columns]
train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv',
                          sep='\t',
                          header=None,
                          names=['label', 'text']) \
                .sample(1000, random_state=123)
train_en_df.reset_index(inplace=True, drop=True)

test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv',
                          sep='\t',
                          header=None,
                          names=['label', 'text']) \
               .sample(200, random_state=123)
test_en_df.reset_index(inplace=True, drop=True)
print(train_en_df)
     label                                               text
0        0  This is a film that literally sees little wron...
1        0  This music is pretty intelligent, but not very...
2        0  One of the best pieces of rock ever recorded, ...
3        0  Reading the posted reviews here, is like revis...
4        1  I've just finished page 341, the last page. It...
..     ...                                                ...
995      1  This album deserves to be (at least) as popula...
996      1  This book, one of the few that takes a more ac...
997      1  I loved it because it really did show Sagan th...
998      1  Stuart Gordons "DAGON" is a unique horror gem ...
999      0  I've heard Al Lee speak before and thought tha...

[1000 rows x 2 columns]

Finetune the German BERT#

Our first approach is to finetune the German BERT model pretrained by deepset. Since MultiModalPredictor integrates with the Huggingface/Transformers (as explained in Customize AutoMM), we directly load the German BERT model available in Huggingface/Transformers, with the key as bert-base-german-cased. To simplify the experiment, we also just finetune for 4 epochs.

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label='label')
predictor.fit(train_de_df,
              hyperparameters={
                  'model.hf_text.checkpoint_name': 'bert-base-german-cased',
                  'optimization.max_epochs': 2
              })
No path specified. Models will be saved in: "AutogluonModels/ag-20230622_213033/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [1, 0]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.1b20230622.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033".

- Validation metric is "roc_auc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai

1 GPUs are detected, and 1 GPUs will be used.
   - GPU 0 name: Tesla T4
   - GPU 0 memory: 15.74GB/15.84GB (Free/Total)
CUDA version is 11.7.

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 109 M 
1 | validation_metric | BinaryAUROC                  | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
218.166   Total estimated model params size (MB)
Epoch 0, global step 3: 'val_roc_auc' reached 0.64021 (best 0.64021), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033/epoch=0-step=3.ckpt' as top 3
Epoch 0, global step 7: 'val_roc_auc' reached 0.78431 (best 0.78431), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033/epoch=0-step=7.ckpt' as top 3
Epoch 1, global step 10: 'val_roc_auc' reached 0.80512 (best 0.80512), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033/epoch=1-step=10.ckpt' as top 3
Epoch 1, global step 14: 'val_roc_auc' reached 0.81032 (best 0.81032), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033/epoch=1-step=14.ckpt' as top 3
`Trainer.fit` stopped: `max_epochs=2` reached.
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model 🎉🎉🎉

- To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033")
    ```

- You can open a terminal and launch Tensorboard to visualize the training log:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213033
    ```

- If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub: https://github.com/autogluon/autogluon
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f38800aa910>
score = predictor.evaluate(test_de_df)
print('Score on the German Testset:')
print(score)
Score on the German Testset:
{'roc_auc': 0.7294671474358974}
score = predictor.evaluate(test_en_df)
print('Score on the English Testset:')
print(score)
Score on the English Testset:
{'roc_auc': 0.53406692794694}

We can find that the model can achieve good performance on the German dataset but performs poorly on the English dataset. Next, we will show how to enable cross-lingual transfer so you can get a model that can magically work for both German and English.

Cross-lingual Transfer#

In the real-world scenario, it is pretty common that you have trained a model for English and would like to extend the model to support other languages like German. This setting is also known as cross-lingual transfer. One way to solve the problem is to apply a machine translation model to translate the sentences from the other language (e.g., German) to English and apply the English model. However, as showed in “Unsupervised Cross-lingual Representation Learning at Scale”, there is a better and cost-friendlier way for cross lingual transfer, enabled via large-scale multilingual pretraining. The author showed that via large-scale pretraining, the backbone (called XLM-R) is able to conduct zero-shot cross lingual transfer, meaning that you can directly apply the model trained in the English dataset to datasets in other languages. It also outperforms the baseline “TRANSLATE-TEST”, meaning to translate the data from other languages to English and apply the English model.

In AutoGluon, you can just turn on presets="multilingual" in MultiModalPredictor to load a backbone that is suitable for zero-shot transfer. Internally, we will automatically use state-of-the-art models like DeBERTa-V3.

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label='label')
predictor.fit(train_en_df,
              presets='multilingual',
              hyperparameters={
                  'optimization.max_epochs': 2
              })
No path specified. Models will be saved in: "AutogluonModels/ag-20230622_213218/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.1b20230622.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213218".

- Validation metric is "roc_auc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230622_213218
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
 in <module>:4                                                                                    
                                                                                                  
   1 from autogluon.multimodal import MultiModalPredictor                                         
   2                                                                                              
   3 predictor = MultiModalPredictor(label='label')                                               
 4 predictor.fit(train_en_df,                                                                   
   5 │   │   │     presets='multilingual',                                                        
   6 │   │   │     hyperparameters={                                                              
   7 │   │   │   │     'optimization.max_epochs': 2                                               
                                                                                                  
 /home/ci/autogluon/multimodal/src/autogluon/multimodal/predictor.py:864 in fit                   
                                                                                                  
    861 │   │   │   )                                                                             
    862 │   │   │   return predictor                                                              
    863 │   │                                                                                     
  864 │   │   self._fit(**_fit_args)                                                            
    865 │   │   training_end = time.time()                                                        
    866 │   │   self._total_train_time = training_end - training_start                            
    867                                                                                           
                                                                                                  
 /home/ci/autogluon/multimodal/src/autogluon/multimodal/predictor.py:1140 in _fit                 
                                                                                                  
   1137 │   │   │   self._output_shape = len(df_preprocessor.label_generator.unique_entity_group  
   1138 │   │                                                                                     
   1139 │   │   if self._model is None:                                                           
 1140 │   │   │   model = create_fusion_model(                                                  
   1141 │   │   │   │   config=config,                                                            
   1142 │   │   │   │   num_classes=self._output_shape,                                           
   1143 │   │   │   │   classes=self._classes,                                                    
                                                                                                  
 /home/ci/autogluon/multimodal/src/autogluon/multimodal/utils/model.py:442 in create_fusion_model 
                                                                                                  
   439                                                                                        
   440 for model_name in names:                                                               
   441 │   │   model_config = getattr(config.model, model_name)                                   
 442 │   │   model = create_model(                                                              
   443 │   │   │   model_name=model_name,                                                         
   444 │   │   │   model_config=model_config,                                                     
   445 │   │   │   num_classes=num_classes,                                                       
                                                                                                  
 /home/ci/autogluon/multimodal/src/autogluon/multimodal/utils/model.py:209 in create_model        
                                                                                                  
   206 │   │   │   pretrained=pretrained,                                                         
   207 │   │   )                                                                                  
   208 elif model_name.lower().startswith(HF_TEXT):                                           
 209 │   │   model = HFAutoModelForTextPrediction(                                              
   210 │   │   │   prefix=model_name,                                                             
   211 │   │   │   checkpoint_name=model_config.checkpoint_name,                                  
   212 │   │   │   num_classes=num_classes,                                                       
                                                                                                  
 /home/ci/autogluon/multimodal/src/autogluon/multimodal/models/huggingface_text.py:84 in __init__ 
                                                                                                  
    81 │   │   self.config, self.model = get_hf_config_and_model(                                 
    82 │   │   │   checkpoint_name=checkpoint_name, pretrained=pretrained, low_cpu_mem_usage=lo   
    83 │   │   )                                                                                  
  84 │   │   self._hf_model_input_names = AutoTokenizer.from_pretrained(checkpoint_name).mode   
    85 │   │                                                                                      
    86 │   │   if isinstance(self.model, T5PreTrainedModel):                                      
    87 │   │   │   self.is_t5 = True                                                              
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:676  
 in from_pretrained                                                                               
                                                                                                  
   673 │   │   if model_type is not None:                                                         
   674 │   │   │   tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]     
   675 │   │   │   if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):          
 676 │   │   │   │   return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_pat   
   677 │   │   │   else:                                                                          
   678 │   │   │   │   if tokenizer_class_py is not None:                                         
   679 │   │   │   │   │   return tokenizer_class_py.from_pretrained(pretrained_model_name_or_p   
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1804 in    
 from_pretrained                                                                                  
                                                                                                  
   1801 │   │   │   else:                                                                         
   1802 │   │   │   │   logger.info(f"loading file {file_path} from cache at {resolved_vocab_fil  
   1803 │   │                                                                                     
 1804 │   │   return cls._from_pretrained(                                                      
   1805 │   │   │   resolved_vocab_files,                                                         
   1806 │   │   │   pretrained_model_name_or_path,                                                
   1807 │   │   │   init_configuration,                                                           
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1959 in    
 _from_pretrained                                                                                 
                                                                                                  
   1956 │   │                                                                                     
   1957 │   │   # Instantiate tokenizer.                                                          
   1958 │   │   try:                                                                              
 1959 │   │   │   tokenizer = cls(*init_inputs, **init_kwargs)                                  
   1960 │   │   except OSError:                                                                   
   1961 │   │   │   raise OSError(                                                                
   1962 │   │   │   │   "Unable to load vocabulary from file. "                                   
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/transformers/models/deberta_v2/tokenization_debert 
 a_v2_fast.py:133 in __init__                                                                     
                                                                                                  
   130 │   │   mask_token="[MASK]",                                                               
   131 │   │   **kwargs                                                                           
   132 ) -> None:                                                                             
 133 │   │   super().__init__(                                                                  
   134 │   │   │   vocab_file,                                                                    
   135 │   │   │   tokenizer_file=tokenizer_file,                                                 
   136 │   │   │   do_lower_case=do_lower_case,                                                   
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py:114 in     
 __init__                                                                                         
                                                                                                  
   111 │   │   │   fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)                  
   112 │   │   elif slow_tokenizer is not None:                                                   
   113 │   │   │   # We need to convert a slow tokenizer to build the backend                     
 114 │   │   │   fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)                        
   115 │   │   elif self.slow_tokenizer_class is not None:                                        
   116 │   │   │   # We need to create and convert a slow tokenizer to build the backend          
   117 │   │   │   slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)                    
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py:1162 in     
 convert_slow_tokenizer                                                                           
                                                                                                  
   1159                                                                                       
   1160 converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]                       
   1161                                                                                       
 1162 return converter_class(transformer_tokenizer).converted()                             
   1163                                                                                           
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py:438 in      
 __init__                                                                                         
                                                                                                  
    435 │   │                                                                                     
    436 │   │   super().__init__(*args)                                                           
    437 │   │                                                                                     
  438 │   │   from .utils import sentencepiece_model_pb2 as model_pb2                           
    439 │   │                                                                                     
    440 │   │   m = model_pb2.ModelProto()                                                        
    441 │   │   with open(self.original_tokenizer.vocab_file, "rb") as f:                         
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py:92   
 in <module>                                                                                      
                                                                                                  
     89 file=DESCRIPTOR,                                                                      
     90 create_key=_descriptor._internal_create_key,                                          
     91 values=[                                                                              
   92 │   │   _descriptor.EnumValueDescriptor(                                                  
     93 │   │   │   name="UNIGRAM",                                                               
     94 │   │   │   index=0,                                                                      
     95 │   │   │   number=1,                                                                     
                                                                                                  
 /home/ci/opt/venv/lib/python3.8/site-packages/google/protobuf/descriptor.py:796 in __new__       
                                                                                                  
    793 def __new__(cls, name, index, number,                                                 
    794 │   │   │   │   type=None,  # pylint: disable=redefined-builtin                           
    795 │   │   │   │   options=None, serialized_options=None, create_key=None):                  
  796 _message.Message._CheckCalledFromGeneratedFile()                                    
    797 # There is no way we can build a complete EnumValueDescriptor with the              
    798 # given parameters (the name of the Enum is not known, for example).                
    799 # Fortunately generated files just pass it to the EnumDescriptor()                  
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 
3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much 
slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
score_in_en = predictor.evaluate(test_en_df)
print('Score in the English Testset:')
print(score_in_en)
score_in_de = predictor.evaluate(test_de_df)
print('Score in the German Testset:')
print(score_in_de)

We can see that the model works for both German and English!

Let’s also inspect the model’s performance on Japanese:

test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123)
test_jp_df.reset_index(inplace=True, drop=True)
print(test_jp_df)
print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df))
score_in_jp = predictor.evaluate(test_jp_df)
print('Score in the Japanese Testset:')
print(score_in_jp)

Amazingly, the model also works for Japanese!

Other Examples#

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization#

To learn how to customize AutoMM, please refer to Customize AutoMM.