Classifying PDF Documents with AutoMM#

PDF comes short from Portable Document Format and is one of the most popular document formats. We can find PDFs everywhere, from personal resumes to business contracts, and from commercial brochures to government documents. The list can be endless. PDF is highly praised for its portability. There’s no worry about the receiver being unable to view the document or see an imperfect version regardless of their operating system and device models.

Using AutoMM, you can handle and build machine learning models on PDF documents just like working on other modalities such as text and images, without bothering about PDFs processing. In this tutorial, we will introduce how to classify PDF documents automatically with AutoMM using document foundation models. Let’s get started!

Get the PDF document dataset#

We have created a simple PDFs dataset via manual crawling for demonstration purpose. It consists of two categories, resume and historical documents (downloaded from milestone documents). We picked 20 PDF documents for each of the category.

Now, let’s download the dataset and split it into training and test sets.

import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
from autogluon.core.utils.loaders import load_zip

download_dir = './ag_automm_tutorial_pdf_classifier'
zip_file = "https://automl-mm-bench.s3.amazonaws.com/doc_classification/pdf_docs_small.zip"
load_zip.unzip(zip_file, unzip_dir=download_dir)

dataset_path = os.path.join(download_dir, "pdf_docs_small")
pdf_docs = pd.read_csv(f"{dataset_path}/data.csv")
train_data = pdf_docs.sample(frac=0.8, random_state=200)
test_data = pdf_docs.drop(train_data.index)

Downloading ./ag_automm_tutorial_pdf_classifier/file.zip from https://automl-mm-bench.s3.amazonaws.com/doc_classification/pdf_docs_small.zip...

100%|██████████| 12.7M/12.7M [00:00<00:00, 95.6MiB/s]

Now, let’s visualize one of the PDF documents. Here, we use the S3 URL of the PDF document and IFrame to show it in the tutorial.

from IPython.display import IFrame
IFrame("https://automl-mm-bench.s3.amazonaws.com/doc_classification/historical_1.pdf", width=400, height=500)

As you can see, this document is an America’s historical document in PDF format. To make sure the MultiModalPredictor can locate the documents correctly, we need to overwrite the document paths.

from autogluon.multimodal.utils.misc import path_expander

DOC_PATH_COL = "doc_path"

train_data[DOC_PATH_COL] = train_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))
test_data[DOC_PATH_COL] = test_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))
print(test_data.head())

                                             doc_path   label
 /home/ci/autogluon/docs/tutorials/multimodal/d...  resume
/home/ci/autogluon/docs/tutorials/multimodal/d...  resume
/home/ci/autogluon/docs/tutorials/multimodal/d...  resume
/home/ci/autogluon/docs/tutorials/multimodal/d...  resume
/home/ci/autogluon/docs/tutorials/multimodal/d...  resume

Create a PDF Document Classifier#

You can create a PDFs classifier easily with MultiModalPredictor. All you need to do is to create a predictor and fit it with the above training dataset. AutoMM will handle all the details, like (1) detecting if it is PDF format datasets; (2) processing PDFs like converting it into a format that our model can recognize; (3) detecting and recognizing the text in PDF documents; etc., without your notice.

Here, label is the name of the column that contains the target variable to predict, e.g., it is “label” in our example. We set the training time limit to 120 seconds for demonstration purposes.

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label="label")
predictor.fit(
    train_data=train_data,
    hyperparameters={"model.document_transformer.checkpoint_name":"microsoft/layoutlm-base-uncased",
    "optimization.top_k_average_method":"best",
    },
    time_limit=120,
)

No path specified. Models will be saved in: "AutogluonModels/ag-20230616_222833/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  ['historical', 'resume']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Detected data scarcity. Consider running using the preset 'few_shot_text_classification' for better performance.
INFO:lightning_fabric.utilities.seed:Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.1b20230616.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/home/ci/autogluon/docs/tutorials/multimodal/document/AutogluonModels/ag-20230616_222833".

- Validation metric is "roc_auc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/document/AutogluonModels/ag-20230616_222833
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai

1 GPUs are detected, and 1 GPUs will be used.
   - GPU 0 name: Tesla T4
   - GPU 0 memory: 15.74GB/15.84GB (Free/Total)
CUDA version is 11.7.

INFO:pytorch_lightning.utilities.rank_zero:Using 16bit None Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name              | Type                | Params
----------------------------------------------------------
0 | model             | DocumentTransformer | 112 M 
1 | validation_metric | BinaryAUROC         | 0     
2 | loss_func         | CrossEntropyLoss    | 0     
----------------------------------------------------------
112 M     Trainable params
0         Non-trainable params
112 M     Total params
225.259   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 1: 'val_roc_auc' reached 0.83333 (best 0.83333), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document/AutogluonModels/ag-20230616_222833/epoch=0-step=1.ckpt' as top 3
INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 2: 'val_roc_auc' reached 0.83333 (best 0.83333), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document/AutogluonModels/ag-20230616_222833/epoch=1-step=2.ckpt' as top 3
INFO:pytorch_lightning.utilities.rank_zero:Epoch 2, global step 3: 'val_roc_auc' reached 0.83333 (best 0.83333), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document/AutogluonModels/ag-20230616_222833/epoch=2-step=3.ckpt' as top 3
INFO:pytorch_lightning.utilities.rank_zero:Epoch 3, global step 4: 'val_roc_auc' was not in top 3
INFO:pytorch_lightning.utilities.rank_zero:Epoch 4, global step 5: 'val_roc_auc' was not in top 3
INFO:pytorch_lightning.utilities.rank_zero:Epoch 5, global step 6: 'val_roc_auc' was not in top 3
AutoMM has created your model 🎉🎉🎉

- To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/document/AutogluonModels/ag-20230616_222833")
    ```

- You can open a terminal and launch Tensorboard to visualize the training log:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/document/AutogluonModels/ag-20230616_222833
    ```

- If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub: https://github.com/autogluon/autogluon

<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f5e44820550>

Classifying PDF Documents with AutoMM#

Get the PDF document dataset#

Create a PDF Document Classifier#

Evaluate on Test Dataset#

Predict on a New PDF Document#

Extract Embeddings#

Other Examples#

Customization#