Adding a custom model to AutoGluon (Advanced)¶

Tip: If you are new to AutoGluon, review Predicting Columns in a Table - Quick Start to learn the basics of the AutoGluon API.

In this tutorial we will cover advanced custom model options that go beyond the topics covered in Adding a custom model to AutoGluon.

It is assumed that you have fully read through Adding a custom model to AutoGluon prior to this tutorial.

Loading the data¶

First we will load the data. For this tutorial we will use the adult income dataset because it has a mix of integer, float, and categorical features.

from autogluon.tabular import TabularDataset

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')  # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame
label = 'class'  # specifies which column do we want to predict
train_data = train_data.sample(n=1000, random_state=0)  # subsample for faster demo

train_data.head(5)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	class
6118	51	Private	39264	Some-college	10	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	>50K
23204	58	Private	51662	10th	6	Married-civ-spouse	Other-service	Wife	White	Female	0	0	8	United-States	<=50K
29590	40	Private	326310	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	44	United-States	<=50K
18116	37	Private	222450	HS-grad	9	Never-married	Sales	Not-in-family	White	Male	0	2339	40	El-Salvador	<=50K
33964	62	Private	109190	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	15024	0	40	United-States	>50K

Force features to be passed to models without preprocessing / dropping¶

Reasons why you would want to do this is if you have model logic that requires a particular column to always be present, regardless of its content. For example, if you are fine-tuning a pre-trained language model that expects a feature indicating the language of the text in a given row which dictates how the text is preprocessed, but training data only includes one language, without this adjustment the language identifier feature would be dropped prior to fitting the model.

Force features to not be dropped in model-specific preprocessing¶

To avoid dropping features in custom models due to having only 1 unique value, add the following _get_default_auxiliary_params method to your custom model class:

from autogluon.core.models import AbstractModel

class DummyModel(AbstractModel):
    def _fit(self, X, **kwargs):
        print(f'Before {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
        X = self.preprocess(X)
        print(f'After  {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
        print(X.head(5))

class DummyModelKeepUnique(DummyModel):
    def _get_default_auxiliary_params(self) -> dict:
        default_auxiliary_params = super()._get_default_auxiliary_params()
        extra_auxiliary_params = dict(
            drop_unique=False,  # Whether to drop features that have only 1 unique value, default is True
        )
        default_auxiliary_params.update(extra_auxiliary_params)
        return default_auxiliary_params

Force features to not be dropped in global preprocessing¶

While the above fix for model-specific preprocessing works if the feature is still present after global preprocessing, it won’t help if the feature was already dropped before getting to the model. For this, we need to create a new feature generator class which separates the preprocessing logic between normal features and user override features.

Here is an example implementation:

# WARNING: To use this in practice, you must put this code in a separate python file
#  from the main process and import it or else it will not be serializable.)
from autogluon.features import BulkFeatureGenerator, AutoMLPipelineFeatureGenerator, IdentityFeatureGenerator


class CustomFeatureGeneratorWithUserOverride(BulkFeatureGenerator):
    def __init__(self, automl_generator_kwargs: dict = None, **kwargs):
        generators = self._get_default_generators(automl_generator_kwargs=automl_generator_kwargs)
        super().__init__(generators=generators, **kwargs)

    def _get_default_generators(self, automl_generator_kwargs: dict = None):
        if automl_generator_kwargs is None:
            automl_generator_kwargs = dict()

        generators = [
            [
                # Preprocessing logic that handles normal features
                AutoMLPipelineFeatureGenerator(banned_feature_special_types=['user_override'], **automl_generator_kwargs),

                # Preprocessing logic that handles special features user wishes to treat separately, here we simply skip preprocessing for these features.
                IdentityFeatureGenerator(infer_features_in_args=dict(required_special_types=['user_override'])),
            ],
        ]
        return generators

The above code splits the preprocessing logic of a feature depending on if it is tagged with the 'user_override' special type in feature metadata. To tag three features ['age', 'native-country', 'dummy_feature'] in this way, you can do the following:

# add a useless dummy feature to show that it is not dropped in preprocessing
train_data['dummy_feature'] = 'dummy value'
test_data['dummy_feature'] = 'dummy value'

from autogluon.tabular import FeatureMetadata
feature_metadata = FeatureMetadata.from_df(train_data)

print('Before inserting overrides:')
print(feature_metadata)

feature_metadata = feature_metadata.add_special_types(
    {
        'age': ['user_override'],
        'native-country': ['user_override'],
        'dummy_feature': ['user_override'],
    }
)

print('After inserting overrides:')
print(feature_metadata)

Before inserting overrides:
('int', [])    :  6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 10 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
After inserting overrides:
('int', [])                   : 5 | ['fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
('int', ['user_override'])    : 1 | ['age']
('object', [])                : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('object', ['user_override']) : 2 | ['native-country', 'dummy_feature']

Note that this is only one example implementation of a custom feature generator that has bifurcated preprocessing logic. Users can make their tagging and feature generator logic arbitrarily complex to fit their needs. In this example, we perform the standard preprocessing on non-tagged features, and for tagged features we pass them through IdentityFeatureGenerator which is a no-op logic that does not alter the features in any way. Instead of an IdentityFeatureGenerator, you could use any kind of feature generator to suite your needs.

Putting it all together¶

# Separate features and labels
X = train_data.drop(columns=[label])
y = train_data[label]
X_test = test_data.drop(columns=[label])
y_test = test_data[label]

# preprocess the label column, as done in the prior custom model tutorial
from autogluon.core.data import LabelCleaner
from autogluon.core.utils import infer_problem_type
# Construct a LabelCleaner to neatly convert labels to float/integers during model training/inference, can also use to inverse_transform back to original.
problem_type = infer_problem_type(y=y)  # Infer problem type (or else specify directly)
label_cleaner = LabelCleaner.construct(problem_type=problem_type, y=y)
y_preprocessed = label_cleaner.transform(y)
y_test_preprocessed = label_cleaner.transform(y_test)

# Make sure to specify your custom feature metadata to the feature generator
my_custom_feature_generator = CustomFeatureGeneratorWithUserOverride(feature_metadata_in=feature_metadata)

X_preprocessed = my_custom_feature_generator.fit_transform(X)
X_test_preprocessed = my_custom_feature_generator.transform(X_test)

Notice how the user_override features were not preprocessed:

print(list(X_preprocessed.columns))
X_preprocessed.head(5)

['fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'native-country', 'dummy_feature']

	fnlwgt	education-num	sex	capital-gain	capital-loss	hours-per-week	workclass	education	marital-status	occupation	relationship	race	age	native-country	dummy_feature
6118	39264	10	0	0	0	40	3	14	1	4	5	4	51	United-States	dummy value
23204	51662	6	0	0	0	8	3	0	1	8	5	4	58	United-States	dummy value
29590	326310	10	1	0	0	44	3	14	1	3	0	4	40	United-States	dummy value
18116	222450	9	1	0	2339	40	3	11	3	12	1	4	37	El-Salvador	dummy value
33964	109190	13	1	15024	0	40	3	9	1	4	0	4	62	United-States	dummy value

Now lets see what happens when we send this data to fit a dummy model:

dummy_model = DummyModel()
dummy_model.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)

Before DummyModel Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After  DummyModel Preprocessing (14 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial            Wife   White   Female   
23204   Married-civ-spouse     Other-service            Wife   White   Female   
29590   Married-civ-spouse      Craft-repair         Husband   White     Male   
18116        Never-married             Sales   Not-in-family   White     Male   
33964   Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country  
6118              0             0              40   United-States  
23204             0             0               8   United-States  
29590             0             0              44   United-States  
18116             0          2339              40     El-Salvador  
33964         15024             0              40   United-States

<__main__.DummyModel at 0x7f1e0b005e90>

Notice how the model dropped dummy_feature during the preprocess call. Now lets see what happens if we use DummyModelKeepUnique:

dummy_model_keep_unique = DummyModelKeepUnique()
dummy_model_keep_unique.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)

Before DummyModelKeepUnique Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After  DummyModelKeepUnique Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
       age workclass  fnlwgt      education  education-num  \
  51   Private   39264   Some-college             10   
 58   Private   51662           10th              6   
 40   Private  326310   Some-college             10   
 37   Private  222450        HS-grad              9   
 62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
  Married-civ-spouse   Exec-managerial            Wife   White   Female   
 Married-civ-spouse     Other-service            Wife   White   Female   
 Married-civ-spouse      Craft-repair         Husband   White     Male   
      Never-married             Sales   Not-in-family   White     Male   
 Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country  \
            0             0              40   United-States   
           0             0               8   United-States   
           0             0              44   United-States   
           0          2339              40     El-Salvador   
       15024             0              40   United-States   

      dummy_feature  
  dummy value  
 dummy value  
 dummy value  
 dummy value  
 dummy value

<__main__.DummyModelKeepUnique at 0x7f1e5d738ad0>

Now dummy_feature is no longer dropped!

The above code logic can be re-used for testing your own complex model implementations, simply replace DummyModelKeepUnique with your custom model and check that it keeps the features you want to use.

Keeping Features via TabularPredictor¶

Now let’s demonstrate how to do this via TabularPredictor in far fewer lines of code. Note that this code will raise an exception if ran in this tutorial because the custom model and feature generator must exist in other files for them to be serializable. Therefore, we will not run the code in the tutorial. (It will also raise an exception because DummyModel isn’t a real model)

from autogluon.tabular import TabularPredictor

feature_generator = CustomFeatureGeneratorWithUserOverride()
predictor = TabularPredictor(label=label)
predictor.fit(
    train_data=train_data,
    feature_metadata=feature_metadata,  # feature metadata with your overrides
    feature_generator=feature_generator,  # your custom feature generator that handles the overrides
    hyperparameters={
        'GBM': {},  # Can fit your custom model alongside default models
        DummyModel: {},  # Will drop dummy_feature
        DummyModelKeepUnique: {},  # Will not drop dummy_feature
        # DummyModel: {'ag_args_fit': {'drop_unique': False}},  # This is another way to get same result as using DummyModelKeepUnique
    }
)