.. _sec_tabularquick: Predicting Columns in a Table - Quick Start =========================================== Via a simple ``fit()`` call, AutoGluon can produce highly-accurate models to predict the values in one column of a data table based on the rest of the columns' values. Use AutoGluon with tabular data for both classification and regression problems. This tutorial demonstrates how to use AutoGluon to produce a classification model that predicts whether or not a person's income exceeds $50,000. To start, import autogluon and TabularPrediction module as your task: .. code:: python import autogluon as ag from autogluon import TabularPrediction as task Load training data from a CSV file into an AutoGluon Dataset object. This object is essentially equivalent to a `Pandas DataFrame `__ and the same methods can be applied to both. .. code:: python train_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv') train_data = train_data.head(500) # subsample 500 data points for faster demo print(train_data.head()) .. parsed-literal:: :class: output Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073 .. parsed-literal:: :class: output age workclass fnlwgt education education-num marital-status \ 0 25 Private 178478 Bachelors 13 Never-married 1 23 State-gov 61743 5th-6th 3 Never-married 2 46 Private 376789 HS-grad 9 Never-married 3 55 ? 200235 HS-grad 9 Married-civ-spouse 4 36 Private 224541 7th-8th 4 Married-civ-spouse occupation relationship race sex capital-gain \ 0 Tech-support Own-child White Female 0 1 Transport-moving Not-in-family White Male 0 2 Other-service Not-in-family White Male 0 3 ? Husband White Male 0 4 Handlers-cleaners Husband White Male 0 capital-loss hours-per-week native-country class 0 0 40 United-States <=50K 1 0 35 United-States <=50K 2 0 15 United-States <=50K 3 0 50 United-States >50K 4 0 40 El-Salvador <=50K Note that we loaded data from a CSV file stored in the cloud (AWS s3 bucket), but you can you specify a local file-path instead if you have already downloaded the CSV file to your own machine (e.g., using ``wget``). Each row in the table ``train_data`` corresponds to a single training example. In this particular dataset, each row corresponds to an individual person, and the columns contain various characteristics reported during a census. Let's first use these features to predict whether the person's income exceeds $50,000 or not, which is recorded in the ``class`` column of this table. .. code:: python label_column = 'class' print("Summary of class variable: \n", train_data[label_column].describe()) .. parsed-literal:: :class: output Summary of class variable: count 500 unique 2 top <=50K freq 394 Name: class, dtype: object Now use AutoGluon to train multiple models: .. code:: python dir = 'agModels-predictClass' # specifies folder where to store trained models predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir) .. parsed-literal:: :class: output Beginning AutoGluon training ... AutoGluon will save models to agModels-predictClass/ AutoGluon Version: 0.0.12b20200713 Train Data Rows: 500 Train Data Columns: 15 Preprocessing data ... Here are the 2 unique label values in your data: [' <=50K', ' >50K'] AutoGluon infers your prediction problem is: binary (because only two unique label-values observed). If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression']) Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K Train Data Class Count: 2 Feature Generator processed 500 data points with 14 features Original Features (raw dtypes): int64 features: 6 object features: 8 Original Features (inferred dtypes): int features: 6 object features: 8 Generated Features (special dtypes): Final Features (raw dtypes): int features: 6 category features: 8 Final Features: int features: 6 category features: 8 Data preprocessing and feature engineering runtime = 0.06s ... AutoGluon will gauge predictive performance using evaluation metric: accuracy To change this, specify the eval_metric argument of fit() AutoGluon will early stop models using evaluation metric: accuracy Fitting model: RandomForestClassifierGini ... 0.83 = Validation accuracy score 0.5s = Training runtime 0.11s = Validation runtime Fitting model: RandomForestClassifierEntr ... 0.83 = Validation accuracy score 0.5s = Training runtime 0.11s = Validation runtime Fitting model: ExtraTreesClassifierGini ... 0.83 = Validation accuracy score 0.4s = Training runtime 0.11s = Validation runtime Fitting model: ExtraTreesClassifierEntr ... 0.82 = Validation accuracy score 0.4s = Training runtime 0.11s = Validation runtime Fitting model: KNeighborsClassifierUnif ... 0.8 = Validation accuracy score 0.01s = Training runtime 0.11s = Validation runtime Fitting model: KNeighborsClassifierDist ... 0.75 = Validation accuracy score 0.01s = Training runtime 0.11s = Validation runtime Fitting model: LightGBMClassifier ... 0.86 = Validation accuracy score 0.14s = Training runtime 0.01s = Validation runtime Fitting model: CatboostClassifier ... 0.85 = Validation accuracy score 0.44s = Training runtime 0.01s = Validation runtime Fitting model: NeuralNetClassifier ... 0.86 = Validation accuracy score 3.9s = Training runtime 0.02s = Validation runtime Fitting model: LightGBMClassifierCustom ... 0.84 = Validation accuracy score 0.41s = Training runtime 0.01s = Validation runtime Fitting model: weighted_ensemble_k0_l1 ... 0.87 = Validation accuracy score 0.33s = Training runtime 0.0s = Validation runtime AutoGluon training complete, total runtime = 8.67s ... Next, load separate test data to demonstrate how to make predictions on new examples at inference time: .. code:: python test_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv') y_test = test_data[label_column] # values to predict test_data_nolab = test_data.drop(labels=[label_column],axis=1) # delete label column to prove we're not cheating print(test_data_nolab.head()) .. parsed-literal:: :class: output Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769 .. parsed-literal:: :class: output age workclass fnlwgt education education-num \ 0 31 Private 169085 11th 7 1 17 Self-emp-not-inc 226203 12th 8 2 47 Private 54260 Assoc-voc 11 3 21 Private 176262 Some-college 10 4 17 Private 241185 12th 8 marital-status occupation relationship race sex \ 0 Married-civ-spouse Sales Wife White Female 1 Never-married Sales Own-child White Male 2 Married-civ-spouse Exec-managerial Husband White Male 3 Never-married Exec-managerial Own-child White Female 4 Never-married Prof-specialty Own-child White Male capital-gain capital-loss hours-per-week native-country 0 0 0 20 United-States 1 0 0 45 United-States 2 0 1887 60 United-States 3 0 0 30 United-States 4 0 0 20 United-States We use our trained models to make predictions on the new data and then evaluate performance: .. code:: python predictor = task.load(dir) # unnecessary, just demonstrates how to load previously-trained predictor from file y_pred = predictor.predict(test_data_nolab) print("Predictions: ", y_pred) perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True) .. parsed-literal:: :class: output Evaluation: accuracy on test data: 0.8131845634148838 Evaluations on test data: { "accuracy": 0.8131845634148838, "accuracy_score": 0.8131845634148838, "balanced_accuracy_score": 0.6360612950251102, "matthews_corrcoef": 0.4016929570045743, "f1_score": 0.8131845634148838 } .. parsed-literal:: :class: output Predictions: [' <=50K' ' <=50K' ' <=50K' ... ' <=50K' ' <=50K' ' <=50K'] .. parsed-literal:: :class: output Detailed (per-class) classification report: { " <=50K": { "precision": 0.8169220369535827, "recall": 0.9731579653737753, "f1-score": 0.8882219636185459, "support": 7451 }, " >50K": { "precision": 0.7760358342665173, "recall": 0.2989646246764452, "f1-score": 0.43164123326066645, "support": 2318 }, "accuracy": 0.8131845634148838, "macro avg": { "precision": 0.79647893561005, "recall": 0.6360612950251102, "f1-score": 0.6599315984396061, "support": 9769 }, "weighted avg": { "precision": 0.8072205098956834, "recall": 0.8131845634148838, "f1-score": 0.779883942022726, "support": 9769 } } Now you're ready to try AutoGluon on your own tabular datasets! As long as they're stored in a popular format like CSV, you should be able to achieve strong predictive performance with just 2 lines of code: :: from autogluon import TabularPrediction as task predictor = task.fit(train_data=task.Dataset(file_path=), label_column=) Description of fit(): --------------------- Here we discuss what happened during ``fit()``. Since there are only two possible values of the ``class`` variable, this was a binary classification problem, for which an appropriate performance metric is *accuracy*. AutoGluon automatically infers this as well as the type of each feature (i.e., which columns contain continuous numbers vs. discrete categories). AutogGluon can also automatically handle common issues like missing data and rescaling feature values. We did not specify separate validation data and so AutoGluon automatically choses a random training/validation split of the data. The data used for validation is seperated from the training data and is used to determine the models and hyperparameter-values that produce the best results. Rather than just a single model, AutoGluon trains multiple models and ensembles them together to ensure superior predictive performance. By default, AutoGluon tries to fit various types of models including neural networks and tree ensembles. Each type of model has various hyperparameters, which traditionally, the user would have to specify. AutoGluon automates this process. AutoGluon automatically and iteratively tests values for hyperparameters to produce the best performance on the validation data. This involves repeatedly training models under different hyperparameter settings and evaluating their performance. This process can be computationally-intensive, so ``fit()`` can parallelize this process across multiple threads (and machines if distributed resources are available). To control runtimes, you can specify various arguments in fit() as demonstrated in the subsequent **In-Depth** tutorial. For tabular problems, ``fit()`` returns a ``Predictor`` object. Besides inference, this object can also be used to view a summary of what happened during fit. .. code:: python results = predictor.fit_summary() .. parsed-literal:: :class: output *** Summary of fit() *** Estimated performance of each model: model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer 0 weighted_ensemble_k0_l1 0.87 0.141810 4.773642 0.000752 0.332977 1 True 1 LightGBMClassifier 0.86 0.009270 0.141447 0.009270 0.141447 0 True 2 NeuralNetClassifier 0.86 0.021867 3.903942 0.021867 3.903942 0 True 3 CatboostClassifier 0.85 0.008350 0.440086 0.008350 0.440086 0 True 4 LightGBMClassifierCustom 0.84 0.009938 0.408290 0.009938 0.408290 0 True 5 RandomForestClassifierEntr 0.83 0.109695 0.499359 0.109695 0.499359 0 True 6 ExtraTreesClassifierGini 0.83 0.109921 0.395275 0.109921 0.395275 0 True 7 RandomForestClassifierGini 0.83 0.110334 0.500405 0.110334 0.500405 0 True 8 ExtraTreesClassifierEntr 0.82 0.109875 0.396431 0.109875 0.396431 0 True 9 KNeighborsClassifierUnif 0.80 0.107783 0.007162 0.107783 0.007162 0 True 10 KNeighborsClassifierDist 0.75 0.107865 0.006966 0.107865 0.006966 0 True Number of models trained: 11 Types of models trained: {'WeightedEnsembleModel', 'TabularNeuralNetModel', 'RFModel', 'LGBModel', 'CatboostModel', 'KNNModel', 'XTModel'} Bagging used: False Stack-ensembling used: False Hyperparameter-tuning used: False User-specified hyperparameters: {'default': {'NN': [{}], 'GBM': [{}], 'CAT': [{}], 'RF': [{'criterion': 'gini', 'AG_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'AG_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}], 'XT': [{'criterion': 'gini', 'AG_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'AG_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}], 'KNN': [{'weights': 'uniform', 'AG_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'AG_args': {'name_suffix': 'Dist'}}], 'custom': [{'num_boost_round': 10000, 'num_threads': -1, 'objective': 'binary', 'verbose': -1, 'boosting_type': 'gbdt', 'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 5, 'two_round': True, 'seed_value': 0, 'AG_args': {'model_type': 'GBM', 'name_suffix': 'Custom', 'disable_in_hpo': True}}]}} Plot summary of models saved to file: agModels-predictClass/SummaryOfModels.html *** End of fit() summary *** From this summary, we can see that AutoGluon trained many different types of models as well as an ensemble of the best-performing models. The summary also describes the actual models that were trained during fit and how well each model performed on the held-out validation data. We can also view what properties AutoGluon automatically inferred about our prediction task: .. code:: python print("AutoGluon infers problem type is: ", predictor.problem_type) print("AutoGluon categorized the features as: ", predictor.feature_types) .. parsed-literal:: :class: output AutoGluon infers problem type is: binary AutoGluon categorized the features as: AutoGluon correctly recognized our prediction problem to be a binary classification task and decided that variables such as ``age`` should be represented as integers, whereas variables such as ``workclass`` should be represented as categorical objects. Regression (predicting numeric table columns): ---------------------------------------------- To demonstrate that ``fit()`` can also automatically handle regression tasks, we now try to predict the numeric ``age`` variable in the same table based on the other features: .. code:: python age_column = 'age' print("Summary of age variable: \n", train_data[age_column].describe()) .. parsed-literal:: :class: output Summary of age variable: count 500.00000 mean 38.31400 std 13.85436 min 17.00000 25% 27.00000 50% 37.00000 75% 47.00000 max 90.00000 Name: age, dtype: float64 We again call ``fit()``, imposing a time-limit this time (in seconds), and also demonstrate a shorthand method to evaluate the resulting model on the test data (which contain labels): .. code:: python predictor_age = task.fit(train_data=train_data, output_directory="agModels-predictAge", label=age_column, time_limits=60) performance = predictor_age.evaluate(test_data) .. parsed-literal:: :class: output Beginning AutoGluon training ... Time limit = 60s AutoGluon will save models to agModels-predictAge/ AutoGluon Version: 0.0.12b20200713 Train Data Rows: 500 Train Data Columns: 15 Preprocessing data ... Here are the first 10 unique label values in your data: [25, 23, 46, 55, 36, 51, 33, 18, 43, 41] AutoGluon infers your prediction problem is: regression (because dtype of label-column == int and many unique label-values observed). If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression']) Feature Generator processed 500 data points with 14 features Original Features (raw dtypes): object features: 9 int64 features: 5 Original Features (inferred dtypes): object features: 9 int features: 5 Generated Features (special dtypes): Final Features (raw dtypes): int features: 5 category features: 9 Final Features: int features: 5 category features: 9 Data preprocessing and feature engineering runtime = 0.05s ... AutoGluon will gauge predictive performance using evaluation metric: root_mean_squared_error To change this, specify the eval_metric argument of fit() AutoGluon will early stop models using evaluation metric: root_mean_squared_error Fitting model: RandomForestRegressorMSE ... Training model for up to 59.95s of the 59.95s of remaining time. -11.0011 = Validation root_mean_squared_error score 0.5s = Training runtime 0.11s = Validation runtime Fitting model: ExtraTreesRegressorMSE ... Training model for up to 59.32s of the 59.32s of remaining time. -11.3388 = Validation root_mean_squared_error score 0.39s = Training runtime 0.11s = Validation runtime Fitting model: KNeighborsRegressorUnif ... Training model for up to 58.8s of the 58.8s of remaining time. -14.5706 = Validation root_mean_squared_error score 0.01s = Training runtime 0.11s = Validation runtime Fitting model: KNeighborsRegressorDist ... Training model for up to 58.68s of the 58.68s of remaining time. -15.8074 = Validation root_mean_squared_error score 0.01s = Training runtime 0.11s = Validation runtime Fitting model: LightGBMRegressor ... Training model for up to 58.56s of the 58.56s of remaining time. -10.9958 = Validation root_mean_squared_error score 0.15s = Training runtime 0.01s = Validation runtime Fitting model: CatboostRegressor ... Training model for up to 58.4s of the 58.4s of remaining time. -10.0961 = Validation root_mean_squared_error score 0.33s = Training runtime 0.01s = Validation runtime Fitting model: NeuralNetRegressor ... Training model for up to 58.05s of the 58.05s of remaining time. -12.3444 = Validation root_mean_squared_error score 3.08s = Training runtime 0.02s = Validation runtime Fitting model: LightGBMRegressorCustom ... Training model for up to 54.94s of the 54.94s of remaining time. -11.3321 = Validation root_mean_squared_error score 0.27s = Training runtime 0.01s = Validation runtime Fitting model: weighted_ensemble_k0_l1 ... Training model for up to 59.95s of the 54.11s of remaining time. -10.0633 = Validation root_mean_squared_error score 0.38s = Training runtime 0.0s = Validation runtime AutoGluon training complete, total runtime = 6.3s ... .. parsed-literal:: :class: output Predictive performance on given dataset: root_mean_squared_error = 10.874239331515662 Note that we didn't need to tell AutoGluon this is a regression problem, it automatically inferred this from the data and reported the appropriate performance metric (RMSE by default). To specify a particular evaluation metric other than the default, set the ``eval_metric`` argument of ``fit()`` and AutoGluon will tailor its models to optimize your metric (e.g. ``eval_metric = 'mean_absolute_error'``). For evaluation metrics where higher values are worse (like RMSE), AutoGluon may sometimes flips their sign and print them as negative values during training (as it internally assumes higher values are better).