.. _sec_tabularquick:

Predicting Columns in a Table - Quick Start
===========================================


Via a simple ``fit()`` call, AutoGluon can produce highly-accurate
models to predict the values in one column of a data table based on the
rest of the columns' values. Use AutoGluon with tabular data for both
classification and regression problems. This tutorial demonstrates how
to use AutoGluon to produce a classification model that predicts whether
or not a person's income exceeds $50,000.

To start, import autogluon and TabularPrediction module as your task:

.. code:: python

    import autogluon as ag
    from autogluon import TabularPrediction as task

Load training data from a CSV file into an AutoGluon Dataset object.
This object is essentially equivalent to a `Pandas
DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`__
and the same methods can be applied to both.

.. code:: python

    train_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
    train_data = train_data.head(500) # subsample 500 data points for faster demo
    print(train_data.head())


.. parsed-literal::
    :class: output

    Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073


.. parsed-literal::
    :class: output

       age   workclass  fnlwgt   education  education-num       marital-status  \
    0   25     Private  178478   Bachelors             13        Never-married   
    1   23   State-gov   61743     5th-6th              3        Never-married   
    2   46     Private  376789     HS-grad              9        Never-married   
    3   55           ?  200235     HS-grad              9   Married-civ-spouse   
    4   36     Private  224541     7th-8th              4   Married-civ-spouse   
    
               occupation    relationship    race      sex  capital-gain  \
    0        Tech-support       Own-child   White   Female             0   
    1    Transport-moving   Not-in-family   White     Male             0   
    2       Other-service   Not-in-family   White     Male             0   
    3                   ?         Husband   White     Male             0   
    4   Handlers-cleaners         Husband   White     Male             0   
    
       capital-loss  hours-per-week  native-country   class  
    0             0              40   United-States   <=50K  
    1             0              35   United-States   <=50K  
    2             0              15   United-States   <=50K  
    3             0              50   United-States    >50K  
    4             0              40     El-Salvador   <=50K  


Note that we loaded data from a CSV file stored in the cloud (AWS s3
bucket), but you can you specify a local file-path instead if you have
already downloaded the CSV file to your own machine (e.g., using
``wget``). Each row in the table ``train_data`` corresponds to a single
training example. In this particular dataset, each row corresponds to an
individual person, and the columns contain various characteristics
reported during a census.

Let's first use these features to predict whether the person's income
exceeds $50,000 or not, which is recorded in the ``class`` column of
this table.

.. code:: python

    label_column = 'class'
    print("Summary of class variable: \n", train_data[label_column].describe())


.. parsed-literal::
    :class: output

    Summary of class variable: 
     count        500
    unique         2
    top        <=50K
    freq         394
    Name: class, dtype: object


Now use AutoGluon to train multiple models:

.. code:: python

    dir = 'agModels-predictClass' # specifies folder where to store trained models
    predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir)


.. parsed-literal::
    :class: output

    Beginning AutoGluon training ...
    AutoGluon will save models to agModels-predictClass/
    AutoGluon Version:  0.0.12b20200713
    Train Data Rows:    500
    Train Data Columns: 15
    Preprocessing data ...
    Here are the 2 unique label values in your data:  [' <=50K', ' >50K']
    AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed).
    If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
    
    Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
    Train Data Class Count: 2
    Feature Generator processed 500 data points with 14 features
    Original Features (raw dtypes):
    	int64 features: 6
    	object features: 8
    Original Features (inferred dtypes):
    	int features: 6
    	object features: 8
    Generated Features (special dtypes):
    Final Features (raw dtypes):
    	int features: 6
    	category features: 8
    Final Features:
    	int features: 6
    	category features: 8
    	Data preprocessing and feature engineering runtime = 0.06s ...
    AutoGluon will gauge predictive performance using evaluation metric: accuracy
    To change this, specify the eval_metric argument of fit()
    AutoGluon will early stop models using evaluation metric: accuracy
    Fitting model: RandomForestClassifierGini ...
    	0.83	 = Validation accuracy score
    	0.5s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: RandomForestClassifierEntr ...
    	0.83	 = Validation accuracy score
    	0.5s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: ExtraTreesClassifierGini ...
    	0.83	 = Validation accuracy score
    	0.4s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: ExtraTreesClassifierEntr ...
    	0.82	 = Validation accuracy score
    	0.4s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: KNeighborsClassifierUnif ...
    	0.8	 = Validation accuracy score
    	0.01s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: KNeighborsClassifierDist ...
    	0.75	 = Validation accuracy score
    	0.01s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: LightGBMClassifier ...
    	0.86	 = Validation accuracy score
    	0.14s	 = Training runtime
    	0.01s	 = Validation runtime
    Fitting model: CatboostClassifier ...
    	0.85	 = Validation accuracy score
    	0.44s	 = Training runtime
    	0.01s	 = Validation runtime
    Fitting model: NeuralNetClassifier ...
    	0.86	 = Validation accuracy score
    	3.9s	 = Training runtime
    	0.02s	 = Validation runtime
    Fitting model: LightGBMClassifierCustom ...
    	0.84	 = Validation accuracy score
    	0.41s	 = Training runtime
    	0.01s	 = Validation runtime
    Fitting model: weighted_ensemble_k0_l1 ...
    	0.87	 = Validation accuracy score
    	0.33s	 = Training runtime
    	0.0s	 = Validation runtime
    AutoGluon training complete, total runtime = 8.67s ...


Next, load separate test data to demonstrate how to make predictions on
new examples at inference time:

.. code:: python

    test_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
    y_test = test_data[label_column]  # values to predict
    test_data_nolab = test_data.drop(labels=[label_column],axis=1) # delete label column to prove we're not cheating
    print(test_data_nolab.head())


.. parsed-literal::
    :class: output

    Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


.. parsed-literal::
    :class: output

       age          workclass  fnlwgt      education  education-num  \
    0   31            Private  169085           11th              7   
    1   17   Self-emp-not-inc  226203           12th              8   
    2   47            Private   54260      Assoc-voc             11   
    3   21            Private  176262   Some-college             10   
    4   17            Private  241185           12th              8   
    
            marital-status        occupation relationship    race      sex  \
    0   Married-civ-spouse             Sales         Wife   White   Female   
    1        Never-married             Sales    Own-child   White     Male   
    2   Married-civ-spouse   Exec-managerial      Husband   White     Male   
    3        Never-married   Exec-managerial    Own-child   White   Female   
    4        Never-married    Prof-specialty    Own-child   White     Male   
    
       capital-gain  capital-loss  hours-per-week  native-country  
    0             0             0              20   United-States  
    1             0             0              45   United-States  
    2             0          1887              60   United-States  
    3             0             0              30   United-States  
    4             0             0              20   United-States  


We use our trained models to make predictions on the new data and then
evaluate performance:

.. code:: python

    predictor = task.load(dir) # unnecessary, just demonstrates how to load previously-trained predictor from file
    
    y_pred = predictor.predict(test_data_nolab)
    print("Predictions:  ", y_pred)
    perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)


.. parsed-literal::
    :class: output

    Evaluation: accuracy on test data: 0.8131845634148838
    Evaluations on test data:
    {
        "accuracy": 0.8131845634148838,
        "accuracy_score": 0.8131845634148838,
        "balanced_accuracy_score": 0.6360612950251102,
        "matthews_corrcoef": 0.4016929570045743,
        "f1_score": 0.8131845634148838
    }


.. parsed-literal::
    :class: output

    Predictions:   [' <=50K' ' <=50K' ' <=50K' ... ' <=50K' ' <=50K' ' <=50K']


.. parsed-literal::
    :class: output

    Detailed (per-class) classification report:
    {
        " <=50K": {
            "precision": 0.8169220369535827,
            "recall": 0.9731579653737753,
            "f1-score": 0.8882219636185459,
            "support": 7451
        },
        " >50K": {
            "precision": 0.7760358342665173,
            "recall": 0.2989646246764452,
            "f1-score": 0.43164123326066645,
            "support": 2318
        },
        "accuracy": 0.8131845634148838,
        "macro avg": {
            "precision": 0.79647893561005,
            "recall": 0.6360612950251102,
            "f1-score": 0.6599315984396061,
            "support": 9769
        },
        "weighted avg": {
            "precision": 0.8072205098956834,
            "recall": 0.8131845634148838,
            "f1-score": 0.779883942022726,
            "support": 9769
        }
    }


Now you're ready to try AutoGluon on your own tabular datasets! As long
as they're stored in a popular format like CSV, you should be able to
achieve strong predictive performance with just 2 lines of code:

::

    from autogluon import TabularPrediction as task
    predictor = task.fit(train_data=task.Dataset(file_path=<file-name>), label_column=<variable-name>)

Description of fit():
---------------------

Here we discuss what happened during ``fit()``.

Since there are only two possible values of the ``class`` variable, this
was a binary classification problem, for which an appropriate
performance metric is *accuracy*. AutoGluon automatically infers this as
well as the type of each feature (i.e., which columns contain continuous
numbers vs. discrete categories). AutogGluon can also automatically
handle common issues like missing data and rescaling feature values.

We did not specify separate validation data and so AutoGluon
automatically choses a random training/validation split of the data. The
data used for validation is seperated from the training data and is used
to determine the models and hyperparameter-values that produce the best
results. Rather than just a single model, AutoGluon trains multiple
models and ensembles them together to ensure superior predictive
performance.

By default, AutoGluon tries to fit various types of models including
neural networks and tree ensembles. Each type of model has various
hyperparameters, which traditionally, the user would have to specify.
AutoGluon automates this process.

AutoGluon automatically and iteratively tests values for hyperparameters
to produce the best performance on the validation data. This involves
repeatedly training models under different hyperparameter settings and
evaluating their performance. This process can be
computationally-intensive, so ``fit()`` can parallelize this process
across multiple threads (and machines if distributed resources are
available). To control runtimes, you can specify various arguments in
fit() as demonstrated in the subsequent **In-Depth** tutorial.

For tabular problems, ``fit()`` returns a ``Predictor`` object. Besides
inference, this object can also be used to view a summary of what
happened during fit.

.. code:: python

    results = predictor.fit_summary()


.. parsed-literal::
    :class: output

    *** Summary of fit() ***
    Estimated performance of each model:
                             model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer
    0      weighted_ensemble_k0_l1       0.87       0.141810  4.773642                0.000752           0.332977            1       True
    1           LightGBMClassifier       0.86       0.009270  0.141447                0.009270           0.141447            0       True
    2          NeuralNetClassifier       0.86       0.021867  3.903942                0.021867           3.903942            0       True
    3           CatboostClassifier       0.85       0.008350  0.440086                0.008350           0.440086            0       True
    4     LightGBMClassifierCustom       0.84       0.009938  0.408290                0.009938           0.408290            0       True
    5   RandomForestClassifierEntr       0.83       0.109695  0.499359                0.109695           0.499359            0       True
    6     ExtraTreesClassifierGini       0.83       0.109921  0.395275                0.109921           0.395275            0       True
    7   RandomForestClassifierGini       0.83       0.110334  0.500405                0.110334           0.500405            0       True
    8     ExtraTreesClassifierEntr       0.82       0.109875  0.396431                0.109875           0.396431            0       True
    9     KNeighborsClassifierUnif       0.80       0.107783  0.007162                0.107783           0.007162            0       True
    10    KNeighborsClassifierDist       0.75       0.107865  0.006966                0.107865           0.006966            0       True
    Number of models trained: 11
    Types of models trained:
    {'WeightedEnsembleModel', 'TabularNeuralNetModel', 'RFModel', 'LGBModel', 'CatboostModel', 'KNNModel', 'XTModel'}
    Bagging used: False 
    Stack-ensembling used: False 
    Hyperparameter-tuning used: False 
    User-specified hyperparameters:
    {'default': {'NN': [{}], 'GBM': [{}], 'CAT': [{}], 'RF': [{'criterion': 'gini', 'AG_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'AG_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}], 'XT': [{'criterion': 'gini', 'AG_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'AG_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}], 'KNN': [{'weights': 'uniform', 'AG_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'AG_args': {'name_suffix': 'Dist'}}], 'custom': [{'num_boost_round': 10000, 'num_threads': -1, 'objective': 'binary', 'verbose': -1, 'boosting_type': 'gbdt', 'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 5, 'two_round': True, 'seed_value': 0, 'AG_args': {'model_type': 'GBM', 'name_suffix': 'Custom', 'disable_in_hpo': True}}]}}
    Plot summary of models saved to file: agModels-predictClass/SummaryOfModels.html
    *** End of fit() summary ***


From this summary, we can see that AutoGluon trained many different
types of models as well as an ensemble of the best-performing models.
The summary also describes the actual models that were trained during
fit and how well each model performed on the held-out validation data.
We can also view what properties AutoGluon automatically inferred about
our prediction task:

.. code:: python

    print("AutoGluon infers problem type is: ", predictor.problem_type)
    print("AutoGluon categorized the features as: ", predictor.feature_types)


.. parsed-literal::
    :class: output

    AutoGluon infers problem type is:  binary
    AutoGluon categorized the features as:  <autogluon.utils.tabular.features.feature_types_metadata.FeatureTypesMetadata object at 0x7f72d5a106d0>


AutoGluon correctly recognized our prediction problem to be a binary
classification task and decided that variables such as ``age`` should be
represented as integers, whereas variables such as ``workclass`` should
be represented as categorical objects.

Regression (predicting numeric table columns):
----------------------------------------------

To demonstrate that ``fit()`` can also automatically handle regression
tasks, we now try to predict the numeric ``age`` variable in the same
table based on the other features:

.. code:: python

    age_column = 'age'
    print("Summary of age variable: \n", train_data[age_column].describe())


.. parsed-literal::
    :class: output

    Summary of age variable: 
     count    500.00000
    mean      38.31400
    std       13.85436
    min       17.00000
    25%       27.00000
    50%       37.00000
    75%       47.00000
    max       90.00000
    Name: age, dtype: float64


We again call ``fit()``, imposing a time-limit this time (in seconds),
and also demonstrate a shorthand method to evaluate the resulting model
on the test data (which contain labels):

.. code:: python

    predictor_age = task.fit(train_data=train_data, output_directory="agModels-predictAge", label=age_column, time_limits=60)
    performance = predictor_age.evaluate(test_data)


.. parsed-literal::
    :class: output

    Beginning AutoGluon training ... Time limit = 60s
    AutoGluon will save models to agModels-predictAge/
    AutoGluon Version:  0.0.12b20200713
    Train Data Rows:    500
    Train Data Columns: 15
    Preprocessing data ...
    Here are the first 10 unique label values in your data:  [25, 23, 46, 55, 36, 51, 33, 18, 43, 41]
    AutoGluon infers your prediction problem is: regression  (because dtype of label-column == int and many unique label-values observed).
    If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
    
    Feature Generator processed 500 data points with 14 features
    Original Features (raw dtypes):
    	object features: 9
    	int64 features: 5
    Original Features (inferred dtypes):
    	object features: 9
    	int features: 5
    Generated Features (special dtypes):
    Final Features (raw dtypes):
    	int features: 5
    	category features: 9
    Final Features:
    	int features: 5
    	category features: 9
    	Data preprocessing and feature engineering runtime = 0.05s ...
    AutoGluon will gauge predictive performance using evaluation metric: root_mean_squared_error
    To change this, specify the eval_metric argument of fit()
    AutoGluon will early stop models using evaluation metric: root_mean_squared_error
    Fitting model: RandomForestRegressorMSE ... Training model for up to 59.95s of the 59.95s of remaining time.
    	-11.0011	 = Validation root_mean_squared_error score
    	0.5s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: ExtraTreesRegressorMSE ... Training model for up to 59.32s of the 59.32s of remaining time.
    	-11.3388	 = Validation root_mean_squared_error score
    	0.39s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: KNeighborsRegressorUnif ... Training model for up to 58.8s of the 58.8s of remaining time.
    	-14.5706	 = Validation root_mean_squared_error score
    	0.01s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: KNeighborsRegressorDist ... Training model for up to 58.68s of the 58.68s of remaining time.
    	-15.8074	 = Validation root_mean_squared_error score
    	0.01s	 = Training runtime
    	0.11s	 = Validation runtime
    Fitting model: LightGBMRegressor ... Training model for up to 58.56s of the 58.56s of remaining time.
    	-10.9958	 = Validation root_mean_squared_error score
    	0.15s	 = Training runtime
    	0.01s	 = Validation runtime
    Fitting model: CatboostRegressor ... Training model for up to 58.4s of the 58.4s of remaining time.
    	-10.0961	 = Validation root_mean_squared_error score
    	0.33s	 = Training runtime
    	0.01s	 = Validation runtime
    Fitting model: NeuralNetRegressor ... Training model for up to 58.05s of the 58.05s of remaining time.
    	-12.3444	 = Validation root_mean_squared_error score
    	3.08s	 = Training runtime
    	0.02s	 = Validation runtime
    Fitting model: LightGBMRegressorCustom ... Training model for up to 54.94s of the 54.94s of remaining time.
    	-11.3321	 = Validation root_mean_squared_error score
    	0.27s	 = Training runtime
    	0.01s	 = Validation runtime
    Fitting model: weighted_ensemble_k0_l1 ... Training model for up to 59.95s of the 54.11s of remaining time.
    	-10.0633	 = Validation root_mean_squared_error score
    	0.38s	 = Training runtime
    	0.0s	 = Validation runtime
    AutoGluon training complete, total runtime = 6.3s ...


.. parsed-literal::
    :class: output

    Predictive performance on given dataset: root_mean_squared_error = 10.874239331515662


Note that we didn't need to tell AutoGluon this is a regression problem,
it automatically inferred this from the data and reported the
appropriate performance metric (RMSE by default). To specify a
particular evaluation metric other than the default, set the
``eval_metric`` argument of ``fit()`` and AutoGluon will tailor its
models to optimize your metric (e.g.
``eval_metric = 'mean_absolute_error'``). For evaluation metrics where
higher values are worse (like RMSE), AutoGluon may sometimes flips their
sign and print them as negative values during training (as it internally
assumes higher values are better).