.. _sec_forecastingquick: Forecasting Time-Series - Quick Start ===================================== Via a simple ``fit()`` call, AutoGluon can train models to produce forecasts for time series data. This tutorial demonstrates how to quickly use AutoGluon to produce forecasts of Covid-19 cases in a country given `historical data from each country `__. Let's first import AutoGluon's ``ForecastingPredictor`` and ``TabularDataset`` classes, where the latter is used to load time-series data stored in a tabular file format: .. code:: python from autogluon.forecasting import ForecastingPredictor from autogluon.forecasting import TabularDataset .. parsed-literal:: :class: output /var/lib/jenkins/workspace/workspace/autogluon-forecasting-py3-v3/venv/lib/python3.7/site-packages/gluonts/json.py:46: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization. "Using `json`-module for json-handling. " We load the time-series data to use for training from a CSV file into an AutoGluon ``TabularDataset`` object. This object is essentially equivalent to a `Pandas DataFrame `__ and the same methods can be applied to both. .. code:: python train_data = TabularDataset("https://autogluon.s3-us-west-2.amazonaws.com/datasets/CovidTimeSeries/train.csv") print(train_data[50:60]) .. parsed-literal:: :class: output Date ConfirmedCases name 50 2020-03-12 7.0 Afghanistan_ 51 2020-03-13 7.0 Afghanistan_ 52 2020-03-14 11.0 Afghanistan_ 53 2020-03-15 16.0 Afghanistan_ 54 2020-03-16 21.0 Afghanistan_ 55 2020-03-17 22.0 Afghanistan_ 56 2020-03-18 22.0 Afghanistan_ 57 2020-03-19 22.0 Afghanistan_ 58 2020-03-20 24.0 Afghanistan_ 59 2020-03-21 24.0 Afghanistan_ Note that we loaded data from a CSV file stored in the cloud (`AWS s3 bucket `__), but you can you specify a local file-path instead if you have already downloaded the CSV file to your own machine (e.g., using `wget `__). Our goal is to train models on this data that can forecast Covid case counts in each country at future dates. This corresponds to a forecasting problem with many related individual time-series (one per country). Each row in the table ``train_data`` corresponds to one observation of one time-series at a particular time. The dataset you use for ``autogluon.forecasting`` should usually contain three columns: a ``date_column`` with the time information (here "Date"), an ``index_column`` with a categorical index ID that specifies which (out of multiple) time-series is being observed (here "name", where each country corresponds to a different time-series in our example), and a ``target_column`` with the observed value of this particular time-series at this particular time (here "ConfirmedCases"). When forecasting future values of one particular time-series, AutoGluon may rely on historical observations of not only this time-series but also all of the other time-series in the dataset. You can use ``NA`` to represent missing observations in the data. If your data only contains observations of a single time-series, then ``index_column`` can be omitted. Currently only continuous numeric values are supported in the ``target_column``. Now let's use AutoGluon to train some forecasting models. Below we instruct AutoGluon to fit models that can forecast up to 19 time-points into the future (``prediction_length``) and save them in the folder ``save_path``. Because of the inherent uncertainty involved in this prediction problem, these models are trained to probabilistically forecast 3 different quantiles of the "ConfirmedCases" distribution: the central 0.5 quantile (median), a low 0.1 quantile, and a high 0.9 quantile. The first of these can be used as our forecasted value, while the latter two can be used as a prediction interval for this value (we are 80% confident the true value lies within this interval). .. code:: python save_path = "agModels-covidforecast" predictor = ForecastingPredictor(path=save_path).fit(train_data, prediction_length=19, index_column="name", target_column="ConfirmedCases", time_column="Date", quantiles=[0.1, 0.5, 0.9], presets="low_quality" # last argument is just here for quick demo, omit it in real applications! ) .. parsed-literal:: :class: output Warning: path already exists! This predictor may overwrite an existing predictor! path="agModels-covidforecast" presets is set to be low_quality Training with dataset in tabular format... Finish rebuilding the data, showing the top five rows. name 2020-01-22 2020-01-23 2020-01-24 2020-01-25 2020-01-26 \ 0 Afghanistan_ 0.0 0.0 0.0 0.0 0.0 1 Albania_ 0.0 0.0 0.0 0.0 0.0 2 Algeria_ 0.0 0.0 0.0 0.0 0.0 3 Andorra_ 0.0 0.0 0.0 0.0 0.0 4 Angola_ 0.0 0.0 0.0 0.0 0.0 2020-01-27 2020-01-28 2020-01-29 2020-01-30 ... 2020-03-24 \ 0 0.0 0.0 0.0 0.0 ... 74.0 1 0.0 0.0 0.0 0.0 ... 123.0 2 0.0 0.0 0.0 0.0 ... 264.0 3 0.0 0.0 0.0 0.0 ... 164.0 4 0.0 0.0 0.0 0.0 ... 3.0 2020-03-25 2020-03-26 2020-03-27 2020-03-28 2020-03-29 2020-03-30 \ 0 84.0 94.0 110.0 110.0 120.0 170.0 1 146.0 174.0 186.0 197.0 212.0 223.0 2 302.0 367.0 409.0 454.0 511.0 584.0 3 188.0 224.0 267.0 308.0 334.0 370.0 4 3.0 4.0 4.0 5.0 7.0 7.0 2020-03-31 2020-04-01 2020-04-02 0 174.0 237.0 273.0 1 243.0 259.0 277.0 2 716.0 847.0 986.0 3 376.0 390.0 428.0 4 7.0 8.0 8.0 [5 rows x 73 columns] Validation data is None, will do auto splitting... Finished processing data, using 0.3105947971343994s. Random seed set to 0 All models will be trained for quantiles [0.1, 0.5, 0.9]. Beginning AutoGluon training ... AutoGluon will save models to agModels-covidforecast/ Fitting model: SFF ... Training model SFF... Start model training Epoch[0] Learning rate is 0.001 0%| | 0/10 [00:00, ]}, 'MQCNN': {'freq': 'D', 'prediction_length': 19, 'epochs': 5, 'num_batches_per_epoch': 10, 'context_length': 5, 'use_feat_static_cat': False, 'use_feat_static_real': False, 'cardinality': None, 'quantiles': [0.1, 0.5, 0.9], 'callbacks': [, ], 'hybridize': False}, 'DeepAR': {'freq': 'D', 'prediction_length': 19, 'epochs': 5, 'num_batches_per_epoch': 10, 'context_length': 5, 'use_feat_static_cat': False, 'use_feat_static_real': False, 'cardinality': None, 'quantiles': [0.1, 0.5, 0.9], 'callbacks': [, ]}}, 'leaderboard': model val_score fit_order 0 SFF -0.676774 1 1 DeepAR -0.836025 3 2 MQCNN -0.928002 2} Now let's load some more recent test data to examine the forecasting performance of our trained models: .. code:: python test_data = TabularDataset("https://autogluon.s3-us-west-2.amazonaws.com/datasets/CovidTimeSeries/test.csv") .. parsed-literal:: :class: output Loaded data from: https://autogluon.s3-us-west-2.amazonaws.com/datasets/CovidTimeSeries/test.csv | Columns = 3 / 3 | Rows = 28483 -> 28483 The below code is unnecessary here, but is just included to demonstrate how to reload a trained Predictor object from file (for example in a new Python session): .. code:: python predictor = ForecastingPredictor.load(save_path) .. parsed-literal:: :class: output Loading predictor from path agModels-covidforecast/ We can view the test performance of each model AutoGluon has trained via the ``leaderboard()`` function, where higher scores correspond to better predictive performance (in this case where the evaluation metric corresponds to a loss, we append a negative sign to the loss to ensure higher=better): .. code:: python predictor.leaderboard(test_data) .. parsed-literal:: :class: output Generating leaderboard for all models trained... Additional data provided, testing on the additional data... 100%|██████████| 313/313 [00:00<00:00, 4206.30it/s] 100%|██████████| 313/313 [00:00<00:00, 4018.07it/s] Running evaluation: 100%|██████████| 313/313 [00:00<00:00, 1117.67it/s] 100%|██████████| 313/313 [00:00<00:00, 2562.23it/s] 100%|██████████| 313/313 [00:00<00:00, 3965.32it/s] Running evaluation: 100%|██████████| 313/313 [00:00<00:00, 1111.00it/s] 100%|██████████| 313/313 [00:00<00:00, 318.32it/s] 100%|██████████| 313/313 [00:00<00:00, 3907.89it/s] Running evaluation: 100%|██████████| 313/313 [00:00<00:00, 1133.81it/s] .. raw:: html
model val_score fit_order test_score
0 SFF -0.676774 1 -0.278595
1 DeepAR -0.836025 3 -0.747373
2 MQCNN -0.928002 2 -0.878101
Here ``test_score`` quantifies the performance of predictions on the held-out part of the test data (time points after the latest time observed in the original training data), while ``val_score`` quantifies the performance of predictions on an internal validation set that AutoGluon held-out during ``fit()``. By default the validation set is comprised of the latest time-points in ``train_data``, but you can also manually provide your own validation data through the ``fit()`` argument: ``val_data``. You can also call ``predictor.leaderboard()`` without any ``test_data`` argument to only display ``val_score``. By default, AutoGluon will score probabilistic forecasts of multiple time-series via the `weighted quantile loss `__, but you can specify a different ``eval_metric`` in ``fit()`` to instruct AutoGluon to optimize for a different evaluation metric instead (eg. ``eval_metric="MAPE")``. For more details about the individual time-series models that AutoGluon can train, you can view the `GluonTS documentation `__ or the AutoGluon source code folder ``autogluon/forecasting/models/``. We can also make forecasts further into the future based on the most recent data. When we call ``predict()``, AutoGluon automatically forecasts with the model that had the best validation performance during training (this is the model at the top of ``leaderboard()`` when called without any data). The predictions returned by ``predict()`` form a dictionary whose keys index each time series (in this example, country) and whose values are DataFrames containing quantile forecasts for each time series (in this example, predicted quantiles of the case counts in each country at future subsequent dates to those observed in the test\_data). .. code:: python predictions = predictor.predict(test_data) print(predictions['Afghanistan_']) # quantile forecasts for the Afghanistan time-series .. parsed-literal:: :class: output Does not specify model, will by default use the model with the best validation score for prediction Predicting with model SFF .. parsed-literal:: :class: output 0.1 0.5 0.9 2020-04-22 238.829971 1182.778687 1932.760254 2020-04-23 292.640137 1286.876831 2406.811279 2020-04-24 129.684311 1362.938843 2444.654297 2020-04-25 1.438057 1273.847778 2290.441895 2020-04-26 263.676788 1384.483032 2708.584229 2020-04-27 215.322830 1575.142334 3049.258789 2020-04-28 73.187408 1584.535278 2793.485840 2020-04-29 -237.644989 1579.732056 3183.700439 2020-04-30 84.962097 1472.296021 3427.439941 2020-05-01 -736.102356 1614.746216 3914.478516 2020-05-02 134.536484 1722.437500 3610.033447 2020-05-03 -392.093353 1552.886841 3763.156250 2020-05-04 284.896149 1357.460815 3524.158203 2020-05-05 -886.666504 1731.957031 4186.833008 2020-05-06 -824.011108 1287.756348 3940.486572 2020-05-07 -1039.792236 1383.604980 4043.714844 2020-05-08 -1171.404175 1779.110962 3805.483154 2020-05-09 -1153.900146 1306.392456 3292.706787 2020-05-10 -945.160034 1354.559814 4442.953125 Instead of forecasting with the model that had the best validation score, you can instead specify which model to use for prediction, as well as that AutoGluon should only predict certain time-series of interest: .. code:: python model_touse = "MQCNN" time_series_to_predict = ["Germany_", "Zimbabwe_"] predictions = predictor.predict(test_data, model=model_touse, time_series_to_predict=time_series_to_predict) .. parsed-literal:: :class: output Predicting with model MQCNN In ``predict()``, AutoGluon makes predictions for ``prediction_length`` (= 19 in this example) time points into the future, after the **last** time observed in the dataset fed into ``predict()``. In ``evaluate()`` and ``leaderboard()``, AutoGluon makes predictions for the first ``prediction_length`` time points exceeding the last time observed in the ``train_data`` originally fed into ``fit()``, and then scores these predictions against the target values at the corresponding times in the dataset fed into these methods. Because some models base their predictions on lengthy histories, it is important that in either case, the ``test_data`` you provide contains the ``train_data`` as a subset! You can verify the ``train_data`` are contained within the ``test_data`` in the example above. After you no longer need a particular trained Predictor, remember to delete the ``save_path`` folder to free disk space on your machine. This is especially important to avoid running out of space if training many Predictors in sequence.