Anomaly Detection Analysis#

Open In Colab Open In SageMaker Studio Lab

Anomaly detection is a powerful technique used in data analysis and machine learning to identify unusual patterns or behaviors that deviate from the norm. These deviations, known as anomalies or outliers, can be indicative of errors, fraud, system failures, or other exceptional events. By detecting these anomalies early, organizations can take proactive measures to address potential issues, enhance security, optimize processes, and make more informed decisions. In this tutorial, we will introduce anomaly detection tools available in AutoGluon EDA package and showcase how to identify these irregularities within your data, even if you’re new to the subject.

!pip install autogluon.eda
!pip install autogluon.tabular[lightgbm]
Requirement already satisfied: autogluon.eda in /home/ci/autogluon/eda/src (0.7.0b20230413)
Requirement already satisfied: numpy<1.27,>=1.21 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (1.23.5)
Requirement already satisfied: scipy<1.12,>=1.5.4 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (1.10.1)
Requirement already satisfied: scikit-learn<1.3,>=1.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (1.1.1)
Requirement already satisfied: pandas<1.6,>=1.4.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (1.5.3)
Requirement already satisfied: matplotlib<3.7,>=3.4 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (3.6.3)
Requirement already satisfied: missingno<0.6,>=0.5.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (0.5.2)
Requirement already satisfied: phik<0.13,>=0.12.2 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (0.12.3)
Requirement already satisfied: seaborn<0.13,>=0.12.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (0.12.2)
Requirement already satisfied: ipywidgets<9.0,>=7.7.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (8.0.6)
Requirement already satisfied: shap<0.42,>=0.41 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (0.41.0)
Requirement already satisfied: yellowbrick<1.6,>=1.5 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (1.5)
Requirement already satisfied: pyod<1.1,>=1.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (1.0.9)
Requirement already satisfied: suod<0.1,>=0.0.8 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.eda) (0.0.8)
Requirement already satisfied: autogluon.core==0.7.0b20230413 in /home/ci/autogluon/core/src (from autogluon.eda) (0.7.0b20230413)
Requirement already satisfied: autogluon.common==0.7.0b20230413 in /home/ci/autogluon/common/src (from autogluon.eda) (0.7.0b20230413)
Requirement already satisfied: autogluon.features==0.7.0b20230413 in /home/ci/autogluon/features/src (from autogluon.eda) (0.7.0b20230413)
Requirement already satisfied: autogluon.tabular==0.7.0b20230413 in /home/ci/autogluon/tabular/src (from autogluon.eda) (0.7.0b20230413)
Requirement already satisfied: boto3<2,>=1.10 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.common==0.7.0b20230413->autogluon.eda) (1.26.113)
Requirement already satisfied: psutil<6,>=5.7.3 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.common==0.7.0b20230413->autogluon.eda) (5.9.4)
Requirement already satisfied: setuptools in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.common==0.7.0b20230413->autogluon.eda) (56.0.0)
Requirement already satisfied: networkx<3.0,>=2.3 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.core==0.7.0b20230413->autogluon.eda) (2.8.8)
Requirement already satisfied: tqdm<5,>=4.38 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.core==0.7.0b20230413->autogluon.eda) (4.65.0)
Requirement already satisfied: requests in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.core==0.7.0b20230413->autogluon.eda) (2.28.2)
Requirement already satisfied: traitlets>=4.3.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipywidgets<9.0,>=7.7.1->autogluon.eda) (5.9.0)
Requirement already satisfied: ipykernel>=4.5.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipywidgets<9.0,>=7.7.1->autogluon.eda) (6.22.0)
Requirement already satisfied: jupyterlab-widgets~=3.0.7 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipywidgets<9.0,>=7.7.1->autogluon.eda) (3.0.7)
Requirement already satisfied: ipython>=6.1.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipywidgets<9.0,>=7.7.1->autogluon.eda) (8.12.0)
Requirement already satisfied: widgetsnbextension~=4.0.7 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipywidgets<9.0,>=7.7.1->autogluon.eda) (4.0.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib<3.7,>=3.4->autogluon.eda) (1.4.4)
Requirement already satisfied: python-dateutil>=2.7 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib<3.7,>=3.4->autogluon.eda) (2.8.2)
Requirement already satisfied: fonttools>=4.22.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib<3.7,>=3.4->autogluon.eda) (4.39.3)
Requirement already satisfied: pillow>=6.2.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib<3.7,>=3.4->autogluon.eda) (9.5.0)
Requirement already satisfied: packaging>=20.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib<3.7,>=3.4->autogluon.eda) (23.1)
Requirement already satisfied: pyparsing>=2.2.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib<3.7,>=3.4->autogluon.eda) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib<3.7,>=3.4->autogluon.eda) (0.11.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib<3.7,>=3.4->autogluon.eda) (1.0.7)
Requirement already satisfied: pytz>=2020.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from pandas<1.6,>=1.4.1->autogluon.eda) (2023.3)
Requirement already satisfied: joblib>=0.14.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from phik<0.13,>=0.12.2->autogluon.eda) (1.2.0)
Requirement already satisfied: six in /home/ci/opt/venv/lib/python3.8/site-packages (from pyod<1.1,>=1.0->autogluon.eda) (1.16.0)
Requirement already satisfied: numba>=0.51 in /home/ci/opt/venv/lib/python3.8/site-packages (from pyod<1.1,>=1.0->autogluon.eda) (0.56.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from scikit-learn<1.3,>=1.0->autogluon.eda) (3.1.0)
Requirement already satisfied: cloudpickle in /home/ci/opt/venv/lib/python3.8/site-packages (from shap<0.42,>=0.41->autogluon.eda) (2.2.1)
Requirement already satisfied: slicer==0.0.7 in /home/ci/opt/venv/lib/python3.8/site-packages (from shap<0.42,>=0.41->autogluon.eda) (0.0.7)
Requirement already satisfied: combo in /home/ci/opt/venv/lib/python3.8/site-packages (from suod<0.1,>=0.0.8->autogluon.eda) (0.1.3)
Requirement already satisfied: botocore<1.30.0,>=1.29.113 in /home/ci/opt/venv/lib/python3.8/site-packages (from boto3<2,>=1.10->autogluon.common==0.7.0b20230413->autogluon.eda) (1.29.113)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from boto3<2,>=1.10->autogluon.common==0.7.0b20230413->autogluon.eda) (1.0.1)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from boto3<2,>=1.10->autogluon.common==0.7.0b20230413->autogluon.eda) (0.6.0)
Requirement already satisfied: tornado>=6.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (6.2)
Requirement already satisfied: nest-asyncio in /home/ci/opt/venv/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (1.5.6)
Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (5.3.0)
Requirement already satisfied: debugpy>=1.6.5 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (1.6.7)
Requirement already satisfied: comm>=0.1.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.1.3)
Requirement already satisfied: jupyter-client>=6.1.12 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (8.1.0)
Requirement already satisfied: matplotlib-inline>=0.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.1.6)
Requirement already satisfied: pyzmq>=20 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (25.0.2)
Requirement already satisfied: stack-data in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.6.2)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (3.0.38)
Requirement already satisfied: typing-extensions in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (4.5.0)
Requirement already satisfied: pygments>=2.4.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (2.15.0)
Requirement already satisfied: decorator in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (5.1.1)
Requirement already satisfied: jedi>=0.16 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.18.2)
Requirement already satisfied: pexpect>4.3 in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (4.8.0)
Requirement already satisfied: backcall in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.2.0)
Requirement already satisfied: pickleshare in /home/ci/opt/venv/lib/python3.8/site-packages (from ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.7.5)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /home/ci/opt/venv/lib/python3.8/site-packages (from numba>=0.51->pyod<1.1,>=1.0->autogluon.eda) (0.39.1)
Requirement already satisfied: importlib-metadata in /home/ci/opt/venv/lib/python3.8/site-packages (from numba>=0.51->pyod<1.1,>=1.0->autogluon.eda) (6.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/ci/opt/venv/lib/python3.8/site-packages (from requests->autogluon.core==0.7.0b20230413->autogluon.eda) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from requests->autogluon.core==0.7.0b20230413->autogluon.eda) (1.26.15)
Requirement already satisfied: idna<4,>=2.5 in /home/ci/opt/venv/lib/python3.8/site-packages (from requests->autogluon.core==0.7.0b20230413->autogluon.eda) (3.4)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ci/opt/venv/lib/python3.8/site-packages (from requests->autogluon.core==0.7.0b20230413->autogluon.eda) (3.1.0)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from jedi>=0.16->ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.8.3)
Requirement already satisfied: zipp>=0.5 in /home/ci/opt/venv/lib/python3.8/site-packages (from importlib-metadata->numba>=0.51->pyod<1.1,>=1.0->autogluon.eda) (3.15.0)
Requirement already satisfied: platformdirs>=2.5 in /home/ci/opt/venv/lib/python3.8/site-packages (from jupyter-core!=5.0.*,>=4.12->ipykernel>=4.5.1->ipywidgets<9.0,>=7.7.1->autogluon.eda) (3.2.0)
Requirement already satisfied: ptyprocess>=0.5 in /home/ci/opt/venv/lib/python3.8/site-packages (from pexpect>4.3->ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.7.0)
Requirement already satisfied: wcwidth in /home/ci/opt/venv/lib/python3.8/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.2.6)
Requirement already satisfied: executing>=1.2.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from stack-data->ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (1.2.0)
Requirement already satisfied: pure-eval in /home/ci/opt/venv/lib/python3.8/site-packages (from stack-data->ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (0.2.2)
Requirement already satisfied: asttokens>=2.1.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from stack-data->ipython>=6.1.0->ipywidgets<9.0,>=7.7.1->autogluon.eda) (2.2.1)
Requirement already satisfied: autogluon.tabular[lightgbm] in /home/ci/autogluon/tabular/src (0.7.0b20230413)
Requirement already satisfied: numpy<1.27,>=1.21 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.tabular[lightgbm]) (1.23.5)
Requirement already satisfied: scipy<1.12,>=1.5.4 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.tabular[lightgbm]) (1.10.1)
Requirement already satisfied: pandas<1.6,>=1.4.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.tabular[lightgbm]) (1.5.3)
Requirement already satisfied: scikit-learn<1.3,>=1.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.tabular[lightgbm]) (1.1.1)
Requirement already satisfied: networkx<3.0,>=2.3 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.tabular[lightgbm]) (2.8.8)
Requirement already satisfied: autogluon.core==0.7.0b20230413 in /home/ci/autogluon/core/src (from autogluon.tabular[lightgbm]) (0.7.0b20230413)
Requirement already satisfied: autogluon.features==0.7.0b20230413 in /home/ci/autogluon/features/src (from autogluon.tabular[lightgbm]) (0.7.0b20230413)
Requirement already satisfied: lightgbm<3.4,>=3.3 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.tabular[lightgbm]) (3.3.5)
Requirement already satisfied: tqdm<5,>=4.38 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (4.65.0)
Requirement already satisfied: requests in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (2.28.2)
Requirement already satisfied: matplotlib in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (3.6.3)
Requirement already satisfied: boto3<2,>=1.10 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (1.26.113)
Requirement already satisfied: autogluon.common==0.7.0b20230413 in /home/ci/autogluon/common/src (from autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (0.7.0b20230413)
Requirement already satisfied: psutil<6,>=5.7.3 in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.common==0.7.0b20230413->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (5.9.4)
Requirement already satisfied: setuptools in /home/ci/opt/venv/lib/python3.8/site-packages (from autogluon.common==0.7.0b20230413->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (56.0.0)
Requirement already satisfied: wheel in /home/ci/opt/venv/lib/python3.8/site-packages (from lightgbm<3.4,>=3.3->autogluon.tabular[lightgbm]) (0.40.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from pandas<1.6,>=1.4.1->autogluon.tabular[lightgbm]) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from pandas<1.6,>=1.4.1->autogluon.tabular[lightgbm]) (2023.3)
Requirement already satisfied: joblib>=1.0.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from scikit-learn<1.3,>=1.0->autogluon.tabular[lightgbm]) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from scikit-learn<1.3,>=1.0->autogluon.tabular[lightgbm]) (3.1.0)
Requirement already satisfied: botocore<1.30.0,>=1.29.113 in /home/ci/opt/venv/lib/python3.8/site-packages (from boto3<2,>=1.10->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (1.29.113)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from boto3<2,>=1.10->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (0.6.0)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from boto3<2,>=1.10->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (1.0.1)
Requirement already satisfied: six>=1.5 in /home/ci/opt/venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas<1.6,>=1.4.1->autogluon.tabular[lightgbm]) (1.16.0)
Requirement already satisfied: packaging>=20.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (23.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (1.4.4)
Requirement already satisfied: contourpy>=1.0.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (1.0.7)
Requirement already satisfied: pyparsing>=2.2.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (4.39.3)
Requirement already satisfied: pillow>=6.2.0 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (9.5.0)
Requirement already satisfied: cycler>=0.10 in /home/ci/opt/venv/lib/python3.8/site-packages (from matplotlib->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (0.11.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ci/opt/venv/lib/python3.8/site-packages (from requests->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (3.1.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/ci/opt/venv/lib/python3.8/site-packages (from requests->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/ci/opt/venv/lib/python3.8/site-packages (from requests->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (1.26.15)
Requirement already satisfied: idna<4,>=2.5 in /home/ci/opt/venv/lib/python3.8/site-packages (from requests->autogluon.core==0.7.0b20230413->autogluon.tabular[lightgbm]) (3.4)
import pandas as pd
import seaborn as sns

import autogluon.eda.auto as auto

First we will load the data. We will use the Titanic dataset.

df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'

auto.detect_anomalies will automatically preprocess the data, but it doesn’t fill in missing numeric values by default. We’ll take care of filling those in ourselves before feeding the data into the anomaly detector.

x = df_train
x_test = df_test
x.Age.fillna(x.Age.mean(), inplace=True)
x_test.Age.fillna(x.Age.mean(), inplace=True)
x_test.Fare.fillna(x.Fare.mean(), inplace=True)
# This parameters specifies how many standard deviations above mean anomaly score considered as anomalies
# (only needed for visualization, does not affect scores calculation)
threshold_stds = 3
auto.detect_anomalies(
    train_data=x,
    test_data=x_test,
    label=target_col,
    threshold_stds=threshold_stds,
    show_top_n_anomalies=None,
    fig_args={
        'figsize': (6, 4)
    },
    chart_args={
        'normal.color': 'lightgrey',
        'anomaly.color': 'orange',
    }
)

Anomaly Detection Report

When interpreting anomaly scores, consider:

  • Threshold: Determine a suitable threshold to separate normal from anomalous data points, based on domain knowledge or statistical methods.

  • Context: Examine the context of anomalies, including time, location, and surrounding data points, to identify possible causes.

  • False positives/negatives: Be aware of the trade-offs between false positives (normal points classified as anomalies) and false negatives (anomalies missed).

  • Feature relevance: Ensure the features used for anomaly detection are relevant and contribute to the model’s performance.

  • Model performance: Regularly evaluate and update the model to maintain its accuracy and effectiveness.

It’s important to understand the context and domain knowledge before deciding on an appropriate approach to deal with anomalies.he choice of method depends on the data’s nature, the cause of anomalies, and the problem being addressed.he common ways to deal with anomalies:

  • Removal: If an anomaly is a result of an error, noise, or irrelevance to the analysis, it can be removed from the dataset to prevent it from affecting the model’s performance.

  • Imputation: Replace anomalous values with appropriate substitutes, such as the mean, median, or mode of the feature, or by using more advanced techniques like regression or k-nearest neighbors.

  • Transformation: Apply transformations like log, square root, or z-score to normalize the data and reduce the impact of extreme values. Absolute dates might be transformed into relative features like age of the item.* Capping: Set upper and lower bounds for a feature, and replace values outside these limits with the bounds themselves. This method is also known as winsorizing.

  • Separate modeling: Treat anomalies as a distinct group and build a separate model for them, or use specialized algorithms designed for handling outliers, such as robust regression or one-class SVM.

  • Incorporate as a feature: Create a new binary feature indicating the presence of an anomaly, which can be useful if anomalies have predictive value.

Use show_help_text=False to hide this information when calling this function.

train_data anomalies for 3-sigma outlier scores

../../_images/c6fdc8c283955deddf1bd0c43566aa41f99b5b956582a9ff58523c80be53b500.png

test_data anomalies for 3-sigma outlier scores

../../_images/9245d65d09a7a11df840fa18e0c0f1377b80865df4922a6346d9070df6938348.png

The test data chart appears to show increasing anomaly scores as we move through the records. This is not normal; let’s check for a covariate shift.

auto.covariate_shift_detection(train_data=x, test_data=x_test, label=target_col)

We detected a substantial difference between the training and test X distributions, a type of distribution shift.

Test results: We can predict whether a sample is in the test vs. training set with a roc_auc of 0.9999 with a p-value of 0.0010 (smaller than the threshold of 0.0100).

Feature importances: The variables that are the most responsible for this shift are those with high feature importance:

importance stddev p_value n p99_high p99_low
PassengerId 0.480003 0.031567 0.000002 5 0.545000 0.415006
Name 0.000167 0.000091 0.007389 5 0.000355 -0.000020

PassengerId values distribution between datasets; p-value: 0.0000

../../_images/321623d3e29143ba9ea2507b496a3ac6b3801c8bc0eb120fbbd7182f211dbed3.png
ax = sns.lineplot(data=df_train[['PassengerId']].reset_index(), x='index', y='PassengerId', label='Train')
sns.lineplot(ax=ax, data=df_test[['PassengerId']].reset_index(), x='index', y='PassengerId', label='Test');
../../_images/22c653ca670de5156d7c4c4259584197291859dbf9e849117b1e34d1def0fa18.png

This feature looks like a monotonically increasing ID and carries no value for our problem; we are going to remove it.

x = x.drop(columns=['PassengerId'], errors='ignore')
x_test = x_test.drop(columns=['PassengerId'], errors='ignore')
auto.covariate_shift_detection(train_data=x, test_data=x_test, label=target_col)

We did not detect a substantial difference between the training and test X distributions.

Run Anomalies Analysis#

state = auto.detect_anomalies(
    train_data=x,
    test_data=x_test,
    label=target_col,
    threshold_stds=3,
    show_top_n_anomalies=5,
    explain_top_n_anomalies=3,
    return_state=True,
    show_help_text=False,
    fig_args={
        'figsize': (6, 4)
    },
    chart_args={
        'normal.color': 'lightgrey',
        'anomaly.color': 'orange',
    }    
)

Anomaly Detection Report

train_data anomalies for 3-sigma outlier scores

../../_images/eb41cbc12b5dc38c5e57d9a334df4f3f35c3045be8f97ac8703f5bdef05a060c.png

test_data anomalies for 3-sigma outlier scores

../../_images/8ca255a14df4c9636a2bf1d6e263557511418bcd4306a79315017a6fdb5f4db2.png

Top-5 train_data anomalies (total: 15)

Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked score
679 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.000000 0 1 PC 17755 512.3292 B51 B53 B55 C 3.612053
258 1 1 Ward, Miss. Anna female 35.000000 0 0 PC 17755 512.3292 NaN C 2.701831
737 1 1 Lesurer, Mr. Gustave J male 35.000000 0 0 PC 17755 512.3292 B101 C 2.693982
732 0 2 Knight, Mr. Robert J male 29.699118 0 0 239855 0.0000 NaN S 2.432819
251 0 3 Strom, Mrs. Wilhelm (Elna Matilda Persson) female 29.000000 1 1 347054 10.4625 G6 S 2.413583

⚠️ Please note that the feature values shown on the charts below are transformed into an internal representation; they may be encoded or modified based on internal preprocessing. Refer to the original datasets for the actual feature values.

⚠️ The detector has seen this dataset; the may result in overly optimistic estimates. Although the anomaly score in the explanation might not match, the magnitude of the feature scores can still be utilized to evaluate the impact of the feature on the anomaly score.

../../_images/6e2153363f6f28d2a785aaa0571e7b1658d05a75843fca053e3fcc21f83a5ead.png ../../_images/4bdb3170dec3094b5f3a4416a9d39cb6d3fb6bfdb0d9a5091527081452afb00b.png ../../_images/127ee713b7f8ae334b956de7bc10d2d769b600846bb0e244947cc0460f4fb1b6.png

Top-5 test_data anomalies (total: 4)

Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked score
343 1 Cardeza, Mrs. James Warburton Martinez (Charlo... female 58.000000 0 1 PC 17755 512.3292 B51 B53 B55 C 3.214611
266 1 Chisholm, Mr. Roderick Robert Crispin male 29.699118 0 0 112051 0.0000 NaN S 2.398288
263 3 Klasen, Miss. Gertrud Emilia female 1.000000 1 1 350405 12.1833 NaN S 1.940198
307 3 Aks, Master. Philip Frank male 0.830000 0 1 392091 9.3500 NaN S 1.929314

⚠️ Please note that the feature values shown on the charts below are transformed into an internal representation; they may be encoded or modified based on internal preprocessing. Refer to the original datasets for the actual feature values.

../../_images/ade1961dedddd8d5723a6542d0722d2b4a63eae250bceea0ece316e8a1335c5d.png ../../_images/4cb3568da25108df763e6c75cbbe43b6ec9c83ad7e5cfd60a3a44f11717af205.png ../../_images/b0cc4692c26b0028332953f4a704f86470174e37575436bb3f42f7754ed2f216.png

Visualize Anomalies#

As we can see from the feature impact charts, the anomaly scores are primarily influenced by the Fare and Age features. Let’s take a look at a visual slice of the feature space. We can get the scores from state under anomaly_detection.scores.<dataset> keys:

train_anomaly_scores = state.anomaly_detection.scores.train_data
test_anomaly_scores = state.anomaly_detection.scores.test_data
auto.analyze_interaction(train_data=df_train.join(train_anomaly_scores), x="Fare", y="Age", hue="score", chart_args=dict(palette='viridis'))
../../_images/fe0409b310a2103dd02a59098fc85f0df57cb860b188c75cb2f4faf5447fa31f.png
auto.analyze_interaction(train_data=df_test.join(test_anomaly_scores), x="Fare", y="Age", hue="score", chart_args=dict(palette='viridis'))
../../_images/683e90966c9010d27d29a9658b46c62fdb4d89b72f8e0cd908ecd791506388a5.png

The data points in the lower left corner don’t appear to be anomalies. However, this is only because we are looking at a slice of the 11-dimensional data. While it might not seem like an anomaly in this slice, it is salient in other dimensions.

In conclusion, throughout this tutorial, we’ve guided you through the process of using AutoGluon for anomaly detection. We’ve covered how to automatically detect anomalies with just a few lines of code. We also explored into finding and visualizing the top detected anomalies, which can help you better understand and address the underlying issues. Lastly, we explored how to find the main contributing factors that led to a data point being marked as an anomaly, allowing you to pinpoint the root causes and take appropriate action.