Anomaly Detection Analysis#

Open In Colab Open In SageMaker Studio Lab

Anomaly detection is a powerful technique used in data analysis and machine learning to identify unusual patterns or behaviors that deviate from the norm. These deviations, known as anomalies or outliers, can be indicative of errors, fraud, system failures, or other exceptional events. By detecting these anomalies early, organizations can take proactive measures to address potential issues, enhance security, optimize processes, and make more informed decisions. In this tutorial, we will introduce anomaly detection tools available in AutoGluon EDA package and showcase how to identify these irregularities within your data, even if you’re new to the subject.

import pandas as pd
import seaborn as sns

import autogluon.eda.auto as auto

Loading and pre-processing the data#

First we will load the data. We will use the Titanic dataset.

df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'

auto.detect_anomalies will automatically preprocess the data, but it doesn’t fill in missing numeric values by default. We’ll take care of filling those in ourselves before feeding the data into the anomaly detector.

x = df_train
x_test = df_test
x.Age.fillna(x.Age.mean(), inplace=True)
x_test.Age.fillna(x.Age.mean(), inplace=True)
x_test.Fare.fillna(x.Fare.mean(), inplace=True)

Running Initial Anomaly Analysis#

# This parameter specifies how many standard deviations above mean anomaly score are considered
# to be anomalies (only needed for visualization, does not affect scores calculation).
threshold_stds = 3
auto.detect_anomalies(
    train_data=x,
    test_data=x_test,
    label=target_col,
    threshold_stds=threshold_stds,
    show_top_n_anomalies=None,
    fig_args={
        'figsize': (6, 4)
    },
    chart_args={
        'normal.color': 'lightgrey',
        'anomaly.color': 'orange',
    }
)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 auto.detect_anomalies(
      2     train_data=x,
      3     test_data=x_test,
      4     label=target_col,
      5     threshold_stds=threshold_stds,
      6     show_top_n_anomalies=None,
      7     fig_args={
      8         'figsize': (6, 4)
      9     },
     10     chart_args={
     11         'normal.color': 'lightgrey',
     12         'anomaly.color': 'orange',
     13     }
     14 )

File ~/autogluon/eda/src/autogluon/eda/auto/simple.py:1602, in detect_anomalies(train_data, label, test_data, val_data, explain_top_n_anomalies, show_top_n_anomalies, threshold_stds, show_help_text, state, sample, return_state, fig_args, chart_args, **anomaly_detector_kwargs)
   1599 chart_args = get_empty_dict_if_none(chart_args).copy()
   1601 store_explainability_data = (explain_top_n_anomalies is not None) and explain_top_n_anomalies > 0
-> 1602 _state: AnalysisState = analyze(  # type: ignore[assignment]  # always has value: return_state=True
   1603     train_data=train_data,
   1604     test_data=test_data,
   1605     val_data=val_data,
   1606     label=label,
   1607     state=state,
   1608     sample=sample,
   1609     return_state=True,
   1610     anlz_facets=[
   1611         ProblemTypeControl(),
   1612         ApplyFeatureGenerator(
   1613             category_to_numbers=True,
   1614             children=[
   1615                 AnomalyDetectorAnalysis(
   1616                     store_explainability_data=store_explainability_data, **anomaly_detector_kwargs
   1617                 ),
   1618             ],
   1619         ),
   1620     ],
   1621 )
   1623 analyze(
   1624     state=_state,
   1625     viz_facets=[
   (...)
   1655     ],
   1656 )
   1658 # Store anomalies with the scores into the state

File ~/autogluon/eda/src/autogluon/eda/auto/simple.py:178, in analyze(train_data, test_data, val_data, model, label, state, sample, anlz_facets, viz_facets, return_state, verbosity, **kwargs)
    164 root_logger.setLevel(log_level)
    166 analysis = BaseAnalysis(
    167     state=state,
    168     train_data=train_data,
   (...)
    175     ],
    176 )
--> 178 state = analysis.fit()
    180 SimpleVerticalLinearLayout(
    181     facets=viz_facets,
    182 ).render(state)
    184 root_logger.setLevel(root_log_level)  # Reset log level

File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
    166     self._fit(self.state, _args, **kwargs)
    167     for c in self.children:
--> 168         c.fit(**kwargs)
    169 return self.state

File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
    166     self._fit(self.state, _args, **kwargs)
    167     for c in self.children:
--> 168         c.fit(**kwargs)
    169 return self.state

File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
    166     self._fit(self.state, _args, **kwargs)
    167     for c in self.children:
--> 168         c.fit(**kwargs)
    169 return self.state

File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:166, in AbstractAnalysis.fit(self, **kwargs)
    164 _args = self._gather_args()
    165 if self.can_handle(self.state, _args):
--> 166     self._fit(self.state, _args, **kwargs)
    167     for c in self.children:
    168         c.fit(**kwargs)

File ~/autogluon/eda/src/autogluon/eda/analysis/anomaly.py:286, in AnomalyDetectorAnalysis._fit(self, state, args, **fit_kwargs)
    284 def _fit(self, state: AnalysisState, args: AnalysisState, **fit_kwargs) -> None:
    285     det = self._create_detector(args)
--> 286     scores = det.fit_transform(args.train_data)
    287     s = {"scores": {"train_data": scores}}
    288     if self.store_explainability_data:

File ~/autogluon/eda/src/autogluon/eda/analysis/anomaly.py:166, in AnomalyDetector.fit_transform(self, train_data)
    164 detector = SUOD(**self._suod_kwargs)
    165 np.int = int  # type: ignore[attr-defined] # workaround to address shap's use of old numpy APIs
--> 166 self._detectors.append(detector.fit(x_train))
    167 self._train_index_to_detector = {**self._train_index_to_detector, **{idx: i for idx in x_train.index}}
    168 val_scores = detector.decision_function(x_val)  # outlier scores

File ~/opt/venv/lib/python3.8/site-packages/pyod/models/suod.py:211, in SUOD.fit(self, X, y)
    209 # fit the model and then approximate it
    210 self.model_.fit(X)
--> 211 self.model_.approximate(X)
    213 # get the decision scores from each base estimators
    214 decision_score_mat = np.zeros([n_samples, self.n_estimators])

File ~/opt/venv/lib/python3.8/site-packages/suod/models/base.py:349, in SUOD.approximate(self, X)
    341 self.approx_flags, _ = build_codes(self.base_estimators,
    342                                    self.approx_clf_list,
    343                                    self.approx_ng_clf_list,
    344                                    self.approx_flag_global)
    346 n_estimators_list, starts, n_jobs = _partition_estimators(
    347     self.n_estimators, n_jobs=self.n_jobs, verbose=self.verbose)
--> 349 all_approx_results = Parallel(n_jobs=n_jobs, verbose=True)(
    350     delayed(_parallel_approx_estimators)(
    351         n_estimators_list[i],
    352         self.base_estimators[starts[i]:starts[i + 1]],
    353         X,  # if it is a PyOD model, we do not need this
    354         self.n_estimators,
    355         self.approx_flags[starts[i]:starts[i + 1]],
    356         self.approx_clf,
    357         self.jl_transformers_[starts[i]:starts[i + 1]],
    358         verbose=True)
    359     for i in range(n_jobs))
    361 # print('Balanced Scheduling Total Test Time:', time.time() - start)
    363 self.approximators = _unfold_parallel(all_approx_results, n_jobs)

File ~/opt/venv/lib/python3.8/site-packages/joblib/parallel.py:1909, in Parallel.__call__(self, iterable)
   1906 self._start_time = time.time()
   1908 if not self._managed_backend:
-> 1909     n_jobs = self._initialize_backend()
   1910 else:
   1911     n_jobs = self._effective_n_jobs()

File ~/opt/venv/lib/python3.8/site-packages/joblib/parallel.py:1359, in Parallel._initialize_backend(self)
   1357 """Build a process or thread pool and return the number of workers"""
   1358 try:
-> 1359     n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
   1360                                      **self._backend_args)
   1361     if self.timeout is not None and not self._backend.supports_timeout:
   1362         warnings.warn(
   1363             'The backend class {!r} does not support timeout. '
   1364             "You have set 'timeout={}' in Parallel but "
   1365             "the 'timeout' parameter will not be used.".format(
   1366                 self._backend.__class__.__name__,
   1367                 self.timeout))

File ~/opt/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py:538, in LokyBackend.configure(self, n_jobs, parallel, prefer, require, idle_worker_timeout, **memmappingexecutor_args)
    534 if n_jobs == 1:
    535     raise FallbackToBackend(
    536         SequentialBackend(nesting_level=self.nesting_level))
--> 538 self._workers = get_memmapping_executor(
    539     n_jobs, timeout=idle_worker_timeout,
    540     env=self._prepare_worker_env(n_jobs=n_jobs),
    541     context_id=parallel._id, **memmappingexecutor_args)
    542 self.parallel = parallel
    543 return n_jobs

File ~/opt/venv/lib/python3.8/site-packages/joblib/executor.py:20, in get_memmapping_executor(n_jobs, **kwargs)
     19 def get_memmapping_executor(n_jobs, **kwargs):
---> 20     return MemmappingExecutor.get_memmapping_executor(n_jobs, **kwargs)

File ~/opt/venv/lib/python3.8/site-packages/joblib/executor.py:52, in MemmappingExecutor.get_memmapping_executor(cls, n_jobs, timeout, initializer, initargs, env, temp_folder, context_id, **backend_args)
     44 # reducers access the temporary folder in which to store temporary
     45 # pickles through a call to manager.resolve_temp_folder_name. resolving
     46 # the folder name dynamically is useful to use different folders across
     47 # calls of a same reusable executor
     48 job_reducers, result_reducers = get_memmapping_reducers(
     49     unlink_on_gc_collect=True,
     50     temp_folder_resolver=manager.resolve_temp_folder_name,
     51     **backend_args)
---> 52 _executor, executor_is_reused = super().get_reusable_executor(
     53     n_jobs, job_reducers=job_reducers, result_reducers=result_reducers,
     54     reuse=reuse, timeout=timeout, initializer=initializer,
     55     initargs=initargs, env=env
     56 )
     58 if not executor_is_reused:
     59     # Only set a _temp_folder_manager for new executors. Reused
     60     # executors already have a _temporary_folder_manager that must not
     61     # be re-assigned like that because it is referenced in various
     62     # places in the reducing machinery of the executor.
     63     _executor._temp_folder_manager = manager

File ~/opt/venv/lib/python3.8/site-packages/joblib/externals/loky/reusable_executor.py:207, in _ReusablePoolExecutor.get_reusable_executor(cls, max_workers, context, timeout, kill_workers, reuse, job_reducers, result_reducers, initializer, initargs, env)
    201     reason = "arguments have changed"
    202 mp.util.debug(
    203     "Creating a new executor with max_workers="
    204     f"{max_workers} as the previous instance cannot be "
    205     f"reused ({reason})."
    206 )
--> 207 executor.shutdown(wait=True, kill_workers=kill_workers)
    208 _executor = executor = _executor_kwargs = None
    209 # Recursive call to build a new instance

File ~/opt/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py:1303, in ProcessPoolExecutor.shutdown(self, wait, kill_workers)
   1299 if executor_manager_thread is not None and wait:
   1300     # This locks avoids concurrent join if the interpreter
   1301     # is shutting down.
   1302     with _global_shutdown_lock:
-> 1303         executor_manager_thread.join()
   1304         _threads_wakeups.pop(executor_manager_thread, None)
   1306 # To reduce the risk of opening too many files, remove references to
   1307 # objects that use file descriptors.

File /opt/conda/lib/python3.8/threading.py:1011, in Thread.join(self, timeout)
   1008     raise RuntimeError("cannot join current thread")
   1010 if timeout is None:
-> 1011     self._wait_for_tstate_lock()
   1012 else:
   1013     # the behavior of a negative timeout isn't documented, but
   1014     # historically .join(timeout=x) for x<0 has acted as if timeout=0
   1015     self._wait_for_tstate_lock(timeout=max(timeout, 0))

File /opt/conda/lib/python3.8/threading.py:1027, in Thread._wait_for_tstate_lock(self, block, timeout)
   1025 if lock is None:  # already determined that the C code is done
   1026     assert self._is_stopped
-> 1027 elif lock.acquire(block, timeout):
   1028     lock.release()
   1029     self._stop()

KeyboardInterrupt: 

Handling Covariate Shift#

The test data chart appears to show increasing anomaly scores as we move through the records. This is not normal; let’s check for a covariate shift.

auto.covariate_shift_detection(train_data=x, test_data=x_test, label=target_col)
ax = sns.lineplot(data=df_train[['PassengerId']].reset_index(), x='index', y='PassengerId', label='Train')
sns.lineplot(ax=ax, data=df_test[['PassengerId']].reset_index(), x='index', y='PassengerId', label='Test');

This feature looks like a monotonically increasing ID and carries no value for our problem; we are going to remove it.

x = x.drop(columns=['PassengerId'], errors='ignore')
x_test = x_test.drop(columns=['PassengerId'], errors='ignore')
auto.covariate_shift_detection(train_data=x, test_data=x_test, label=target_col)

Run Anomaly Analysis on Cleaned Data#

state = auto.detect_anomalies(
    train_data=x,
    test_data=x_test,
    label=target_col,
    threshold_stds=3,
    show_top_n_anomalies=5,
    explain_top_n_anomalies=1,
    return_state=True,
    show_help_text=False,
    fig_args={
        'figsize': (6, 4)
    },
    chart_args={
        'normal.color': 'lightgrey',
        'anomaly.color': 'orange',
    }    
)

Visualize Anomalies#

As we can see from the feature impact charts, the anomaly scores are primarily influenced by the Fare and Age features. Let’s take a look at a visual slice of the feature space. We can get the scores from state under anomaly_detection.scores.<dataset> keys:

train_anomaly_scores = state.anomaly_detection.scores.train_data
test_anomaly_scores = state.anomaly_detection.scores.test_data
auto.analyze_interaction(train_data=df_train.join(train_anomaly_scores), x="Fare", y="Age", hue="score", chart_args=dict(palette='viridis'))
auto.analyze_interaction(train_data=df_test.join(test_anomaly_scores), x="Fare", y="Age", hue="score", chart_args=dict(palette='viridis'))

The data points in the lower left corner don’t appear to be anomalies. However, this is only because we are looking at a slice of the 11-dimensional data. While it might not seem like an anomaly in this slice, it is salient in other dimensions.

In conclusion, in this tutorial we’ve guided you through the process of using AutoGluon for anomaly detection. We’ve covered how to automatically detect anomalies with just a few lines of code. We also explored finding and visualizing the top detected anomalies, which can help you better understand and address the underlying issues. Lastly, we explored how to find the main contributing factors that led to a data point being marked as an anomaly, allowing you to pinpoint the root causes and take appropriate action.