Anomaly Detection Analysis#

Anomaly detection is a powerful technique used in data analysis and machine learning to identify unusual patterns or behaviors that deviate from the norm. These deviations, known as anomalies or outliers, can be indicative of errors, fraud, system failures, or other exceptional events. By detecting these anomalies early, organizations can take proactive measures to address potential issues, enhance security, optimize processes, and make more informed decisions. In this tutorial, we will introduce anomaly detection tools available in AutoGluon EDA package and showcase how to identify these irregularities within your data, even if you’re new to the subject.

import pandas as pd
import seaborn as sns

import autogluon.eda.auto as auto

Loading and pre-processing the data#

First we will load the data. We will use the Titanic dataset.

df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'

auto.detect_anomalies will automatically preprocess the data, but it doesn’t fill in missing numeric values by default. We’ll take care of filling those in ourselves before feeding the data into the anomaly detector.

x = df_train
x_test = df_test
x.Age.fillna(x.Age.mean(), inplace=True)
x_test.Age.fillna(x.Age.mean(), inplace=True)
x_test.Fare.fillna(x.Fare.mean(), inplace=True)

Running Initial Anomaly Analysis#

# This parameter specifies how many standard deviations above mean anomaly score are considered
# to be anomalies (only needed for visualization, does not affect scores calculation).
threshold_stds = 3

auto.detect_anomalies(
    train_data=x,
    test_data=x_test,
    label=target_col,
    threshold_stds=threshold_stds,
    show_top_n_anomalies=None,
    fig_args={
        'figsize': (6, 4)
    },
    chart_args={
        'normal.color': 'lightgrey',
        'anomaly.color': 'orange',
    }
)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 auto.detect_anomalies(
   train_data=x,
   test_data=x_test,
   label=target_col,
   threshold_stds=threshold_stds,
   show_top_n_anomalies=None,
   fig_args={
       'figsize': (6, 4)
   },
   chart_args={
       'normal.color': 'lightgrey',
       'anomaly.color': 'orange',
   }
)

File ~/autogluon/eda/src/autogluon/eda/auto/simple.py:1602, in detect_anomalies(train_data, label, test_data, val_data, explain_top_n_anomalies, show_top_n_anomalies, threshold_stds, show_help_text, state, sample, return_state, fig_args, chart_args, **anomaly_detector_kwargs)
chart_args = get_empty_dict_if_none(chart_args).copy()
store_explainability_data = (explain_top_n_anomalies is not None) and explain_top_n_anomalies > 0
-> 1602 _state: AnalysisState = analyze(  # type: ignore[assignment]  # always has value: return_state=True
   train_data=train_data,
   test_data=test_data,
   val_data=val_data,
   label=label,
   state=state,
   sample=sample,
   return_state=True,
   anlz_facets=[
       ProblemTypeControl(),
       ApplyFeatureGenerator(
           category_to_numbers=True,
           children=[
               AnomalyDetectorAnalysis(
                   store_explainability_data=store_explainability_data, **anomaly_detector_kwargs
               ),
           ],
       ),
   ],
)
analyze(
   state=_state,
   viz_facets=[
   (...)
   ],
)
# Store anomalies with the scores into the state

File ~/autogluon/eda/src/autogluon/eda/auto/simple.py:178, in analyze(train_data, test_data, val_data, model, label, state, sample, anlz_facets, viz_facets, return_state, verbosity, **kwargs)
root_logger.setLevel(log_level)
analysis = BaseAnalysis(
   state=state,
   train_data=train_data,
   (...)
   ],
)
--> 178 state = analysis.fit()
SimpleVerticalLinearLayout(
   facets=viz_facets,
).render(state)
root_logger.setLevel(root_log_level)  # Reset log level

File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
   self._fit(self.state, _args, **kwargs)
   for c in self.children:
--> 168         c.fit(**kwargs)
return self.state

File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
   self._fit(self.state, _args, **kwargs)
   for c in self.children:
--> 168         c.fit(**kwargs)
return self.state

File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
   self._fit(self.state, _args, **kwargs)
   for c in self.children:
--> 168         c.fit(**kwargs)
return self.state

File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:166, in AbstractAnalysis.fit(self, **kwargs)
_args = self._gather_args()
if self.can_handle(self.state, _args):
--> 166     self._fit(self.state, _args, **kwargs)
   for c in self.children:
       c.fit(**kwargs)

File ~/autogluon/eda/src/autogluon/eda/analysis/anomaly.py:286, in AnomalyDetectorAnalysis._fit(self, state, args, **fit_kwargs)
def _fit(self, state: AnalysisState, args: AnalysisState, **fit_kwargs) -> None:
   det = self._create_detector(args)
--> 286     scores = det.fit_transform(args.train_data)
   s = {"scores": {"train_data": scores}}
   if self.store_explainability_data:

File ~/autogluon/eda/src/autogluon/eda/analysis/anomaly.py:166, in AnomalyDetector.fit_transform(self, train_data)
detector = SUOD(**self._suod_kwargs)
np.int = int  # type: ignore[attr-defined] # workaround to address shap's use of old numpy APIs
--> 166 self._detectors.append(detector.fit(x_train))
self._train_index_to_detector = {**self._train_index_to_detector, **{idx: i for idx in x_train.index}}
val_scores = detector.decision_function(x_val)  # outlier scores

File ~/opt/venv/lib/python3.8/site-packages/pyod/models/suod.py:211, in SUOD.fit(self, X, y)
# fit the model and then approximate it
self.model_.fit(X)
--> 211 self.model_.approximate(X)
# get the decision scores from each base estimators
decision_score_mat = np.zeros([n_samples, self.n_estimators])

File ~/opt/venv/lib/python3.8/site-packages/suod/models/base.py:349, in SUOD.approximate(self, X)
self.approx_flags, _ = build_codes(self.base_estimators,
                                  self.approx_clf_list,
                                  self.approx_ng_clf_list,
                                  self.approx_flag_global)
n_estimators_list, starts, n_jobs = _partition_estimators(
   self.n_estimators, n_jobs=self.n_jobs, verbose=self.verbose)
--> 349 all_approx_results = Parallel(n_jobs=n_jobs, verbose=True)(
   delayed(_parallel_approx_estimators)(
       n_estimators_list[i],
       self.base_estimators[starts[i]:starts[i + 1]],
       X,  # if it is a PyOD model, we do not need this
       self.n_estimators,
       self.approx_flags[starts[i]:starts[i + 1]],
       self.approx_clf,
       self.jl_transformers_[starts[i]:starts[i + 1]],
       verbose=True)
   for i in range(n_jobs))
# print('Balanced Scheduling Total Test Time:', time.time() - start)
self.approximators = _unfold_parallel(all_approx_results, n_jobs)

File ~/opt/venv/lib/python3.8/site-packages/joblib/parallel.py:1909, in Parallel.__call__(self, iterable)
self._start_time = time.time()
if not self._managed_backend:
-> 1909     n_jobs = self._initialize_backend()
else:
   n_jobs = self._effective_n_jobs()

File ~/opt/venv/lib/python3.8/site-packages/joblib/parallel.py:1359, in Parallel._initialize_backend(self)
"""Build a process or thread pool and return the number of workers"""
try:
-> 1359     n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
                                    **self._backend_args)
   if self.timeout is not None and not self._backend.supports_timeout:
       warnings.warn(
           'The backend class {!r} does not support timeout. '
           "You have set 'timeout={}' in Parallel but "
           "the 'timeout' parameter will not be used.".format(
               self._backend.__class__.__name__,
               self.timeout))

File ~/opt/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py:538, in LokyBackend.configure(self, n_jobs, parallel, prefer, require, idle_worker_timeout, **memmappingexecutor_args)
if n_jobs == 1:
   raise FallbackToBackend(
       SequentialBackend(nesting_level=self.nesting_level))
--> 538 self._workers = get_memmapping_executor(
   n_jobs, timeout=idle_worker_timeout,
   env=self._prepare_worker_env(n_jobs=n_jobs),
   context_id=parallel._id, **memmappingexecutor_args)
self.parallel = parallel
return n_jobs

File ~/opt/venv/lib/python3.8/site-packages/joblib/executor.py:20, in get_memmapping_executor(n_jobs, **kwargs)
def get_memmapping_executor(n_jobs, **kwargs):
---> 20     return MemmappingExecutor.get_memmapping_executor(n_jobs, **kwargs)

File ~/opt/venv/lib/python3.8/site-packages/joblib/executor.py:52, in MemmappingExecutor.get_memmapping_executor(cls, n_jobs, timeout, initializer, initargs, env, temp_folder, context_id, **backend_args)
# reducers access the temporary folder in which to store temporary
# pickles through a call to manager.resolve_temp_folder_name. resolving
# the folder name dynamically is useful to use different folders across
# calls of a same reusable executor
job_reducers, result_reducers = get_memmapping_reducers(
   unlink_on_gc_collect=True,
   temp_folder_resolver=manager.resolve_temp_folder_name,
   **backend_args)
---> 52 _executor, executor_is_reused = super().get_reusable_executor(
   n_jobs, job_reducers=job_reducers, result_reducers=result_reducers,
   reuse=reuse, timeout=timeout, initializer=initializer,
   initargs=initargs, env=env
)
if not executor_is_reused:
   # Only set a _temp_folder_manager for new executors. Reused
   # executors already have a _temporary_folder_manager that must not
   # be re-assigned like that because it is referenced in various
   # places in the reducing machinery of the executor.
   _executor._temp_folder_manager = manager

File ~/opt/venv/lib/python3.8/site-packages/joblib/externals/loky/reusable_executor.py:207, in _ReusablePoolExecutor.get_reusable_executor(cls, max_workers, context, timeout, kill_workers, reuse, job_reducers, result_reducers, initializer, initargs, env)
   reason = "arguments have changed"
mp.util.debug(
   "Creating a new executor with max_workers="
   f"{max_workers} as the previous instance cannot be "
   f"reused ({reason})."
)
--> 207 executor.shutdown(wait=True, kill_workers=kill_workers)
_executor = executor = _executor_kwargs = None
# Recursive call to build a new instance

File ~/opt/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py:1303, in ProcessPoolExecutor.shutdown(self, wait, kill_workers)
if executor_manager_thread is not None and wait:
   # This locks avoids concurrent join if the interpreter
   # is shutting down.
   with _global_shutdown_lock:
-> 1303         executor_manager_thread.join()
       _threads_wakeups.pop(executor_manager_thread, None)
# To reduce the risk of opening too many files, remove references to
# objects that use file descriptors.

File /opt/conda/lib/python3.8/threading.py:1011, in Thread.join(self, timeout)
   raise RuntimeError("cannot join current thread")
if timeout is None:
-> 1011     self._wait_for_tstate_lock()
else:
   # the behavior of a negative timeout isn't documented, but
   # historically .join(timeout=x) for x<0 has acted as if timeout=0
   self._wait_for_tstate_lock(timeout=max(timeout, 0))

File /opt/conda/lib/python3.8/threading.py:1027, in Thread._wait_for_tstate_lock(self, block, timeout)
if lock is None:  # already determined that the C code is done
   assert self._is_stopped
-> 1027 elif lock.acquire(block, timeout):
   lock.release()
   self._stop()

KeyboardInterrupt: 

Handling Covariate Shift#

The test data chart appears to show increasing anomaly scores as we move through the records. This is not normal; let’s check for a covariate shift.

auto.covariate_shift_detection(train_data=x, test_data=x_test, label=target_col)

ax = sns.lineplot(data=df_train[['PassengerId']].reset_index(), x='index', y='PassengerId', label='Train')
sns.lineplot(ax=ax, data=df_test[['PassengerId']].reset_index(), x='index', y='PassengerId', label='Test');

This feature looks like a monotonically increasing ID and carries no value for our problem; we are going to remove it.

x = x.drop(columns=['PassengerId'], errors='ignore')
x_test = x_test.drop(columns=['PassengerId'], errors='ignore')

auto.covariate_shift_detection(train_data=x, test_data=x_test, label=target_col)

Run Anomaly Analysis on Cleaned Data#

state = auto.detect_anomalies(
    train_data=x,
    test_data=x_test,
    label=target_col,
    threshold_stds=3,
    show_top_n_anomalies=5,
    explain_top_n_anomalies=1,
    return_state=True,
    show_help_text=False,
    fig_args={
        'figsize': (6, 4)
    },
    chart_args={
        'normal.color': 'lightgrey',
        'anomaly.color': 'orange',
    }    
)

Visualize Anomalies#

As we can see from the feature impact charts, the anomaly scores are primarily influenced by the Fare and Age features. Let’s take a look at a visual slice of the feature space. We can get the scores from state under anomaly_detection.scores.<dataset> keys:

train_anomaly_scores = state.anomaly_detection.scores.train_data
test_anomaly_scores = state.anomaly_detection.scores.test_data

auto.analyze_interaction(train_data=df_train.join(train_anomaly_scores), x="Fare", y="Age", hue="score", chart_args=dict(palette='viridis'))

auto.analyze_interaction(train_data=df_test.join(test_anomaly_scores), x="Fare", y="Age", hue="score", chart_args=dict(palette='viridis'))

The data points in the lower left corner don’t appear to be anomalies. However, this is only because we are looking at a slice of the 11-dimensional data. While it might not seem like an anomaly in this slice, it is salient in other dimensions.

In conclusion, in this tutorial we’ve guided you through the process of using AutoGluon for anomaly detection. We’ve covered how to automatically detect anomalies with just a few lines of code. We also explored finding and visualizing the top detected anomalies, which can help you better understand and address the underlying issues. Lastly, we explored how to find the main contributing factors that led to a data point being marked as an anomaly, allowing you to pinpoint the root causes and take appropriate action.