Anomaly Detection Analysis#
Anomaly detection is a powerful technique used in data analysis and machine learning to identify unusual patterns or behaviors that deviate from the norm. These deviations, known as anomalies or outliers, can be indicative of errors, fraud, system failures, or other exceptional events. By detecting these anomalies early, organizations can take proactive measures to address potential issues, enhance security, optimize processes, and make more informed decisions. In this tutorial, we will introduce anomaly detection tools available in AutoGluon EDA package and showcase how to identify these irregularities within your data, even if you’re new to the subject.
import pandas as pd
import seaborn as sns
import autogluon.eda.auto as auto
Loading and pre-processing the data#
First we will load the data. We will use the Titanic dataset.
df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'
auto.detect_anomalies will automatically preprocess the data, but it doesn’t fill in missing numeric values by default. We’ll take care of filling those in ourselves before feeding the data into the anomaly detector.
x = df_train
x_test = df_test
x.Age.fillna(x.Age.mean(), inplace=True)
x_test.Age.fillna(x.Age.mean(), inplace=True)
x_test.Fare.fillna(x.Fare.mean(), inplace=True)
Running Initial Anomaly Analysis#
# This parameter specifies how many standard deviations above mean anomaly score are considered
# to be anomalies (only needed for visualization, does not affect scores calculation).
threshold_stds = 3
auto.detect_anomalies(
train_data=x,
test_data=x_test,
label=target_col,
threshold_stds=threshold_stds,
show_top_n_anomalies=None,
fig_args={
'figsize': (6, 4)
},
chart_args={
'normal.color': 'lightgrey',
'anomaly.color': 'orange',
}
)
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[6], line 1
----> 1 auto.detect_anomalies(
2 train_data=x,
3 test_data=x_test,
4 label=target_col,
5 threshold_stds=threshold_stds,
6 show_top_n_anomalies=None,
7 fig_args={
8 'figsize': (6, 4)
9 },
10 chart_args={
11 'normal.color': 'lightgrey',
12 'anomaly.color': 'orange',
13 }
14 )
File ~/autogluon/eda/src/autogluon/eda/auto/simple.py:1602, in detect_anomalies(train_data, label, test_data, val_data, explain_top_n_anomalies, show_top_n_anomalies, threshold_stds, show_help_text, state, sample, return_state, fig_args, chart_args, **anomaly_detector_kwargs)
1599 chart_args = get_empty_dict_if_none(chart_args).copy()
1601 store_explainability_data = (explain_top_n_anomalies is not None) and explain_top_n_anomalies > 0
-> 1602 _state: AnalysisState = analyze( # type: ignore[assignment] # always has value: return_state=True
1603 train_data=train_data,
1604 test_data=test_data,
1605 val_data=val_data,
1606 label=label,
1607 state=state,
1608 sample=sample,
1609 return_state=True,
1610 anlz_facets=[
1611 ProblemTypeControl(),
1612 ApplyFeatureGenerator(
1613 category_to_numbers=True,
1614 children=[
1615 AnomalyDetectorAnalysis(
1616 store_explainability_data=store_explainability_data, **anomaly_detector_kwargs
1617 ),
1618 ],
1619 ),
1620 ],
1621 )
1623 analyze(
1624 state=_state,
1625 viz_facets=[
(...)
1655 ],
1656 )
1658 # Store anomalies with the scores into the state
File ~/autogluon/eda/src/autogluon/eda/auto/simple.py:178, in analyze(train_data, test_data, val_data, model, label, state, sample, anlz_facets, viz_facets, return_state, verbosity, **kwargs)
164 root_logger.setLevel(log_level)
166 analysis = BaseAnalysis(
167 state=state,
168 train_data=train_data,
(...)
175 ],
176 )
--> 178 state = analysis.fit()
180 SimpleVerticalLinearLayout(
181 facets=viz_facets,
182 ).render(state)
184 root_logger.setLevel(root_log_level) # Reset log level
File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
166 self._fit(self.state, _args, **kwargs)
167 for c in self.children:
--> 168 c.fit(**kwargs)
169 return self.state
File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
166 self._fit(self.state, _args, **kwargs)
167 for c in self.children:
--> 168 c.fit(**kwargs)
169 return self.state
File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:168, in AbstractAnalysis.fit(self, **kwargs)
166 self._fit(self.state, _args, **kwargs)
167 for c in self.children:
--> 168 c.fit(**kwargs)
169 return self.state
File ~/autogluon/eda/src/autogluon/eda/analysis/base.py:166, in AbstractAnalysis.fit(self, **kwargs)
164 _args = self._gather_args()
165 if self.can_handle(self.state, _args):
--> 166 self._fit(self.state, _args, **kwargs)
167 for c in self.children:
168 c.fit(**kwargs)
File ~/autogluon/eda/src/autogluon/eda/analysis/anomaly.py:286, in AnomalyDetectorAnalysis._fit(self, state, args, **fit_kwargs)
284 def _fit(self, state: AnalysisState, args: AnalysisState, **fit_kwargs) -> None:
285 det = self._create_detector(args)
--> 286 scores = det.fit_transform(args.train_data)
287 s = {"scores": {"train_data": scores}}
288 if self.store_explainability_data:
File ~/autogluon/eda/src/autogluon/eda/analysis/anomaly.py:166, in AnomalyDetector.fit_transform(self, train_data)
164 detector = SUOD(**self._suod_kwargs)
165 np.int = int # type: ignore[attr-defined] # workaround to address shap's use of old numpy APIs
--> 166 self._detectors.append(detector.fit(x_train))
167 self._train_index_to_detector = {**self._train_index_to_detector, **{idx: i for idx in x_train.index}}
168 val_scores = detector.decision_function(x_val) # outlier scores
File ~/opt/venv/lib/python3.8/site-packages/pyod/models/suod.py:211, in SUOD.fit(self, X, y)
209 # fit the model and then approximate it
210 self.model_.fit(X)
--> 211 self.model_.approximate(X)
213 # get the decision scores from each base estimators
214 decision_score_mat = np.zeros([n_samples, self.n_estimators])
File ~/opt/venv/lib/python3.8/site-packages/suod/models/base.py:349, in SUOD.approximate(self, X)
341 self.approx_flags, _ = build_codes(self.base_estimators,
342 self.approx_clf_list,
343 self.approx_ng_clf_list,
344 self.approx_flag_global)
346 n_estimators_list, starts, n_jobs = _partition_estimators(
347 self.n_estimators, n_jobs=self.n_jobs, verbose=self.verbose)
--> 349 all_approx_results = Parallel(n_jobs=n_jobs, verbose=True)(
350 delayed(_parallel_approx_estimators)(
351 n_estimators_list[i],
352 self.base_estimators[starts[i]:starts[i + 1]],
353 X, # if it is a PyOD model, we do not need this
354 self.n_estimators,
355 self.approx_flags[starts[i]:starts[i + 1]],
356 self.approx_clf,
357 self.jl_transformers_[starts[i]:starts[i + 1]],
358 verbose=True)
359 for i in range(n_jobs))
361 # print('Balanced Scheduling Total Test Time:', time.time() - start)
363 self.approximators = _unfold_parallel(all_approx_results, n_jobs)
File ~/opt/venv/lib/python3.8/site-packages/joblib/parallel.py:1909, in Parallel.__call__(self, iterable)
1906 self._start_time = time.time()
1908 if not self._managed_backend:
-> 1909 n_jobs = self._initialize_backend()
1910 else:
1911 n_jobs = self._effective_n_jobs()
File ~/opt/venv/lib/python3.8/site-packages/joblib/parallel.py:1359, in Parallel._initialize_backend(self)
1357 """Build a process or thread pool and return the number of workers"""
1358 try:
-> 1359 n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
1360 **self._backend_args)
1361 if self.timeout is not None and not self._backend.supports_timeout:
1362 warnings.warn(
1363 'The backend class {!r} does not support timeout. '
1364 "You have set 'timeout={}' in Parallel but "
1365 "the 'timeout' parameter will not be used.".format(
1366 self._backend.__class__.__name__,
1367 self.timeout))
File ~/opt/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py:538, in LokyBackend.configure(self, n_jobs, parallel, prefer, require, idle_worker_timeout, **memmappingexecutor_args)
534 if n_jobs == 1:
535 raise FallbackToBackend(
536 SequentialBackend(nesting_level=self.nesting_level))
--> 538 self._workers = get_memmapping_executor(
539 n_jobs, timeout=idle_worker_timeout,
540 env=self._prepare_worker_env(n_jobs=n_jobs),
541 context_id=parallel._id, **memmappingexecutor_args)
542 self.parallel = parallel
543 return n_jobs
File ~/opt/venv/lib/python3.8/site-packages/joblib/executor.py:20, in get_memmapping_executor(n_jobs, **kwargs)
19 def get_memmapping_executor(n_jobs, **kwargs):
---> 20 return MemmappingExecutor.get_memmapping_executor(n_jobs, **kwargs)
File ~/opt/venv/lib/python3.8/site-packages/joblib/executor.py:52, in MemmappingExecutor.get_memmapping_executor(cls, n_jobs, timeout, initializer, initargs, env, temp_folder, context_id, **backend_args)
44 # reducers access the temporary folder in which to store temporary
45 # pickles through a call to manager.resolve_temp_folder_name. resolving
46 # the folder name dynamically is useful to use different folders across
47 # calls of a same reusable executor
48 job_reducers, result_reducers = get_memmapping_reducers(
49 unlink_on_gc_collect=True,
50 temp_folder_resolver=manager.resolve_temp_folder_name,
51 **backend_args)
---> 52 _executor, executor_is_reused = super().get_reusable_executor(
53 n_jobs, job_reducers=job_reducers, result_reducers=result_reducers,
54 reuse=reuse, timeout=timeout, initializer=initializer,
55 initargs=initargs, env=env
56 )
58 if not executor_is_reused:
59 # Only set a _temp_folder_manager for new executors. Reused
60 # executors already have a _temporary_folder_manager that must not
61 # be re-assigned like that because it is referenced in various
62 # places in the reducing machinery of the executor.
63 _executor._temp_folder_manager = manager
File ~/opt/venv/lib/python3.8/site-packages/joblib/externals/loky/reusable_executor.py:207, in _ReusablePoolExecutor.get_reusable_executor(cls, max_workers, context, timeout, kill_workers, reuse, job_reducers, result_reducers, initializer, initargs, env)
201 reason = "arguments have changed"
202 mp.util.debug(
203 "Creating a new executor with max_workers="
204 f"{max_workers} as the previous instance cannot be "
205 f"reused ({reason})."
206 )
--> 207 executor.shutdown(wait=True, kill_workers=kill_workers)
208 _executor = executor = _executor_kwargs = None
209 # Recursive call to build a new instance
File ~/opt/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py:1303, in ProcessPoolExecutor.shutdown(self, wait, kill_workers)
1299 if executor_manager_thread is not None and wait:
1300 # This locks avoids concurrent join if the interpreter
1301 # is shutting down.
1302 with _global_shutdown_lock:
-> 1303 executor_manager_thread.join()
1304 _threads_wakeups.pop(executor_manager_thread, None)
1306 # To reduce the risk of opening too many files, remove references to
1307 # objects that use file descriptors.
File /opt/conda/lib/python3.8/threading.py:1011, in Thread.join(self, timeout)
1008 raise RuntimeError("cannot join current thread")
1010 if timeout is None:
-> 1011 self._wait_for_tstate_lock()
1012 else:
1013 # the behavior of a negative timeout isn't documented, but
1014 # historically .join(timeout=x) for x<0 has acted as if timeout=0
1015 self._wait_for_tstate_lock(timeout=max(timeout, 0))
File /opt/conda/lib/python3.8/threading.py:1027, in Thread._wait_for_tstate_lock(self, block, timeout)
1025 if lock is None: # already determined that the C code is done
1026 assert self._is_stopped
-> 1027 elif lock.acquire(block, timeout):
1028 lock.release()
1029 self._stop()
KeyboardInterrupt:
Handling Covariate Shift#
The test data chart appears to show increasing anomaly scores as we move through the records. This is not normal; let’s check for a covariate shift.
auto.covariate_shift_detection(train_data=x, test_data=x_test, label=target_col)
ax = sns.lineplot(data=df_train[['PassengerId']].reset_index(), x='index', y='PassengerId', label='Train')
sns.lineplot(ax=ax, data=df_test[['PassengerId']].reset_index(), x='index', y='PassengerId', label='Test');
This feature looks like a monotonically increasing ID and carries no value for our problem; we are going to remove it.
x = x.drop(columns=['PassengerId'], errors='ignore')
x_test = x_test.drop(columns=['PassengerId'], errors='ignore')
auto.covariate_shift_detection(train_data=x, test_data=x_test, label=target_col)
Run Anomaly Analysis on Cleaned Data#
state = auto.detect_anomalies(
train_data=x,
test_data=x_test,
label=target_col,
threshold_stds=3,
show_top_n_anomalies=5,
explain_top_n_anomalies=1,
return_state=True,
show_help_text=False,
fig_args={
'figsize': (6, 4)
},
chart_args={
'normal.color': 'lightgrey',
'anomaly.color': 'orange',
}
)
Visualize Anomalies#
As we can see from the feature impact charts, the anomaly scores are primarily influenced by the Fare and Age features. Let’s take a look at a visual slice of the feature space. We can get the scores from state under anomaly_detection.scores.<dataset> keys:
train_anomaly_scores = state.anomaly_detection.scores.train_data
test_anomaly_scores = state.anomaly_detection.scores.test_data
auto.analyze_interaction(train_data=df_train.join(train_anomaly_scores), x="Fare", y="Age", hue="score", chart_args=dict(palette='viridis'))
auto.analyze_interaction(train_data=df_test.join(test_anomaly_scores), x="Fare", y="Age", hue="score", chart_args=dict(palette='viridis'))
The data points in the lower left corner don’t appear to be anomalies. However, this is only because we are looking at a slice of the 11-dimensional data. While it might not seem like an anomaly in this slice, it is salient in other dimensions.
In conclusion, in this tutorial we’ve guided you through the process of using AutoGluon for anomaly detection. We’ve covered how to automatically detect anomalies with just a few lines of code. We also explored finding and visualizing the top detected anomalies, which can help you better understand and address the underlying issues. Lastly, we explored how to find the main contributing factors that led to a data point being marked as an anomaly, allowing you to pinpoint the root causes and take appropriate action.