Feature Interaction Charting#

Open In Colab Open In SageMaker Studio Lab

This tool is made for quick interactions visualization between variables in a dataset. User can specify the variables to be plotted on the x, y and hue (color) parameters. The tool automatically picks chart type to render based on the detected variable types and renders 1/2/3-way interactions.

This feature can be useful in exploring patterns, trends, and outliers and potentially identify good predictors for the task.

Using Interaction Charts for Missing Values Filling#

Let’s load the titanic dataset:

import pandas as pd

df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'

Next we will look at missing data in the variables:

import autogluon.eda.auto as auto

auto.missing_values_analysis(train_data=df_train)

Missing Values Analysis

missing_count missing_ratio
Age 177 0.198653
Cabin 687 0.771044
Embarked 2 0.002245
../../_images/07a0159c7dfeaf7a8104387601c17967a4f8a3508c8e36bf567dc2463ced398a.png

It looks like there are only two null values in the Embarked feature. Let’s see what those two null values are:

df_train[df_train.Embarked.isna()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN

We may be able to fill these by looking at other independent variables. Both passengers paid a Fare of $80, are of Pclass 1 and female Sex. Let’s see how the Fare is distributed among all Pclass and Embarked feature values:

auto.analyze_interaction(train_data=df_train, x='Embarked', y='Fare', hue='Pclass')
../../_images/01da3cee2cf0671d3bf1ad889086768d3ae17d887c29668587828c434d654e46.png

The average Fare closest to $80 are in the C Embarked values where Pclass is 1. Let’s fill in the missing values as C.

Using Interaction Charts To Learn Information About the Data#

state = auto.partial_dependence_plots(df_train, label='Survived', return_state=True)
No path specified. Models will be saved in: "AutogluonModels/ag-20230327_183056/"

Partial Dependence Plots

Individual Conditional Expectation (ICE) plots complement Partial Dependence Plots (PDP) by showing the relationship between a feature and the model’s output for each individual instance in the dataset. ICE lines (blue) can be overlaid on PDPs (red) to provide a more detailed view of how the model behaves for specific instances. Here are some points on how to interpret PDPs with ICE lines:

  • Central tendency: The PDP line represents the average prediction for different values of the feature of interest. Look for the overall trend of the PDP line to understand the average effect of the feature on the model’s output.

  • Variability: The ICE lines represent the predicted outcomes for individual instances as the feature of interest changes. Examine the spread of ICE lines around the PDP line to understand the variability in predictions for different instances.

  • Non-linear relationships: Look for any non-linear patterns in the PDP and ICE lines. This may indicate that the model captures a non-linear relationship between the feature and the model’s output.

  • Heterogeneity: Check for instances where ICE lines have widely varying slopes, indicating different relationships between the feature and the model’s output for individual instances. This may suggest interactions between the feature of interest and other features.

  • Outliers: Look for any ICE lines that are very different from the majority of the lines. This may indicate potential outliers or instances that have unique relationships with the feature of interest.

  • Confidence intervals: If available, examine the confidence intervals around the PDP line. Wider intervals may indicate a less certain relationship between the feature and the model’s output, while narrower intervals suggest a more robust relationship.

  • Interactions: By comparing PDPs and ICE plots for different features, you may detect potential interactions between features. If the ICE lines change significantly when comparing two features, this might suggest an interaction effect.

../../_images/be6439cf3970eab19dbb5c7813ad561d5cfd7df962de268519389e0dd398560f.png

The following variable(s) are categorical: Ticket, Cabin, Embarked. They are represented as the numbers in the figures above. Mappings are available in state.pdp_id_to_category_mappings. Thestate can be returned from this call via adding return_state=True.

A few observations can be made from the charts above:

  • Sex feature has a very strong impact on the prediction result

  • Parch has almost no impact on the outcome except when it is 0 or 1 - this is a candidate for clipping

  • Fare and Age: both have a non-linear relationship with the outcome; Fare has two modes (density of blue lines) - these are good candidates to explore for feature interaction with other properties

auto.analyze_interaction(x='Parch', hue='Survived', train_data=df_train)
../../_images/c8a191a2f953ed9ef156a217f7fe970ec1d055321872f9dde1869ca6a7ed652f.png
auto.analyze_interaction(x='Pclass', y='Survived', train_data=df_train, test_data=df_test)
../../_images/dfab5b2627cd756049fd5e4d79ce1ce32152eaf3b41bf72b8b00abbbf3a658f6.png

It looks like 63% of first class passengers survived, while; 48% of second class and only 24% of third class passengers survived. Similar information is visible via Fare variable:

Fare and Age features exploration#

Because PDP plots hinted non-linear interaction in these two variables, let’s take a closer look and visualize them individually and in jointly.

auto.analyze_interaction(x='Fare', hue='Survived', train_data=df_train, test_data=df_test, chart_args=dict(fill=True))
../../_images/583b8adfba690410f93a57d22a3608d97a203b368e0898ed7bf2559b7add38a2.png
auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, test_data=df_test)
../../_images/bea7a13f61efafe9651fc2707723dad882c6d268add6c4d62d173c7621628482.png

The very left part of the distribution on this chart possibly hints that children and infants were the priority.

auto.analyze_interaction(x='Fare', y='Age', hue='Survived', train_data=df_train, test_data=df_test)
../../_images/010a64c0ef739322b62b502aa8f626d1586fee1152351ff9f653dd6e1c7fc612.png

This chart highlights three outliers with a Fare of over $500. Let’s take a look at these:

df_train[df_train.Fare > 400]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
258 259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C
679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C
737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C

As you can see all 4 passengers share the same ticket. Per-person fare would be 1/4 of this value. Looks like we can add a new feature to the dataset fare per person; also this allows us to see if some passengers travelled in larger groups. Let’s create two new features and take at the Fare-Age relationship once again.

ticket_to_count = df_train.groupby(by='Ticket')['Embarked'].count().to_dict()
data = df_train.copy()
data['GroupSize'] = data.Ticket.map(ticket_to_count)
data['FarePerPerson'] = data.Fare / data.GroupSize

auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Survived', train_data=data)
auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Pclass', train_data=data)
../../_images/f083596b56572c615c7b3c9f26a43c48b1c7fad0c446f63e3da18c755b81ea16.png ../../_images/1befc4e9be45164c79367e830e7e346947d3da823bfb61604305dd223ac88789.png

You can see cleaner separation between Fare, Pclass and Survived now.