Reference: Auto components#
This section is a reference for high-level composite components showcased in sections above.
autogluon.eda.analysis.auto#
Shortcut to perform high-level datasets summary overview (counts, frequencies, missing statistics, types info). |
|
Target variable composite analysis. |
|
This helper performs quick model fit analysis and then produces a composite report of the results. |
|
Perform quick analysis of missing values across datasets. |
|
Shortcut for covariate shift detection analysis. |
|
This helper performs simple feature interaction analysis. |
|
This helper creates BaseAnalysis wrapping passed analyses into Sampler if needed, then fits and renders produced state with specified visualizations. |
|
Kernel SHAP is a method that uses a special weighted linear regression to compute the importance of each feature. |
|
Display Partial Dependence Plots (PDP) with Individual Conditional Expectation (ICE) |
dataset_overview#
- autogluon.eda.auto.simple.dataset_overview(train_data: Optional[DataFrame] = None, test_data: Optional[DataFrame] = None, val_data: Optional[DataFrame] = None, label: Optional[str] = None, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, sample: Union[None, int, float] = None, fig_args: Optional[Dict[str, Dict[str, Any]]] = None, chart_args: Optional[Dict[str, Dict[str, Any]]] = None)[source]#
Shortcut to perform high-level datasets summary overview (counts, frequencies, missing statistics, types info).
- Supported fig_args/chart_args keys:
feature_distance.<property> - feature distance dendrogram chart
chart.<variable>.<property> - near-duplicate groups visualizations chart. If chart is labeled as a relationship <A>/<B>, then <variable> is <B>
- Parameters
train_data (Optional[DataFrame], default = None) – training dataset
test_data (Optional[DataFrame], default = None) – test dataset
val_data (Optional[DataFrame], default = None) – validation dataset
label (: Optional[str], default = None) – target variable
state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.
return_state (bool, default = False) – return state if True
sample (Union[None, int, float], default = None) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()fig_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component figure
chart_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component chart
Examples
>>> import autogluon.eda.analysis as eda >>> >>> auto.dataset_overview( >>> train_data=df_train, test_data=df_test, label=target_col, >>> chart_args={'feature_distance.orientation': 'left'}, >>> fig_args={'feature_distance.figsize': (6,6)}, >>> )
target_analysis#
- autogluon.eda.auto.simple.target_analysis(train_data: DataFrame, label: str, test_data: Optional[DataFrame] = None, problem_type: str = 'auto', fit_distributions: Union[bool, str, List[str]] = True, sample: Union[None, int, float] = None, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, fig_args: Optional[Dict[str, Any]] = None, chart_args: Optional[Dict[str, Any]] = None) Optional[AnalysisState][source]#
Target variable composite analysis.
- Performs the following analysis components of the label field:
basic summary stats
feature values distribution charts; adds fitted distributions for numeric targets
target correlations analysis; with interaction charts of target vs high-correlated features
- Supported fig_args/chart_args keys:
correlation.<property> - properties for correlation heatmap
chart.<variable_name>.<property> - properties for charts rendered during the analysis.
If <variable_name> is matching label value, then this will modify the top chart; all other values will be affecting label/<variable_name> interaction charts
- Parameters
train_data (Optional[DataFrame]) – training dataset
test_data (Optional[DataFrame], default = None) – test dataset
label (: Optional[str]) – target variable
problem_type (str, default = 'auto') – problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.
fit_distributions (Union[bool, str, List[str]], default = False,) – If True, or list of distributions is provided, then fit distributions. Performed only if y and hue are not present.
state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.
sample (Union[None, int, float], default = None) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()return_state (bool, default = False) – return state if True
fig_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component figure. The args are supporting nested dot syntax: ‘a.b.c’. Charts args are following the convention of <variable_name>.<param> (i.e. chart.PassengerId.figsize will result in setting figsize on <target>/PassengerId figure.
chart_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component chart. The args are supporting nested dot syntax: ‘a.b.c’. Charts args are following the convention of <variable_name>.<param> (i.e. chart.PassengerId.fill will result in setting fill on <target>/PassengerId chart.
- Return type
state after fit call if return_state is True; None otherwise
Examples
>>> import autogluon.eda.analysis as eda >>> >>> auto.target_analysis(train_data=..., label=...)
quick_fit#
- autogluon.eda.auto.simple.quick_fit(train_data: DataFrame, label: str, test_data: Optional[DataFrame] = None, path: Optional[str] = None, val_size: float = 0.3, problem_type: str = 'auto', sample: Union[None, int, float] = None, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, save_model_to_state: bool = True, verbosity: int = 0, show_feature_importance_barplots: bool = False, estimator_args: Optional[Dict[str, Dict[str, Any]]] = None, fig_args: Optional[Dict[str, Dict[str, Any]]] = None, chart_args: Optional[Dict[str, Dict[str, Any]]] = None, render_analysis: bool = True, **fit_args)[source]#
This helper performs quick model fit analysis and then produces a composite report of the results.
- The analysis is structured in a sequence of operations:
Sample if sample is specified.
Perform train-test split using val_size ratio
- Fit AutoGluon estimator given fit_args; if hyperparameters not present in args, then use default ones
(Random Forest by default - because it is interpretable)
Display report
- The reports include:
confusion matrix for classification problems; predictions vs actual for regression problems
model leaderboard
feature importance
samples with the highest prediction error - candidates for inspection
samples with the least distance from the other class - candidates for labeling
- Supported fig_args/chart_args keys:
confusion_matrix.<property> - confusion matrix chart for classification predictor
regression_eval.<property> - regression predictor results chart
feature_importance.<property> - feature importance barplot chart
- Parameters
train_data (DataFrame) – training dataset
test_data (DataFrame) – test dataset
label (str) – target variable
path (Optional[str], default = None,) – path for models saving
problem_type (str, default = 'auto') – problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.
sample (Union[None, int, float], default = None) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()val_size (float, default = 0.3) – fraction of training set to be assigned as validation set during the split.
state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.
return_state (bool, default = False) – return state if True
save_model_to_state (bool, default = True,) – save fitted model into state under model key. This functionality might be helpful in cases when the fitted model could be usable for other purposes (i.e. imputers)
verbosity (int, default = 0) – Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).
show_feature_importance_barplots (bool, default = False) – if True, then barplot char will ba added with feature importance visualization
estimator_args (Optional[Dict[str, Dict[str, Any]]], default = None,) – args to pass into the estimator constructor
fit_args (Optional[Dict[str, Dict[str, Any]]], default = None,) – kwargs to pass into TabularPredictor fit.
fig_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component figure. The args are supporting nested dot syntax: ‘a.b.c’.
chart_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component chart. The args are supporting nested dot syntax: ‘a.b.c’.
render_analysis (bool, default = True) – if False, then don’t render any visualizations; this can be used if user just needs to train a model. It is recommended to use this option with save_model_to_state=True and return_state=True options.
- Return type
state after fit call if return_state is True; None otherwise
Examples
>>> import autogluon.eda.analysis as eda >>> >>> # Quick fit >>> state = auto.quick_fit( >>> train_data=..., label=..., >>> return_state=True, # return state object from call >>> fig_args={"regression_eval.figsize": (8,6)}, # customize regression evaluation `figsize` >>> chart_args={"regression_eval.residuals_plot_mode": "hist"} # customize regression evaluation `residuals_plot_mode` >>> hyperparameters={'GBM': {}} # train specific model >>> ) >>> >>> # Using quick fit model >>> model = state.model >>> y_pred = model.predict(test_data)
missing_values_analysis#
- autogluon.eda.auto.simple.missing_values_analysis(graph_type: str = 'matrix', train_data: Optional[DataFrame] = None, test_data: Optional[DataFrame] = None, val_data: Optional[DataFrame] = None, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, sample: Union[None, int, float] = None, **chart_args)[source]#
Perform quick analysis of missing values across datasets.
- Parameters
graph_type (str, default = 'matrix') –
One of the following visualization types: - matrix - nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion
This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.
bar - visualizes how many rows are non-null vs null in the column. Logarithmic scale can by specifying log=True in kwargs
- heatmap - correlation heatmap measures nullity correlation: how strongly the presence or absence of one
variable affects the presence of another. Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does). Entries marked <1 or >-1 have a correlation that is close to being exactingly negative or positive but is still not quite perfectly so.
- dendrogram - the dendrogram allows to more fully correlate variable completion, revealing trends deeper than the pairwise ones
visible in the correlation heatmap. The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.
train_data (Optional[DataFrame]) – training dataset
test_data (Optional[DataFrame], default = None) – test dataset
val_data – validation dataset
state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.
return_state (bool, default = False) – return state if True
sample (Union[None, int, float], default = None) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()
- Return type
state after fit call if return_state is True; None otherwise
Examples
>>> import autogluon.eda.auto as auto >>> >>> auto.missing_values_analysis(train_data=...)
See also
covariate_shift_detection#
- autogluon.eda.auto.simple.covariate_shift_detection(train_data: DataFrame, test_data: DataFrame, label: str, sample: Union[None, int, float] = None, path: Optional[str] = None, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, verbosity: int = 0, fig_args: Optional[Dict[str, Any]] = None, chart_args: Optional[Dict[str, Any]] = None, **fit_args)[source]#
Shortcut for covariate shift detection analysis.
Detects a change in covariate (X) distribution between training and test, which we call XShift. It can tell you if your training set is not representative of your test set distribution. This is done with a Classifier 2 Sample Test.
- Supported fig_args/chart_args keys:
chart.<variable_name>.<property> - properties for charts rendered during the analysis
- Parameters
train_data (Optional[DataFrame]) – training dataset
test_data (Optional[DataFrame]) – test dataset
label (: Optional[str]) – target variable
state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.
sample (Union[None, int, float], default = None) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()path (Optional[str], default = None,) – path for models saving
return_state (bool, default = False) – return state if True
verbosity (int, default = 0) – Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).
fit_args – kwargs to pass into TabularPredictor fit
fig_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component figure. The args are supporting nested dot syntax: ‘a.b.c’. Charts args are following the convention of <variable_name>.<param> (i.e. chart.PassengerId.figsize will result in setting figsize on PassengerId figure.
chart_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component chart. The args are supporting nested dot syntax: ‘a.b.c’. Charts args are following the convention of <variable_name>.<param> (i.e. chart.PassengerId.fill will result in setting fill on PassengerId chart.
- Return type
state after fit call if return_state is True; None otherwise
Examples
>>> import autogluon.eda.auto as auto >>> >>> # use default settings >>> auto.covariate_shift_detection(train_data=..., test_data=..., label=...) >>> >>> # customize classifier and verbosity level >>> auto.covariate_shift_detection(train_data=..., test_data=..., label=..., verbosity=2, hyperparameters = {'GBM': {}})
See also
analyze_interaction#
- autogluon.eda.auto.simple.analyze_interaction(x: Optional[str] = None, y: Optional[str] = None, hue: Optional[str] = None, fit_distributions: Union[bool, str, List[str]] = False, fig_args: Optional[Dict[str, Any]] = None, chart_args: Optional[Dict[str, Any]] = None, **analysis_args)[source]#
This helper performs simple feature interaction analysis.
- Parameters
x (Optional[str], default = None) –
y (Optional[str], default = None) –
hue (Optional[str], default = None) –
fit_distributions (Union[bool, str, List[str]], default = False,) – If True, or list of distributions is provided, then fit distributions. Performed only if y and hue are not present.
chart_args (Optional[dict], default = None) – kwargs to pass into visualization component
fig_args (Optional[Dict[str, Any]], default = None,) – kwargs to pass into chart figure
Examples
>>> import pandas as pd >>> import autogluon.eda.auto as auto >>> >>> df_train = pd.DataFrame(...) >>> >>> auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, chart_args=dict(headers=True, alpha=0.2))
analyze#
- autogluon.eda.auto.simple.analyze(train_data: Optional[DataFrame] = None, test_data: Optional[DataFrame] = None, val_data: Optional[DataFrame] = None, model=None, label: Optional[str] = None, state: Union[None, dict, AnalysisState] = None, sample: Union[None, int, float] = None, anlz_facets: Optional[List[AbstractAnalysis]] = None, viz_facets: Optional[List[AbstractVisualization]] = None, return_state: bool = False, verbosity: int = 2) Optional[AnalysisState][source]#
This helper creates BaseAnalysis wrapping passed analyses into Sampler if needed, then fits and renders produced state with specified visualizations.
- Parameters
train_data – training dataset
test_data – test dataset
val_data – validation dataset
model – trained Predictor
label (str) – target variable
state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.
sample (Union[None, int, float], default = None) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()anlz_facets (List[AbstractAnalysis]) – analyses to add to this composite analysis
viz_facets (List[AbstractVisualization]) – visualizations to add to this composite analysis
return_state (bool, default = False) – return state if True
verbosity (int, default = 2,) – Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).
- Return type
state after fit call if return_state is True; None otherwise
explain_rows#
- autogluon.eda.auto.simple.explain_rows(train_data: DataFrame, model: TabularPredictor, rows: DataFrame, display_rows: bool = False, plot: Optional[str] = 'force', baseline_sample: int = 100, return_state: bool = False, fit_args: Optional[Dict[str, Any]] = None, **kwargs) Optional[AnalysisState][source]#
Kernel SHAP is a method that uses a special weighted linear regression to compute the importance of each feature. The computed importance values are Shapley values from game theory and also coefficients from a local linear regression values analysis for the given rows.
The results are rendered either as force plot or waterfall plot.
- Parameters
train_data (DataFrame) – training dataset
model (TabularPredictor) – trained AutoGluon predictor
rows (pd.DataFrame,) – rows to explain
display_rows (bool, default = False) – if True then display the row before the explanation chart
plot (Optional[str], default = 'force') – type of plot to visualize the Shapley values. Supported keys: - force - Visualize the given SHAP values with an additive force layout - waterfall - Visualize the given SHAP values with a waterfall layout - None - do not use any visualization
baseline_sample (int, default = 100) – The background dataset size to use for integrating out features. To determine the impact of a feature, that feature is set to “missing” and the change in the model output is observed.
return_state (bool, default = False) – return state if True
fit_args (Optional[Dict[str, Any]], default = None,) – kwargs for ShapAnalysis.
kwargs –
See also
KernelExplainer,ShapAnalysis,ExplainForcePlot,ExplainWaterfallPlot
partial_dependence_plots#
- autogluon.eda.auto.simple.partial_dependence_plots(train_data: DataFrame, label: str, target: Optional[Any] = None, features: Optional[Union[str, List[str]]] = None, path: Optional[str] = None, max_ice_lines: int = 300, sample: Union[None, int, float] = None, fig_args: Optional[Dict[str, Dict[str, Any]]] = None, chart_args: Optional[Dict[str, Dict[str, Any]]] = None, show_help_text: bool = True, return_state: bool = False, col_number_warning: int = 20, **fit_args)[source]#
Display Partial Dependence Plots (PDP) with Individual Conditional Expectation (ICE)
ICE plots complement PDP by showing the relationship between a feature and the model’s output for each individual instance in the dataset. ICE lines (blue) can be overlaid on PDPs (red) to provide a more detailed view of how the model behaves for specific instances. Here are some points on how to interpret PDPs with ICE lines:
- Central tendency
The PDP line represents the average prediction for different values of the feature of interest. Look for the overall trend of the PDP line to understand the average effect of the feature on the model’s output.
- Variability
The ICE lines represent the predicted outcomes for individual instances as the feature of interest changes. Examine the spread of ICE lines around the PDP line to understand the variability in predictions for different instances.
- Non-linear relationships
Look for any non-linear patterns in the PDP and ICE lines. This may indicate that the model captures a non-linear relationship between the feature and the model’s output.
- Heterogeneity
Check for instances where ICE lines have widely varying slopes, indicating different relationships between the feature and the model’s output for individual instances. This may suggest interactions between the feature of interest and other features.
- Outliers
Look for any ICE lines that are very different from the majority of the lines. This may indicate potential outliers or instances that have unique relationships with the feature of interest.
- Confidence intervals
If available, examine the confidence intervals around the PDP line. Wider intervals may indicate a less certain relationship between the feature and the model’s output, while narrower intervals suggest a more robust relationship.
- Interactions
By comparing PDPs and ICE plots for different features, you may detect potential interactions between features. If the ICE lines change significantly when comparing two features, this might suggest an interaction effect.
- Parameters
train_data (DataFrame) – training dataset
label (str) – target variable
target (Optional[Any], default = None) – In a multiclass setting, specifies the class for which the PDPs should be computed. Ignored in binary classification or classical regression settings
features (Optional[Union[str, List[str]]], default = None) – feature subset to display; None means all features will be rendered.
path (Optional[str], default = None) – location to store the model trained for this task
max_ice_lines (int, default = 300) – max number of ice lines to display for each sub-plot
sample (Union[None, int, float], default = None) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()fig_args (Optional[Dict[str, Any]], default = None) – kwargs to pass into chart figure
chart_args (Optional[dict], default = None) – kwargs to pass into visualization component
show_help_text (bool, default = True) –
return_state (bool, default = False) – return state if True
col_number_warning (int, default = 20) – number of features to visualize after which the warning will be displayed to warn about rendering time
fit_args (Optional[Dict[str, Dict[str, Any]]], default = None,) – kwargs to pass into TabularPredictor fit.
- Return type
state after fit call if return_state is True; None otherwise
See also