Feature Importance#

Using Regression Methods#

Different regression methods can estimate feature importance by providing coefficients or weights for the input features. Here’s a list of popular regression techniques that yield feature importance:

Linear Regression
Ridge Regression
Lasso Regression
Elastic Net Regression
Decision Tree Regression
Random Forest Regression
Gradient Boosting Regression (e.g., XGBoost)
Support Vector Regression (SVR)
Partial Least Squares (PLS) Regression
Permutation Importance (Model Agnostic)

Setup: Generating the Dataset#

We’ll create a simple dataset with scikit-learn’s make_regression function.

[1]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)

# Print shape for verification
print("Features shape:", X.shape)
print("Target shape:", y.shape)

Features shape: (100, 5)
Target shape: (100,)

1. Linear Regression#

This is an Ordinary Least Squares.

How it works: Minimizes the sum of squared errors to fit a linear model.
Feature Importance: Coefficients of the linear model indicate the influence of each feature.
Interpretation: Larger absolute values of coefficients indicate more importance.

Limitation: Sensitive to multicollinearity.

from sklearn.linear_model import LinearRegression

# Initialize and fit the model
linear_model = LinearRegression()
linear_model.fit(X, y)

# Display feature importance (coefficients)
print("Linear Regression Coefficients:", linear_model.coef_)

2. Ridge Regression (L2 Regularization)#

How it works: Adds an L2 penalty to the loss function to shrink coefficients.
Feature Importance: Similar to linear regression, but smaller coefficients due to the penalty.
Use case: When features are correlated.

Advantage: Helps with multicollinearity and overfitting.

from sklearn.linear_model import Ridge

# Initialize Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X, y)

# Display feature importance (coefficients)
print("Ridge Regression Coefficients:", ridge_model.coef_)

3. Lasso Regression (L1 Regularization)#

How it works: Adds an L1 penalty that encourages sparsity in the coefficients.
Feature Importance: Some coefficients are exactly zero, making it a feature selector.
Use case: When you want to identify a small subset of important features.

Advantage: Automatic feature selection by shrinking unimportant coefficients to zero.

from sklearn.linear_model import Lasso

# Initialize Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X, y)

# Display feature importance (coefficients)
print("Lasso Regression Coefficients:", lasso_model.coef_)

4. Elastic Net Regression (L1 + L2 Regularization)#

How it works: Combines both L1 and L2 regularization.
Feature Importance: Balances between feature selection (L1) and coefficient shrinkage (L2).
Use case: When features are highly correlated, and some sparsity is desired.

Advantage: More flexible than Ridge or Lasso alone.

from sklearn.linear_model import ElasticNet

# Initialize Elastic Net Regression
elastic_net_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net_model.fit(X, y)

# Display feature importance (coefficients)
print("Elastic Net Coefficients:", elastic_net_model.coef_)

5. Decision Tree Regression#

How it works: Splits data at nodes based on feature values to minimize variance.
Feature Importance: Based on reduction in variance or impurity at each split.
Interpretation: Sum of reductions at each node where the feature was used.

Limitation: Prone to overfitting on small datasets.

from sklearn.tree import DecisionTreeRegressor

# Initialize Decision Tree Regressor
tree_model = DecisionTreeRegressor()
tree_model.fit(X, y)

# Display feature importance
print("Decision Tree Feature Importances:", tree_model.feature_importances_)

6. Random Forest Regression#

How it works: Builds an ensemble of decision trees.
Feature Importance: Average reduction in impurity across all trees where the feature was used.
Use case: When non-linear relationships are important.

Advantage: More robust than a single decision tree.

from sklearn.ensemble import RandomForestRegressor

# Initialize Random Forest Regressor
forest_model = RandomForestRegressor(n_estimators=100)
forest_model.fit(X, y)

# Display feature importance
print("Random Forest Feature Importances:", forest_model.feature_importances_)

7. Gradient Boosting Regression (e.g., XGBoost, LightGBM)#

How it works: Sequentially builds trees, each correcting the errors of the previous.
Feature Importance: Based on gain (reduction in loss) or frequency of usage in trees.
Use case: Excellent for complex, non-linear relationships.

Advantage: Often more accurate, with detailed feature importance scores.

from xgboost import XGBRegressor

# Initialize XGBoost Regressor
xgb_model = XGBRegressor(objective='reg:squarederror', n_estimators=100)
xgb_model.fit(X, y)

# Display feature importance
print("XGBoost Feature Importances:", xgb_model.feature_importances_)

8. Support Vector Regression (SVR)#

How it works: Finds a hyperplane in a high-dimensional space to minimize the error.
Feature Importance: Based on the coefficients in the dual representation.

Limitation: Not straightforward for feature importance unless using linear kernels.

from sklearn.svm import SVR

# Initialize Support Vector Regressor
svr_model = SVR(kernel='linear')
svr_model.fit(X, y)

# Display feature importance (coefficients)
print("SVR Coefficients:", svr_model.coef_)

9. Partial Least Squares (PLS) Regression#

How it works: Projects predictors to a new space while maximizing variance explanation.
Feature Importance: Feature weights on the projected latent variables.

Use case: Suitable for multicollinear and high-dimensional data.

from sklearn.cross_decomposition import PLSRegression

# Initialize PLS Regression with 2 components
pls_model = PLSRegression(n_components=2)
pls_model.fit(X, y)

# Display the coefficients for each feature
print("PLS Regression Coefficients:", pls_model.coef_)

10. Permutation Importance (Model Agnostic)#

How it works: Measures the decrease in model performance when a feature’s values are randomly shuffled.
Feature Importance: Drop in performance (e.g., R² or RMSE) after shuffling.

Advantage: Applicable to any model, providing a global view of feature importance.

from sklearn.inspection import permutation_importance

# Use the previously trained RandomForest model
perm_importance = permutation_importance(forest_model, X, y, n_repeats=30, random_state=42)

# Display permutation importance
print("Permutation Importances:", perm_importance.importances_mean)

Choosing a Method#

Linear Data: Linear, Ridge, Lasso.
Non-linear Data: Random Forest, Gradient Boosting.
Feature Selection Focus: Lasso, Elastic Net.
Interpretability: Linear models, Decision Trees.