Feature Importance with Random Forests

Feature Importance in Random Forests measures how much each feature contributes to the model’s prediction accuracy. It helps in identifying the most influential input variables, improving performance, interpretability and computational efficiency.

It enhances model transparency by ranking influential features.
Improves predictive accuracy by focusing on key variables.
Reduces overfitting by removing less significant features.
Optimizes training speed and overall model efficiency.

Importance

Understanding feature importance offers several advantages:

Enhanced Model Performance: By identifying the most influential features, we can prioritize them during model training hence leading to more accurate predictions.
Faster Training Times: Focusing on the most relevant features streamlines the training process, saving valuable time and computational resources.
Reduced Overfitting: Overfitting occurs when a model memorizes the training data instead of learning general patterns. By focusing on important features, we can prevent the model from becoming overly reliant on specific data points.

Feature Importance in Random Forests

Random Forests, a popular ensemble learning algorithm, consist of multiple decision trees that combine to produce robust predictions. They inherently provide mechanisms to measure feature importance using various methods such as:

Built-in Feature Importance: Uses internal metrics like Gini importance or Mean Decrease in Accuracy.
Permutation Feature Importance: Evaluates how model performance changes when a feature is shuffled.
SHAP (SHapley Additive exPlanations) Values: Quantifies the contribution of each feature to individual predictions.

Method 1: Built-in Feature Importance

To show implementation the iris dataset is used throughout the article to understand the implementation of feature importance.

Step 1: Install dependencies and Libraries

We will install the required libraries and packages.

The dataset is split into training and testing sets to evaluate model performance.
We also extract the feature names for later use.

Python

!pip install shap
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import shap
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state=42) # Split dataset into 75% train and 25% test
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

Step 2: Calculate Gini Importance

Gini importance is derived from the Random Forest algorithm’s internal structure.
It measures how much each feature contributes to reducing impurity in decision trees.

Python

importances = clf.feature_importances_
feature_imp_df = pd.DataFrame({'Feature': feature_names, 'Gini Importance': importances}).sort_values(
    'Gini Importance', ascending=False)
print(feature_imp_df)

Output:

Visualization plot for feature importance

Python

plt.figure(figsize=(8, 4))
plt.barh(feature_names, importances, color='skyblue')
plt.xlabel('Gini Importance')
plt.title('Feature Importance - Gini Importance')
plt.gca().invert_yaxis()
plt.show()

Output:

feature-importance

Method 2: Mean Decrease Accuracy (MDA)

Mean Decrease Accuracy (MDA) measures the average reduction in model accuracy across all out-of-bag cross-validated predictions when a particular feature undergoes permutation after training but before making predictions.
MDA measures how much accuracy drops when each feature is randomly shuffled.
A larger drop in accuracy means the feature is more important.

Python

importances = []
initial_accuracy = accuracy_score(y_test, clf.predict(X_test))
for i in range(X.shape[1]):
    X_test_copy = X_test.copy()
    np.random.shuffle(X_test_copy[:, i])
    shuff_accuracy = accuracy_score(y_test, clf.predict(X_test_copy))
    importances.append(initial_accuracy - shuff_accuracy)

accuracy_df = pd.DataFrame({'Feature': feature_names, 'Decrease in Accuracy': importances}).sort_values(
    'Decrease in Accuracy', ascending=False)
print(accuracy_df)

Output:

Visualization plot for feature importance

Python

plt.figure(figsize=(8, 4))
plt.barh(feature_names, importances, color='skyblue')
plt.xlabel('Mean Decrease Accuracy')
plt.title('Feature Importance - Mean Decrease Accuracy')
plt.gca().invert_yaxis()
plt.show()

Output:

featureimp

Method 3: Permutation Feature Importance

Uses a statistical approach to measure importance by shuffling each feature.
More robust than Gini importance as it does not depend on model internals.
The permutation_importance() function from scikit-learn automates this process.

Python

from sklearn.inspection import permutation_importance
result = permutation_importance(
    clf, X_test, y_test, n_repeats=10, random_state=0, n_jobs=-1)
perm_imp_df = pd.DataFrame({'Feature': feature_names, 'Permutation Importance': result.importances_mean}).sort_values(
    'Permutation Importance', ascending=False)
print(perm_imp_df)

Output:

Here's what the permutation importance values suggest in this output:

Petal length (cm): has the highest permutation importance value (0.210526), indicating that shuffling the values of this feature leads to a significant decrease in the model's performance.
Petal width (cm): also has a relatively high permutation importance value (0.144737), though lower than "Petal length (cm)".
Sepal length (cm) and "Sepal width (cm): have permutation importance values of 0, suggesting that shuffling the values of these features does not affect the model's performance significantly.

Visualization plot for feature importance

Python

plt.figure(figsize=(6, 6))
plt.bar(perm_imp_df['Feature'], perm_imp_df['Permutation Importance'])
plt.xlabel('Feature')
plt.ylabel('Permutation Importance')
plt.title('Permutation Feature Importance')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Output:

features

Method 4: SHAP values

SHAP (SHapley Additive exPlanations) value is a measure of the contribution of a feature towards the prediction for each instance.

For a given prediction made by a Random Forest model, SHAP values can be calculated for each feature for that specific instance.
It involves considering all possible subsets of features and their interactions, measuring the impact of including or excluding each feature on the prediction.

Positive SHAP values indicate a positive contribution to the prediction, while negative values suggest a negative contribution. The magnitude of the SHAP value represents the strength of the contribution.

Python

import shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)
if isinstance(shap_values, list):
    shap_summary = np.mean(np.abs(np.stack(shap_values)), axis=(0, 1))
else:
    shap_summary = np.abs(shap_values).mean(axis=0)

shap_summary = shap_summary.flatten()
shap_summary = shap_summary[:len(feature_names)]
shap_summary_df = pd.DataFrame({
    'Feature': feature_names,
    'Mean |SHAP Value|': shap_summary
}).sort_values('Mean |SHAP Value|', ascending=False)

print(shap_summary_df)
plt.figure(figsize=(8, 4))
plt.barh(shap_summary_df['Feature'],
         shap_summary_df['Mean |SHAP Value|'], color='skyblue')
plt.xlabel('Mean Absolute SHAP Value')
plt.title('Feature Importance (SHAP Values)')
plt.gca().invert_yaxis()
plt.show()

Output:

Feature Importance with Random Forests

Importance

Feature Importance in Random Forests

Method 1: Built-in Feature Importance

Step 1: Install dependencies and Libraries

Step 2: Calculate Gini Importance

Method 2: Mean Decrease Accuracy (MDA)

Method 3: Permutation Feature Importance

Method 4: SHAP values

Explore