Feature selection is a important step in the machine learning pipeline. By identifying and retaining only the most relevant features, we can build models that generalize better, train faster, and are easier to interpret. Among the various approaches, filter methods are popular due to their simplicity, speed, and independence from specific machine learning models.
What is Feature Selection?
Feature selection is the process of selecting a subset of relevant features (predictor variables) from a larger set. Unlike feature extraction, which creates new features from combinations or transformations of original ones, feature selection retains the original variables.
Let:
\mathbf{X} \in \mathbb{R}^{n \times d} \quad \text{(feature matrix with } n \text{ samples and } d \text{ features)}
\mathbf{y} \in \mathbb{R}^n \quad \text{(target variable)}
Our goal is to select a subset
Feature Selection: Where Do Filter Methods Fit?
Method | Model Involvement | Speed | Captures Feature Interaction |
|---|---|---|---|
Filter | No | Fast | No |
Wrapper | Yes | Slow | Yes |
Embedded | Yes | Moderate | Yes |
Filter methods are typically used early in the pipeline, especially during exploratory data analysis (EDA) or as a first-pass reduction technique before applying more sophisticated methods.
Filter Methods
Filter methods evaluate the relevance of features by examining their intrinsic properties β independently of any predictive model. This makes them highly scalable and general-purpose.
Key Characteristics
- Model-agnostic: No dependency on specific algorithms.
- Fast and scalable: Ideal for high-dimensional data.
- Mostly univariate: Features are assessed individually.
- Preprocessing step: Often used before model training.
Common Filter Techniques
1. Variance Thresholding
Features with low variance across samples contain less information and can be removed.
\text{Var}(X_j) = \frac{1}{n} \sum_{i=1}^n (x_{ij} - \bar{x}_j)^2
Remove
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
2. Correlation Coefficient
Measures linear relationship between a feature and the target. For continuous π¦, use Pearson correlation:
\rho_{X_j, y} = \frac{\text{Cov}(X_j, y)}{\sigma_{X_j} \sigma_y}
Drop features with low absolute correlation
import numpy as np
import pandas as pd
correlations = df.corr()['target'].drop('target')
selected = correlations[correlations.abs() > 0.1].index
X_selected = df[selected]
3. Chi-Squared Test (ΟΒ²)
For categorical targets and categorical features, the Chi-squared test assesses dependence:
\chi^2 = \sum \frac{(O - E)^2}{E}
Where
from sklearn.feature_selection import SelectKBest, chi2
X_new = SelectKBest(score_func=chi2, k=10).fit_transform(X, y)
4. Mutual Information (MI)
Measures non-linear dependencies between variables. Mutual information between feature ππ and target π¦ is:
I(X_j; y) = \sum_{x_j} \sum_y p(x_j, y) \log \left( \frac{p(x_j, y)}{p(x_j)p(y)} \right)
from sklearn.feature_selection import mutual_info_classif
mi_scores = mutual_info_classif(X, y)
mi_selected = X[:, mi_scores > 0.1]
5. F-test (ANOVA)
Used for classification problems with continuous features and categorical targets. It measures if means across groups are significantly different.
F = \frac{\text{Between-group variability}}{\text{Within-group variability}}
from sklearn.feature_selection import f_classif
F_values, p_values = f_classif(X, y)
X_selected = X[:, p_values < 0.05]
Comparison Table
Method | Target Type | Feature Type | Captures Non-linear | Model-agnostic |
|---|---|---|---|---|
Variance Threshold | Any | Any | No | Yes |
Correlation | Continuous | Continuous | No | Yes |
Chi-Squared | Categorical | Categorical | No | Yes |
Mutual Information | Any | Any | Yes | Yes |
ANOVA F-test | Categorical | Continuous | No | Yes |
Implementation of Filter Methods
Step-by-Step
- Preprocess Data: Handle missing values, encode categorical variables.
- Normalize/Standardize (when needed): Especially before correlation or variance-based filtering.
- Apply Filter Criteria: Use thresholding based on chosen metrics.
- Evaluate Reduced Feature Set: Optionally use wrapper or embedded methods afterward.
- Train Model: Proceed with model training using the filtered feature set.
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
# Load sample data
iris = load_iris()
X = iris.data
y = iris.target
# Standardize features
scaler = StandardScaler()
X_sca = scaler.fit_transform(X)
# Remove low variance features
selector_var = VarianceThreshold(threshold=0.1)
X_var = selector_var.fit_transform(X_sca)
# ANOVA F-test for classification
sele_f = SelectKBest(score_func=f_classif, k=10)
X_sele = sele_f.fit_transform(X_var, y)
Advantages of Filter Methods
- Simplicity and Speed: They are straightforward to implement and very fast to compute, especially compared to wrapper or embedded methods.
- Model Independence: Since they donβt depend on any learning algorithm, you can use them before trying out different models.
- Scalability: Filter methods are ideal for high-dimensional datasets where computational cost is a concern.
- Good for Initial Feature Reduction: They can be used as a first-pass to eliminate obviously irrelevant features before applying more complex techniques.
Limitations of Filter Methods
- Ignores Feature Interaction: Most filter methods consider one feature at a time (univariate). They do not capture interactions between multiple features.
- Threshold Sensitivity: Performance can depend heavily on manually chosen thresholds (e.g., correlation > 0.1).
- Risk of Losing Useful Features: Some features may not appear important on their own but could be valuable in combination with others.
- No Feedback from Model: Since the model is not involved in selection, you might miss features that could have performed well in practice.
When to Use Filter Methods?
Use filter methods:
- As a preprocessing step before applying wrapper or embedded methods.
- When dealing with very high-dimensional data like gene expression, text classification, etc.
- For quick feature elimination in exploratory data analysis.