Feature Selection | Filter Methods

Feature selection is a important step in the machine learning pipeline. By identifying and retaining only the most relevant features, we can build models that generalize better, train faster, and are easier to interpret. Among the various approaches, filter methods are popular due to their simplicity, speed, and independence from specific machine learning models.

What is Feature Selection?

Feature selection is the process of selecting a subset of relevant features (predictor variables) from a larger set. Unlike feature extraction, which creates new features from combinations or transformations of original ones, feature selection retains the original variables.

Let: \mathbf{X} \in \mathbb{R}^{n \times d} \quad \text{(feature matrix with } n \text{ samples and } d \text{ features)}

\mathbf{y} \in \mathbb{R}^n \quad \text{(target variable)}

Our goal is to select a subset S \subseteq \{1, 2, ..., d\} such that features in S are most predictive of \mathbf{y}.

Feature Selection: Where Do Filter Methods Fit?

Method	Model Involvement	Speed	Captures Feature Interaction
Filter	No	Fast	No
Wrapper	Yes	Slow	Yes
Embedded	Yes	Moderate	Yes

Filter methods are typically used early in the pipeline, especially during exploratory data analysis (EDA) or as a first-pass reduction technique before applying more sophisticated methods.

Filter Methods

Filter methods evaluate the relevance of features by examining their intrinsic properties — independently of any predictive model. This makes them highly scalable and general-purpose.

Key Characteristics

Model-agnostic: No dependency on specific algorithms.
Fast and scalable: Ideal for high-dimensional data.
Mostly univariate: Features are assessed individually.
Preprocessing step: Often used before model training.

Common Filter Techniques

1. Variance Thresholding

Features with low variance across samples contain less information and can be removed.

\text{Var}(X_j) = \frac{1}{n} \sum_{i=1}^n (x_{ij} - \bar{x}_j)^2

Remove X_j if \text{Var}(X_j) < \theta.

Python

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

2. Correlation Coefficient

Measures linear relationship between a feature and the target. For continuous 𝑦, use Pearson correlation:

\rho_{X_j, y} = \frac{\text{Cov}(X_j, y)}{\sigma_{X_j} \sigma_y}

Drop features with low absolute correlation |\rho| < \theta.

Python

import numpy as np
import pandas as pd

correlations = df.corr()['target'].drop('target')
selected = correlations[correlations.abs() > 0.1].index
X_selected = df[selected]

3. Chi-Squared Test (χ²)

For categorical targets and categorical features, the Chi-squared test assesses dependence:

\chi^2 = \sum \frac{(O - E)^2}{E}

Where O = observed frequency, E = expected frequency.

Python

from sklearn.feature_selection import SelectKBest, chi2

X_new = SelectKBest(score_func=chi2, k=10).fit_transform(X, y)

4. Mutual Information (MI)

Measures non-linear dependencies between variables. Mutual information between feature 𝑋_𝑗 and target 𝑦 is:

I(X_j; y) = \sum_{x_j} \sum_y p(x_j, y) \log \left( \frac{p(x_j, y)}{p(x_j)p(y)} \right)

Python

from sklearn.feature_selection import mutual_info_classif

mi_scores = mutual_info_classif(X, y)
mi_selected = X[:, mi_scores > 0.1]

5. F-test (ANOVA)

Used for classification problems with continuous features and categorical targets. It measures if means across groups are significantly different.

F = \frac{\text{Between-group variability}}{\text{Within-group variability}}

Python

from sklearn.feature_selection import f_classif

F_values, p_values = f_classif(X, y)
X_selected = X[:, p_values < 0.05]

Comparison Table

Method	Target Type	Feature Type	Captures Non-linear	Model-agnostic
Variance Threshold	Any	Any	No	Yes
Correlation	Continuous	Continuous	No	Yes
Chi-Squared	Categorical	Categorical	No	Yes
Mutual Information	Any	Any	Yes	Yes
ANOVA F-test	Categorical	Continuous	No	Yes

Implementation of Filter Methods

Step-by-Step

Preprocess Data: Handle missing values, encode categorical variables.
Normalize/Standardize (when needed): Especially before correlation or variance-based filtering.
Apply Filter Criteria: Use thresholding based on chosen metrics.
Evaluate Reduced Feature Set: Optionally use wrapper or embedded methods afterward.
Train Model: Proceed with model training using the filtered feature set.

Python

from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Load sample data
iris = load_iris()
X = iris.data
y = iris.target

# Standardize features
scaler = StandardScaler()
X_sca = scaler.fit_transform(X)

# Remove low variance features
selector_var = VarianceThreshold(threshold=0.1)
X_var = selector_var.fit_transform(X_sca)

# ANOVA F-test for classification
sele_f = SelectKBest(score_func=f_classif, k=10)
X_sele = sele_f.fit_transform(X_var, y)

Advantages of Filter Methods

Simplicity and Speed: They are straightforward to implement and very fast to compute, especially compared to wrapper or embedded methods.
Model Independence: Since they don’t depend on any learning algorithm, you can use them before trying out different models.
Scalability: Filter methods are ideal for high-dimensional datasets where computational cost is a concern.
Good for Initial Feature Reduction: They can be used as a first-pass to eliminate obviously irrelevant features before applying more complex techniques.

Limitations of Filter Methods

Ignores Feature Interaction: Most filter methods consider one feature at a time (univariate). They do not capture interactions between multiple features.
Threshold Sensitivity: Performance can depend heavily on manually chosen thresholds (e.g., correlation > 0.1).
Risk of Losing Useful Features: Some features may not appear important on their own but could be valuable in combination with others.
No Feedback from Model: Since the model is not involved in selection, you might miss features that could have performed well in practice.

When to Use Filter Methods?

Use filter methods:

As a preprocessing step before applying wrapper or embedded methods.
When dealing with very high-dimensional data like gene expression, text classification, etc.
For quick feature elimination in exploratory data analysis.

Feature Selection Techniques in Machine Learning
Feature Selection in Python with Scikit-Learn
Optimal feature selection for Support Vector Machines
How to Perform Feature Selection for Regression Data
Performing Feature Selection with gridsearchcv in Sklearn
Feature selection using Decision Tree

Feature Selection | Filter Methods

What is Feature Selection?

Feature Selection: Where Do Filter Methods Fit?

Filter Methods

Key Characteristics

Common Filter Techniques

1. Variance Thresholding

2. Correlation Coefficient

3. Chi-Squared Test (χ²)

4. Mutual Information (MI)

5. F-test (ANOVA)

Comparison Table

Implementation of Filter Methods

Step-by-Step

Advantages of Filter Methods

Limitations of Filter Methods

When to Use Filter Methods?

Related Articles

Explore