Feature Selection | Filter Methods

Last Updated : 23 Jul, 2025

Feature selection is a important step in the machine learning pipeline. By identifying and retaining only the most relevant features, we can build models that generalize better, train faster, and are easier to interpret. Among the various approaches, filter methods are popular due to their simplicity, speed, and independence from specific machine learning models.

What is Feature Selection?

Feature selection is the process of selecting a subset of relevant features (predictor variables) from a larger set. Unlike feature extraction, which creates new features from combinations or transformations of original ones, feature selection retains the original variables.

Let: \mathbf{X} \in \mathbb{R}^{n \times d} \quad \text{(feature matrix with } n \text{ samples and } d \text{ features)}

\mathbf{y} \in \mathbb{R}^n \quad \text{(target variable)}

Our goal is to select a subset S \subseteq \{1, 2, ..., d\} such that features in S are most predictive of \mathbf{y}.

Feature Selection: Where Do Filter Methods Fit?

Method

Model Involvement

Speed

Captures Feature Interaction

Filter

No

Fast

No

Wrapper

Yes

Slow

Yes

Embedded

Yes

Moderate

Yes

Filter methods are typically used early in the pipeline, especially during exploratory data analysis (EDA) or as a first-pass reduction technique before applying more sophisticated methods.

Filter Methods

Filter methods evaluate the relevance of features by examining their intrinsic properties β€” independently of any predictive model. This makes them highly scalable and general-purpose.

Key Characteristics

  • Model-agnostic: No dependency on specific algorithms.
  • Fast and scalable: Ideal for high-dimensional data.
  • Mostly univariate: Features are assessed individually.
  • Preprocessing step: Often used before model training.

Common Filter Techniques

1. Variance Thresholding

Features with low variance across samples contain less information and can be removed.

\text{Var}(X_j) = \frac{1}{n} \sum_{i=1}^n (x_{ij} - \bar{x}_j)^2

Remove X_j if \text{Var}(X_j) < \theta.

Python
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

2. Correlation Coefficient

Measures linear relationship between a feature and the target. For continuous 𝑦, use Pearson correlation:

\rho_{X_j, y} = \frac{\text{Cov}(X_j, y)}{\sigma_{X_j} \sigma_y}

Drop features with low absolute correlation |\rho| < \theta.

Python
import numpy as np
import pandas as pd

correlations = df.corr()['target'].drop('target')
selected = correlations[correlations.abs() > 0.1].index
X_selected = df[selected]

3. Chi-Squared Test (χ²)

For categorical targets and categorical features, the Chi-squared test assesses dependence:

\chi^2 = \sum \frac{(O - E)^2}{E}

Where O = observed frequency, E = expected frequency.

Python
from sklearn.feature_selection import SelectKBest, chi2

X_new = SelectKBest(score_func=chi2, k=10).fit_transform(X, y)

4. Mutual Information (MI)

Measures non-linear dependencies between variables. Mutual information between feature 𝑋𝑗 and target 𝑦 is:

I(X_j; y) = \sum_{x_j} \sum_y p(x_j, y) \log \left( \frac{p(x_j, y)}{p(x_j)p(y)} \right)

Python
from sklearn.feature_selection import mutual_info_classif

mi_scores = mutual_info_classif(X, y)
mi_selected = X[:, mi_scores > 0.1]

5. F-test (ANOVA)

Used for classification problems with continuous features and categorical targets. It measures if means across groups are significantly different.

F = \frac{\text{Between-group variability}}{\text{Within-group variability}}

Python
from sklearn.feature_selection import f_classif

F_values, p_values = f_classif(X, y)
X_selected = X[:, p_values < 0.05]

Comparison Table

Method

Target Type

Feature Type

Captures Non-linear

Model-agnostic

Variance Threshold

Any

Any

No

Yes

Correlation

Continuous

Continuous

No

Yes

Chi-Squared

Categorical

Categorical

No

Yes

Mutual Information

Any

Any

Yes

Yes

ANOVA F-test

Categorical

Continuous

No

Yes

Implementation of Filter Methods

Step-by-Step

  1. Preprocess Data: Handle missing values, encode categorical variables.
  2. Normalize/Standardize (when needed): Especially before correlation or variance-based filtering.
  3. Apply Filter Criteria: Use thresholding based on chosen metrics.
  4. Evaluate Reduced Feature Set: Optionally use wrapper or embedded methods afterward.
  5. Train Model: Proceed with model training using the filtered feature set.
Python
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Load sample data
iris = load_iris()
X = iris.data
y = iris.target

# Standardize features
scaler = StandardScaler()
X_sca = scaler.fit_transform(X)

# Remove low variance features
selector_var = VarianceThreshold(threshold=0.1)
X_var = selector_var.fit_transform(X_sca)

# ANOVA F-test for classification
sele_f = SelectKBest(score_func=f_classif, k=10)
X_sele = sele_f.fit_transform(X_var, y)

Advantages of Filter Methods

  • Simplicity and Speed: They are straightforward to implement and very fast to compute, especially compared to wrapper or embedded methods.
  • Model Independence: Since they don’t depend on any learning algorithm, you can use them before trying out different models.
  • Scalability: Filter methods are ideal for high-dimensional datasets where computational cost is a concern.
  • Good for Initial Feature Reduction: They can be used as a first-pass to eliminate obviously irrelevant features before applying more complex techniques.

Limitations of Filter Methods

  • Ignores Feature Interaction: Most filter methods consider one feature at a time (univariate). They do not capture interactions between multiple features.
  • Threshold Sensitivity: Performance can depend heavily on manually chosen thresholds (e.g., correlation > 0.1).
  • Risk of Losing Useful Features: Some features may not appear important on their own but could be valuable in combination with others.
  • No Feedback from Model: Since the model is not involved in selection, you might miss features that could have performed well in practice.

When to Use Filter Methods?

Use filter methods:

  • As a preprocessing step before applying wrapper or embedded methods.
  • When dealing with very high-dimensional data like gene expression, text classification, etc.
  • For quick feature elimination in exploratory data analysis.
Comment