Non - Parametric Methods in Statistics

Last Updated : 23 Jul, 2025

Non-parametric methods in statistics are techniques that do not assume a specific probability distribution for the data. Unlike parametric methods, which rely on fixed parameters (e.g., mean, variance), non-parametric methods are more flexible and useful when dealing with unknown or complex distributions. These methods are widely applied in hypothesis testing, regression, density estimation and classification.

Common Non-Parametric Statistical Tests

Wilcoxon Rank-Sum Test (Mann-Whitney U Test)

Used to compare two independent groups when normality assumptions do not hold.

U = n_1 n_2 + \frac{n_1 (n_1 + 1)}{2} - R_1

where:

  • U is the Mann-Whitney statistic,
  • n1, n2 are the sample sizes,
  • R1 is the sum of ranks for group 1.
Python
from scipy.stats import mannwhitneyu
x = [3, 5, 7, 9]
y = [2, 4, 6, 8]
stat, p = mannwhitneyu(x, y)
print("Mann-Whitney U test statistic:", stat, "p-value:", p)

Output

Mann-Whitney U test statistic: 10.0 p-value: 0.6857142857142857

Kruskal-Wallis Test

A non-parametric alternative to ANOVA for comparing more than two groups.

H = \frac{12}{N(N+1)} \sum \frac{R_i^2}{n_i} - 3(N+1)

where:

  • H is the Kruskal-Wallis statistic,
  • Ri is the rank sum for group i,
  • ni is the sample size of group i,
  • N is the total sample size.
Python
from scipy.stats import kruskal
stat, p = kruskal([1, 2, 3], [4, 5, 6], [7, 8, 9])
print("Kruskal-Wallis test statistic:", stat, "p-value:", p)

Output

Kruskal-Wallis test statistic: 7.200000000000003 p-value: 0.02732372244729252

Non-Parametric Regression

1. Kernel Density Estimation (KDE)

KDE is a technique to estimate the probability density function (PDF) of a dataset.

\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x - x_i}{h} \right)

where:

  • K(.) is the kernel function (e.g., Gaussian kernel),
  • h is the bandwidth parameter,
  • xi are sample points.
Python
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

data = np.random.randn(100)
sns.kdeplot(data, bw_adjust=0.5)
plt.show()

Output

Density

2. k-Nearest Neighbors (k-NN) Regression

k-NN is a simple, non-parametric regression method that predicts the target variable based on the mean (or median) of the nearest k neighbors.

\hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i

where yi are the values of the k nearest neighbors.

Implementation of K-Nearest Neighbors Regression

Python
from sklearn.neighbors import KNeighborsRegressor
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(X, y)
print(knn.predict([[3.5]]))

Output

[7.]

3. Bootstrap Methods

Bootstrap methods are resampling techniques used to estimate the sampling distribution of a statistic.

Algorithm:

  • Randomly sample with replacement from the original dataset.
  • Compute the statistic of interest (e.g., mean, median) on the resampled dataset.
  • Repeat this process many times (e.g., 1000 iterations).
  • Use the empirical distribution of the computed statistic for inference.
Python
from sklearn.utils import resample
import numpy as np

sample = np.array([3, 5, 7, 9, 11])
bootstrap_samples = [resample(sample, replace=True, n_samples=len(sample)) for _ in range(1000)]
bootstrap_means = [np.mean(s) for s in bootstrap_samples]
print("Bootstrap Mean Estimate:", np.mean(bootstrap_means))

Output

Bootstrap Mean Estimate: 6.9883999999999995

Advantages

  • No need for strict assumptions about data distribution.
  • More flexible in handling real-world data.
  • Useful for small datasets where parametric assumptions fail.

Disadvantages

  • Less efficient for large datasets compared to parametric methods.
  • Higher computational cost due to resampling or rank calculations.
  • May require larger sample sizes to achieve reliable results.
Comment