Non - Parametric Methods in Statistics

Non-parametric methods in statistics are techniques that do not assume a specific probability distribution for the data. Unlike parametric methods, which rely on fixed parameters (e.g., mean, variance), non-parametric methods are more flexible and useful when dealing with unknown or complex distributions. These methods are widely applied in hypothesis testing, regression, density estimation and classification.

Common Non-Parametric Statistical Tests

Wilcoxon Rank-Sum Test (Mann-Whitney U Test)

Used to compare two independent groups when normality assumptions do not hold.

U = n_1 n_2 + \frac{n_1 (n_1 + 1)}{2} - R_1

where:

U is the Mann-Whitney statistic,
n_1,n₂are the sample sizes,
R₁is the sum of ranks for group 1.

Python

from scipy.stats import mannwhitneyu
x = [3, 5, 7, 9]
y = [2, 4, 6, 8]
stat, p = mannwhitneyu(x, y)
print("Mann-Whitney U test statistic:", stat, "p-value:", p)

Output

Mann-Whitney U test statistic: 10.0 p-value: 0.6857142857142857

Kruskal-Wallis Test

A non-parametric alternative to ANOVA for comparing more than two groups.

H = \frac{12}{N(N+1)} \sum \frac{R_i^2}{n_i} - 3(N+1)

where:

H is the Kruskal-Wallis statistic,
R_iis the rank sum for group i,
n_iis the sample size of group i,
N is the total sample size.

Python

from scipy.stats import kruskal
stat, p = kruskal([1, 2, 3], [4, 5, 6], [7, 8, 9])
print("Kruskal-Wallis test statistic:", stat, "p-value:", p)

Output

Kruskal-Wallis test statistic: 7.200000000000003 p-value: 0.02732372244729252

Non-Parametric Regression

1. Kernel Density Estimation (KDE)

KDE is a technique to estimate the probability density function (PDF) of a dataset.

\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x - x_i}{h} \right)

where:

K(.) is the kernel function (e.g., Gaussian kernel),
h is the bandwidth parameter,
x_iare sample points.

Python

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

data = np.random.randn(100)
sns.kdeplot(data, bw_adjust=0.5)
plt.show()

Output

2. k-Nearest Neighbors (k-NN) Regression

k-NN is a simple, non-parametric regression method that predicts the target variable based on the mean (or median) of the nearest k neighbors.

\hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i

where y_iare the values of the k nearest neighbors.

Implementation of K-Nearest Neighbors Regression

Python

from sklearn.neighbors import KNeighborsRegressor
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(X, y)
print(knn.predict([[3.5]]))

Output

[7.]

3. Bootstrap Methods

Bootstrap methods are resampling techniques used to estimate the sampling distribution of a statistic.

Algorithm:

Randomly sample with replacement from the original dataset.
Compute the statistic of interest (e.g., mean, median) on the resampled dataset.
Repeat this process many times (e.g., 1000 iterations).
Use the empirical distribution of the computed statistic for inference.

Python

from sklearn.utils import resample
import numpy as np

sample = np.array([3, 5, 7, 9, 11])
bootstrap_samples = [resample(sample, replace=True, n_samples=len(sample)) for _ in range(1000)]
bootstrap_means = [np.mean(s) for s in bootstrap_samples]
print("Bootstrap Mean Estimate:", np.mean(bootstrap_means))

Output

Bootstrap Mean Estimate: 6.9883999999999995

Advantages

No need for strict assumptions about data distribution.
More flexible in handling real-world data.
Useful for small datasets where parametric assumptions fail.

Disadvantages

Less efficient for large datasets compared to parametric methods.
Higher computational cost due to resampling or rank calculations.
May require larger sample sizes to achieve reliable results.

Non - Parametric Methods in Statistics

Common Non-Parametric Statistical Tests

Wilcoxon Rank-Sum Test (Mann-Whitney U Test)

Kruskal-Wallis Test

Non-Parametric Regression

1. Kernel Density Estimation (KDE)

2. k-Nearest Neighbors (k-NN) Regression

3. Bootstrap Methods

Algorithm:

Advantages

Disadvantages

Explore