Yeo -Johnson Transform

The Yeo-Johnson Transform is a statistical method used to stabilize variance and make data more closely resemble a normal distribution. It is an extension of the Box-Cox Transform, but unlike Box-Cox, it can handle both positive and negative values. This makes it especially useful in data preprocessing for machine learning models that require normally distributed or symmetrically distributed data, such as linear regression or principal component analysis (PCA).

Traditional transformations like logarithms only work for positive data, but the Yeo-Johnson transform can be applied to a wider range of datasets, improving model performance by reducing the impact of skewed data. Essentially, it helps normalize data for algorithms that perform better when data is more evenly distributed.

Definition

The Yeo-Johnson transformation is defined piecewise depending on whether the input value y is non-negative or negative, and it is parameterized by a transformation parameter \lambda. The transform T(y; \lambda) is given by:

For y \geq 0:

T(y; \lambda) = \begin{cases}\frac{(y + 1)^\lambda - 1}{\lambda} & \text{if } \lambda \ne 0 \\\log(y + 1) & \text{if } \lambda = 0\end{cases}

For y < 0:

T(y; \lambda) = \begin{cases}-\frac{(-y + 1)^{2 - \lambda} - 1}{2 - \lambda} & \text{if } \lambda \ne 2 \\-\log(-y + 1) & \text{if } \lambda = 2\end{cases}

Properties

Handles Zero and Negative Values: Unlike the Box-Cox transform, the Yeo-Johnson transform works for all real-valued inputs.

Continuity and Smoothness: The function is continuous and differentiable with respect to both 𝑦 and 𝜆.
Reversible: The transformation is invertible, allowing you to recover the original data.
Parameter Optimization: The optimal 𝜆 can be estimated from data using maximum likelihood or other methods.
Normalizing Effect: Helps in transforming skewed data into approximately Gaussian distributions.

Visualization

Yeo_Johnson_Transform — Yeo-Johnson Transformation for different 𝜆

The plot of the transformation for various values of 𝜆 shows how the curve adjusts data with different distributions. For instance:

𝜆 = 1: Linear transformation.
𝜆 = 0: Log-like transformation.
𝜆 = 2: Heavy compression of large values.

Visualizing the transformation helps in understanding the impact on data distribution, particularly the left or right skew.

Estimating λ

The optimal value of λ is typically chosen to maximize the log-likelihood of the transformed data under the assumption of normality. This involves solving:

\lambda^* = \arg\max_\lambda \, \log L(T(y; \lambda))

Where:

\lambda^*: The value of the parameter 𝜆 that maximizes the log-likelihood function after transformation.
\arg\max_\lambda: The value of 𝜆 that yields the maximum of the given expression.
\log L(\cdot): The natural logarithm of the likelihood function, representing model fit to the data.
T(y; \lambda): A transformation of the data y that depends on the parameter 𝜆.

Scikit-learn, SciPy, and R provide built-in methods to compute this.

Use in Data Science Pipelines

The Yeo-Johnson transform is especially useful in preprocessing pipelines in machine learning workflows. It is commonly used for:

Stabilizing variance
Reducing skewness
Improving model accuracy for algorithms sensitive to feature distributions

It is often applied using scikit-learn’s PowerTransformer class:

Python

from sklearn.preprocessing import PowerTransformer

transformer = PowerTransformer(method='yeo-johnson')
X_transformed = transformer.fit_transform(X)

Comparison with Other Transforms

Transform	Handles Negatives	Non-linear	Optimal λ	Common Use
Log Transform	No	Yes	No	Skewed Positives
Box-Cox	No	Yes	Yes	Normalization
Yeo-Johnson	Yes	Yes	Yes	Robust Normalization

Mathematical Derivation

The Yeo-Johnson transform can be viewed as a composition of two conditional transformations:

Case 1: 𝑦 ≥ 0

For positive inputs, we essentially apply the Box-Cox transform to 𝑦 + 1. This ensures that zero is handled safely.

Case 2: 𝑦 < 0

For negative inputs, the function reflects the value (via − 𝑦 + 1) and applies a different parameter 2 − 𝜆, maintaining symmetry in transformation while avoiding the logarithm of negative numbers.

This design maintains differentiability and allows a unified optimization of 𝜆.

Applications

1. Regression Modeling

Many regression models (like linear regression or Lasso) assume normally distributed residuals.
Applying Yeo-Johnson to features or the response variable helps meet this assumption.

2. Principal Component Analysis (PCA)

PCA is sensitive to the scale and distribution of data.
Transforming data with Yeo-Johnson can improve the effectiveness of PCA by ensuring more uniform variances.

3. Robust Classification

Algorithms like k-NN or SVMs may perform better when features are closer to a normal distribution.
Yeo-Johnson preprocessing reduces the risk of poor classification boundaries due to skew.

4. Time Series Forecasting

Stabilizing variance in time series improves stationarity, which benefits models like ARIMA.

Limitations

Assumes monotonic transformation: May not help when a non-monotonic transformation is needed.
Estimation of λ can be unstable for small sample sizes or heavy-tailed data.
Computational Cost: Estimating the best λ is more expensive than fixed transformations.