The Yeo-Johnson Transform is a statistical method used to stabilize variance and make data more closely resemble a normal distribution. It is an extension of the Box-Cox Transform, but unlike Box-Cox, it can handle both positive and negative values. This makes it especially useful in data preprocessing for machine learning models that require normally distributed or symmetrically distributed data, such as linear regression or principal component analysis (PCA).
Traditional transformations like logarithms only work for positive data, but the Yeo-Johnson transform can be applied to a wider range of datasets, improving model performance by reducing the impact of skewed data. Essentially, it helps normalize data for algorithms that perform better when data is more evenly distributed.
Definition
The Yeo-Johnson transformation is defined piecewise depending on whether the input value
For y \geq 0 :
T(y; \lambda) = \begin{cases}\frac{(y + 1)^\lambda - 1}{\lambda} & \text{if } \lambda \ne 0 \\\log(y + 1) & \text{if } \lambda = 0\end{cases}
For y < 0 :
T(y; \lambda) = \begin{cases}-\frac{(-y + 1)^{2 - \lambda} - 1}{2 - \lambda} & \text{if } \lambda \ne 2 \\-\log(-y + 1) & \text{if } \lambda = 2\end{cases}
Properties
Handles Zero and Negative Values: Unlike the Box-Cox transform, the Yeo-Johnson transform works for all real-valued inputs.
- Continuity and Smoothness: The function is continuous and differentiable with respect to both 𝑦 and 𝜆.
- Reversible: The transformation is invertible, allowing you to recover the original data.
- Parameter Optimization: The optimal 𝜆 can be estimated from data using maximum likelihood or other methods.
- Normalizing Effect: Helps in transforming skewed data into approximately Gaussian distributions.
Visualization

The plot of the transformation for various values of 𝜆 shows how the curve adjusts data with different distributions. For instance:
- 𝜆 = 1: Linear transformation.
- 𝜆 = 0: Log-like transformation.
- 𝜆 = 2: Heavy compression of large values.
Visualizing the transformation helps in understanding the impact on data distribution, particularly the left or right skew.
Estimating λ
The optimal value of λ is typically chosen to maximize the log-likelihood of the transformed data under the assumption of normality. This involves solving:
\lambda^* = \arg\max_\lambda \, \log L(T(y; \lambda))
Where:
\lambda^* : The value of the parameter 𝜆 that maximizes the log-likelihood function after transformation.\arg\max_\lambda : The value of 𝜆 that yields the maximum of the given expression.\log L(\cdot) : The natural logarithm of the likelihood function, representing model fit to the data.T(y; \lambda) : A transformation of the datay that depends on the parameter 𝜆.
Scikit-learn, SciPy, and R provide built-in methods to compute this.
Use in Data Science Pipelines
The Yeo-Johnson transform is especially useful in preprocessing pipelines in machine learning workflows. It is commonly used for:
- Stabilizing variance
- Reducing skewness
- Improving model accuracy for algorithms sensitive to feature distributions
It is often applied using scikit-learn’s PowerTransformer class:
from sklearn.preprocessing import PowerTransformer
transformer = PowerTransformer(method='yeo-johnson')
X_transformed = transformer.fit_transform(X)
Comparison with Other Transforms
Transform | Handles Negatives | Non-linear | Optimal λ | Common Use |
|---|---|---|---|---|
Log Transform | No | Yes | No | Skewed Positives |
Box-Cox | No | Yes | Yes | Normalization |
Yeo-Johnson | Yes | Yes | Yes | Robust Normalization |
Mathematical Derivation
The Yeo-Johnson transform can be viewed as a composition of two conditional transformations:
Case 1: 𝑦 ≥ 0
For positive inputs, we essentially apply the Box-Cox transform to 𝑦 + 1. This ensures that zero is handled safely.
Case 2: 𝑦 < 0
For negative inputs, the function reflects the value (via − 𝑦 + 1) and applies a different parameter 2 − 𝜆, maintaining symmetry in transformation while avoiding the logarithm of negative numbers.
This design maintains differentiability and allows a unified optimization of 𝜆.
Applications
1. Regression Modeling
- Many regression models (like linear regression or Lasso) assume normally distributed residuals.
- Applying Yeo-Johnson to features or the response variable helps meet this assumption.
2. Principal Component Analysis (PCA)
- PCA is sensitive to the scale and distribution of data.
- Transforming data with Yeo-Johnson can improve the effectiveness of PCA by ensuring more uniform variances.
3. Robust Classification
- Algorithms like k-NN or SVMs may perform better when features are closer to a normal distribution.
- Yeo-Johnson preprocessing reduces the risk of poor classification boundaries due to skew.
4. Time Series Forecasting
- Stabilizing variance in time series improves stationarity, which benefits models like ARIMA.
Limitations
- Assumes monotonic transformation: May not help when a non-monotonic transformation is needed.
- Estimation of λ can be unstable for small sample sizes or heavy-tailed data.
- Computational Cost: Estimating the best λ is more expensive than fixed transformations.