Quantile-Quantile Plots

The Quantile-Quantile (Q–Q) plot is a graphical method used to determine whether a dataset follows a specific probability distribution or whether two datasets come from the same population. It is particularly useful for assessing whether data is normally distributed or follows another known distribution.

Plots sample quantiles against theoretical quantiles (or another dataset’s quantiles)
A roughly straight line indicates a good match between distributions
Deviations from the line reveal skewness, heavy or light tails or outliers
Widely used to check distributional assumptions in statistical modeling
Useful in data analysis, hypothesis testing and quality control

Quantiles And Percentiles

Quantiles are points in a dataset that divide the data into intervals containing equal probabilities or proportions of the total distribution. They are often used to describe the spread or distribution of a dataset. The most common quantiles are:

Median (50th percentile): The median is the middle value of a dataset when it is ordered from smallest to largest. It divides the dataset into two equal halves.
Quartiles (25th, 50th and 75th percentiles): Quartiles divide the dataset into four equal parts. The first quartile (Q1) is the value below which 25% of the data falls, the second quartile (Q2) is the median and the third quartile (Q3) is the value below which 75% of the data falls.
Percentiles: Percentiles are similar to quartiles but divide the dataset into 100 equal parts. For example, the 90th percentile is the value below which 90% of the data falls.

Note:

A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set.
For reference, a 45-degree line is plotted. If the samples come from the same distribution, the points will lie approximately along this line..

How to Draw Q-Q plot

To draw a Quantile-Quantile (Q-Q) plot, you can follow these steps:

1. Collect the Data: Gather the dataset for which you want to create the Q-Q plot. Ensure that the data are numerical and represent a random sample from the population of interest.

2. Sort the Data: Arrange the data in either ascending or descending order. This step is essential for computing quantiles accurately.

3. Choose a Theoretical Distribution: Determine the theoretical distribution against which you want to compare your dataset. Common choices include the normal distribution, exponential distribution or any other distribution that fits your data well.

4. Calculate Theoretical Quantiles: Compute the quantiles for the chosen theoretical distribution. For example, if you're comparing against a normal distribution, you would use the inverse cumulative distribution function (CDF) of the normal distribution to find the expected quantiles.

5. Plotting:

Plot the sorted dataset values on the x-axis.
Plot the corresponding theoretical quantiles on the y-axis.
Each data point (x, y) represents a pair of observed and expected values.
Connect the data points to visually inspect the relationship between the dataset and the theoretical distribution.

Interpretation of Q-Q plot

If the points on the plot fall approximately along a straight line, it suggests that your dataset follows the assumed distribution.
Deviations from the straight line indicate departures from the assumed distribution, requiring further investigation.

Exploring Distribution Similarity with Q-Q Plots

Q-Q plots visually assess whether two datasets follow the same distribution by comparing their quantiles.

Comparing datasets helps determine if they can be merged to improve estimation of parameters like location and scale.
A Q-Q plot is constructed by plotting quantiles of one dataset against corresponding quantiles of another.
Points lying close to a diagonal line indicate that the two distributions are similar.
Deviations from the diagonal suggest differences in shape, spread or tail behavior.
Tests such as chi-square and Kolmogorov–Smirnov assess overall distribution differences statistically.
These tests may not reveal where specific distributional differences occur.
Q-Q plots provide a detailed visual comparison by examining quantile-by-quantile differences.

Python Implementation Of Q-Q Plot

Here we create a Q-Q plot to compare a sample dataset with a theoretical normal distribution. It helps visually assess whether the data follows a normal distribution.

Python

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Generate example data
np.random.seed(0)
data = np.random.normal(loc=0, scale=1, size=1000)

# Create Q-Q plot
stats.probplot(data, dist="norm", plot=plt)
plt.title('Normal Q-Q plot')
plt.xlabel('Theoretical quantiles')
plt.ylabel('Ordered Values')
plt.grid(True)
plt.show()

Output:

Here as the data points approximately follow a straight line in the Q-Q plot, it suggests that the dataset is consistent with the assumed theoretical distribution, which in this case we assumed to be the normal distribution.

Types of Q-Q plots

There are several types of Q-Q plots commonly used in statistics and data analysis, each suited to different scenarios or purposes:

Normal Distribution: A symmetric distribution where the Q-Q plot would show points approximately along a diagonal line if the data adheres to a normal distribution.
Right-skewed Distribution: A distribution where the Q-Q plot would display a pattern where the observed quantiles deviate from the straight line towards the upper end, indicating a longer tail on the right side.
Left-skewed Distribution: A distribution where the Q-Q plot would exhibit a pattern where the observed quantiles deviate from the straight line towards the lower end, indicating a longer tail on the left side.
Under-dispersed Distribution: A distribution where the Q-Q plot would show observed quantiles clustered more tightly around the diagonal line compared to the theoretical quantiles, suggesting lower variance.
Over-dispersed Distribution: A distribution where the Q-Q plot would display observed quantiles more spread out or deviating from the diagonal line, indicating higher variance or dispersion compared to the theoretical distribution.