Sitemap
Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

The Normal Distribution for Data Scientists

7 min readOct 22, 2019

--

As a data scientist, you will have to deduce properties or make propositions about a population of interest based on sample data. In order to do this, you will probably use statistical inference to test hypotheses and derive estimates.

The first step to do hypothesis testing is exploring your data and figuring out which probability distribution it comes from. The normal distribution is a widely used probability distribution because many social and natural phenomena follows it, thus, it is usually the first distribution studied in the data science curriculum. I dedicate this blog post to the normal distribution aiming to help data scientists that, like me, do not have a solid statistics background and need some guidance to understand the magic behind the normal distribution and what it has to do with hypothesis testing.

Table of Contents

· Normal Distribution Cornerstone: The Central Limit Theorem

· What is a Probability Distribution?

· The Normal Distribution and its PDF

· The Normal Distribution and the Z-Table

· How can you determine if your probability distribution is Normal?

Normal Distribution Cornerstone: The Central Limit Theorem

Mathematicians from the 18th and 19th centuries were eager to understand the patterns and mathematical models ruling games of chance and other scientific fields like astronomy. This curiosity allowed incredible advances in probability theory and gave birth to our subject of study: the normal distribution.

Below timeline briefly lists the mathematicians that contributed to the discovery of the normal distribution.

Press enter or click to view image in full size

As you can see in the timeline, the Central Limit Theorem was published by Laplace in 1810 and it is one of the main reasons why the Normal Distribution is so relevant. An experiment’s population can have several different probability distributions (normal, skewed, exponential, etc). The Central Limit Theorem states that regardless of the distribution of the population (even if the distribution is unknown to you), if you take a large enough number of same-sized random samples and calculate the mean of each sample group, the sampling means will approximate a normal distribution.

This theorem is crucial for hypothesis testing as several parametric inferential tests like the t-test or ANOVA require the normality assumption to be met to secure reliable results. Thanks to the central limit theorem, if the sample size you are working with is large enough, then you can use parametric hypothesis tests of the mean even if the sample population is not normal. In other words, parametric inferential tests that require normality remain robust in non-normal distributions as long as the sample size is large enough.

What is a Probability Distribution?

In an experiment, each possible value of the random variable has a specific probability of happening. If you design an experiment and draw a number of random samples, the resulting experiment values against their probability of happening is your probability distribution; you can obtain the probability of each value happening by weighting its frequency during the experiment. Note that the outcomes of your experiment will most likely be obtained by some measurement (temperature, age, number of marriages, money on sales) or by chance (tossing a coin).

Below Fig 1 and Fig 2 illustrate the probability distribution of the number of orders received by a company per week in the form of a table and a histogram.

Figure 1. Probability Distribution Table. Source: http://ci.columbia.edu/ci/premba_test/c0331/s5/s5_3.html
Figure 2. Probability Distribution Plot. Source: http://ci.columbia.edu/ci/premba_test/c0331/s5/s5_3.html

The Normal Distribution and its PDF

The Normal Distribution is a continuous probability distribution that is described by the probability density function (PDF) in Fig 3. The PDF describes the probability of a certain value of the experiment falling into a particular range of values. It includes a normalizing constant that ensures the area under the curve is equal to one (the sum of all event probabilities must equal one). The total area under the curve is divided in ½ around the mean.

Figure 3. Probability Density Function of the Normal Distribution

The shape of the Normal Distribution curve is defined by the mean and standard deviation of the sample; the curve will be centered and symmetric around the mean and stretched by the standard deviation. The PDF curve never crosses the x-axis; therefore, it is non-zero across the entire real line. This means that the normal distribution can give you the probability of any event happening, but as it gets farther from the mean, its probability of happening will be closer and closer to zero.

The Empirical Rule (68–95–99.7% rule) states that, in a normal distribution, almost all of the data lies within 3 standard deviations of the mean. This comes very handy when you are trying to identify outliers in your data or even as a way to check the distribution’s normality. Fig 4 shows the Empirical Rule and how 99.7% of the data in a Normal Distribution lies within 3 standard deviations.

Figure 4. Empirical Rule. Source: https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution

The Normal Distribution and the Z-Table

Going back to the PDF of a normal distribution and how it describes the probability of a certain value falling in a particular range, let me briefly mention the z-table. A z-table has a summary of the area percentage of all possible z-scores of a standard normal distribution (mean = 0, standard deviation = 1).

Lets remember that the total area under the normal distribution PDF is equal to 1 and that the mean of the distribution divides the data on the 50% mark. In order to calculate the probability of a particular event happening, having the PDF, you would take the integral up to the particular z-score of interest in order to get the area under the curve and thus the probability of the event (see Fig 5 for a visual example). Since the PDF of a normal distribution never crosses the x-axis, this would mean taking the integral from negative infinity up to the particular z-score (in the case of a right-hand z-table). Imagine the amount of work it would take if you had to calculate this integral for every z-score you wanted to check.

Figure 5. Area under PDF up to a z-score. Source: https://towardsdatascience.com/how-to-use-and-create-a-z-table-standard-normal-table-240e21f36e53

The Normal Distribution is the underlying probability distribution for many parametric hypothesis tests. Z-cores are usually obtained as a result of hypothesis testing (a z-score tells us how many standard deviations is the value away from the mean). Based on this we can see the need to have a handy summary of the area percentage under the PDF.

Remember that there is an infinite number of possible Normal Distributions (based on the mean and the standard deviation of the particular sample). The z-table describes the area percentage for the Standard Normal Distribution (mean = zero, standard deviation = 1). Make sure to bring your distribution’s score into z-score terms (standard normal distribution terms) before looking for your particular result on the z-table.

How can you determine if your Probability Distribution is Normal?

Histogram

When you get a sample of outcomes from an experiment, a common first step is to plot the number of occurrences against sample values to get the distribution curve (histogram). In many cases, the resulting curve will expose the type of probability distribution your data follows (or at least it will give you a good sense of it).

When working with Normal Distribution, you should look for a bell-shape curve. If you see a rough estimation of a bell, you can proceed with other tests to be fully sure that your samples come from a normal distribution.

Q-Q plot

This plot helps you determine if your dependent variable comes from a normal distribution. Q-Q plots take theoretical normal distribution quantiles (x-axis) and compare them against your sample data quantiles (y-axis). If both sets come from a normal distribution, then the scatter plot will roughly form a straight line with a 45 degree angle (see Fig 6 for an example). Keep in mind that, just like the histogram, the Q-Q plot is a visual check and it is subjective to what the reader might consider a good-enough straight line is.

Figure 6. Q-Q Plot. Source: http://www.sthda.com/english/wiki/qq-plots-quantile-quantile-plots-r-base-graphs

Additional Statistical Tests

After visual inspection of your data, you can do some additional tests to confirm the normality of your probability distribution. A common statistical test for normality is the Shapiro-Wilk test, which tells you if your data comes from a normal distribution depending on the alpha level you have set.

REFERENCES

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com