Z-Score in Statistics

Tanwir Khan
5 min readJun 18, 2020
Determining outliers using Z-Score

In this post we will get into detail of understanding Z-Score and what are its application with respect to Gaussian/Normal distribution . We will also discuss about and implement it to see how a particular distribution is divided into different Quantiles.

If we try to understand about Z-score in layman language, then it basically shows about how far is a data point away from the mean.

If we try to understand it in a more technical way, then it states how many standard deviations above or below the mean is a particular value present.

Gaussian or Normal Distribution : Image by Author

The curve shown above is a Gaussian or Normal Distribution curve. The central portion of the curve is the Mean.

The portion of the curve that is one standard deviation away from the mean both on the left and right covers 68.16% of the portion. Similarly, the portion with two standard deviations away both on the left and right covers 95.44% of the portion and the portion with three standard deviations away both on the left and right covers 99.73% of the portion of the curve. This is basically the empirical formula for Gaussian Normal Distribution.

Standard Normal Distribution : Image by Author

Now let’s take a Standard Normal distribution as shown above, which has mean as zero and standard deviation as 1. So, in that case a Z-score of +1 says that we are 1 standard deviation above the mean. If the it is +2 then we are 2 standard deviations above the mean and so on.

Similarly, for a Z-score of -1, says that we are 1 standard deviations below the mean.

Z-score of -2, says that we are 2 standard deviations below the mean and so on.

Z-Score formula

The Z-score formula for a sample would be as follows:

Where

Now, let’s take an example to understand this concept better. Suppose we are considering the heights of student in a class. Let’s say the mean height of the students are 150 cm and the standard deviation is 10 and we have to find the probability of students who have heights greater than 165 cmP (height >165cm)

Let’s see how the distribution looks in the below normal distribution curve:

Normal Distribution Curve : Image by Author

So, if you take the above curve and try to map it out to a standard normal distribution curve then the value of 165 cm would fall 1.5 standard deviations above the mean.

The reason being, 150 is the mean, so 1 SD above the mean would be 160 and 2 SD above the mean would be 170, so 165 would be 1.5 SD above mean.

Using the Z-score table below, if we see the score for z which is 1.5 in our case, the corresponding value for that 0.9332 (circled in red) which means that the region of the curve which is less than 165 cm is 93.32% of the whole curve as shown below.

As the complete standard normal distribution would cover 100% of the area, so, the portion for which the P (height >165cm) would be 100–93.32 = 6.68 %

The probability of students in the class whose height is more than 165cm is around 6.68%

Z-score for One Sample

In the above example, we had considered the complete population.

To calculate the z-score for one sample as well. The formula for that also remains the same:

Where,

The process for solving the z-score remains the same for samples.

Now that we have seen, the formula for calculating Z-score for one sample, let’s go ahead and understand how we could do this when we have multiple samples.

Z-score for Multiple Samples

The below formula would give the z-score when we have multiple samples.

Where,

Let’s take an example: The mean weight of students in a class is 150lbs with standard deviation of 3.0. What will be the probability of finding a random sample of 60 students with mean weight of 170lbs assuming the height is normally distributed.

So, as we are dealing with the sampling distribution of means, we had to include the standard error in the formula while calculating the z-score.

Another thing that we need to keep in mind here as well is the empirical formula that we discussed earlier.

According to the empirical formula, 99.73% of the values will fall under 3 standard deviations from the mean in a normal distribution and since our z-score value is 51.28 which means it is 51.28 standard deviations away from the mean, it shows that there are less than 1% probability that any sample of students will have mean weight is 170lbs.

Now that we have understood the concept of Z-score, let’s go ahead and see how we could implement it to detect Outliers.

We would also see another method of detecting Outliers which is through Using Quantiles.

Here’s the GitHub link for the code which is being explained in the video below

Z Score and Quantiles used to determine Outliers

Reference

  1. https://en.wikipedia.org/wiki/Quantile
  2. https://en.wikipedia.org/wiki/Standard_score

Originally published at https://ai-ml-analytics.com on June 18, 2020.

--

--