Working with Confidence Intervals

Learn the basics of how confidence intervals are used in data science and statistics.



Working with Confidence Intervals
Image by Editor

 

In data science and statistics, confidence intervals are very useful for quantifying uncertainty in a dataset. The 65% confidence interval represents data values that fall within one standard deviation of the mean. The 95% confidence interval represents data values that are distributed within two standard deviations from the mean value. The confidence interval can also be estimated as the interquartile range, which represents data values between the 25th percentile and the 75th percentile, with the 50th percentile representing the mean or median value. 

In this article, we illustrate how the confidence interval can be calculated using the heights dataset. The heights dataset contains male and female height data.

 

Visualization of Probability Distribution of Heights

 

First, we generate the probability distribution of the male and female heights.

# import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# obtain dataset
df = pd.read_csv('https://raw.githubusercontent.com/bot13956/Bayes_theorem/master/heights.csv')

# plot probability distribution of heights
sns.kdeplot(df[df.sex=='Female']['height'], label='Female')
sns.kdeplot(df[df.sex=='Male']['height'], label = 'Male')
plt.xlabel('height (inch)')
plt.title('probability distribution of Male and Female heights')
plt.legend()
plt.show()

 

Working with Confidence Intervals
Probability distribution of male and female heights | Image by Author.

 

From the figure above, we observe that males are on average taller than females.

 

Calculation of Confidence Intervals

 

The code below illustrates how the 95% confidence intervals for the male and female heights can be calculated.

# calculate confidence intervals for male heights
mu_male = np.mean(df[df.sex=='Male']['height'])
mu_male

>>> 69.31475494143555

std_male = np.std(df[df.sex=='Male']['height'])
std_male

>>> 3.608799452913512

conf_int_male = [mu_male - 2*std_male, mu_male + 2*std_male]
conf_int_male

>>> [65.70595548852204, 72.92355439434907]

# calculate confidence intervals for female heights
mu_female = np.mean(df[df.sex=='Female']['height'])
mu_female

>>> 64.93942425064515

std_female = np.std(df[df.sex=='Female']['height'])
std_female

>>> 3.752747269853828

conf_int_female = [mu_female - 2*std_female, mu_female + 2*std_female]
conf_int_female

>>> [57.43392971093749, 72.4449187903528]

 

 

Confidence Interval Using Boxplot

 

Another method to estimate the confidence interval is to use the interquartile range. A boxplot can be used to visualize the interquartile range as illustrated below.
 

# generate boxplot
data = list([df[df.sex=='Male']['height'],   
             df[df.sex=='Female']['height']])

fig, ax = plt.subplots()
ax.boxplot(data)
ax.set_ylabel('height (inch)')
xticklabels=['Male', 'Female']
ax.set_xticklabels(xticklabels)
ax.yaxis.grid(True)
plt.show()

 

 

Working with Confidence Intervals
Box plot showing the interquartile range.| Image by Author.

 

The box shows the interquartile range, and the whiskers indicate the minimum and maximum values of the data, excluding outliers. The round circles indicate the outliers. The orange line is the median value. From the figure, the interquartile range for male heights is [ 67 inches, 72 inches]. The interquartile range for female heights is [63 inches, 67 in]. The median height for males heights is 68 inches, while the median height for female heights is 65 inches.

 

Summary

 

In summary, confidence intervals are very useful for quantifying uncertainty in a dataset. The 95% confidence interval represents data values that are distributed within two standard deviations from the mean value. The confidence interval can also be estimated as the interquartile range, which represents data values between the 25th percentile and the 75th percentile, with the 50th percentile representing the mean or median value.
 
 
Benjamin O. Tayo is a Physicist, Data Science Educator, and Writer, as well as the Owner of DataScienceHub. Previously, Benjamin was teaching Engineering and Physics at U. of Central Oklahoma, Grand Canyon U., and Pittsburgh State U.