Statistics 101 for Data Science : Part 4

Sumaiya Sande
5 min readDec 5, 2023

--

I am writing the series on Statistics for data science and this part continues from part 3 in which I explained the detail process of hypothesis testing, the concept.

In part 2, I explained basic concepts in inferential statistics such as estimation, central limit theorem, hypothesis testing etc. The link to Part 2 : https://medium.com/@sumaiyasande/statistics-101-for-data-science-part-2-c30e28384ad5

In part 1, I explained population, sample, parameter, statistics, estimate, descriptive vs inferential statistics, random variable, distributions etc. The link to Part 1 : https://sumaiyasande.medium.com/statistics-101-for-data-science-part-i-7868c30774e2

In this part, I will explain one sample tests in detail.

When you are testing for just one population (parameter), it is called one-sample test and when you are comparing two populations (parameters), then it is called two sample test. I will go through these one-by-one starting with one sample tests : one sample t-test and test for proportion.

One sample t-test : Let’s say you are trying to calculate average income of the person in a country (Let’s call this parameter as µ). Let’s say we are trying to test whether average income of a person (in thousands) in the country is 10 or not. If we don’t have access to income data of all people in the country then we will collect the sample (denoted by X) of a fixed size, let’s say 100 (sample size is usually denoted by n) which is normally distributed (approximately). Hence, we have the income data of 100 people in thousands as 5, 10, 2, 11, 15, 150 ….so on. Then we calculate the sample mean (usually denoted by X̅) which best represents the population parameter. Then, we try to find whether µ is 10 or not using the sample mean X̅. So, theoretically, we are testing the following hypothesis.

Now that we have defined hypotheses, it is time to define test statistic. Test statistic should give value 0 if null hypothesis is true and it should go away from 0 when alternative hypothesis is true. In this case, the test statistic is this :

where μ0 is the value of μ from null hypothesis which is 10 in this case. s is the standard deviation (spread) of the data X. If null hypothesis is true then T follows t-distribution with (n-1) degrees of freedom.

Now, let’s say, from sample X, we calculate the X̅ = 15, s = 20, then T= 2.5 which implies that The observed mean X̅ is 2.5 standard deviation larger than or away from the null hypothesis value which is 10.

Now, it is time to calculate the P-value.

t distribution with 99 degrees of freedom

The above graph is a t-distribution with 99 degrees of freedom (under null hypothesis). Hence, p-value is the probability that the test statistic is greater than 2.5 or less than -2.5 given that null hypothesis is true. In this case, p-value is 0.0141. If we set the confidence level as 95% then the one tail probability will be 2.5% and another 2.5% and together it will be 5% (significance level). Now, compared to 0.05, the p-value is less. Hence, we reject null hypothesis.

Now, what does this mean? This means all of the following :

  1. There is 1.41% chance of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is correct.
  2. There is a strong evidence against the null hypothesis and it is unlikely that we are getting random result.
  3. If the average income is actually 10k, out of 100 samples, 1–2 (~1.41) samples will be having an average income at least as extreme as 15k.

Now that we understand one-sample t-test, Let’s discuss test for population proportion.

Test for population proportion : Now suppose the variable we are interested is categorical in nature and not continuous. Then we can’t apply one-sample t-test. Suppose you are interested to test whether the the population proportion whose income is greater than 10k is 0.5 (or 50 %) or more. Then you will need the test for proportion. The important assumption for this test is that we need a large sample and the sample proportion is normally distributed. Now, let’s say population proportion is denoted by p and sample proportion is denoted by p_samp with sample size 10000. We are tesing the following hypothesis :

When null hypothesis is true, the sampling distribution of p_samp is normally distributed with mean 0.5 and standard deviation square root of 0.5 (1–0.5)/10000 (since it is a binomial random variable). Hense, test statistic value is

Hense, suppose from the sample you obtain the proportion of individuals with an income greater than 10k as 0.52, then value of Z=4 .

This means, the sample proportion of 0.52 is 4 standard errors away from the null hypothesis value of 0.5.

And p-value in this case is 0.00003167 which is much smaller than 0.05 (the significance level). Hence, the null hypothesis is rejected. In other words, There is a strong evidence against the null hypothesis and it is unlikely to obtain 50% proportion of individuals having income more than 10k if the actual proportion is assumed to be 50%.

Always remember that the hypothesis tests are always performed to reject the null hypothesis or not and not for accepting alternative hypothesis.

I conclude the one-sample tests here.

In the next part, I will be discussing the following two sample tests in details.

Two sample tests

Keep reading folks!!!

--

--

Sumaiya Sande
Sumaiya Sande

Written by Sumaiya Sande

PhD in Statistics from National University of Singapore. ML and AI Enthusiast. Follow me on LinkedIn:https://www.linkedin.com/in/sumaiya-sande/

No responses yet