Statistics 101 for Data Science : Part 1

Sumaiya Sande
3 min readNov 22, 2023

Being from academic background in statistics and working in data science industries for the last 3+ years, I always see data science folks less curious about data and more curious about which machine learning model to use or how to tune that model to increase the accuracy. Eventually, they forget the word ‘Data’ in Data science and keep doing fit-predict things. When you understand the data well, half of your problems get solutions. And to understand the data, you need to know basic statistics. So here I am trying my best to simplify the statistics jargons.

Let’s first understand, what data is? Why the data is collected? When we are framing a problem, it is a general problem for the desired “population”. And let’s say we have a problem about some aspect of the population, then that is called a population ‘parameter’. But usually population has infinite subjects or the subjects we cannot reach or record the values. Then comes the role of a ‘sample’ or ‘data’ in data science. The sample is collected in a way that represents the desired population. There are different sampling methods to avoid selection bias such as stratified sampling, cluster sampling etc. but this is not the objective of this blog. Let’s assume that we have the sample which represents the population. Then to represent the population parameter we have ‘sample statistics’. The value of the statistic is called ‘Estimate’.

Population vs Sample

Now that the three important words(parameter, statistic and estimate) are defined, let’s discuss what is desciptive statistics and inferential statistics :

—Descriptive Statistics : Methods for organising and summarizing the information. e.g. graphs and charts etc.

—Inferential Statistics : Drawing and measuring the reliability of conclusions about population based on information obtained from a sample of the population. e.g point estimation, interval estimation, hypothesis testing etc.

Desciptive vs Inferencial statistics

Let’s discuss few descriptives central tendency and the spread of the data :

—Central tendency of the data : Mean, Median, Mode are the basic statistics to measure central tendency.

—Spread of the data : Range, Inter quantile Range (IQR) and standard deviation are some of the ways to estimate spread of the data. Range is defined as the difference between maximum and minimum. Inter quantile Range (IQR) is the difference between 75th quantile and 25th quantile. Standard Deviation is defined below.

1.Range = Max — Min

2. IQR = Q(0.75) — Q(0.25) = Q3 — Q1

3. Standard Deviation : Average of the absolute deviations of the observed values from mean of the variable in question.

Random variable is the characteristic that varies from one person to another. They are primarily of two types Quantitative (Numerical) and Qualitative (Categorical). Quantitative variable can be Discrete or Continuous. Some of the discrete variables are bernoulli, binomial etc. while some of the continuous variables are exponential, normal or guassian etc.

Now, the important concept which data scientists often forget to look at is the distribution of the variables which is extremely important to answer the problem statement. In statistics, the discrete variables have probability mass function (PMF) and continuous variables have probability density function (PDF).

You must have seen this bell curve which is nothing but the PDF of normal random variable.

Stay tuned for the part II of this blog where we talk about inferential statistics : Estimation, Central limit theorem, hypothesis testing.

I will keep adding further parts like regression analysis, Advanced hypothesis testing etc. So keep reading and growing…

--

--

Sumaiya Sande

PhD in Statistics from National University of Singapore. ML and AI Enthusiast. Follow me on LinkedIn:https://www.linkedin.com/in/sumaiya-sande/