Fundamentals of Statistics and Machine Learning — Theory(Part-1)
Broader classification of Statistics
- Descriptive Statistics is an approach where we summarize the data in terms of mean, standard deviation for cases where the data is continuous (ex., height) and in terms of frequency and percentage where the data is categorical (ex, time of the year: Spring, fall, summer)
- Inferential Statistics is an approach where we we take a sample of data out of the complete population or dataset in cases when the dataset is huge, and try to come up with conclusions about the entire dataset. This approach is called Inferential Statistics. Inferences are carried out using different methods like hypothesis testing, correlation within data elements etc.
On the other side Machine Learning can be broadly classified into 3 categories:
- Supervised Learning: is a machine learning technique where we train the machine learning model to understand the relationship between the independent or feature variables and the dependent variables. Supervised learning is classified into following:
i. Classification Problem
ii. Regression Problem
2. Unsupervised Learning: Unlike supervised learning, unsupervised learning algorithms do not need any kind of training as they learn on their own to understand the pattern and relationship among the different data elements. Unsupervised learning is classified into the following:
i. Clustering
ii. Dimensionality Reduction.
3. Reinforcement Learning: This category of machine learning follows a different approach where the machine which is also called as the agent learns to take actions based on the feedback from the environment. In this approach the machine takes some actions without any kind of training and will be rewarded in terms or +1 or -1 based on the actions taken. Based on the reward, the machine or agent will reevaluate its actions and take steps.
Machine Learning Life Cycle
Below are the steps that are involved in development and deployment of machine learning models.
Few preliminary terms and concepts of Statistics required for model building
We always create statistical model with some level of assumptions which should hold good for all the data elements of the dataset. If this doesn’t happen then the model performance will not be significant enough.
Below are few terms which are important to know and understand for building models:
- Population: We use this term to refer to the dataset as a whole. This covers the complete set of data elements for the subject under study.
- Sample: We use this term to refer to a subset of Population. We use samples in Inferential Statistics where we we take a sample of data out of the complete population or dataset in cases when the dataset is huge, and try to come up with conclusions about the entire dataset.
3. Parameter: We use the term parameter in cases where we are computing something on the whole population of the data.
4. Statistic: We use the term statistic in cases where we are computing something on a sample of data rather than the whole population.
5. Mean: Mean can be calculated by taking the sum of all the data points divided by the count of the data points. Mean is however sensitive to the outliers in the data, where outliers are data points in the data that are deviating from the population. Outliers could have very high or very low values.
6. Median: Median is defined as the mid-point of the data points which can be calculated by arranging the data either in descending or ascending order.
7. Mode: Mode is defined as the term/data point in the data set that occurs the most number of times. Mode comes in handy in cases of missing categorical data. How that works is, we take the mode for the categorical data element and fill it in all the places which has missing values.
8. Dispersion: Dispersion basically describes how much spread out are the data points.
9. Range: Range is the difference between the maximum and the minimum value of a data element.
10. Variance: Variance is the average of the squared difference from the mean. The formula for variance is as below:
In case of calculating the Variance for the whole population we divide by “N” where “N” is the count of whole population, whereas in case of calculating the Variance of the Sample, we will divide by “N-1”. The formulas for Variance of the whole population and Sample are as follows:
Variance for Population =
Let’s take an example to understand how variance can be calculated:
Given numbers : 16, 11, 9, 8, 1
Step -1: Find the mean ( μ) : (16+11+9+8+1)/5 = 9
Step -2 : Subtract each data point from the mean and square it:
(16–9)^2 = 49
(11–9)^2 = 4
(9–9)^2 = 0
(8–9)^2 = 1
(1–9)^2 = 64.
Step- 3: Add all the squared difference:
49+4+0+1+64 = 118
Step-4: Divide Step-3 by the number of data points
118/5 = 23.6 is the population variance.
11. Standard Deviation : Standard deviation basically shows how dispersed a set of values are. A low standard deviation means that the values are closer to the mean while a high standard deviation indicates that the values are spread out wider.
If you are calculating the standard deviation for the whole population the it’s formula is:
While, if we’re calculating the standard deviation for just a sample then the formula would be:
12. Quantiles: These are basically measures/segments in which the data can be shown. These include percentiles, quartiles, deciles and so on. We calculate these data after arranging them in ascending order.
i. Percentile: Percentile is defined as the percentage of data that is present below the value of the original value. For example if someone scores a percentile of 90 then the number of students below that are 90 percent of the students.
ii. Decile: Decile is defined as the 10th percentile, which means the data points that are present below decile is 10 percent of the whole data set.
iii. Quartile: Quartile is defined as 1/4th of the data or the 25th percentile and can be categorized into 3 quartiles:
1st Quartile consists of 25 percent of the data, 2nd Quartile is 50 percent of the data and 3rd quartile is 75 percent of the data. We also call the 2nd quartile as the median or 50th percentile or 5th decile.
iv. Interquartile Range(IQR): Interquartile range is defined as the difference between the third quartile and the first quartile and can be effective in figuring out the outliers in the dataset.
Let’s understand what is meant by an outlier:
An outlier is basically an extremely high or extremely low value in our dataset.
We can calculate the outlier if it is greater than Q3(third quartile) + 1.5(IQR) or lower than Q1(first quartile) + 1.5(IQR).
Now that we have understood what all these means, let’s pull it back together and calculate it using an example:
Ex: 4, 17, 7, 14, 18, 12, 3, 16, 10, 4, 4, 11
Rearrange the numbers in order:
3, 4, 4, 4, 7, 10, 11, 12, 14, 16, 17, 18
Calculate the median(Q2) : (10+11)/2 = 10.5
Split the range into 2 halves with the first half less than 10.5 and the 2nd half greater than 10.5
1st half = 3, 4, 4, 4, 7, 10
2nd half = 11, 12, 14, 16, 17, 18
Q1 = calculate the median of the 1st half = (4+4)/2 = 4
Q3 = calculate the median of the 2nd half = (14+16)/2 = 15
Also:
The lowest value is : 3
The highest value is : 18
The interquartile range is : Q3-Q1 = 15–4 = 11
Let’s implement this in python:
The github code for the above examples are below:
https://gist.github.com/tkhan0/c0a2c7f04409c8eb667351098b5ecae0