I have two data sets consisting of 7 observations each:
Set#1: 10, 2, 3, 2, 4, 2, 5
Set#2: 20, 12, 13, 12, 14, 12, 15
For each, I will compute the Central Tendency and Variation.
These measures are fundamental statistical concepts that I need to master on my journey to understand advanced analytics.
A little bit of knowledge first
Central tendency is a fundamental statistical concept that helps describe where a data set’s “center” or “typical” value lies. It provides a way to summarize and understand a group of data points. There are three primary measures of central tendency:
Mean, Median, and Mode.
Variation measures in statistics help you understand how data points in a dataset are spread out or dispersed from the central tendency (mean, median, or mode). They provide insights into the variability or consistency of your data. Here are some fundamental variation measures for beginners in statistics: Range, Variance, Standard Deviation, Interquartile Range, and Coefficient of Variation.
For my sample datasets, I will demonstrate how to obtain those measures discussed above (I will not include the coefficient of variation at this time).
First, I will convert those datasets into vectors: ‘set1’ and ‘set2’
> set1 <-c(10, 2, 3, 2, 4, 2, 5)
> set2 <-c(20, 12, 13, 12, 14, 12, 15)
###central tendencies measures and results
> mean(set1)
> # calculates the average
[1] 4
> mean(set2)
[1] 14
> median(set1)
># median is the middle value of the dataset arranged in ascending or descending order
[1] 3
>median(set2)
[1] 13
>mode(set1)
> # mode is the value that appears most frequently in a data set.
[1] “numeric”
#”numeric,” in this case, means that the mode is a numerical value or value within the dataset. The most frequent number in dataset ‘set1’ is 2.
> mode(set2)
> # same as above. The most frequent number in the dataset ‘set2’ is 12.
[1] “numeric”
> #####Review Samples
> summary(set1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.0 2.0 3.0 4.0 4.5 10.0
> summary(set2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.0 12.0 13.0 14.0 14.5 20.0
> ###variations measures and results
> range(set1)
># It is the difference between the max and min values in the data set.
[1] 2 10
> range(set2)
[1] 12 20
> IQR(set1)
>#It is the difference between the third quartile (Q3) and the first quartile (Q1).
[1] 2.5
> IQR(set2)
[1] 2.5
> var(set1)
># This measures how each data point varies from the mean.
[1] 8.333333
> var(set2)
[1] 8.333333
> sd(set1)
> # It’s the square root of the variance.
[1] 2.886751
> sd(set2)
[1] 2.886751
My two cents:
The standard deviations of both data sets are the same (i.e., both have a standard deviation of 2.89), and they have a similar degree of variability or spread. However, the mean, median, and other summary statistics differ between the two data sets.
The standard deviation measures the spread or dispersion of data points around the mean. If the standard deviations are the same, the degree of variability in both sets is similar. However, the differences in the mean and median values indicate that the central tendency of the two data sets is still different.
In a normal distribution, the mean, median, and standard deviation are related in a specific way:
- The mean and median are equal.
- The standard deviation determines the spread or width of the distribution.
In this case, even though both data sets have the same standard deviation, their means and medians are different, which means they are not similar to a normal distribution with the same parameters.
To assess whether they closely follow a normal distribution, I would typically use graphical methods like histograms, Q-Q plots, and other statistical tests. The differences in mean and median values shows that they are not likely to be normally distributed with the same parameters.