Professional Documents
Culture Documents
Calculating For Data Normality Jibarra
Calculating For Data Normality Jibarra
Normally Distributed Data: bell-curve shape, the tails are the same length on both sides. This data is
analyzed using parametric tests. Parametric means that the data follows a normal distribution
pattern.
Not Normally Distributed: skewed shape, one tail is longer on one side, not a symmetric shape. This
data is analyzed using non-parametric tests.
Skew: A positive skew value means the tail is longer on the right side, and the mean is higher than
the median value. A negative skew value means the tail is longer on the left side, and the mean is
lower than the median value. A skew of 0 means the data is symmetrical.
Kurtosis: How the data is clumped together. A positive kurtosis value means that there is a high
peak and the shape of the curve is narrow. A negative kurtosis value means that there is a low/flat
peak, and the shape is flat/wide. A kurtosis of 0 means the shape is symmetrical.
P-Value: Is the probability value used in statistical analysis to either support or reject the null
hypothesis. Typically the p-value is compared to a significance/alpha level of (0.05).
● For data normality using a Shapiro-Wilks Test, If a p-value is less than 0.05, then
the data is NOT normally distributed.
● For data normality using a Shapiro-Wilks Test, If a p-value is greater than 0.05,
then the data IS normally distributed.
The following exercises are designed to reinforce the above concepts and to inspire you to think
about the different ways in which data can be presented and interpreted. In a post below, report
on the following activities/questions. Attach your spreadsheet to your post with your work AND
include a Histogram of your data within your spreadsheet.
Practice Example (Google Sheets):
Imagine you are comparing the size of hardwood trees in two different study watersheds to assess
the successional stage of each forest. You choose to measure the DBH (diameter at breast height;
a commonly used metric in forestry) of a sample of trees in each watershed. This results in the
following data:
a. DBH of trees in "Watershed A" (in cm): 84, 78, 51, 62, 55, 72, 34, 45, 89
b. DBH of trees in "Watershed B" (in cm): 54, 118, 142, 5, 36, 115, 12, 14,
c. Enter the above data into Google Sheets and insert a Histogram chart for your
data. Remember to combine “Watershed A & B” in one column and the numerical
values corresponding to each category in the second column. Select Insert -> Chart
and scroll down to select Histogram. You can customize the bins/buckets for data
range columns in the histogram to give you a better idea of the shape of the
dataset. Select function to calculate the average (mean) and median, as well as the
skew and kurtosis for your dataset.
● Mean: 62.705
● Median: 55
● Kurtosis: -0.332
● Skew: 0.432
1. Are the mean and median values the same or different?
The mean and median values are different.
2. Is the kurtosis and skew value 0 or different?
The kurtosis and skew value is different (kurtosis value is -0.332 and skew value is 0.432).
3. Based on this information, and the shape of your histogram, would you assume that your
data is normally distributed or not normally distributed?
Based on the above information and the shape of my histogram, I would assume that my
data is not normally distributed because the tail of the bell curve is longer on the positive side
(right side). The bell curve shows that the data set is more positively skewed.
4. Copy and paste your histogram graph below.
Watershed A & B Dataset (Stats Kingdom)
Go to the stats kingdom link for calculating normality using a Shapiro-Wilks Test. Clear the
example data they have, and input the watershed dataset. Make sure the alpha (significance) level
is set to 0.05, and keep the outliers included. Select Calculate.
● P-value: 0.840008
5. What is the outcome of the Shapiro-Wilks Test? In other words, is the data normally or not
normally distributed?
The outcome of the Shapiro-Wilks Test supports that the data is normally distributed
because the P-value is greater than 0.05.
Finding a Dataset Online:
In the same spreadsheet, add a new sheet at the bottom of your google sheets, and report the
results of a quick "Internet Investigation" study. For example, you might investigate which is more
popular, country music or hip hop, by looking at the number of views on Youtube for the top ten
videos for each. Be creative. Whatever you choose, compare at least two groups, include a brief
description of what you looked at, and report the mean, median, range, and standard deviation for
each data set. Also include these data in the spreadsheet you attach to your post.
d. Google Dataset Search
e. Kaggle
● What is your identified dataset?
Dataset: Top 10 Pop Songs of 2019 and Top 10 Iconic Classical Pieces of 2019
● Mean: 643,966,732
● Median: 57,926,820
● Kurtosis: 2.028
● Skew: 1.757
● P-Value (Stats Kingdom): 0.0000231228
6. Are the mean and median values the same or different?
The mean and median values are drastically different.
7. Is the kurtosis and skew value 0 or different?
The kurtosis and skew value is different (kurtosis value is 2.028 and skew value is 1.757).
8. Based on this information, and the shape of your histogram, would you assume that your
data is normally distributed or not normally distributed?
Based on the above information and the shape of my histogram, I would assume that the
data is not normally distributed because it is more positively skewed.
9. Copy and paste your histogram graph below.
10. Based on the outcome of the Shapiro-Wilks test in Stats Kingdom, is your identified
dataset normally or not normally distributed?
Based on the outcome of the Shapiro-Wilks test in Stats Kingdom, my identified dataset is
not normally distributed.