Introduction To Statistics

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

Introduction to Statistical

Measures
Mean
• The mean is the average value in a collection of numbers.

• In statistics, it is a measure of central tendency of a


probability distribution along median and mode. It is also
referred to as an expected value.
How to Calculate Mean?

When should you not use Mean?


The mean is usually the best measure of central tendency to use when your data
distribution is continuous and symmetrical, such as when your data is normally
distributed. However, it all depends on what you are trying to show from your data.
Median
• Median is a statistical measure that determines the middle value of a dataset listed in
ascending order (i.e., from smallest to largest value).

• When should you be using or not using Median?


The median is the most informative measure of central tendency for skewed
distributions or distributions with outliers. For example, the median is often used as a
measure of central tendency for income distributions, which are generally highly skewed.

• What Is a Skewed Distribution?


A distribution is said to be skewed when the data points cluster more toward one
side of the scale than the other, creating a curve that is not symmetrical. In other
words, the right and the left side of the distribution are shaped differently from each other.
How to Find Median?
The median can be easily found. In some cases, it does not require any
calculations at all. The general steps of finding the median include:
1.Arrange the data in ascending order (from the lowest to the largest value).
2.Determine whether there is an even or an odd number of values in the
dataset.
3.Considering the results of the previous step, further analysis may follow
two distinct scenarios:
4.If the dataset contains an odd number of values, the median is a central
value that will split the dataset into halves.
5.If the dataset contains an even number of values, find the two central
values that split the dataset into halves. Then, calculate the mean of the
two central values. That mean is the median of the dataset.
Mode
• The mode is the most frequent score in our data set. On a histogram it
represents the highest bar in a bar chart or histogram. You can,
therefore, sometimes consider the mode as being the most popular
option. An example of a mode is presented below:
Example
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

Find mean, median and mode accordingly


• Mean:- The mean value is the average value.
• To calculate the mean, find the sum of all values, and divide the sum
by the number of values:

• You can also use mean() function of Numpy library to calculate


mean.
Python Code to calculate mean
• import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)

print(x)
Median example
• Median
• The median value is the value in the middle, after you have sorted
all the values:

• You can also use median() of Numpy library to calculate median.


Python Code for Median
• import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)
Mode
• The Mode value is the value that appears the most number of times:

• The SciPy module has a method for this.


• from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = stats.mode(speed)

print(x)
Standard Deviation
• Standard deviation is a measure of
dispersement in statistics.

• “Dispersement” tells you how


much your data is spread out.

• Specifically, it shows you how


much your data is spread out
around the mean or average. 
Standard Deviation

Square root of the variance is the standard


deviation. So what is Variance?
Variance
• Variance measures how far a data set is spread out. It is mathematically
defined as the average of the squared differences from the mean.
• The variance for a sample is calculated by:
1.Finding the mean(the average).
2.Subtracting the mean from each number in the data set and then
squaring the result. The results are squared to make the negatives
positive. Otherwise negative numbers would cancel out the positives in
the next step. It’s the distance from the mean that’s important, not
positive or negative numbers.
3.Averaging the squared differences.
4.Dividing the value by sample size - 1
Example
• Speed=[32,111,138,28,59,77,97]
Step:-1 Mean
(32+111+138+28+59+77+97)/7 = 77.4
Step:2 For each value: find the difference from the mean:
• Step:-3 For each difference: find the square value:
• Step:4 The variance is the average number of these squared differences:

• import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.var(speed)

print(x)
Z-Score
• Simply put, a z-score (also called a standard score) gives you an
idea of how far from the mean a data point is. But more technically
it’s a measure of how many standard deviations below or above the
mean a score is.
Correlation
• Explains the relationship between two variables
• Correlation coefficients are used to measure how strong the
relationship is
• Correlation coefficient formulas are used to find how strong a
relationship is between data. The formulas return a value between -1
and 1, where:
• 1 indicates a strong positive relationship.
• -1 indicates a strong negative relationship.
• A result of zero indicates no relationship at all.
Correlation Coefficients
• Pearson’s
Practice Question

Find the value of the correlation coefficient from the following table :

GLUCOSE
SUBJECT AGE X
LEVEL Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Answers
Sampling Distributions
• The distribution of sample means is defined as the set of means from all the
possible random samples of a specific size (n) selected from a specific set.
• A sampling distribution is a graph of a statistic for your sample data. While,
technically, you could choose any statistic to paint a picture, some common ones
you’ll come across are:
1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Range
Example of Population Distributions
• Let’s take a sample of size n and we will calculate the statistic
value(could be either mean, median, range, std deviation, variance)
to estimate the value of the parameter.
• Lets take a mean of population having 3 different samples.
• Now mean would be certainly 1+2+3 = 6/3 = 2
• Further we can also check the associated means by taking different
samples in order to have an estimated value of the parameter.
Regression
• Linear Regression

• Multiple Linear Regression

• Polynomial Regression
Linear Regression
• A linear regression is where the relationships between your variables 
can be described with a straight line. Non-linear regressions produce
curved lines
• It is the most widely used statistical technique; it is a way to model a
relationship between two sets of variables. The result is a linear
regression equation that can be used to make predictions about data.
• y’ = a + bx (b = slope of the line, a is the intercept. Also denoted as
b0 and b1)
Linear Regressions
• Linear regression is a data plot that graphs the linear relationship
between an independent and a dependent variable. It is typically
used to visually show the strength of the relationship and the
dispersion of results – all for the purpose of explaining the behavior
of the dependent variable.

• Example
• Answers
Multiple Linear Regression
• Multiple regression is like linear regression but with more than one
independent value, meaning that we try to predict a value based on
two or more variables.
• From the sklearn module we will use the LinearRegression() to
create Regression object.
• This object has a method called fit() that takes the independent and
dependent values as parameters and fills the regression object with
data that describes the relationship:
• Now we have a regression object that are ready to predict CO2
values based on a car's weight and volume:
Python Code
• import pandas
from sklearn import linear_model

df = pandas.read_csv("D:\python files\cars.csv")

X = df[['Weight', 'Volume']]
y = df['CO2']

regr = linear_model.LinearRegression()
regr.fit(X, y)

#predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300]])

print(predictedCO2)

• We have predicted that a car with 1.3 liter engine, and a weight of 2300 kg, will release approximately 107 grams of
CO2 for every kilometer it drives.
Polynomial Regression
• If your data points clearly will not fit a linear regression (a straight
line through all data points), it might be ideal for polynomial
regression.
• Polynomial regression, like linear regression, uses the relationship
between the variables x and y to find the best way to draw a line
through the data points.
• For Eg, we have registered 18 cars as they were passing a certain
tollbooth.
• We have registered the car's speed, and the time of day (hour) the
passing occurred.
• The x-axis represents the hours of the day and the y-axis represents
the speed:
• First lets draw a scatter plot
import matplotlib.pyplot as plt

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

plt.scatter(x, y)
plt.show()
• Lets then draw the line of Polynomial Regression by importing
Numpy and Matplotlib together
Contd.
• NumPy has a method that lets us make a polynomial model:

• Then specify how the line will display, we start at position 1, and end
at position 22:

• Draw the line of polynomial regression:


Contd.
• Display the diagram:
Python Code on entirety
import numpy
import matplotlib.pyplot as plt

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

myline = numpy.linspace(1, 22, 100)

plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Goodness of Fit
• The goodness of fit test is used to test if sample data fits a distribution
from a certain population

• In other words, it tells you if your sample data represents the data you
would expect to find in the actual population. Goodness of fit tests
commonly used in statistics are:
• Chi – Squared Test
• Kolmogorov-Smirnov.
• Anderson-Darling.
• Shipiro-Wilk.
Chi-Squared Test

The subscript “c” is the degrees of freedom. “O” is your observed value and


E is your expected value.

To interpret the test, you’ll need to choose an alpha level (1%, 5% and 10%
are common). The chi-square test will return a p-value. If the p-value is small
(less than the significance level), you can reject the null hypothesis that the
data comes from the specified distribution. 
Goodness of fit
• Since p < 0.05 is enough to reject the null hypothesis (no
association), p = 0.002 reinforce that rejection only. If the
significance value that is p-value associated with chi-square statistics
is 0.002, there is very strong evidence of rejecting the null
hypothesis of no fit. It means good fit.

You might also like