Professional Documents
Culture Documents
Introduction To Statistics
Introduction To Statistics
Introduction To Statistics
Measures
Mean
• The mean is the average value in a collection of numbers.
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
print(x)
Median example
• Median
• The median value is the value in the middle, after you have sorted
all the values:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
Mode
• The Mode value is the value that appears the most number of times:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
Standard Deviation
• Standard deviation is a measure of
dispersement in statistics.
• import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)
Z-Score
• Simply put, a z-score (also called a standard score) gives you an
idea of how far from the mean a data point is. But more technically
it’s a measure of how many standard deviations below or above the
mean a score is.
Correlation
• Explains the relationship between two variables
• Correlation coefficients are used to measure how strong the
relationship is
• Correlation coefficient formulas are used to find how strong a
relationship is between data. The formulas return a value between -1
and 1, where:
• 1 indicates a strong positive relationship.
• -1 indicates a strong negative relationship.
• A result of zero indicates no relationship at all.
Correlation Coefficients
• Pearson’s
Practice Question
Find the value of the correlation coefficient from the following table :
GLUCOSE
SUBJECT AGE X
LEVEL Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Answers
Sampling Distributions
• The distribution of sample means is defined as the set of means from all the
possible random samples of a specific size (n) selected from a specific set.
• A sampling distribution is a graph of a statistic for your sample data. While,
technically, you could choose any statistic to paint a picture, some common ones
you’ll come across are:
1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Range
Example of Population Distributions
• Let’s take a sample of size n and we will calculate the statistic
value(could be either mean, median, range, std deviation, variance)
to estimate the value of the parameter.
• Lets take a mean of population having 3 different samples.
• Now mean would be certainly 1+2+3 = 6/3 = 2
• Further we can also check the associated means by taking different
samples in order to have an estimated value of the parameter.
Regression
• Linear Regression
• Polynomial Regression
Linear Regression
• A linear regression is where the relationships between your variables
can be described with a straight line. Non-linear regressions produce
curved lines
• It is the most widely used statistical technique; it is a way to model a
relationship between two sets of variables. The result is a linear
regression equation that can be used to make predictions about data.
• y’ = a + bx (b = slope of the line, a is the intercept. Also denoted as
b0 and b1)
Linear Regressions
• Linear regression is a data plot that graphs the linear relationship
between an independent and a dependent variable. It is typically
used to visually show the strength of the relationship and the
dispersion of results – all for the purpose of explaining the behavior
of the dependent variable.
• Example
• Answers
Multiple Linear Regression
• Multiple regression is like linear regression but with more than one
independent value, meaning that we try to predict a value based on
two or more variables.
• From the sklearn module we will use the LinearRegression() to
create Regression object.
• This object has a method called fit() that takes the independent and
dependent values as parameters and fills the regression object with
data that describes the relationship:
• Now we have a regression object that are ready to predict CO2
values based on a car's weight and volume:
Python Code
• import pandas
from sklearn import linear_model
df = pandas.read_csv("D:\python files\cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300]])
print(predictedCO2)
• We have predicted that a car with 1.3 liter engine, and a weight of 2300 kg, will release approximately 107 grams of
CO2 for every kilometer it drives.
Polynomial Regression
• If your data points clearly will not fit a linear regression (a straight
line through all data points), it might be ideal for polynomial
regression.
• Polynomial regression, like linear regression, uses the relationship
between the variables x and y to find the best way to draw a line
through the data points.
• For Eg, we have registered 18 cars as they were passing a certain
tollbooth.
• We have registered the car's speed, and the time of day (hour) the
passing occurred.
• The x-axis represents the hours of the day and the y-axis represents
the speed:
• First lets draw a scatter plot
import matplotlib.pyplot as plt
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
plt.scatter(x, y)
plt.show()
• Lets then draw the line of Polynomial Regression by importing
Numpy and Matplotlib together
Contd.
• NumPy has a method that lets us make a polynomial model:
• Then specify how the line will display, we start at position 1, and end
at position 22:
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Goodness of Fit
• The goodness of fit test is used to test if sample data fits a distribution
from a certain population
• In other words, it tells you if your sample data represents the data you
would expect to find in the actual population. Goodness of fit tests
commonly used in statistics are:
• Chi – Squared Test
• Kolmogorov-Smirnov.
• Anderson-Darling.
• Shipiro-Wilk.
Chi-Squared Test
To interpret the test, you’ll need to choose an alpha level (1%, 5% and 10%
are common). The chi-square test will return a p-value. If the p-value is small
(less than the significance level), you can reject the null hypothesis that the
data comes from the specified distribution.
Goodness of fit
• Since p < 0.05 is enough to reject the null hypothesis (no
association), p = 0.002 reinforce that rejection only. If the
significance value that is p-value associated with chi-square statistics
is 0.002, there is very strong evidence of rejecting the null
hypothesis of no fit. It means good fit.