QUANTITATIVE REASONING

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

CHAPTER-I

QUANTITATIVE REASONING
BS-III SEMESTER (Final Term Notes)
Mathematical models
A mathematical model is a description of a system using mathematical concepts, for example algebra, graphs,
equations and functions, and language, for example arithmetic signs. Mathematical modelling is the process of
developing mathematical models.
Deterministic models and statistical models (Probabilistic Model)
Mathematical models can be classified as either deterministic models or statistical models.
A deterministic model is a mathematical model in which the output is determined only by the specified
values of the input data and the initial conditions. This means that a given set of input data will always generate
the same output.
A statistical model is a mathematical model in which some or all of the input data have some
randomness, for example as expressed by a probability distribution, so that for a given set of input data the
output is not reproducible but is described by a probability distribution. The output ensemble is obtained by
running the model a large number of times with a new input value sampled from the probability distribution
each time the model is run. Statistical models can be run by using Monte Carlo simulation.
So, another definition of a statistical model is a mathematical description of a system that accounts for
uncertainty in the system. Statistical modelling is the process of forming a hypothesis for a statistical model on
a set of data, developing a model and then testing it on the data to see if the hypothesis is true.
Definition of Deterministic Models
Deterministic mathematical models are models in which the final outcome is entirely determined by the initial
conditions and parameters of the system. In these models, there is no random or uncertain component involved,
and they always produce the same output for the same input.
Characteristics of Deterministic Models
Deterministic models have the following characteristics.
 They produce consistent results for the same input.
 They do not involve any random component.
 The future state of the system can be precisely predicted. Examples of deterministic models include
classical mechanics, geometric optics, and some deterministic optimization problems.

Definition of Probabilistic Models


Probabilistic mathematical models, on the other hand, involve randomness and uncertainty in the system. In
these models, the final outcome is expressed as a probability distribution instead of a fixed value, and individual
outcomes may vary even if the initial conditions and parameters are the same.
Characteristics of Probabilistic Models
Probabilistic models have the following characteristics.
 They involve randomness or uncertainty.
 The outcome is expressed as a probability distribution.
 The future state of the system cannot be precisely predicted, but its probability can be estimated.
Examples of probabilistic models include quantum mechanics, genetics, and many statistical models
such as regression or classification models.
Comparison of Deterministic and Probabilistic Models
In summary, the main differences between deterministic and probabilistic models are:
 Deterministic models produce consistent results for the same input and do not involve any random
component, while probabilistic models involve randomness or uncertainty and provide probability
distributions as the outcome.
 Deterministic models allow predicting the future state of the system precisely, while probabilistic
models estimate the probable future state without providing an exact prediction.
 Examples of deterministic models are classical mechanics and geometric optics, while examples of
probabilistic models are quantum mechanics and statistical models.
By understanding these differences, you can recognize when to apply deterministic or probabilistic models in
various contexts and problems.

Linear Function

A linear function is a function that represents a straight line on the coordinate plane. For example, y = 3x - 2
represents a straight line on a coordinate plane and hence it represents a linear function. Since y can be replaced
with f(x), this function can be written as f(x) = 3x - 2.
What is a Linear Function?
A linear function is of the form f(x) = mx + b where 'm' and 'b' are real numbers. Isn't it looking like the slope-
intercept form of a line which is expressed as y = mx + b? Yes, this is because a linear function represents a
line, i.e., its graph is a line. Here,
'm' is the slope of the line
'b' is the y-intercept of the line
'x' is the independent variable
'y' (or f(x)) is the dependent variable
A linear function is an algebraic function. This is because it involves only algebraic operations.
Linear Function Equation
The parent linear function is f(x) = x, which is a line passing through the origin. In general, a linear function
equation is f(x) = mx + b and here are some examples.
f(x) = 3x - 2
f(x) = -5x - 0.5
f(x) = 3
Real Life Example of Linear Function
Here are some real-life applications of the linear function.
 A movie streaming service charges a monthly fee of $4.50 and an additional fee of $0.35 for every
movie downloaded. Now, the total monthly fee is represented by the linear function f(x) = 0.35x + 4.50,
where x is the number of movies downloaded in a month.
 A t-shirt company charges a one-time fee of $50 and $7 per T-shirt to print logos on T-shirts. So, the
total fee is expressed by the linear function f(x) = 7x + 50, where x is the number of t-shirts.
The linear function is used to represent an objective function in linear programming problems, to help minimize
the close, or maximize the profits.
How to Find a Linear Function?
We use the slope-intercept form or the point-slope form to find a linear function. The process of finding a linear
function is the same as the process of finding the equation of a line and is explained with an example.
Example: Find the linear function that has two points (-1, 15) and (2, 27) on it.
Solution:
The given points are (x1, y1) = (-1, 15) and (x₂, y₂) = (2, 27).
Step 1: Find the slope of the function using the slope formula:
m = (y₂ - y1) / (x₂ - x1) = (27 - 15) / (2 - (-1)) = 12/3 = 4.
Step 2: Find the equation of linear function using the point slope form.
y - y1 = m (x - x1)
y - 15 = 4 (x - (-1))
y - 15 = 4 (x + 1)
y - 15 = 4x + 4
y = 4x + 19
Therefore, the equation of the linear function is, f(x) = 4x + 19.
Identifying a Linear Function
If the information about a function is given as a graph, then it is linear if the graph is a line. If the information
about the function is given in the algebraic form, then it is linear if it is of the form f(x) = mx + b. But to see
whether the given data in a table format represents a linear function:
Compute the differences in x-values.
Compute the differences in y-values
Check whether the ratio of the difference in y-values to the difference in x-values is always constant.
Example: Determine whether the following data from the following table represents a linear function.
x y
3 15
5 23
7 31
11 47
13 55
Solution:
We will compute the differences in x-values, differences in y values, and the ratio (difference in y)/(difference
in x) every time and see whether this ratio is a constant.
Since all numbers in the last column are equal to a constant, the data in the given table represents a linear
function.
Graphing a Linear Function
We know that to graph a line, we just need any two points on it. If we find two points, then we can just join
them by a line and extend it on both sides. The graph of a linear function f(x) = mx + b is
an increasing line when m > 0
a decreasing line when m < 0
a horizontal line when m = 0
There are two ways to graph a linear function.
By finding two points on it.
By using its slope and y-intercept.
Graphing a Linear Function by Finding Two Points
To find any two points on a linear function (line) f(x) = mx + b, we just assume some random values for 'x' and
substitute these values in the function to find the corresponding values for y. The process is explained with an
example where we are going to graph the function f(x) = 3x + 5.
Step 1: Find two points on the line by taking some random values.
We will assume that x = -1 and x = 0.
Step 2: Substitute each of these values in the function to find the corresponding y-values.
Here is the table of the linear function y = 3x + 5.
x y
-1 3(-1)+5 = 2
0 3(0)+5 = 5 Therefore, two points on the line are (-1, 2) and (0, 5).
Step 3: Plot the points on the graph and join them by a line. Also, extend the line on both sides.
Graphing a Linear Function Using Slope and y-Intercept
To graph a linear function, f(x) = mx + b, we can use its slope 'm' and the y-intercept 'b'. The process is
explained again by graphing the same linear function f(x) = 3x + 5. Its slope is, m = 3 and its y-intercept is (0,
b) = (0, 5).
Step 1: Plot the y-intercept (0, b).
Here, we plot the point (0, 5).
Step 2: Write the slope as the fraction rise/run and identify the "rise" and the "run".
Here, the slope = 3 = 3/1 = rise/run.
So rise = 3 and run = 1.
Step 3: Rise the y-intercept vertically by "rise" and then run horizontally by "run". This results in a new point.
(Note that if "rise" is positive, we go up and if "rise" is negative, we go down. Also, if "run" is positive", we go
right and if "run" is negative, we go left.)
Here, we go up by 3 units from the y-intercept and thereby go right by 1 unit.
Step 4: Join the points from Step 1 and Step 2 by a line and extend the line on both sides.

Important Notes on Linear Functions:


 A linear function is of the form f(x) = mx + b and hence its graph is a line.
 A linear function f(x) = mx + b is a horizontal line when its slope is 0 and in this case, it is known as a
constant function.
 The domain and range of a linear function f(x) = ax + b is R (all real numbers) whereas the range of a
constant function f(x) = b is {b}.
 These linear functions are useful to represent the objective function in linear programming.
 A constant function has no inverse as it is NOT a one-one function.
 Two linear functions are parallel if their slopes are equal.
 Two linear functions are perpendicular if the product of their slopes is -1.
 A vertical line is NOT a linear function as it fails the vertical line test.

Linear Growth & Decay


Linear Growth
Consider the relationship represented by the table shown below.
In the table above, a constant change of +1 in x corresponds to a constant change +2 in y.
Therefore, the relationship given in the table above represents linear growth, because each y-value is 2 more
than the value before it.
Linear growth can be modeled by a straight line with a positive slope.
For example, if James has a piggybank with 75 dollars already in it, and he adds 10 dollars every month, the
total amount in the piggybank can be modeled by
A = 10t + 75
Where A is the total amount, t is the number of months, and 50 (the y-intercept) is the initial amount.
The diagram shown below illustrates the above example.

Unlike exponential growth, linear growth doesn't have moments when it slows down or speeds up. Here, growth
is constant and it goes up by the same amount each time.

Linear Decay
Consider the relationship represented by the table shown below.

In the table above, a constant change of +1 in x corresponds to a constant change -3 in y.


Therefore, the relationship given in the table above represents linear decay, because each y-value is 3 less than
the value before it.
Linear decay can be modeled by a straight line with a negative slope.
For example, imagine James now takes 10 dollars every month out of her piggybank, which initially contained
100 dollars. Then, the final amount in the piggybank can be modeled by
A = 100 - 10t
Where A is the total amount, t is the number of months, and 100 (the y-intercept) is the initial amount.
The diagram shown below illustrates the above example.

Here, the decrease is at a constant rate and the slope is negative.


Positive Association and Negative Association
Positive Association:
Both exponential growth and linear growth are examples of a positive association between two things. A
positive association happens, when two variables move in the same direction.

That is, an increase on the part of one variable introduces an increase on the part of the other variable or a
decrease on the part of one variable introduces a decrease on the part of the other variable.
For example, the number of hours spent on studying and final exam scores:

When the data points are close to forming a smooth line or graph that shows the negative relationship, we can
say there is a strong negative association.
The graph above shows a positive association that is quite strong.
Negative Association:
Both exponential decay and linear decay are examples of a negative association between two things.
A negative association happens, when two variables move in the opposite directions.
That is, an increase on the part of one variable results a decrease on the part of the other variable or a decrease
on the part of one variable results an increase on the part of the other variable. For example, the number of
absences over the semester and final exam scores:
The graph above shows a negative association that is quite strong.
Exponential Growth and Decay
Exponential growth and decay apply to physical quantities which change in value or form in a rapid manner.
The change can be measured using the concept of exponential growth and exponential decay, and the new
obtained quantity can be obtained from the existing quantity. The formulas of exponential growth and decay are
f(x) = a(1 + r)t, and f(x) = a(1 - r)t respectively.
Let us learn more about exponential growth and decay, the formula, applications, with the help of examples,
FAQs.
What Is Exponential Growth And Decay?
Exponential growth and decay apply to quantities which change rapidly. Exponential growth and decay have
been derived from the concept of geometric progression. Quantities that do not change as constant but change in
an exponential manner can be termed as having an exponential growth or exponential decay.
The simplest representation of exponential growth and decay is the formula abx, where 'a' is the initial quantity,
'b' is the growth factor which is similar to the common ratio of the geometric progression, and 'x' in the time
steps for multiplying the growth factor. For exponential growth, the value of b is greater than 1 (b > 1), and for
exponential decay, the value of b is lesser than 1 (b < 1).
Exponential growth finds applications in studying bacterial growth, population increase, and money growth
schemes. Exponential decay refers to a rapid decrease in a quantity over a period of time. The exponential decay
can be used to find food decay, half-life, and radioactive decay. The formulas of exponential growth and decay
are as presented below.

Exponential growth uses a factor 'r' which is the rate of growth. Here the r-value lies between 0 and 1 (0 < r <
1). The term (1 + r) can be taken as the growth factor. And’t’ is the time steps which is the number of times the
growth factor is to be multiplied. The value of’t’ can be a whole number or a decimal number. For exponential
decay, the growth factor is (1 - r), which has a value lesser than 1.
Formulas of Exponential Growth and Decay
The exponential growth and decay have different interpretations of the formulas which are interrelated and can
be interpreted differently. The below table shows the three different formulas of exponential growth and decay.

Exponential Growth Exponential Decay

f(x) = abx f(x) = ab-x

f(x) = a(1 + r)t f(x) = a(1 - r)t

P = Poekt P = Poe-kt

In the above formulas the 'a' or Po is the initial quantity of the substance. Further for exponential growth
b = 1 + r = ek and for exponential decay we have b = 1 - r = e-k.
Applications of Exponential Growth and Decay
The concept of exponential growth and decay can be observed in numerous day-to-day scientific and industrial
activities. Let us check a few important applications of exponential growth and decay.
 Bacterial Growth: The initial bacterial growth can be observed in numerous communicable diseases.
The recent Covid-19 is a quick example of exponential growth where the disease is highly
communicable and is transmitted from one to many, and then to many more people. Bacteria and viruses
are growing in an exponential manner, and the initial bacteria grows at an exponential rate. The initial
bacterial if taken to double every second would grow in numbers such as 1, 2, 4, 8, 16, 32, 64, 128, 256,
512... With this, we can observe the manner of exponential growth in bacteria.
 Nuclear Chain Reactions: The nuclear chain reactions can be broadly classified as nuclear fission and
nuclear fission reactions. The concept of nuclear fusion can be linked with exponential growth and the
concept of nuclear fission can be linked with exponential decay. Nuclear fusion is a reaction in which
two or more atoms combine to form a larger atom and this kind of reaction can be observed in the core
of the sun. Nuclear fission is a kind of exponential decay that can be observed in radioactive material, in
which the initial quantity decomposes and we have a smaller quantity by the end of the observed time
period.
 Feedback: The concept of feedback, more so of customer feedback grows in an exponential manner.
This can be observed more so in negative feedback. The bad functioning of the product if experienced
by a customer is shared by this customer to another person, who in turn shares it with two or more
people, and each of those people again shares it with two or more people. The medium of the internet
helps for this easy sharing of feedback which flows exponentially to a larger customer segment. Thus
the feedback in this current day of the internet is conveyed at a rapid pace.
 Processing Power of Computers: The processing and storage power has now increased exponentially.
Earlier the data storage was only in MB, which has now grown to GB and TB exponentially. The slow
growth in the computer hardware and storage system in the initial 1970s and 1980s has now grown
exponentially. The initial floppy disks and hard disk drives have now been replaced with cloud servers
which can be easily assessed by the user through an internet-connected mobile device.
 Food Degradation: Food degradation can also be understood as a case of exponent growth and decay.
The food remains good for a certain amount of time and then it degenerates exponentially. The food is
slightly stale initially, and then it stales rapidly until we discover that it has spoilt completely. This is
also a typical case of exponential decay, where once the decay process initiates, then it decays at a rapid
pace until the entire food quantity is completely stale.
 Aging of Human Beings: The aging process of humans or any living being at the ending part of the
lifetime follows an exponential decay process. The person remains hale and healthy for a normal course
of life for about 60 years. This is the same reason that for many jobs the retirement age is set as 60 years.
After this, the aging process is so rapid that it affects the body at an exponential rate. The degradation in
the quality of life seen in the initial days is further degraded drastically in the later years. This could be
because of the advancement of certain existing diseases or the malfunctioning of certain organs.
 Internet Content: The internet is exploding with information. In the initial stages of the internet, google
had to collect useful information and add it to the net. But with time the internet users started adding
information to the internet, and now the amount of information now available on the internet is mind-
boggling. Also now the application of artificial intelligence algorithms helps build the content in an
exponential manner. The content is generated thousands and millions of times in a short span of time.

CHAPTER-II
What is Probabilistic?
A probabilistic method or model is based on the theory of probability or the fact that randomness plays a role in
predicting future events. The opposite is deterministic, which is the opposite of random — it tells us something
can be predicted exactly, without the added complication of randomness.
What is a Probabilistic Model?
Probabilistic models incorporate random variables and probability distributions into the model of an event or
phenomenon. While a deterministic model gives a single possible outcome for an event, a probabilistic model
gives a probability distribution as a solution. These models take into account the fact that we can rarely know
everything about a situation. There’s nearly always an element of randomness to take into account. For
example, life insurance is based on the fact we know with certainty that we will die, but we don’t know when.
These models can be part deterministic and part random or wholly random.
Random variables from the normal distribution, binomial distribution and Bernoulli distribution form the
foundation for this type of modeling.

A normal distribution curve, sometimes called a bell curve, is one of the building blocks of a probabilistic
model.
What is the Probabilistic Method?
The probabilistic method, first introduced by Paul Erdős, is a way to prove the existence of a structure with
certain properties in combinatorics. The idea is that you create a probability space, and — choosing elements at
random — prove than any random element from the space has both a positive probability and the properties
sought after. The method is widely used in a variety of disciplines, including: statistical physics, quantum
mechanics, and theoretical computer science.

Bivariate Analysis Definition & Example


What is Bivariate Data?
Data in statistics is sometimes classified according to how many variables are in a particular study. For
example, “height” might be one variable and “weight” might be another variable. Depending on the number of
variables being looked at, the data might be univariate, or it might be bivariate.
When you conduct a study that looks at a single variable, that study involves univariate data. For example, you
might study a group of college students to find out their average SAT scores or you might study a group of
diabetic patients to find their weights. Bivariate data is when you are studying two variables. For example, if
you are studying a group of college students to find out their average SAT score and their age, you have two
pieces of the puzzle to find (SAT score and age). Or if you want to find out the weights and heights of diabetic
patients, then you also have bivariate data. Bivariate data could also be two sets of items that are dependent on
each other. For example:
 Ice cream sales compared to the temperature that day.
 Traffic accidents along with the weather on a particular day.
Bivariate data has many practical uses in real life. For example, it is pretty useful to be able to predict when a
natural event might occur. One tool in the statistician’s toolbox is bivariate data analysis. Sometimes, something
as simple as plotting one variable against another on a Cartesian plane can give you a clear picture of what the
data is trying to tell you. For example, the scatterplot below shows the relationship between the time between
eruptions at Old Faithful vs. the duration of the eruption.

Waiting time between eruptions and the duration of the eruption for the Old Faithful Geyser in Yellowstone
National Park, Wyoming, USA. This scatterplot suggests there are generally two “types” of eruptions: short-
wait-short-duration, and long-wait-long-duration.
What is Bivariate Analysis?
Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis, used
to find out if there is a relationship between two sets of values. It usually involves the variables X and Y.
 Univariate analysis is the analysis of one (“uni”) variable.
 Bivariate analysis is the analysis of exactly two variables.
 Multivariate analysis is the analysis of more than two variables.
The results from bivariate analysis can be stored in a two-column data table. For example, you might want to
find out the relationship between caloric intake and weight (of course, there is a pretty strong relationship
between the two. You can read more here.). Caloric intake would be your independent variable, X and weight
would be your dependent variable, Y.

Bivariate analysis is not the same as two sample data analysis. With two sample data analysis (like a two
sample z test in Excel), the X and Y are not directly related. You can also have a different number of data
values in each sample; with bivariate analysis, there is a Y value for each X. Let’s say you had a caloric intake
of 3,000 calories per day and a weight of 300lbs. You would write that with the x-variable followed by the y-
variable: (3000,300).
Two sample data analysis
Sample 1: 100,45,88,99
Sample 2: 44,33,101
Bivariate analysis
(X,Y)=(100,56),(23,84),(398,63),(56,42)
Types of Bivariate Analysis
Common types of bivariate analysis include:
1. Scatter plots,
These give you a visual idea of the pattern that your variables follow.

A simple scatterplot.
2. Regression Analysis
Regression analysis is a catch all term for a wide variety of tools that you can use to determine how your data
points might be related. In the image above, the points look like they could follow an exponential curve (as
opposed to a straight line). Regression analysis can give you the equation for that curve or line. It can also give
you the correlation coefficient.
3. Correlation Coefficients
Calculating values for correlation coefficients are usually performed on a computer, although you can find the
steps to find the correlation coefficient by hand here. This coefficient tells you if the variables are related.
Basically, a zero means they aren’t correlated (i.e. related in some way), while a 1 (either positive or negative)
means that the variables are perfectly correlated (i.e. they are perfectly in sync with each other).
Importance of bivariate analysis
Bivariate analysis is an important statistical method because it lets researchers look at the relationship between
two variables and determine their relationship. This can be helpful in many different kinds of research, such as
social science, medicine, marketing, and more.
Here are some reasons why bivariate analysis is important:
 Bivariate analysis helps identify trends and patterns: It can reveal hidden data trends and patterns by
evaluating the relationship between two variables.
 Bivariate analysis helps identify cause and effect relationships: It can assess if two variables are
statistically associated, assisting researchers in establishing which variable causes the other.
 It helps researchers make predictions: It allows researchers to predict future results by modeling the
link between two variables.
 It helps inform decision-making: Business, public policy, and healthcare decision-making can benefit
from bivariate analysis.
The ability to analyze the correlation between two variables is crucial for making sound judgments, and this
analysis serves this purpose admirably.

Scatterplots: Using, Examples, and Interpreting


Use scatterplots to show relationships between pairs of continuous variables. These graphs display symbols at
the X, Y coordinates of the data points for the paired variables. Scatterplots are also known as scattergrams and
scatter charts.
The pattern of dots on a scatterplot allows you to determine whether a relationship or correlation exists between
two continuous variables. If a relationship exists, the scatterplot indicates its direction and whether it is a linear
or curved relationship.
Fitted line plots are a special type of scatterplot that displays the data points along with a fitted line for a
simple regression model. This graph allows you to evaluate how well the model fits the data.
Trend line
When a scatter plot is used to look at a predictive or correlational relationship between variables, it is common
to add a trend line to the plot showing the mathematically best fit to the data. This can provide an additional
signal as to how strong the relationship between the two variables is, and if there are any unusual points that are
affecting the computation of the trend line.

Use scatterplots to assess the following features of your dataset:


o Examine the relationship between two variables.
o Check for outliers and unusual observations.
o Create a time series plot with irregular time-dependent data.
o Evaluate the fit of a regression model.
At a minimum, scatterplots require two continuous variables.
Example Scatterplot
During an experiment, I measured the Body Mass Index (BMI) and body fat percentage of adolescent girls. I
graphed these two variables in a scatterplot to assess the relationship between them.
Scatterplots typically contain the following elements:
o X-axis representing values of a continuous variable. By custom, this is the independent variable when
you can classify one of the variables as such.
o Y-axis representing values of a continuous variable. Traditionally, this is the dependent variable.
o Symbols plotted at the (X, Y) coordinates of your data. Optionally, the graph can use different
colored/shaped symbols to represent separate groups on the same chart.
o Optionally, you can overlay fit lines to determine how well a model fits the data.
For the BMI and the body fat data, the scatterplot displays a moderately strong, positive relationship. As BMI
increases, the body fat percentage also tends to increase. The relationship appears to curve slightly because it
flattens out for higher BMI values. To model the curvature, the analysts include a squared term in the model.
The fitted line follows the curvature of the data, indicating a good fit.
Simple Linear Regression &The Correlation Coefficient
Simple linear regression is used to estimate the relationship between two quantitative variables. You can use
simple linear regression when you want to know:
1. How strong the relationship is between two variables (e.g., the relationship between rainfall and soil
erosion).
2. The value of the dependent variable at a certain value of the independent variable (e.g., the amount of
soil erosion at a certain level of rainfall).
Regression models describe the relationship between variables by fitting a line to the observed data. Linear
regression models use a straight line, while logistic and nonlinear regression models use a curved line.
Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.
Simple linear regression example: You are a social researcher interested in the relationship between income
and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their
happiness on a scale from 1 to 10.
Your independent variable (income) and dependent variable (happiness) are both quantitative, so you can do a
regression analysis to see if there is a linear relationship between them.
Assumptions of simple linear regression
Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data. These
assumptions are:
1. Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change
significantly across the values of the independent variable.
2. Independence of observations: the observations in the dataset were collected using statistically
valid sampling methods, and there are no hidden relationships among observations.
3. Normality: The data follows a normal distribution.
4. The relationship between the independent and dependent variable is linear: the line of best fit through
the data points is a straight line (rather than a curve or some sort of grouping factor).
If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use
a nonparametric test instead, such as the Spearman rank test.
Example: Data that doesn’t meet the assumptions You think there is a linear relationship between cured meat
consumption and the incidence of colorectal cancer in the U.S. However, you find that much more data has
been collected at high rates of meat consumption than at low rates of meat consumption, with the result that
there is much more variation in the estimate of cancer rates at the low range than at the high range. Because the
data violate the assumption of homoscedasticity, it doesn’t work for regression, but you perform a Spearman
rank test instead.
If your data violate the assumption of independence of observations (e.g., if observations are repeated over
time), you may be able to perform a linear mixed-effects model that accounts for the additional structure in the
data.
How to perform a simple linear regression
Simple linear regression formula
The formula for a simple linear regression is:

 y is the predicted value of the dependent variable (y) for any given value of the independent variable (x).
 B0 is the intercept, the predicted value of y when the x is 0.
 B1 is the regression coefficient – how much we expect y to change as x increases.
 x is the independent variable ( the variable we expect is influencing y).
 e is the error of the estimate, or how much variation there is in our estimate of the regression
coefficient.
Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B 1)
that minimizes the total error (e) of the model.
While you can perform a linear regression by hand, this is a tedious process, so most people use statistical
programs to help them quickly analyze the data.
The Correlation Coefficient r
Besides looking at the scatter plot and seeing that a line seems reasonable, how can you tell if the line is a good
predictor? Use the correlation coefficient as another indicator (besides the scatterplot) of the strength of the
relationship between x and y.
The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is a numerical measure of the
strength of association between the independent variable x and the dependent variable y.
The correlation coefficient is calculated as:

where n = the number of data points.


If you suspect a linear relationship between x and y, then r can measure how strong the linear relationship is.
What the VALUE of r tells us:
 The value of r is always between -1 and +1:
 The size of the correlation r indicates the strength of the linear relationship between x and y. Values
of r close to -1 or to +1 indicate a stronger linear relationship between x and y.
 If r = 0 there is absolutely no linear relationship between x and y (no linear correlation).
 If r = 1, there is perfect positive correlation. If r = 1, there is perfect negative correlation. In both
these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not
generally happen.
What the SIGN of r tells us:
 A positive value of r means that when x increases, y tends to increase and when x decreases, y tends
to decrease (positive correlation).
 A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends
to increase (negative correlation).
 The sign of r is the same as the sign of the slope, m, of the best fit line.
We can see this in Figure 10.

Figure 10
NOTE: Strong correlation does not suggest that x causes y or y causes x. We say “correlation does not imply
causation.” For example, every person who learned math in the 17th century is dead. However, learning math
does not necessarily cause death!
Difference between Correlation and Regression
Correlation and regression are both used as statistical measurements to get a good understanding of the
relationship between variables. If the correlation coefficient is negative (or positive) then the slope of the
regression line will also be negative (or positive). The table given below highlights the key difference between
correlation and regression.

Correlation Regression

Regression is used to numerically


Correlation is used to determine describe how a dependent variable
whether variables are related or not. changes with a change in an
independent variable

It finds the best-fitted regression line


Correlation tries to establish a linear
to estimate an unknown variable on
relationship between variables.
the basis of the known variable.

The variables can be used


The variables cannot be interchanged.
interchangeably

Correlation uses a signed numerical Regression is used to show the impact


value to estimate the strength of the of a unit change in the independent
relationship between the variables. variable on the dependent variable.

The least-squares method is the best


The Pearson's coefficient is the best
technique to determine the regression
measure of correlation.
line.

Important Notes on Correlation and Regression


 Correlation and regression are statistical measurements that are used to quantify the strength of the
linear relationship between two variables.
 Correlation determines if two variables have a linear relationship while regression describes the cause
and effect between the two.
 Pearson's correlation coefficient and ordinary least squares method are used to perform correlation and
regression analysis.
Confidence Interval
A confidence interval, in statistics, refers to the probability that a population parameter will fall between a set of
values for a certain proportion of times. Analysts often use confidence intervals that contain either 95% or 99%
of expected observations. Thus, if a point estimate is generated from a statistical model of 10.00 with a 95%
confidence interval of 9.50 to 10.50, it means one is 95% confident that the true value falls within that range.
What exactly is a confidence interval?
A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the
range of values you expect your estimate to fall between if you redo your test, within a certain level of
confidence.
Confidence, in statistics, is another way to describe probability. For example, if you construct a confidence
interval with a 95% confidence level, you are confident that 95 out of 100 times the estimate will fall between
the upper and lower values specified by the confidence interval.
Your desired confidence level is usually one minus the alpha (α) value you used in your statistical test:
Confidence level = 1 − a
So if you use an alpha value of p < 0.05 for statistical significance, then your confidence level would be 1 −
0.05 = 0.95, or 95%.
When do you use confidence intervals?
You can calculate confidence intervals for many kinds of statistical estimates, including:
 Proportions
 Population means
 Differences between population means or proportions
 Estimates of variation among groups
These are all point estimates, and don’t give any information about the variation around the number. Confidence
intervals are useful for communicating the variation around a point estimate.

Finding the critical value


Critical values tell you how many standard deviations away from the mean you need to go in order to reach the
desired confidence level for your confidence interval.
There are three steps to find the critical value.
1. Choose your alpha (α) value.
The alpha value is the probability threshold for statistical significance. The most common alpha value is p =
0.05, but 0.1, 0.01, and even 0.001 are sometimes used. It’s best to look at the research papers published in your
field to decide which alpha value to use.
2. Decide if you need a one-tailed interval or a two-tailed interval.
You will most likely use a two-tailed interval unless you are doing a one-tailed t test.
For a two-tailed interval, divide your alpha by two to get the alpha value for the upper and lower tails.
3. Look up the critical value that corresponds with the alpha value.
If your data follows a normal distribution, or if you have a large sample size (n > 30) that is approximately
normally distributed, you can use the z distribution to find your critical values.
For a z statistic, some of the most common values are shown in this table:

Confidence level 90% 95% 99%

alpha for one-tailed CI 0.1 0.05 0.01

alpha for two-tailed CI 0.05 0.025 0.005

z statistic 1.64 1.96 2.57

If you are using a small dataset (n ≤ 30) that is approximately normally distributed, use
the t distribution instead.
The t distribution follows the same shape as the z distribution, but corrects for small sample sizes. For
the t distribution, you need to know your degrees of freedom (sample size minus 1).
Check out this set of t tables to find your t statistic. We have included the confidence level and p values for
both one-tailed and two-tailed tests to help you find the t value you need.
For normal distributions, like the t distribution and z distribution, the critical value is the same on either side of
the mean.
Confidence interval for the mean of normally-distributed data
Normally-distributed data forms a bell shape when plotted on a graph, with the sample mean in the middle and
the rest of the data distributed fairly evenly on either side of the mean.
The confidence interval for data which follows a standard normal distribution is:

Where:
 CI = the confidence interval
 X̄ = the population mean
 Z* = the critical value of the z distribution
 σ = the population standard deviation
 √n = the square root of the population size
The confidence interval for the t distribution follows the same formula, but replaces the Z* with the t*.
In real life, you never know the true values for the population (unless you can do a complete census). Instead,
we replace the population values with the values from our sample data, so the formula becomes:
Where:
 ˆx = the sample mean
 s = the sample standard deviation
Example: Calculating the confidence intervalIn the survey of Americans’ and Brits’ television watching habits,
we can use the sample mean, sample standard deviation, and sample size in place of the population mean,
population standard deviation, and population size.
To calculate the 95% confidence interval, we can simply plug the values into the formula.
For the USA:

So for the USA, the lower and upper bounds of the 95% confidence interval are 34.02 and 35.98.
For GB:

So for the GB, the lower and upper bounds of the 95% confidence interval are 33.04 and 36.96.
Hypothesis Testing
Hypothesis Testing Example
The best way to solve a problem on hypothesis testing is by applying the 5 steps mentioned in the previous
section. Suppose a researcher claims that the mean average weight of men is greater than 100kgs with a
standard deviation of 15kgs. 30 men are chosen with an average weight of 112.5 Kgs. Using hypothesis testing,
check if there is enough evidence to support the researcher's claim. The confidence interval is given as 95%.
Hypothesis Testing and Confidence Intervals
Confidence intervals form an important part of hypothesis testing. This is because the alpha level can be
determined from a given confidence interval. Suppose a confidence interval is given as 95%. Subtract the
confidence interval from 100%. This gives 100 - 95 = 5% or 0.05. This is the alpha value of a one-tailed
hypothesis testing. To obtain the alpha value for a two-tailed hypothesis testing, divide this value by 2. This
gives 0.05 / 2 = 0.025.
Related Articles:
 Probability and Statistics
 Data Handling
 Data
Important Notes on Hypothesis Testing
 Hypothesis testing is a technique that is used to verify whether the results of an experiment are
statistically significant.
 It involves the setting up of a null hypothesis and an alternate hypothesis.
 There are three types of tests that can be conducted under hypothesis testing - z test, t test, and chi
square test.
 Hypothesis testing can be classified as right tail, left tail, and two tail tests.

You might also like