Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 89

1

Define Statistics? Branches of statistics ?


Statistics is a branch of applied mathematics that involves the collection, description,
analysis, and inference of conclusions from quantitative data. The mathematical
theories behind statistics rely heavily on differential and integral calculus, linear
algebra, and probability theory.
Statisticians, people who do statistics, are particularly concerned with determining
how to draw reliable conclusions about large groups and general events from the
behavior and other observable characteristics of small samples. These small samples
represent a portion of the large group or a limited number of instances of a general
phenomenon.
Branches of Statistics
Descriptive Statistics
Descriptive statistics mostly focus on the central tendency, variability, and
distribution of sample data. Central tendency means the estimate of the
characteristics, a typical element of a sample or population, and includes descriptive
statistics such as mean, median, and mode. Variability refers to a set of statistics that
show how much difference there is among the elements of a sample or population
along the characteristics measured, and includes metrics such as range, variance,
and standard deviation.
The distribution refers to the overall "shape" of the data, which can be depicted on a
chart such as a histogram or dot plot, and includes properties such as the probability
distribution function, skewness, and kurtosis. Descriptive statistics can also describe
differences between observed characteristics of the elements of a data set.
Descriptive statistics help us understand the collective properties of the elements of
a data sample and form the basis for testing hypotheses and making predictions
using inferential statistics.
Inferential Statistics
Inferential statistics are tools that statisticians use to draw conclusions about the
characteristics of a population, drawn from the characteristics of a sample, and to
decide how certain they can be of the reliability of those conclusions. Based on the
sample size and distribution statisticians can calculate the probability that statistics,
which measure the central tendency, variability, distribution, and relationships
between characteristics within a data sample, provide an accurate picture of the
corresponding parameters of the whole population from which the sample is drawn.
Inferential statistics are used to make generalizations about large groups, such as
estimating average demand for a product by surveying a sample of consumers'
2

buying habits or to attempt to predict future events, such as projecting the future
return of a security or asset class based on returns in a sample period.
Regression analysis is a widely used technique of statistical inference used to
determine the strength and nature of the relationship (i.e., the correlation) between
a dependent variable and one or more explanatory (independent) variables. The
output of a regression model is often analyzed for statistical significance, which
refers to the claim that a result from findings generated by testing or
experimentation is not likely to have occurred randomly or by chance but is likely to
be attributable to a specific cause elucidated by the data. Having statistical
significance is important for academic disciplines or practitioners that rely heavily on
analyzing data and research.
The Branches of Statistics
Two branches, descriptive statistics and inferential statistics, comprise the field of
statistics.
Descriptive Statistics
CONCEPT The branch of statistics that focuses on collecting, summarizing, and
presenting a set of data.
EXAMPLES The average age of citizens who voted for the winning candidate in the
last presidential election, the average length of all books about statistics, the
variation in the weight of 100 boxes of cereal selected from a factory's production
line.
Inferential Statistics
CONCEPT The branch of statistics that analyzes sample data to draw conclusions
about a population.
EXAMPLE A survey that sampled 2,001 full-or part-time workers ages 50 to 70,
conducted by the American Association of Retired Persons (AARP), discovered that
70% of those polled planned to work past the traditional mid-60s retirement age. By
using methods discussed in Section 6.4, this statistic could be used to draw
conclusions about the population of all workers ages 50 to 70.

What is the importance of statistics in economics?


The field of Statistics deals with collection, organisation, analysis,
interpretation and presentation of data. Statistics plays a vital role in
understanding economic data such as the relationship between the quantity
and price, supply and demand, economic output, GDP, per capita income of
nations etc. The government and the policymakers use statistical data to
3

formulate suitable policies of economic development. No analysis of a problem


would be possible without the availability of data on various factors underlying
an economic problem. For example, if the government wants to make policy to
solve the problem of unemployment and poverty, reliable data are required
for it.
What are the Advantages and disadvantages of mean, median and mode?
Advantages and disadvantages of mean, median and mode.
 Mean is the most commonly used measures of central tendency. It
represents the average of the given collection of data.
 Median is the middle value among the observed set of values and is
calculated by arranging the values in ascending order or in descending
order and then choosing the middle value.
 The most frequent number occurring in the data set is known as the
mode.
The advantages and disadvantages of mean, median, and mode are as follow:

Data Advantages Disadvantages

Takes account of all


A very small or very large value can affect
Mean values to calculate
the mean.
the average.

Since the median is an average of


The median is not
position, therefore arranging the data in
affected by very
Median ascending or descending order of
large or very small
magnitude is time-consuming in the case
values.
of a large number of observations.

The only averages There can be more than one mode, and
that can be used if there can also be no mode which means
Mode
the data set is not in the mode is not always representative of
numbers. the data.
4

What are the Advantages and disadvantages of mean?


The geometric mean is an average or average that gives the central tendency
or typical value of a series of numbers by taking the product of the numbers.
The main advantage of the geometric mean are :
1. The calculation is based on all the terms of the sequence.
2. Suitable for further mathematical analysis.
3. Fluctuations in the sample do not affect the geometric mean.
4. It gives more weight to small observations.
The disadvantage of the geometric mean are :
1. One of the main drawbacks of the geometric mean is that if one of the
observations is negative, the geometric mean will be imaginary, despite
the other set of observations.
2. Due to the complexity of the numbers, it is not easy for anyone other
than a mathematician to understand and calculate.
What are the Advantages and disadvantages of mode?
Mode:
Mode is one of the measures of central tendency. It is the value with the
highest frequency in the given data.
The following are the advantages and disadvantages of mode:
Advantage:
 Mode is simple to understand and easy to calculate.
 It can be located graphically, unlike mean and median.
 It can be used for qualitative analysis.
 The extremities in the values of the data do not affect the mode.
Disadvantage
 The mode does not consider all the values in the data.
 There can be more than one mode or no mode for the data.
 It is not well defined.
5

What Is Variance?
The variance is a measure of variability. It is calculated by taking the average of
squared deviations from the mean.
Variance tells you the degree of spread in your data set. The more spread the
data, the larger the variance is in relation to the mean.
Variance = (Standard deviation)2= σ2
Population variance
When you have collected data from every member of the population that
you’re interested in, you can get an exact value for population variance.
The population variance formula looks like this:
Formula Explanation
 = population variance
 = sum of…
 Χ = each value
 = population mean
 Ν = number of values in the population

Sample variance
When you collect data from a sample, the sample variance is used to make
estimates or inferences about the population variance.
The sample variance formula looks like this:
Formula Explanation
 = sample variance
 = sum of…
 Χ = each value
 = sample mean
 n = number of values in the sample
6

What is Quartile Deviation?


Quartile deviation depends on the difference between the first quartile and
the third quartile in the frequency distribution. The difference is also known as
the interquartile range. The difference divided by two is known as quartile
deviation or semi-interquartile range.
When one takes half of the difference or variance between the 3rd quartile
and the 1st quartile of a simple distribution or frequency distribution is the
quartile deviation.
Formula
A Quartile Deviation (Q.D.) formula is used in statistics to measure spread or, in
other words, to measure dispersion. It can also be called a semi-interquartile
range.
Q.D. = Q3 – Q1 / 2

 The formula includes Q3 and Q1 in the calculation, which is the top 25%
and lower 25% data, respectively. When the difference is between these
two, and this number halves, it gives measures of spread or dispersion.
 So, to calculate quartile deviation, you need first to find out Q1, then the
second step is to find Q3 and then make a difference between both, and
the final step is to divide by 2.
 It is one of the best methods of dispersion for open-ended data.
7

Skewness is a measure of the asymmetry of a distribution. A distribution is


asymmetrical when its left and right side are not mirror images.
A distribution can have right (or positive), left (or negative), or zero skewness.
A right-skewed distribution is longer on the right side of its peak, and a left-
skewed distribution is longer on the left side of its peak:

You might want to calculate the skewness of a distribution to:


 Describe the distribution of a variable alongside other descriptive
statistics
 Determine if a variable is normally distributed. A normal distribution has
zero skew and is an assumption of many statistical procedures.
What is zero skew?
When a distribution has zero skew, it is symmetrical. Its left and right sides are
mirror images.
Normal distributions have zero skew, but they’re not the only distributions
with zero skew. Any symmetrical distribution, such as a uniform distribution or
some bimodal (two-peak) distributions, will also have zero skew.
Zero skew: mean = median
For example, the mean chick weight is 261.3 g, and the median is 258 g. The
mean and median are almost equal. They aren’t perfectly equal because the
sample distribution has a very small skew.
What is right skew (positive skew)?
A right-skewed distribution is longer on the right side of its peak than on its
left. Right skew is also referred to as positive skew.
8

You can think of skewness in terms of tails. A tail is a long, tapering end of a
distribution. It indicates that there are observations at one of the extreme
ends of the distribution, but that they’re relatively infrequent. A right-skewed
distribution has a long tail on its right side.
Right skew: mean > median
For example, the mean number of sunspots observed per year was 48.6, which
is greater than the median of 39.
What is left skew (negative skew)?
A left-skewed distribution is longer on the left side of its peak than on its right.
In other words, a left-skewed distribution has a long tail on its left side. Left
skew is also referred to as negative skew.
Left skew: mean < median
For example, the mean zoology test score was 53.7, which is less than the
median of 55.
How to calculate skewness
There are several formulas to measure skewness. One of the simplest is
Pearson’s median skewness. It takes advantage of the fact that the mean and
median are unequal in a skewed distribution.

Pearson’s median skewness =


Example: Calculating Pearson’s median skewness Pearson’s median skewness
of the number of sunspots observed per year:
 Mean = 48.6
 Median = 39
 Standard deviation = 39.5
Calculation

Pearson’s median skewness =

Pearson’s median skewness =


Pearson’s median skewness =
9

Events in Probability
Events in probability can be defined as a set of outcomes of a random
experiment. The sample space indicates all possible outcomes of an
experiment. Thus, events in probability can also be described as subsets of the
sample space.
There are many different types of events in probability. Each type of event has
its own individual properties. This classification of events in probability helps to
simplify mathematical calculations. In this article, we will learn more about
events in probability, their types and see certain associated examples.
Events in Probability Example
Suppose a fair die is rolled. The total number of possible outcomes will form
the sample space and are given by {1, 2, 3, 4, 5, 6}. Let an event, E, be defined
as getting an even number on the die. Then E = {2, 4, 6}. Thus, it can be seen
that E is a subset of the sample space and is an outcome of the rolling of a die.
Types of Events in Probability
There are several different types of events in probability. There can only be
one sample space for a random experiment however, there can be many
different types of events. Some of the important events in probability are listed
below.
Independent and Dependent Events
Independent events in probability are those events whose outcome does not
depend on some previous outcome. No matter how many times an experiment
has been conducted the probability of occurrence of independent events will
be the same. For example, tossing a coin is an independent event in
probability.
Dependent events in probability are events whose outcome depends on a
previous outcome. This implies that the probability of occurrence of a
dependent event will be affected by some previous outcome. For example,
drawing two balls one after another from a bag without replacement.
Impossible and Sure Events
An event that can never happen is known as an impossible event. As
impossible events in probability will never take place thus, the chance that
10

they will occur is always 0. For example, the sun revolving around the earth is
an impossible event.
A sure event is one that will always happen. The probability of occurrence of a
sure event will always be 1. For example, the earth revolving around the sun is
a sure event.
Simple and Compound Events
If an event consists of a single point or a single result from the sample space, it
is termed a simple event. The event of getting less than 2 on rolling a fair die,
denoted as E = {1}, is an example of a simple event.
If an event consists of more than a single result from the sample space, it is
called a compound event. An example of a compound event in probability is
rolling a fair die and getting an odd number. E = {1, 3, 5}.
Complementary Events
When there are two events such that one event can occur if and only if the
other does not take place then such events are known as complementary
events in probability. The sum of the probability of complementary events will
always be equal to 1. For example, on tossing a coin let E be defined as getting
a head. Then the complement of E is E' which will be the event of getting a tail.
Thus, E and E' together make up complementary events. Such events are
mutually exclusive and exhaustive.
Mutually Exclusive Events
Events that cannot occur at the same time are known as mutually exclusive
events. Thus, mutually exclusive events in probability do not have any common
outcomes. For example, S = {10, 9, 8, 7, 6, 5, 4}, A = {4, 6, 7} and B = {10, 9, 8}.
As there is nothing common between sets A and B thus, they are mutually
exclusive events.
Exhaustive Events
Exhaustive events in probability are those events when taken together from
the sample space of a random experiment. In other words, a set of events out
of which at least one is sure to occur when the experiment is performed are
exhaustive events. For example, the outcome of an exam is either passing or
failing.
Equally Likely Events
11

Equally likely events in probability are those events in which the outcomes are
equally possible. For example, on tossing a coin, getting a head or getting a tail,
are equally likely events.
Intersection of Events in Probability
The intersection of events in probability corresponds to the AND event. If two
events are associated with the "AND" operator, it implies that the common
outcomes of both events will be the result. It is denoted by the intersection
symbol "∩". For example, A = {1, 2, 3, 4}, B = {2, 3, 5, 6} then A ∩ B = {2, 3}.

Union of Events in Probability


The union of events in probability is the same as the OR event. If there are two
events that belong to this group then the outcomes of either event or both will
be the result. The union symbol (∪) is used to denote the OR event. For
example, A = {1, 2, 3, 4}, B = {2, 3, 5, 6} then A ∪ B = {1, 2, 3, 4, 5, 6}.
12

What Is Sampling?
Sampling is a process in statistical analysis where researchers take a
predetermined number of observations from a larger population. The method
of sampling depends on the type of analysis being performed, but it may
include simple random sampling or systematic sampling.
KEY TAKEAWAYS
 Certified Public Accountants use sampling during audits to determine
the accuracy and completeness of account balances.1
 Types of sampling include random sampling, block sampling, judgement
sampling, and systematic sampling.
 Companies use sampling as a marketing tool to identify the needs and
wants of their target market.
Sampling may be defined as the procedure in which a sample is selected from
an individual or a group of people of certain kind for research purpose. In
sampling, the population is divided into a number of parts called sampling
units.

Sampling merits and demerits?


The methods, using which, we can get the samples; below are given
its merits and demerits on the whole.
13

Merits:
1. Economical:
It is economical, because we have not to collect all data. Instead of
getting data from 5000 farmers, we get it from 50-100 only.

2. Less Time Consuming:


As no of units is only a fraction of the total universe, time consumed
is also a fraction of total time. Number of units is considerably
small, hence the time.

3. Reliable:
If sample is taken judiciously, the results are very reliable and
accurate.

4. Organisational Convenience:
As samples are taken and the number of units is smaller, the better
(Trained) enumerators can be employed by the organisation.

5. More Scientific:
According to Prof R.A. Fisher, “The sample technique has four
important advantages over census technique of data collection.
They are Speed, Economy, Adaptability and Scientific approach.”

It is based on certain laws such as:

(a) Law of Statistical Regularity

(b) Law of Inertia of Large numbers

(c) Law of Persistence

(d) Law of Validity.

6. Detailed Enquiry:
A detailed study can be undertaken in case of the units included in
the sample. Size of sample can be taken according to time and
money available with the investigator.

7. Indispensable Method:
14

If universe is bigger, there remains no option but to proceed for this


method. It is specially used for infinite, hypothetical and perishable
universes.

Demerits:

1. Absence of Being Representative:


Methods, such as purposive sampling may not provide a sample,
that is representative.

2. Wrong Conclusion:
If the sample is not representative, the results will not be correct.
These will lead to the wrong conclusions.

3. Small Universe:
Sometimes universe is so small that proper samples cannot be taken
not of it. Number of units are so less.

4. Specialised Knowledge:
It is a scientific method. Therefore, to get a good and representative
sample, one should have special knowledge to get good sample and
to perform proper analysis so that reliable result may be achieved.

5. Inherent defects:
The results which are achieved though the analysis of sampling data
may not be accurate as this method have inherent defects. There is
not even a single method of sampling which has no demerit.

6. Sampling Error:
This method of sampling has many errors.

7. Personal Bias:
As in many cases the investigator, chooses samples, such as
convenience method, chances of personal bias creep in.
15

What is a Sampling Distribution?


A sampling distribution of a statistic is a type of probability distribution created
by drawing many random samples of a given size from the same population.
These distributions help you understand how a sample statistic varies
from sample to sample.
Sampling distributions are essential for inferential statistics because they allow
you to understand a specific sample statistic in the broader context of other
possible values. Crucially, they let you calculate probabilities associated with
your sample.
Sampling distributions describe the assortment of values for all manner
of sample statistics. While the sampling distribution of the mean is the most
common type, they can characterize other statistics, such as the median,
standard deviation, range, correlation, and test statistics in hypothesis tests. I
focus on the mean in this post.
For this post, I’ll show you sampling distributions for both normal and
nonnormal data and demonstrate how they change with the sample size. I
conclude with a brief explanation of how hypothesis tests use them.
Let’s start with a simple example and move on from there!
Sampling Distribution of the Mean Example
For starters, I want you to fully understand the concept of a sampling
distribution. So, here’s a simple example!
Imagine you draw a random sample of 10 apples. Then you calculate the mean
of that sample as 103 grams. That’s one sample mean from one sample.
However, you realize that if you were to draw another sample, you’d obtain a
different mean. A third sample would produce yet another mean. And so on.
With this in mind, suppose you decide to collect 50 random samples of the
same apple population. Each sample contains 10 apples, and you calculate the
mean for each sample.
Repeated Apple Samples
At this point, you have 50 sample means for apple weights. You plot
these sample means in the histogram below to display your sampling
distribution of the mean.
16

This histogram shows us that our initial sample mean of 103 falls near the
center of the sampling distribution. Means occur in this range the most
frequently—18 of the 50 samples (36%) fall within the middle bar. However,
other samples from the same population have higher and lower means. The
frequency of means is highest in the sampling distribution center and tapers
off in both directions. None of our 50 sample means fall outside the range of
85-118. Consequently, it is very unusual to obtain sample means outside this
range.
Typically, you don’t know the population parameters. Instead, you use samples
to estimate them. However, we know the parameters for this simulation
because I’ve set the population to follow a normal distribution with a mean (µ)
weight of 100 grams and a standard deviation (σ) of 15 grams. Those are
the parameters of the apple population from which we’ve been sampling.
Notice how the histogram centers on the population mean of 100,
and sample means become rarer further away. It’s also a reasonably
symmetric distribution. Those are features of many sampling distributions. This
distribution isn’t particularly smooth because 50 samples is a small number for
this purpose, as you’ll see.
I used Excel to create this example. I had it randomly draw 50 samples with
a sample size of 10 from a population with µ = 100 and σ = 15.
17

What is the Standard Error of the Sampling Distribution of a Sample Mean?


Standard Error: The standard error of the sampling distribution of a sample
mean is an estimate of how far the mean of the sampling distribution of a
sample mean is from the population mean. The standard error is equal to the
standard deviation of the population divided by the sample size.
We will use these steps, definitions, and formulas to calculate the standard
error of the sampling distribution of a sample mean in the following two
examples.
Examples for Calculating the Standard Error of the Sampling Distribution of a
Sample Mean

What is Estimation? Differentiate Between Point Estimation and Interval


Estimation?
estimation, in statistics, any of numerous procedures used to calculate the
value of some property of a population from observations of a sample drawn
from the population. A point estimate, for example, is the single number most
likely to express the value of the property. An interval estimate defines a range
within which the value of the property can be expected (with a specified
degree of confidence) to fall. The 18th-century English theologian and
mathematician Thomas Bayes was instrumental in the development
18

of Bayesian estimation to facilitate revision of estimates on the basis of further


information.
Estimation in Statistics
In statistics, estimation refers to the process by which one makes inferences
about a population, based on information obtained from a sample.
Point Estimate vs. Interval Estimate
Statisticians use sample statistics to estimate population parameters. For
example, sample means are used to estimate population means; sample
proportions, to estimate population proportions.
An estimate of a population parameter may be expressed in two ways:
 Point estimate. A point estimate of a population parameter is a single
value of a statistic. For example, the sample mean x is a point estimate
of the population mean μ. Similarly, the sample proportion p is a point
estimate of the population proportion P.
 Interval estimate. An interval estimate is defined by two numbers,
between which a population parameter is said to lie. For
example, a < x < b is an interval estimate of the population mean μ. It
indicates that the population mean is greater than a but less than b.
19

Explain Coefficient of Determinant and coefficient of


correlation?
Coefficient of Determination (R²) | Calculation & Interpretation
The coefficient of determination is a number between 0 and 1 that measures
how well a statistical model predicts an outcome.

Interpreting the coefficient of determination

Coefficient of determination (R2) Interpretation

0 The model does not predict the outcome.

Between 0 and 1 The model partially predicts the outcome.

1 The model perfectly predicts the outcome.

The coefficient of determination is often written as R2, which is pronounced as


“r squared.” For simple linear regressions, a lowercase r is usually used instead
(r2).
Calculating the coefficient of determination
You can choose between two formulas to calculate the coefficient of
determination (R²) of a simple linear regression. The first formula is specific
to simple linear regressions, and the second formula can be used to calculate
the R² of many types of statistical models.
Formula 1: Using the correlation coefficient
Formula 1:

Where r = Pearson correlation coefficient


Example: Calculating R² using the correlation coefficient
You are studying the relationship between heart rate and age in children, and you
find that the two variables have a negative Pearson correlation:
20

This value can be used to calculate the coefficient of determination (R²) using
Formula 1:

Formula 2: Using the regression outputs


Formula 2:

Where:
 RSS = sum of squared residuals
 TSS = total sum of squares

Example: Calculating R² using regression outputs


As part of performing a simple linear regression that predicts students’ exam scores
(dependent variable) from their study time (independent variable), you calculate that:

These values can be used to calculate the coefficient of determination (R²) using
Formula 2:

coefficient of correlation
21

Correlation Coefficient is a statistical concept, which helps in establishing a


relation between predicted and actual values obtained in a statistical
experiment. The calculated value of the correlation coefficient explains the
exactness between the predicted and actual values.
Correlation Coefficient value always lies between -1 to +1. If correlation
coefficient value is positive, then there is a similar and identical relation
between the two variables. Else it indicates the dissimilarity between the two
variables.
The covariance of two variables divided by the product of their standard
deviations gives Pearson’s correlation coefficient. It is usually represented by ρ
(rho).
ρ (X,Y) = cov (X,Y) / σX.σY.
Here cov is the covariance. σX is the standard deviation of X and σY is the
standard deviation of Y. The given equation for correlation coefficient can be
expressed in terms of means and expectations.
ρ(X,Y)=E(X−μx)(Y−μy)/σx.σy
μx and μy are mean of x and mean of y respectively. E is the expectation.
Pearson Correlation Coefficient Formula
The linear correlation coefficient defines the degree of relation between two
variables and is denoted by “r”. It is also called as Cross correlation coefficient
as it predicts the relation between two quantities. Now let us proceed to a
statistical way of calculating the correlation coefficient.

If x & y are the two variables of discussion, then the correlation coefficient can
be calculated using the formula

Here,
n = Number of values or elements
∑x = Sum of 1st values list
∑y = Sum of 2nd values list
22

∑xy = Sum of the product of 1st and 2nd values


∑x2 = Sum of squares of 1st values
∑y2 = Sum of squares of 2nd values
How to find the Correlation Coefficient
Correlation is used almost everywhere in statistics. Correction illustrates the
relationship between two or more variables. It is expressed in the form of a
number that is known as the correlation coefficient. There are mainly two
types of correlations:
 Positive Correlation
 Negative Correlation

Positive Correlation The value of one variable


increases linearly with
increase in another variable.
This indicates a similar
relation between both the
variables. So its correlation
coefficient would be positive
or 1 in this case.

Negative Correlation When there is a decrease in


values of one variable with
increase in values of other
variable. In that case,
correlation coefficient would
be negative.
23

Zero Correlation or No There is one more situation


Correlation when there is no specific
relation between two
variables.
24

What Is Multiple Linear Regression (MLR)?


Multiple linear regression (MLR), also known simply as multiple regression, is a
statistical technique that uses several explanatory variables to predict the
outcome of a response variable. The goal of multiple linear regression is to
model the linear relationship between the explanatory (independent) variables
and response (dependent) variables. In essence, multiple regression is the
extension of ordinary least-squares (OLS) regression because it involves more
than one explanatory variable.

Assumptions for multiple regression analysis


 The variables considered for the model should be relevant and the
model should be reliable.
 The model should be linear and not non-linear.
 Variables must have normal distribution
 The variance should be constant for all levels of the predicted variable.
Benefits of multiple regression analysis
 Multiple regression analysis helps us to better study the various
predictor variables at hand.
 It increases reliability by avoiding dependency on just one variable and
have more than one independent variable to support the event.
25

 Multiple regression analysis permits you to study more formulated


hypotheses that are possible.
What Is Stratified Random Sampling?
Stratified random sampling is a method of sampling that involves the division
of a population into smaller subgroups known as strata. In stratified random
sampling, or stratification, the strata are formed based on members’ shared
attributes or characteristics, such as income or educational attainment.
Stratified random sampling has numerous applications and benefits, such as
studying population demographics and life expectancy.
Stratified random sampling is also called proportional random sampling or
quota random sampling.
Example of Stratified Random Sampling
Suppose a research team wants to determine the grade point average (GPA) of
college students across the United States. The research team has difficulty
collecting data from all 21 million college students; it decides to take a random
sample of the population by using 4,000 students.
Now assume that the team looks at the different attributes of the sample
participants and wonders if there are any differences in GPAs and students’
majors. Suppose it finds that 560 students are English majors, 1,135 are
science majors, 800 are computer science majors, 1,090 are engineering
majors, and 415 are math majors. The team wants to use a proportional
stratified random sample where the stratum of the sample is proportional to
the random sample in the population.
Assume the team researches the demographics of college students in the U.S.
and finds the percentage of what students major in: 12% major in English, 28%
major in science, 24% major in computer science, 21% major in engineering,
and 15% major in mathematics. Thus, five strata are created from the stratified
random sampling process.
The team then needs to confirm that the stratum of the population is in
proportion to the stratum in the sample; however, they find the proportions
are not equal. The team then needs to resample 4,000 students from the
population and randomly select 480 English, 1,120 science, 960 computer
science, 840 engineering, and 600 mathematics students.
26

With those groups, it has a proportionate stratified random sample of college


students, which provides a better representation of students’ college majors in
the U.S. The researchers can then highlight specific stratum, observe the
varying types of studies of U.S. college students and observe the various GPAs.
Types of Allocation of Sample Sizes
In stratified sampling, the size of the sample from each stratum is chosen by
the sampler, or to put it another way, given a total sample size n = n1 + n2 + …
+ nh + … + nk, a choice can be made on how to allocate the sample among the k
strata. There are rules governing how a sample from a given stratum should be
taken. Sample size should be larger in strata that are larger, with greater
variability and where sampling has lower cost. If the strata are of the same size
and there is no information about the variability of the population, a
reasonable choice would be to assign equal sample sizes to all strata.
4.5.1 Proportional allocation
Let n be the total size of the sample to be taken.
If the strata sizes are different, proportional allocation could be used to
maintain a steady sampling fraction throughout the population. The total
sample size, n, should be allocated to the strata proportionally to their sizes:

or
4.5.2 Optimum allocation
Optimum allocation takes into consideration both the sizes of the strata and
the variability inside the strata. In order to obtain the minimum sampling
variance the total sample size should be allocated to the strata proportionally
to their sizes and also to the standard deviation of their values, i.e. to the
square root of the variances.
nh = constant × Nh sh

Given that , in this case

so that
27

where n is total sample size, nh is the sample size in stratum h, Nh is the size
of stratum h and sh is the square root of the variance in stratum h.
4.5.3 Optimum allocation with variable cost
In some sampling situations, the cost of sampling in terms of time or money is
composed of a fixed part and of a variable part depending on the stratum.
The sampling cost function is thus of the form:

where C is the total cost of the sampling, c0 is an overhead cost and ch is the
cost per sampling unit in stratum h, which may vary from stratum to stratum.
The optimum allocation of the sample to the strata in this situation is
allocating sample size to the strata proportional to the size, and the standard
error, and inversely proportional to the cost of sampling in each stratum. This
gives the following sample size for stratum h:

Very often, it is the total cost of the sampling, rather than the total sample
size, that is fixed. This is usually the case with research vessel surveys, in which
the number of days is fixed beforehand. In this case, the optimum allocation of
sample size among strata is

systematic and cluster sampling?


Systematic Sampling
Systematic sampling is a random probability sampling method. It's one of the
most popular and common methods used by researchers and analysts. This
method involves selecting samples from a larger group. While the starting
28

point may be random, the sampling involves the use of fixed intervals between
each member.
Example of Systematic Sampling
The goal of systematic sampling is to obtain an unbiased sample. The method
in which to achieve this is by assigning a number to every participant in the
population and then selecting the same designated interval in the population
to create the sample.
For example, you could choose every 5th participant or every 20th participant
but you must choose the same one in every population. The process of
selecting this nth number is systematic sampling.
Cluster Sampling
Cluster sampling is another type of random statistical measure. This method is
used when there are different subsets of groups present in a larger population.
These groups are known as clusters. Cluster sampling is commonly used
by marketing groups and professionals.
Cluster sampling is a two-step procedure. First, the entire population is
selected and separated into different clusters. Random samples are then
chosen from these subgroups. For example, a researcher may find it difficult to
construct the entire population of customers of a grocery store to interview.
However, they may be able to create a random subset of stores; this
represents the first step in the process. The second step is to interview a
random sample of the customers of those stores.
Example of Cluster Sampling
For example, say an academic study is being conducted to determine how
many employees at investment banks hold MBAs, and of those MBAs, how
many are from Ivy League schools. It would be difficult for the statistician to go
to every investment bank and ask every single employee their educational
background. To achieve the goal, a statistician can employ cluster sampling.
The first step would be to form a cluster of investment banks. Rather than
study every investment bank, the statistician can choose to study the top three
largest investment banks based on revenue, forming the first cluster. From
there, rather than interviewing every employee in all three investment banks,
a statistician could form another cluster, which would include employees from
29

only certain departments, for example, sales and trading or mergers and
acquisitions.
What is a Rejection Region?

The rejection regions in a two-tailed t-


distribution. Image: ETSU.edu

A rejection region (also called a critical region) is an area of a graph where you
would reject the null hypothesis if your test results fall into that area. In other
words, if your results fall into that area then they are statistically significant.
The main purpose of statistics is to test theories or results from experiments.
For example, you might have invented a new fertilizer that you think makes
plants grow 50% faster. In order to prove your theory is true, your experiment
must:
1. Be repeatable.
2. Be compared to a known fact about plants (in this example, probably the
average growth rate of plants without the fertilizer).
Acceptance Region:
In hypothesis testing, the test procedure partitions all the possible sample
outcomes into two subsets (on the basis of whether the observed value of the
test statistic is smaller than a threshold value or not). The subset that is
considered to be consistent with the null hypothesis is called the "acceptance
region"; another subset is called the "rejection region" (or "critical region").
If the sample outcome falls into the acceptance region, then the null
hypothesis is accepted. If the sample outcome falls into the rejection region,
then the null hypothesis is rejected (i.e. the alternative hypothesis is accepted).
We call this type of statistical testing a hypothesis test. The rejection region is
a part of the testing process. Specifically, it is an area of probability that tells
you if your theory (your “”hypothesis”) is probably true.
30

All possible values which a test-statistic may assume can be divided into two
mutually exclusive groups: one group consisting of values which appear to be
consistent with the null hypothesis and the other having values which are
unlikely to occur if Ho is true. The first group is called the acceptance region
and the second set of values is known as the rejection region for a test. The
rejection region is also called the critical region. The value(s) that separates the
critical region from the acceptance region is called the critical value(s). The
critical value, which can be in the same units as the parameter or in the
standardized units, is to be decided by the experimenter keeping in view the
degree of confidence they are willing to have in the null hypothesis.
Hypothesis Testing
Hypothesis testing is a formal procedure for investigating our ideas about the
world using statistics. It is most often used by scientists to test specific
predictions, called hypotheses, that arise from theories.
There are 5 main steps in hypothesis testing:
1. State your research hypothesis as a null hypothesis and alternate
hypothesis (Ho) and (Ha or H1).
2. Collect data in a way designed to test the hypothesis.
3. Perform an appropriate statistical test.
4. Decide whether to reject or fail to reject your null hypothesis.
5. Present the findings in your results and discussion section.
What is an Estimator?
estimator
The sample mean is an estimator for the population mean.
An estimator is a statistic that estimates some fact about the population. You
can also think of an estimator as the rule that creates an estimate. For
example, the sample mean(x̄) is an estimator for the population mean, μ.
The quantity that is being estimated (i.e. the one you want to know) is called
the estimand. For example, let’s say you wanted to know the average height of
children in a certain school with a population of 1000 students. You take a
sample of 30 children, measure them and find that the mean height is 56
31

inches. This is your sample mean, the estimator. You use the sample mean to
estimate that the population mean (your estimand) is about 56 inches.
Point vs. Interval
Estimators can be a range of values (like a confidence interval) or a single value
(like the standard deviation). When an estimator is a range of values, it’s called
an interval estimate. For the height example above, you might add on a
confidence interval of a couple of inches either way, say 54 to 58 inches. When
it is a single value — like 56 inches — it’s called a point estimate.
Types
Estimators can be described in several ways (click on the bold word for the
main article on that term):
Biased: a statistic that is either an overestimate or an underestimate.
Efficient: a statistic with small variances (the one with the smallest possible
variance is also called the “best”). Inefficient estimators can give you good
results as well, but they usually requires much larger samples.
Invariant: statistics that are not easily changed by transformations, like simple
data shifts.
Shrinkage: a raw estimate that’s improved by combining it with other
information. See also: The James-Stein estimator.
Sufficient: a statistic that estimates the population parameter as well as if you
knew all of the data in all possible samples.
Unbiased: an accurate statistic that neither underestimates nor overestimates.
What is a Point Estimate?
In simple terms, any statistic can be a point estimate. A statistic is
an estimator of some parameter in a population. For example:
 The sample standard deviation (s) is a point estimate of the
population standard deviation (σ).
 The sample mean (̄x) is a point estimate of the population mean, μ.
 The sample variance (s2) is a point estimate of the population
variance (σ2).
32

In more formal terms, the estimate occurs as a result of point estimation


applied to a set of sample data. Points are single values, in comparison
to interval estimates, which are a range of values. For example, a confidence
interval is one example of an interval estimate.
Finding the Estimates
Four of the most common ways to find an estimate:
 The Method of Moments: is based on the law of large numbers and uses
relatively simple equations to find point estimates. Is often not too
accurate and has a tendency to be biased. More info.
 Maximum Likelihood: uses a model (for example, the normal
distribution) and uses the values in the model to maximize a likelihood
function. This results in the most likely parameter for the inputs
selected.
 Bayes Estimators: minimize the average risk (an expectation of random
variables). More info.
 Best Unbiased Estimators: several unbiased estimators can be used to
approximate a parameter. Which one is “best” depends on what
parameter you are trying to find. For example, with variance,
the estimator with the smallest variance is “best”. More info.
Define and eloborate the scope of statistics clearly bringing out its
relationship with ecnomics?
What is Statistics?
Statistics may be defined as the collection, presentation, analysis and
interpretation of numerical data.
Statistics is a set of decision-making techniques which helps businessmen in
making suitable policies from the available data. In fact, every businessman
needs a sound background of statistics as well as of mathematics.
The purpose of statistics and mathematics is to manipulate, summarize and
investigate data so that the useful decision-making results can be executed.
Uses of Statistics in Business Decision Making
Uses of Statistics in Business
33

The following are the main uses of statistics in various business activities:
 With the help of statistical methods, quantitative information about
production, sale, purchase, finance, etc. can be obtained. This type of
information helps businessmen in formulating suitable policies.

 By using the techniques of time series analysis which are based on


statistical methods, the businessman can predict the effect of a large
number of variables with a fair degree of accuracy.

 In business decision theory, most of the statistics techniques are used in


taking a business decision which helps us in doing the business without
uncertainty.

 Nowadays, a large part of modern business is being organised around


systems of statistical analysis and control.

 By using ‘Bayesian Decision Theory’, the businessmen can select the


optimal decisions for the direct evaluation of the payoff for each
alternative course of action.
Uses of Mathematics for Decision Making
 The number of defects in a roll of paper, bale of cloth, sheet of a
photographic film can be judged by means of Control Chart based on
Normal distribution.
 In statistical quality control, we analyse the data which are based on the
principles involved in Normal curve.
Uses of Statistics in Economics
Statistics is the basis of economics. The consumer’s maximum satisfaction can
be determined on the basis of data pertaining to income and expenditure. The
various laws of demand depend on the data concerning price and quantity. The
price of a commodity is well determined on the basis of data relating to its
buyers, sellers, etc.

Define relative merits and demerits of mean,median and mode?


34

MERITS AND DEMERITS OF MEAN, MEDIAN AND MODE


MEAN

The arithmetic mean (or simply "mean") of a sample is the sum of the sampled
values divided by the number of items in the sample.

MERITS OF ARITHEMETIC MEAN


l ARITHEMETIC MEAN RIGIDLY DEFINED BY ALGEBRIC FORMULA
l It is easy to calculate and simple to understand
l IT BASED ON ALL OBSERVATIONS AND IT CAN BE REGARDED AS
REPRESENTATIVE OF THE GIVEN DATA
l It is capable of being treated mathematically and hence it is widely used in
statistical analysis.
l Arithmetic mean can be computed even if the detailed distribution is not
known but some of the observation and number of the observation are known.
l It is least affected by the fluctuation of sampling

DEMERITS OF ARITHMETIC MEAN


l It can neither be determined by inspection or by graphical location
l Arithmetic mean cannot be computed for qualitative data like data on
intelligence honesty and smoking habit etc
l It is too much affected by extreme observations and hence it is not
adequately represent data consisting of some extreme point
l Arithmetic mean cannot be computed when class intervals have open ends
35

Median:

The median is that value of the series which divides the group into two equal
parts, one part comprising all values greater than the median value and the
other part comprising all the values smaller than the median value.

Merits of median

(1) Simplicity:- It is very simple measure of the central tendency of the series. I
the case of simple statistical series, just a glance at the data is enough to locate
the median value.

(2) Free from the effect of extreme values: - Unlike arithmetic mean, median
value is not destroyed by the extreme values of the series.

(3) Certainty: - Certainty is another merits is the median. Median values are
always a certain specific value in the series.

(4) Real value: - Median value is real value and is a better representative value
of the series compared to arithmetic mean average, the value of which may
not exist in the series at all.

(5) Graphic presentation: - Besides algebraic approach, the median value can
be estimated also through the graphic presentation of data.

(6) Possible even when data is incomplete: - Median can be estimated even in
the case of certain incomplete series. It is enough if one knows the number of
items and the middle item of the series.

Demerits of median:

Following are the various demerits of median:

(1) Lack of representative character: - Median fails to be a representative


36

measure in case of such series the different values of which are wide apart
from each other. Also, median is of limited representative character as it is not
based on all the items in the series.

(2) Unrealistic:- When the median is located somewhere between the two
middle values, it remains only an approximate measure, not a precise value.

(3) Lack of algebraic treatment: - Arithmetic mean is capable of further


algebraic treatment, but median is not. For example, multiplying the median
with the number of items in the series will not give us the sum total of the
values of the series.

However, median is quite a simple method finding an average of a series. It is


quite a commonly used measure in the case of such series which are related to
qualitative observation as and health of the student.

Mode:

The value of the variable which occurs most frequently in a distribution is


called the mode.

Merits of mode:

Following are the various merits of mode:

(1) Simple and popular: - Mode is very simple measure of central tendency.
Sometimes, just at the series is enough to locate the model value. Because of
its simplicity, it s a very popular measure of the central tendency.

(2) Less effect of marginal values: - Compared top mean, mode is less affected
by marginal values in the series. Mode is determined only by the value with
highest frequencies.

(3) Graphic presentation:- Mode can be located graphically, with the help of
histogram.

(4) Best representative: - Mode is that value which occurs most frequently in
37

the series. Accordingly, mode is the best representative value of the series.

(5) No need of knowing all the items or frequencies: - The calculation of mode
does not require knowledge of all the items and frequencies of a distribution.
In simple series, it is enough if one knows the items with highest frequencies in
the distribution.

Demerits of mode:

Following are the various demerits of mode:

(1) Uncertain and vague: - Mode is an uncertain and vague measure of the
central tendency.

(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable of
further algebraic treatment.

(3) Difficult: - With frequencies of all items are identical, it is difficult to identify
the modal value.

(4) Complex procedure of grouping:- Calculation of mode involves


cumbersome procedure of grouping the data. If the extent of grouping
changes there will be a change in the model value.

(5) Ignores extreme marginal frequencies:- It ignores extreme marginal


frequencies. To that extent model value is not a representative value of all the
items in a series. Besides, one can question the representative character of the
model value as its calculation does not involve all items of the series.
38

Consumer price index number and steps?


Consumer Price Index Numbers
Consumer price index numbers measure the changes in the prices paid by
consumers for a special “basket” of goods and services during the current year as
compared to the base year. The basket of goods and services will contain items
like (1) Food (2) Rent (3) Clothing (4) Fuel and Lighting (5) Education (6)
Miscellaneous like cleaning, transport, newspapers, etc. Consumer price index
numbers are also called cost of living index numbers or retail price index
numbers.
Construction of Consumer Price Index Numbers
The following steps are involved in the construction of consumer price index
numbers.
(1) Class of People
The first step in the construction of the consumer price index (CPI) is that the class
of people should be defined clearly. It should be decided whether the cost of
living index number is being prepared for industrial workers, or middle or lower
class salaried people living in a particular area. It is therefore necessary to specify
the class of people and locality where they reside.
(2) Family Budget Inquiry
The next step in the construction of a consumer price index number is that some
families should be selected randomly. These families provide information about
the cost of food, clothing, rent, miscellaneous, etc. The inquiry includes questions
on family size, income, the quality and quantity of resources consumed and the
money spent on them, and the weights are assigned in proportions to the
expenditure on different items.
(3) Price Data
The next step is to collect data on the retail prices of the selected commodities for
the current period and the base period when these prices should be obtained
from the shops situated in the locality for which the index numbers are prepared.
(4) Selection of Commodities
The next step is the selection of the commodities to be included. We should select
those commodities which are most often used by that class of people.
39

Explain the Concept of Absolute Measure of Dispersion and Relative Measure


of Dispersion?
Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data
set. The absolute dispersion method expresses the variations in terms of the
average of deviations of observations like standard or means deviations. It
includes range, standard deviation, quartile deviation, etc.
The types of absolute measures of dispersion are:
1. Range: It is simply the difference between the maximum value and the
minimum value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1=
6
2. Variance: Deduct the mean from each data in the set, square each of
them and add each square and finally divide them by the total no of
values in the data set to get the variance. Variance (σ2) = ∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the
standard deviation i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a
list of numbers into quarters. The quartile deviation is half of the
distance between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the
mean and the arithmetic mean of the absolute deviations of the
observations from a measure of central tendency is known as the mean
deviation (also called mean absolute deviation).

Relative Measure of Dispersion


The relative measures of dispersion are used to compare the distribution of
two or more data sets. This measure compares values without units. Common
relative dispersion methods include:
1. Co-efficient of Range
2. Co-efficient of Variation
3. Co-efficient of Standard Deviation
40

4. Co-efficient of Quartile Deviation


5. Co-efficient of Mean Deviation
Co-efficient of Dispersion
The coefficients of dispersion are calculated (along with the measure of
dispersion) when two series are compared, that differ widely in their averages.
The dispersion coefficient is also used when two series with different
measurement units are compared. It is denoted as C.D.
The common coefficients of dispersion are:

C.D. in terms of Coefficient of dispersion

Range C.D. = (Xmax – Xmin) ⁄ (Xmax + Xmin)

Quartile Deviation C.D. = (Q3 – Q1) ⁄ (Q3 + Q1)

Standard Deviation (S.D.) C.D. = S.D. ⁄ Mean

Mean Deviation C.D. = Mean deviation/Average

Measures of Dispersion Formulas


The most important formulas for the different dispersion methods are:

Arithmetic Mean Formula Quartile Formula

Standard Deviation Formula Variance Formula

Interquartile Range Formula All Statistics Formulas

Difference between binomial and poisson distribution?


41

Binomial vs. Poisson Distribution: Similarities & Differences

Two distributions that are similar in statistics are the Binomial distribution and
the Poisson distribution.
This tutorial provides a brief explanation of each distribution along with the
similarities and differences between the two.
The Binomial Distribution
The Binomial distribution describes the probability of obtaining k successes
in n binomial experiments.
If a random variable X follows a binomial distribution, then the probability
that X = k successes can be found by the following formula:
P(X=k) = nCk * pk * (1-p)n-k
where:
 n: number of trials
 k: number of successes
 p: probability of success on a given trial
 C : the number of ways to obtain k successes in n trials
n k

For example, suppose we flip a coin 3 times. We can use the formula above to
determine the probability of obtaining 0 heads during these 3 flips:
P(X=0) = 3C0 * .50 * (1-.5)3-0 = 1 * 1 * (.5)3 = 0.125
The Poisson Distribution
The Poisson distribution describes the probability of experiencing k events
during a fixed time interval.
If a random variable X follows a Poisson distribution, then the probability
that X = k events can be found by the following formula:
P(X=k) = λk * e– λ / k!
where:
 λ: mean number of successes that occur during a specific interval
 k: number of successes
42

 e: a constant equal to approximately 2.71828


For example, suppose a particular hospital experiences an average of 2 births
per hour. We can use the formula above to determine the probability of
experiencing 3 births in a given hour:
P(X=3) = 23 * e– 2 / 3! = 0.18045
Similarities & Differences
The Binomial and Poisson distribution share the following similarities:
 Both distributions can be used to model the number of occurrences of
some event.
 In both distributions, events are assumed to be independent.
The distributions share the following key difference:
 In a Binomial distribution, there is a fixed number of trials (e.g. flip a coin
3 times)
 In a Poisson distribution, there could be any number of events that occur
during a certain time interval (e.g. how many customers will arrive at a
store in a given hour?)
What Is Subjective Probability?
Subjective probability is a type of probability derived from an individual's
personal judgment or own experience about whether a specific outcome is
likely to occur. It contains no formal calculations and only reflects the subject's
opinions and past experience. An example of subjective probability is a "gut
instinct" when making a trade.

KEY TAKEAWAYS
Subjective probability is a type of probability derived from an individual's
personal judgment or own experience about whether a specific outcome is
likely to occur.
It contains no formal calculations and only reflects the subject's opinions and
past experience rather than on data or computation.
43

Subjective probabilities differ from person to person and contain a high degree
of personal bias.
Example of Subjective Probability
An example of subjective probability is asking New York Yankees fans, before
the baseball season starts, about the chances of New York winning the World
Series. While there is no absolute mathematical proof behind the answer to
the example, fans might still reply in actual percentage terms, such as the
Yankees having a 25% chance of winning the World Series.
What Is Conditional Probability?
Conditional probability is defined as the likelihood of an event or outcome
occurring, based on the occurrence of a previous event or outcome.
Conditional probability is calculated by multiplying the probability of the
preceding event by the updated probability of the succeeding, or conditional,
event.
Conditional probability can be contrasted with unconditional probability.
Unconditional probability refers to the likelihood that an event will take place
irrespective of whether any other events have taken place or any other
conditions are present.
KEY TAKEAWAYS
 Conditional probability refers to the chances that some outcome occurs
given that another event has also occurred.
 It is often stated as the probability of B given A and is written as P(B|A),
where the probability of B depends on that of A happening.
 Conditional probability can be contrasted with unconditional probability.
 Probabilities are classified as either conditional, marginal, or joint.
 Bayes' theorem is a mathematical formula used in calculating conditional
probability.

Example of Conditional Probability


44

As another example, suppose a student is applying for admission to a


university and hopes to receive an academic scholarship. The school to which
they are applying accepts 100 of every 1,000 applicants (10%) and awards
academic scholarships to 10 of every 500 students who are accepted (2%).
Of the scholarship recipients, 50% of them also receive university stipends for
books, meals, and housing. For the students, the chance of them being
accepted and then receiving a scholarship is .2% (.1 x .02). The chance of them
being accepted, receiving the scholarship, then also receiving a stipend for
books, etc. is .1% (.1 x .02 x .5).
Definition of Type I Error
In statistics, type I error is defined as an error that occurs when the sample
results cause the rejection of the null hypothesis, in spite of the fact that it is
true. In simple terms, the error of agreeing to the alternative hypothesis, when
the results can be ascribed to chance.
Also known as the alpha error, it leads the researcher to infer that there is a
variation between two observances when they are identical. The likelihood of
type I error, is equal to the level of significance, that the researcher sets for his
test. Here the level of significance refers to the chances of making type I error.
E.g. Suppose on the basis of data, the research team of a firm concluded that
more than 50% of the total customers like the new service started by the
company, which is, in fact, less than 50%.
Definition of Type II Error
When on the basis of data, the null hypothesis is accepted, when it is actually
false, then this kind of error is known as Type II Error. It arises when the
researcher fails to deny the false null hypothesis. It is denoted by Greek letter
‘beta (β)’ and often known as beta error.
Type II error is the failure of the researcher in agreeing to an alternative
hypothesis, although it is true. It validates a proposition; that ought to be
refused. The researcher concludes that the two observances are identical
when in fact they are not.
The likelihood of making such error is analogous to the power of the test. Here,
the power of test alludes to the probability of rejecting of the null hypothesis,
which is false and needs to be rejected. As the sample size increases, the
45

power of test also increases, that results in the reduction in risk of making type
II error.
E.g. Suppose on the basis of sample results, the research team of an
organisation claims that less than 50% of the total customers like the new
service started by the company, which is, in fact, greater than 50%.
Key Differences Between Type I and Type II Error
The points given below are substantial so far as the differences between type I
and type II error is concerned:
1. Type I error is an error that takes place when the outcome is a rejection
of null hypothesis which is, in fact, true. Type II error occurs when the
sample results in the acceptance of null hypothesis, which is actually
false.
2. Type I error or otherwise known as false positives, in essence, the
positive result is equivalent to the refusal of the null hypothesis. In
contrast, Type II error is also known as false negatives, i.e. negative
result, leads to the acceptance of the null hypothesis.
3. When the null hypothesis is true but mistakenly rejected, it is type I
error. As against this, when the null hypothesis is false but erroneously
accepted, it is type II error.
4. Type I error tends to assert something that is not really present, i.e. it is
a false hit. On the contrary, type II error fails in identifying something,
that is present, i.e. it is a miss.
5. The probability of committing type I error is the sample as the level of
significance. Conversely, the likelihood of committing type II error is
same as the power of the test.
6. Greek letter ‘α’ indicates type I error. Unlike, type II error which is
denoted by Greek letter ‘β’.

BASIS FOR
TYPE I ERROR TYPE II ERROR
COMPARISON

Meaning Type I error refers to non- Type II error is the


acceptance of hypothesis acceptance of hypothesis
46

BASIS FOR
TYPE I ERROR TYPE II ERROR
COMPARISON

which ought to be accepted. which ought to be rejected.

Equivalent to False positive False negative

What is it? It is incorrect rejection of true It is incorrect acceptance of


null hypothesis. false null hypothesis.

Represents A false hit A miss

Probability of Equals the level of Equals the power of test.


committing error significance.

Indicated by Greek letter 'α' Greek letter 'β'

Top Machine learning interview questions and answers


What is the difference between Simple Linear Regression and Multi Linear
Regression?
47

Simple Linear Regression


Simple Linear Regression establishes the relationship between two variables
using a straight line. It attempts to draw a line that comes closest to the data
by finding the slope and intercept which define the line and minimize
regression errors. Simple linear regression has only one x and one y variable.
Multi Linear Regression
Multiple Linear regressions are based on the assumption that there is a linear
relationship between both the dependent and independent variables or
Predictor variable and Target variable. It also assumes that there is no major
correlation between the independent variables. Multi Linear regressions can
be linear and nonlinear. It has one y and two or more x variables or one
dependent variable and two or more independent variables.
What is difference between simple linear and multiple linear regressions?
Simple linear regression has only one x and one y variable.
Multiple linear regression has one y and two or more x variables.
For instance, when we predict rent based on square feet alone that is simple
linear regression.
When we predict rent based on square feet and age of the building that is an
example of multiple linear regression.
Define Probability?
Probability Definition in Math
Probability is a measure of the likelihood of an event to occur. Many events
cannot be predicted with total certainty. We can predict only the chance of an
event to occur i.e., how likely they are going to happen, using it. Probability
can range from 0 to 1, where 0 means the event to be an impossible one and 1
indicates a certain event. Probability for Class 10 is an important topic for the
students which explains all the basic concepts of this topic. The probability of
all the events in a sample space adds up to 1.
For example, when we toss a coin, either we get Head OR Tail, only two
possible outcomes are possible (H, T). But when two coins are tossed then
there will be four possible outcomes, i.e {(H, H), (H, T), (T, H), (T, T)}.
Formula for Probability
48

The probability formula is defined as the possibility of an event to happen is


equal to the ratio of the number of favourable outcomes and the total number
of outcomes.

Probability of event to happen P(E) = Number of favourable outcomes/Total


Number of outcomes

Sometimes students get mistaken for “favourable outcome” with “desirable


outcome”. This is the basic formula. But there are some more formulas for
different situations or events.
Probability Tree
The tree diagram helps to organize and visualize the different possible
outcomes. Branches and ends of the tree are two main positions. Probability of
each branch is written on the branch, whereas the ends are containing the final
outcome. Tree diagrams are used to figure out when to multiply and when to
add. You can see below a tree diagram for the coin:

Types of Probability
There are three major types of probabilities:
 Theoretical Probability
 Experimental Probability
 Axiomatic Probability
Theoretical Probability
It is based on the possible chances of something to happen. The theoretical
probability is mainly based on the reasoning behind probability. For example, if
a coin is tossed, the theoretical probability of getting a head will be ½.
49

Experimental Probability
It is based on the basis of the observations of an experiment. The experimental
probability can be calculated based on the number of possible outcomes by
the total number of trials. For example, if a coin is tossed 10 times and head is
recorded 6 times then, the experimental probability for heads is 6/10 or, 3/5.
Axiomatic Probability
In axiomatic probability, a set of rules or axioms are set which applies to all
types. These axioms are set by Kolmogorov and are known as Kolmogorov’s
three axioms. With the axiomatic approach to probability, the chances of
occurrence or non-occurrence of the events can be quantified. The axiomatic
probability lesson covers this concept in detail with Kolmogorov’s three rules
(axioms) along with various examples.
Conditional Probability is the likelihood of an event or outcome occurring
based on the occurrence of a previous event or outcome.
Probability of an Event
Assume an event E can occur in r ways out of a sum of n probable or
possible equally likely ways. Then the probability of happening of the event or
its success is expressed as;
P(E) = r/n
The probability that the event will not occur or known as its failure is
expressed as:
P(E’) = (n-r)/n = 1-(r/n)
E’ represents that the event will not occur.
Therefore, now we can say;
P(E) + P(E’) = 1
This means that the total of all the probabilities in any random test or
experiment is equal to 1.
What Is a Confidence Interval?
A confidence interval, in statistics, refers to the probability that
a population parameter will fall between a set of values for a certain
proportion of times. Analysts often use confidence intervals than contain
50

either 95% or 99% of expected observations. Thus, if a point estimate is


generated from a statistical model of 10.00 with a 95% confidence interval of
9.50 - 10.50, it can be inferred that there is a 95% probability that the true
value falls within that range.
Statisticians and other analysts use confidence intervals to understand
the statistical significance of their estimations, inferences, or predictions. If a
confidence interval contains the value of zero (or some other null hypothesis),
then one cannot satisfactorily claim that a result from data generated by
testing or experimentation is to be attributable to a specific cause rather than
chance.
KEY TAKEAWAYS
 A confidence interval displays the probability that a parameter will fall
between a pair of values around the mean.
 Confidence intervals measure the degree of uncertainty or certainty in a
sampling method.
 They are also used in hypothesis testing and regression analysis.
 Statisticians often use p-values in conjunction with confidence intervals
to gauge statistical significance.
 They are most often constructed using confidence levels of 95% or 99%.
Why Are Confidence Intervals Used?
Statisticians use confidence intervals to measure uncertainty in a sample
variable. For example, a researcher selects different samples randomly from
the same population and computes a confidence interval for each sample to
see how it may represent the true value of the population variable. The
resulting datasets are all different where some intervals include the true
population parameter and others do not.

Definition of statistical data


Statistical data are the outcomes or the observations which
occur in scientific experiments or an investigation. To
conduct any analysis it is must to have some data. Without
51

data we can not think about research or statistical analysis.


In statistics, data plays a vital role in all the fields and all
the theories and measurement. Measure of central
tendency (mean, median, mode), measure of dispersion
(variance, mean deviation, standard deviation, etc) are
some statistical measure by which we find the different
characteristics of the data.
For example, In a garments factory, we want to find the
female workers’ height and weight. If we consider the size
in feet and weight in kilograms, then we get some
numerical values, which are the numerical data.
Types of statistical data
All statistical data may be classified into two categories.
Qualitative: Gender, Education status, Marital status, etc.
Quantitive: Age, height, weight, etc.
we can divide data into two categories depending on the data collection as,
Primary data: Primary data are those which are
collected from the units or individuals directly, and
these data have never been used for any purpose earlier.
Secondary data: The data, which had been collated by
some individual or agency and statistically treated to
draw certain conclusions. The same data are
used and analyzed to extract some other information,
which is termed as secondary data.
Methods of data collection Following are the methods of collection of data:
Direct personal inquiry method; Indirect oral investigation;
By filling of schedules; By mailed questionnaires;
Information from local agents. By old records; rind
By the direct observational method,
Requirements of reliable statistical data It should be complete.
It should be consistent. It should be accurate, and
It should be homogeneous with respect to the unit of information.
Statistical data are defined under some random variable
where one random variable contains same characteristics
of data. Such as height is random variable where we
52

include all the data which represents height. We collect


data from a certain area which is called study area. All the
data present in the study are is called population.
Generally population size is very large that’s why we
collect the representative part from the population which is called sample.
Difference Between Sampling and Non-Sampling Error?

Sa
mpling error is one which occurs due to unrepresentativeness of the sample
selected for observation. Conversely, non-sampling error is an error arise from
human error, such as error in problem identification, method or procedure
used, etc.
An ideal research design seeks to control various types of error, but there are
some potential sources which may affect it. In sampling theory, total error can
be defined as the variation between the mean value of population parameter
and the observed mean value obtained in the research. The total error can be
classified into two categories, i.e. sampling error and non-sampling error.
In this article excerpt, you can find the important differences between
sampling and non-sampling error in detail.
Content: Sampling Error Vs Non-Sampling Error
1. Comparison Chart
2. Definition
3. Key Differences
4. Conclusion
Comparison Chart
53

BASIS FOR
SAMPLING ERROR NON-SAMPLING ERROR
COMPARISON

Meaning Sampling error is a type of An error occurs due to


error, occurs due to the sources other than sampling,
sample selected does not while conducting survey
perfectly represents the activities is known as non
population of interest. sampling error.

Cause Deviation between sample Deficiency and analysis of


mean and population mean data

Type Random Random or Non-random

Occurs Only when sample is selected. Both in sample and census.

Sample size Possibility of error reduced It has nothing to do with the


with the increase in sample sample size.
size.

Definition of Sampling Error


Sampling Error denotes a statistical error arising out of a certain sample
selected being unrepresentative of the population of interest. In simple terms,
it is an error which occurs when the sample selected does not contain the true
characteristics, qualities or figures of the whole population.
The main reason behind sampling error is that the sampler draws various
sampling units from the same population but, the units may have individual
variances. Moreover, they can also arise out of defective sample design, faulty
demarcation of units, wrong choice of statistic, substitution of sampling unit
done by the enumerator for their convenience. Therefore, it is considered as
the deviation between true mean value for the original sample and the
population.
Definition of Non-Sampling Error
54

Non-Sampling Error is an umbrella term which comprises of all the errors,


other than the sampling error. They arise due to a number of reasons, i.e. error
in problem definition, questionnaire design, approach, coverage, information
provided by respondents, data preparation, collection, tabulation, and analysis.
There are two types of non-sampling error:
 Response Error: Error arising due to inaccurate answers were given by
respondents, or their answer is misinterpreted or recorded wrongly. It
consists of researcher error, respondent error and interviewer error
which are further classified as under.
o Researcher Error
 Surrogate Error
 Sampling Error
 Measurement Error
 Data Analysis Error
 Population Definition Error
o Respondent Error
 Inability Error
 Unwillingness Error
o Interviewer Error
 Questioning Error
 Recording Erro
 Respondent Selection Error
 Cheating Error
 Non-Response Error: Error arising due to some respondents who are a
part of the sample do not respond.
Key Differences Between Sampling and Non-Sampling Error
The significant differences between sampling and non-sampling error are
mentioned in the following points:
55

1. Sampling error is a statistical error happens due to the sample selected


does not perfectly represents the population of interest. Non-sampling
error occurs due to sources other than sampling while conducting survey
activities is known as non-sampling error.
2. Sampling error arises because of the variation between the true mean
value for the sample and the population. On the other hand, the non-
sampling error arises because of deficiency and inappropriate analysis of
data.
3. Non-sampling error can be random or non-random whereas sampling
error occurs in the random sample only.
4. Sample error arises only when the sample is taken as a representative of
a population.As opposed to non-sampling error which arises both in
sampling and complete enumeration.
5. Sampling error is mainly associated with the sample size, i.e. as the
sample size increases the possibility of error decreases. On the contrary,
the non-sampling error is not related to the sample size, so, with the
increase in sample size, it won’t be reduced.
Probability sampling methods
Probability sampling means that every member of the population has a chance
of being selected. It is mainly used in quantitative research. If you want to
produce results that are representative of the whole population, probability
sampling techniques are the most valid choice.
There are four main types of probability sample.
1. Simple random sampling
In a simple random sample, every member of the population has an equal
chance of being selected. Your sampling frame should include the whole
population.
To conduct this type of sampling, you can use tools like random number
generators or other techniques that are based entirely on chance.
Example: Simple random sampling
You want to select a simple random sample of 1000 employees of a social
media marketing company. You assign a number to every employee in the
56

company database from 1 to 1000, and use a random number generator to


select 100 numbers.
2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually
slightly easier to conduct. Every member of the population is listed with a
number, but instead of randomly generating numbers, individuals are chosen
at regular intervals.
Example: Systematic sampling
All employees of the company are listed in alphabetical order. From the first 10
numbers, you randomly select a starting point: number 6. From number 6
onwards, every 10th person on the list is selected (6, 16, 26, 36, and so on),
and you end up with a sample of 100 people.
If you use this technique, it is important to make sure that there is no hidden
pattern in the list that might skew the sample. For example, if the HR database
groups employees by team, and team members are listed in order of seniority,
there is a risk that your interval might skip over people in junior roles, resulting
in a sample that is skewed towards senior employees.
3. Stratified sampling
Stratified sampling involves dividing the population into subpopulations that
may differ in important ways. It allows you draw more precise conclusions by
ensuring that every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called
strata) based on the relevant characteristic (e.g., gender identity, age range,
income bracket, job role).
Based on the overall proportions of the population, you calculate how many
people should be sampled from each subgroup. Then you use random
or systematic sampling to select a sample from each subgroup.
Example: Stratified sampling
The company has 800 female employees and 200 male employees. You want
to ensure that the sample reflects the gender balance of the company, so you
sort the population into two strata based on gender. Then you use random
sampling on each group, selecting 80 women and 20 men, which gives you a
representative sample of 100 people.
57

4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each
subgroup should have similar characteristics to the whole sample. Instead of
sampling individuals from each subgroup, you randomly select entire
subgroups.
If it is practically possible, you might include every individual from each
sampled cluster. If the clusters themselves are large, you can also sample
individuals from within each cluster using one of the techniques above. This is
called multistage sampling.
This method is good for dealing with large and dispersed populations, but
there is more risk of error in the sample, as there could be substantial
differences between clusters. It’s difficult to guarantee that the sampled
clusters are really representative of the whole population.
Example: Cluster sampling
The company has offices in 10 cities across the country (all with roughly the
same number of employees in similar roles). You don’t have the capacity to
travel to every office to collect your data, so you use random sampling to
select 3 offices – these are your clusters.
Non-probability sampling methods
In a non-probability sample, individuals are selected based on non-random
criteria, and not every individual has a chance of being included.
This type of sample is easier and cheaper to access, but it has a higher risk
of sampling bias. That means the inferences you can make about the
population are weaker than with probability samples, and your conclusions
may be more limited. If you use a non-probability sample, you should still aim
to make it as representative of the population as possible.
Non-probability sampling techniques are often used
in exploratory and qualitative research. In these types of research, the aim is
not to test a hypothesis about a broad population, but to develop an initial
understanding of a small or under-researched population.
1. Convenience sampling
A convenience sample simply includes the individuals who happen to be most
accessible to the researcher.
58

This is an easy and inexpensive way to gather initial data, but there is no way
to tell if the sample is representative of the population, so it can’t
produce generalizable results. Convenience samples are at risk for
both sampling bias and selection bias.
Example: Convenience sampling
You are researching opinions about student support services in your university,
so after each of your classes, you ask your fellow students to complete
a survey on the topic. This is a convenient way to gather data, but as you only
surveyed students taking the same classes as you at the same level, the sample
is not representative of all the students at your university.
2. Voluntary response sampling
Similar to a convenience sample, a voluntary response sample is mainly based
on ease of access. Instead of the researcher choosing participants and directly
contacting them, people volunteer themselves (e.g. by responding to a public
online survey).
Voluntary response samples are always at least somewhat biased, as some
people will inherently be more likely to volunteer than others, leading to self-
selection bias.
Example: Voluntary response sampling
You send out the survey to all students at your university and a lot of students
decide to complete it. This can certainly give you some insight into the topic,
but the people who responded are more likely to be those who have strong
opinions about the student support services, so you can’t be sure that their
opinions are representative of all students.
3. Purposive sampling
This type of sampling, also known as judgement sampling, involves the
researcher using their expertise to select a sample that is most useful to the
purposes of the research.
It is often used in qualitative research, where the researcher wants to gain
detailed knowledge about a specific phenomenon rather than make statistical
inferences, or where the population is very small and specific. An effective
purposive sample must have clear criteria and rationale for inclusion. Always
59

make sure to describe your inclusion and exclusion criteria and beware
of observer bias affecting your arguments.

Example: Purposive sampling


You want to know more about the opinions and experiences of disabled
students at your university, so you purposefully select a number of students
with different support needs in order to gather a varied range of data on their
experiences with student services.
4. Snowball sampling
If the population is hard to access, snowball sampling can be used to recruit
participants via other participants. The number of people you have access to
“snowballs” as you get in contact with more people. The downside here is also
representativeness, as you have no way of knowing how representative your
sample is due to the reliance on participants recruiting others. This can lead
to sampling bias.
Example: Snowball sampling
You are researching experiences of homelessness in your city. Since there is no
list of all homeless people in the city, probability sampling isn’t possible. You
meet one person who agrees to participate in the research, and she puts you in
contact with other homeless people that she knows in the area.
What is a research problem?
A research problem is a statement about an area of concern, a condition to be
improved, a difficulty to be eliminated, or a troubling question that exists in
scholarly literature, in theory, or in practice that points to the need for
meaningful understanding and deliberate investigation. In some social science
disciplines the research problem is typically posed in the form of a question. A
research problem does not state how to do something, offer a vague or broad
proposition, or present a value question.
A research problem is a specific issue or gap in existing knowledge that you aim
to address in your research. You may choose to look for practical problems
aimed at contributing to change, or theoretical problems aimed at expanding
knowledge.
60

Some research will do both of these things, but usually the research problem
focuses on one or the other. The type of research problem you choose
depends on your broad topic of interest and the type of research you think will
fit best.
This article helps you identify and refine a research problem. When writing
your research proposal or introduction, formulate it as a problem statement
and/or research questions.
Why is the research problem important?
Having an interesting topic isn’t a strong enough basis for academic research.
Without a well-defined research problem, you are likely to end up with an
unfocused and unmanageable project.
You might end up repeating what other people have already said, trying to say
too much, or doing research without a clear purpose and justification. You
need a clear problem in order to do research that contributes new and
relevant insights.
Whether you’re planning your thesis, starting a research paper, or writing a
research proposal, the research problem is the first step towards knowing
exactly what you’ll do and why.
Step 1: Identify a broad problem area
As you read about your topic, look for under-explored aspects or areas of
concern, conflict, or controversy. Your goal is to find a gap that your research
project can fill.
Practical research problems
If you are doing practical research, you can identify a problem by reading
reports, following up on previous research, or talking to people who work in
the relevant field or organization. You might look for:
Issues with performance or efficiency
Processes that could be improved
Areas of concern among practitioners
Difficulties faced by specific groups of people
Examples of practical research problems
61

Voter turnout in New England has been decreasing, in contrast to the rest of
the country.
The HR department of a local chain of restaurants has a high staff turnover
rate.
A non-profit organization faces a funding gap that means some of its programs
will have to be cut.
Theoretical research problems
If you are doing theoretical research, you can identify a research problem by
reading existing research, theory, and debates on your topic to find a gap in
what is currently known about it. You might look for:
A phenomenon or context that has not been closely studied
A contradiction between two or more perspectives
A situation or relationship that is not well understood
A troubling question that has yet to be resolved
Examples of theoretical research problems
The effects of long-term Vitamin D deficiency on cardiovascular health are not
well understood.
The relationship between gender, race, and income inequality has yet to be
closely studied in the context of the millennial gig economy.
Historians of Scottish nationalism disagree about the role of the British Empire
in the development of Scotland’s national identity.
Step 2: Learn more about the problem
Next, you have to find out what is already known about the problem, and
pinpoint the exact aspect that your research will address.
Context and background
 Who does the problem affect?
 Is it a newly-discovered problem, or a well-established one?
 What research has already been done?
 What, if any, solutions have been proposed
62
63

Research proposal purpose


Academics often have to write research proposals to get funding for their
projects. As a student, you might have to write a research proposal as part of
a grad school application, or prior to starting your thesis or dissertation.
In addition to helping you figure out what your research can look like, a
proposal can also serve to demonstrate why your project is worth pursuing to
a funder, educational institution, or supervisor.
Research proposal aims
Relevance Show your reader why your project is interesting, original, and important.

Context Demonstrate your comfort and familiarity with your field.


Show that you understand the current state of research on your topic.

Approach Make a case for your methodology.


Demonstrate that you have carefully thought about the data, tools, and
procedures necessary to conduct your research.

Achievabilit Confirm that your project is feasible within the timeline of your program or
y funding deadline.

Research phase Objectives


1. Background  Meet with supervisor for initial discussion
research and
 Read and analyze relevant literature
literature review
 Use new knowledge to refine research
questions
 Develop theoretical framework

2. Research design  Design questionnaires


planning
 Identify channels for recruiting participants
 Finalize sampling methods and data analysis
methods

3. Data collection and  Recruit participants and send out


preparation questionnaires
64

Research phase Objectives

 Conduct semi-structured interviews with


selected participants
 Transcribe and code interviews
 Clean data

4. Data analysis  Statistically analyze survey data


 Conduct thematic analysis of interview
transcripts
 Draft results and discussion chapters

5. Writing  Complete a full thesis draft


 Meet with supervisor to discuss feedback and
revisions

6. Revision  Complete 2nd draft based on feedback


 Get supervisor approval for final draft
 Proofread
 Print and bind final work
 Submit

Weighted Index Number


In general, all the commodities cannot be given equal importance, so we can
assign weights to each commodity according to their importance and the index
number computed from these weights are called as weighted index number.
The weights can be production, consumption values. If ‘w’ is the weight
attached to a commodity, then the price index is given by,

Let us consider the following notations,


p1 - current year price
65

p0 - base year price


q1 - current year quantity
q0 - base year quantity
where suffix ‘0’ represents base year and ‘1’ represents current year.

Simple Index Number


A simple index number is the ratio of two values representing the
same variable, measured in two different situations or in two different periods.
For example, a simple index number of price will give the relative variation of
the price between the current period and a reference period. The most
commonly used simple index numbers are those of price, quantity, and value.
MATHEMATICAL ASPECTS
The index number In/0, which is representative of a variable G in
situation n with respect to the same variable in situation 0 (reference
situation), is defined by:

In/0=Gn / G0,
66

where G n is the value of variable G in situation n and G 0 is the value of


variable G in situation 0.
Generally, a simple index number is expressed in base 100 in reference
situation 0:

In/0=Gn / G0 *100
Properties of Simple Index Numbers
 Identity: If two compared situations (or two periods) are identical, the
value of the index number is...

Some of the uses of index numbers are discussed below:


Index numbers possess much practical importance in measuring changes in the
cost of living, production trends, trade, income variations, etc.
1. In Measuring Changes in the Value of Money:
ADVERTISEMENTS:
Index numbers are used to measure changes in the value of money. A study of
the rise or fall in the value of money is essential for determining the direction
of production and employment to facilitate future payments and to know
changes in the real income of different groups of people at different places and
times. As pointed out by Crowther, “By using the technical device of an index
number, it is thus possible to measure changes in different aspects of the value
of money, each particular aspect being relevant to a different purpose.”
2. In Cost of Living:
Cost of living index numbers in the case of different groups of workers throw
light on the rise or fall in the real income of workers. It is on the basis of the
study of the cost of living index that money wages are determined and
dearness and other allowances are granted to workers. The cost of living index
is also the basis of wage negotiations and wage contracts.
3. In Analysing Markets for Goods and Services:
Consumer price index numbers are used in analysing markets for particular
kinds of goods and services. The weights assigned to different commodities like
67

food, clothing, fuel, and lighting, house rent, etc., govern the market for such
goods and services.
4. In Measuring Changes in Industrial Production:
ADVERTISEMENTS:
Index numbers of industrial production measure increase or decrease in
industrial production in a given year as compared to the base year. We can
know from such as index number the actual condition of different industries,
whether production is increasing or decreasing in them, for an industrial index
number measures changes in the quantity of production.
5. In Internal Trade:
The study of indices of the wholesale prices of consumer and industrial goods
and of industrial production helps commerce and industry in expanding or
decreasing internal trade.
6. In External Trade:
The foreign trade position of a country can be accessed on the basis of its
export and import indices. These indices reveal whether the external trade of
the country is increasing or decreasing.
7. In Economic Policies:
Index numbers are helpful to the state in formulating and adopting
appropriate economic policies. Index numbers measure changes in such
magnitudes as prices, incomes, wages, production, employment, products,
exports, imports, etc. By comparing the index numbers of these magnitudes
for different periods, the government can know the present trend of economic
activity and accordingly adopt price policy, foreign trade policy and general
economic policies.
8. In Determining the Foreign Exchange Rate:
Index numbers of wholesale price of two countries are used to determine their
rate of foreign exchange. They are the basis of the purchasing power parity
theory which determines the exchange rate between two countries on
inconvertible paper standard.
68

Quartile, Decile and Percentile?


All of us are aware of the concept of the median in Statistics, the middle value
or the mean of the two middle values, of an array. We have learned that the
median divides a set of data into two equal parts. In the same way, there are
also certain other values which divide a set of data into four, ten or hundred
equal parts. Such values are referred as quartiles, deciles, and percentiles
respectively.
Collectively, the quartiles, deciles and percentiles and other values obtained by
equal sub-division of the data are called Quartiles.
Quartiles:
The values which divide an array (a set of data arranged in ascending or
descending order) into four equal parts are called Quartiles. The first, second
and third quartiles are denoted by Q1, Q2,Q3 respectively. The first and third
quartiles are also called the lower and upper quartiles respectively. The second
quartile represents the median, the middle value.
Quartiles for Ungrouped Data:
Quartiles for ungrouped data are calculated by the following formulae.

Quartiles for Grouped Data:


The quartiles may be determined from grouped data in the same way as the
median except that in place of n/2 we will use n/4. For calculating quartiles
from grouped data we will form cumulative frequency column. Quartiles for
grouped data will be calculated from the following formulae;
69

= Median.
Where,
l = lower class boundary of the class containing the , i.e. the class
corresponding to the cumulative frequency in which n/4 or 3n/4 lies
h = class interval size of the class containing .
f = frequency of the class containing .
n = number of values, or the total frequency.
C.F = cumulative frequency of the class preceding the class containing .
Deciles:
The values which divide an array into ten equal parts are called deciles. The
first, second,…… ninth deciles by respectively. The fifth decile (
corresponds to median. The second, fourth, sixth and eighth deciles which
collectively divide the data into five equal parts are called quintiles.
Deciles for Ungrouped Data:
Deciles for ungrouped data will be calculated from the following formulae;

Decile for Grouped Data


Decile for grouped data can be calculated from the following formulae;
70

Where,
l = lower class boundary of the class containing the , i.e. the class
corresponding to the cumulative frequency in which 2n/10 or 9n/10 lies
h = class interval size of the class containing .
f = frequency of the class containing .
n = number of values, or the total frequency.
C.F = cumulative frequency of the class preceding the class containing .
Percentiles:
The values which divide an array into one hundred equal parts are called
percentiles. The first, second,……. Ninety-ninth percentile are denoted
by The 50th percentile ( ) corresponds to the median. The
25th percentile corresponds to the first quartile and the
th
75 percentile corresponds to the third quartile.
Percentiles for Ungrouped Data:
Percentile from ungrouped data could be calculated from the following
formulae;
71

Percentiles for Grouped Data:


Percentiles can also be calculated for grouped data which is done with the help
of following formulae;

Where,
l = lower class boundary of the class containing the , i.e. the class
corresponding to the cumulative frequency in which 35n/100 or 99n/100 lies
h = class interval size of the class containing. .
f = frequency of the class containing .
n = number of values, or the total frequency.
C.F = cumulative frequency of the class preceding the class containing .

An Introduction to t Tests | Definitions, Formula and Examples?


A t test is a statistical test that is used to compare the means of two groups. It
is often used in hypothesis testing to determine whether a process or
treatment actually has an effect on the population of interest, or whether two
groups are different from one another.
t test example
You want to know whether the mean petal length of iris flowers differs
according to their species. You find two different species of irises growing in a
72

garden and measure 25 petals of each species. You can test the difference
between these two groups using a t test and null and alterative hypotheses.
 The null hypothesis (H0) is that the true difference between these group
means is zero.
 The alternate hypothesis (Ha) is that the true difference is different from
zero.
What type of t test should I use?
When choosing a t test, you will need to consider two things: whether the
groups being compared come from a single population or two different
populations, and whether you want to test the difference in a specific
direction.
One-sample, two-sample, or paired t test?
 If the groups come from a single population (e.g., measuring before and
after an experimental treatment), perform a paired t test. This is
a within-subjects design.
 If the groups come from two different populations (e.g., two different
species, or people from two separate cities), perform a two-
sample t test (a.k.a. independent t test). This is a between-subjects
design.
 If there is one group being compared against a standard value (e.g.,
comparing the acidity of a liquid to a neutral pH of 7), perform a one-
sample t test.
One-tailed or two-tailed t test?
 If you only care whether the two populations are different from one
another, perform a two-tailed t test.
 If you want to know whether one population mean is greater than or
less than the other, perform a one-tailed t test.
t test example
In your test of whether petal length differs by species:
 Your observations come from two separate populations (separate
species), so you perform a two-sample t test.
73

 You don’t care about the direction of the difference, only whether there
is a difference, so you choose to use a two-tailed t test.

T test formula
The formula for the two-sample t test (a.k.a. the Student’s t-test) is shown below.

In this formula, t is the t value, x1 and x2 are the means of the two groups being
compared, s2 is the pooled standard error of the two groups, and n1 and n2 are
the number of observations in each of the groups.
A larger t value shows that the difference between group means is greater
than the pooled standard error, indicating a more significant difference
between the groups.
You can compare your calculated t value against the values in a critical value
chart (e.g., Student’s t table) to determine whether your t value is greater than
what would be expected by chance. If so, you can reject the null hypothesis
and conclude that the two groups are in fact different.
What Is Analysis of Variance (ANOVA)?
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an
observed aggregate variability found inside a data set into two parts:
systematic factors and random factors. The systematic factors have a statistical
influence on the given data set, while the random factors do not. Analysts use
the ANOVA test to determine the influence that independent variables have on
the dependent variable in a regression study.
The t- and z-test methods developed in the 20th century were used for
statistical analysis until 1918, when Ronald Fisher created the analysis of
variance method.12 ANOVA is also called the Fisher analysis of variance, and it
is the extension of the t- and z-tests. The term became well-known in 1925,
after appearing in Fisher's book, "Statistical Methods for Research
Workers."3 It was employed in experimental psychology and later expanded to
subjects that were more complex.
KEY TAKEAWAYS
74

 Analysis of variance, or ANOVA, is a statistical method that separates


observed variance data into different components to use for additional
tests.
 A one-way ANOVA is used for three or more groups of data, to gain
information about the relationship between the dependent and
independent variables.
 If no true variance exists between the groups, the ANOVA's F-ratio
should equal close to 1.

Example of How to Use ANOVA


A researcher might, for example, test students from multiple colleges to see if
students from one of the colleges consistently outperform students from the
other colleges. In a business application, an R&D researcher might test two
different processes of creating a product to see if one process is better than
the other in terms of cost efficiency.
The type of ANOVA test used depends on a number of factors. It is applied
when data needs to be experimental. Analysis of variance is employed if there
is no access to statistical software resulting in computing ANOVA by hand. It is
simple to use and best suited for small samples. With many experimental
designs, the sample sizes have to be the same for the various factor level
combinations.
ANOVA is helpful for testing three or more variables. It is similar to multiple
two-sample t-tests. However, it results in fewer type I errors and is appropriate
for a range of issues. ANOVA groups differences by comparing the means of
each group and includes spreading out the variance into diverse sources. It is
employed with subjects, test groups, between groups and within groups.
75

One-Way ANOVA Versus Two-Way ANOVA


There are two main types of ANOVA: one-way (or unidirectional) and two-way.
There also variations of ANOVA. For example, MANOVA (multivariate ANOVA)
differs from ANOVA as the former tests for multiple dependent variables
simultaneously while the latter assesses only one dependent variable at a time.
One-way or two-way refers to the number of independent variables in your
analysis of variance test. A one-way ANOVA evaluates the impact of a sole
factor on a sole response variable. It determines whether all the samples are
the same. The one-way ANOVA is used to determine whether there are any
statistically significant differences between the means of three or more
independent (unrelated) groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way,
you have one independent variable affecting a dependent variable. With a
two-way ANOVA, there are two independents. For example, a two-way ANOVA
allows a company to compare worker productivity based on two independent
variables, such as salary and skill set. It is utilized to observe the interaction
between the two factors and tests the effect of two factors at the same time.
----------------------------------------------------------------------------------------------
Interpolation Meaning
Interpolation is a method of deriving a simple function from the given discrete
data set such that the function passes through the provided data points. This
helps to determine the data points in between the given data ones. This
method is always needed to compute the value of a function for an
intermediate value of the independent function. In short, interpolation is a
process of determining the unknown values that lie in between the known
data points. It is mostly used to predict the unknown values for any
geographical related data points such as noise level, rainfall, elevation, and so
on.
Interpolation is a method of fitting the data points to represent the value of a
function. It has a various number of applications in engineering and science,
that are used to construct new data points within the range of a discrete data
set of known data points or can be used for determining a formula of the
function that will pass from the given set of points (x,y). In this article, we are
going to discuss the meaning of interpolation in Statistics, its formulas, and
uses in detail.
76

Interpolation Formula
The unknown value on the data points can be found using the linear
interpolation and Lagrange’s interpolation formula.

Interpolation Methods
There are different types of interpolation methods. They are:
Linear Interpolation Method – This method applies a distinct linear polynomial
between each pair of data points for curves, or within the sets of three points
for surfaces.
Nearest Neighbour Method – This method inserts the value of an interpolated
point to the value of the most adjacent data point. Therefore, this method
does not produce any new data points.
Cubic Spline Interpolation Method – This method fits a different cubic
polynomial between each pair of data points for curves, or between sets of
three points for surfaces.
Shape-Preservation Method – This method is also known as Piecewise Cubic
Hermite Interpolation (PCHIP). It preserves the monotonicity and the shape of
the data. It is for curves only.
Thin-plate Spline Method – This method consists of smooth surfaces that also
extrapolate well. It is only for surfaces only
Biharmonic Interpolation Method – This method is applied to the surfaces
only.

ASSUMPTIONS FOR INTERPOLATION


1. There are no sudden jumps or falls in the values during the period under
consideration.
77

2. The rise and fall in the values should be uniform. For example, if we are
given data regarding rainfall in various years and some of the observations are
for the years in which El-Nino occurred, then interpolation methods are not
applicable.
3. When we apply calculus of finite differences, we assume that the given set
of observations is capable of being expressed in a polynomial form.
---------------------------------------------------------------------------------------
What is Research?
Research is the careful consideration of study regarding a particular concern or
problem using scientific methods. According to the American sociologist Earl
Robert Babbie, “research is a systematic inquiry to describe, explain, predict,
and control the observed phenomenon. It involves inductive and deductive
methods.”
Inductive methods analyze an observed event, while deductive methods verify
the observed event. Inductive approaches are associated with qualitative
research, and deductive methods are more commonly associated with
quantitative analysis.
Research is conducted with a purpose to:
 Identify potential and new customers
 Understand existing customers
 Set pragmatic goals
 Develop productive market strategies
 Address business challenges
 Put together a business expansion plan
 Identify new business opportunities
What are the characteristics of research?
1. Good research follows a systematic approach to capture accurate data.
Researchers need to practice ethics and a code of conduct while making
observations or drawing conclusions.
78

2. The analysis is based on logical reasoning and involves both inductive


and deductive methods.
3. Real-time data and knowledge is derived from actual observations in
natural settings.
4. There is an in-depth analysis of all data collected so that there are no
anomalies associated with it.
5. It creates a path for generating new questions. Existing data helps create
more research opportunities.
6. It is analytical and uses all the available data so that there is no
ambiguity in inference.
7. Accuracy is one of the most critical aspects of research. The information
must be accurate and correct. For example, laboratories provide a
controlled environment to collect data. Accuracy is measured in the
instruments used, the calibrations of instruments or tools, and the
experiment’s final result.
What is the purpose of research?
There are three main purposes:
1. Exploratory: As the name suggests, researchers conduct exploratory
studies to explore a group of questions. The answers and analytics may
not offer a conclusion to the perceived problem. It is undertaken to
handle new problem areas that haven’t been explored before. This
exploratory process lays the foundation for more conclusive data
collection and analysis.
2. Descriptive: It focuses on expanding knowledge on current issues
through a process of data collection. Descriptive research describe the
behavior of a sample population. Only one variable is required to
conduct the study. The three primary purposes of descriptive studies are
describing, explaining, and validating the findings. For example, a study
conducted to know if top-level management leaders in the 21st century
possess the moral right to receive a considerable sum of money from the
company profit.
3. Explanatory: Causal or explanatory research is conducted to understand
the impact of specific changes in existing standard procedures. Running
79

experiments is the most popular form. For example, a study that is


conducted to understand the effect of rebranding on customer loyalty.
Here is a comparative analysis chart for better understanding:

Exploratory Descriptive Explanatory


Research Research Research

Approach
Unstructured Structured Highly structured
used

Conducted By using
Asking questions Asking questions
through hypotheses.

Early stages of Later stages of Later stages of


Time
decision making decision making decision making

Types of Research
1. Descriptive Research

Descriptive Research is a form of research that incorporates surveys as well as


different varieties of fact-finding investigations. This form of research is
focused on describing the prevailing state of affairs as they are. Descriptive
Research is also termed as Ex post facto research.

This research form emphasises on factual reporting, the researcher cannot


control the involved variables and can only report the details as they took
place or as they are taking place.

Researchers mainly make use of a descriptive research approach for purposes


such as when the research is aimed at deciphering characteristics, frequencies
or trends.
80

Ex post facto studies also include attempts by researchers to discover causes


even when they cannot control the variables. The descriptive research
methods are mainly, observations, surveys as well as case studies.
2. Analytical Research

Analytical Research is a form of research where the researcher has to make do


with the data and factual information available at their behest and interpret
this information to undertake an acute evaluation of the data.

This form of research is often undertaken by researchers to uncover some


evidence that supports their present research and which makes it more
authentic. It is also undertaken for concocting fresh ideas relating to the topic
on which the research is based.

From conducting meta analysis, literary research or scientific trials and learning
public opinion, there are many methods through which this research is done.

3. Applied Research

When a business or say, the society is faced with an issue that needs an
immediate solution or resolution, Applied Research is the research type that
comes to the rescue.

We primarily make use of Applied Research when it comes to resolving the


issues plaguing our daily lives, impacting our work, health or welfare. This
research type is undertaken to uncover solutions for issues relating to varying
sectors like education, engineering, psychology or business.

For instance, a company might employ an applied researcher for concluding


the best possible approach of selecting employees that would be the best fit
for specific positions in the company.
81

The crux of Applied Research is to figure out the solution to a certain growing
practical issue.

The 3 Types of Applied Research are mainly

A. Evaluation Research - Research where prevailing data regarding the


topic is interpreted to arrive at proper decisions

B. Research and Development - Where the focus is on setting up fresh


products or services which focus on the target market requirements

C. Action Research - Which aims at offering practical solutions for certain


business issues by giving them proper direction, are the 3 types of
Applied Research.

4. Fundamental Research

This is a Research type that is primarily concerned with formulating a theory or


understanding a particular natural phenomenon. Fundamental Research aims
to discover information with an extensive application base, supplementing the
existing concepts in a certain field or industry.

Research on pure mathematics or research regarding generalisation of the


behavior of humans are also examples of Fundamental Research. This form of
research is mainly carried out in sectors like Education, Psychology and
Science.
82

For instance, in Psychology fundamental research assists the individual or the


company in gaining better insights regarding certain behaviors such as
deciphering how consumption of caffeine can possibly impact the attention
span of a student or how culture stereotypes can possibly trigger depression.

5. Quantitative Research
Quantitative Research, as the name suggests, is based on the measurement of
a particular amount or quantity of a particular phenomenon. It focuses on
gathering and interpreting numerical data and can be adopted for discovering
any averages or patterns or for making predictions.

This form of Research is number based and it lies under the two main Research
Types. It makes use of tables, data and graphs to reach a conclusion. The
outcomes generated from this research are measurable and can be repeated
unlike the outcomes of qualitative research. This research type is mainly
adopted for scientific and field based research.

Quantitative research generally involves a large number of people and a huge


section of data and has a lot of scope for accuracy in it.

These research methods can be adopted for approaches like descriptive,


correlational or experimental research.

Descriptive research - The study variables are analyzed and a summary of the
same is seeked.

Correlational Research - The relationship between the study variables is


analyzed.

Experimental Research - It is deciphered to analyse whether a cause and effect


relationship between the variables exists.
83

Quantitative research methods

 Experiment Research - This method controls or manages independent


variables for calculating the effect it has on dependent variables.

 Survey - Surveys involve inquiring questions from a certain specified


number or set of people either online, face to face or over the phone.

 (Systematic) observation - This method involves detecting any


occurrence and monitoring it in a natural setting.

 Secondary research : This research focuses on making use of data which


has been previously collected for other purposes such as for say, a
national survey.

6. Qualitative Research

As the name suggests, this form of Research is more considered with the
quality of a certain phenomenon, it dives into the “why” alongside the “what”.
For instance, let’s consider a gender neutral clothing store which has more
women visiting it than men.

Qualitative research would be determining why men are not visiting the store
by carrying out an in-depth interview of some potential customers in this
category.

This form of research is interested in getting to the bottom of the reasons for
human behaviour, i.e understanding why certain actions are taken by people
or why they think certain thoughts.
84

Through this research the factors influencing people into behaving in a certain
way or which control their preferences towards a certain thing can be
interpreted.

An example of Qualitative Research would be Motivation Research. This


research focuses on deciphering the rooted motives or desires through
intricate methods like in depth interviews. It involves several tests like story
completion or word association.

Another example would be Opinion Research. This type of research is carried


out to discover the opinion and perspective of people regarding a certain
subject or phenomenon.

This is a theory based form of research and it works by describing an issue by


taking into account the prior concepts, ideas and studies. The experience of
the researcher plays an integral role here.

The Types of Qualitative Research includes the following methods :

Qualitative research methods

 Observations: In this method what the researcher sees, hears of or


encounters is recorded in detail.

 Interviews: Personally asking people questions in one-on-one


conversations.

 Focus groups: This involves asking questions and discussions among a


group of people to generate conclusions from the same.
85

 Surveys: In these surveys unlike the quantitative research surveys, the


questionnaires involve extensive open ended questions that require
elaborate answers.

 Secondary research: Gathering the existing data such as images, texts or


audio or video recordings. This can involve a text analysis, a research of a
case study, or an In-depth interview.

7. Conceptual Research

This research is related to an abstract idea or a theory. It is adopted by thinkers


and philosophers with the aim of developing a new concept or to re-examine
the existing concepts.

Conceptual Research is mainly defined as a methodology in which the research


is conducted by observing and interpreting the already present information on
a present topic. It does not include carrying out any practical experiments.

This methodology has often been adopted by famous Philosophers like


Aristotle, Copernicus, Einstein and Newton for developing fresh theories and
insights regarding the working of the world and for examining the existing ones
from a different perspective.

The concepts were set up by philosophers to observe their environment and to


sort, study, and summarise the information available.

8. Empirical Research
86

This is a research method that focuses solely on aspects like observation and
experience, without focusing on the theory or system. It is based on data and it
can churn conclusions that can be confirmed or verified through observation
and experiment. Empirical Research is mainly undertaken to determine proof
that certain variables are affecting the others in a particular way.

This kind of research can also be termed as Experimental Research. In this


research it is essential that all the facts are received firsthand, directly from the
source so that the researcher can actively go and carry out the actions and
manipulate the concerned materials to gain the information he requires.

In this research a hypothesis is generated and then a path is undertaken to


confirm or invalidate this hypothesis. The control that the researcher holds
over the involved variables defines this research. The researcher can
manipulate one of these variables to examine its effect.

Social Research: Definition


Social Research is a method used by social scientists and researchers to learn
about people and societies so that they can design products/services that cater
to various needs of the people. Different socio-economic groups belonging to
different parts of a county think differently. Various aspects of human behavior
need to be addressed to understand their thoughts and feedback about the
social world, which can be done using Social Research. Any topic can trigger
social research – new feature, new market trend or an upgrade in old
technology.
Social Research is conducted by following a systematic plan of action which
includes qualitative and quantitative observation methods.
 Qualitative methods rely on direct communication with members of a
market, observation, text analysis. The results of this method are
focused more on being accurate rather than generalizing to the entire
population.
87

 Quantitative methods use statistical analysis techniques to evaluate data


collected via surveys, polls or questionnaires.
Social Research contains elements of both these methods to analyze a range of
social occurrences such as an investigation of historical sites, census of the
country, detailed analysis of research conducted to understand reasons for
increased reports of molestation in the country etc.
A survey to monitor happiness in a respondent population is one of the most
widely used applications of social research. The happiness survey template can
be used by researchers an organizations to gauge how happy a respondent is
and the things that can be done to increase happiness in that respondent.
For example, A survey can be conducted to understand Climate change
awareness among the general population. Such a survey will give in-depth
information about people’s perception about climate change and also the
behaviors that impact positive behavior. Such a questionnaire will enable the
researcher to understand what needs to be done to create more awareness
among the public.
88

What is a frequency distribution?


The frequency of a value is the number of times it occurs in a dataset.
A frequency distribution is the pattern of frequencies of a variable. It’s the
number of times each possible value of a variable occurs in a dataset.
Types of frequency distributions
There are four types of frequency distributions:
 Ungrouped frequency distributions: The number of observations of
each value of a variable.
o You can use this type of frequency distribution for categorical
variables.
 Grouped frequency distributions: The number of observations of
each class interval of a variable. Class intervals are ordered groupings of
a variable’s values.
o You can use this type of frequency distribution for quantitative
variables.
 Relative frequency distributions: The proportion of observations of each
value or class interval of a variable.
o You can use this type of frequency distribution for any type of
variable when you’re more interested in comparing
frequencies than the actual number of observations.
 Cumulative frequency distributions: The sum of the frequencies less
than or equal to each value or class interval of a variable.
o You can use this type of frequency distribution for ordinal or
quantitative variables when you want to understand how often
observations fall below certain values.
How to make a frequency table
Frequency distributions are often displayed using frequency tables. A
frequency table is an effective way to summarize or organize a dataset. It’s
usually composed of two columns:
 The values or class intervals
 Their frequencies
89

The method for making a frequency table differs between the four types of
frequency distributions. You can follow the guides below or use software such
as Excel, SPSS, or R to make a frequency table.

You might also like