3 Introduction To Probablities

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Introduction to

Probability
INTRODUCTION TO PROBABILITY
• Random Experiment
Random experiment is an experiment in which the outcome is not known with certainty. That is, the
output of a random experiment cannot be predicted with certainty.

• Sample Space
Sample space is the universal set that consists of all possible outcomes of an experiment. Sample space
is usually represented using the letter ‘S’ and individual outcomes are called the elementary events.

• Outcome of a football match


Sample Space = S = {Win, Draw, Lose}
INTRODUCTION TO PROBABILITY
• Event
Event (E) is a subset of a sample space and probability is usually calculated with respect to an event.
INTRODUCTION TO PROBABILITY
• Probability Estimation using Relative Frequency
• The classical approach to probability estimation of an event is based on the relative frequency of the
occurrence of that event. According to frequency estimation, the probability of an event X, P(X), is
given by

• For example, say a company has 1000 employees and every year about 200 employees leave the job. Then
the probability of attrition of an employee per annum is 200/1000 = 0.2.
INTRODUCTION TO PROBABILITY
• FUNDAMENTAL CONCEPTS IN PROBABILITY – AXIOMS OF PROBABILITY
• In 1933, Andrey Kolmogorov, a Russian mathematician laid the foundation of the
axiomatic theory of
probability (Kolmogorov, 1956). According to axiomatic theory of probability, the
probability of an event
E satisfies the following axioms:
1. The probability of event E always lies between 0 and 1. That is, 0 ≤ P(E) ≤ 1.
2. The probability of the universal set S is 1. That is, P(S) = 1.
3. P(X  Y) = P(X) + P(Y), where X and Y are two mutually exclusive events
INTRODUCTION TO PROBABILITY
• Joint Probability
Let A and B be two events in a sample space. Then the joint probability of the two events, written as
P(A  B), is given by
INTRODUCTION TO PROBABILITY
• Independent Events
wo events A and B are said to be independent when occurrence of one event (say event A) does not affect
the probability of occurrence of the other event (event B).

• Conditional Probability
If A and B are events in a sample space, then the conditional probability of the event B given that the
event A has already occurred, denoted by P(B|A), is defined as
INTRODUCTION TO PROBABILITY
• PROBABILITY DENSITY FUNCTION (PDF) AND CUMULATIVE DISTRIBUTION
FUNCTION (CDF) OF A CONTINUOUS RANDOM VARIABLE

• Cumulative distribution function F(a) is the area under the probability density function (Figure ) up
to X = a.
INTRODUCTION TO PROBABILITY
• NORMAL DISTRIBUTION
Normal distribution, also known as Gaussian distribution, is one of the most popular continuous distribution
in the field of analytics especially due to its use in multiple contexts. Normal distribution is
observed across many naturally occurring measures such as birth weight, height, intelligence, etc.
INTRODUCTION TO PROBABILITY
• NORMAL DISTRIBUTION
Normal distribution, also known as Gaussian distribution, is one of the most popular continuous distribution
in the field of analytics especially due to its use in multiple contexts. Normal distribution is
observed across many naturally occurring measures such as birth weight, height, intelligence, etc.
INTRODUCTION TO PROBABILITY
• Properties of Normal Distribution
1. Theoretical normal density functions are defined between -∞ and +∞.
2. It is a two parameter distribution, where the parameter m is the mean (location parameter) and
the parameter s is the standard deviation (scale parameter).
3. All normal distributions have symmetrical bell shape around mean m (thus it is also median).
m is also the mode of the normal distribution, that is, m is the mean, median as well as the
mode.
4. For any normal distribution, the areas between specific values measured in terms of m and s are
given by:
Sampling
INTRODUCTION TO SAMPLING
• The population of India was close to 1.32 billion in July 2016 according to United
Nation Census India collects information such as demography, literacy, housing,
economic activity of the individuals, etc. Thousands of people are used for
collecting data on such huge population, which is very expensive and thus carried
out only once in 10 years. Many organizations cannot
afford to collect data on the entire population and in many cases it is expensive
and time-consuming. In many cases, all members of the population are not known
for sampling purpose
SAMPLING
• The process of identifying a subset from a population of elements (aka
observations or cases) is called
sampling process or simply sampling. The following steps are used in any
sampling process:
1. Identification of target population that is important for a given problem
under study. For example, assume that we are interested in studying attrition
among young professionals in India. The definition ‘young professionals’ in India
is vague; we need a clear identification of the target population. A better definition
of the population in this case would be to study the attrition
among IT professionals in the age group 25–35 years in India. It is important to
clearly define the target population for correct inference.
SAMPLING
• Decide the sampling frame. Sampling frame defines the source (or method/procedure) used
for identifying the elements of the target population.

• Determine the sample size: Determining sample size for data collection is important since
collecting data can be expensive and at the same time insufficient sample results in lack of precision in
estimation of the parameters.

• Sampling method: Sampling method is the technique used for selecting individual cases in
the sample from the target population using the sampling frame. At a higher level, sampling
method is classified into two major categories: probabilistic sampling and non-probabilistic
sampling.
PROBABILISTIC SAMPLING
• PROBABILISTIC SAMPLING

• Random sampling is one of the most popular and frequently used sampling methods. Shewhart
(1931) defines random sample as a ‘sample drawn under conditions such that the law of large
number applies’. That is, in random sampling, every case in the population has equal probability
of getting selected in a sample. Random sampling is usually carried out without replacement, that
is, an observation which is selected in the sample is removed from the population for subsequent
selection. However, random samples can also be created with replacement, that is, an observation
which is selected for inclusion in the sample can again be considered since it is replaced (not
removed) in the population
PROBABILISTIC SAMPLING
• Stratified Sampling
The population can be divided into mutually exclusive groups using some factor (for example,
age,
gender, marital status, income, geographical regions, etc.). The groups, thus, formed are called
stratum.
It is important that the groups are mutually exclusive and exhaustive of the population.

• Example
• The population can be divided into mutually exclusive groups using some factor (for example,
age,
gender, marital status, income, geographical regions, etc.). The groups, thus, formed are called
stratum.
It is important that the groups are mutually exclusive and exhaustive of the population.

• Random sampling method can be used within each stratum to select individual cases to generate
samples in each group
PROBABILISTIC SAMPLING
• Stratified Sampling
The population can be divided into mutually exclusive groups using some factor (for example,
age,
gender, marital status, income, geographical regions, etc.). The groups, thus, formed are called
stratum.
It is important that the groups are mutually exclusive and exhaustive of the population.

• Example
• The population can be divided into mutually exclusive groups using some factor (for example,
age,
gender, marital status, income, geographical regions, etc.). The groups, thus, formed are called
stratum.
It is important that the groups are mutually exclusive and exhaustive of the population.

• Random sampling method can be used within each stratum to select individual cases to generate
samples in each group
PROBABILISTIC SAMPLING
• Cluster Sampling
In cluster sampling, the population is divided into mutually exclusive clusters. For example,
assume that a researcher is interested in analysing life of smart phone batteries from a specific
manufacturer. The manufacturer may have different models (each model in this case will be a
cluster). The clusters are randomly selected and then all units within the selected clusters are
included in the sample (or random sampling is used for selecting subjects within the cluster if the
cluster size is too large). The following steps are used in cluster sampling:

• Note that stratified sampling and cluster sampling are similar; the major difference is that in a
stratified sample, all strata will be represented in the sample, whereas in a cluster sampling, not all
clusters will be represented.
PROBABILISTIC SAMPLING
• Cluster Sampling

Cluster sample is used when the clusters are large in number. For example, assume that
• we are interested in impact of demonetization on Indian industry. There are large number of
industrial sectors. Analysing the impact on all clusters will be expensive and time consuming, so in
such cases few clusters (such as healthcare and manufacturing) may be used for the study.
PROBABILISTIC SAMPLING
• Bootstrap Aggregating (Bagging)

• Bootstrap Aggregating (known as Bagging) is sampling with replacement used in


machine learning algorithms, especially the random forest algorithm (Breiman,
1996). In Bagging, several samples (with replacement) are generated from the
population and analytical models are developed using each sample.

• For example, in random forest, several hundred samples are generated from the
population and classification trees are generated using each sample. The final
classification of a new case is decided based on the majority voting.
NON-PROBABILITY SAMPLING

• In a non-probability sampling, the selection of sample units from the population


does not follow any
probability distribution. Sample units are selected based on convenience and/or on
voluntary basis.
Assume that a data scientist is interested in studying attrition and factors
influencing attrition. For this
study, he/she may collect data from his friends and colleagues which may not be
true representation of
the population. Such sampling procedures come under the category of non-
probability sampling since
the sample cases are not chosen probabilistically. The accuracy of estimation
based on non-probability
sampling may be biased
NON-PROBABILITY SAMPLING

• Convenience Sampling
Convenience sampling is a non-probability sampling technique in which the sample units are not
selected according to a probability distribution. For example, a researcher may collect data from his
school or the work place and from his/her friends since the cost of data collection in such cases is minimal.
Convenience sampling is not recommended since it is likely to result in biased estimates.

• Voluntary Sampling
Under voluntary sampling the data is collected from people who volunteer for such data collection. For
example, customer feedbacks in many contexts fall under this sampling procedure. There could be bias
in case of voluntary sampling. Many organizations such as Amazon, Trip Advisor provide customer
feedback. Many times the feedback is provided by customers who had bad experience with product/
service; many customers who were happy with product/service may not give feedback
ESTIMATION OF POPULATION PARAMETERS

• Estimation is a process used for making inferences about population parameters based on samples. For
example, we may like to estimate the population parameters such as mean and standard deviation and
probability distribution parameters such as scale, shape, and location parameters. The following are the
two types of estimates
1. Point Estimate: Point estimate of a population parameter is the single value (or specific value)
calculated from sample (thus called statistic). Sample mean and variance are estimates of population mean
and variance. Similarly, sample proportion is an estimate of population proportion.
2. Interval Estimate: Instead of a specific value of the parameter, in an interval estimate the parameter is
said to lie in an interval (say between points a and b) with certain probability (or
confidence)

• Confidence level, usually written as (1 - a)100%, on the interval estimate of a population


parameter is the probability that the interval estimate will contain the true population
parameter. When a = 0.05, 95% is the confidence level and 0.95 is the probability that the
interval estimate will have the population parameter.
ESTIMATION OF POPULATION PARAMETERS

• 95% is most frequently used confidence level, although 90% and 99% are also used frequently. The choice
of a depends on the context of the problem. When high accuracy for the estimate is required, low value
of a is chosen.

You might also like