Random Variable

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

UNIT II

A population is the entire group that you want to draw conclusions about. A sample is the
specific group that you will collect data from. The size of the sample is always less than the
total size of the population.

Measures of Location

These functions calculate an average from a population or sample. These functions do not require the
data given to them to be sorted.

mean() Arithmetic mean (“average”) of data.


geometric_mean() Geometric mean of data.
harmonic_mean() Harmonic mean of data.
median() Median (middle value) of data.
median_low() Low median of data.
median_high() High median of data.
median_grouped() Median, or 50th percentile, of grouped data.
mode() Single mode (most common value) of discrete or nominal data.
multimode() List of modes (most common values) of discrete or nominal data.

Python3 provide statistics module, which comes with very useful functions like mean(),
median(), mode() etc.

import statistics as st

statistics.mean(data) - Return the sample arithmetic mean of data which can be a


sequence or iterable. If the input dataset is empty, raises a StatisticsError.

The sample mean gives an unbiased estimate of the true population mean, so that when
taken on average over all the possible samples, mean(sample) converges on the true
mean of the entire population. If data represents the entire population rather than a sample,
then mean(data) is equivalent to calculating the true population mean μ.

st.mean([1, 2, 3, 4, 4])

2.8

statistics.geometric_mean(data)

Convert data to floats and compute the geometric mean.

The geometric mean indicates the central tendency or typical value of the data using the
product of the values (as opposed to the arithmetic mean which uses their sum). The
geometric mean is defined as the nth root of the product of n numbers,

1
Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it
contains a negative value.

from scipy.stats.mstats import gmean


a = [1.2, 2.3, 3.5]
print (gmean(a))
2.129

statistics.harmonic_mean(data, weights=None)
Returns the harmonic mean of data. If weights is omitted or None, then equal weighting is
assumed.

The harmonic mean is the reciprocal of the arithmetic mean() of the reciprocals of the data.
For example, the harmonic mean of three values a, b and c will be equivalent
to 3/(1/a + 1/b + 1/c).

It is often appropriate when averaging ratios or rates, for example speeds.

import statistics as st

a = [1.2, 2.3, 3.5]

print (st.harmonic_mean(a))

1.93

StatisticsError is raised if data is empty, any element is less than zero, or if the weighted sum isn’t
positive

statistics.median(data)

Return the median (middle value) of numeric data, using the common “mean of middle two” method.
If data is empty, StatisticsError is raised. The median is a robust measure of central location.
When the number of data points is odd, the middle data point is returned:

st.median([1, 3, 5])

When the number of data points is even, the median is interpolated by taking the average of the two
middle values:

st.median([1, 3, 5, 7])

4.0

If the data is ordinal (supports order operations) but not numeric (doesn’t support addition), consider
using median_low() or median_high() instead.

statistics.median_low(data)

2
Return the low median of numeric data. If data is empty, StatisticsError is raised. When the
number of data points is odd, the middle value is returned. When it is even, the smaller of the two
middle values is returned.

st.median_low([1, 3, 5])

st.median_low([1, 3, 5, 7])

statistics.median_high(data)

Return the high median of data. If data is empty, StatisticsError is raised. The high median is
always a member of the data set. When the number of data points is odd, the middle value is
returned. When it is even, the larger of the two middle values is returned.

st.median_high([1, 3, 5])

st.median_high([1, 3, 5, 7])

statistics.median_grouped(data, interval=1)

Return the median of grouped continuous data. If data is empty, StatisticsError is raised.

The mathematical formula for Grouped Median is: GMedian = L + interval * (N / 2 - CF) / F.

• L = The lower limit of the median interval


• interval = The interval width
• N = The total number of data points
• CF = The number of data points below the median interval
• F = The number of data points in the median interval

st.median_grouped([52, 52, 53, 54])

52.5

st.median_grouped([ 1, 2, 2, 3, 4, 4, 4, 4, 4, 5])

3.7

statistics.mode(data)

Return the single most common data point from discrete or nominal data. The mode (when it exists)
is the most typical value and serves as a measure of central location. If there are multiple modes
with the same frequency, returns the first one encountered in the data. If the input data is
empty, StatisticsError is raised.

st.mode([1, 1, 2, 3, 3, 3, 3, 4])

3
3

statistics.multimode(data)

Return a list of the most frequently occurring values in the order they were first encountered in
the data. Will return more than one result if there are multiple modes or an empty list if the data is
empty:

from statistics import multimode

x = multimode([“aabbbbccddddeeffffgg”])

print(x)

[‘b’, ‘c’, ‘f’]

Measures of Spread

These functions calculate a measure of how much the population or sample tends to deviate from the
typical or average values.

variance() Sample variance of data.


stdev() Sample standard deviation of data.
pstdev() Population standard deviation of data.
pvariance() Population variance of data.

statistics.variance(data, xbar=None)¶
Return the sample variance of data. Variance is a measure of the variability from the mean (spread
or dispersion) of data. A large variance indicates that the data is spread out; a small variance
indicates it is clustered closely around the mean. If the optional second argument xbar is given, it
should be the mean of data. If it is missing or None (the default), the mean is automatically
calculated. Raises StatisticsError if data has fewer than two values.

Variance is calculated as the average of the squared differences from the Mean.

data=[2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]

st.variance(data)

1.372023809

statistics.stdev(data, xbar=None)
Return the standard deviation (the square root of the sample variance).

st.stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])

1.08108741

4
statistics.pvariance(data, mu=None)
Return the population variance of data. Variance is a measure of the variability (spread or
dispersion) of data. A large variance indicates that the data is spread out; a small variance
indicates it is clustered closely around the mean. If the optional second argument mu is
given, it is typically the mean of the data. If it is missing or None (the default), the arithmetic
mean is automatically calculated.

data= [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]

st.pvariance(data)

1.25

statistics.pstdev(data, mu=None)
Return the population standard deviation (the square root of the population variance).

st.pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])

0.98689327

Measures of association of two or more variables

These functions calculate statistics regarding relations between two inputs.

covariance() Sample covariance for two variables.


correlation() Correlation coefficient for two variables.
statistics.covariance(x, y, /)

Covariance is a measure of how much two random variables vary together. It’s similar to variance,
but where variance tells you how a single variable varies, covariance tells you how two variables
vary together.

Cov(X,Y) = Σ E((X – μ) E(Y – ν)) / n-1 where:

• X is a random variable
• E(X) = μ is the expected value (the mean) of the random variable X and
• E(Y) = ν is the expected value (the mean) of the random variable Y
• n = the number of items in the data set.
• Σ summation notation.

Return the sample covariance of two inputs x and y. Both inputs must be of the same length
else, StatisticsError is raised.
5
import numpy as np

x = [1, 2, 3, 4, 5, 6, 7, 8, 9]

y = [1, 2, 3, 1, 2, 3, 1, 2, 3]

data = np.array([x,y])

print(np.cov(x,y))

[[7.5 0.75]

[0.75 0.75]]

statistics.correlation(x, y, /)

Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It’s a common tool for
describing simple relationships without making a statement about cause and effect.

Correlations are useful for describing simple relationships among data. For example,
imagine that you are looking at a dataset of campsites in a mountain park. You want to
know whether there is a relationship between the elevation of the campsite (how high up
the mountain it is), and the average high temperature in the summer.

Each of these plots shows one of three different forms of correlation:

1. Negative correlation (red dots)(-1): In the plot on the left, the y values tend to
decrease as the x values increase. This shows strong negative correlation, which
occurs when large values of one feature correspond to small values of the other, and
vice versa.
2. Weak or no correlation (green dots)(0): The plot in the middle shows no obvious
trend. This is a form of weak correlation, which occurs when an association between
two features is not obvious or is hardly observable.
3. Positive correlation (blue dots)(1): In the plot on the right, the y values tend to
increase as the x values increase. This illustrates strong positive correlation, which
occurs when large values of one feature correspond to large values of the other, and
vice versa.

import numpy as np

x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
6
y = [1, 2, 3, 1, 2, 3, 1, 2, 3]

data = np.array([x,y])

print(np.corrcoef(x,y))

[[1. 0.31622777]

[0.31622777 1. ]]

Probability

Probability is the measure of the likelihood that an event will occur. Probability is quantified
as a number between 0 and 1, 0 indicates impossibility and 1 indicates certainty. The higher
the probability of an event, the more likely it is that the event will occur.

Example: A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the
two outcomes (“heads” and “tails”) are both equally probable; the probability of “heads”
equals the probability of “tails”; and since no other outcomes are possible, the probability of
either “heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).

Conditional Probability is a measure of the probability of an event given that (by


assumption, presumption, assertion or evidence) another event has already occurred. If the
event of interest is A and the event B is known or assumed to have occurred, “the
conditional probability of A given B”, is usually written as P(A|B).

Independent Events - Two events are said to be independent of each other, if the
probability that one event occurs in no way affects the probability of the other event
occurring, or in other words if we have observation about one event it doesn’t affect the
probability of the other.

Simple examples of independent events:


• Owning a dog and growing your own herb garden.
• Paying off your mortgage early and owning a Chevy Cavalier.
• Taking a cab home and finding your favourite movie on cable.

P(A|B) = P(A)
P(B|A) = P(B)
P(A∩B) = P(A) · P(B)
7
Joint Probability refers to the probability that two events will both occur. In other words,
joint probability is the likelihood of two events occurring together. For joint probability
calculations to work, the events must be independent. In other words, the events must not
be able to influence each other.

where:

• P(A ⋂ B) is the notation for the joint probability of event “A” and “B”.
• P(A) is the probability of event “A” occurring.
• P(B) is the probability of event “B” occurring.
• For example, the probability of drawing a red card from a deck of cards is 1/2 = 0.5.
This means that there is an equal chance of drawing a red and drawing a
black; since there are 52 cards in a deck, of which 26 are red and 26 are black,
there is a 50-50 probability of drawing a red card versus a black card.
• Joint probability is a measure of two events happening at the same time, and can
only be applied to situations where more than one observation can occur at the
same time. For example, from a deck of 52 cards, the joint probability of picking up
a card that is both red and 6 is P(6 ∩ red) = 2/52 = 1/26, since a deck of cards has
two red sixes—the six of hearts and the six of diamonds. Because the events "6"
and "red" are independent in this example, you can also use the following formula to
calculate the joint probability:

P(6∩red)=P(6)×P(red)=4/52×26/52=1/26

Bayes’ Theorem

Bayes' theorem, named after 18th-century British mathematician Thomas Bayes, is a


mathematical formula for determining conditional probability. Bayes' theorem provides a
way to revise existing predictions or theories (update probabilities) given new or additional
evidence. In finance, Bayes' theorem can be used to rate the risk of lending money to
potential borrowers.

Bayes' theorem is also called Bayes' Rule or Bayes' Law and is the foundation of the field
of Bayesian statistics. It allows you to update predicted probabilities of an event by
incorporating new information. It is often employed in finance in updating risk evaluation.

Given this,

P(A|B) = P(A∩B) / P(B)


P(B|A) = P(B∩A) / P(A)

P(B∩A)= P(A∩B) = P(B|A) * P(A)

Therefore, we get,
P(A|B) = (P(B|A) * P(A)) / P(B)
This is Bayes’ theorem.

8
It is used in scenarios where P(A|B) is easy to calculate but the other P(B|A) is difficult to
calculate. Bayes’ theorem is very helpful in such situations. It converts one conditional
probability to other conditional probability with ease.

Example:
What is the probability of two girls given at least one girl?

P(2G | 1 G) = P(1G|2G) * P(2G) / P(1 G)

P(1G|2G) = 1
There are 4 possibilities - GB, GG, BG, BB

P(2G) = ¼

P(1 G) = ¾

P(2G | 1 G) =(1 * ¼) / ¾

P(2G | 1 G) = 1/3

Prior probability, in Bayesian statistical inference, is the probability of an event before


new data is collected. This is the best rational assessment of the probability of an outcome
based on the current knowledge before an experiment is performed.

Posterior probability is the revised probability of an event occurring after taking into
consideration new information. Posterior probability is calculated by updating the prior
probability by using Bayes' theorem. In statistical terms, the posterior probability is the
probability of event A occurring given that event B has occurred.

Likelihood refers to the probability of observing the data that has been observed assuming
that the data came from a specific scenario.

Marginal probability: the probability of an event occurring (p(A))

9
Differences between Conditional Probability & Bayes Theorem

Conditional Probability Bayes Theorem


Conditional Probability is the probability of
Bayes Theorem includes two conditional
occurrence of a certain event, say A, based on
probabilities for the events, say A and B.
some other event whether B is true or not.
The equation of conditional probability is: The equation of Bayes Theorem is:

P(A|B)=P(A∩B) / P(B) P(A|B)=P(B|A)×P(A) / P(B)


It is used in Bayesian inference and in models
It is used to compute the conditional probability
where we are interested in the distribution up to
and the events A and B are relatively simple.
a normalizing factor P(B)
It gives a structured formula for solving more
It is used for relatively simple problems.
complex problems.

10

You might also like