Professional Documents
Culture Documents
Random Variable
Random Variable
Random Variable
A population is the entire group that you want to draw conclusions about. A sample is the
specific group that you will collect data from. The size of the sample is always less than the
total size of the population.
Measures of Location
These functions calculate an average from a population or sample. These functions do not require the
data given to them to be sorted.
Python3 provide statistics module, which comes with very useful functions like mean(),
median(), mode() etc.
import statistics as st
The sample mean gives an unbiased estimate of the true population mean, so that when
taken on average over all the possible samples, mean(sample) converges on the true
mean of the entire population. If data represents the entire population rather than a sample,
then mean(data) is equivalent to calculating the true population mean μ.
st.mean([1, 2, 3, 4, 4])
2.8
statistics.geometric_mean(data)
The geometric mean indicates the central tendency or typical value of the data using the
product of the values (as opposed to the arithmetic mean which uses their sum). The
geometric mean is defined as the nth root of the product of n numbers,
1
Raises a StatisticsError if the input dataset is empty, if it contains a zero, or if it
contains a negative value.
statistics.harmonic_mean(data, weights=None)
Returns the harmonic mean of data. If weights is omitted or None, then equal weighting is
assumed.
The harmonic mean is the reciprocal of the arithmetic mean() of the reciprocals of the data.
For example, the harmonic mean of three values a, b and c will be equivalent
to 3/(1/a + 1/b + 1/c).
import statistics as st
print (st.harmonic_mean(a))
1.93
StatisticsError is raised if data is empty, any element is less than zero, or if the weighted sum isn’t
positive
statistics.median(data)
Return the median (middle value) of numeric data, using the common “mean of middle two” method.
If data is empty, StatisticsError is raised. The median is a robust measure of central location.
When the number of data points is odd, the middle data point is returned:
st.median([1, 3, 5])
When the number of data points is even, the median is interpolated by taking the average of the two
middle values:
st.median([1, 3, 5, 7])
4.0
If the data is ordinal (supports order operations) but not numeric (doesn’t support addition), consider
using median_low() or median_high() instead.
statistics.median_low(data)
2
Return the low median of numeric data. If data is empty, StatisticsError is raised. When the
number of data points is odd, the middle value is returned. When it is even, the smaller of the two
middle values is returned.
st.median_low([1, 3, 5])
st.median_low([1, 3, 5, 7])
statistics.median_high(data)
Return the high median of data. If data is empty, StatisticsError is raised. The high median is
always a member of the data set. When the number of data points is odd, the middle value is
returned. When it is even, the larger of the two middle values is returned.
st.median_high([1, 3, 5])
st.median_high([1, 3, 5, 7])
statistics.median_grouped(data, interval=1)
Return the median of grouped continuous data. If data is empty, StatisticsError is raised.
The mathematical formula for Grouped Median is: GMedian = L + interval * (N / 2 - CF) / F.
52.5
st.median_grouped([ 1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
3.7
statistics.mode(data)
Return the single most common data point from discrete or nominal data. The mode (when it exists)
is the most typical value and serves as a measure of central location. If there are multiple modes
with the same frequency, returns the first one encountered in the data. If the input data is
empty, StatisticsError is raised.
st.mode([1, 1, 2, 3, 3, 3, 3, 4])
3
3
statistics.multimode(data)
Return a list of the most frequently occurring values in the order they were first encountered in
the data. Will return more than one result if there are multiple modes or an empty list if the data is
empty:
x = multimode([“aabbbbccddddeeffffgg”])
print(x)
Measures of Spread
These functions calculate a measure of how much the population or sample tends to deviate from the
typical or average values.
statistics.variance(data, xbar=None)¶
Return the sample variance of data. Variance is a measure of the variability from the mean (spread
or dispersion) of data. A large variance indicates that the data is spread out; a small variance
indicates it is clustered closely around the mean. If the optional second argument xbar is given, it
should be the mean of data. If it is missing or None (the default), the mean is automatically
calculated. Raises StatisticsError if data has fewer than two values.
Variance is calculated as the average of the squared differences from the Mean.
st.variance(data)
1.372023809
statistics.stdev(data, xbar=None)
Return the standard deviation (the square root of the sample variance).
1.08108741
4
statistics.pvariance(data, mu=None)
Return the population variance of data. Variance is a measure of the variability (spread or
dispersion) of data. A large variance indicates that the data is spread out; a small variance
indicates it is clustered closely around the mean. If the optional second argument mu is
given, it is typically the mean of the data. If it is missing or None (the default), the arithmetic
mean is automatically calculated.
st.pvariance(data)
1.25
statistics.pstdev(data, mu=None)
Return the population standard deviation (the square root of the population variance).
0.98689327
Covariance is a measure of how much two random variables vary together. It’s similar to variance,
but where variance tells you how a single variable varies, covariance tells you how two variables
vary together.
• X is a random variable
• E(X) = μ is the expected value (the mean) of the random variable X and
• E(Y) = ν is the expected value (the mean) of the random variable Y
• n = the number of items in the data set.
• Σ summation notation.
Return the sample covariance of two inputs x and y. Both inputs must be of the same length
else, StatisticsError is raised.
5
import numpy as np
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
data = np.array([x,y])
print(np.cov(x,y))
[[7.5 0.75]
[0.75 0.75]]
statistics.correlation(x, y, /)
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It’s a common tool for
describing simple relationships without making a statement about cause and effect.
Correlations are useful for describing simple relationships among data. For example,
imagine that you are looking at a dataset of campsites in a mountain park. You want to
know whether there is a relationship between the elevation of the campsite (how high up
the mountain it is), and the average high temperature in the summer.
1. Negative correlation (red dots)(-1): In the plot on the left, the y values tend to
decrease as the x values increase. This shows strong negative correlation, which
occurs when large values of one feature correspond to small values of the other, and
vice versa.
2. Weak or no correlation (green dots)(0): The plot in the middle shows no obvious
trend. This is a form of weak correlation, which occurs when an association between
two features is not obvious or is hardly observable.
3. Positive correlation (blue dots)(1): In the plot on the right, the y values tend to
increase as the x values increase. This illustrates strong positive correlation, which
occurs when large values of one feature correspond to large values of the other, and
vice versa.
import numpy as np
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
6
y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
data = np.array([x,y])
print(np.corrcoef(x,y))
[[1. 0.31622777]
[0.31622777 1. ]]
Probability
Probability is the measure of the likelihood that an event will occur. Probability is quantified
as a number between 0 and 1, 0 indicates impossibility and 1 indicates certainty. The higher
the probability of an event, the more likely it is that the event will occur.
Example: A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the
two outcomes (“heads” and “tails”) are both equally probable; the probability of “heads”
equals the probability of “tails”; and since no other outcomes are possible, the probability of
either “heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).
Independent Events - Two events are said to be independent of each other, if the
probability that one event occurs in no way affects the probability of the other event
occurring, or in other words if we have observation about one event it doesn’t affect the
probability of the other.
P(A|B) = P(A)
P(B|A) = P(B)
P(A∩B) = P(A) · P(B)
7
Joint Probability refers to the probability that two events will both occur. In other words,
joint probability is the likelihood of two events occurring together. For joint probability
calculations to work, the events must be independent. In other words, the events must not
be able to influence each other.
where:
• P(A ⋂ B) is the notation for the joint probability of event “A” and “B”.
• P(A) is the probability of event “A” occurring.
• P(B) is the probability of event “B” occurring.
• For example, the probability of drawing a red card from a deck of cards is 1/2 = 0.5.
This means that there is an equal chance of drawing a red and drawing a
black; since there are 52 cards in a deck, of which 26 are red and 26 are black,
there is a 50-50 probability of drawing a red card versus a black card.
• Joint probability is a measure of two events happening at the same time, and can
only be applied to situations where more than one observation can occur at the
same time. For example, from a deck of 52 cards, the joint probability of picking up
a card that is both red and 6 is P(6 ∩ red) = 2/52 = 1/26, since a deck of cards has
two red sixes—the six of hearts and the six of diamonds. Because the events "6"
and "red" are independent in this example, you can also use the following formula to
calculate the joint probability:
P(6∩red)=P(6)×P(red)=4/52×26/52=1/26
Bayes’ Theorem
Bayes' theorem is also called Bayes' Rule or Bayes' Law and is the foundation of the field
of Bayesian statistics. It allows you to update predicted probabilities of an event by
incorporating new information. It is often employed in finance in updating risk evaluation.
Given this,
Therefore, we get,
P(A|B) = (P(B|A) * P(A)) / P(B)
This is Bayes’ theorem.
8
It is used in scenarios where P(A|B) is easy to calculate but the other P(B|A) is difficult to
calculate. Bayes’ theorem is very helpful in such situations. It converts one conditional
probability to other conditional probability with ease.
Example:
What is the probability of two girls given at least one girl?
P(1G|2G) = 1
There are 4 possibilities - GB, GG, BG, BB
P(2G) = ¼
P(1 G) = ¾
P(2G | 1 G) =(1 * ¼) / ¾
P(2G | 1 G) = 1/3
Posterior probability is the revised probability of an event occurring after taking into
consideration new information. Posterior probability is calculated by updating the prior
probability by using Bayes' theorem. In statistical terms, the posterior probability is the
probability of event A occurring given that event B has occurred.
Likelihood refers to the probability of observing the data that has been observed assuming
that the data came from a specific scenario.
9
Differences between Conditional Probability & Bayes Theorem
10