A Case Study With Conditional Probability - Kaggle

8/13/2021 Notebook
https://www.kaggle.com/lakshya91/a-case-study-with-conditional-probability 1/11
link code
8/13/2021 Notebook
Overview
In this post, I'll give a gentle introduction to the conditional probability with the help of a real life
example. Then, I'll try to extend this idea of calculating conditional probabilities to Bayes Theorem. I
know there already exist excellent resources on this but this post is no way a replacement to those, it's
just a supplement to see how we can use those concepts to some real-life data.
In probability theory, conditional probability is the probability of occurring of an event where it is given
that another event has already occurred. To understand it a little better, we first need to set up the
stage by defining a few terms from set theory.
Events and Sample Space
In the simplest terms, an event is just the result of a random experiment. For example, getting a head
when we toss a coin is one event, drawing a ball at random from a bag containing 3 black and 5 red
balls is also an event. As we can see, we can easily associate the concept of probability to the events.
A collection of all possible outcomes of an event is called a sample space, for tossing the coin we can
have just two outcomes: head (H) or a tail(T). Similarly, rolling a fair die will always result in some
number between 1 to 6, hence the sample space is {1, 2, 3, 4, 5, 6}.
Union of events
Consider again the rolling of a fair die where we define two events:
Event A: Getting a number which is divisible by 2

Event B: Getting a number which is divisible by 3
The sample space for event A is {2, 4, 6} whereas
for event B it is {3, 6}. Now if we define another event C which is getting a number which is divisible
by either 2 OR 3 our new sample space would just be the combination of all the unique elements of
sample space A and sample space B: {2, 3, 4, 6}. Mathematically, these events can be shown in
terms of venn diagrams:
8/13/2021 Notebook
(https://imgur.com/7o5B3u9)
In terms of probabilities, we can easily calculate the probabilities for all the events as follows:
Number of cases where the output is divisible by 2

P(A) = Total possible outcomes
= 3/6 = 0.5
P(B) = = 2/6 = 0.333
Total possible outcomes
Number of cases where the output is divisible by either 2 OR 3

P(C) = P(A ∪ B) = Total possible outcomes
= 4/6 = 0.667
Intersection of events
Following the events defined previously, we can also define an event D which is getting a number which
is divisible by both 2 AND 3, meaning the common element in the sample space of both the events. In
terms of venn diagrams, it can be shown as:
(https://imgur.com/dwBjCji)
Again, in terms of probabilities:
8/13/2021 Notebook
P(A) = Total possible outcomes
= 3/6 = 0.5
P(B) = Total possible outcomes
= 2/6 = 0.333
Number of cases where the output is divisible by both 2 AND 3
P(C) = P(A ∩ B) = Total possible outcomes
= 1/6 = 0.167
Disjoint events
Consider these two events:
Event A: Getting a number which is divisible by 2

Event B: Getting a number which is divisible by 5
The sample space for event A is {2, 4, 6} whereas
for event B it is {5}. We can see that these events cannot occur together in the case of rolling a fair
die. These events are called disjoint events. The venn diagram for these type of events can be
shown as:
(https://imgur.com/KqXin9c)
Dependent and Independent events
If the occurrence of one event does not effect the occurrence of another event,then these events are
termed as independent events. Few examples of independent events include:
Getting a head when a coin is tossed AND getting a 5 in rolling a fair die
Getting rains in the month of August AND snow in December
The probability in the case of independent events can be written as P(A ∩ B) = P(A) * P(B) , that
is, the probability of occurring both the events is just the product of individual probabilities. Let's try to
understand this more concretely, suppose we have a bag containing 3 blue and 5 green balls and we
draw two balls at random with replacement (meaning putting back the first ball in the bag after first
trial). We define the two events as:
Event A: Getting a blue ball

Event B: Getting a green ball
8/13/2021 Notebook
We are interested in calculating the probability of getting a blue ball in the first trial AND a green ball in
the second. Writing the probabilities for both events:
P(A) = 3/8 ; P(B) = 5/8 . Since getting a blue
ball in a trial is independent of getting the green ball, this is the case of independent events. So we can
write:
P(C) = P(A ∩ B) = P(A) P(B) = P(B) P(A) = 15 / 64
Now for the case of dependent events, we can use the same example with one big difference: we are
not going to replace the drawn ball in the first trial! In this case, P(A) = 3/8 and P(B) = 5/7 .
Notice the denominator of P(B) : since the drawn ball is not replaced, now we are sampling the second
draw out of a smaller number of sample space. This would also ensure a greater chance of getting the
green ball in the second draw.
This discussion of dependent events naturally extends to the idea of conditional probability: We try to
calculate the probability of an event A given another event B has already happened. It is denoted by
P(A|B) . To get a feel for it, let's see some examples based on it
Probability of drawing a diamond from a deck of well-shuffled cards given the drawn card is red
Probability of rain on a given day of the month given it is July
We can easily infer from above two
examples that both the events in the examples are dependent of each other.
Conditional probability can be defined as follows:
P (A∩B)
Probability of event A given event B has already occurred = P(A|B) =
P (B)
we can easily see the above equation reduces to P(A) for independent events by writing P(A ∩ B) =
P(A)*P(B) .
Now in my attempt to make this post a little less boring, let's make our hands dirty and use python to
understand the concept of conditional probability
In [1]:
# Import libraries
import numpy as np
import pandas as pd
About the Dataset
The dataset contains the monthly rainfall data from years 1901 to 2018 for the Indian state of Kerala.
Kerala is one of the few states which are usually badly hit by monsoons every year. You can read more
about it in this excellent kernel (https://www.kaggle.com/biphili/india-rainfall-kerala-flood).
8/13/2021 Notebook
In [2]:
# Read the data
df = pd.read_csv("/kaggle/input/kerela-flood/kerala.csv")
df.head()
Out[2]:
SUBDIVISION YEAR JAN FEB MAR APR MAY JUN JUL AUG SE
0 KERALA 1901 28.7 44.7 51.6 160.0 174.7 824.6 743.0 357.5 19
1 KERALA 1902 6.7 2.6 57.3 83.9 134.5 390.9 1205.0 315.8 49
2 KERALA 1903 3.2 18.6 3.1 83.6 249.7 558.6 1022.5 420.2 34
3 KERALA 1904 23.7 3.0 32.2 71.5 235.7 1098.2 725.5 351.8 22
4 KERALA 1905 1.2 22.3 9.4 105.9 263.3 850.2 520.5 293.6 21
In [3]:
# Changing the target column to numeric values
df["FLOODS"] = df["FLOODS"].map({"YES": 1, "NO": 0})
We will be needing only columns JUN , JUL , YEAR and FLOODS since we are interested in
calculating the probability of flooding in that year given it rained more than a certain threshold (500
mm) in these months. We will create a couple more columns based on these columns.
8/13/2021 Notebook
In [4]:
# Creating binary data for the months of June and July using the rainfall threshol
d
df["JUN_GT_500"] = (df["JUN"] > 500).astype("int")
df["JUL_GT_500"] = (df["JUL"] > 500).astype("int")
df_small = df.loc[:, ["YEAR", "JUN_GT_500", "JUL_GT_500", "FLOODS"]]
df_small["COUNT"] = 1
df_small.head()
Out[4]:
YEAR JUN_GT_500 JUL_GT_500 FLOODS COUNT
0 1901 1 1 1 1
1 1902 0 1 1 1
2 1903 1 1 1 1
3 1904 1 1 1 1
4 1905 1 1 0 1
In [5]:
df_small.shape
Out[5]:
(118, 5)
In [6]:
# Creating the tabular data based on the counts
pd.crosstab(df_small["FLOODS"], df_small["JUN_GT_500"])
Out[6]:
JUN_GT_500 0 1
FLOODS
0 19 39
1 6 54
8/13/2021 Notebook
Defining some variables:
P(F) : Probability of flooding

P(J) : Probability of having more than 500 mm rain in June
P(F ∩ J) : Probability of flooding and having more than 500 mm rain in June
P(F|J) : Probability of flooding given it rained more than 500 mm in June
Based on the above table we can easily find these probabilities.
In [7]:
P_F = (6 + 54) / (6 + 54 + 19 + 39)
P_J = (39 + 54) / (6 + 54 + 19 + 39)
P_F_intersect_J = 54 / (6 + 54 + 19 + 39)
print(f"P(F): {P_F}")
print(f"P(J): {P_J}")
print(f"P(F AND J): {P_F_intersect_J}")
P(F): 0.5084745762711864
P(J): 0.788135593220339
P(F AND J): 0.4576271186440678
Using the formula - P(A|B) = P(A ∩ B) / P(B) we can easily calculate the conditional probability:
In [8]:
# Now calculate probailitity of flood given it rained more than 500 mm in June (P
(A|B))
P_F_J = P_F_intersect_J / P_J
print(f"P(F|J): {P_F_J}")
P(F|J): 0.5806451612903226
8/13/2021 Notebook
Now, we can also ask ourselves this: *given that it flooded in Kerala in a given year what is the
probability that it rained more than 500 mm in the month of June or July?* This is where Bayes
Theorem comes into action. Some other examples of Bayes Theorem are like:
The probability of a woman having breast cancer given she tested positive in the test
Probability that a given email is actually a spam given it contains certain flagged words.
Bayes Theorem can be easily derived using the relationship between conditional probability and
intersection of events. Given two events, we already know:
P (A ∩ B) = P (A|B). P (B) = P (B|A). P (A)
P (A|B). P (B)
so, P (B|A) =
P (A)
In Bayesian inference, `P(B)` is called **Prior Probability**. In our case, `P(J)` is the prior probability
which tells the probability of rain more than 500 mm in June (or July) without knowing whether it
flooded or not that year. We can see prior probability is the probability of the event we are interested in
before any new information.
Okay, enough chatter, let's try to code this in python. Actually we have already done most of the work,
it's just a matter of plugging in the numbers into the above equation.
In [9]:
# Probability of rain more than 500 mm in June given it flooded that year (P(B|A))
P_J_F = (P_F_J * P_J) / P_F
print(f"P(J|F): {P_J_F}")
P(J|F): 0.9000000000000001
In [10]:
# We can similarly do it for july
pd.crosstab(df_small["FLOODS"], df_small["JUL_GT_500"])
Out[10]:
JUL_GT_500 0 1
FLOODS
0 19 39
1 3 57
8/13/2021 Notebook
Defining the similar parameters for July:
P(F) : Probability of flooding

P(J) : Probability of having more than 500 mm rain in July
P(F ∩ J) : Probability of flooding and having more than 500 mm rain in July
P(F|J) : Probability of flooding given it rained more than 500 mm in July
In [11]:
P_F = (3 + 57) / (3 + 57 + 19 + 39)
P_J = (39 + 57) / (3 + 57 + 19 + 39)
P_F_intersect_J = 57 / (3 + 57 + 19 + 39)
print(f"P(F): {P_F}")
print(f"P(J): {P_J}")
print(f"P(F AND J): {P_F_intersect_J}")
P(F): 0.5084745762711864
P(J): 0.8135593220338984
P(F AND J): 0.4830508474576271
In [12]:
# Now calculate probailitity of flood given it rained more than 500 mm in July
P_F_J = P_F_intersect_J / P_J
print(f"P(F|J): {P_F_J}")
P(F|J): 0.59375
In [13]:
# Probability of rain more than 500 mm in July given it flooded that year (P(B|A))
P_J_F = (P_F_J * P_J) / P_F
print(f"P(J|F): {P_J_F}")
P(J|F): 0.9500000000000002
8/13/2021 Notebook
Important Takeaways
1. Based on the probability outputs above we can easily infer that it flooded almost 59% of the time in
the year when it rained more than 500 mm in July whereas for June it's only 58%. This means only
rainfall in the months of June and July are not completely responsible for the flooding in Kerala. This
actually makes sense since in both 2018 and 2020, the flooding happened in August. May be
including August in the analysis provide more insight to this.
2. Using Bayes theorem we found that whenever it flooded in Kerala, both June and July have a very
high probability (90% and 95% respectively) of rain for more than 500 mm. This also makes sense
June and July are the peak months of rainfall because of monsoon.
Thanks for reading my kernel! I hope it helped you to understand this concept as much as it helped me.
References and Resources:
https://www.statisticshowto.com/bayes-theorem-problems/
(https://www.statisticshowto.com/bayes-theorem-problems/)
https://www.analyticsvidhya.com/blog/2017/03/conditional-probability-bayes-theorem/
(https://www.analyticsvidhya.com/blog/2017/03/conditional-probability-bayes-theorem/)
https://towardsdatascience.com/bayes-theorem-the-holy-grail-of-data-science-55d93315defb
(https://towardsdatascience.com/bayes-theorem-the-holy-grail-of-data-science-55d93315defb)
In [ ]:

A Case Study With Conditional Probability - Kaggle

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Case Study With Conditional Probability - Kaggle

Uploaded by

Copyright:

Available Formats

8/13/2021 Notebook

Events and Sample Space

Event A: Getting a number which is divisible by 2

Number of cases where the output is divisible by 2

Number of cases where the output is divisible by either 2 OR 3

Again, in terms of probabilities:

Consider these two events:

Event A: Getting a number which is divisible by 2

Dependent and Independent events

Event A: Getting a blue ball

P(C) = P(A ∩ B) = P(A) P(B) = P(B) P(A) = 15 / 64

Conditional probability can be defined as follows:

About the Dataset

# Read the data

# Changing the target column to numeric values

df["FLOODS"] = df["FLOODS"].map({"YES": 1, "NO": 0})

df["JUN_GT_500"] = (df["JUN"] > 500).astype("int")

df["JUL_GT_500"] = (df["JUL"] > 500).astype("int")

df_small = df.loc[:, ["YEAR", "JUN_GT_500", "JUL_GT_500", "FLOODS"]]

YEAR JUN_GT_500 JUL_GT_500 FLOODS COUNT

# Creating the tabular data based on the counts

Defining some variables:

P(F) : Probability of flooding

Based on the above table we can easily find these probabilities.

P_F = (6 + 54) / (6 + 54 + 19 + 39)

P_J = (39 + 54) / (6 + 54 + 19 + 39)

print(f"P(F AND J): {P_F_intersect_J}")

P(F AND J): 0.4576271186440678

P_F_J = P_F_intersect_J / P_J

P_J_F = (P_F_J * P_J) / P_F

# We can similarly do it for july

Defining the similar parameters for July:

P(F) : Probability of flooding

P_F = (3 + 57) / (3 + 57 + 19 + 39)

P_J = (39 + 57) / (3 + 57 + 19 + 39)

print(f"P(F AND J): {P_F_intersect_J}")

P(F AND J): 0.4830508474576271

P_F_J = P_F_intersect_J / P_J

P_J_F = (P_F_J * P_J) / P_F

References and Resources:

You might also like