Download as pdf or txt
Download as pdf or txt
You are on page 1of 1046

Statistics for Data Science II

Professor Andrew Thangaraj


Electric Engineering Department
Indian Institute of Technology Madras
Lecture 1.1
Experiment, Outcome, Sample Space, Event, Probability

(Refer Slide Time: 00:12)

Hello, everyone. Welcome to Week 1 of the second course in statistics that you are doing in this
program. We will begin by looking at the basic concepts of probability and how to model random
phenomenon using probability and all that. So, everything starts with a few very basic important
ideas.

And these are the ideas there is something called an experiment, there is something called an
outcome for the experiment, there is something called sample space, there is something called
event, and finally, we will get to probability. So, it is important to go in the sequence. We will do
that slowly in this lecture.
(Refer Slide Time: 00:51)

So, before I start, let me talk briefly about textbooks for the course. There is one textbook that we
will follow reasonably closely which is the first one that I have put down here, Probability and
Statistics by Siva Athreya, Deepayan Sarkar and Steve Tanner. It is available from Professor
Athreya’s website. You can click on that link there and go to the website. As of now, it is available.
It is work in progress. So, they have not completed the book, but we are using it even before they
completed. So, we will follow this book fairly closely in terms of what topics we cover, what
examples we use, the sequence in which we do it, etc.

A couple of other books which are useful, one is, what I am going to call, what I have put down
here in the second list; Think Stats 2. It is a very applied book. It talks about doing statistics with
Python and then lots of ways to do computer simulation. And in this course also, we will do quite
a bit of computer simulation. So, mostly it will be me showing you how the simulation is done.
You do not have to do any programming or anything on your own.

But what computer simulations allow you to do is to sort of see in action what we are writing down
as equations and theory. So, quite often it is useful to imagine what is going to happen when you
do a computer simulation. So, we will do that quite often in this class. So, a book like Think Starts
2 is quite useful. It is also available for free download. So, you can use it without any cost.

The third book is a little bit more theoretical; A Modern Introduction to Probability and Statistics
and it is a little bit more expensive. So, if you want more ideas in this area, more, this you can use
for like a reference. When you want to look at other ideas, related ideas, more details, you can look
at that book. So, in fact, there are so many books. You go to even websites which talk about good
books on probability and statistics.

You have so many books, so many different types of options. I mean, any book is good enough,
you can, at some level. So, but you may like some books, you may not like some other books, the
approach and the way they handle it, I mean, all be very different. But you are free to just browse
through any book that you like. I am not against it.

So, none of these books are strictly necessary as in I will cover most of the things myself. But
having a book is very, very useful. Now, we can go back, sit down and browse through the pages,
you will get some ideas when you read again. So, I will strongly advise that you have access to
some of these books. Keep reading them while you attend these lectures. So, that is the first slide
on the textbook.

(Refer Slide Time: 03:44)

So, we are ready to jump into the very first concept in this course. There are two concepts that we
are going to start, concepts or definitions or I think definition is probably a better word, experiment
and outcome of the experiment. So, when I use the word experiment, you are thinking of chemistry
lab, scientist with a lab coat and all that. It is not quite that, need not be exactly that.

In statistics, experiment is generally used to denote any process or phenomenon that we wish to
study statistically. So, that is called an experiment. So, the experiment would have various different
outcomes. And we would like to study it in a statistical way. I described in the first lecture on what
it means to study something statistically. It is basically to observe patterns and make some, have
some theory about what those patterns are.

So, I put three different examples here, familiar examples, we discussed that even in the Week 0
lectures. Tossing a coin is a good example of an experiment that one can study statistically. And
throwing a die, the result is varied, there are different types of results and all that when you cannot
really precisely say what the result will be. So, that is a good experiment. And we also saw this
very complicated experiment so to speak, which is like the Indian Premier League tournament. We
looked at that as an experiment as well. We wanted to study that statistically.

So, all these are experiments. So, in fact, it looks like anything is an experiment. So, the point is,
when you want to study statistically see if there are patterns or statistical patterns in the data or
statistical pattern in the phenomena, you use the word experiment to denote that. So, you will see
quickly enough in about three, four slides down from here. I will never use the word experiment.
But technically, that is the word that is used. Experiment is what you basically model.

Now, outcome is very important. What is the outcome? This is the result of the experiment. What
happened as, what did you observe in the experiment. You can even call it data in some sense. So,
the outcome is what is very important. So, you will see the theory focuses a lot more on the
outcomes. And the experiment itself is sometimes even you can imagine what the experiment
should be. So, I have listed down below the outcomes corresponding to each of these experiments.

For the coin toss experiment there are two outcomes. Assuming it is a standard coin, it is either
heads or tails. For throwing a die, one die, it is either 1 or 2 or 3 or 4 or 5 or 6. That is the possible
outcomes. And if you go to a complex experiment like the Indian Premier League, you are not
going to be able to write down the outcome so easily. It is, for instance, it could be available as a
YAML format file. We discussed this format in the Week 0. So, you have to capture the result or
the outcome of an experiment as complex as an IPL match, you need a lot of text and typing, you
need to capture what happens in every ball.

Even though you may not capture everything, so you might think the only thing you want to study
statistically is the number of runs scored and wickets etc. on the ball. But people who want to do
research in cricket, who want to know where did the ball pitch, what is the pace of the ball, did it
turned, did it swing, did it do this, that etc, all those are also patterns. You may learn some patterns
from that. So, I mean, technically, I would say, experiment like the Indian Premier League matches
is like very, very rich in the kind of outcomes that it can generate. So, that is, so this probably gives
you a glimpse of what is it that you can study.

So, for instance, you are studying rainfall. You want you want to see statistically how the rainfall
pattern is, what is the experiment, who is doing the experiment? So, it is like nature is doing the
experiment. So, there is this big system of clouds and wind and sunlight and oceans and all these
things together create these rainfall situations that happened for you in India and then you study it
and the outcome you all measure, how much did it rain, what was the pressure here, what was the
pressure there is that, this that, the people send balloons up in the air.

So, you can measure whatever outcome you want and collect as much detail as you want, whenever
you want to study statistically and the experiment is something like the phenomenon that you want
to observe. So, hopefully, you get the sense. So, I want to tell you that the, we will look at a lot of
toy experiments. You can see the toy experiments, simple experiments to study are tossing the
coin, throwing a die.

Now, why are these simple experiments important? If you cannot do something in the simple
experimental world, you really cannot do something in the complicated real life world. A lot of
people look down upon theory and say you deal with small toy examples, but if you cannot do the
toy examples properly, very likely that you will make mistakes in the bigger examples as well.

So, you have any ideas, you want to try something, you want to try them in the small toy examples
first, make sure it works, and then, adapt it to complicated scenario. So, do not look down upon
the simple experiments. They are nice. It is good to illustrate, it is good to teach out of, it good to,
it is good to ask questions in exam sort of, so all these small experiments are very nice that way.

Of course, in real life, you want to study complicated experiments with a very rich set of outcomes.
And you will see, I mean, quite often, we will use some ideas from simple experiments in the
complicated studies as well. So, hopefully, this gives you a feel for what experiments are. There
are so many of them. Anything you study statistically becomes an experiment in some sense. So,
that is something nice to know.
So, seeing experiments and like I said, outcomes are very, very important. The outcomes of the
experiment is really what you study statistically, really what you look for patterns. So, that is why
it is very, very important. But when, experiment is also equally important, the physical aspects of
it, what you know about, what happens in the experiment. When you know cricket really, really
well, maybe you know what patterns to observe for. So, your understanding of the experiment is
also very important. But nevertheless, for the theory and definitions and equations, the outcomes
are central.

(Refer Slide Time: 10:19)

The next important definition is that of a sample space. So, I have put down the definition here in
this little box, which says the sample space is a set that contains all outcomes of an experiment.
So, first thing you see in a definition is, when somebody is trying to define something, you have
to see what it is. Usually, when you define something mathematically in a precise way, you have
to say something is a certain mathematical object. So, that is only then it becomes precise and
substance. You cannot be vague about the definition.

So, in this case, you saw that experiment and outcome were not very precisely defined in some
sense, but the sample space is a very precisely defined mathematical object, it is a set. What does
it contain? What are the elements of the set? The elements of the set are all the outcomes that can
possibly happen in an experiment. So, like I said, sample space is a set and we will denote it
typically in this class by some S. So, this capital S will denote the sample space for us. Let us look
at examples. Anytime you have a definition. The first thing you should do is write down a few
examples.

So, look at the examples we saw before. We saw three experiments that I listed out. I will see a
few, I will show you a few more experiments in this lecture itself. Some may be more complicated,
some may be very simple, but we will start seeing them in this lecture itself. So, let us look at
tossing the coin. So, what is going to be the sample space in this experiment? It is going to be a set
with just two elements, heads, tails. So, it is easy enough. Throw a die, again, the sample space is
very easy, there are six outcomes. You collect them all together into a set. It becomes your sample
space.

Now, if you now jump from the small toy experiments to the big experiment of the IPL, you see
that sample space is really difficult to write down. So, it is basically the set of all possible things
that can happen in an IPL match. I mean, depending on what level of detail you want to look at,
the sample space can be very, very complex. Like for instance, if you only were to be concerned
about run scored in one delivery, you can have a sample space which is very simple, it could be 0
or 1 or 2 or 3 or 4 or 5, 6, 7, can 7 runs be scored? Yeah. I mean, it can happen occasionally. So,
all of this you may want to write down that is easy to do.

Or maybe you are interested in just the winner in IPL 2021 maybe that is what you are interested
in. In that case, if that is the only detail you care about, then your outcome is very simple. I am, I
read recently that there are nine teams that are going to be playing in IPL 2021. And you might
want to list out all the teams as into your sample space.

So, depending on the level of detail you want, even in a big experiment, the sample space can end
up being small. But nevertheless, the experiment itself is so rich, the set of outcomes are so rich,
like an IPL that you can do a lot, lot more, but maybe this gives you a sort of a hint as to what to
look for when you think of sample space and all that in real life.

So, having said that, having defined the sample space, I will also have to tell you that quite often
in practice, once you get some comfort with all these notions of probability, you will hardly ever
write down the sample space for a problem. So, you will sort of imagine that there is a sample
space. There is always a sample space, let us say. And once you understand that there is a sample
space, you will be able to work with probabilities directly. You do not have to really explicitly
write it down.

But quite often, what I have seen at least in practice is when you get really confused about what
some definition has to be, you are not clear about some definition, you do not know how to think
of something, you are confused, in those kinds of situations writing down the sample space can
give you a lot of clarity. So, in that way, sample space is quite useful. It is very important theoretical
construct. The entire probability theory builds on it.

But nevertheless, in practice, usually, once you see the problem, you just write down the
probabilities and start working with it as opposed to thinking of sample spaces. But still it is
important to know. Somebody describes an experiment, talks about the outcomes of the
experiment, you should be able to identify the sample space of interest to you. So, that is very,
very important.
(Refer Slide Time: 15:05)

So, here are more examples. So, I gave you two or three simple examples. We are going to now
start talking about more and more complicated examples. And these examples are nice because
they will show up again and again in all your problems, in all your activity questions, graded
assignment questions, practice questions, quiz questions, exam questions, other questions,
whatever and even in real life.

These kinds of experiments and outcomes and events and sample spaces will happen again and
again and again. So, it is good to sort of get used to what we mean by this. Statisticians love what
is called the urn you are in you would not say pot or jar or some box or something. They will say
urn. For some reason they like the urn. Urn is just a pot.

The first experiment that you see there is drawing a marble from an urn. So, marble, balls, people
use different things that they put inside the urn and they start drawing one after the other in different
ways. And it is a very nice illustrative experiment, in which you can write down things, do
calculations. It is quite useful. We will see this experiment is very interesting. We will keep
revisiting this in various different contexts. So, you have an urn that is this pot. Start opening up
your imagination. So, think of a pot and it has marbles of many colors.

So, for example, it could be, it could have three marbles of red color, three marbles of white color,
three marbles of blue color. So, it is all inside the pot. And so you are going to draw a marble from
it. What do I mean by that? You do not see what is inside. You just close your eyes, put your hand
inside and pick up a marble.

Now, what is the outcome of this little experiment in drawing a marble from an urn, which contains
marbles of many colors, it is the color of the marble. So, that is what we are interested in. So, if
you want to write down a sample space for this little experiment that we have red, white, blue
would be very good answer for that sample space. So, that is a good sample space.

So, now, what happens when you draw two marbles? So, you see immediately when you
complicate the experiment a little bit, the sample space starts to get bigger and more interesting
and has richer structure from which you can do more calculations. So, immediately you have to
ask when you draw two marbles, how do you draw them? Do you draw them simultaneously or
do you draw them one after the other? Do you, if you do one after the other, do you put the first
marble back and then pick up the second one or you keep the first marble away?

So, these two are, turns out different examples. They give you different experiments. Even though
the outcome and the sample space and all might be same, the experiment suddenly turns to changes
when you do with replacement and without replacement. So, what do we mean by that once again?
You have an urn with a lot of marbles. You are going to pick two marbles, one after the other.

So, you pick the first one and then you pick the second one or it depends, the nature of the
experiment changes depending on whether or not you put the first marble back and then pick the
second one or you kept the first marble away and pick the second. So, that is two different things.
So, people will talk about drawing two marbles from an urn with replacement, drawing two
marbles from an urn without replacement.

So, we will see people distinguishing these two experiments and the calculations do change based
on that. But hopefully, I want to convince you, the sample space starts to get more and more
complicated. I mean, if you want to keep writing sample space what are the possible outcomes?
Let us say, I mean, let us say, you start with three marbles of red, white, blue, three, an urn with
three marbles each of red, white, and blue color and let us say you are drawing five marbles.

I mean, as we keep increasing the number, so you see now you have to think about what would
you put in the outcome, what is possible etc. You cannot have red, red, red, red, red, you cannot
have that. So, there are only three red balls. So, you have to think of all that and it starts to get
confusing in your head. So, that is one of the reasons why people do not tend to write down the
sample space in all its glory.

They just think and they understand there has to be some set of outcomes. I understand the
outcomes. I just, I draw the balls one after the, marbles one after the other. I note down the color
of each marble, measure the replacement, there is no replacement. So, I know what could have
happened. Just once again illustrate this replacement and no replacement.

If you have three marbles of each of these colors, and if you do with replacement, one of your
outcomes is red, red, red, red, red you can have five red marbles. But if you are doing it without
replacement, you cannot have red, red, red, red, red, is not it, because there are only three red balls.
If you are not replacing, you can have at most three red balls in every outcome. So, anyway, I
mean, I want to drive home the point that there are these nice simple experiments which give you,
which can quickly give you more complicated sample spaces that you can think of.

Another very common example is a pack of cards. So, now, many of you maybe are very, very,
very sincere students. So, you have never played or seen a pack of cards, but I am assuming you
are at least familiar with what a pack of cards is at least for the reason that you have to solve
questions in the problems here. So, a pack of cards is 52 cards. There are four suits. The suits are
in English called spades, hearts, diamonds and clubs.

So, there are different names. I think, in common parlance people use different names for these
four suits, but we will use the spades, hearts, diamonds and clubs. And then every suit you have
13 different cards of different values. The cards start with 2, 3, 4, all the way up to 10. And then
you have the jack, queen, king and ace. The ace I will put at the end, some people tend to put the
ace in the beginning, but it is, we will tend to have it at the end.

So, the sample space, one can conveniently write it as the Cartesian product of these two sets, the
set of suits, Cartesian product, the set of values. Recall Cartesian product from Math 1. So, you
remember Cartesian product, you have two sets. When you do Cartesian product, it is basically a
pair, one from the first set, another from the second set.

So, in the 52 cards, there is 2 of spades, 3 of spades, 4 of spades, so on until ace of spades, there is
2 of hearts, 3 of hearts, 4 of hearts all the way till ace of hearts, etc. So, you have these 52 cards.
So, when you draw a card from this pack, let us say a shuffled pack, then the sample space is this
set of 52 cards.

So, the, what would be the suit, what would be the value. Now, when you draw multiple cards,
one after the other, or like it happens in a usual card game you deal or distribute the cards to the
players, what could happen from a shuffle pack if you start distributing cards? Let us say four
players are playing and you distribute the cards, each player would get 13 cards.

Now, look at the possibilities there. The sample space of what one player would get is a set of
subsets of 13 cards from the sample space. And how many subsets are there, a huge number, 52
choose 13. There is a lot of possibilities. And anyway, I will come back to this experiment. It is a
slightly more complicated experiment, but it will bring out a lot of statistical patterns that people
observed.

For instance, you might see that different collection of cards are likely some card selection from a
random distribution are unlikely and all that, so you may have a sense of what can happen or what
cannot happen when you distribute a pack of cards, but it is a useful experiment, very useful
experiment to think of when you want to solve problems. And hopefully you agree with me, the
sample space is slowly becoming more and more complicated when you want to think of modeling
say distributing cards around a table statistically.

So, the last point I want to mention, so we discussed quite a few experiments now. We saw tossing
a coin, throwing a die, this complicated IPL experiment and drawing a marble from an urn or a
ball from an urn, drawing a card from a shuffled back, we saw quite a few experiments and we
have talked about the sample space, etc. But I have to warn you, so as you learn more probability,
once the problem is given to you and you have to compute the chance of something, the likelihood
of something, what is the probability of something, you will usually not write down the sample
space.

I mean, it will be in your mind sort of clear to you how to do it. So, I want to emphasize once
again, it is still very useful. When some question confuses you about what can possibly happen,
you can always go back to writing down the sample space, writing down the outcome in very clear
notation, and then invariably, at least, I have seen in practice that that helps get rid of all the
confusions and the paradoxes and all these things that show up in probability calculations quite
often. So, hopefully, you have gotten your feet wet with sample space. Let us dig in deeper into
the theory.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.2
Events and Probabilities Events Part I

(Refer Slide Time: 00:13)

We are going to now move in to the central object, one of the very central objects in probability
theory. So, we discussed experiment, we discussed outcome, so, those are sort of like the physical
things that happened and then we talked about sample space which is a mathematical object, first
mathematical object in the theory is sort of is the sample space which is the set of outcomes in the
experiment.

And we also saw that even though it is an important theoretical notion we are not going to use it
that much in problem solving but still it is an important notion. This event is a very, very, very
central notion. So, you have to be very, very comfortable with events and manipulating them, using
them, thinking of them, understanding them, etc. If you have to be, if you are going to be good at
working in probability and statistics, so, event is a very, very important notion.

To give a precise definition, it is surprising, it is very surprising that in theory you build up the
theory and the third definition you make or second definition you make I have to protect technical
caveat, so, you see in the definition I have said an event is a subset of the sample space. That seems
easy enough, you know subsets or at least we think we know subsets.
So, you have a set which is a sample space in the collection of all outcomes and then you think of
a subset of that sample space. So, it turns out this is not entirely a correct definition, because there
is a technical restriction on what subsets can be events. It turns out in probability theory you do
not want to count every subset as an event.

So, this will not impact most of our calculations. So, it shows up only in some special cases. So,
for now I am going to ignore this restriction, your book also does the same thing. Ignore this
restriction for now and continue to do the calculations, nothing much will affect you and later on
we will point out when that becomes necessary.

So, if you want an analogy from physics I will tell you. So, you learn about Newton's laws and
Newton's theory of gravity and all that and then you see in some special situations it does not really
work. It needs a correction, so, like for instance very heavy objects like black holes or very high
speeds, close to the speed of light. You know Newton's theory does not quite work.

So, what do you do for that you correct it with some general theory of relativity or something.
Something that Einstein did, you bring in and you correct it. So, this is sort of like that. So, just
saying event as subset of the sample space for most practical purposes it is going to be okay, but
there is one more sort of case within the practical purposes that will show up and there you cannot
really do this very easily.

So, we will point it out later on. So, do not bother about it too much. For now, we will simply take
the definition to be event is a subset of the sample space. So, that is very good. So, now let us start
looking at examples. So, we saw the definition, let us start looking at examples.

So, the first example that you saw was from the tossing a coin experiment, we know the sample
space here, heads, tails. There are only four events here and all four are fine. You can take all four
to be possible events here. You have the empty set. Notice the empty set is an event, right? So,
why is that? Because empty set is a subset of any set.

So, empty set is an event. So, empty set is something like nothing happened. You toss a coin. What
happened? Nothing. I mean, so, it is maybe not very interesting but you will see it is good to have
it just for completion of our theory. It is good to have empty set as the possible event. You can
have an event as heads, now that is interesting event isn't it?
So, you toss a coin, the sample space contains all outcomes. Your event, one of the events is only
heads. So, I am interested in that. So, I want to know what, this event that it is a head is interesting
to me. So, you are going out for a cricket match, somebody is tossing and you call heads, you are
really interested in that event that whether it is heads or not.

So, likewise you have tails and then you have the final event which is equal to the sample space
itself. It is just heads, tails. Once again maybe a dull sort of event which maybe it is not so,
interesting, but those are the four events. So, now let us move on to a slightly more complicated
example which is throwing a die. Now the sample space has six different outcomes 1, 2, 3, 4, 5, 6
and you can now start writing out events.

So, now you may or may not know this result but if you have a set with six objects, six elements
how many subsets do you have? So, it turns out the answer is 2 power 6. So, it is true for general
any finite set. If you have so many elements in a set finite number of elements in a set.

The number of subsets that it has is 2 power the number of elements, so a very quick way to
understand that is, just imagine you are constructing a subset. You go one element at a time and
you have two choices for every element. You either put it into your subset or you do not put it in
to your subset. So, 2 into 2 into 2 into 2 into 2, 2 power size of the set that gives you the number
of subsets you can possibly pick from the set.

You can write it down, so you see the empty set is again the item of interest. You did not pick
anything as you went through but interesting events, notice the first six events, so those are very
interesting. So, those are one of those things that could happen. 1 or 2 or 3 or 4 or 5 or 6. So, those
event seem interesting so, now you are already picking up the notion of what is this event right?

So, event is a collection of outcomes of the experiment, isn't it? So, it is the subset of the sample
space. The sample space contains outcomes, so any event has a collection of outcomes. So, you
can always think of when experiment is done whether an event occurred or not. So, that is a
question that you will see people keep asking again and again. In the next slide you will see more
details.

So, think of that when you write down an event. Event is not just a dull set or a subset. It is
something that is of interest. It contains outcomes from the experiment and outcomes that are of
interest to you. So, that you want to think about those outcomes. You want to see if there are any
patterns and the chance of that outcome appearing, etc.

So, when you throw a die sometimes maybe you are playing or my kids play this Snakes and
Ladders a lot. So, you get up to 90 or something. If you throw a die if it is 2 or 6 you are dead. A
big snake is going to bite you and you come down all the way, but any other number is okay.

So, at that point when you throw a die you are interested in that event, right? Not because you like
it but because you do not want it. So, 2, 6 is an event that is of interest to you. So, quite often in
situations you are interested in genuinely subsets of the outcomes. Not just one particular outcome,
but this one particular outcome is also very important.

These give you a simple way to understand the, what is going to happen at that, as a result of the
experiment, but subsets of the outcomes are interesting in different ways, so how many possible
subsets you can have? You can have 64 possible subsets, you can just write then down and you
get all these events.

So, notice the third point I put down there. There are quite often many events you can describe in
words in English. So, you do not have to mathematically put down the subset. You can describe
that in English using words and it implies a certain precise subset, so for instance, when you throw
a die I might say, I am interested in the event that I get an even number. So, that is an event, so, I
have said something in English. How does it become an event?

Because getting an even number means that I am interested in the outcome 2, 4, 6 in the set of
outcomes 2, 4, 6. That is clearly a subset of sample space and that becomes an event. So, anything
like that is also important. So, here is the another event I put down getting a multiple of three could
be an event, in that case it is just 3, 6 when throw a die. When you throw two dice then if you are
looking at this total that you get from the two dice then these things can become more interesting,
but with one die it is little bit more simple.

Maybe you are interested in the event of getting an even number or getting a prime number so, all
these can be interesting events that you can describe in words but it actually corresponds to an
actual subset of the sample space. So, look at the third experiment here.
It is an experiment that happens every day in Chennai for instance. There is ocean, I mean sea
nearby, fisherman go out into the sea to fish. So, that could be an experiment. So, fisherman goes
out to fish, let us say maybe you are imagining a small boat, these days people have big powerful
boats, but small boats, throw the net out and wait and then some catch and pull, a bunch of fish
come in. so, what is the sample space? Depending on your culinary interest you may or may not
be interested in the sample space but writing down the sample space I am sure you will agree is a
bit complicated.

There is a quite a lot of fish in the sea and you have to describe everything and not just that. I mean
what is the fisherman interested in? He wants to know how many kilos of say Pomfret that he has
got. So, that is very important to him, because he may be getting a lot of money for Pomfret. So,
it is the catch of the fisherman when he goes out to sea is a complicated sample space.

It is something that is statistically very interesting to study. It makes difference to their economics
and to the taste buds of the people and depending on once again whether you eat fish or you like
fish or not, is it something interesting to you, but sample space itself is a big complicated to write
down but quite often you can define events without worrying so, much so, about the sample space,
right?

So, for instance an event which could be of very good interest to the fisherman is, is the catch more
than 100 kilos. So, he catches more than 100 kilos of fish maybe he makes a lot of money and not
only that, he may be interested in specific type of fish. In Chennai for instance there is something
called the Seer fish, it is very popular. So, did I catch Seer fish, how much of seer fish did I catch?

So, it can make a difference to the fisherman. So, while the sample space has not even been written
out and writing it out might be a complicated exercise and etc. I can define events and I can define
events of interest. So, this notion will come back to us again and again and again so, quite often
somebody will describe an experiment and they will describe an event and contest to estimate or
compute or precisely write down the chance of that event.

So, this is a very typical statistical study and in those cases you do not typically always have to
write down the entire sample space, particularly if it is very complicated and you may be interested
only in some particular aspect of it and that you can get away with directly without writing the
sample space.
So, hopefully this gave you a very basic idea of what an event is, subset of the sample space, event
is subset of the sample space. So, let us see a few more examples. In the next few slides I am going
to show you a few more examples of events and how to think of them and I want to emphasize
once again events are really, really, really, very, very important in probability theory. You have to
have ultimate comfort with dealing with events, working with them, doing things with them,
manipulating them, understanding them, etc. So, let us go ahead and see examples in the next few
slides.

(Refer Slide Time: 12:29)

So, like I mentioned just a little while ago. Events are central objects in probability theory. So,
they are very, very important and I have already mentioned this so, event is a subset of set of
outcomes, the subset of the sample space. So, if one, the actual outcome of the experiment, you
did the experiment and if the actual outcome belongs to the event I can say that, right? A set, when
you have a set you can say whether or not an element belongs to that set, right?

So, my event is a set. It is a set of outcomes. The experiment produces an outcome. I can
legitimately ask whether or not this outcome belongs to the event. Does it belong to the subset? If
indeed it belongs, then that event is said to have occurred. So, this, I think I sort of hinted that even
while I defined the event but this is a very important notion in probability theory. So, we say events
occur when the outcome belongs to the event. The subset which is the event.
So, once again I want to remind you event suggest sets, whatever you can do with sets you can do
with events also and interestingly because we are talking about experiment that we are studying
statistically many set theoretic operations, what you do in set theory with events as sets, have a
corresponding physical or, physical meaning as to what actually happens in the experiment and
what, how the event changes based on the operations that you do to the sets.

So, I am going to start looking at examples like that and then illustrating and talking about it in
more detail, etc. So, the first notion when you think of sets is containment. You can have one set
being a subset of another set. A can be subset equal to B as in every element of A is also in B. A
is a subset, B is a subset, every element of A is also in B. So, in that case you say A is subset equal
to B.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.3
Events and Probabilities Events Part II

(Refer Slide Time: 00:13)

Hello, we have been looking at the basic concepts in probability theory and we looked at the
definition of experiment, outcome and then we looked at the first precise sort of definition which
is the sample space, which is basically a set of outcomes and then we saw a definition for event
and I was mentioning that events are very, very important. Everything in probability theory,
calculations, goes around events and events are things that are of interest to us as an important
thing.

So, let us continue looking at why, how to think of events, how to work with them, how to
manipulate them etc. So, I mean it is, you must all be, you must all have studied some mathematics
before and you have also studied some basic mathematics in this course. Always you study a
mathematical object and then you want to manipulate it. So, you just not study it and keep it aside.
You want to take it and play with it, when you learn about numbers. You want to add, subtract,
divide, you want to do various things. Only when you do some operations on it, it becomes
interesting and same way with events.
So, when you have events as subsets of the sample space it all seems fine, but then what is
interesting about them is you can work with them, you can do operations on them. They are just
sets, they are subsets. So, whatever operation you can do on a subset you can do on events and that
tends to have a natural meaning, a physical meaning so, if you have an event and you do an
operation on it, it will have a physical meaning. So, let us just look at some examples. You will
see how it works.

So, for instance the first thing that can happen is one event can be contained in another. So, you
can have a smaller event inside a bigger event. So, remember event is a subset, always remind
yourself events are subsets of the sample space, events are subsets of the sample space so, it is just
a subset. So, you can have a subset of a subset. So, here I am showing you an example immediately.

When you throw a die the sample space is 1, 2, 3, 4, 5, 6, you can have two events one could be
that the number seen is an even number, which is the event, 2, 4, 6. So, the three elements in that
subset and it is 2, 4, 6. Now here is another event A which is just 2, 6. So, it is definitely another
valid event. Another subset of the entire sample space. Maybe it does not have a nice English
description like an even number. It is 2, 6 and, but clearly it is a subset of the bigger event which
is an even number.

So, notice what happens in, physically, so when one event is contained in another, if the smaller
event occurs, the bigger event has also occurred. So, if I say A occurred, A happened as in the
experiment was performed, the outcome showed up and the outcome is in A. So, A occurred and
if the outcome is an A clearly since A is a subset of B the outcome is also in B so B also occurred.

So, these are the kind of physical implications to the mathematical operations that you do with
events as subsets. So, one subset is contained in another and when one, that smaller event happens
the bigger event also happens. So, this is a nice interesting thing to know, but notice if B occurred,
if what you saw was an even number you cannot be entirely sure whether A occurred or not, A
could have occurred, A may not have occurred. Both are possible, because A is a smaller set which
is than B.

So, you immediately see that the mathematical operations that you do with sets end up being
related to the physical reality so to speak of the experiment and that is always good and your theory
should have such things. So, such things should happen, only then it is interesting. The next
interesting operation that a lot of people do with sets is the complement.

So, what is complement, I have defined it there for you. If you have an event A, the complement
of A is all the outcomes that are not in A. So, you have a sample space. You have a subset A,
which contains the collection of outcomes that you are interested in, that is your event. Now what
is the complement of this event? All the outcomes that are outside of this subspace that you have.
So, that is a notation there for you S, and then this other slash I think it is backward slash, forward
slash I do not know, one of those things, the other slash. Not this way, the other way, and that slash
is usually denotes a set minus or set difference or in this case the complement becomes 𝑆 − 𝐴.

All the elements of the sample space which are not in A, in subsets, so, this is called set minus or
set difference. So, here is an example for you, once again throwing a die is the simple example I
have used. If you are interested in the event A which is even number, which is 2, 4, 6 then 𝐴′
clearly is 1, 3, 5, which is the odd number event.

So, you see again, this is of interest to us and notice the connection between this sort of
mathematical operation of complement and what happens physically in the experiment and
outcome. So, if you have an event A and if A occurs definitely A complement did not occur. So,
because all the outcomes in A and all the outcomes in 𝐴′ they are disjoint, they do not have
anything in common.

So, if A occurred 𝐴′ could not have occurred. So, notice the connection here. So, all these things
will end up playing an important role as we do computations with events and probabilities, etc. but
it is nice to know. The same thing is true for A complement. So, because (𝐴′ )′ is actually A itself.
So, if A complement occurred A also did not occur.

So, you see we are starting to think of events, these are no dry things like some subset of a sample
space. It is interesting things with respect to the experiment. When you do a mathematical
operation to your event, something interesting, it may reflect on something interesting in practice.
So, that is the first thing to understand.

Slowly we will complicate the kind of operation we are going to do, we are going to do slightly
more involved thinking about events, combining them in more interesting way, so, let us look at
that next.
(Refer Slide Time: 06:42)

So, you can combine events to create new events and the standard way to operate on, the standard
two, the two operations which are very central in set theory, one is called union, the other is called
intersection and both of these can be done with events as well. Events are subsets, so, one can do
these two operations on events, but let us see what it means in the experiment with respect to the
outcomes. So, union is also like an OR operation.

It is denoted with that U shape. For instance, let us start with an example and then look at what it
means. Supposing you throw a die and let us say there are two events. One event is even number,
so, let us say event A is even number, event B is a multiple of 3. So, I have said event A is even
number shows up which could be 2, 4, 6, event B is that the number is a multiple of 3. If you look
at multiple of 3 there are only two numbers there 3, 6.

So, event A is 2, 4, 6, event B is 3, 6. What is the meaning if you say union of these two events?
If I take (𝐴 ∪ 𝐵) here I am doing OR, the number that shows up is either an even number or a
multiple of 3. In that case you can do union, so if you actually take the union 2, 4, 6 and 3, 6 I
think I have asked you to work it out it should be easy enough to do. You will see 2, 3, 4, 6. It is
this union event and it is either an even number or a multiple of 3, both are good for you.

So, both together are, it is possible to have when you do union. So, union is like the, it is sort of
represents the OR operation in what you think of as OR in English ends up being union when you
deal with subsets and events. So, I have given another example here even number or a prime
number. So, if you look at even number, the event is just 2, 4, 6, a prime number is I think 2, 3, 5
those are the prime numbers in 1, 2, 3, 4, 5, 6. So, you can take the union of 2, 3, 5 and 2, 4, 6 you
will see what numbers you get, you get quite a few numbers that way.

So, either an even number or a prime number. Now let us look at the other example. We looked at
the fisherman's catch as an example a few, in the last lecture I gave you this example of how you
can model what the fisherman catches as, in the probabilistic sense and you may want to ask is the
fisherman's day was a bit unusual today? What is unusual? Maybe he catches between 50 and 200
most of the days, what is an unusual rare event? He either caught more than 200kgs or less than
50kgs of fish.

So, that is unusual and you see how the or naturally works to give you something meaningful with
respect to the experiment. So, you want to say, I want to have an event which measures, which
models or which represents the fisherman having had an unusual catch and you see how OR or
union naturally comes in to the picture.

You can have a smaller event which is a catch was more than 200, another smaller event which is
catch was less than 50, and now you can take union of these two and then you have this new event
which is either more than 200 or less than 50 which sort of maybe represents an unusual day for
the fisherman.

So, these are the kind of things you can do, so, you see when you want more meaningful events
which are, which you can relate to physically or which you are interested in genuinely these kind
of union and complement and all these operations will start entering the picture very naturally.

The next operation is intersection. So, intersection is sort of like an AND operation. It is something
AND, something else, and it is denoted by the upside down version of the U. so, you take U and
then flip it around upside down so, you get like a cap, I do not know what you call it. This symbol
and that symbol is the intersection operation.

So, I have given you example the same two sets as before when you throw a die and one event is
even number, another event is multiple of 3. If you do intersection what will happen? You want
an even number and a multiple of 3 that is just 6, so it is event represented by the set 6. There is
just one element here so, it is easy to workout.
Same thing with even number and a prime number. So, this will end up being 2, just that one
element 2 in the set. So, you see intersection sort of shrinks the event, makes it smaller in some
sense. You put more constraints, you make it more rigorous or more strict in some sense. Union is
sort of relaxing constraint. So, making the event bigger, bringing more into the picture, etc.

Now I gave you a practical example of the fisherman's catch being unusual, the next event naturally
you may be interested in is maybe something more specific. What is the chance that fisherman had
a very average day, like 100 to 150 maybe. Maybe that is his very average day, exactly very close
to average and there you see the and operation naturally comes in, why?

You can have an event which says he caught more than 100kgs, you can have another event which
says he caught less than 150kgs, so many of these things when you have a number, in my mind at
least I visualize a number line. So, I have this number line and more than 100kgs, we will do this
later on in the course as well, so more than 100kgs as from 100 to the right. So, that is my more
than 100kgs.

And what is less than 150kg? From 150 to the left, isn't it? So, now when you say and, what
happens, you go from 100 to 150 that is it. So, that is the fisherman's catch. That is intersection for
you and that is another operation and you see, so, it is useful to have these kind of operations, they
give you a very meaningful events which you can relate to in practice.

So, I am asking, I have made a third point here which says many interesting events you can express
as intersections and unions you already saw with respect to the fisherman's catch that I true, here
is another example. I have just picked up more examples and you will see later on as you go
through the course. This will become like a second habit to you. I mean you will not even think
too much about it, somebody gives you an event, you will naturally break it down into smaller
events and you will start thinking of unions, intersections, this, that, and, or, how do you, divided,
etc.

So, that will come hopefully naturally to you, but here is one of the first examples maybe that we
are seeing in this class. So, let us say you have pack of cards and you are drawing five cards without
replacement, one after the other. So, the event that, the event that I am interested in is, I did not
draw a single ace, none of the five cards are aces. So, no aces, that is an interesting event to think
about. Maybe you are playing a game where having the ace makes you a winner or something and
you want to find out what is the chance of not getting the ace.

So, this you see easily it is written as the intersection of many smaller events. The first is not ace,
second is not ace, third is not ace, fourth is not ace, fifth is not ace, so, quite easy, but you will see,
I mean these kind of things are very interesting so, in most events it looks like it can be written as
intersections and unions of smaller events and this kind of thing divide and conquer type technique
is very central in math. It comes up again and again.

So, interestingly I have given you an IPL example. Let us say you want to have 5 runs of the bat
in 2 legal deliveries as in there are no extras or things like that but you have 5 runs that you want
to score in 2 legal deliveries. You see you can split it in multiple things. So, the first delivery could
have been (1 + 4) and then (2 + 3), you know things like overthrow and all I am not looking at
it, so, do not take it too seriously. So, (2 + 3) and then (3 + 2) or it could be (4 + 1). So, I am
just saying right off the bat, no extras, no overthrows, nothing like that.

Just in those kind of scenarios these are the only things that can happen, of course if you add
overthrows it could be (5 + 0)and (0 + 5), because you can get 5 runs with overthrows. So, it just
becomes more interesting, so that is another way of looking at how 5 runs could be got off the bat
in two legal deliveries and you see how union shows up here as oppose to intersection in the
previous example of aces.

No aces, here 5 runs off the bat, it could be (1 + 4) or it could be (2 + 3) or it could be (3 + 2)


or it could be (4 + 1) and then you have this union of these four events. So, what am I trying to
say is, so, hopefully you are convinced now that events are everything in probability, so you want
to work with events, you want to calculate things with events. You want to combine events, do
things with events and then decompose events, maybe into smaller events and having this comfort
is very important and we will see many more examples hopefully in your practice assignments and
graded assignments and other assignments. You will see more examples of this.
(Refer Slide Time: 16:24)

The next important type of thing you have to consider when you look at events is what happens
when two events are disjoint or not necessarily what happens, how do you, what are the things that
are, become interesting when two events are disjoint. So, quite often you can make two subsets of
a set of a sample space which have no intersection. So, those kind of events are called disjoint
events.

So, here is an example, when you throw a die you can have two different sets which are even
number, odd number, 2, 4, 6 and 1, 3, 5 these are disjoint events and same with the fisherman's
catch I am giving you examples more than 200kg less than 50kg those two are disjoint events, so,
quite often disjoint events show up naturally and what is the big deal when A and B are disjoint?

Again it is like A and 𝐴′. So, if A and B are disjoint B should be inside 𝐴′, isn't it? So, if A occurred,
B did not occur. So, that is for sure and if B occurred, A did not occur. So, that is nice thing about
disjoint events so that is something you can know, and I have given you event and its complement
as an example.

If A and 𝐴′ are there they are disjoint. (𝐴 ∩ 𝐴′ ) is the empty set. Everything that is not in A is in
𝐴′. And interestingly, there is one more interesting thing about A and 𝐴′, not only are they example
of a disjoint event, they are actually an example of what is called a partition. Why is it a partition?

They are disjoint, A and 𝐴′ are disjoint, they do not have any intersection and if you take a union
of A and 𝐴′ you get the entire sample space. So, two events like this, or two or more events like
this which do not have any intersections between them, they are all disjoint, but together if they
makeup the whole sample space then you actually have a partition.

Partition is quite interesting and sometimes we will use that when we will do computations, but
disjoints, as you see why it is important to understand how to deal with disjoint events. Not just
two events, you can have multiple events which are disjoint. So, what is multiple events? You take
any two subsets in a group, like 𝐸1 , 𝐸2 , 𝐸3 supposing you have 𝐸1 , 𝐸2 , 𝐸3 , three different subsets
and when do I say the three are disjoint there should be no intersection between any pair of them.

𝐸1 , 𝐸2 should not have an intersection, 𝐸2 , 𝐸3 should not have an intersection, 𝐸1 , 𝐸3 should not
have an intersection. If no pair has intersections, then that collection of sets is called disjoint. So,
here is an example. I gave you the example of A and 𝐴′. Here is some other similar example if you
consider the experiment of picking a card from pack I can have these four events. What are the
four events?

Spades, hearts, diamonds, clubs. What is spades? It means the card is a spade, the card is a heart,
the card is a diamond, the card is a club. So, those are the four events, clearly these four are disjoint
and interestingly they make up the whole sample space together. So, in fact they are a partition of
the sample space. So, in this manner you can take a large sample space and partition it into disjoint
events and study it further.

This will help you do things very, very efficiently, like for instance a big country like India is first
divided into multiple states and every state is divided into multiple districts, every district is
divided in multiple taluks, etc. for easy governance. So, anything you want to administer, anything
you want to compute, anything you want to do, you start with some plan and then suddenly
everyone in every district is activated. They are trying to do something locally there and then it
goes further.

So, the dividing up into non-overlapping pieces is something very important to do and even in
probability theory that will play a critical role as well. All right, so I have spoken about quite a few
things with events. Hopefully these are making sense in your mind. We are going to see a couple
of examples to give you some practice with using these ideas, but let me just quickly recap.
We looked at complements, we looked at how one event can be a subset of another event, we
looked at union, intersection, then we looked at disjoint events. So, all these interesting things can
happen with events and we have to have our theory deal with all of this in a very nice way.

(Refer Slide Time: 20:44)

So, quite often people like to visualize sets in what are called Venn diagrams. So, these diagrams
are very useful. So, even though we keep talking about mathematics as sort of a dry subject and it
is a bit abstract but visualizations are very important as well. So, if you have a good visualization
in your mind when you are thinking of a mathematical object. Usually it clears up a lot of things
in your thinking.
So, Venn diagram is one such very useful visualization for sets. So, what do we do the, we take
the sample space to be one big egg like shape, ellipse or circle or whatever you want. That is the
universe or the sample space and every subset I denote as a smaller circle inside the bigger circle,
circle or ellipse or any other shape that you like.

So, I have shown you here three events A, B, C and you can start thinking of various things here,
like for instance you can start thinking of union of A and B. If you want to think of union of A and
B you are going to get that. So, let me go back, so, there was A and B. What is union of A and B?

The entire region that is covered by A and B, so, in some sense the Venn diagram is basically the
set S. imagine has a lot of small dots, each dot representing an outcome, sort of and then when I
put a circle like this into A that subset is collected into the event A. Likewise if I put a circle like
this, this red color circle to denote an event B whatever is inside the circle those outcomes belong
to B, like that.

So, sort of visualize it like that, C is also like that. So, now naturally what happens when you
wanted a union? You will get a shape like this, so, whatever was in A plus whatever was in B sort
of jointly together becomes unions. So, this is a nice way to visualize union. How do you visualize
intersection?

(Refer Slide Time: 22:42)


That is intersection is not it? So, what was common to both A and B is intersection so, this is sort
of a graphical visualization of sets. It is useful to have this picture in your head a little bit. You can
see disjoint events. So, there are A and C. There should be no overlap, B and C is also disjoint.
There is no overlap. So, if we have had to A union C what will happen? It will be everything inside
C, everything inside A, both of them will be there.

So, graphically using the Venn diagram you can do interesting things like A intersect B
complement. What is (𝐴 ∩ 𝐵′)? What is complement now in the Venn diagram? If I do B
complement everything outside of B, you have B, so, let me just maybe take one of these things
and show you. You have this set B, what would be 𝐵′? Everything outside, everything outside of
that set B would be 𝐵′, isn't it? So, that is 𝐵′.
(Refer Slide Time: 23:55)

And now I want to do A intersected with 𝐵′. Whatever is outside of B and A. So, you can quickly
visualize this, it is not difficult to visualize this. you will see you will get (𝐴 ∩ 𝐵′) as that part
alone, is not it? So, that part alone is (𝐴 ∩ 𝐵′). So, hopefully you see why these kind of operations
are nice and quite often even for events when you deal with them in probability theory this kind
of a picture will help you to understand some of the results and all that, this is useful.

So, we saw A intersect B complement you can do more complicated operations of that nature and
you will get different looking results. So, for instance there is an example. If you throw a die I
could say it has to be even but not a multiple of 3 so, that is intersection, intersection and a
complement. Notice how English and math are sort of getting converted to one another. We will
come back and revisit that soon. So, if I say I want the number that shows up when I throw a die
to be even but not a multiple of 3, it has to be even.

So, one event is it is even 2, 4, 6 and then the other event is it is a multiple of 3, which is 3, 6 and
then it is not a multiple of 3, So, I will take a complement of 3, 6 and then intersect it with 2, 4, 6,
so what will you get? You will get 2, 4. So, that will be the answer.

So, these are nice things to visualize and think about when we do events and you see naturally
from events that you may be interested in can be nicely decomposed into unions, intersections of
complements, all of those of smaller events with which you can deal with. Anyway, so that is Venn
diagram. We will not talk too much about the Venn diagram in this class and it is just a
visualization tool for you to think of sets and events.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.4
Events and Probabilities Events Part III

(Refer Slide Time: 00:13)

All right so, we saw this throughout, so, one of the useful skills you will need in probability theory
is translating English into language of events. So, we saw soon enough, I mean just now how even
but not a multiple of 3 got translated into events as (𝐴 ∩ 𝐵′). Here is another example of how an
experiment and event are described in English. And then I am going to ask you or start, get you to
think about how you would write complements and how things can be written in terms of
complements and what are the operations, etc. I mean it is interesting.

So, here is the example, the, here is the experiment. What is the experiment? There are 5 persons
who have identical hats and their hats get mixed up, so, this could be people in a cricket team or
something like that, they go by bus somewhere and they put their caps into some box the caps get
mixed up. So, this can happen.

And then each person picks a cap at random, a hat at random, so, this is what they are doing, so,
that is the experiment. And now we are interested in events. So, the outcome is what, what is the
outcome? Who got what hat, so, that is the outcome. Initially people come in with their own hats
they put it into a box and then they pick up a hat at random what hat did each person get that is the
outcome So, that is what, that is the, those are all the outcomes, then we will have interesting
events working with these outcomes.

So, notice we do not try to write the sample space here we do not try to define the outcome very
precisely and come up with the representation for it and then write a sample space out in whole
glory, you could do that, you could say ℎ1 , ℎ2 , ℎ3 , ℎ4 , ℎ5 , 𝑝1 , 𝑝2 , 𝑝3 , 𝑝4 who are these people?
ℎ1 is the 5 hats ℎ𝑝1 is what? Each person. Initially they come with 𝑝1 to ℎ1 , 𝑝2 to ℎ2 etc. And
then what can happen 𝑝1 can get ℎ1 , p2 can get like, any, every possible rearrangement of 1, 2, 3,
4, 5 is what you are looking at as an outcome.

So, you have to represent all that and think of all that you could do it. But maybe we should just
try to do the problem without really going through all that trouble. So, that is also something that
is interesting. Here are the events, the first event A is that no person gets their own hat. So, this
event has a name in mathematics, it is called a derangement. So, it got perfectly deranged in some
sense so, no one got their own hat, no one was lucky, completely unlucky, no person gets their
own hat and that is event A and look at event B, event B is a very specific, simple event it is every
person gets their own hat, all of them got their own hats that is B.

Even C says at least one person does not get their own hat, so, hopefully that is clear to you, at
least one person, not everybody got their own hat, That is the thing. And then D is what? At least
one person gets their own hat. So, look at C and D; C is at least one person does not get their own
hat, and D is at least one person gets their own hat.

So, I want to ask first about complements. What is A complement, B complement? Now in English
when you think of complement maybe you have a different way of thinking of complement, but
this is more precise set theoretic sort of notion. So, you have to be very careful when you think of
complements.

So, what is A complement? No person gets their own hat. So, if you think of a complement it is
sometimes tempting to say that every person getting their own hat is the complement of this. But
remember this is not an English language complement it is a very precise set theoretic definition.
So, no person gets their own hat, the complement, if you think about it will end up being at least
one person who gets their own hat.
So, no person gets their own hat, the complement of that, the, when will you say A did not occur,
so, A occurs means no person got, gets their own hat. No, person 1 is not wearing h1, he has got
something else, that is event A, event A occurring is there. When will you say event A did not
occur? At least one person must have got their own hat. So, that is D. So, 𝐴′ ends up being B, so,
similarly 𝐵′ will end up being C.

So, think about it. Every person gets their own hat is B and C is at least one person does not get
their own hat. So, that will end up occurring when B did not occur, when you have B occurring
every person gets their own hat. When you say B does not occur then there must be at least one
person who did not get their own hat.

So, you see, 𝐵′ is equal to C and 𝐴′ is equal to D. If you actually take the trouble of writing down
the sample space for this then you will understand maybe bit more clearly how this works out, but
you can also see, think about it and argue it out.

The next question is, are A and B disjoint? Think about this question a little bit. The question, look
at the two event no person gets their own hat, every person gets their own hat, A and B are they
disjoint? Yes, isn't it? So, A and B are disjoint you cannot have, so if A occurred no person should
be getting their own hat and you, definitely B did not occur. So, B should be disjoint from A. So,
there is no case of A, one outcome which would have both A and B occur at the same time so (𝐴 ∩
𝐵) is null set So, there is nothing in the intersection of A and B.

So, if A occurred, B did not occur. B occurred, A did not occur. We know that for sure. So, A and
B are disjoint indeed yes. What is A intersect B complement? So, when I say what is, I mean what
is it in English So, you can of course say A intersect B complement is (𝐴 ∩ 𝐵 ′ ) there is nothing
more to say in it. But we are talking about what it means. So, 𝐵′ is what? It is going to be C at least
one person does not get their own hat and A is what? No person gets their own hat.

Now B is disjoint from A. So, if you look at B complement and you intersect it with A think of the
Venn diagram, A and B are disjoint, you look at B complement and intersect with A. what will
you get? You will get A itself, isn't it? So, A itself is the answer. So, (𝐴 ∩ 𝐵′) is nothing but no
person gets their own hat.

So, think about that, think about these kind of things a little bit and you will get a better sense of
how this notion of events and the English interpretation of what happens in the experiment or the
physical interpretation of what happens in the experiment, how they are tied together and how
when you complement something happens when you take intersection something else happens and
all these things are very important. So, this kind of skill is very important every problem you look
at in probability theory at least the toy problems if you look at or even the real problems if you
look at will have something like this.

(Refer Slide Time: 07:57)

Take a look at this, for instance here is an IPL example. Just for fun I came up with one simple
IPL example maybe. So, here again you will see how all these unions and intersections will start,
sounding something like how do you translate that into what is actually happening in the
experiment.

So, here is the example. Here is the experiment. There is one over maybe in an IPL match it has 6
deliveries, in each delivery, let us say 0, 1, 2, 3, 4 or 6 runs may be scored. We do not allow other
possibilities, of course that is not the only thing that can happen, you can get 5 runs or even 7 runs
if you allow for extras, if you allow for overthrows and things like that, all that can happen.

We do not allow for all that we just say 0, 1, 2, 3, 4 or 6 runs may be scored. Here are the events I
am interested in, event A is no fours, event B is no sixes and event C is exactly 20 runs are scored
in this over.

So, that is event A, B and C and one may ask a question what is A union B? (𝐴 ∪ 𝐵) is over had
no fours or over had no sixes. So, that is something interesting and think about what that means.
So, supposing in an over a boundary was scored, and no sixes were scored. A union B would have
still occurred. If a boundary is scored A did not occur but no sixes were there only boundaries were
there. All six were boundaries let us say, then in that case A did not occur but B occurred, but
because B occurred A union B is A or B. We can say A union B still occurred.

What about intersection no fours, no sixes So, you cannot have, the over should not have any hit
that went across the rope, that is one thing. Can you have (𝐴 ∩ 𝐵 ∩ 𝐶)? If you are allowed to score
only 0, 1, 2, 3, 4 or 6 runs, and what is A intersect B intersect into C? There should be no fours
there should be no sixes, but you should score 20 runs from 6 deliveries.

And you know in this situation it is not possible. You cannot, you can have maximum of 18 runs
in this situation. So, 20 runs is not possible. So, A intersect B intersect C is in fact the null set. So,
this there is no outcome in that situation that can satisfy, where we can say A intersect B intersect
C has occurred.

So, hopefully this gives you a feel for how the interpretation for an event or an occurrence in the
experiment relates to the way in which we think of them as subsets and unions and intersections
and all that. So, I have put down here just from one more result which might be useful for you this
is called De Morgan’s Law.

We say (𝐴 ∪ 𝐵) and what about (𝐴 ∪ 𝐵′)? So, if you were to do (𝐴 ∪ 𝐵 ′ )or A intersect B
complement in set theory there is this De Morgan’s law. These laws are very, very useful, quite
often they will come to your rescue when you want to interpret an event given in English. (𝐴 ∪ 𝐵)′
is (𝐴′ ∩ 𝐵 ′ ) and (𝐴 ∩ 𝐵)′ is (𝐴′ ∪ 𝐵 ′ ), there are proofs for this result. I am not going to go into
details there, but this is quite useful.

So, for instance if you say no fours and no sixes in the over A intersect B. If you want to take
complement of that what should have happened? No fours and no sixes even one boundary is
scored(𝐴 ∩ 𝐵′), So, there should be either one 4 or one 6. So, at least one 4 or at least one 6 must
be there at least one hit outside of the boundary ropes must be there for (𝐴 ∩ 𝐵′).

Look at (𝐴′ ∪ 𝐵 ′ ), 𝐴′ is what? At least there was one 4. 𝐵′ is what? At least there was one 6. So,
𝐴′ or 𝐵 ′ is the same as (𝐴 ∩ 𝐵)′. So, I gave you a proof by illustration. Just gave you an example
how that works. It is possible to prove this properly if you are interested also.
So, once again let us quickly summarize what we did in the last two slides. We took an experiment
which had a very realistic physical meaning, the outcomes have something physical you can relate
to and then we started defining events as sets, and then started relating, combining these events in
interesting ways and ask the questions what does it mean in the physical reality of the experiment
when you do these combinations?

And we got this translation between English and event, English and events and for you to be able
to solve problems successfully in this class you have to be comfortable in doing this translation
between events and English. So, in fact one can say a lot of statistics is done that way. So, that is
the, that sort of concludes what we looked at as far as events are concerned. That is the end of this
part of the course. In the next part comes, this is very important, we are going to move towards
probability now.

So far, we have not really used that word probability. We are talking about probability theory but
we have not really used the word probability yet. We have been only talking about experiment,
outcome, sample space and events. The next lecture will be about probability. So, let us see you
there.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.5
Events and Probabilities: Introduction to Probability
Hello and welcome. In this series of lectures in week 1 we have been looking at the basic concepts
of probability theory and so far you have seen sample space, outcomes, events and now we are
going to see probability, what is called probability itself, that will enter the picture now as part of
this whole theory.
(Refer Slide Time: 00:38)

So, before that let us just reflect a little bit on what all this is. We are really, we want to model
something that happens, some phenomenon or some event or something that happens in reality in
the statistical way. So, we know that we cannot do deterministic models. We want to do statistical
way, so we are working towards it and we came across this notion of events and that seems
interesting and events are very useful to understand.
Now, we want to associate mathematical value of chance to that event occurring. So, quite often
once I describe an experiment to you and once I describe and define an event in that, as a subset
of that experiment's outcomes, you have a sense in your mind as to whether or not it will occur.
What is the chance that it will occur?
So, you may get a sense that it is a very low chance of that occurring, very high chance of that
occurring. So, intuitively you may know this. So, here is a few examples to show you why, how
to do this. Supposing you toss a coin and it is a fair coin, and then for instance over several tosses
you appear heads, you think that heads could occur maybe half the time, so this is sort of intuitive.
So, that is why people use coin toss before every test match or a cricket match to decide who is
going to bat first, they expect it to be fair in some sense.
So, maybe you think every time, 50 % chance of heads, 50 % chance of tails, so that is something
they associate with. And if you might have played cards and you might have a four players playing
cards. It is very popular to have four players playing cards and you distribute the pack to everybody
and I am sure collectively among us we would have never had, unless it was specifically, somebody
was cheating, it is almost, you would have never seen a case where somebody got dealt all 13 cards
from the same suit.
It would literally, I mean it would be a huge surprise if that happens. So, you should of course
shuffle the cards properly and then deal. If you do not shuffle then anything can happen, but if you
shuffle the cards properly and then deal it out. It is really very, very low chance that one player
gets all 13 cards of a suit. I mean all 13 spades, I mean what is the chance of anybody getting that
Oh! here is another event, maybe a little bit different, not a toy experiment, something that happens
in real life. The football world cup and what is the chance that India wins the football world cup.
You have a sense of that, I think it is not very high, definitely, but chance of India winning the
football world cup is not that high. This is just based on our collective experience. I mean it could
happen, but it will be a big surprise if it happens.
On the other hand, if you look at the cricket world cup, chance of India winning, I mean one can
never say it is like 90 % or anything, but it is still high. I mean there are several good teams but it
is relatively higher than the chance that India would win the football world cup like for instance,
just to give you a comparison. So, what am I trying to say here?
So, given an event, given somebody gives you an experiment and defines an event we have a sense
of the chance that which chance of that event occurring. So, in, the entire goal of probability theory
is to give you a reasonable way to assign proper values of chance to these events and experiments.
So, you have a sample space, you have events. How do you assign probability or chance of that
occurrence of that event? How do you give a number to it, how do you work with it and how do
you give in the sense, I mean how do you work with numbers, which represent chance?
So, that is how we build up probability theory. We will see that it has a lot of interesting ingredients
and it is central to the study of data science and statistics in some way. So, before that I want to
check a quick detour and point you to what is a mathematical theory. So, probability theory is a
mathematically developed theory, so, it is, so, what are the ingredients of a mathematical theory?
So, I will go through this real quick. It is just a little bit of philosophy to get you used to how we
will see things.
So, typically in a mathematical theory you want to define basic objects of interest and this will
come from other prior fields like for instance in probability it comes primarily from set theory, so
you have the set which is a sample space and you have subsets which are events, is it not? So, we
will soon define probability and that will also be an object which you are familiar with, and then
once you define the objects of interest precisely, you will assume a few things.
You will put down a few conditions. I say assume a few things as in when I assume a few things
to hold I am saying those conditions are valid. So, we will put down a few axioms which either
you can think of them as conditions that you have to always satisfy or as assumptions that you
make about the objects that we have.
And then once you do that, that is it, you cannot do anything else intuitively. So, of course your
intuition plays a role but you have to back up everything else with logical proof. So, that is how a
mathematical theory works and definitely the logical proof part, many people get turned off by, it
is a bit scary sometimes, if you are not used to it, to get into it. So, but mathematical theories have
had tremendous impact in the world today.
So, some of the theories have been so, successful particularly when the axioms are very natural,
they respect your intuition of what you are trying to model and it ends up being very, very widely
applied in such a big area, so for instance if you look at analysis, algebra, there is applications all
across in so many different areas. All of that, all of them, all the mathematical theories are built in
the same way. There are a few objects of interest, a few axioms or conditions that have to hold in
the beginning and then after that everything else is logical proof.
So, in this class we will not do the theory in a mathematical way. We are going to do it in a very
applied fashion but we will not really, we will not disrespect the mathematics as in we will respect
it and when possible maybe I will show you some proofs just for saying, how is it that these things
work. Some arguments, maybe one or two lines, we will not be that rigorous about it but at least
just hand waving arguments, why they work like this. So, it is sort of interesting to do this.
So, this is how any mathematical theory is. Probability theory is also like that. What are the objects
of interest in probability theory? I gave you two or three objects already, two objects, sample space
and events, these are very important objects. The third important object which makes up
probability theory is probability. So, it is called probability itself.
(Refer Slide Time: 07:34)

Probability is a function, usually we will denote it P. And what does it do? A function is what?
Function will have, there is an argument that the function will take and there is value that the
function will assign it to, so function is always, you think of it as 𝑓(𝑥) or something like that. So,
X is the variable that the function operates on and you get the functions value. So, that is a function,
it takes you from one set to another set.
You can call it domain and range if you like. So, it goes from domain to range, so, you have two
sets, the function tells you how to go from, like how to map one point from this set to another point
in that set. Now what is, what kind of function is probability? The argument is an event, probability
takes event as input in some sense, so, that is the probability function. The input to this function is
an event, you give this probability function an event. What will it tell you, what is the output, what
is the value of the function?
It is some real number between 0 and 1. What kind of function is it? Once again, the function takes
as input subsets of the sample space which are events and its output is a number between 0 and 1.
So, these three things together, probability function, events, sample space, sample space, events,
probability function these three together make up something that people call generally as a
probability space. So, it is called probability space.
Ofcourse there is an experiment and outcome which they are not mathematically precise objects,
so these three are mathematically precise objects, sample space, events, probability function, these
three make up the probability space. So, these three together should satisfy two axioms, two very
important axioms. I will specify it in the next slide or two. You will quickly get it. So, without
satisfying those axioms we will not continue, so, at that point we do not have a valid probability
space.
A probability space will be valid only when it satisfies two axioms that I am yet to describe. I will
describe them soon enough, but before that let us just dwell on this probability a little bit, what is
this? Now I have, as a mathematical object this is just a function, a function from the events to the
line segment between 0 and 1 on the real number line, so between 0 and 1, some value between 0
and 1. So, that is the function definition, just mathematically, but what do we want it to be?
We want the value assigned by the probability function to be something like the chance of the
event occurring, that is our intuition. I mean we want the probability function to represent the
chance of the event occurring. So, if you have an event A and you give it to the probability function
the value it spits out, the value that gives you, you are hoping represents the chance of that event
occurring. So, that is how you want to build the theory. That is how you want to do it.
So, that is you want to use the theory. Let me be precise here, build is different, but that is how
you want to build use this theory. So, in general, the values between 0 and 1, you can also express
it as a percentage. So, you know any value between 0 and 1 you can express it as a percentage, if
it is 0.9 you are going to say it is 90 %, like that. (11:06).
Generally, what is the assumption here? Higher value means higher chance. So, if the probability
of an event is 0.9 then that has a higher chance of occurring than say a probability of an event
which is 0.1. So, that event may not occur that often.
The value 0 and 1 are special, if it is 0, if an event has a probability of occurrence 0 then it will
never occur. If the event has a probability of occurrence 1 then it always occurs. So, it is always
true. So, I have put a fourth point there, which is a warning. So, this kind of associating a meaning
is, comes only when you want to apply the theory to some particular case.
The theory itself does not care. As far the theory is concerned you have a set which is a sample
space. You have a bunch of subsets which are events, and then I have a function from the set of
events to the real line between 0 and 1 and that is it. That is probability for me. I do not care
whether it connects to an IPL game or throwing of a die or whatever. I mean my theory will give
you enough tools to work on and the person who comes in to apply will attach the meaning to all
these things.
So, this dissociation you should keep in mind. So, I will give you some examples to sort of drive
home this point later on. So, at this point we have identified all the ingredients of the theory.
Everything has been defined, sample space, events, probability together they are called the
probability space. Once again, I want to point out experiment and outcome are not very precisely
defined mathematical objects. They do not enter into the theory itself completely. We work with
sample space.
Once you have sample space everything about the experiment and outcome are known. Experiment
and outcome, the way you set it up is where the physical part comes in. Someone who comes in to
apply the theory will start setting up an experiment and thinking about outcomes and then grouping
them into a set which is a sample space, etc.
So, that is probability. I have not told you the axioms. Those are the two important things. The
axioms make or break everything. So, if you do not have the axioms you do not have a theory. So,
I have to tell you what the axioms are, but let, before we jump into the axioms let us try and
motivate the axioms. So, where will the axioms come from? How do you, why do we need it sort
of? I mean why does it make sense?
(Refer Slide Time: 13:14)

So, like I said before probability function simply assigns a value to an event and you wanted to
represent the chance as, represent something like the chance that the event occurs, that is what you
want. So, these axioms, what they will do is they will place conditions on the probability function.
They will say not every function is a valid probability function. For your function to be a valid
probability function here are the two conditions it needs to satisfy. That is what the axioms will
say. Now why do you need those conditions?
You need those conditions, I have sort of put it out here in some way, I mean there are various
other ways in which people think of these things but, so intuitively the axioms should be natural
in some sense. When you have a certain relationship between events, the way you assign
probability to those events should respect that relationship. When you combine events it should be
easy for you to work with the probabilities in the, that the probability function gives. All these
things should be ingredients in the theory. Otherwise, it makes no sense.
Here are the few more basic scenarios. So, for instance if 𝐴 ⊂ 𝐵, 𝑃(𝐵) which is the value assigned
by the probability function P to the event B, should it be higher than 𝑃(𝐴) or lower than 𝑃(𝐴),
what does your intuition tell you? What is the natural intuition? If you have A being a subset of B
and you have P as the probability function, P is associating a value, 𝑃(𝐴) to the event A as the
probability of A, 𝑃(𝐵) as the probability of B.
If A is a subset you want intuitively 𝑃(𝐵)to be higher than 𝑃(𝐴). So, these are conditions. So, this
might be a condition that you desire. So, probability function should be like that. So, likewise there
are more complicated things even if A and B are two events and 𝑃(𝐴) and 𝑃(𝐵) are known it may
place some restrictions on 𝑃(𝐴 ∪ 𝐵). So, there may be some restrictions like that. So, you can
imagine why, so these kind of things have to be respected by the axioms.
As in the axiom should enforce the conditions on the probability functions so, that these things
naturally apply to anything that you come up with. And here are a few more basic questions. So,
what is going to be the probability of the empty set, in any probability function what should be the
probability of the empty set?
We know empty set is an event. So, what probability should be assigned to the empty set?
Similarly, what probability should be assigned to the entire sample space? So, you know entire
sample space together is an event, it is a valid event. So, what probability should you assign to
that?
We also saw A and B being disjoint is an important thing in probability, in the, when you look at
events that you are assuming disjoint it gives you strong conditions. So, if they are disjoint what
can we say about P of A B, 𝑃(𝐴), 𝑃(𝐵), 𝑃(𝐴 ∪ 𝐵). It is a special particular case. We know 𝑃(𝐴 ∩
𝐵) has to be 0. If they are disjoint maybe, P of empty set, there is nothing in it.
So, what about union? Are there other interesting conditions like this and how do you, I mean there
are can be so, many numerous conditions, so, how do you put all that into two axioms? So, that is
the power of the theory. So, people have worked over the years refined them, refined them, refined
them and got just two axioms, now there are just two conditions that you have to satisfy to be a
valid probability function and once you satisfy that, all the intuition, all the natural intuition about
how events combine, relationship between events all that will be respected. So, that is the power
of the axioms.
I mean I will state them, they will seem very simple and intuitive to you, but their real power is
that they enforce conditions that ensure that theory supports the practice in a nice way or theory
reflects the practice in a very obvious easy way, it is easy to make, let me put it this way maybe
this is a better way of working it.
The axioms ensure that it is easy to make theory comply with the practice or theory reflect the
practice in a close way. So, that is the power of these axioms. Let me give them to you, you will
see that they are very simple and at the same time they are very powerful. Over the rest of the
lectures, we will see how to use these axioms in interesting ways.
(Refer Slide Time: 17:42)
Here are the axioms, so, these are called the probability space axioms. The definition, I am sort of
repeating the definition of probability again for you. This is the proper definition, because last time
I did not define the conditions that it needs to satisfy. So, probability is function P that assigns to
each event, a real number between 0 and 1 and satisfies the following two axioms. 𝑃(𝑆),
probability of the entire sample space is going to be 1. So, that is the first axiom you have to ensure
that for the entire sample space you associate a probability 1, otherwise it is not a valid probability
function. It is not a valid probability space.
The second axiom is what is intriguing and how that it could be the most central axiom in
probability is, it sort of took time it was not very easy to come up with this. So, here is the axiom.
Supposing you have a collection of disjoint events. Now I am going to write them as 𝐸1 , 𝐸2 , 𝐸3 , …
Now how many events you can, say you may ask me how many events? Is it just two events 𝐸1 , 𝐸2
or should it go on and on? It turns out you could have either a finite number of events or you could
have infinitely many, it can keep on coming one after the other it has to come 𝐸1 , 𝐸2 , 𝐸3 like that
it can keep on coming.
And you can have an infinitely many numbers of events like this. So, the second axiom is what is
most interesting, if you have a collection of disjoint events 𝐸1 , 𝐸2 , 𝐸3 , …. What is this so on? How
many can you have? You could have a finite number, it could go up to 10 and stop or it could keep
on going 10, 11, 12, 13, 100, 1000, 10000, never ending sequence.
But it is in that sequence so it can either be finite or it can be going on and on and on, but they all
have to be disjoint. There is no question of being any intersection between any of them. If that
were the case, then the probability function should assign values to events such that; 𝑃(𝐸1 ∪ 𝐸2 ∪
𝐸3 ∪ … )
So, what is inside the P here? Let me just show you with the pointer. What is inside the P here? It
is the union of all these disjoint events that you have. Either 𝐸1 occurred or 𝐸2 occurred or 𝐸3
occurred. Union is like OR, you have a bunch of disjoint events, disjoint is a very key word here.
And you have 𝐸1 ∪ 𝐸2 ∪ 𝐸3 on the left side. So, what is P of that? The probability value assigned
to the union of all those events. So, we saw before how you can have disjoint events and how you
can take unions of events.
They are all events again, you take union of two events, you get another event. It all becomes
events again and again and again. Now your probability function assigns probability to all events.
So, if you have a bunch of disjoint events, it will assign probability to each of these events. It will
also assign a probability to the union of all these events.
Now the second axiom is saying the probability function should assign values such that, the values
that it assigns to 𝐸1 , 𝐸2 , 𝐸3 , … etc., if you add them all up, you can notice what is happening on the
right hand side; on the right hand side I am taking each of the individual probabilities 𝑃(𝐸1 ),
𝑃(𝐸2 ), 𝑃(𝐸3 ), etc, and adding them all up, that sum should be the exact same value as what? As
what you have on the left hand side here. What is, what do you have on the left hand side here? It
is the probability assigned to the union of all these events.
So, if you want a very simple axiom here, simply say 𝐸1 ∪ 𝐸2 drop everything else, it says go on
and on, forget about 𝐸3 onwards. It says if 𝐸1 and 𝐸2 are disjoint, 𝑃(𝐸1 ∪ 𝐸2 ) = 𝑃(𝐸1 ) + 𝑃(𝐸2 ).
It has to be that way, the probability function has to do that, it does not have any other way of
doing it.
So, first thing I want to say is these are very very intuitive and simple conditions and you can see
why they are simple conditions. The probability of the whole sample space being 1 is sort of
obvious, if you define the whole event to be the entire sample space any outcome is okay with you.
So, what chance is there that something happen, something will happen for sure, so, that is how
your experiment works.
So, you should assign the value 1 to the entire sample space that is very clear. So, remember when
I say 𝑃(𝑆)what do I mean? I am seeing S as an event. The event of the entire sample space together
is actually a subset of the sample space, that subset should have a probability of 1, that is the first
condition.
What does the second condition tell you? You have two events let us say which are disjoint. You
put a probability here, you put a probability here, if you look at the union of those two the
probability should clearly add. So, you have a 10 % chance that something happens, another 20 %
chance of something else happening which just has no intersection here. Either this or this will
happen with 30 % chance, it has to respect that, is it not?
So, it is a natural law that is coded into mathematical language here as axioms. So, most people
when I describe the natural law seem comfortable, so, something will happen when you say
probability of the entire sample space it should be 1 and then we have disjoint events, the chances
add, it is clear but sometimes the precise mathematical language is a bit troubling.
But do not be put off by it, try to understand it and it is actually very comforting once you get over
that, get over their initial difficulty because mathematics is very precise and precise things are
much more easier to understand than imprecise things. So, look at how this language captures the
intuition.
So, axioms are intuitive and they put down conditions which are, have to be satisfied by every
probability function. If the condition is not satisfied it is not a valid probability function, that the
theory will not do meaningful things after that. So, we will see how to put them to use and it is
very useful.
(Refer Slide Time: 24:17)

So, I am going to show you next a few examples and sort of argue that, well the theory has just
three objects and two conditions but it is not very easy to get started with this. So, we need more
simpler things to start using this theory and for that we will start making some deductions. So, let
us see a few examples to see how, where the theory is sort of becoming difficult to use.
So, let us take the first experiment to be tossing a coin, 𝑆 = {𝐻, 𝑇}, I have been writing heads and
tails so far, but we can write {𝐻, 𝑇}. H stands for heads, T stands for tails, so, that is your sample
space. The first probability function, so, what I am going to do is, I am going to show you multiple
probability functions.
For the same sample space, I am going to show you multiple probability functions and then ask
you does this function satisfy the axioms; there are two conditions in axioms. The question I am
going to ask you is? Do they satisfy the, does this function satisfy the axiom? What is the first
function? The first function assigns a value of 0 to the empty set, assigns a value of 0.5 to heads,
which is the subset just H.
Remember for {𝐻, 𝑇}, there are only four subsets, there are only four events. What are the four
events? ∅, {𝐻, 𝑇}, {𝐻}, {𝑇}. So, you have to just assign four probability values and they have to be
consistent with the axioms. So, to the empty set you can assign 0, to the entire sample space you
can assign 1, you have to assign 1 there.
If you do not assign 1 there you immediately violate it and then you put 0.5 to {𝐻}, 0.5 to {𝑇}. So,
how do you go about checking the axiom, how do you check the second axiom. You have to find
disjoint sets, is it not? You have to find disjoint subsets. What are disjoint subsets? H alone and T
alone, let us say are disjoint subsets.
So, what is the union of the two disjoint subsets? The entire set {𝐻, 𝑇}, so you see it adds up. So,
𝑃{𝐻} + 𝑃{𝑇} = 𝑃{𝐻, 𝑇}. It works out, it is correct. Now I want to generalize this is a little bit, the
next second bullet point there shows you, maybe I should use my pointer here.
The second bullet point shows you a more general case than this, I take a value p, which is some
value between 0 and 1, in the previous thing, I took 0.5. I am going to take let us say any other
value could be 0.2, 0.3, 0.9, 0.8, any value and I am going to assign probability of heads to be p
and probability of tails to be (1-p), and 𝑃(∅), I will keep as 0, 𝑃{𝐻, 𝑇}, I will keep as 1.
This is another probability function on the same sample space and same events. There are four
events and this is another probability function and this also is a valid probability function. You can
check, you can do this quick check 𝑃{𝐻} + 𝑃{𝑇} should be equal to 𝑃{𝐻, 𝑇}. So, that we know
from the second axiom. The second axiom is satisfied. 𝑝 + (1 − 𝑝) = 1. Of course it has to assign
a value between 0 and 1 that is also happening.
So, I have given you here another function which is invalid just to show you that not every function
is valid, you cannot think of any function. So, it is assigned 0 to empty set, 1 to the entire sample
space. Let us say we put 0.5 to H and 0.6 to T. This is not a valid probability function, why is that?
You do 𝑃{𝐻} + 𝑃{𝑇}, you get like 1.1, which is clearly wrong. It should be 1 because 𝑃{𝐻, 𝑇} =
1. So, hopefully this gave you a quick glimpse at how these probability functions look, they are
just values between 0 and 1 assigned to the subsets of the sample space. Whatever, how many of
subsets you have, you have to assign a value to all of them and then you have to check all these
conditions.
The first axiom is very easy to check, the second axiom is tough to check, there are so many
disjoint subsets and why do you go checking everything, for {𝐻, 𝑇} it was very easy, there is only
really one check you have to do but in general how do you do it? It does not seem that clear. Let
us look at a slightly more complicated example. You will see the problem with this when,
immediately when I see a complicated example.
You need not even go that big, let us just look at throwing a die, it is {1, 2, 3, 4, 5, 6}. There are six
elements in the sample space, there are 64 events. Forget about even checking whether the axioms
are satisfied, how do you even specify the probability function? What do I mean by specify? So,
in the previous case I could write the probability for every event.
How do I write the probability for every event? There are 64 events. I have to make a big table of
all the subsets and then put one value against each of these things. It is going to be really tough
after that to ensure the axioms are satisfied there are so, many disjoint subsets here. I have to make
sure I do not violate all of them.
So, all these is not giving me an easy way out, theory seems hopelessly complicated at this point,
if you do not make use of the axioms and simplify your work. So, we will try and address these
questions in the ensuing lectures of this week. We will come up with the easy way to specify a
probability function and ensure that it is valid.
So, I will give you a very easy way to do it. It is not that hard. I mean I will just ask you a simple
question. You will see it is a very easy way of answering it and we will see several types of sample
spaces and all that and we will use only the axioms, we will not do anything else, everything else
will be deduced from this point on. That is the power of the theory, that is how the theory works.
So, you have had your basic objects, sample space, events, probability function you have your two
axioms, which place conditions on the probability function and from now on we will deduce, start
deducing how to, the first thing we have to deduce is how to easily specify a valid probability
function.
Probability function that will definitely be valid, how do you specify that. The answers are different
based on what type of sample space you have the first type we will consider is a finite sample
space, like for instance a finite number of elements in the sample space, how can you go about
specifying a valid probability function. So, we will see that in the ensuing lectures, thanks a lot for
your attention.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.6
Events and Probabilities: Basic Properties of Probability
Hello and welcome to this lecture. We have been looking throughout basic concepts of probability.
Once again, these are very important rudimentary basic ideas. You have to grasp them really well
and you have to learn how to use them. That is what is most important, most people just listen to
something and think I know this, I know this, I understand this, I understand this but till you put it
to use, till you solve problems with it really, you have not understood or you do not know how to
apply it.
So, it is as good as not understanding it. So, there is no understanding without the associated action
that comes along with it. So, it is very important that you practice and there are enough practice
problems that will come out for every lecture. So, I hope you enjoy the practice as much as listening
to the lectures.
So, we have defined all the basic quantities involved in the probability space. We are studying
about the probability space. There is an experiment and it has outcomes and after that we have the
mathematical theory which starts with the sample space, which is the set of all outcomes and then
there are these events. So, events are subsets of the sample space and then there is this probability
function which associates a value between 0 and 1, for every event that you have in your
probability space.
So, this is the group of three things, triplets if you will that make up a probability space, the sample
space, the collection of events and the probability function and there is a restriction there is like a
condition or an axiom that we impose on the probability function. Relatively simple axioms we
saw it before, the entire sample space itself should be assigned a value 1 for probability and if there
are two disjoint events and you take their union the probability of the union should be sum of the
probabilities of the individual events.
These are the two axioms and we were trying to use that axioms and it seemed like it was a little
bit confusing to use. So, what we are going to do is start with the axioms and make some very
useful deductions and study some important properties of the probability space, so that is what we
will do in this lecture.
I will put out four or maybe five properties for the probability space and that makes working with
probability space, so, much easier and these are all deductions. So, these are not axioms, the axioms
are only two and we will logically prove these results that we will see. We will see they are very
simple results in some sense. So, let us get started with the properties. We will only see the
properties in this lecture, in the next lecture we will see applications of the properties. So, let us
get started.
(Refer Slide Time: 02:57)
I just did the full recap; I am not going to recap again. So, this is the recap slide of where we are
and we are going to look at the basic properties of the probability space and you will see it is very,
very useful when you work with the probability space to know all these basic properties.
(Refer Slide Time: 03:16)

So, here is the first property, I will call it the empty set property. Probability of the empty set which
we will denote as this ∅. So, it is a Greek letter with the circle and a straight line on top. It is the
Greek capital ∅, it denotes an empty set. So, a lot of people ask why mathematics is considered
very difficult and I will say one of the reasons is they choose to use Greek alphabets. If you say
things like a, b, c it seems quite easy. If you say, start saying alpha, beta, epsilon and all that,
hurriedly it seems like it is confusing and the Greek thing could have been avoided.
If you think this phi is very confusing, simply use empty set all the time, it is perfectly fine. It is
just a notation for empty set. So, the first property, now this is not an axiom, remember this is a
property there are only two axioms. We will prove this property using the axiom. It is a very simple
proof; you will see how easy it is. The property says the probability of the empty set equals 0.
So, the probability function assigns a probability of 0 to the empty set. So, if you remember the
axiom said, probability functions assigns a value 1 to the entire sample space. Now to the empty
set it assigns the value 0. So, let us see how to prove it. I will quickly go through the proof.
The proofs are not so important in this class. You will never be asked to prove anything; I promise
you that. You will only have to apply it but I think for those of you who are interested you can sort
of see where the proof comes from and it is very simple actually. It is very easy use of axioms.
Remember we were only going to use the definition and axiom.
We cannot intuitively say, I mean intuitively it is clear that the empty set has nothing. So, the
probability should be 0, that is not a proof. A proof is, you have to use logical deductions starting
with axioms and go to the exact statement that you want to show. So, this is how a proof would
look.
You look at the complement of the empty set, that is the entire sample space S. So, that is the first
observation, the next observation maybe I should show you with this pointer here. So, the first
observation I have is, ∅𝑐 = 𝑆. The next observation I have is this guy ∅ and 𝑆 are disjoint. It is
very easy to see that ∅ and 𝑆 are disjoint, empty set is disjoint and everything else, is it not?
So, clearly ∅ and 𝑆 are disjoint and you also observe that ∅ ∪ 𝑆 = 𝑆. So, these are three critical
observations involving the empty set. ∅𝑐 = 𝑆, so that clearly means that the empty set and the
sample space are disjoint and interestingly if you take the union of the empty set with the sample
space you will get the sample space. So, these are easy things to think of.
And now we are ready to use axiom 2. So, you have two disjoint events and we know their union,
so let us use axiom 2. What does axiom 2 tell us? 𝑃(∅ ∪ 𝑆) = 𝑃(∅) + 𝑃(𝑆), so these are disjoint.
So, what is ∅ ∪ 𝑆 that is S itself, so, 𝑃(𝑆) = 𝑃(∅) + 𝑃(𝑆).
So, that tells you, 𝑃(∅) = 0. So, notice how the axiom was used and I did not appeal to any
intuition or something. I used the axiom and then just algebraically proved the result for you. Once
again, I am not going to spend too much time on the proofs in this course but let us once again go
back to the intuitive field for this property.
It is very clear that any probability function better assign the value 0 to the empty set, there is
really nothing in the set. So, there is now way that event can occur. So, it should have a probability
0. So, let us move on to the next property.
(Refer Slide Time: 07:08)
This is what we call the complement property. So, once again these Venn diagrams for illustration
is quite useful, at least you can picture what this means, you have the sample space yes? The entire
sample space has a probability of 1 and then you have an event E. The event E is a subset of the
sample space.
And it has some probability 𝑃(𝐸), and then you have the 𝐸 𝑐 . What is the complement of E? It is
everything else other than E. So, what is complement? I mean it is sort of logical that the
complement should be an event. So, if an event occurred the E complement did not occur, if the
event did not occur the 𝐸 𝑐 occurred.
So, doing complements, doing unions, doing intersections are natural processes in a probability
space. So, if an event E exists, 𝐸 𝑐 will also be a proper event. It has to be in some sense. So, those
are the things we will assume about events, otherwise you will just get stuck. So, the basic
operations that we do with events should be all possible.
So, that is just something we will take that as true. So, you have an event E, you have its
complement, 𝐸 𝑐 . So, this probability function P, if you evaluate it on 𝐸 𝑐 , 𝐸 𝑐 = 1 − 𝑃(𝐸). So,
notice that this places a restriction, this property gives you something very useful about what the
probability function does.
If you put some probability to the event E, let us say for instance if you say event E is 30 % chance
of occurring. So, what should be 𝐸 𝑐 , what should be the probability that E did not occur? It should
be 70 %. So, 1-P(E) is a perfectly intuitive result. Nobody is going to be very surprised if you make
this, state this result, one can prove it very rigorously from the axioms.
I will just quickly go through the proofs, everything starts with E and 𝐸 𝑐 being disjoint and 𝐸 ∪ 𝐸 𝑐
being the sample space S and then you use axiom 2, you get that 𝑃(𝑆) = 𝑃(𝐸) + 𝑃(𝐸 𝑐 ) and then
you use axiom 1 to get 𝑃(𝑆) = 1 and you are done.
So, once again I want to point out a couple of quick things. One is, once you have an event E why
should 𝐸 𝑐 also be an event? These are things we will assume, we will assume the events are like
that, you know natural operations on the events give you other events things like unions,
intersections, complements all of that give you other valid events, you cannot say that those are
not events.
Because you know if one occurs the other event also occurs. If you have two events A, B, if you
say 𝐴 ∪ 𝐵, either A occurs or B occurs so, then 𝐴 ∪ 𝐵 will occur. So, it is clearly a valid event in
the sense of in a probability space. So, you should have those kind of things possible and so this
property gives you something very useful.
So, this is the second property, first property was empty set. The probability functions always
assigns the value 0 to the empty set and if it assigns a value 𝑃(𝐸) to an event E. The event 𝐸 𝑐 is
1 − 𝑃(𝐸). So, it is very clear and it is very simple property. So, it will all sound very simple and
as you keep listening to it you will be nodding your head saying yeah this one is easy, this one is
easy.
But once again you will see when we want to apply it in a problem, you will always get confused
about what to apply, what not to apply, is this property useful? Is that property useful? So, pay
attention to how, the intuition behind the property, but let us see when we use it, we will see how
well you have understood these properties.
(Refer Slide Time: 10:38)

Next is what I am going to call is a subset property. So, if you have two events and then one is a
subset of the other. So, we will assume two events one is a subset of the other what does it mean
to say subset of the other if E occurred F definitely occurred. So, E is inside F so, if E occurred F
definitely occurred, but F can occur without E occurring also.
So, this picture here, the Venn diagram, here sort of shows how the events and sort of look in a
Venn diagram. You have the entire sample space S, the big oval here and then you have a smaller
oval which represents the event F and then you have an even smaller oval inside of F which is the
event E.
So, this clearly shows 𝐸 ⊂ 𝐹. So, if you have something like this, one can define something called
𝐹\𝐸. So, this is something you might have seen before in set theory. If you have not seen let me
define it for you, 𝐹\𝐸 is basically the elements of F, which are not in E. So, if you look at this
picture here everything inside E is inside E; what is 𝐹\𝐸?
The things in F, all these guys which are not in E. So, that is 𝐹\𝐸 and believe me if F is an event,
E is an event, 𝐹\𝐸 will also be an event, why is that? Because it is the same as 𝐹 ∩ 𝐸 𝑐 . So, 𝐸 𝑐 is
everything outside of E and if you take 𝐹 ∩ 𝐸 𝑐 you get a valid event.
So, this 𝐹\𝐸 is also a proper event, so it is all reasonable to assume 𝐹\𝐸 is an event. So, this
probability function will assign probabilities to F, to E, and 𝐹\𝐸. So, what does this property tell
you? This property tells you these three probabilities, the probability of F, the probability of the
entire F and then the probability of E what is inside this small circle and then what is in F that is
not in E.
It is clear sort of intuitive that this 𝑃(𝐹), should be equal to 𝑃(𝐸) + 𝑃(𝐹\𝐸). So, it is a reasonable
probability space, this should be true and that is property that holds here. So, we can prove this
property. It is not very hard to write down and prove it. I have a proof here, it just relies on E and
𝐹\𝐸 being disjoint and the union being F and then you use axiom 2 and you are done.
So, in particular notice what this implies that, this of 𝑃(𝐹) ≥ 𝑃(𝐸). Now this is quite important. I
want you to think about why this is important here. So, if you have 𝑃(𝐹) = 𝑃(𝐸) plus something
else this is a positive quantity, it is a probability, probability is only between 0 and 1.
So, it is a positive quantity, so, you take P(E) and then add something to it you get P(F) which
means P(F) is greater than or equal to P(E) or P(E) is less than or equal to P(F). So, these two
properties are very useful when you have a subset. The events will sum up like this and that is a
property that the probability function satisfies.
You know more than anything, quite often you will have a complicated problem, you may have
an event which is very difficult to calculate probability for. So, what people do, a very useful trick
that they use is, they will find a bigger event which contains this event and for which you can
easily compute the probability.
So, then you do not know the exact probability of E but at least you know, you have a upper bound.
So, if you have, instead of finding E, if you find probability for F where F is a bigger set than E
and you can easily compute it, then even though you do not know P(E), you know that P(E) is at
most this.
So, usually that is a good thing to have, if you do not know the exact probability, like you do not
know if it is 10 % of 12 % but you know it is less than 15 %, that is good for you. So, that is the
sort of thing that people use these kind of results for. Once again let us, let me summarize there
are three properties that we have seen so far.
The empty set has to be assigned a probability of zero by the probability function, that is first
property. Second property was the complement property. 𝑃(𝐸 𝑐 ) = 1 − 𝑃(𝐸), very intuitive once
again. A third property we are seeing is a subset property, if you have two events one is contained
inside the other as in if one event happens the other event is definitely supposed to happen. E is
inside F, then 𝑃(𝐹) = 𝑃(𝐸) + 𝑃(𝐹\𝐸), F\E is sort of another event, we saw why that should be
an event, it is 𝐹 ∩ 𝐸 𝑐 .
So, that is subset property, hopefully this is clear enough to you. The proofs you can see are very
simple. There are just two or three lines and they come, repeatedly use this axiom 2 and axiom 1
and you will see how it is very powerful to get this result, let us go to property 3a.
(Refer Slide Time: 15:37)

So, it is very similar to 3 but it is 3a because, now we are going to look at two events but one may
not be contained in the other, so here is a picture here, so you can contrast it with the previous
picture. You have your sample space S and then there is this event F and then there is another event
E and these two events sort of overlap but one is not inside the other.
So, these are two different events, so this is how they look. So, now what you can do here is you
can look at 𝐸 ∩ 𝐹, notice this 𝐸 ∩ 𝐹, it is here. Even though E is not a subset of F, (𝐸 ∩ 𝐹) ⊂ 𝐹
and likewise 𝐸 ∩ 𝐹 is also a subset of E but at least you know E intersect F is a subset of F.
So, that is the starting point for this proof and you can see quite easily that 𝑃(𝐸) = 𝑃(𝐸 ∩ 𝐹) +
𝑃(𝐸\𝐹). So, 𝐸\𝐹 is whatever is in E which is not in F. So, that is 𝐸\𝐹, so 𝐸\𝐹 is whatever that is
in E which is not in F. So, all these guys, all of these guys will be in 𝐸\𝐹.
So, just the subset property applied to 𝐸 ∩ 𝐹 and E will give you this result, and you can think
about why this is true, it is sort of intuitive as well. So, probability of F is probability of E intersect
F, both E and F happening and then F happening alone without E. Either both E and F happen or
F happen without E.
So, this is all, these are all very important properties. I mean we will use these sort of intuitively
to compute probabilities. I will try and show you some examples where we will end up using these
things intuitively. As we calculate more and more complicated probabilities and more and more
complicated scenarios you will use results like this.
So, you want to find the probability for an event F, you will find some other event E and find it is
intersection, find the probability of that and then go to F without E. So, you will keep doing this
again and again, you can do these things for simplifying your work. So, it is property 3a.
(Refer Slide Time: 18:00)

Finally, we come to union and intersection. So, we have been talking a lot about union and
intersection and here is a very simple and yet powerful, elegant property which connects the
probability of union and intersection to probability of the original events. So, this is a very
important property. Supposing once again you have two events E, F inside a sample space S.
The probability of 𝐸 ∪ 𝐹, E or F, what is E union F once again? It is E or F, either E occurred or
F occurred. If you want the probability of that, 𝑃(𝐸 ∪ 𝐹) = 𝑃(𝐸) + 𝑃(𝐹) − 𝑃(𝐸 ∩ 𝐹). So, how
do you understand this result? There is a proof here. I will encourage you to look through the proof.
I am not going to discuss the proof in more detail but how do you understand this result. So, here
is 𝐸 ∩ 𝐹. We can do like a proof by Venn diagram. 𝐸 ∪ 𝐹 is this whole area, is it not? Now what
is P(E)? It is this area and what is P(F)? It is this area. If you do P(E) + P(F) and you want 𝑃(𝐸 ∪ 𝐹),
what would have happened?
This intersecting area, 𝐸 ∩ 𝐹 would have gotten added twice. It would have gotten added once in
F and twice in E. So, you have to add P(E) plus P(F) and then you have to subtract 𝑃(𝐸 ∩ 𝐹),
which is what I am doing here. Then you will get 𝑃(𝐸 ∪ 𝐹). So, it is sort of intuitive and I sort of
gave you a rough proof by Venn diagram but if you want an exact proof here is an exact proof.
And you can go through this, this is not very hard to prove. You can basically write 𝐸 ∪ 𝐹 as a
disjoint union of three things and then you will get the result quite easily. So, hopefully the results
are interesting to you, the properties are intuitive to you, they are not very surprising. It is, this
lecture is just sort of mostly notation and sort of dryish lecture, talking to you about the properties.
In the next lecture we are going to apply it in multiple problems. So, you will see you can test
whether you grasp the properties properly or not, you will be able to use it or not, we will do that
in the next lecture but hopefully these properties are clear.
So, let me once again quickly recap, you had the two axioms, which are always true for any
probability function. This entire sample space is assigned a probability 1. If you have two disjoint
events, the probability of their union is probability, the sum of the two probabilities. So, this union
and intersection result is sort of an extension of the axiom 2 to the general case where E and F are
not disjoint.
If E and F are not disjoint, if E and F are disjoint, 𝑃(𝐸 ∩ 𝐹) = 0, why is that? 𝐸 ∩ 𝐹 is a null space.
So, this goes to 0. So, you get the original axiom 2. So, here this is a generalization of axiom 2 is
it not? If E and F are general events, then 𝑃(𝐸 ∪ 𝐹) = 𝑃(𝐸) + 𝑃(𝐹) − 𝑃(𝐸 ∩ 𝐹).
So, this is union and intersection, we had the empty set property, we had the complement property,
probability of complement of an event is 1 minus probability of an event, then we had the subset
property, then we had the difference property, difference and intersection property, then we had
the union and intersection property.
All right so, that is the, I believe the end of all the properties. In the next lecture we will look at
how to use these properties and start computing the probabilities of different events in probability
spaces. How do you start working with probability like spaces? That we will do in the next lecture.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.7
Events and Probabilities: Working with Probability Spaces
Hello and welcome to this lecture, in the previous lecture we saw some very interesting and simple
properties of probability spaces, so notice how the theory is slowly developing. So, we had the
basic ingredients in the theory first and then we just gave it a couple of axioms and then we looked
at events and how events can be manipulated and look at how all the properties are sort of entering
naturally into the game.
So, what happens when you do unions, intersections, complements, what happens when one event
is inside another event, it is sort of implied by another event and all that. So, it is just quickly
becomes very interesting and that is one of the features of many of these mathematical theories,
you start with something very simple and suddenly you will start getting very interesting answers
and probability is certainly like that. And what we are going to do next is take some simple
examples, some of them will be very simple examples some of them will be slightly more
complicated examples.
And we will start using these properties, so we will start working with the events in these
probability spaces, start understanding what kind of probabilities they can have, given the
information how do we use it, how do we use the properties and deduce something more interesting
etc. So, let us get started.
(Refer Slide Time: 01:26)

So, we are going to start with something very simple, we have been talking about it quite a bit and
let us see how probabilities work in this very simple probability space. So, this space is basically
tossing a coin, the experiment is tossing a coin and we know there are two outcomes: heads and
tails. I am going to denote them as H and T. So, the sample space is something we have seen
several times in this course already. It has got just two entries in it, set with two elements {𝐻, 𝑇},
that is it.
Now what are the events in this sample space, like for instance how many events are possible etc.
So, one can see that there are four events, so that, they make, I mean that, it is interesting to consider
these four events, these are the four events: ∅, 𝐸 = {𝐻}, 𝐹 = {𝑇}, 𝑆 = {𝐻, 𝑇}.
So, there are four events and now let us see what kind of values the probability function can assign
to these four events. So, for the entire sample space you have to assign 1, 𝑃(𝑆) = 1. From the
empty set property you know 𝑃(∅) = 0, already two probabilities are taken care of. Notice how
the property has just entered the picture and simplified the work for you.
Now complement property is again interesting, so, you notice quickly that 𝐹 = 𝐸 𝑐 , is it not? You
define 𝐸 = {𝐻}, 𝐹 = 𝐸 𝑐 , so, what is not in F, what is not in E but is in S is, that is the complement,
is it not? That is F. So, 𝑃(𝐹) = 1 − 𝑃(𝐸). So, the probability function, when you toss a coin, that
experiment, if you start thinking of four events, then you want to assign probabilities for these four
events, the probability function has to look like the following.
You can assign whatever number you want for the probability of heads, maybe we call it p, right
now I do not know what value to put for it, so let us say we keep it as some p unknown p, it is
between 0 and 1, I know for sure and once you assign that value everything else gets taken care
of, why is that? Anyway, 𝑃(𝑆) and 𝑃(∅) were known before 1 and 0 no problem, the only thing
that was left was 𝑃(𝐸) and 𝑃(𝐹). If you put a value p to 𝑃(𝐸), then what should be 𝑃(𝐹)? (1 −
𝑝). So, that comes from the complement property, that is it.
So, if you had to actually apply this probability space to study the tossing of a coin experiment
somewhere, you have to choose a value p. How do you do it? One way of saying is that maybe
ahead of time that you know that the coin is fair, if the coin is fair then 𝑃(𝐸) should be equal to
𝑃(𝐹). Now how do you make 𝑃(𝐸) = 𝑃(𝐹)? You have to equate 𝑝 = 1 − 𝑝, which gives you 𝑝 =
0.5.
Okay so, assigning the probability of 0.5 to heads and 0.5 to tails takes care of the probability space
when you have a fair coin. So, a very simple example, I mean I did not do anything very different
or unusual here, just took a very simple sample space, looked at all the events that you can have.
There are only four events in this very simple space and looked at how is it that the probability
function will work.
And we got this very simple and very interesting idea that the only thing that matters, only thing
that is in your hands so, to speak is this 𝑝, the probability of heads. Once you fix that everything
else gets fixed in the probability space. You do not have any room for playing with anything else
and we saw how if you want a fair coin you want to put 𝑝 = 0.5.
So, there is a very simple space, hopefully this was clear and you can see how the theory and the
practice sort of meet somewhere and then when you want to assign actual probabilities you have
to use some ideas that you expect from the practice or reality of the situation, some pattern that
you may know and then you get a nice probability space to work with.
(Refer Slide Time: 05:45)
Now the next one is a bit interesting, by the way all these examples have taken from the book, the
book by Deepayan Sarkar, Siva Athreya and Steve Tanner, the book that is in preparation, you can
go and read what they have written also about these problems, these are good to get used to the
probability space and different ways of thinking about it.
So, here is a completely different sort of example, very different from maybe the tossing a coin
example. So, here is the situation, here is the experiment so to speak. A restaurant wants to hire a
waiter and a cashier. So, they may be put out an application or advertisement in the paper and
some four people have applied. Two people who I am calling David and Megha they are from
Delhi and then two others Rajesh and Veronica say are from Mumbai.
So, there are four people and there are two from Delhi, two from Mumbai, so, the restaurant is
going to do a random experiment sort of thing in the hiring. I know anyway I mean these are
positions for which you, maybe you can, all of them are equally qualified and well, well they can
be waiter or cashier in this restaurant so they just decided to choose at random.
So, they are going to hire one person at random first as the waiter and then there will be three
people left from the remaining three persons, they will choose another person at random and make
them cashier. So, first person at random as waiter and then from the remaining three you select
one at random and make them the cashier. So, for this experiment and scenario the outcomes that
are interesting to us. What are the outcomes? So, let me just start writing out the outcome.
Okay the experiment is this random choice, the outcome is, let us say {𝑤𝑎𝑖𝑡𝑒𝑟, 𝑐𝑎𝑠ℎ𝑖𝑒𝑟}, is it not?
Or let us say the name of the waiter and the name of the cashier who got hired as the cashier, who
got hired as the waiter. So, that is the outcome, it is a very easy way to represent the outcome. So,
for example you may have David selected as the waiter and Megha selected as the cashier.
So, that can happen, so {𝐷𝑎𝑣𝑖𝑑, 𝑀𝑒𝑔ℎ𝑎} might be one of the outcomes of this experiment. So,
how many, what would be the sample space? Sample space is going to be the set of all possible
outcomes, is it not? So, it could be {𝐷𝑎𝑣𝑖𝑑, 𝑀𝑒𝑔ℎ𝑎}, it could be {𝐷𝑎𝑣𝑖𝑑, 𝑅𝑎𝑗𝑒𝑠ℎ}, it could be
{𝐷𝑎𝑣𝑖𝑑, 𝑉𝑒𝑟𝑜𝑛𝑖𝑐𝑎}, and to save space I am only going to write the first initials, so, they are all
different initials, so, I will only write the first initials for the remaining folks.
So, David got hired as waiter and then each of the others, one of the other three got hired as a
cashier or it can also happen the other way around, so we will be careful about it, we will, let us
just write it first and, anyway and then we will come to it, then you could have Megha hired as
waiter and then David hired as cashier. You can have Megha hired as waiter and Rajesh hired as
cashier, we could have Megha hired as waiter and Veronica hired as cashier.
So, this is how it is going to look and then you have the same thing, you could have {𝑅, 𝐷}, {𝑅, 𝑀},
and then {𝑅, 𝑉}, and then you can have {𝑉, 𝐷}, {𝑉, 𝑀}, and then {𝑉, 𝑅}. So,
{(𝐷𝑎𝑣𝑖𝑑, 𝑀𝑒𝑔ℎ𝑎), (𝐷𝑎𝑣𝑖𝑑, 𝑅𝑎𝑗𝑒𝑠ℎ), (𝐷𝑎𝑣𝑖𝑑, 𝑉𝑒𝑟𝑜𝑛𝑖𝑐𝑎), (𝑀, 𝐷), (𝑀, 𝑅), (𝑀, 𝑉), (𝑅, 𝐷), (𝑅, 𝑀),
(𝑅, 𝑉), (𝑉, 𝐷), (𝑉, 𝑀), (𝑉, 𝑅)}, this is a sample space, it has all the 12 possible outcomes. All the
12 possible outcomes of who gets hired as waiter and who gets hired as a cashier are written down
here this is the sample space.
Now the question asks for events, the first event is 𝐴: 𝐶𝑎𝑠ℎ𝑖𝑒𝑟 𝑖𝑠 𝑓𝑟𝑜𝑚 𝐷𝑒𝑙ℎ𝑖. So, if the cashier
is from Delhi the cashier needs to be either David or Megha, is it not? So, the event A is basically,
is the set that contains (𝐷𝑎𝑣𝑖𝑑, 𝑀𝑒𝑔ℎ𝑎). So, that qualifies for that and the cashier is from Delhi,
so David is from Delhi, so (𝑀, 𝐷) will also qualify, and then (𝑅, 𝐷), (R,M) and then (V,D), (V,M),
is that okay?
So, you have the cashier being either Megha or David and the waiter we do not care. So, waiter
can be anybody else and that gives you six possibilities for the cashier to be from Delhi and that is
the subset of, that is the event A which gives you the cashier from Delhi.
Then next 𝐸𝑣𝑒𝑛𝑡 𝐵: 𝐸𝑥𝑎𝑐𝑡𝑙𝑦 𝑜𝑛𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑖𝑠 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑎 𝐷𝑒𝑙ℎ𝑖𝑡𝑒, so, these are all sort of
problems where you know you have to write out a set given the description of it in English. So,
exactly one position is filled by a Delhite, so you should have exactly one, so it cannot be
(𝐷𝑎𝑣𝑖𝑑, 𝑀𝑒𝑔ℎ𝑎), it can be (𝐷𝑎𝑣𝑖𝑑, 𝑅𝑎𝑗𝑒𝑠ℎ), it can be (𝐷𝑎𝑣𝑖𝑑, 𝑉𝑒𝑟𝑜𝑛𝑖𝑐𝑎), it cannot be (𝑀, 𝐷).
So, then both of them are filled by Delhites. So, it can be (𝑀, 𝑅), (𝑀, 𝑉), and then it can can be
(𝑅, 𝐷), (𝑅, 𝑀), and then it can be (𝑉, 𝐷), (𝑉, 𝑀).
So, the event A is that the cashier’s position was from Delhi, did not care about the waiter position,
the second one is exactly one of the two positions is filled by a Delhite.
𝐸𝑣𝑒𝑛𝑡 𝐶: 𝑛𝑒𝑖𝑡ℎ𝑒𝑟 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑖𝑠 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑎 𝐷𝑒𝑙ℎ𝑖𝑖𝑡𝑒, so neither David nor Megha should appear
anywhere. So, you can have (𝑅, 𝑉)or (𝑉, 𝑅) that is C, both have to be either Rajesh or Veronica.
So, hopefully this gives you one more view of how one works with probability spaces you know
it is a simple example, small example but hopefully illustrative, you had an experiment where
something happened at random and then you had to write down all the possible outcomes into a
sample space and then you had to identify some events which were described in English and then
write it down precisely as subsets of the sample space.
So, hopefully this example is also interesting, it is not, does not quite use the properties of the
sample space and all that, because they are not trying to assign any probability to it, maybe in one
of your practice questions or graded questions you can have, take up the probabilities, function for
this sample space, for this probability space and see what properties it has to satisfy, so, you can
see that later.
(Refer Slide Time: 13:59)
Here is another example okay so, here we will get to use a property, I promise you that. So, this is
an example about a fishing town. So, in this example you will see we will not worry so much about
the sample space, we will not worry so much about writing down the outcomes or anything. We
will just jump into events and work with them.
We have an idea of what an event should be and we sort of see how the probability of an event
works and there are ways to sort of estimate probabilities of events in various ways, sort of intuitive
also. So, here is a fishing town where there are a few fishing boats that go out to fish everyday and
maybe over the years, people who know the trade, who, people who are in the fishing business
have seen that the chance of catching more than 400 kilograms of fish in one day, they go out at
night usually and then come back early in the morning, the chance that more than 400kg of fish
are caught is maybe 35 percent.
So, how would you get estimates of things like this? So, maybe you keep a track of how many
kilograms of fish being caught and over the last year or so, maybe 35 percent of the days, there
was more than 400 kilograms, that sounds like a reasonable way to estimate. We will see later on
whether it is reasonable or not, but seems like a reasonable way to estimate this, these kind of
probabilities or chances, so it seems reasonable, so that is 35 percent.
Maybe there is one more event here which is catching more than 500 kilograms of fish, so let us
call event A as 𝐸𝑣𝑒𝑛𝑡 𝐴: > 400 𝑘𝑔 𝑜𝑓 𝑓𝑖𝑠ℎ, 𝐸𝑣𝑒𝑛𝑡 𝐴: > 500 𝑘𝑔 𝑜𝑓 𝑓𝑖𝑠ℎ. So, what this statement
is telling you is 𝑃(𝐴) = 0.35, 𝑃(𝐵) = 0.10. So, notice how in a problem like this, there is no
sample space that is properly defined or there is no notion of events and events just show up, I
mean this does not come necessarily by defining the sample space and talking about subsets of the
sample space just intuitively in practice you just see the sample space and you sort of feel that you
know this is reasonable event for which you need to have probabilities.
So, you have an event A which is greater than 400 kilograms, event B which is greater than 500
kilograms. Now the question is asking you what is the chance of catching between 400 and 500
kilograms? So, what I want to know is the probability for the catch being less than or equal to 500
kilograms and greater than or equal to 400 kilograms. So, this is the question that is being asked,
this is the event that we are interested in.
So, I am pausing here, so that you get, maybe some chance to think about this for a little while or
you can pause the video of course, think all that you want. So, notice something very interesting
here, so what is this event that we are interested in, it is between 400 and 500, what is this A? It is
greater than 400, B which is greater than 500, so you will notice that 𝐵 ⊂ 𝐴, clearly. So, if you got
greater than 500 you also got greater than 400.
So, you, but you can get greater than 400 without getting greater than 500. So, B is contained in A
and this event actually is what? A must have occurred, but B must not have occurred, is it not? I
should get greater than 400, so I think maybe I should just, just to be careful let me not put greater
than or equal to and all that, so I will just put the, it is not significant but let us just keep it like this.
So, I have got greater than 400 which means A happened but I have not got greater than 500 I have
got less than or equal to 500.
So, B did not happen, so I put 𝐴\𝐵, that is this event, now B is contained in A and you are doing
𝐴\𝐵, so, what is 𝑃(𝐴\𝐵)? This is 𝑃(𝐴) − 𝑃(𝐵) = 0.35 − 0.10 = 0.25. So, here was a simple
example where we had two events and we had to compute the probability of a third event which
was actually made by some complement, intersection, etc. 𝐴\𝐵 in this case.
So, we see that B is contained in A and then we are able to simply subtract the probability of A
and probability of B. So, all these conditions, so, this condition is crucial, so if this condition is not
true, then you will not get the subtraction 𝑃(𝐴) − 𝑃(𝐵), something else will happen.
So, hopefully one more example which brought out the point of how you work with probability
spaces and events sort of directly using the properties without worrying so much about the sample
space and all that, so this kind of example is very critical, we will use these kinds of calculations
over and over again.
(Refer Slide Time: 19:42)

Here is the next example, it is about weather forecast, now this something that you know you hear
every day, is it not? So, weather forecasters always come up with these kinds of numbers and let
us say one fine day you are hearing this forecast for rain and temperature, it says chance of rain
tomorrow is 60 percent, chance of maximum temperature being above 30 degrees is 70 percent
and then additionally there is one more information that is given, the chance of rain and maximum
temperature above 30 degrees is 40 percent.
So, let us start writing this in terms of events. Let us say 𝐸𝑣𝑒𝑛𝑡 𝐴: 𝑟𝑎𝑖𝑛, 𝐸𝑣𝑒𝑛𝑡 𝐵: > 30°, now
𝑃(𝐴) = 0.60, 𝑃(𝐵) = 0.70, and what is this third piece of information? Rain and maximum
temperature above 30 degrees, that is 𝑃(𝐴 ∩ 𝐵) = 0.40.
And the question that is being asked is, you have to compute the chance that there will be no rain
and below 30 degrees maximum temperature tomorrow. So, given this detail what can you say
about probability of no rain and maximum temperature below 30 degrees. So, what is the desired
event here? So, no rain is A complement. 𝐴𝑐 : 𝑛𝑜 𝑟𝑎𝑖𝑛, maximum temperature below 30 degrees,
so maybe we will take this conveniently to be greater than or equal to so that there is no confusion
with this below, greater than, etc.
So, then you have 𝐵 𝑐 : < 30°, so you have to compute, the question asks you to find 𝑃(𝐴𝑐 ∩ 𝐵 𝑐 ),
probability that there is no rain, 𝐴𝑐 should occur and maximum temperature should be below 30°,
B complement should also occur.
So, now here is an interesting fact about 𝐴𝑐 ∩ 𝐵 𝑐 , this is called De Morgan’s Law. 𝐴𝑐 ∩ 𝐵 𝑐 =
(𝐴 ∪ 𝐵)𝑐 . A should not occur and B should not occur that is the same as either A occurs or B
occurs and the complement of that, neither B occurs nor A occurs, so this is sort of like the logical
statement here you can prove it also it is called De Morgan’s Law.
So, this is actually 𝑃(𝐴 ∪ 𝐵)𝑐 , what do we know from the complement property? From the
complement property we know that 𝑃(𝐴 ∪ 𝐵)𝑐 = 1 − 𝑃(𝐴 ∪ 𝐵), is it not? So, I am given 𝑃(𝐴), I
am given 𝑃(𝐵), I am given 𝑃(𝐴 ∩ 𝐵). I have to find 𝑃(𝐴 ∪ 𝐵), if I have done that, if once I do
that I have finished, I am done, the problem is solved.
Now that is easy, I know that that is easy, so what is 𝑃(𝐴 ∪ 𝐵) in terms of A, B and (𝐴 ∪ 𝐵), I
know the union and intersection property. So, 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) and you can
plug in all these values and compute it. 𝑃(𝐴 ∪ 𝐵) = 0.6 + 0.7 − 0.4 = 0.1. So, once again a very
simple example, I have been talking about how we have to go back and forth between events and
English and this is a simple example of that.
So, you have a weather forecast which gives you the chance of rain and chance of temperature
being above a certain number and the intersection also both can happen and then you are able to
compute some very interesting thing, of things like 𝐴𝑐 ∩ 𝐵 𝑐 , etc. Hopefully that gave you a nice
illustration of how properties are being used.
So, notice how the properties that we derived are being used very cleanly without any concern
about, we did not define the probability space very properly. Where is the sample space? How do
you represent outcomes? Is that, I mean all that is okay, it is somehow, weather happens every
day. So, let us assume that all that is, some sort of implicitly defined and we can work with events
directly, just knowing the properties is enough to work with events directly.
(Refer Slide Time: 24:53)
So, this is the last example I want to present in this lecture. So, in this problem you have been
given a sample space 𝑆 = {𝑎, 𝑏, 𝑐, 𝑑, 𝑒} and there are two events 𝐸 = {𝑎, 𝑏, 𝑒} and 𝐹 = {𝑏, 𝑐}.
These are like abstract sample space, just representations and what all can you find with E and F
and what are all the various things you can do with E, F.
So, let us just try and do this sort of systematically, so you have 𝐸 = {𝑎, 𝑏, 𝑒} and you have 𝐸 𝑐 =
{𝑐, 𝑑} and then you have 𝐹 = {𝑏, 𝑐} and then you have 𝐹 𝑐 = {𝑎, 𝑑, 𝑒} and then we now start doing
unions, intersections whatever.
So, you can do interesting unions and intersections, this same union is not interesting, 𝐸 ∪ 𝐹 is
interesting. So, you can do E union F, 𝐸 ∪ 𝐹 = {𝑎, 𝑏, 𝑐, 𝑒} and then E intersect F, 𝐸 ∩ 𝐹 = {𝑏},
then you can do E union F complement, (𝐸 ∪ 𝐹)𝑐 = {𝑎, 𝑏, 𝑑, 𝑒}, then E intersect F complement,
𝐸 ∩ 𝐹 𝑐 = {𝑎, 𝑒}.
Now what else can we do? You can do complements of all of these guys. No, or maybe we can
just do the other intersection, so I finished with E. Let us do, 𝐸 𝑐 ∩ 𝐹, have I done that? No, so
𝐸 𝑐 ∪ 𝐹 = {𝑏, 𝑐, 𝑑}; 𝐸 𝑐 ∩ 𝐹 = {𝑐}.
𝐸 𝑐 ∪ 𝐹 𝑐 = {𝑎, 𝑐, 𝑑, 𝑒} and then 𝐸 𝑐 ∩ 𝐹 𝑐 = {𝑑}. So, I think this is all that you will have, I do not
think you can have anything more than this. I think if you do further unions or complements or
something, further complements will all be kind of closed and if you take further unions also, I
think you will not get anything new here. These are all the events I believe that you will get with
all possible combinations of E and F.
You can check it out, maybe there are more fancy things you can do. I am not able to think of
anything else at this point. Let me just see this, yeah, I mean I think anything else, you can of
course make more unions and more big, make it bigger but I think these are good examples to start
with maybe you can do more.
So, you can keep adding to this and see if you, just repeatedly take maybe more unions till you end
up in space where you cannot go any further. So, maybe that is more interesting you will get to
some place. I am just trying to see if there is anything very interesting that I can do here, maybe
not.
Yeah, I think we can just stop it. So, that is this example. So, we saw some four or five examples.
I will urge you to look at the book and look at your practice assignments, graded assignments and
all that and this skill is very important. So this, what is this skill I am talking about? The skill of
looking at a problem which talks about a probability space, maybe events in the probability space,
maybe some of their probabilities, maybe some manipulation of the events and typically the
probabilities and everything is probably given in English.
So, it will say chance of rain tomorrow, etc. So, can we do something interesting in that probability
space, can we write down those events, can we think of other events that are obtained from that
event, can we find their probabilities, etc., all these things are very interesting simple problems
and it is good to practice these kinds of problems as much as possible. I will encourage you to look
at the book and also the practice and graded assignments, all the best.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.8
Distributions

Hello everyone, welcome to this lecture. We have been talking about basic concepts once again,
a lot of it hopefully is familiar to you from your previous statistics 𝐼 course but hopefully you
are getting a good opportunity to brush-up the ideas once again. So, we have seen what a
probability space is, we have started working with it, we have started computing probability of
events, we started using properties of these probabilities of events, what happens when you
complement, what happens when you take disjoint union, what happens when you take union
intersection, etcetera, etcetera.

So, we will proceed with this study but before that there is this notion, very important notion
of something called distribution. So, if you look at a sample space the probability of the entire
sample space is 1, so that is what the axiom tells you, the first axiom is the probability of the
entire sample space is 1 and then there are these outcomes and events. So, this probability of
the sample space sort of gets distributed over the outcomes in some sense. So, that is the
distribution so, that gives you a sense of how the probability is distributed over the outcomes
and that is the way in which you can describe the probability function.

So, we saw how for when you toss a coin, the only thing the probability function has to do is
to assign a probability to heads. Everything else takes care of itself. So, probability of heads is
something, probability of tails is 1 minus that and that is it. There is nothing else that can
happen. Now there were only two outcomes there so, it was easy to do. What happens if there
are more? How is this notion of distribution going to work when there are more outcomes? So,
that is the topic of this lecture. So, let us get on with it.
(Refer Slide Time: 01:49)

So, I am going to start with an example which is throwing a die. When you throw a die, the
sample space has six possible outcomes, 1, 2, 3, 4, 5, 6. We will assume that all the individual
outcomes are events. So, 1 is an event, 2 is an event, 3 is an event, we know that is already true
so we have to have all those events otherwise it would not make sense, so that if you think
about it, once you have all the individual outcomes as events, you can take unions, you can
take whatever else and then that makes every subset an event. So, that is the sample space we
are dealing with here.

Now once you have every individual outcome as an event, every event has a probability. The
probability function has to put some value on each of these events so, let us say it puts the value
𝑝1 on the event 1, 𝑝2 on the event 2, 𝑝3 on the event 3, so on. So, you have 𝑝1 , 𝑝2, 𝑝3 , 𝑝4 , 𝑝5 , 𝑝6 .
There are six values, all of them are between 0 and 1 and that is what the probability function
would assign to each of these outcomes, each outcome is an event and to each of these events
the probability function is assigned a probability, 𝑝1 to 𝑝6 .

Now this individual events, 1, 2, 3, 4, 5, 6, they are all disjoint and their union is going to give
you 𝑆 itself. It is a very simple thing to observe. Each individual outcome is an event, they are
all disjoint, you take their union, you get the entire sample space. So, now axiom 2 kicks in and
the axiom 2 tells you 𝑝1 + 𝑝2 + 𝑝3 + 𝑝4 + 𝑝5 + 𝑝6 = 1. And that is it. You do not need any
other condition. So, each of the 𝑝𝑖 have to be between 0 and 1 and they have to add up to 1. So,
this is the notion of a distribution.
You have a probability of 1 on the entire sample space. You have a whole bunch of individual
outcomes, you simply distribute that 1 over all of these that probability of 1, the total probability
of 1 over each of these individual outcomes, each guy gets a fraction, they all have to add up
to 1. That is it. That is a complete description of the probability space of the probability
function.

Now a very simple example is when the die is fair. If the die is fair, all the 𝑝𝑖 are equal. If they
1
have to be all equal and they have to add up to 1, each one is 6. That situation is called equally

likely outcomes or a uniform distribution. So, you have distributed your 1 uniformly over all
the outcomes, the probability of 1 for the entire sample space. So, that is a simple example to
give you an idea of how distributions work.

(Refer Slide Time: 04:33)

So, let us look at the continuation of that. I told you distribution is completely sufficient to
specify the probability function but I have not told you how to find probability for events. So,
you have probability for each individual outcome, what if I give you a complicated event? Let
us say the event is 1, 3, 5. How do you go about finding the probability? It is actually very easy
because any event now can be split up into a disjoint union of individual outcomes, is not it?
1, 3, 5 is 1 ∪ 3 ∪ 5, now disjoint union of events, you know that axiom 2 kicks in, you simply
have to add the three probabilities so you have 𝑝1 + 𝑝3 + 𝑝5 .

So, this way once you assign your distribution, once you take your probability of 1 for the
sample space distributed over each of the individual outcomes, you are done specifying the
probability function. Any event you give me; I will tell you what the probability is. The
probability function becomes fully specified when you specify the distribution. So, you will
see this word ‘distribution’ used over and over and over again when people describe probability
spaces. In fact, nobody will say probability space. In fact, they will just say distribution.

I am working with this distribution. I am working with a normal distribution. I am working


with uniform distribution, some such distribution they would say and then that completely
defines the entire probability space for you by defining the probability function properly. So,
this notion is very, very crucial and important and it makes it very simple to satisfy the axioms
correctly without any problem. And it also gives you a tool or a simple way to compute
probability for any event.

Now let us specialize to the particular case where you have equally likely outcomes or a fair
1
die. So, the die is fair so each of the 𝑝𝑖 is 6. Now what will happen to the probability of the
1 1 1
event? It is just 6 + 6 + 6, how many times do you have to add? As many as there are number

of outcomes in that event.

So, 𝑃(𝐸) probability of any event when you have the uniform distribution with equally likely
outcome is simply the number of outcomes in 𝐸, size of 𝐸 divided by size of 𝑆, that is it. So, it
becomes probability calculation of events become automatic in this uniform distribution. When
you have a finite set of outcomes and you have uniform distribution on them, probability of an
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝐸
event 𝐸 is simply a . So, it is very simple in the fair die case, it seems
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝑆

very, very simple and all of this can be very easily generalized.
(Refer Slide Time: 07:00)

So, let me just quickly summarize. Distributions, the idea of distributions is to assign
probabilities to each of the individual outcomes in your sample space. Of course there is a
question of when is this possible. This is possible only when you can identify and count the
outcomes, is not it? One after the other you should be able to count, only then you can assign
to each one a uniform, some probability, you should be able to go one after the other and then
given each of them something. If you cannot go one after the other, you cannot count 1, 2, 3,
etcetera, then you cannot do this.

So, this distribution will not work in such a easy way, so such kind of sample spaces or sets are
called countable sample space, it sounds very complicated but you can just think of a finite
sample space. So, any finite sample space, finite set is definitely countable, you can do 1, 2, 3,
first, second, third, etcetera, just give it some probability and go through and finish it off. So,
that is the idea of distribution. A particular example is this uniform distribution on a finite
sample space. You have a finite number of outcomes, whatever those outcomes may be, you
simply say I have a uniform distribution on those outcomes.

That defines the probability space for you, that defines the probability function for you. You
have equally likely outcomes, every outcome is equally likely, so probability of any one
outcome is 1 divided by the size of the sample space, probability of any event is number of
outcomes in the event divided by the total number of outcomes in the sample space, a very
simple probability space to work with. Of course the problems themselves can be complicated,
so the sample space can get hopelessly complicated.
We need more tricks to simplify our, but still at this notion of describing a distribution like this
is very, very important. Of course the distribution need not always be uniform. You may have
a non-uniform distribution in which case it is more painful but at least in the uniform
distribution it is very, very easy to do this. So, this is something to remember.

(Refer Slide Time: 08:59)

So, we are going to see a bunch of problems next as is our usual thing that we do when we look
at this lecture, we describe an idea and then we describe a few problems to drive the idea home.
You will have more problems in your practice assignments and graded assignments and I do
not have to repeat myself for the 10𝑡ℎ or the 20𝑡ℎ or the 30𝑡ℎ time, doing problems is the only
way you learn. So, let us get started.

So, the problem here is very simple. There are marbles in an urn, 5 red and 8 blue marbles and
you pick a marble from the urn at random. So, this is the problem. So, you will have, there are
5 red marbles, 𝑅3, 𝑅4, 𝑅5, 𝐵1, 𝐵2, 𝐵3, 𝐵4, 𝐵5, 𝐵6, 𝐵7, 𝐵8, so only 8, so let us stop at 8. So,
those are the marbles and so you are just let us say closing your eyes or something and picking
a marble at random. How many outcomes are there? The outcomes, the sample space could be
anyone of these guys. So, it could be 𝑅1 to 𝑅5 and then 𝐵1 to 𝐵8.

And what is the distribution? The distribution is uniform. So, we are going to assume
distribution is uniform. So, how is that conveyed? What word here conveys the uniform
distribution to you? This is the word. Typically when somebody describes it in English, I am
not going to say it is uniformly distributed or something, I will usually use the English word
like this, I will say “pick a marble from the urn at random.” So, that is a phrase. This phrase
“at random” suggests to you that the distribution is uniform.

So, lookout for these phrases. They are very common. They are slightly nonmathematical as in
it in English, but you should just read it and understand that when somebody says I am doing
something at random, they imply that there is a finite set of possibilities and I am going to pick
one uniformly at random. So, the uniform distribution is implied. Some of them, some
questions they may emphasize the word “uniformly” but sometimes even if they say “at
random” then not given anything else, you would assume uniform, is not it? There is no other
thing.

So, how many possibilities are there? 13 total possibilities and you may be interested in an
event probability of getting red. So, this is your event. So, the probability of that event is the
denominator, you have total number of outcomes, size of 𝑆, which is 13, the numerator you
have the number of favorable outcomes and that is 5, is not it. So, it is quite easy to compute
these kind of probabilities over uniform distribution, just count, that is all. So, what is
8
probability of blue? It is going to be . So, this is the ease with which one can work with a
13

uniform distribution and then notice how this “at random” word or phrase suggested the
uniform distribution to you. Hopefully this was simple.

(Refer Slide Time: 12:09)

Let us move on to the next problem. This one is slightly more complicated and that is just
because maybe the sample space is larger. So, you are throwing two dice, each of them is 1 to
6, 1 to 6, the first one can be 1, 1, 1, 2, like that and the question is asking you about the
probability that the sum of two numbers is 8. So, it not really mentions but let us try to write
down the outcomes. The best way to write down the outcome is like a pair, so you can put
(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), or you can have (2, 1), (2, 2), (2, 3), (2, 4), (2, 5),
(2, 6) and I am not going to write everything out for you, so, you can go all the way to (6, 1),
(6, 2), (6, 3), (6, 4), (6, 5), (6, 6).

So, you have likewise a lot of outcomes. The side of 𝑆 is 36, or 36 possible outcomes when
you throw two dice. And that is what it is and we are going to assume once again the uniform
distribution. So, what in this problem is suggesting the uniform to you, I have not really told
you, I am just telling you in words, it is not given in the problem. Usually two fair dice are
thrown, it means, you expect the distribution to be uniform. It is fair dice, so you expect it to
be uniform.

Now what is the probability, the sum of the two numbers is 8. Now the event 𝐸 I am interested
in, is sum of the two numbers is 8. So, (1, 1) and all, is not there, all the way up to (1, 6) is not
there. The first guy that qualifies is (2, 6). The next guy that qualifies would be (3, 5), next
guy who qualifies would be (4, 4), next guy would be (5, 3), and the next guy would be (6, 2).
Is that okay?

So, what is the probability space of 𝐸? You have five events that are favorable to you, I am
sorry, five outcomes that are favorable to you in this event and there are 36 total outcomes. So,
5
it is 36 so that is the probability that you will get a number 8 as the total. So, you can calculate

like this for anything else. Instead of 8 if I give you 10, you know what to say. If I give you 20
you know what to say. If I give you 1, you know what to say. If I give you 2, you know what
to say.

So, just enumerate the number of cases that are favorable to you and divide by 36, you will get
the answer for this problem. Simple enough problem, but hopefully it gives you an illustration
of this uniform distribution.
(Refer Slide Time: 15:01)

So, here is another problem. This is from your textbook, the textbook by Shiva Athreya,
Deepayan Sarkar, and Steve Tanner. They describe a situation where someone is living in an
apartment and they have lost their key. And they go to the security in the apartment and ask, I
have lost my key, what can you do, and the security person has 50 keys and is not willing to,
is not able to distinguish let us say between each of these keys. Maybe there are 50 apartments
in the complex and the security person has all the 50 keys. He does not know which key fits
which apartment.

So, the person of interest is trying out one key after another in a sequence. He takes the first
key, tries it. If it works, what is he going to do? He is going to stop. If it does not work, he is
going to go to the next one. If it does not work, he is going to go to the next one, next one, next
one till he finds the key that works and he is going to do that in order. Now any of these keys
could be the key that fits the person and that is where the randomness comes in.

So, if you want to look at outcomes, how would you write down the outcome? The outcome
could be the tick. What is a tick, tick meaning tick at the very first time. The first key that he
tried was the right key. Or it could be no and a tick, or it could be no, no, and tick or it could
be no, no, no and a tick. Each of these are outcomes, so in your sample space you have one
outcome, second outcome, third outcome, four outcome. This is a comma by the way.
Hopefully you see that, this is a comma. So, this is how I am describing the outcome so, it will
go on and on and on.
What will be the last one? The last one will have … 49 times this × will repeat. If I were to own
that apartment I think the last outcome is guaranteed to happen at probability one, so I mean
will try it in a sequence. I am sure it will never get. I will get the last key to be the right key,
that could be because of just my own experience with how lucky I think I am.

But it is not, that is not a reasonable assumption to put on this probability space and is a uniform
assumption reasonable for this? I would say yes. It can happen that any one of these things can
happen. It could be all of them are okay in some sense, so you try one after the other. The
person does not know where the right key is. It could be anywhere and you are trying it from
one side, one after the other and this is the best we can come up with.

So, hopefully this gives you an idea of how a very different looking sample space still has a
uniform distribution and it is something nice to look at. You may want to contrast this with so
many other cases but this is something interesting. It is in your textbook as well.

(Refer Slide Time: 18:06)

So, last question is again something that I picked up from your textbook. It is very interesting.
It has a different way of thinking about it. So, three people and there are three hats, let us say
three people from a cricket team. There are three hats and the three hats get mixed up. Identical
hats, it is nothing to distinguish between them.

And so notice what happens. There are three people, 𝑃1, 𝑃2, 𝑃3. These are the three person and
their hats get mixed up and various possibilities can happen now. What are the possibilities? It
could be that the hats get mixed up and each person is picking up the hat. Maybe 𝑃1 picks up
𝐻1, 𝑃2 picks up 𝐻2, 𝑃3 picks up 𝐻3, that can happen or maybe 𝐻1 here but 𝐻3 here, 𝐻2 here,
this could happen or maybe 𝐻1 itself did not happen, maybe this became 𝐻2 and then I guess
this became 𝐻1, this became 𝐻3, this could happen.

Or this is 𝐻2, this is 𝐻3, this is 𝐻1, this can happen. Maybe the first person picked up 𝐻3 and
then the second person picked up 𝐻1, 𝐻2, or the first person picked up 𝐻3, second person
picked up 𝐻2. So, that is it. It is all the possibilities. So, if you want to put up in an outcome,
so you would get all these outcomes. This is your sample space.

So, the three hats got mixed up, 𝐻1, 𝐻2, 𝐻3 and each person picked up the hat. The first person
could pick up 𝐻1, second person could pick up 𝐻2, third person 𝐻3, or any of the permutations
of 1, 2, 3 or 𝐻1, 𝐻2, 𝐻3, is possible. So, lot of people when they write down the sample space,
they will simply drop the 𝐻 also.

What is this H doing here? It is just additional ink that is wasted. What is really important is
the 1, 2, 3 and the sequence in which the 1, 2, 3 comes. So, quite popularly people just use the
permutation. The first one is a permutation, 1, 2, 3 in the same order, second one is a
permutation 1, 3, 2 in the same order, etcetera, etcetera, etcetera.

So, these are the outcomes and this is I think uniform distribution is very reasonable here. So,
we have said each person picks a hat at random. So, it is very reasonable to expect a uniform
distribution over each one of these outcomes. That is okay. So, now the question that is asked
is what is the probability that none of the persons get their own hat? So, the event 𝐸 I am
interested in this.

None of them should get their own hats. So, if you look at 𝐻1, if 𝐻1 shows up in the first place,
I am done. So, because the first person got his own hat. If 𝐻2 shows up in the second place, I
am done, also. So, that case is also not interesting. If 𝐻3 shows up in the third place, that is
also not interesting.

So, looks like you have very few cases, is not it, so, first one is ruled out. Second one is ruled
out, third one is also ruled out, fourth one is okay, is not it? So, fourth one is okay, so it could
be 𝐻2, 𝐻3, 𝐻1; it could be 𝐻2, 𝐻3, 𝐻1. What about the next one? 𝐻3, 𝐻1, 𝐻2, that is also okay.
What about the last one? 𝐻3, 𝐻2, 𝐻1, no, because 𝐻2 is the same, so that is it. There are only
2 1 1
two events which gave you this, so 𝑃(𝐸) = = 3. So, the probability of 3, none of them are
6

going to get their own hat.

Now this problem is quite simple. The way I did it with just three persons it seems very, very
simple. I want you to think about 30 persons. Now well, when will 30 persons have the same
hat, I do not know, it could be some reunion or something where all of them were given a hat
and they were asked to put it into some place and then have to go back and pick it up and then
they realize nobody knows what their original hat was.

So, they start picking up randomly. If you have 30 persons, unfortunately the number of
permutations of 30 is really, really, really large. So, 30 factorials, very, very, very large but I
might still be interested in this probability. What is the probability that none of them get their
own hat? So, this could be something of interest to me. Do you think there is a reasonable way
to do this?

You cannot of course write down all the possibilities. It will take probably the rest of the
universe’s lifetime or something like that to finish in that mode. So, we need something
smarter, so maybe it is possible. I will urge you to think about that so, that is an open problem
for you. Think about how you can compute the same probability when there are 30 persons
instead of 3.

So, such permutations where none of the original positions are in their own place. So, that is
called a derangement and this is called a probability of a derangement, there are lots of solutions
I am sure. All of you will go search it on stack exchange and all that. Lot of people will write
things on it, try to understand and look up how to find probability of derangements. It is an
interesting problem.

That is the end of this lecture. We will hope you saw the notion of a distribution, in particular
we saw the example of a uniform distribution and how to start working with uniform
distribution writing down the outcomes, writing down the favorable outcomes, number of
favorable outcomes divided by total number of outcomes, is the probability of an event in
uniform distribution. Thank you very much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.9
Definition of Conditional Probability

(Refer Slide Time: 00:13)

So, here is the example. In this example, we are going to go from the original distribution to
the conditional distribution. We will use the formula and compute. So, I am just to illustrate
the definition, just to give you an example. So, throw a die. You have the uniform distribution
1
in the set {1, 2, 3, 4, 5, 6}, and we have assigned the distribution to each and you have the
6

event 𝐸. Event 𝐸 is {2, 4, 6}. Let us say it is an even number. So, you have got an even number.
1
So, 𝑃(𝐸) is easy to write down. It is 2, you know that.

So, if we move to the conditional probability space given 𝐸, you have {2, 4, 6} as your sample
space and then you have all these events inside it. You could have {2} |𝐸, what is the probability
𝑃({2} ∩𝐸) 𝑃({2} ) 1⁄
of {2} |𝐸, it is a and what is {2} ∩ 𝐸? {2} ∩ 𝐸 is 2 itself, so , so that is 1⁄6 that is
𝑃(𝐸) 𝑃(𝐸) 2
1 𝑃(𝐴 ∩𝐵)
. So, these are just examples to illustrate the formula, 𝑃(𝐴 |𝐵) = . How do you use that
3 𝑃(𝐵)

formula?
So, 𝑃({4}) you have to do this calculation. If you have got to do 𝑃({1}) which is not a valid
event in your sample space so, what will happen is when you do the intersection with 𝐸, it will
go to a null set. When you go to a null set, automatically the probability will go to 0.

So, the conditional probability space, the definition, interacts very nicely with the original
probability space, even if you go wrong in the event, you pick an event and the conditional
space is not even there, it will deal with it very smoothly because the intersection with 𝐸 will
just kill it. It will make a null space. So, you do not have to change any formula or anything.
So, it is a very nice neat interaction between the conditional space and the original space.

1
Similarly, {3}|𝐸 and {5}|𝐸, also 0; {4} and {6}|𝐸 are 3 and you can also calculate more events.

So, what about {2, 5}|𝐸? So, this one again, so, you might say {2, 5} is not a valid conditional
event but that does not matter. So, this A is any event in the original space. So, you can always
define a conditional one for that. You can take {2, 5} ∩ {2, 4, 6}, so what would happen here,
𝑃({2,5}∩{2,4,6}) 𝑃({2}) 1 1⁄ 1
you would get and that intersection is just the same as is 2, so 1⁄6 is 3.
𝑃(𝐸) 𝑃(𝐸) 2

So, you get the same answer. So, what about {2, 3, 4}? {2, 3, 4}when you intersect with
{2, 4, 6}you have 2 size intersection, so that is probability of {2, 4} in the original space that is
2⁄ 1⁄ 2⁄ 2
6
2⁄ , sorry did I get that right? 1⁄3 and sorry 1⁄6 so, that gets 3. So, that is also fine. So, this
3 2 2

is how you calculate given definition like that. So, you are given an original space, you are
given an event 𝐸, your conditioning on that event 𝐸, this is how you calculate the probabilities.

you, it is very unlikely that, only in quiz problems and practice problems you may do this. In
reality, the way conditional probability will occur is in a sequence of steps and you will
condition on the first step and go to the second step, condition on the first and second step, go
to the third step, like that. So, that is how you will condition over and over again. So, let us see
maybe a simple example of that.
(Refer Slide Time: 03:53)

A very nice example that one can give, a very simple two-step process is these two urns with
colored marbles. So, instead of just blindly writing down, I thought I will make a picture here
with two urns and a bunch of red color marbles and blue color marbles. What is my experiment
now? I am going to pick an urn at random. Notice there are two steps here now, so you first
pick an urn at random and then you pick a marble at random from the chosen urn. It is not that
all of them are put together and I pick one at random. Then I have only one action. Here I do it
in two steps.

First step I pick an urn. It could be urn 1 or urn 2. That I do at random. Then once I pick an
urn, I go in and pick a marble from that urn at random. So, that is what I do here. And notice
how easily you can do these conditionals. So, notice how the conditionals work out. You do
not have to write sample space, you can try to do all that, but it is quite easy to see. So, when
you condition on urn 1, what are you conditioning on? The urn you picked is urn 1. So, you
have gone into urn 1. So, the sample space, the conditional sample space, is very clear. It is just
picking randomly a marble from urn 1.

That is your conditional sample space, is not it? You do not need to worry so, much about the
entire initial sample space. It is enough if you write down the conditional sample space and
you can directly find probability of red given urn 1. There are seven red marbles here, total of
7 5
13 marbles, so the probability of red given urn 1 is . Probability of red given urn 2 is ,
13 13

again. Notice how the conditioning is so, natural to do. You move naturally to the conditional
space and the conditional space is very, very easily laid out for you.
6 8
Probability of blue given urn 1 is 13, probability of blue given urn 2 is 13. So, notice a curious

thing, these two are adding to one, these two are adding to one. Is this an accident, is this is an
accident or is it guaranteed by the axioms of probability? I would say it is guaranteed, is not it?
Red or blue in the conditional probability space when you pick, you should either get a red
marble or blue marble.

So, those two things are the entire sample space of outcomes. So, the probability of those two
together, they are disjoint also those two together should give you 1. Likewise, the probability
of these two together should give you 1. These two are complements in the conditional
probability space. These two are complements, I should write it a bit clearly, complements in
the conditional space.

So, probabilities better add up to 1 because you know the conditional space is still a probability
space, is not it? It has to respect its axioms, it has to respect all the properties that you have for
the original probability space, is not it? Complements have to be 1 minus of each other, the
disjoint unions have to add, 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵), all of those properties
have to be satisfied even in the conditional space and you can use all of that and do it.

And you can see how naturally the first step and second step comes and how you naturally can
use conditional probability without worrying so, much about sample space and all the theory
and the equations. So, this is the skill I was talking about. This is the comfort I was talking
about that you need to pick up when you work with conditional probability.
(Refer Slide Time: 07:28)

So, let us move on and see a few more examples, maybe slightly more complex than the
previous ones. And these kind of examples also illustrate the power of conditioning. You will
see if you were not conditioning, you will have to do all sorts of things and keep track of
everything and when you condition one after the other, things become so, much easier. So, here
is a class with 15 students and people from multiple states. There are three states in particular,
there are 4 from 1 State, 8 from another state which we call State 2 and 3 from State 3.

So, three students are picked three different students are picked at random, usually one after
another and that is something you can assume. So, you pick one after another, three students at
random. The question asks what is the probability that the selected three are from State 1, State
3 and State 1 again in that order? So, the first one you select this from State 1, next one you
select is from State 3, next one you select is from State 1 again.

Now if you were to do it without conditioning, you will have to do all sorts of things and you
can see there are multiple steps here, it is not just two steps. There is the first step where you
picked someone and then something would have happened and then conditioning on that you
have to move to the next step where you will pick one more person and then conditioning on
that, you will go to the next one.

So, the way you think of this is you write it as three events, 𝐴1 is first student is from State 1,
then you write 𝐴2 which is second student is from State 3 and then 𝐴3 is third student is from
State 1 again. The question is asking for 𝑃(𝐴1 ∩ (𝐴2 ∩ 𝐴3) ). So, let me put a bracket here and
think of 𝐴2 ∩ 𝐴3 , as one event. This is my 𝐵.

So, I will write this as, sorry this is my 𝐵 and this is my 𝐴. I want to write this as 𝑃(𝐵)𝑃(𝐴|𝐵).
Now what is 𝐵 now? 𝐵 is 𝐴1 , so, 𝑃(𝐴1 )𝑃(𝐴2 ∩ 𝐴3 |𝐴1 ). So, notice how I wrote this out. This
second event is this sort of bigger combination, 𝐴2 ∩ 𝐴3 , it is still big, it is not just one event,
this is intersection of two events.

Now this 𝑃(𝐴1 ), notice it is very easy to do. What is 𝐴1 ? First student is from State 1. It is just
the first student you are picking. There are 4 from State 1, 8 from State 2, 3 from State 3, what
is this probability? It is 4 by 15 anybody can tell you that. Now the question is what you do
with this guy? 𝑃(𝐴2 ∩ 𝐴3 |𝐴1 ). This looks a little complicated but remember this is in the
conditional probability space given A1. But it is also a probability space. I can further do the
same conditioning trick that I did before.

So, I will write down how I do this. This guy can be written as 𝑃(𝐴2 |𝐴1 ), notice, this is like
my new 𝐵 and this is like my 𝐴. So, I can write this probability as 𝑃(𝐵) but notice it is not just
𝑃(𝐵), why? Because I am in the conditioning on 𝐴1 space, I have gone to the smaller space so,
I have to write this as 𝑃(𝐵|𝐴1 ), so, that is 𝐴2 | 𝐴1.

Next what do I do? After that I can write probability of 𝐴3 conditioned on 𝐴2 should have
already happened. 𝐴1 has already happened, A1 also should have already happened, so, what
you do there is 𝐴1 ∩ 𝐴2 . So, this is a little bit of a trick here. So, notice what I am doing. I am
in the initial probability space. I did the first round of conditioning. But then I found I still have
an intersection going on, 𝐴1 ∩ 𝐴3 , I have a step two and then I have a step three.

So, once I move to the conditional space, I can do a further conditioning. For that I say 𝐴2 first,
I split it as 𝐴2 ∩ 𝐴3 , so I have 𝑃(𝐴2 |𝐴1 )𝑃(𝐴3 |(𝐴1 ∩ 𝐴2 )). So, this is sort of a formal result that
you can keep repeating this conditioning. This is and if you look at your book, this is stated as
a formal theorem and all that. I did not want to do it in a formal way. I just wanted to introduce
it to you informally.

So, notice how easy it is to keep conditioning. I have 𝐴1 ∩ 𝐴2 ∩ 𝐴3 . It seems like a complicated
problem. I have to write down also, so it is a possibility but I can proceed by conditioning, the
divide and conquer approach. I can do 𝐴1 , 𝐴2 , 𝐴3 , given 𝐴1 and then this 𝐴2 ∩ 𝐴3 is further
conditioned and how it is to do, notice how easy it is to do this problem. What is 𝐴2 | 𝐴1?
𝐴1 has already occurred which means what? The first student has been chosen and that person
is from State 1.

So, if 𝐴1 has already happened, then what are the remaining students? What is the sample space
given 𝐴1 ? There are 3 from State 1, 8 from State 2, and 3 from State 3, is not it? So, in the
conditional space given 𝐴1 , you have 3 from State 1, 8 from State 2, and 3 from State 3 and 𝐴2
is asking what is the event, that 𝐴2 is the event that second student is from State 3. So, that is
going to be 3 out of 14. Notice how easy this calculation is. You are not having to write down
all possible combinations of picking 3 from 15 and then choosing out one after the other. You
4 3
can just go step-by-step and it is very easy in every step. So, 15 . 14.

And then notice the last one. What is the last one? 𝐴1 ∩ 𝐴2 has happened already. If 𝐴1 ∩ 𝐴2
has happened what does it mean? The first student that is chosen that was from State 1 and then
the second student was chosen that was from State 3. So, now after 𝐴1 ∩ 𝐴2 , so, given 𝐴1 ∩
𝐴2 , you will have 3 from State 1, 8 from State 2, and 2 from State 3 and what is your probability
of that? Probability of 𝐴3 which is picking someone from State 1 given that the first two are
picked from State 1 and State 3, it is 3 out of 13, is not it? 3 out of 13 in total.

So, that is the answer. So, if you had to enumerate all the possibilities of picking three students
from 15, imagine that is a lot of possibilities. So, we are not saying it is difficult to do but still
takes much long a time and notice how conditioning has simplified the work for you. So, it will
be 𝑃(𝐴1 )𝑃(𝐴2 ∩ 𝐴3 ), there is little bit of thinking here to do about how I further condition.
Say I went into a smaller space and then I further condition from there. Nothing stops me from
doing that. That is also a valid probability space.

If I have an intersection of two events there, I can further condition. Anytime I have an
intersection, I can condition. That is the idea but I will condition in the conditional space and
when you go further conditioning, the further conditioning becomes an intersection because
both of them have happened. So, that is the slightly tricky idea, think about that. It will be clear
to you. So, this probability finally after breaking on the whole thing has worked down to
4 3 3
. . . You can simplify this if you like and you will get an answer.
15 14 13

So, hopefully this showed you the power of conditioning. I think to me this is a very nice
example which illustrates how powerful it is to condition. If you did not condition, you will
have to do so, much work. Here conditioning, just let you do step by step by step and in every
step you are doing a very simple sample space. The conditional space is very simple to deal
with. On the other hand, the overall space seems much more complicated. Hopefully this was
useful and hopefully you will see more problems like this in your assignments.

(Refer Slide Time: 16:41)

So, the last example I want to do in conditional probability is this sort of deceptive problem,
people give a lot of answers which are intuitively very happy about but the word conditional
and the word probability given something mean, certain specific things and in that case you
will see the answer here is a bit surprising. So, let us look at this question.

There is a family with two children and the question that is asked is what is the probability that
both are girls, both the children are girls given that at least one is a girl? Notice the condition,
notice the first thing that I always see when somebody asks this question is the word conditional
is dropped.

That from a mathematical point of view it is unsettling but you can see that it is given
something. So, obviously the probability that they mean here is conditional probability. What
is the conditional probability, that both are girls, given that at least one is a girl. Notice that
subtle thing, at least one is a girl, it includes girl boy, boy girl and girl girl. So, sometimes
people just miss this and it seems sort of clear that if you say one is a girl, the other has to be a
girl, but it does not say one is a girl, it does not say exactly one is a girl.

If you say one is a girl, it makes sense that the other is a girl with probability half or something
like that but here it seems like what is condition, what you are conditioning on is not exactly
one is a girl. You are conditioning on; you can think about what I am saying here. So, if you
have exactly one girl the other cannot be a girl. So, it is close to probability 0, it is no problem
there and it does not say, so the conditioning is a bit tricky. Think about why that is true.

So, the sample space here is basically girl comma girl, both are girls, the first one is a girl,
second one is a boy, the first one is a boy, second one is a girl and then both are boys. These
are the four possibilities in this problem and we will assume uniform. So, it does not mention
any distribution so we are going to assume uniform distribution and what is this event 𝐵 that
we are interested in? At least one is a girl. So, it at least one is a girl, there are three outcomes
that qualify under at least one is a girl. So, this is at least one is a girl. So this is that event.

So, now the question is what is the probability that both are girls? So, the question that is asked
is probability of girl comma girl, this event given 𝐵. So, this is going to be 𝑃({𝑔𝑖𝑟𝑙, 𝑔𝑖𝑟𝑙} ∩ 𝐵)
𝑃({𝑔𝑖𝑟𝑙,𝑔𝑖𝑟𝑙} ∩𝐵)
so, what is . {𝑔𝑖𝑟𝑙, 𝑔𝑖𝑟𝑙} ∩ 𝐵 is {𝑔𝑖𝑟𝑙, 𝑔𝑖𝑟𝑙} itself so, there is nothing in this
𝑃(𝐵)

3 1⁄ 1
4
intersection. By 𝑃(𝐵), 𝑃(𝐵) is so you have {𝑔𝑖𝑟𝑙, 𝑔𝑖𝑟𝑙} as 3⁄ and that is 3. So, out of the
4 4
1
three cases where at least one is a girl, both are girls with probability 3.

So, slightly counter intuitive, lot of people depending on the wording just want this answer to
be half, seems like I am saying what is the probability that the next one is a girl or something
like that, it is not the same. It is a subtly different question and a lot of probability questions
become confusing, particularly with conditional probability for this reason.

So, when you have this kind of complexity, it is better to write down the sample space clearly.
Do not try to work without the sample space, without the conditioning event and all that, you
will get into trouble here. When you have problems of this nature where the wording is a big
confusing, it is better to write down the sample space and you will be always okay.

So, that is the end of this lecture. This was a very important lecture and introduced slightly
confusing and often misused idea of conditional probability but also hopefully it showed you
the power of it. If you can use it correctly, it can give you fantastic results in a wonderful way
and it is a very-very powerful idea, very crucial idea for the entire program I would say. Thank
you very much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 1.10
Examples on Conditional Probability

Hello, and welcome to this lecture. This lecture is on conditional probability. So, so far we
have been talking about probability space where we had an experiment and an outcome, yes,
but then we had the sample space. It was very important and then those this collection of events
that we had and then we had this probability function which needed to satisfy these axioms and
we are able to do calculations with you and all that and already more and more interesting
problems we are trying to see.

We are seeing more complicated sample spaces. We saw the derangement example last time
but it seems a little more complicated than what we are used to and slowly we are gaining
expertise on handling more and more complicated probability spaces. So, now this conditional
probability is a hugely powerful tool. This will give you the power to take a very large
probability space and break it down into smaller pieces, different pieces in various different
ways and still compute probability in a correct consistent manner.

So, it is an extremely powerful tool. It is easily one of the most powerful ideas in probability.
A lot of people call it the heart and soul of probability some of them. It is a very, very central
to a lot of things we will do in this class. Even in this program, the whole data science program
relies a lot on this notion of conditional probability for its theoretical backing. So, let us get
started.
(Refer Slide Time: 01:41)

So, the motivation, there are various to motivate it, so I am going to provide this motivation
for you and this is a good way to think about how conditional probability enters the picture.
There are various other reasons also but this is one way of thinking about it. So, quite often
when an experiment is complex, it is complex because there are a sequence of actions. It is not
just one action happening once and you are done, and there is usually a sequence of actions.

So, I have given you an example here. You may a toss a coin three times, repeating the toss.
You may throw a die two times or you can look at the IPL example, so we have been doing
this Indian Premiere League cricket tournament example throughout. If you think of one over
in IPL, its first delivery and then the second delivery and then the third delivery, so it just
repeats, every time you think of a complicated experiment, it appears like things repeat quite
often.

So, while you start with, before anything starts, before even the first action, you start with a
very big probability space. After the first action it appears that the number of possibilities,
number of outcomes have shrunk a little bit. Of course it depends on what happened in the first
step but nevertheless, it seems to have shrunk. You can take an example, so, you are tossing a
coin three times. Now after you have tossed it once, things have reduced in possibility. We are
only tossing two times more.

Same thing with the die also, so once you throw it, once you are only throwing it once more.
So, things have shrunk a little bit. It seems like the probability, number of possibilities, number
of outcomes, everything is shrinking. So, you start with an initial probability space and you
observe the first step. Now you have observed a part of your outcome in the probability space.
So, it is not like before, something has changed and the question in conditional probability and
this idea of conditioning is can you now work with a smaller probability space for the rest of
the outcomes and still meaningfully compute probability in the original probability space.

So, that is the critical idea. It is sort of a divide and conquer type of approach and all these
divide and conquer type of approaches are really, really powerful. They are powerful in real
life, they are powerful in theory also. So, let us see a specific example. I have been talking
about this in a slightly more abstract way. Let us take a simple example. Toss a coin three
times. So, if you look at the initial sample space, it has got eight different possibilities. I have
written it down here, 𝐻𝐻𝐻 to 𝑇𝑇𝑇, so, all the eight different possibilities are there.

Now if you focus on the first toss, let us say we define an Event 𝐵, which is that the first toss
is tails. The first toss results in tails and let us say we observed Event 𝐵. We observed the first
step in the experiment. We observed Event 𝐵, Event 𝐵 has occurred, somebody has told you
that. So, the question like we have been asking, can we divide now. Can we say, okay I have
observed this, can I make my probability space smaller and we have seen when probability
space is smaller or easier to deal with, we can do computations more easily. When it is unveiled
and big it seems very confusing but maybe we can break it down in steps, it seems easier for
us to work with.

So, the answer to that question is, yes, because once you have observed 𝐵, 𝐵 itself becomes
like the sample space like what is 𝐵? 𝐵 is 𝑇𝐻𝐻, 𝑇𝐻𝑇, 𝑇𝑇𝐻, 𝑇𝑇𝑇. That itself becomes a sample
space. Why is that, because once 𝐵 has occurred, you are inside 𝐵. One of the four things is
going to happen from now on. I know the first toss resulted in a tail so, one of these four things
only are the possibility. And what is that? That is exactly 𝐵. So, when an event occurs and you
want to proceed after that in the next step or think of what is going to happen after that, you
can restrict yourself to 𝐵 itself.

When an event occurs, the event itself becomes a sample space for subsequent steps in the
experiment. Now what about events in the smaller sample space, what about probability
function for the smaller sample space, we have to do some re-definition, so all that is not clear.
I am just trying to motivate. It is not clear as to how to do that and that is where conditional
probability comes in. So, conditional probability gives you a way to deal with this smaller
sample space in a very clean meaningful way so, that it ties up nicely with the larger sample
space that you started with initially.

So, if you want to understand from a high level what is it that we will do in conditional
probability, this last statement is probably a good indicator. So, you have an initial probability
space and then you observe an event. When you observe an event, any event, you can split that
initial probability space into that event that occurred, that something event, and some sort of a
conditional probability space given that the event occurred.

So, this conditional probability space we are yet to define. We have not defined it, it seems to
be a bit of a challenge. Maybe the sample space is easy to identify, but what about events, what
about probability function in that conditional sample space, conditional probability space, it is
not clear how to do that.

We will do that in the subsequent slides in this lecture. So, this is the high-level idea. We are
going to do some sort of a divide and conquer approach and everything hinges on the fact that
we observed an event. And this notion of somehow there is a sequence of actions so, even after
observing an event, you have not fully found the outcome. When you observe the event, you
fully found the outcome then there is nothing much to happen after that.

So, your experiment should be complicated enough that after observing one event, there is still
something left to discover or some more actions that have to happen and in that case you can
shrink to the conditional probability space and this conditional probability space is a different
probability space from the original one, but the probability you calculate and that can be tied
up to the original probability. So, that is the crucial idea. Let us see how to do that. We will see
the theoretical way of doing it properly with equations and all that and then we will see a few
examples to drive home the point of how conditional probability works.
(Refer Slide Time: 07:59)

So, this is sort of a formal definition of a conditional probability space. We always have this
pictures of the Venn diagram, keep that in mind, the image and the picture is very important.
It gives you an intuitive feel for what it is we are doing. So, let us start with the probability
space, what is our probability space? The familiar world of a sample space, yes, which is the
set of all outcomes, then your collection of events which could be the set of all subsets of the
sample space for now and then the probability function 𝑃.

What does the probability function do, 𝑃? It assigns to every event in your original probability
space some number between 0 and 1, on this has to be consistent with the two axioms. So, all
that is assumed when we say that. So, that is a nice thing about this definition. We saw all this
and now when I say sample space, events, probability function, you know what it all means.

Now let us say there is an event 𝐵 and it has to happen with some positive probability. If there
is no chance that the event 𝐵 occurs, 𝐵 is a null set of something, then I think there is no
meaning in observing that, because you will never observe 𝐵 if it had no chance of happening.
So, 𝐵 has to be an event with probability greater than 0.

So, this is the picture, so the picture here is good to keep in mind. Picture here is good to keep
in mind. You have the sample space 𝑆 which is this big oval and then you have the event 𝐵
which is this oval here. So, this is the event 𝐵. I have shown a few other events here. I will
come to that soon enough but this is where 𝐵 is and 𝑃(𝐵), 𝑃(𝐵), the probability function
assigns a non-zero probability to the event 𝐵.
So, here is the definition of a conditional probability space given 𝐵. So, from this picture we
can sort of invent or create another probability space and what is this probability space, we are
going to give it a special name, it is called the conditional probability space given 𝐵. So, you
have to imagine that this experiment happened and part of it you have observed and that part
of it is somehow connected with this event 𝐵. So, that is the idea. Now what is the sample space
of this conditional probability space given 𝐵? It is 𝐵 itself.

So, that is simple idea, 𝐵 occurred, so you have moved into this sample space, your whole
sample space has become 𝐵 itself. Now what are the events now? What are the events in this
smaller sample space? You have a whole bunch of events in the original sample
space, 𝐴1 , 𝐴2 , 𝐴3 , let us say all the possible events, we will list all of them one after the other.
The events in the new sample space are simply the intersections of the original events in the
original space with 𝐵.

If you have an event 𝐴 in the original space, you simply take 𝐴 ∩ 𝐵, so that gives you all the
events in the new probability space we are constructing. We are constructing a new sample
probability space, given 𝐵, so we went with our sample space to be itself and now what happens
to our events, you take the original events and then you intersect with 𝐵. Now comes the all-
important probability function.

What probabilities will you assign to the events in this conditional probability space and what
name will you give it? So, that is what is important. The probability function that we will use
𝑃(𝐴∩𝐵)
in this new probability space is so, this is a very, very crucial and important and
𝑃(𝐵)

interesting definition. To every set in this new conditional probability, to every event in this
new conditional probability space we have to assign a probability function and this probability
function we will assign this probability to it.

Notice the subtle change here. It is not just 𝑃(𝐴 ∩ 𝐵), why is it not 𝑃(𝐴 ∩ 𝐵)? 𝑃(𝐴 ∩ 𝐵) is the
probability in the original space. That does not really translate into the probability function in
the conditional space. You have to divide by 𝑃(𝐵). This 𝑃(𝐵) division is very, very crucial.
Otherwise notice what will happen. Your axioms will not be satisfied for this new probability
space.

What is the axiom? Look at axiom one, 𝑃(𝐵), in the conditional probability space the new
probability function should assign a probability of 1 to 𝐵. So, if instead of 𝐴, if I put 𝐵 here
𝑃(𝐵)
what is 𝑃(𝐴 ∩ 𝐵)? Is 𝐵 itself so 𝑃(𝐵), I get 1. So, this 𝑃(𝐵) is simply doing an adjustment to

make sure that axiom one is satisfied.

Notice why this definition is motivated very cleanly, notice quite a few things. This is a tricky
definition. You can take it as a definition and not worry about it anymore but I want you to
think about it little bit more. We have original probability function 𝑃 which was assigning
probabilities to all these events, 𝐴 ∩ 𝐵, 𝐵, etcetera.

Now we define these new events inside my conditional probability space and to every event in
𝑃(𝐴∩𝐵)
this conditional probability space I am going to assign a new probability and that one is 𝑃(𝐵)

and notice how this becomes consistent with what a new probability space should be. So, if
𝑃(𝐵)
you put instead of 𝐴 if you put 𝐵, you get 𝑃(𝐵) that is 1. So, the new probability function, you

have a proper probability space and it satisfies all the axioms.

𝑃(𝐴∩𝐵)
So, this ratio, , it is denoted 𝑃(𝐴|𝐵) so this bar is read as given. So, 𝑃(𝐴|𝐵) and it is
𝑃(𝐵)

called the conditional probability of 𝐴 given 𝐵. A lot of people will drop the word conditional.
Instead of saying conditional probability of 𝐴 given 𝐵, 𝑃(𝐴|𝐵), they will just say 𝑃(𝐴|𝐵).
𝑃(𝐴∩𝐵)
𝑃(𝐴|𝐵) is simply the conditional probability of 𝐴 given 𝐵 and it is defined as . So, once
𝑃(𝐵)

again this definition is also rewritten a little bit, 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐵)𝑃(𝐴|𝐵).

So, notice this subtle change in notation and the way in which we confuse people a lot. This 𝑃
was the original probability function. Now the same 𝑃 is used but instead of 𝐴 alone, we put
this 𝐴|𝐵. So, this complicated argument which is like some 𝐴|𝐵 enters this 𝑃 and that makes
it a conditional probability. So, this 𝑃(𝐴|𝐵) is called a conditional probability. It is defined as
𝑃(𝐴∩𝐵)
which also is simply given by this nice little succinct formula here. So, let us go in and
𝑃(𝐵)

look at what this means in terms of the Venn diagram.


(Refer Slide Time: 15:22)

We have this event 𝐵 which is inside the sample space 𝑆 and we have a whole bunch of other
events, 𝐴1 , 𝐴2 , 𝐴3, etcetera. What is my sample space, the conditional sample space so, to
speak? It is 𝐵 itself and what are my events inside this new conditional probability space? All
these intersections that we had, we had these intersections with 𝐴1 , 𝐴2 , 𝐴3, etcetera.

And what are the probabilities for each of these things? The probability for each of these things,
𝑃(𝐴1 ∩𝐵)
let me just make sure I get the 𝐴1 , 𝐴2 , 𝐴3 . Probability of each of these things, this one is ,
𝑃(𝐵)
𝑃(𝐴2 ∩𝐵)
so, this is in the, this is the new conditional probability space, given 𝐵. And this guy is .
𝑃(𝐵)
𝑃(𝐴3 ∩𝐵)
Notice this P is the original space. So, this one is . So, this is guy is what, this is 𝑃(𝐴3 |𝐵).
𝑃(𝐵)
So, if you think about it, the way I introduced it, I went from the initial space to the conditional
space. Quite often in problems you will be surprised to see that the conditional space is much,
much easier to deal with than the original space. So, you will want an event in the original
space and it will be complicated to deal with, but you will use the conditional space and the
conditional space is very obvious to deal with. You know what to do with 𝑃(𝐴|𝐵).

So, both ways are used in an exchangeable way, so right now it feels like, the way I write it, I
𝑃(𝐴3 ∩𝐵)
write 𝑃(𝐴3 |𝐵), as , this is if the original probability space is easy to deal with. You
𝑃(𝐵)

know all these things in the original space. You can take the ratio and find the conditional
probability. Quite often what will happen is, the conditional space is very-very easy to deal
with and 𝑃(𝐵) is easy to deal with but 𝑃(𝐴 ∩ 𝐵) is very tough in the original space.

So, people compute it like this. So, both ways it is used and I know this is sounding a bit abstract
when I do problems and examples, I will point out how you can move to the conditional space
and sort of know how to calculate the conditional probability directly without doing this and
get the original events back in a very interesting way.

So, that is how we will use conditional probability. That is the divide and conquer approach, is
not it? I want the probability in a complicated probability space, pretty big, multiple actions,
etcetera, so I simply condition, go to a smaller one, find the probability there and then come
back to the original probability space and how do I do it, how do I go back and forth between
these things, is the key.

So, initially it may sound like a lot of things going on at the same time. If you get enough
practice, once you get enough practice, you will do this so, smoothly in your head. So, all of
this conditioning and going to the conditional space, doing the calculation there, coming back
to the original one, adding it up here, that, etcetera, there are a lot of laws that help you here
and you will do it in your head automatically very fast.

When you do that, you do not have to know all this theory too much. This theory is just written
down for putting it in a clear context. Once you start doing it, you will hardly keep looking at
this theory and all. You will just do it in your head and go through forward. So, as we work it
out, pay attention to this, you will see how easy it is to work with conditional probability.
Statistics for Data Science II
Professor Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture 1.11
Bayes’ Theorem and independence

Hello and welcome to this lecture. We have been doing looking at the basic concepts of
probability, sort of a quick round up of very important ideas that play a very crucial role in the
entire theory, and we have been doing it in week 1. We are going to continue that study and look
at Bayes theorem and independence. And Bayes Theorem is always called Bayes Theorem. May
be technically one needs to say Bayes' theorem which is too much to say. So we just say Bayes
theorem and independence at least in this class.

(Refer Slide Time: 00:46)

So let us begin with a quick recap. I want to point out that many of these ideas you probably saw
in Statistics 1. So this is meant to be a repetition so that you quickly come up to speed. So and
this is also week 1 of the class. So that we set up the notation and everything gets set. So that is
one of the main reasons why I am sort of repeating, some of the things which you probably
already should know.

Anyway quick recap, we have been talking about probability space which consists of the sample
space, events and the probability function. And there are two axioms that we start with, that we
assume. And then we derived the whole bunch of properties based on the axioms and that helped
us solve problems, is not it?

So when you look at the problem, when you want to have a, when you have a probability space
and some of the partial, I mean some of the partial probabilities are available to you, some events
you know about how can you compute other things of other events, combinations things like that.
Few problems I showed you in the lecture itself. Maybe you will see more problems in practice.

So that part is, that skill is very important as well. And then we saw this very, very crucial notion
of conditional probability. Often times the sample space is very complex. Lot of things happened
one after the other. And given one event occurs you can sort of move to this conditional
probability space and do all your calculations given an occurrence of an event B. And that you
can repeat, you can keep repeating.

And then that adds up simplifying your calculation in such a fundamental way that conditional
probability is hugely important in probability theory, and the may basic idea of the probability of
an event given another event, right. P(A|B) we have saw definition for it when P(B)>0 and then
there is this basic relationship ( ) ( ) ( ), right?

So that is the important relationship here. Often times we found this was very useful to find
probability of A intersect B, right. So you find probability of A and then B given A, or
probability of B and then A given B, right? So that was easy. So the next set of lectures, some
two or three lectures that will be four lectures that are remaining in this topic.

The first will be something called Law of Total Probability which is a law which involves
conditional probability and gives you again additional ammunition to attack bigger problems,
more complex probability spaces, how to compute probability in subspaces. That is the Law of
Total Probability. And then Bayes theorem, Bayes theorem is hugely important and that I again
deals with conditional probability.

So how do you use conditional probability as a tool and go back and forth between very
interesting, very many conditional spaces and the original space and what are the relationships
between these things. So the Law of Total Probability and Bayes theorem together give you a
complete set of tools to work with many interesting cases and I will also show you some
interesting examples of problems that we can tackle with this idea.

And finally there is this crucial notion of independence. And that we will spend quite a bit of
time studying independence and why that is very crucial and you will see how really
independence makes program solving so much easier, right? So anything, so dependence is hard
to deal with in practice and independence is so great, right?

So maybe I guess that is philosophically also true. So in probability also it is very, very true. And
finally the last set of lectures for this week, the theory lectures is on repeated trials of an
experiment, independent repeated trials of an experiment and how that gives us some very nice
distributions and all. So that is recap and what is to come in the module.

(Refer Slide Time: 04:41)


So let us get into the law of Total Probability which builds on what we know from conditional
probability to give us some additional tools. So that is law of Total Probability. The law itself
can be stated in this very simple fashion. Let me start with the Venn diagram. I find it always
easy to talk about the Venn diagram. Given the sample space S, let us see you have an event B,
right? So this event B sort of partitions the sample space into two halves, right?

There is B and there is B complement. So the picture hopefully makes sense to you. So if sample
space there is a set B, and B and B complement together they are disjoint, they make up the
whole sample space and such things are called partitions. So B and B complement partition the
sample space. So that picture is sort of shown here.

And anytime you have A partition like this, any time you have a situation in the probability space
where either this has to happen or that has to happen, right? So that is what this means. So what
does this mean to say B and Bc partitions S in any outcome either B should have occurred or Bc
should have occurred, right? So that sort of like divide and conquer already, right? So the
strategy is extremely useful in solving problems.

It comes again and again and again and in probability it shows up like this. It shows up as the
partition of B and Bc. And any situation, any probability space that you are working with, if you
always, if you ever see a situation where either this has to happen or that has to happen you can
use this law. This law is very, very powerful. You will see how it is.
So now I have this event B. What about, B is not the event that I am primary interested in, maybe
I am interested in an event A, right? And how does that event A look? So now this event A is this
other little oval that I have drawn there. And this A will have two parts to it. Naturally this A will
split into two parts. What are the two parts that A naturally splits into? , the part of A that
is inside B. And , the part of A that is inside Bc.

It is quite easy to see, right? So you can see, if I get my pointer here let me show you that you
have here , is not it? is this part. Oops! And then you have which is this
part. This part is . Is that okay? And these two are disjoint. So notice what has happened.
This A and B are just any two events.

It does not matter what A is, what B is, right? Anytime you have an event B, either B has to
happen or Bc has to happen. There is this natural partitioning of the sample space. And anytime
you have the partitioning of the sample space like this, any other event A also gets partitioned. A
gets partitioned into the part of A that is in B and the part of A that is in Bc.

So what we know about disjoint sets and together making up the whole set A? We
have our wonderful axiom 2 right, of probability space. So ( ) ( ) ( ). That
comes from axiom 2 of the probability space. Now we have our intersections.

And once we have our intersections we know we can go into this conditional probability idea and
write down the probability of intersection using conditional probability, right? So, ( )
( ) ( ) . Notice this ( ) is actually a probability function in the conditional
probability space. So it is a sort of a different space. Keep that in the back of your mind.

Even though we keep writing P, P and we keep saying the same event A in this space and that
space, so keep at the back of your mind that the space is different. So that part is it important to
know. But eventually you will get comfortable with the idea and you will go between the
conditional space and the original space back and forth, back and forth very comfortable. So you
have to get to the comfort level only then become good at problem solving.
So ( ) , one can write as ( ) ( ) ( ) . And what about ( )?
( ) ( ) ( ). ( ) ( ) ( ) ( ) ( ) So a law like
this is called the law of Total Probability. It is very simple law.

It looks deceptively simple, and it is extremely powerful. So many, many calculations of events
where it seems like you do not even know where to start; you can always start like this. You pick
some other event B which makes the conditioning very easy to calculate. So this is the law of
Total Probability. We will see a few examples as we go along.

(Refer Slide Time: 09:31)

So here is the first example. So in the first example I have 2 urns. I think we have seen this
before. We saw this in the context of conditional probability. And 2 urns with a mixture of red
color marbles and blue color marbles. And my experiment is to pick an urn at random and then
pick a marble inside the chosen urn at random again. So this is the experiment we have seen and
we saw that the conditional probabilities are easy to write down, right.

So these two conditional probabilities are easy to write down; red given urn 1, red given urn 2,
blue given urn 1, blue given urn 2. It is very easy to write. Now what the law of Total Probability
allows you to do is to find the probability of red. What is the probability that red is drawn, red
ball is drawn? So we use the law of Total Probability in this direct fashion. Probability of red, so
either you choose urn 1 or urn 2, right?
So this is your B and B complement, right? You either choose urn 1 or urn 2. That is what
happens in this experiment. So ( ) ( ) ( ) ( ) ( ),
that is it, as simple as that. So you plug it in. You get your answer, 6/13. Is that okay? So very
simple, right?

So it seems deceptively simple but in a complex or at least in a two-step experiment of this


nature where you choose the urn first and then you choose the red or blue marble, if you want to
find the probability of the color of the marble finally chosen you always condition on the first
step, so you condition and then you just condition on all possibilities that can happen in the first
step then you move to the next step, next et cetera. So you can do like this. Alright, we saw this
little artificial sort of problem with the 2 urns and the colored marbles.

(Refer Slide Time: 11:16)

Let us see slightly different sort of probability space which may be occurs in real life and you
want to look at a phenomenon which is more real life maybe right? So here is the problem. This
is again from your textbook, the stats textbook that we have been using for this course. This is
about an economic model and some computation of probability inside that economic model.

Let us read it. So you have an economic model which predicts that if interest rates rise there is a
60% chance that unemployment will increase. But if interest rates do not rise then there is a 30%
chance that unemployment will increase. Is that okay? So once again you can see already, you
can see the way the law of Total Probability is going to come in, the way there are two events, B
and Bc.

So there is two situations here. Either the interest rates rise or the interest rates do not rise, right?
If the interest rate rise, something happens to unemployment. If the interest rates, interest rates
do not rise, something else happens to, the chance of unemployment increasing changes. So
these are things that the economist gives you. Now you have to calculate, right, given this setting
in the probability space sort of roughly, right?

So notice once again these kind of problems you do not try to think of sample space and all. You
just work with events directly. What should one calculate as? The probability that unemployment
will increase. So let us see this. It is quite easy to solve this with the tools that we have now, the
law of Total Probability. P(unemployment increase), so I will just call this P(u.i), right, just to, as
a shortcut. Probability of u.i given interest rates rise is not it?

So I will say this is my event B(Intersetrates rise). Interest rates do not rise, this is my event Bc,
is not it? So probability of unemployment increase given B times probability of B, pen was not
here properly, okay, plus probability of unemployment increase given B complement times
probability of B complement. P(ui)=P(ui|B)P(B)+P(ui|Bc)P(Bc). So now the question gives you
the P(ui|B)= 60%.

And what about probability of B itself? The economist believes that there is a 40 percent chance
that interest rates will rise. So that is P(B)= 0.40. Plus what is the probability unemployment
increase given B complement which is interest rates do not rise (P(ui|Bc)), right? Then that is
P(ui|Bc)=0.30 as given in the problem. What is P(Bc)?

Again you know from the properties 0.4 is the P(B), 0.6 is probability of B complement. So after
this it is just arithmetic. 0.24 + 0.18=0.42. So that is a 42% chance over all that unemployment
will increase whatever happens to the interest rate. So that is simple calculation.

So you can see how the law of Total Probability is very useful, very simple. It gives you just one
more easy way to keep increasing the events for which you can calculate probabilities for. Any
problem setting, any probability space that you come in, you will be only be able to assign
probabilities to some events, right? Some of the events will be measurable, easy to, for you to set
up a survey or something, or ask few people, find something.

And then once you find, assign probabilities to few events you should always go off and
calculate all the other probabilities that these things imply, right? So, so that kind of a setup and
that kind of comfort with manifesting probabilities for events and calculating probabilities for
new and new events is very, very useful. So from whatever it is that you can actually set up an
experiment or actually survey and find you should be able to extend to other events of interest.

(Refer Slide Time: 15:43)

So let us get a little bit more general here. So let us say we go a little bit more general. So
previously we had just one event B and we said B and Bc partition S. So this can be generalized.
How do we generalize? Maybe we have a more general partition of S. S is partitioned by
So this is just an illustration. In general it could be more than B4. It can go B5, B6,
keep on going.

But let us say there is a partition of S which is . So what does it mean to say
partition, partition? Seems like a very independence time statement, right, partition of the
country or something. It is not so bad. Partition is very good, actually. So you can do divide and
conquer. So it is like one of these alternatives has to happen. Any experiment, any outcome
either B1 has to happen or B2 has to happen or B3 has to happen or B4 has to happen. If you find
any situation like this then these events together will partition the sample space.
Also additionally to say partition these two events should be mutually exclusive. They should be
disjoint. If B1 happened, B2 cannot happen, B3 cannot happen, B4 cannot happen. None of the
other events can happen. So partition implies two things. Together one of these things should
happen. So the union of the whole, all the subsets is the entire sample space S. Then no two of
them can happen together. They are mutually exclusive. They are disjoint events. So if one
happens none of the others could have happened.

For instance here if happens in any outcome, outcome results in being favorable, if an
outcome is favorable to it cannot be favorable to any other event that you have in the
partition. So those are the conditions. So this is just generalization of B and Bc. And you can see
quite readily that if you have any event A, any event A is going to be naturally split into a
disjoint partition once again, right?

So partition once again, , , , . This is the partition. So the


probabilities will add according to axiom 2. And then once you have these intersections you can
move into the conditioning subspaces, conditioning space, probability space, conditional
probability spaces and write in terms of conditional probability ( ) ( ), ( ) ( )
etc. Is that okay? So this is the law of Total Probability in its full generality.
(Refer Slide Time: 18:18)

So now we are going to use the law of Total Probability in a slightly more general situation. Let
us do that. So here is an interesting little question is again a toy question, maybe not too realistic
in practice but still it is an interesting toy question. There is a person who has 5 coins and 2 of
these coins are double-headed. So what is the double-headed? It is like one of the coins in Sholay,
right? So have you seen this movie Sholay? You must have seen it.

There is a coin which has heads on both sides. Whatever you toss you get heads. So those are 2
or double-headed. One is double-tailed as in both sides are tails. Somebody printed, made a fake
coin with both sides being tails. Two others are normal, normal meaning one side is head, other
side is tails. So there are 5 of these coins this person picks.

It is a two step experiment, right. So what does he do? He picks a coin at random, and then he
tosses it. So the probability that you have to calculate is what is the chance that he will see a head?
So again you can see law of Total Probability sort of naturally comes into the picture, right? So P
of head, so I am going to condition on what could have happened in the first, in the choice of the
coin? What are the three things that could have possibly happened?

When you pick a coin a coin should either be double-headed or double-tailed or normal. That is
it. Those are the three possibilities for the coin. So you can condition on that. And then law of
Total Probability will give you what you want.
Head-H, Double Head is D-H, and Doublt Tail is D-T, and N is normal coin. Ptobability of head
is P(H)=P(H|D-H)P(D-H)+P(H|D-T)P(D-T)+P(H|N)P(N). So probability of head given, so this
head I am going to call H, given double-headed I will call it D-H, times probability of D-H plus
probability H given D-T, now you are following the notations right? I do not have to say what D-
T is. D-T is double-tailed, times probability of double-tailed plus problem of head given N. I do
not have the say what N is, right? So N is normal. I will just write this down here for you, times
probability of normal coin.

So now what is the probability? So now let us just start writing values to this. What is the
probability of H given that you picked up a double-headed coin? That is going to be 1, right? So
in that conditional sample space once you have gone into the conditional probability space, once
you have gone into the double-headed world whatever you toss you will get heads with
probability 1. What about this guy, probability of heads given double tail?

That is going to be 0, is not it? So you have gone into the, the coin you picked is double-tailed.
You toss it. You will never going to get heads. What about this world, a normal coin? You are
going to get half, right? So half, fair coin, it is not stated here, we will assume it is fair. Now
what about this one? What is the probability that when you randomly choose from this
combination you will get a double-headed coin?

That is 2/5, right? So 2/5. Again a uniform sort of distribution. This is 1 over 5. This is 2 over 5.
You are done, is it?

P(H)=P(H|D-H)P(D-H)+P(H|D-T)P(D-T)+P(H|N)P(N) =1*2/5+0*1/5+1/2*2/5 =2/5+0+1/5=3/5.


Now it is just arithmetic, right? You just multiply and get to the answer. So it is 2/ 5 + 1/5=3/5.
So simple calculation to show you situation where the partition was with multiple events, more
than one of them. Alright so that is the last part of the short lecture on law of Total Probability.

Hopefully it is clear to you.The crucial idea is captured in this picture here. You have a partition
of the sample space. You have the bunch of alternatives that have to have happened. And then
you condition on them, move onto the next step and then you can easily calculate an overall
probability for any other event by breaking it down into how it intersects with each of the
partitioning events. Thank you very much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Bayes’ Theorem
(Refer Slide Time: 00:13)

Hello, and welcome to this lecture. In this lecture we will see Bayes’ Theorem. It is very
nice and elegant and simple theorem, you already know the theorem by now when I write
it down, you see the proof is not needed. It is very, very simple. But it is extremely
powerful. Quite often in problems, we will use it in very simple ways. And it gives you a
very nice, neat way to go from one conditional space to another conditional space
through the original space.

So that is one way of thinking about Bayes’ Theorem. And it is very convenient. And it
shows up in problems all the time and many intuitive questions that you will have in
when studying something statistically right, when you will have some conditions are very
intuitive and all those can be conveniently answered using Bayes’ Theorem. So let us get
started.
(Refer Slide Time: 01:01)

So, I am going to start with an example and show you the nature of the problem that you
will have. So why is such a theorem very interesting and important? Let us go back to
this toy example of two urns with colored marbles. We have seen this again and again, I
guess this picture is going to haunt you for the rest of your life, this ugly little picture of
two urns and a bunch of marbles inside them.

So you got to pick a urn at random and you are going to pick a marble at random from the
urn. So far, we have been doing the calculation in one way, right? How did we do it, we
set P(urn1)=P(urn2)=1/2, that was my B and Bc in some sense. And then given that I
chose a particular urn, I was looking at probability of red and probability of blue and that
conditioning, et cetera, et cetera. we saw that that worked out quite easily as well.

Now, what is interesting to some people is, maybe there is someone who did not observe
the first step. So, they did not observe the first step and they did not see which urn you
pick the ball from, the only thing there shown is the ball. So, they see the end of the
experiment, in some sense. So, this happens quite a bit. I mean, when you set up a
complicated experiment, when we look at a complicated situation, you will only get to
observe some things. And that may not be the most important root cause of the whole
thing. And then a lot of things might happen and you will observe only one thing.
So, when you observe things like that, you may want to go in and figure out what could
have happened inside, internally what happened? You may or may not know, but at least
you can try to probabilistically ask, statistically asked what could happen on the inside?
So, the question down below is something like that. And this toy example, it seems really
silly, but you will see quickly enough, in realistic practical situations. These kinds of
questions are extremely crucial.

So, I might, observe after the experiment is over, without seeing the first step in great
detail, I might only observe that a red marble was selected. And I may be interested in
finding out what is the probability that it came from urn one. So, this is a conditional
probability. I want the conditional probability that the ball came from urn one, given that
I got a red ball.

Notice, this experiment involves two steps, I pick an urn at random and then I pick a ball.
I only observe the color of the ball at the end. And then I ask the question, what is the
probability of urn one given that I got a red ball? So, this seems like a little complicated,
right? It is not like, maybe it is. It is sort of like a law of total probability situation. But it
is, it seems different, the nature of the conditioning is different, right?

Given that you are going to pick from urn one, finding probability of the red ball is very
easy. It is a very natural space, it is uniform distribution, quickly you write down the
answer. But this way, it is not clear what the distribution is, probability of urn one given
that you saw a red, it is not immediately clear at least, right? Same thing with blue, right,
urn one given blue or earn two given red, urn two given blue, right. So, you may be
interested in questions like this. This is one such example.
(Refer Slide Time: 04:11)

So here is a more interesting, maybe more realistic, practical sort of situation. And this
question comes up all the time, I mean, we are going through a major flu epidemic and
these kind of questions are very, very critical in practice. So here is the question
surrounding a flu test.

So, let us say you have a city, where you think about 1% of the people have swine flu. Of
course, you do not know who this 1% are, if you are exactly know, you are done. So you
have to have some statistical framework here, right? So you do not know who these 1
percent are, but you know that, or at least you have a belief that 1 percent of the people
have swine flu.

So someone says, I have devised a flu test and what does the property that the person say
the flu test has? If a person actually has swine flu then with 95% chance, that person will
test positive in this test. That is their tests. Sounds pretty good, right? So about very high
probability that if someone actually has swine flu, he will be caught.

And then look at the next one, in about 2% of the cases, very small number of the cases,
even if the person does not have swine flu, that person will test positive. So, this is the
test that somebody is giving you, some pharmaceutical company or something has
devised a test and they tell you, this is the property and we have done some trials of this
elsewhere, maybe in other cities and controlled circumstances, whatever. And we know
that this is how the test behaves.

You have a test, you have a city where you suspect about 1% of the people have swine
flu, what do you do? You start testing, right? So how do you test? Let us say you do this?
So, notice the situation here. So, you pick a person randomly from the city. And you test
him. So, you have a test, you can pick a person, you pick a person randomly from the
city, and then you do the test. And the test comes out positive.

So, notice, in this situation, you never got to observe the first step, right, what is the first
step in this experiment, the person contracted swine flu or not, right? You can never
really observe that. So, they would have got swine flu through various mechanisms,
maybe the travel, maybe they went close to someone else. I mean, it is very hard to
observe that, you can never really observe that.

That is what I meant, when I said, in many experiments, there is a two step sort of
process, but you can never ever get to observe the first step. But the second step, you can
get to observe if you pick this person and then you do not know, you can never find out
whether they actually have swine flu or not.

So that part is just a statistical number for you 1%, you do not know exactly whether that
is true or not, you have to rely on the test. So, you do the test, you know whether the test
ended up positive or not. So, this is the sort of realistic scenario which mirrors the urn
experiment in some sense, the toy experiment of the urn.

So, in this case, again, you see the probability of swine flu that the person actually has
swine flu is 0.01 and probability that the person tests positive given they have swine flu is
point 0.95 and probability that they test positive given they do not have swine flu is 0.02.

Now, the crucial question we are interested in clearly in practices the reverse, is not it
probability that the person has swine flu, given that they tested positive? And you want
this number to be very, very large, is not it? You want it to be 90+ %? So, then you can
be reasonably sure that, yes, this is a good test, I can use this test successfully to go
around and find the people who are interested, who were infected with swine flu maybe.
So that seems like a interesting question to ask.

So here, again, you have a situation where you are interested in this reversing of the
probability, probability of positive given swine flu is easy for you, it is given to you the
company’s, which comes in markets, the test has to give this to you. But the reverse
probability is not something that you have, is not it? So how do you do this problem? So
again, two problems with the same flavor.

(Refer Slide Time: 08:14)

So let me just put it out abstractly for you. You have two events, A, B. P (A), P(B) are
known, let us say, all right. And then P (A|B) and P(A|Bc) are either known or easy to
find, you can easily find this or you know this.

I want to find the reverse conditioning, I want to find the P(B|A), P(A|B) is given to me
or I know P(A|Bc) also is known to me. What about P(B|A)? Can I find probability of B
given A, is there a way to move from these conditional spaces?

Notice, the conditioning on A is a different probability space, is not it? It is a different


conditional probability space, conditioning on B is a different property space. Are these
probabilities related? So, if you think about it, they are related. And that relationship is
exploited and specified as Bayes’ Theorem. And Bayes’ Theorem is stated follows.
(Refer Slide Time: 09:12)

It is easy enough to state, I have just drawn a picture here to highlight you what is going
on, there is a sample spaces, you have two events A, B, and there is an intersection
. We will assume P(A) is positive, P(B) is positive is given to you, right. So, everything
starts from this , right? So, what is ? is inside A. It is also inside B.
So, it is an event which is inside both the conditional spaces. So, you condition on A,
is there. You condition on B also is there.

So, you can write the in two ways using the either of the conditionals, spaces.
You can write | | . So that is Bayes’ Theorem, it
is very, very simple, right? You know this equality is definitely true. From here, you can
write P(B|A) in terms of P(B), P(A|B), and P(A), and the relationship is given there
| | .

So that is as simple as that, there is nothing majorly going on here. So, it is just as ,
which is in both these conditional spaces and sort of gives you a very natural connect
between and .

Now, these two laws, they seem so simple, but together, they can really open the door for
you for so many problems. So, this law of total probability and Bayes’ Theorem are at the
heart of so many applications in probability. And we will see them over and over and
over again, as we do statistics.

(Refer Slide Time: 11:06)

Let us go back to our urn problem. So, my event B is urn1 is chosen Bc is urn2 is chosen.
A is the red marble is picked, right? So, I know P(B), P(Bc). I know P(A|B) and P(A|Bc),
all these we have seen before. By law of total probability, you can very easily compute
P(A) is 6/30.

So, you see, these guys are all easy to find, right? So, P(B), P(A|B), P(A|Bc), P(A), all of
them easily worked out. Now, Bayes’ Theorem will give you the reverse, right? What is
the reverse that I wanted, P(urn1|red), that is P(B|A), it is just a direct application of
Bayes’ Theorem, | , plug it in, you will get the answer 7/12. As easy as
they come, it is just a very simple method and you can just directly apply the formula to
get the answer.

(Refer Slide Time: 12:06)


Let us go back to our flu test. What was our flu test, the same scenario is put down here.
Let us say the event B is the person has swine flu. Event A is the person tests positive.
So, P(B) is 0.01, P(Bc) complement is 0.99 that we know ahead of time. P(A|B), P(A|Bc)
is known to you. So, using this and the total of total probability, we can compute P(A).
So, notice what is going on here, this is very important. Maybe the pen is good enough
for me.

So, notice what is going on here, P of A first thing, whenever you use Bayes theorem just
before that, you will end up using law of total probability. So, quite often, it is stated in
that fashion also. So, | |

I have just plugged it in here, I got P(A). So, here is a situation where you have P(A), I
have P(A|B), P(B). I know I can use Bayes’ Theorem and find P(B|A) and just plug in
and look at the surprising answer. Are you surprised by this answer? 32.42, you have
tested positive, given that you have tested positive, given all these other statistics about
the test, there is only a 32 percent chance that you actually have swine flu.

Think about what went wrong here? Think about what is going on here, right? So, P(A),
which is that you test positive, there are two ways you can test positive, you may either
have swine flu and test positive and you may not have swine flu and test positive. Look at
this probability, this total probability is 0.0095. Look at this probability. What is this
probability? It is 0.0198. That is the probability. These two are roughly in the ratio of one
is to two, a little bit more than one is to two.

So, this case, which is what you are interested in happens is about one-third probability, a
little less than one-third probability. So, this is a crucial thing. So, what matters is the
product of these two guys and not just what is usually P(A|B) is. Even though
individually, 95% is like some 50 times 2%. Once you multiply by the 0.01 and 0.99 they
end up being just 1:2 ratio. So, this proability ends up being very large.

So, this is something very interesting and instructive, so given these kind of numbers, one
needs to be very, very careful. And look at how Bayes’ Theorem has helped you
understand something very, very critical about tests. So, something similar happens over
and over again in practice.

(Refer Slide Time: 15:04)

So, what we are going to do next is, I think a few more problems just to get our feet wet.
To understand how to use this and how different situations this may arise. So, here is
another problem. Let me just quickly work this with you. Here is an MCQ, like, in your
quizzes, when you go to a quiz, in this program, you have multiple choice questions with
four choices. One of them is correct. And maybe this is a good model for most people. So
for any question, there is a probability three by four that you actually know the answer
If you do not know where you are going to be guessing it. And that is, like, random, you
randomly pick one. So, supposing the question was answered correctly? What is the
conditional probability that the student actually knew the answer? So, what is given here?
So, P(correct|knows)=1.

If you know the answer, you are going to be correct. P(correct|doesn’t know)=1/4, I am
writing everything in English just to keep it very simple. This is going to be 1/4, is not it?
And P(know)= 3/4. So that implies first thing probability of does not know is what is 1
by 4? That is it.

This is a classic case of law of total probability, right? So, you find P(correct) using law
of total probability 1*3/4+1/4*1/4, so that is 3/4 plus 1/16 that is 13/16, is not it? So, you
know, P(correct), P(correct|knows), P(knows), you can find P(knows|correct). So, I wrote
all these things in gory detail.

So usually, you can just directly write the answer once you get used to these kind of
problems, you know exactly how to do this. So, this would be P(knows|correct)=

P(knows)*P(correct|knows)/P(correct). So, that will work out to that works out as

12/13. So, you can see that is 12:13 ratio.

So, that is good. The probability that I knew given that you got the answer right was
overwhelming, it is pretty large probability. So, you can see how these things work out in
quite a simple way. Given probability of some events, you can just use the results of all
that you know to move towards the probability that you want.
(Refer Slide Time: 18:34)

So, here is a slightly more complicated situation, maybe the computations are going to be
a bit more painful. So, let us look at this, there are two steps once again, you roll a fair
die, you may get an answer from 1/6, then you toss as many fair coins as the number that
showed on the die.

So, the number may show 2 you, you will toss 2 fair coins. And you count the number of
heads you got like that is one reasonable outcome. So, given that 5 heads are obtained,
what is the probability that the die showed 5? So, once again, this is not too difficult to
show.

So, P(1)=P(2)=P(3)=P(4)=P(5)=P(6)=1/6, right? Okay, so this is your partition, what the


dice showed is your partition, 1 or 2 or 3 or 4 or 5 or 6. So, that is the partition and
probability of 5 heads, given that the die showed i denoted as P(5 heads|i), first thing I am
going to say is 0 if i equals 1, 2, 3, 4. P(5 heads|i)=0 if i=1,2,3,4.

So, if the die showed only 4 and you tossed only 4 coins, there is no way you are going to
see 5 heads, right? So, that is not possible. So, P(5 heads|5). So, if the die showed 5, you
are going to toss 5 coins that are 32 possibilities for heads and tails, and only 1 of them is
favorable to you, which is 5 heads. So, P(5 heads|5)=1/32. So, this is a uniform
distribution calculation. You just count the favorable outcomes among the 32 possibilities
and you will get the answer.

So, what about the next one, P(5 heads|6)? So here again, you have 64 possibilities. And
if you have to have 5 heads, you have to have exactly 1 tail. And that 1 tail can happen in
the first coin, second coin, third coin, fourth coin, fifth coin, sixth coin, okay. So, you
have 6 different outcomes, where you can have 5 heads, so it is 6/ 64, or 3/32.
P(5heads|6)= 3/32. So, you have this complex thing that you are trying to find. You have
done your partitioning. So now, I am ready to find probability of heads.

So, all these I equals 1 to 4 cases will go away, only these two remain. So I have P(5
heads)=P(i=5) P(5 heads|i=5)+P(i=6)P(5 heads|i=6)= 1/6*5/32+1/6*3/32. 32 times 6 is
192, you have 1 plus 3 which is 4, 4 by 192 that would be P(5 heads) = 1/48. So that is
the P(5 heads)=1/48. So now we have P(5 heads), P(5 heads|5), P(5 heads|6), you do not
have to bother about these other things strictly. But anyway, we should be careful about
it.

So, how do you find the other probability? So, with this the base probability, the P(5|5
heads), right? This is what you have to do. So that is going to be P(5 heads|5)*P(5)/P(5

heads). So, that is going to be . So, that should end up being 1/4, so P(5|5 heads) is

1/4.

So, it is just basic calculation, you see how the same idea is used over and over again,
except the computations might be a bit different and tricky at some places, but the idea is
the same. You go in one direction, use the law of total probability and then you sort of
reverse and use Bayes’ Theorem.

So, that is the last example I am doing for Bayes’ Theorem. Hopefully, the previous two
lectures give you a good sample of how we can move from one condition into the other,
do multiple conditionings use the partitioning very smartly, and keep computing
probabilities for more and more interesting events. And then in particular, reverse the
conditioning and use Bayes’ Theorem and use and find some very interesting conditional
probabilities. That is the end of this lecture. Thank you.
Statistics for Data Science II
Professor Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Independence

Hello and welcome. And this lecture we are going to look at independence. When do we say two
events are independent? When do we say multiple events are independent? How, what is the
meaning of independence? Well there is a lots of meanings of the word independence in English
but in probability theory independence has a very specific well-defined meaning.

And quite often people make this mistake of using their intuition, and somebody asks you are
these two independent as a probability question, the people just tend to use their intuition and say
yeah they should be independent, that should be independent. It is good to have that kind of
intuition. But the way probability, independence is defined in probability is very precise.

It is a very precise, exact definition. And you should also learn the discipline. When somebody
ask you a question about independence, to take a piece of paper, write down a couple of
statements, check the precise definition whether makes sense, whether it is meaningful or not and
then you accept your intuition whether of being independent or not.

So this is important. I want to stress that once again because of the normal meaning of the word
independence lot of people get mislead and going to the wrong conclusion very quickly.
Remember there is precise definition. You always have to check that definition. Let us go ahead
and see what it is.
(Refer Slide Time: 01:34)

Let us go ahead and see what it is. Here is the definition. Start with the definition. Independence
of two events, there are two events A and B in a sample space and when do we say they are
independent? Here is the definition. This is the only way you can say two events are independent.
( ) ( ) ( ). That is it.

There is no other intuitive way in which you can define independence. There is no other
definition. This is the definition. Two events are independent A and B if ( ) ( )
( ). product of the two probabilities should be equal to the probability of the intersection.

So what is the motivation? Why does it make sense? And why is it useful it etc? So I have tried
to capture that in the slide, in the points that I have written below the definition. The motivation
sort of, comes the word independence and where that word comes from is from the connection to
the conditional probability. Supposing let us say P(B) is positive, right?

So if it is positive then P(A|B), if A and B are independent, if A and B are independent, notice if
( ) ( ) ( )
( ) ( ) ( ), P(A|B) will be what? It is ( ).
( ) ( )

So the event A is independent conditionally of event B in this sense. If you are in the original
space and you find P(A). You go into the conditional space with conditioned on B and find
probability of A, probability of A given B you will get the same answer. So as if the occurrence
of B did not influence A in anyway. So that is the idea in this. So you think of this as an
independence condition. So P(A) is unaffected by B.

So now what is... so notice what has happened here. So go back to our initial way in which we,
we started with the axioms and we started playing around with events and we saw this union and
intersection a very, very crucial operations, right? Union is like an OR operation A or B.
Intersection is like an AND operation. We saw in one of the axioms, the second point here, you
go back to our axiom 2, if two events are mutually exclusive, if they are disjoint, right, A and B
are disjoint then the union of the two events is easy to calculate, right?

( ) ( ) ( ). This is called, some people call it the Addition Rule, right, so the
Addition Rule of probabilities. When you have two events that you think are mutually exclusive
and you want to find the probability that either this or that happened you can simply add. There
is no intersection between them. There is no need to worry about the subtraction of the
intersection.

Now, what independence is telling you is you can do probability of intersection of two events,
probability of two events happening at the same time for one outcome being favorable to both
events A and B, right? You are two events A, B. What is the probability of A and B, event 1 and
event 2? You have a Multiplicative Rule only when the two events are independent. It is
extremely important.

Just like the Addition Rule works only when the two events are mutually exclusive, they are
disjoint, the Multiplication Rule for and will work when the two events are independent. So lot
of people put it like this. The rule of multiplication, rule of addition is very useful in computing
probabilities. So once again the definition I want to remind you, do not assume all events are
independent.

Lot of people assume events are independent when they are not. So events are independent if
probability of their intersection equals the product of the probabilities P(A) and P(B), (
) ( ) ( ).. So that is the definition. So when I state it like Multiplicative Rule it seems
like something you can use but you should know A and B are independent ahead of time. Only
when you know A and B are independent you can do ( ) ( ). Alright so this is definition
so much for their descriptions.
(Refer Slide Time: 05:49)

So we are going to see a few examples and I will point out some pitfalls and places where I have
seen students struggle to understand independence. Let us begin with the very simple example.
You coss a toin thrice. There is a coin. Pick it up. Toss once. It comes back. Toss second. It
shows up. Toss third. It shows up. You look at the three outcomes.

So intuitively, physically from a very high level you see immediately that the first toss and the
second toss have to be independent, right? So if you have an event involving the first toss, if you
have a event involving the second toss they better be independent. Otherwise the natural physical
sense of connecting to the experiment is going to be too tough for you, right? You tossed once.
You toss second time.

What happened in the first toss and what happened in the second toss better be independent in
some sense. So let us see if that intuition is sort of carried over in the definition. But we look at it
from a definition point of view. So we know when we call toss a Coin thrice you have a
probability space which has a uniform distribution, right? The uniform distribution is over what?
Over all the 8 possibilities that can happen. HHH HHT THH all those.

So I am going to define two events A and B, one for the first toss, another for the second toss.
First toss is heads that is A. Second toss is heads, that is B. Let us see what the event A is as the
subset of the sample space, right, all events are subset of the sample space. What is the subset of
the sample space? It is everything with H in the first place, right. So HHH, HHT HTH HTT.
What is B? Second toss is heads. You go through the 8 ones and pick out the ones which have H
in the second position, HHH HHT THH and THT.

Now what is the definition? Definition involves right. It is a bit complicated. Let us see if
that satisfied. So what is ? First toss should head and second toss should be head, right? So
that is HHH HHT. You can see. You can go through the events here. And you can see that that is
satisfied, right. Now you check out the condition of the independence, for independence, right?

( ) , right, is not it? Why is it 1/4? There are 2 favorable outcomes out of 8 , 2 out of 8,

1/4. And that is equal to 1/2*1/2. And that is P(A)*P(B), is not it? So we see that A and B are
indeed independent. So it is sort of a good definition. The definition is not bad where you think it
has to be independent it ends up being independent, no problem.

Now what about A and Bc? That should also be independent, right. So why should that be
independent? You think about it. A involves the first toss. B involves the second toss. Bc all
involves the second toss. Second toss is tails. First and second should really be independent. I
will leave you to check that. A and Bc, are they independent, end up being independent? That is
true.

So what about first toss is heads and third toss is heads. That should also end up being
independent. I will leave you to check those two. These two you can check. So once again I
mean I will step back and give that warning, I mean you have this intuition on why, what events
are going to be independent. But always check, always check with the definition. Make sure that
the actual definition holds.
(Refer Slide Time: 09:02)

Let us look at another event where quite often people get misled. So I put this example right next
as an interesting example. Let us say you throw a die. And I am going to define an event A as the
number being even and the event B as the number being odd. So people have this wrong intuition
that when the two events are disjoint they become independent for some reason. It seems like
that. Sometimes people have this wrong feeling about it but that is absolutely not true. You can
quickly check, right?

So even and odd are disjoint if you do , you have the empty set. And what is probability of
the empty set? That is 0. That is definitely not the product of even number and odd number, right?
Probability that we will get an even number is half. Probability that you get an odd number is
half. Half into half is 1/4. That is the probability of the product P(A) into P(B). That is the
product of the two probabilities.

On the other hand the probability of the intersection, the number being both even and odd is 0,
right. It does not happen. So here is the point to remember. Do not forget this. If two events are
disjoint then actually they are not independent at all. Why is that? Because if they are disjoint
and if I told you B happened then A did not happen. You know that for sure, right? So the
occurrence of B definitely affects the conditional probability of A. So that is important to know.

So here is sort of a, last point here is sort of bit counter-intuitive. Maybe it is little bit surprising
but it is very crucial part of independence. Independence does not mean disjointness. So they
have to have, two events if they are independent they have to have a non-trivial intersection.
Intersection should be possible. A and B must be possible for A and B to be independent. Small,
small idea but important to keep in mind.

Here is another example. I am going to modify this event B slightly. A is still the even number 2,
4, 6. I am going to make B as a multiple of 3. B is the event that the number that showed up is a
multiple of 3. A is the number that the number that showed up is 2, 4, 6, even number. You go
through the mechanics of the definition is just 6 alone. ( ) . That is clearly
1/2*1/3, that is ( ) ( )..

So the definition holds. So A and B are independent. The fact the event that you get an even
number is independent of the event that you get a multiple of 3. So I have seen students get
confused by this. So first confusion is A and B have an intersection. How can they be
independent? This is what I am going trying to dispel here. Intersection is mandatory for
independence. Without intersection there is no independence.

It seems bit, it is sort of against our normal English understanding of independence.


Independence, the two people are away from each other. So with here it shows there is an
intersection and still there is independence. So only when there is an intersection there can be
independence. So sort of a different understanding of what it is.

The other slightly interesting doubt I have had students ask me was long period of time is, the die
is being thrown only once. There is only one number. And what is this meaning of two events
defined from there and two events being independent et cetera? I want you to think about that on
your own and be convinced that there is no problem there, right? The outcome is only one.

The experiment is done once and there is one outcome. There are so many events that are defined
in that sample space and your outcome can belong to multiple events. It can be favorable to
multiple events. All of that is perfectly fine. It is okay. It is not wrong. Do not be confused by
that. That is one more example to sort of clear up the air.
(Refer Slide Time: 12:52)

So to drive home this point I am going to repeat it once again, the same point in another setting,
because this is important enough that you should have a good deep understanding of it.
Supposing you draw a card from a pack. I am going to define an A, event A as the card is a spade,
event B is card is a king. And is what? There is an intersection, right? Card is a spade king,
right. There is a spade. This is, king is also there, the spade king.

So ( ) ( ) ( ). Probability that card is a spade 1/4, right. There are

four suits. All of them are equally likely in some sense. Our card is a king. There are four kings.
1/13 is the probability. So you see A and B are independent.

So when you have a pack of cards, you are drawing a card. The event that the card is a spade has
no influence on the probability of the card being a king. The card being a king before you knew
it was a spade and the card being king after you knew it was a spade is exactly the same, does not
change. Once again I have repeated the two points of confusion.

There is an intersection. There is a spade king. How can it be independent? In fact because there
is a spade king it is independent. If it is not they will not have been independent, right? Very
important. And there is only one card drawn. What is the meaning of two events? And remember
that that is how probability spaces and events are defined. There is only one outcome. The
outcome can belong to multiple events, can be favorable to multiple events.
(Refer Slide Time: 14:23)

So we have seen independence of two events. It turns out one can define independence for
multiple events. Now this definition has a lot of other motivations and important reasons why,
but I am just going to make this definition and sort of drive home the important peculiarity about
this definition. So if you have three events A, B, C you have these conditions for A and B being
independent, A and C being independent, B and C being independent.

Those three are easy to write down. I have written it down first. You can see that is what I have
written them down as let us pick up this laser point. That is little bit better. You ( )
( ) ( ), ( ) ( ) ( ), ( ) ( ) ( ).. So mutual independence
of three events does not stop there.

There is one more condition that is required. Here is this important condition. ( )
( ) ( ) ( ). Only then you say that all three are mutually independent. So the first
condition is called pair-wise independence. Only two at a time they are independent. The third
condition makes them really mutually independent. The three events are mutually independent.

Now it is possible I have given an example here of three events in a very simple sample space,
coin, tossing a coin twice; pair-wise independent but not mutually independent. This is possible.
So here is A, B, C. I have just defined them as subsets. And A and B and C are independent pair-
wise. You can check that out. I will leave it to check this. But this condition is not satisfied.
( ) . And that is not the product of 3 probabilities.
So it may happen that pair-wise independent is true but mutual independence is not there. So this
is... I mean, at this point we will leave it as definition. But the way I usually think of this mutual
independence of bunch of events is once you have A, B, C being independent, any intersection
that you can think of always becomes a product. So the multiplication rule is sort of like the
motivational point behind this, right. So any intersection you do becomes a product. So that is
nice to know.

(Refer Slide Time: 16:42)

You can push this to multiple events as well. Why stop at three while you have two events, we
have defined independence for them. Three events we defined independence. You can define for
any number of events. If you have n events A1 to An and they are said to be mutually
independent if for every collection of subsets of these events, right, you pick i1 to ik and you look
at Ai1, Ai2, Aik.

You pick any subsets A1, A2, A3; A2 A10 A15, A1 A100 A11 whatever. Any subset, you can take
subset of size 3, subset of size 4, 5, 6, 7, everything. For any subset of these n events the
probability of their intersection has to be equal to the product of the probabilities of the
individual events involved in this intersection. ( ( ) ( ) ( ).
This is the condition. When the probability of event satisfy this condition they are said to be
mutually independent.
So there are a lot of constraints here, constraints we think about, may be lesser than that but
so lot of constraints, so this is whole bunch of things to check. But here is an interesting result. I
want to point this out. I will not go into depth of proving these things and all that. There is a
proof here. I will not describe that.

So if A and B are independent it turns out A and Bc are also independent. This you can prove.
You do not have to check this. If somebody tells you A and B are independent it implies that A
and Bc are independent. So that sort of intuitive in some sense. Think about why that is true. So
B does not affect A. So Bc should not affect A as well.

So you can, so basically this result says if two events are independent complement of one event
and the other event together are independent. So you can use this rule twice and simply get to the
point where A and B are independent implies Ac and Bc are also independent. So notice how
powerful this independence is. It is really powerful definition. Seems very simple but implies so
many other things, right.

Even if you compliment, either complement or not complement they are all independent. This in
fact extends to n events being mutually independent as well. Any subset, any collection, sub-
collection of these events, you take a sub-collection and then either complement or not
complement, case you do whatever you want there, those case are all independent. So for
instance A1 to A100 are mutually independent A1 and A80c are independent.

A1 A80c and A50 are mutually independent. Any subset, with complement or without complement
they are all mutually independent. So that is the power of this result. This is very, very powerful.
So once you know they are mutually independent you can just multiply. The rule of
multiplication works for any intersection of these events. That is the power of this definition, and
it is put to use quite a bit everywhere.

So physically you may get independence quite easily. Events involving different things are going
to be independent and you can just multiply. The rule of multiplication really, really helps you.
So that is how independence is usually used in practice. You know that the events are coming
from physically independent experiments in some sense, realities in some sense. So you can
multiply the probability.
(Refer Slide Time: 20:16)

So let me show you a problem which seems simple but you can use independence and very
easily solve. So here is the problem. There will be no more problems of this nature in your
assignments. But let me do one for you in some detail. Here is a situation. There are three cities,
let us say cities or towns A, B and C. There are two parallel roads that connect A and B. There
are two roads. One road connects A and B. Another road connects A and B also. Same way
between B and C. There are two parallel roads, right? Hopefully situation is clear.

So what people have observed is each of these roads can get blocked with a probability p and
that is independent of all the other roads. Here is the mutually independent scenario. What is the
event? There are 4 events here. First road from A to B gets blocked. Second road from A to B
gets blocked. First road from B to C gets blocked. Second road from B to C gets blocked. These
are the four events. And the problem says these four are mutually independent and each happens
with the probability p.

So now you know any sort of intersection with this or the event complement what can you do?
You just multiply. That is the beauty of this, right. And this is reasonable because, maybe
reasonably I think, so you have each of these roads having different physical characterizations.
Something that happens this road may not affect what happens on the other road. So it is
reasonable to assume.
Now you have been asked to calculate a certain probability here. So I may not do this entirely
but let us see, let us see where we go with this problem. So in this probability space we need to
compute a certain particular probability. Let us see what that calculation works out. So question
is asking a conditional probability even though the conditional word is not there, it says what is
the probability that there is an open route from A to B given that there is no open route from A to
C.

So let us see if we can calculate this probability. So what is an open route from A to B, right? If
there is open route from A to B means either first road or the second road should be open, is not
it? So let us see if there is some calculation that is possible here. So let us first calculate the
probability of open route from A to B. This is the probability that, I will call this 1, 2, 3, 4. 1 is
open or 2 is open, is not it?

So probability, p is the probability that the road gets blocked. What is the probability that the
road gets, road is open? That is 1 minus p, is not it? So 1 is open or 2 is open. So this seems little
bit difficult to do. So what one can do here is maybe find the other way around. So you can do
this. May be I will write it like this. You will see there are various ways to do this. I will do it
like this. Probability that there is no open route from A to B, this is actually easy to calculate,
right?

So 1 is open or 2 is also open seems very hard but this is easy to calculate. 1 is blocked and 2 is
blocked, right. Notice how is this working out. So this is a trick that you should use in every
probability calculation. You see a probability and you see that it is hard to evaluate. What do you
do? You go try to find its complement. Probability of open route from A to B, I tried to find out
the complement of that probability, probability of no open route from A to B. And you notice
here that this becomes very easy.

What is this? 1 is blocked. 2 is blocked. These two are independent events and I am doing and. If
you do independent events and you are doing and you can multiply. That is just p into p and you
get p square. Notice how this worked out very, very easily. So this will become 1 minus p square.
This is the probability of open route from A to B.

So I have to find the conditional probability here. This seems like, seems interesting enough. So
what is the probability having no open route from A to C? That is the question that has been
asked, right? What is the probability You do not have an open route from A to C. And then you
want to condition on there is an open road from A to B.

So let us do this conditioning. So probability of no open route from A to C given open route from
A to B times probability of open route from A to B plus the other possibility right? So no open
route, let me call this event as E just to simplify my matters here, writing here probability of E.
given the open route from A to B, I will call this event as F. ( ) ( ) ( )
( | ) ( ) ( | ) ( ).

So now if there is no open route from A to B what is the probability that there is no open route
from A to C? That is 1, is not it? You agree? If already there is no open route from A to B
probability that there is no open route from A to C is just 1. You do not care. Whatever happens
B and C there is no open route from A to C. It is finished, right. So that is 1.

And this guy we know already, right? What is the probability that there is no open route from A
to B? That is p2. Now what is the probability that there is an open route from A to B? This we
also calculated, 1 –p2. And what is this guy? What is the probability that there is no open route
from A to C given that there is an open route from A to B?

There should be no open route between B and C, and that the same as no open route between A
and B. So that will end up being p square. Is that correct? So this will end up being p square. So
you get here. You multiply ( ) . So that the probability that
there is no open route from A to C. That is E. What you have to find is P(F|E), is not it?

That is going to be ( | ) ( | ) ( ) ( ). And you know all these qualities, right? So


what is P(E|F)? That is just p2. What is P(F)=1-p2. Here it isP(E)= 2p2-p4. You substitute all this.
So this guy is, this guy is p2. So I am running out of room. Let me see if we can squeeze this in.

( )
P(F|E)= ( | ) , so you get the final answer, right? So this is your final
( )

answer. So you can see how we argued this out. Maybe the mechanics of it is little bit, maybe I
wrote it in a bit of painful way. Hopefully you can see.

You can write events very carefully and argue about them very carefully and notice this
independence was helping you, right. So when you want to do and two things with independence
you can just multiply. And that is very, very, very smooth and easy to do. And you can just
repeat this calculation, multiply them together and then you use Bayes rule, total probability over
and over again. You will get your answer.

I think, to me this problem is very interesting and simple and at the same time conveys a lot of
ideas. It brings all that we discussed in this area of conditional probability, law of total
probability, Bayes Theorem and independence together. It is a very nice problem. You wanted to
think about this and get your comfort level with this problem going quite a bit. So it uses, there is
just 4 simple basic events.

Each of those roads getting blocked and they are all connected and they are all independent, right?
So each is with p. And then there is this question that is asked about open routes, conditional
probabilities of open routes, and no open routes. And look at the way we have to calculate, right?
So you have to think of how to organize an event. And noticed this very subtle thing of how the
union is, when the union is difficult to calculate you just go to the other rules, right? So P (A) +
P(B)- ( ), right or the other way round.

So many ways to do these problems. And you can go, if you do not like the probability of an
event you can find the complement of the event, then 1 minus. All of these are very interesting,
important tricks. And as you go along you have to pick up these tricks and be comfortable with
solving problems. So this problem is very interesting. Hopefully there will be more problems like
this in the graded assignment and practice assignments and you can improve with practice. That
is the end of this lecture. Thank you very much.
Statistic for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology
Lecture 1.14

Hello and welcome to this lecture. We will consider 3 different things in this lecture, we will begin
with this notion of repeated independent trials, which results in something called Bernoulli trials,
which are very important, they are often used in practice, when you want to collect some statistics
and make some inferences; et cetera. it is very important.

And two distributions, which are closely associated with Bernoulli trials, one is independent
Bernoulli trials, one is called the binomial distribution the other is called the geometric distribution.
So, let us get into these 3 things.

(Refer Slide Time: 00:47)


So, we will begin with Bernoulli trials and let me just begin with some sort of an application or
setting where something like this arises very naturally. So, quite often the government or an
administration of a city is interested in finding out incidence of a diseases and there could be some
disease of flu or any other disease among the population and they may have a sense of it and they
may want to collect some statistics about it and they may want to know, what fraction of the
population maybe that is infected.

So, this can drive policy, right, so they can do something for such people know devise some
strategies for that disease, figure out if it is an epidemic or if it is endemic and then do something
various, there is lots of things you can do once you know things like this. So, one thing they may
do like for instance, to generally assess how bad is the spread of the disease is to pick a random
person from the population notice, what I am seeing is random.

So, maybe they are just walking around and find someone there and catch them and then administer
a test for that disease and they may repeat this in different places at random times in various
locations hopefully, that person is actually in the city and not just passing through. So, they can do
these experiments and these kinds of things one can call a set trial, so it is a trial. So, maybe they
are repeated n times.

And maybe one can assume that each of these trials in the sense that the person you pick from the
population is maybe some random enough that you assume it is all independent, so hopefully, they
are from different places and they do not have maybe a connection and things like that so is you
assume these are independent and the number that you will get at the end is how many of these
tested positive.

So, there is a test and so that test, let us say is a very good test and if it is, if somebody has the
disease, it will turn out positive otherwise, it will turn out negative and then in that case finally, at
the end of n trials, you have this number with you. So, a natural question that arises is, what is the
distribution of this number. You need to know, what is going to be the distribution of this number
as a function of the actual incidence of the disease.

So, the, you expect this number to be high, if a lot of people are infected in the city, if you expect
this number to be low, if a few people only are in are infected and you want to get a sense of the
distribution of this, so you want to know what the PMF is, what the probability that this number is
certain value, so all these things you want to know.

So, more interestingly, maybe you want to infer from this number, what fraction of the total
population actually has the disease, and this might help you with budgets for how much you have
to buy, et cetera. So, this is just a setting and sort of brings out this notion of a trial and its
importance in practice. So, how do you mirror this in theory?

And then the theory that we are studying, how do you understand or how do you have a framework
for something like this.

(Refer Slide Time: 04:06)


So, let us ask that question and from there you get this notion of a single Bernoulli trial. So, here
is the setting you have an experiment and then the sample space associated with this set of all
outcomes and there is an event A inside the sample space, some event and we want to consider
this occurrence of this event as a success for us. You go back to the disease, what is this event?
This event is the fact that the test was positive.

So quite often, when you do trials, there is going to be an event, which is a success, or when you
say success, I mean it is not in the positive term, it is just that it is a, it is something that you want
to count, it is a 1, not as supposed to a 0, which corresponds to a failure you are not interested in,
somebody did not test positive, that is like a 0, somebody tested positive it is a 1 and you add all
of those ones together, you get the total number that tested positive.

So, something like that so this is what the setting is and let us say the small p is the probability of
this event A. So, once again there is an experiment and there is a sample space and there may be
so many events associated with that sample space, I am not interested in all of them there is only
one particular event A which I am going to think of as being critical for me it might trial I am
interested only in that.

And if that occurred when you did the experiment, if that event A occurred I am going to say I
have a success, I have a 1 supposed to a 0, when I do not have event A occurring. So, for the
Bernoulli trial, even though your random experiment may have lots of other outcomes, but I am
not interested in all that, I am only interested in success or failure, I am only interested in whether
event A occurred or A complement occurred and one of these will definitely occur all the time, so
either A occurs or A does not occur.

So, I am interested in this Bernoulli trial only in a sample space with two entries, my outcome are
just 2 either success or failure, whatever it is my experiment maybe, my experiment may involve
a complicated blood collection and then numerous tests and numerous results may come and then
finally, they may say, yes, this person has the disease, this person does not have the disease.

My interest in a Bernoulli trial is in only that success and failure, I am not interested in anything
else that you might have done in the actual test and my probability of success, my probability that
the event of interest occurred is p. So, instead of saying success, failure, and all these things, one
can simplify this and make the sample space into just 0 or 1 and probability of one is p, probability
of 0 in this very simple sample space is 1 − 𝑝.

So, this particular distribution, sample space of 0, 1 with the probability of p associated with 1 and
probability of 1 − 𝑝 associated with 0 is called this Bernoulli p distribution. So, you can see quite
easily how this becomes a very nice framework for things like incidence of a disease and testing
from picking a random person and test, so this is a single Bernoulli trial and as you can see, as you
can expect we are going to repeat this multiple times independently and see what happens, let us
look at that.

(Refer Slide Time: 07:06)


So, we are now going to consider repeated Bernoulli trials and the setting is sort of similar to the
previous setting I mean, you have an experiment, which you do and there is an outcome of interest
and you just have success and failure in mind in your sample space, you do it once you have
success with probability p that much you know.

Now, you can repeat this multiple times, supposing you repeat this n times, so anytime you have
these independent trials, there will be these two parameters one is the probability of success P and
the next thing is the number of times the repeated independently that is n, so n times, you repeat
the single Bernoulli trial, Bernoulli P trials independently so that is the setting, so you keep
repeating it.

And now your outcomes are many more, so if you did it only once your outcome was either 0 or
1, now you have n trials and in each trial you could be 0 or 1, so there are two possibilities for each
trial, there are n of them, so 2×2×2×2…. n times that is 2𝑛 outcomes, so your outcomes are much
larger in number and let us look at some simple examples first.

So, supposing you look at n equals 3, if you did it only 3 times your sample space will have
outcomes which are 0 0 0, which means 3 failures, 0 0 1, two failures and a success, 0 1 0, failure,
success, failure, so on till 1 1 1 which is success, success, success. You could have n equals 4 in
which case your sample space will keep growing like 0 0 0 0 to 1 1 1 1, so those are the possibilities
are listed here.
To compute probability of a particular outcome you can use independence and write down the
expression quite easily. So, I have shown you some examples here to illustrate how it is done in
general and you can quite easily extrapolate this. Let’s look at n equals 3, if you want to find the
probability of 0 0 0, that is the probability that I got a 0 in trial 1 and 0 in trial 2 and 0 in trial 3 and
importantly, these 3 trials are independent.

So, you can just multiply the probability of each of these things, so you get (1 − 𝑝) × (1 − 𝑝) ×
(1 − 𝑝) that is (1 − 𝑝)3, so given the value for 𝑝, you can find the probability of 0 0 0. Not just 0
0 0, any other outcome 1 0 1, I have shown you here, it just works out as 𝑝2 × (1 − 𝑝) you can
take anything else.

So, let us say I take this on a pen and maybe I should pick up blue here, let us say I want to take
probability of 1 1 0 1, so that is going to be 𝑝 × 𝑝 × (1 − 𝑝) × 𝑝 , 1 1 0 1, 1 in the first trial, 1
in the second trial, 0 on the third trial and 1 in the fourth trial, so that is just 𝑝3 × (1 − 𝑝). So,
like that given any value for n and given any value for p. I can find the probability of every single
outcome in my sample space, it not it, just by using the rules of probability, the rules for combining
events one can do this.
(Refer Slide Time: 10:13)

So, here is a very elaborate experiment with, just putting out everything in good detail for you.
Supposing you toss a fair coin five times, now this fair coin actually makes things very very easy
for you because fair coin, you have to say what successes what failure is, let us say, for my fair
coin, I am going to say heads is failure and tails is a success, so T is 1 and heads is 0.

Let us say I say that, I mean, you can also reverse it you can say H is 1 and T is 0, that is up to
you, it does not matter much and the coin is fair. So, my sample space will end up having 32
equally likely outcomes, fair coin is very very important, you can go back and look at the examples
1
here, if you have H H H H H that is 32, so half into half into half, these are all independent trials,
1
so you get 32.

1 1 1 1 1 1 1
Any other combination also you will get each, so it is 2 × 2 × 2 × 2 × 2, so sample spaces,
32 32

32 equally likely outcomes, you can go ahead and start now computing various interesting
probabilities. So notice, instead of just looking at every outcome, I can have interesting questions
like what is the probability that when I tossed a fair coin 5 times, when I do 5 independent Bernoulli
half trials I will get 0 tails.
1
No tail at all that would be H H H H H, that is 32, and I might be interested in 1 tail. So, remember,

I am interested in the number of successes that is something of interest to me, so maybe I am


interested in probability that I have 1 tail. So, if I have 1 tail then there are just 5 outcomes that are
favourable to me. 1 tail could be tail occurring in the first place, first toss or tail occurring in the
second toss, everything else has to be heads.

Exactly 1 tail is what I am asking. So, H H T H H. So, out of 5 tosses, exactly 1 needs to be tail
5
and that can happen in 5 different ways, so you have 32 as the probability of 1 tail. So notice these

events involve all the 5 tosses and I can come up with very interesting questions like 1 tail, 0 tails,
2 tails, 3 tails et cetera, and I have listed out all the possibilities here, so you notice for probability
5
of 2 tails you have 32.

10 5
There are 10 outcomes there are favourable among the 32, 3 tails is , 4 tails is again and 5
32 32
1
tails is 32, so these are the various different outcomes, that can happen to the number of tails that

you have in your, when you toss a coin, fair coin 5 times. So, let us just an elaborate little example.

(Refer Slide Time: 12:44)

Now notice, if my coin becomes, so particularly for the uniform distribution you notice you have
this very simple way of calculating, how many tails will happen, so it is just 5𝐶0 of 5𝐶1 5𝐶2 et
cetera, because out of 5 exactly 2 needs to be tails, so that is 5𝐶2 possibilities and I have a uniform
distribution, so I can very easily do this. And notice how 0 tails corresponds to five heads and all
that.

So all this you can do very easily and then you can have a lot of interesting events. I may want
probability of at least 4 tails. What is at least 4 tails? 4 tails or 5 tails, so that is probability of 4
tails plus probability of 5 tails and what is the rule I am using here, I am using this OR rule, which
6
is the disjoint events rule. So, you add the probabilities here, you get . You may also asked for
32

questions like probability of utmost 3 heads.

So, what is utmost heads? 0 or 1 or 3 or 3 heads you add up the probability of each of them, you
26
get ,, so you notice here, when you have particularly a fair coin and you are asking all these
32

interesting questions about the distribution is just like calculating probability in a uniform
distribution, favourable outcomes by total number of outcomes.

And then you use your familiar and OR rule the intersection and using independence and OR,
which is disjoint events you add them up and you get your answer. So, that is all about the
calculations with uniform distribution when you toss a fair coin 5 times.

(Refer Slide Time: 14:15)


Now, in general the value of 𝑝 may be different. If the value of 𝑝 is different and you perform n
independent trials, you can do a similar calculation but, quickly the calculation will become a little
bit different. So, let me just show you a few examples here. Here is an example of n independent
Bernoulli p trials and say n equals 10.

And you are given a particular sequence 0110001011 then you just multiply 1 − 𝑝, 𝑝 , 𝑝 , 1 − 𝑝,
1 − 𝑝 , 1 − 𝑝 , like that, so I hope you got that 1 − 𝑝 and then 𝑝 , 1 − 𝑝, 𝑝 , 𝑝 equals 𝑝5 and
(1 − 𝑝)5 , put in the value of 𝑝 you get your answer. So, even though the number of trials may be
large, even though the value of p might be different from half of something, finding probability of
particular outcome or as long as you listed out is easy to do, it is not very hard.

So, you put it in you get the answer. So, in general I have given expression here. If you have 𝑏1 to
𝑏𝑛 as that string 0, 1 et cetera, all you need to do is to find this w which is the number of ones in
𝑏1 to 𝑏𝑛 . I do not care about anything else, only the number of ones, the actual sequence do not
matter in the probability because they are all independent.

I am going to multiply them out together in, whichever sequence I multiply them in I will get the
same answer at the end, so probability of a particular sequence depends only on the number of
ones in the sequence and that is 𝑝𝑤 (1 − 𝑝)𝑛−𝑤 . So, I have given you a more elaborate example.

If you have 200 repetitions, it does not matter as long as you know how many ones are there, if
you have 36 ones, then it is 𝑝36 (1 − 𝑝)200−36 , so it is computing the probability of a particular
outcome, when you repeat the Bernoulli trials is easy even in the general case.

(Refer Slide Time: 16:05)


Here is a more elaborate example of the same thing, but I want to drive home this point that it is
not very hard, you just have to write it down carefully, make sure you count for all the cases and
you will get the right answer. So, here is a case where I am tossing a biased coin 5 times, previously
we looked at a fair coin.

So everything was half half everything was uniform, it seemed like easy to compute, what if it is
1 2
biased, what if it is not half it, what do you do? Probability of heads is 3, probability of tails is 3.

Now, even then my calculation is not very hard. Why is that? If you look at probability of H H H
1
H H H that is simply (3)5 .

2
T T T T T is (3)5. What about probability of 1 tail? I have 5 different outcomes and every outcome
2 1
has a probability of × (3)4, why is that? So, supposing you do probability of P (T H H H H)
3
2 1 1 1 1 2 1
that is simply 3 × 3 × 3 × 3 × 3 it is really simple, so we have 3 × (3)4 and this guy is also going

to give you the same number.

This guy is also going to give you the same number, this guy is also going to give you the same
numbers, this is also going to give you the same number and all of these 5 guys are disjoint. So
notice if you have tails and 4 heads, it is definitely not the same as having this case. So, you can
2 1
add the probabilities of these things and you will get 5 × 3 × (3)4
Notice how the answer is slightly different from the uniform case and in the uniform case you
5
simply got , it was very easy to just simply add the outcomes. Here you have to add the
32

probability of each of the outcomes and luckily, as long as you have 1 tail, this event 1 tail
2 1
probability, you have a single tail in this sequence is always the same × (3)4, so you do not
3

have to really, change the probability for each of these outcomes.

2 1 2 1
It is just × (3)4 , you add up 5 of them, so you get 5 × 3 × (3)4 . So, hopefully, this calculation
3

is clear how I got probability of 1 tail in the biased case to be something like this. Now what about
2 tails? So, if you look at 2 tails, you will see that the probability of each of these guys, probability
2 1
of each of these guys is actually (3)2 × (3)3.

So, this is wrong, this is wrong, this is got to be 10, this is got be 10, so the probability of each of
2 1
these guys is (3)2 × (3)3, why? Because, it is got 2 tails and 3 heads and each of these things are

disjoint in the sense that, if the first sequence happened, the second sequence did not happen, so
2
each of these things are disjoint so you can add up all the probabilities, you will get 10 × (3)2
1
× (3)3. Notice there are 10 of these cases which have 2 tails, out of the 5 tosses that you have.

So, notice what has changed between the fair coin and the biased coin, this 5 and 10 and all remain
the same, except that probability of the individual outcome with a fixed number of tails changes,
1
previously it was all 32, it was very easy, here they changed, but they changed to something else

depending on p. So, it is easy enough except that you have to pay attention to how it is changing.
So, you can repeat the calculation in the same way.

So, if you can generalise this. So, we will sort of generalise this soon enough, we will look at the
general case of Bernoulli p repeated n times and then we will ask this question of how many
successes you have and all that, so we will do that, so this kind of calculation is central to the
whole thing. And you can see how this calculation is coming about it just uses independence and
disjoint event situation, so you just add it up you get the probability.

(Refer Slide Time: 20:01)


So, I want to go back to the example I started this little lecture with. I started with this incidence
of a disease so let us redefine that a little bit. So, let us say a fraction of the people in a city have a
disease and I am doing this trial where I select a person at random and test for a disease. So, maybe
we could make this assumption, we can assume that the probability that a person tests positive is
P.

Can you assume P is same as the fraction of the people? When do you think it is a good
assumption? Think about it. It is not too bad if you pick the people at random, do not cluster
yourself up anywhere you pick them up at random in different places, then it is probably a pretty
good assumption. Now we repeat the trial n times and you assume each trial is independent you
have a repeated independent Bernoulli p trial.

So, now we are closer to modelling the incidence of a disease in this framework and the outcome
is number of times the test was positive and we are interested in how to think of the number of
times the test was positive in terms of n and p. Just like in the previous example we saw, we were
interested in the number of tails for whatever reason and see, so you see, naturally this question of
number of successes in n independent Bernoulli p trials arises, and that is what is called the
binomial distribution, we will go to that next.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture – 1.15
Events and Probabilities: Binomial Distribution

Hello and welcome. We have been looking at repeated independent Bernoulli trials n independent
trials of Bernoulli p. The trial so now we are going to look at what is called the binomial
distribution and this is very important as well.

(Refer Slide Time: 00:27)

So, here is the definition of binomial(n, p) which is called the binomial distribution. There are 2
parameters usually in the binomial distribution one is the number of independent Bernoulli trials
you performed and p which is the probability that the Bernoulli trial results in a success n is the
number of independent binary trials and p is the probability of success of the Bernoulli trial.

And the outcome we are interested in or B(n, p) which we will denote to represent the binomial
distribution in some sense. This B(n, p) is the number of successes. I am not interested in anything
else, I have done this n independent Bernoulli p trials and the only outcome I am interested in now
is the number of successes I got.
So, if you think of the sample space now what are the various different possible outcomes you
could have 0 success, 1 success, 2 success, 3 success until n success. So, my sample space is just
0, 1 to n and here is an example. So, let us immediately begin with an example and the interesting
simple example is to look at n equals 3.

And probably B could be 0 or 1 or 2 or 3. And what is the probability that B equals 0, what is the
event B equals 0? There are 0 success when I did 3 Bernoulli p trials. So, the trial should result in
000 and that has a probability (1 − 𝑝)3 . We saw this before. Now, what is the probability that B
equals 1? Now, the trial can result in either 001 or 010 or 100.

And these are just join 3 possibilities find the probability of each and add and you get
3𝑝 × (1 − 𝑝)2 . So likewise, probability of B equals 2 gives you 3𝑝2 × (1 − 𝑝). And probability
of B equals 3 gives you 𝑝3 . So, that is so there is a bug here this should be a 1 this is small it is not
000. So, that is that is the binomial distribution for small n it seems very simple.

(Refer Slide Time: 02:28)

Let us look at one more example maybe binomial of 5 comma p. So, here I have done 5 trials of
Bernoulli p and I am interested in B which is the number of successes in 5 Bernoulli p trials. So,
probability of B equals 0 is just 1 outcome 00000 and that is (1 − 𝑝)5 . Now, B equals 1 is
5𝑝 × (1 − 𝑝)4

So, I have so many bugs in these slides hopefully I am correcting all of them. So, (1 − 𝑝)4 so
probability of B equals 1 is that B equals 2 will be 10𝑝 × (1 − 𝑝)3 10 p squared into 1 minus p
power 3. So, you can add it up all the different possibilities with which you can have 2 successes
in 5 trials.

Same way 3 success in in 5 trials is the trials result in 3 ones and that is the number of favourable
outcomes into 𝑝3 × (1 − 𝑝)2 . Why is that? Because, you know each of these favourable outcomes
once I have has the same probability 𝑝3 × (1 − 𝑝)2 so I can simply add up all of them so I get just
5 choose 3 that is 10.

So, you notice how the 5 choose 2 and 5 choose 3 enter the picture. Because, I am only interested
in the number of ones that are there in the sequence and any of these sequences with that many
number of ones that is the same probability that is the nice property of this repeated Bernoulli trials
and I can use that and I get this 10𝑝3 × (1 − 𝑝)2 quite easily. And then for the last 4 trials 4
successes in 5 trials I get 5𝐶4 which is 5 and then 𝑝5 very nice, is not it?

So, you can check like for instance you know the probability of B equals 0 plus probability B
equals 1 plus probability B equals 2 plus so on should result in 1 is not. It it just a, now sample
space with the probability distribution and if I assign values to that I should get just 1 and you can
check that all these numbers add up to 1. May not be a very easy thing to check if you have not
seen it before you can add all these guys it will add up to 1. It is an interesting little property also.
(Refer Slide Time: 04:30)

So, let us see a few more of these cases we saw just 1 or 2 cases. So, let us just look a little bit
more elaborately for n equals 1, 2, 3, 4, 5, 6 so that you sort of see a pattern and get used to this
how this is going to work out. So, if you look at B equals 0 to B equals 6 for instance n equals 5
we just saw I just wrote down all that you would get.

For n equals 6 you can see you know (1 − 𝑝)6 will be probability of B equals 0. Probability of B
equals 1 would be 6 × 𝑝 × (1 − 𝑝)5 . 6 B equals 1 means 1 success in 6 trials that can happen in
6 different ways and the probability of each such thing is 𝑝 × (1 − 𝑝)5 because, it has 1 success
and 5 failures.

Whatever order they occur each of those outcomes has probability 𝑝 × (1 − 𝑝)5 . Because, you
can multiply all of them and then how many of them you have 6 so you can add 6. Likewise, for
B equals 2, 6 choose 2 6 choose 2 ends up being 15 so you get 15 × 𝑝2 × (1 − 𝑝)4. Likewise, for
B equals 3 you have 20 , B equals 4 you have 15 again, 5 you have 6 and then finally you have
𝑝6 .
So, you notice the pattern so I can easily do for n equals 7. So, you should be able to go for small
n very easily with a table like this. So, that gives you a picture for what happens in a binomial
distribution for small n and p.
(Refer Slide Time: 05:54)

What happens when n goes large, when n goes large you have to work with the general expression.
And the general expression is what I have put down here probability that B(n, p) equals k is
𝑛𝐶𝑘 × 𝑝𝑘 × (1 − 𝑝)𝑛−𝑘 . So, this is this comes out just because you know every outcome with
exactly k once has the same probability.

If you have an outcome sequence which has k once you know k successes out of n the probability
is 𝑝𝑘 × (1 − 𝑝)𝑛−𝑘 . Wherever the ones occur since you multiply all of them together you have the
same probability. So, it only matters how many of these sequences have k ones and that is n choose
k.

𝑛!
You choose the k positions make them all 1 so that is 𝑘!×(𝑛−𝑘)!. So, that is 𝑛𝐶𝑘 answer so you see

the probability of B(n,p) equals k is this very very famous formula for binomial distribution
𝑛𝐶𝑘 × 𝑝𝑘 × (1 − 𝑝)𝑛−𝑘 . We have a formula how does it look how do you visualize it.
(Refer Slide Time: 06:52)

You can make these plots I use some software to create these plots and here are plots for binomial
of n comma 0.5. So, this is plot for plot of that for n equals 5 for n equals 10 for n equals 20 and
for n equals 100. So, so here are plots of you know binomial n comma 0.5 so you can see how I
have done this.

So, you can see for instance for so you can see what I have done here I have just plotted you know
for k equals 0, k equals 1, k equals 2, 3, 4, 5 the probability that B equals that k. So, this is you
know if this is 2 this is the probability that you know binomial with 5 trials and probability half
of success will give you 2 successes and 3 successes and 4 and 0.

Then you go to 10 you see you know 5 successes, 4 successes, 6 successes, so on. So, you see that
this is how it works. And as you go to n equals 20 you can see there is a peak at 10 it increases and
then decreases again and you go to n equals 100 also it has that similar behaviour.

1
So, you know even though you know the plot seems to start at 0 it is not really 0 it is it is (2)100.
1
So, (2)100 is really, really, really tiny, so you do not see it here. It is not exactly 0 it is just a very

small value and it goes up to a peak which is attained at 50. So, this is how a binomial distribution
would look.
As you keep increasing n and this is meaningful because in practice you are going to do hundreds
of trials, so these kind of things are meaningful and you should get a good grasp of how this
distribution looks.

(Refer Slide Time: 08:29)

So here is a biased distribution instead of picking p as 0.5 all the time. Let us say we pick at pick
p as you know 0.25 and again I have got so maybe this n is not listed here so let me just write that
down. So, this is again n equals 5. I have put that down there is n equals 10 this is n equals 20 and
this is n equals 100.

So, the n equals 100 case maybe is a bit interesting if you if you want to look at it starts at a very
small value and then goes up to a peak around 25 and then falls down and it goes off to a peak. So,
it is interesting that the shape sort of shifts depending on p and it seems to have interesting
behaviour.

So, you can plot these things using mathematical tools even though you know the computation
might look very complicated. So, if you look at 100𝐶25 × 𝑝25 × (1 − 𝑝)25 . it is a very complicated
calculation but it can be done it comes to about 0.09 or something like that. So, it is good to see
how this calculation works. So, this is how binomial distribution looks.
(Refer Slide Time: 09:40)

Here are some observations it starts at (1 − 𝑝)𝑛 and then increases reaches a peak and falls to 𝑝𝑛
again. It starts at (1 − 𝑝)𝑛 as n increases it is going to be a very low value it goes to a peak and
then falls again to 𝑝𝑛 which is also going to be very low.

Where is the peak it is roughly around n, p sort of maybe makes sense why the peak should be
around n, p. There is an exact characterization of where exactly the peak will be in I will leave it
for you as an exercise if you are interested. Now, you know the number of successes can either be
0 or 1 or 2 or 3 or up to n that has to happen with probability 1.

So, probability of that is equal to 1 now these are all disjoint events so you add up all their
probability you should get 1 and that gives you this wonderful binomial identity it is a very well-
known entity (1 − 𝑝)𝑛 + 𝑛𝐶1 × 𝑝 × (1 − 𝑝)𝑛−1 + 𝑛𝐶2 × 𝑝 × (1 − 𝑝)𝑛−2 .

Likewise, if you keep adding all the way up to 𝑝𝑛 you will get 1. So, you can check it for small
values of n and it is an interesting identity to know and this identity is quite useful when you want
to calculate things around binomial distribution, quick observations about the binomial
distribution.
(Refer Slide Time: 10:52)

So, here is a problem. So, let us try to tackle that. So, each person has a disease with probability
0.1 independently in a big population let us say. And you pick 100 random persons from this
population for the disease what is the probability that 20 percent will test positive. So, now there
is a disease and there is a test also sometimes will also throw up some false negatives and things
like that.

So, we for the purposes of this problem we will assume there are no false positives no false
negatives nothing like that is this test is just a true test if somebody has a disease it will come out
positive if someone does not have a disease it will come out negative. Believe me that there are no
tests like that but anyway we can make that assumption for solving problems.

So, this is a binomial so you see that you know each test is a Bernoulli half trial 100 so maybe I
should just write what it is. So, this is a n equals 100 repetitions of Bernoulli 0.1, is not it? So, n
equals 100 and this is p. And so this question is asking what is the probability that 20 percent will
test positive so that is 100 sorry B of 100 comma 0.1 equals 20 and that is 100𝐶20 × 0.120 × (1 −
0.1)80 . So, what will this number B is I mean sort of difficult to maybe compute.
But, you know the computer can give you some answers there are powerful tools I will point you
to some tools which can calculate these things and you can get a number. So, I will leave the
numerical value as an exercise for you can calculate it. So, it turns out, I mean this is this binomial
distribution but maybe what is interesting is something for you to think about.

(Refer Slide Time: 12:58)

So, instead of knowing 0.1 I may not know the point 1. So, there is some probability p which with
which I am going to assume that there is this probability that the person has this disease like this
is sort of the incidence of the disease in a population and I am going to pick some hundred people
and test them. And maybe it will turn out that 20 percent are positive this can happen, is not it?

So, now as a policy planner I am interested in what is p. What is the value of p? So assuming even
that the test is very accurate you do not have to bother about the test being inaccurate, what is the
value of p. So, this is an interesting question I want you to just think about it you are not ready to
answer this question yet, but these kind of questions is what is interesting in statistical practice, is
not it?

So, you want to know how will you decide on p what is what makes sense and it turns out there is
no 1 answer to this question you can come up with multiple ways to answer this question. You
will get moderately different answers with different properties and 1 studies all that in in statistical
questions. So, it is an interesting question for you to think about and this is I mean sort of tightly
related to the binomial distribution we will see later in this course.
(Refer Slide Time: 14:13)

Let us come back to the familiar and simple world of you know Bernoulli repeated Bernoulli trials
and populations computed along when probabilities computed for them. Here is a problem you
have a fair coin and it gets tossed 10 times, what is the probability that the number of heads is a
multiple of 3.

So, here I have a situation which is a B of 10 comma 0.5 I am interested in probability of B of 10


comma 0.5 belonging to a multiple of 3. So, 0 a multiple of 3 you have to assume you know 3 0 is
always there 3, 6, 9 and that is it. It cannot go outside of that so that is the probability that B equals
0.

So, notice I am going to drop this so this B is actually B of 10 comma 0.5 I do not need to keep
carrying that out around all the time so I will just drop it B equals 3, plus B equals 6, plus B equals
1
9. And then that is just you know 10 choose 0 or you know it is a fair coin so (2)10 +
1 1 1
10𝐶3 (2)10+10𝐶6 (2)10 +10𝐶9 (2)10.
1+10𝐶3 +10𝐶6 +10𝐶9
So, if you want you can rewrite this as you know 210 is 1024 you can write that
210

down you will get an answer. So, that is the probability for the a part. So, b part asks you for
probability that the number of heads is even.

So, you can do it like this if you like but you know maybe there is a simpler possibility I do not
know so this belongs to 0 or 2 or 4 or 6 or 8 or 10, is not it? So, this is the probability that the
number of heads is even. So, you can use the you know the uniform distribution this is will come
1+10𝐶2 +10𝐶4 +10𝐶6 +10𝐶8 +10𝐶10
to our help here .
210

So, that will be the answer here you can quickly evaluate it if you like you can simplify this and
write out an answer. So, that is how you use these Bernoulli trials and you know look cook up
events based on the outcomes and number of heads and number of successes etcetera and then you
will get the answer.

(Refer Slide Time: 17:00)

Here is another question this is interesting little question on communicating bits. Let us say Alice
and bob are talking to each other and Alice will send a bit to bob a bit is what bit is either 0 or 1.
So, Alice may send a 0 but it gets flipped when bob receives it maybe something went wrong in
the transmission whatever.
So, that probability is 0.1 so, every bit may get flipped with probability 0.1 use Alice sends a 0 but
bob may receive it as 1 with probability 0.1. Of course, he receives it as 0 itself with probability
0.9. So, that is the picture to keep in mind and Alice is sending 5 bits independently.

What is the probability that at most 2 bits get flipped? So, this is a question which involves a B of
5 comma 0.1 and it is asking probability that this is at most 2 meaning 0, 1, 2, is not it? So, this is
probability that 5, 0.1 equals 0 that is you know (1 − 0.1)5 + 5𝐶1 × 0.1 × (1 − 0.1)4 +
5𝐶2 × 0.1 × 0.1 × (1 − 0.1)3.

You can go ahead and simplify this if you like you will get an answer this is the answer for this
question. You notice how I came up with the B of 5 comma 0.1. So, this is situation where there
are 5 repeated Bernoulli trials each trial has a success of 0.1. What is success for me?

See whatever I am counting is my success I am counting here that the bit gets flipped in reality
this might actually be a failure I do not want the bit to be flipped you know unless you know Alice
and bob are fighting or something. But, you know you do not want the bits to get flipped but even
nevertheless whatever you are counting is a success for you in this Bernoulli trial binomial
distribution case.

So, you have to put the success probability as 0.1 so B of 5 comma 0.1 is the binomial under
question. So, the first step usually in these problems is to identify the binomial distribution in
question. The next thing is what is the values taken by the binomial distribution. Once you do that
you can just write it down just write it down as different possibilities you will get the answer.

So, the next question is the same thing but instead of 5 it is 10 so I am going to leave this as an
exercise to you. So, it is very easy so again probability that B of 10 comma 0.1 belongs to 0, 1, 2.
So, you will get here 0.910 + 10𝐶1 which is 10 × 0.1 × 0.99 + 45 × 0.1 × 0.1 × 0.98 .

So, this is the value you can compute these 2 things and you can compare these 2. So, you will see
you know if you do 10 trials and the probability of success is 0.1 you will see 0, 1, 2 has a much
lesser probability of occurring than you know 0 1 2 with 5 trials. You can compute that and see
why that is true or maybe wrong maybe I am wrong so check that out and see which is greater
maybe you can answer that question which is greater maybe you can answer that question.

So, that is the binomial distribution and simple problems involving some binomial distribution.
So, like I said there is always 2 steps identify the correct parameters for the variable distribution
identify the values that the problem is asking you to compute probabilities for and just add it up
get the answer. So, that is binomial for you.
Statics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 1.16
Geometric distribution
(Refer Slide Time: 00:13)

So, in the third part of this lecture, we are going to look at what is called the geometric distribution.
This also arises naturally in the context of Bernoulli trials, repeated Bernoulli trials. Let us see how
we solve this.

(Refer Slide Time: 00:25)


So, here is the question, as a couple of simple examples to get you started thinking about this. You
are tossing a coin, let us say a fair coin. You may want to toss it till you get a head. So, head is a
success for you. And you want to keep tossing a coin repeatedly till you get head. When you get
head you stop tossing.

So, it is a slight twist on the Bernoulli trials, you are doing repeated Bernoulli trials, but you do
not know ahead of time how many you will do? So, it is a slightly different setting. But even
though the setting is slightly different, it is still that independent Bernoulli flavor is there. So, you
can use similar methods, but the method will the situation is slightly different.

In the binomial case, you had N independent trials and P was the probability and you know N and
P before. So, here you only know P, and you do not know how many you are going to do. But you
are going to repeat till you get success. Seems like a situation that is pretty interesting. And I am
interested in figuring out how many times I will have to toss this coin. That is my unknown number
of interests. That is my outcome of interest.

My outcome is, how many times I will toss the coin? So, maybe I will have to toss it only once.
What is the probability I will have to toss it only once? I should get heads and heads in the first
toss. That is half can make it really lucky. I toss, I get heads. Or I might need to toss it twice. If I
have to toss it exactly twice to get heads for the first time, what should happen? Tails in the first
toss and heads in the second toss. These two are independent Bernoulli trials, so it is going to be
1 1 1
× 2. That is 4.
2

What about 3? When will the number of times you have to toss the coin to get the heads for the
very first time be the third 1 that is in P of 3, tails in the first toss, tails in the second toss, and heads
1
in the third toss. That is 8. So likewise, you can even ask for a general K. What is the probability

that you will have to toss it K times to get heads for the first time in the 𝐾 𝑡ℎ toss.

So, it should be tails in the first 1, tails in the second 1, also, we all the way till tails in the 𝐾 − 1,
1
and then heads in K. So, that is (2)𝑘 . And does it stop with K? Not really. This K can keep going

and going and going and going. So, you make it really, really, really unlucky. And never get a head
till, going on and on and on, like million times you might have to go before you get heads. It can
happen, right?

This is potential it can happen. But it may not happen in practice, but it would you may never see
it, but it can happen, it is not impossible in something, that is highly improbable, but may not have
may not we cannot say it is impossible. So, this is a good example that illustrates this geometric
distribution so called geometric distribution.

(Refer Slide Time: 03:05)


Let me give you 1 more example. You might have seen this game there is this games like Ludo
and other you know, dice throwing games. Where player cannot start till they throw a 1. So, this
is a very common occurrence in many of these board games. You will keep tossing, you will keep
throwing a die till you get a 1 you cannot start.

So, I am of course interested in finding out how many times I will have to toss, I will have to throw
the die till I get 1. So, here is a very readymade case where a geometric distribution immediately
enters the picture. Here again, you will see there is a slight twist in the calculation also, right? If
1
you look at P of 1, the probability is 6. In the first row, I should get 1. That is 1 out of 6 possibilities,
1
is not it? 6.

5
What about 2? When will you get it in the second row? First row you should not get 1, that is 6.
1 5 1
And second row, you should get 1, that is 6, so it is × 6 . When will I need like 3 throws? What
6

if I have to find the probability that I'll need 3 throws of the die, right? You should not get a 1 in
the first throw, you should not get a 1 in the second throw, and you should get a 1 in the third
5
throw. So, that is 6.

5 1 5 1
I do not care what else I get. It is just failure for me × 6. So, it is (6)2 × 6 . Likewise, you can go
6

all the way up to K and ask how many what is the probability that I will need K throws to get a 1
in the 𝐾 𝑡ℎ throw and no 1 before that. First time ever get 1 is in the 𝐾 𝑡ℎ throw.

5 5 1
So, it will be (6)𝐾−1 . 6 × 6 you should not get 1, you should not get a 1, you should not get a 1,
5 1
you should not get a 1, like that K minus 1 times, then you get a 1. So, it is (6)𝐾−1 × 6. And this

can go on and on and on forever. So, here is a another illustration of what is called the geometric
distribution. Having seen two illustrations, let us see the actual definition.

(Refer Slide Time: 04:48)


Now the geometric distribution, unlike binomial has only 1 parameter P. There is no N, P binomial
and N, P, geometric has just 1 parameter P. And you keep on performing Bernoulli P trials. And
what is the outcome? This is my experiment. Experiment is I keep repeating this Bernoulli P trial
over and over and over again, I do not stop at all. Of course, I will stop, but I will tell you when I
will stop.

The outcome that I am interested in is, the number of trials needed for the first success. So, at that
point, I can stop. And I am interested only in that outcome, after that, whatever toss there maybe I
am not interested in it. So, I can stop when I get the first success. So, number of trials needed for
the first success, I will denote this as G of P, or G, in short.

So, my sample space starts with 1, 2, 3, 4, 5, 6. So on and on, and on and on and on. So, I have to
warn you, some people use geometric distribution starting at 0. They will say if you do it first time
and you get it, you actually the value is 0 they will say. Instead of 1, they will start at 0. And that
is also fine. It is just a convention. You can start at 0 or 1. We are going to start at 1 as far as this
distribution is concerned, because I know if I define it as a number of trials needed for first success,
1 is a more natural definition than 0.

And probabilities are easy to calculate, if it is a Bernoulli P trial, G equals 1 has probability, P, G
equals 2 has probability (1 − 𝑝)1 × 𝑝 and G equals 3 is (1 − 𝑝)2 × 𝑝 is the same logic. And so G
equals K, it is (1 − 𝑝)𝑘−1 × 𝑝. So, this is a geometric distribution. And you can see it occurs in
the context of repeated Bernoulli trials, but sort of different from the binomial as you keep doing
it indefinitely till you get the first success.

(Refer Slide Time: 06:35)

How does it look? You know, the problem with geometric is, if you want to show something, you
have to stop somewhere. Because you know, it never quite goes to 0, isn't it? It keeps going on and
on and on. So, I have to stop somewhere, and in this plots, I am stopping at 50. You can choose to
stop at 500, if you like, you will get a longer plot. But I am stopping at 50. And this is how it looks.

And as you change P, as the probability of success improves, it looks like you need much lesser
trials to get a success, is not it? That is not very unexpected. You can see point .001, it seems like
you will definitely need a lot of trials. But for 0.99, you know, in first few trials, you are going to
get the success without any problem. So, that sort of it is reflected by this picture.

(Refer Slide Time: 07:22)


So, here are some observations about the geometric distribution. It starts at P and keeps falling.
This is unlike the other cases of binomial where you had up and down, this does not the up and
down behaviour, it starts at P, it keeps falling. It keeps on decreasing, but does not really go all the
way to 0 if P is less than 1. Even if P is a little bit less than 1, it never really goes to 0, it will keep
going to smaller and smaller and smaller values.

And you can find out this little probability this is sometimes very important. Most people ask this
question, what is the probability that G is less than or equal to K? What is the probability that I
will get my first success within the first K trials? So, if here, if probability is G has to be less than
or equal to K, G could be 1, or two, or 3, all the way up to K.

So, you can add up all these things, G equals 1 plus, G equals two, these are disjoint outcomes. So,
you add up all of them. And you notice the probability 𝑝 + (1 − 𝑝)𝑝 + (1 − 𝑝)2 𝑝 plus so on until
(1 − 𝑝)𝑘 × 𝑝.This sum, there is a very nice important identity for you. I want you to note this
down, we will use this quite often in calculations. This sum ends up being 1 − (1 − 𝑝)𝑘 . So, this
is an identity which you can sort of remember it by heart. It comes from summing up a geometric
series. It is quite a standard identity. You can sum it up and you will get 1 − (1 − 𝑝)𝑘 .

So, another way of recalling this identity is probability. Let me just get up this is equivalent to
probability of G greater than K. What is probability that G is greater than K, is 1 - probability that
G is less than or equal to K. What is 1 minus this it is 1-(1 − (1 − 𝑝)𝑘 ). So, that is a nice little
thing to remember, isn't it? What is the probability that G is going to be greater than 1. Probability
that G is less than or equal to K, it is just probability that G is equal to 1, or G is equal two, or so
on till G equals to K.

Now, you can add up all these things, these are disjoint events in my sample space the way I do
my experiment. So, probability of G less than or equal to K is 𝑝 + (1 − 𝑝)𝑝 + (1 − 𝑝)2 𝑝 so on
till (1 − 𝑝)𝑘 × 𝑝. So, now this is a very important identity using summing of geometric sequence,
you will get this to be equal to 1 − (1 − 𝑝)𝑘 . So, this is a very important identity. And you can
sort of ask a very related question, what is the probability that G is going to be greater than K and
that is 1 minus probability that G is less than or equal to K and that you can see.

(Refer Slide Time: 10:09)

So, 1 interesting question you can ask is, what is the probability that G is greater than K? So, sort
of related question it is 1 minus probability that G is less than or equal to K. And you can just plug
this in and you will get (1 − 𝑝)𝑘 . So, it is sort of an interesting question. So, what is the probability
that for K equals 1? For instance, what is the probability that G is greater than 1? It is 1 − 𝑝, isn't
it? The first toss you should not get a success in the first attempt it is 1 − 𝑝. So, that has to be
greater than 1.So likewise, it is a interesting identity to remember. So, probability of G less than
or equal to K is 1 − (1 − 𝑝)𝑘 , probability of G greater than K is (1 − 𝑝)𝑘 . So, this is the geometric
distribution.
(Refer Slide Time: 10:50)

So, here is an example to calculate this. I introduced this Ludo question. A player needs to
repeatedly throw a die till they get 1. And what is the probability they will need lesser than 6
throws? They will need lesser than 11 throws? They will need lesser than 21 throws and so on. If
you calculate that, you see you get the answer as probability of needing lesser than 6 throws is
probability that G is less than or equal to 5.

This is probability that G is less than or equal to 10. This is probability that G is less than or equal
to 20. Lesser than, a strictly lesser than that is how we interpret it. So, you see the probability it
5
looks out as 1 − (6)21 . So, you know, I mean, if you are really, really unlucky, you will need even

21 throws. So, the probability chance of that happening is like 3 percent, 2 percent. So 2 percent
probability is there.

And if you are really unlucky someday, you make it go on and on and on. And you will have to
get so many throws and before you get 1 and you can start thinking. So, that is the die question
with geometric probability. Hopefully, this sort of illustrates to you the kind of computations that
you do with geometric distribution.
(Refer Slide Time: 12:04)

Here is a small modification of the geometric distribution. So, this problem is a little bit different
as an there are two players playing basketball, they can free throws, the shots at the basketball ring.
They are throwing balls into the ring trying to get it in. And first person is a 40 percent shooter as
an his probability of getting it into the basket in 1 throw is 0.4. For player two is 0.7. And each
throw will assume as independent of all the other throws. And the two players keep alternating the
shooting and the player 1 starts. That is important, player 1 starts.

So, I am going to ask this question, what is the probability that player 1 wins before the third round
as an events either in the first round, first round meaning first player shoots, second player shoots,
and within that he wins, or second round, meaning, first player shoots, second player shoots, both
of them do not make it. And then first player shoots, second player shoots the second round there
he wins in that. So, something like that has to happen.

So, if you look at the first question, the success has to be either 1, in which case you will stop,
what are the favorable outcomes. So, 1 for player 1. After that, we just stop. Or you can have 0 for
player 1, 0 for player 2, and then 1 for player 1. Is not it? These are the only two possible outcomes,
is not it? So, this is the chance for player 1 to win before the third round.

So, once this guy gets a 1, he wins. But anything else, we do not have to consider. If this guy does
not get a basket and the other guy gets it, I mean, all those will result in failures for you. So, if
player 1 has to win, these are the only two possibilities. What is the probability of this? This is 0.4
probability. This is 0.4 times, this is 0.6 × 0.3 × 0.4. So, this will evaluate to something.

So, the probability is this plus this and that will end up being something. I do not know I can
calculate that if you want to 18 times four is 72. So, it is 0.472. So, that is the probability that the
person wins before the third round. So, look at the second question, it says, what is the probability
that player 1 wins? Again, let us write down the favorable outcomes.

So, I like this method of writing down the favorable outcomes in problems like this. This is sort of
like the geometric distribution, but it is not easy to fit this into directly into a geometric distribution
and work with it. Because there are two players involved and you know they go back and forth. It
is a little bit more complicated. So, it is better to write it down, write down all the favorable
outcomes.

If it works out like geometry, great, you know, you can easily do it. Otherwise also you can do it
properly. So, what is the chance that player 1 wins, it could be 1 for player 1 or 0 for player 1, 0
for player 2, and then 1 for player 1, or 0 for player 1, 0 for player 1 player 2, 0 for player 1, 0 for
player 2, and then 1 for player 1. Or this can now repeat. 0 P1, 0 P2, instead of repeating just twice,
it can repeat 3 times and then you can get 1 for player 1.

Or you can have 0 for player 1, 0 for player 2, you understand this notation, right? When I put 3
here, it means 0 P1, 0 P2 repeats 3 times, right? It can repeat four times. And then the first guy can
get a shot, so on and so forth. This can go on and on and on forever. So, what is the probability
here, it is 0.4 plus 0.6 × 0.3. So, these two together is 0.18, 0.18 × 0.4 plus these two together is
0.182 × 0.4 + 0.183 × 0.4, so on. Point 0.184 × 0.4 plus so on and on and on and on, goes on
forever.

0.4
Somehow may use your familiar GP formula, all the way to infinity . So, you will get the
1−0.18

answer is 0.4 by 0.82, I do not know. This is a 82, is not it? So, this is something 0.40 by 0.82 you
will get an answer. You will get something, I do not know, 0.48 or 0.49 or something like that. So,
it will keep improving.
So, this is the way to do these problems. So, you notice, I did something like the geometric
distribution. But, you know, I have listed out the outcomes and evaluated the probability sort of
explicitly in this problem. It is a slight twist. It is not quite geometric. So, why is it not geometric?
See, geometric will mean, now, instead of 0.18 here, you should have only 0.6, is not it? You do
not have 0.6 here, 0.18, it is a slightly different scenario. You can imagine why this has come up
because you know there are two players taking shots, 1 after the other, it is a slightly different
situation. So, you have to write down the outcomes carefully and do it. So, hopefully, this problem
gave you something.

This is a slightly twist, a small twist to the idea behind the geometric problems. If you understood
how the geometric distribution was developed, you can also do this problem. So, it is goes a little
bit deeper than just asking you given a geometric distribution, how do you compute probability, it
is probably an easy problem. This is slightly more difficult problem, but if you carefully write
down the outcomes, you will get to the answer as well. Thank you very much. That is the end of
this lecture.
Statics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 1.17
Events and probabilities_ Probability in python

Hello, and welcome to this lecture. We are going to do something slightly different instead of just
a regular lecture where we do just computations and derivations, we are going to do some computer
simulation of probabilistic experiments. Now, this is a bit different. And also maybe a bit
challenging to some of you if you have never seen computer program before or something like
that.

But hopefully, you have seen some computational thinking and you have some experience of
looking at you know steps of numerical computation there. And you are also doing a Python course
on the side. So hopefully, these two together will help you get over this. And we are not going to
use any major Python structure or anything like that. We are going to just use some simple things.
And most of the code, I will give you, I mean, you have to just look at the code and learn how to
run it.

And we have also been talking about activities in this class. And this will sort of be part of your
activity. And you will see, it is easy enough to do, I will expect some very simple modifications
maybe from you and your instructors will help you in this as well. So, let us get started.

(Refer Slide Time: 01:21)


So, like I said, one can define an experiment. An experiment is some sort of an action, is not it you
do something and some outcome results. And one can presumably do that in a computer, so you
can write a computer program that will simulate the experiment with maybe some realistic sort of
condition. So, this is now possible. And what I will give you are simple Python routines for
simulating experiments. And interestingly, from the simulations, using something called Monte
Carlo simulations is a very standard technique today.

You can verify or you can actually by simulation, find the probability of an event. So, you have a
experiment and you define an event and you do a computation to find what that probability of the
element is turns out, you can simulate it and verify whether these two agree. So, that is something
of very great interest to us. So, you toss a coin, we have always been saying fair Coin is 0.5. So,
maybe you should just have a way of verifying that actually has a meaning in real life. And that is
something we will see later on, even in theory in this class and practice also it is true.

So, this is some this Monte Carlo simulation, is very important in practice, a lot of quick
experiments can be done like this, for complicated experiments you can actually do simulations
and check whether you have probabilities of events work out okay or not okay. So, all this code
will be available to you as a Python notebook in something called Google Colab. And these are
things that are very important for you to learn in this program. Knowing a good Python notebook
environment where you can go in and do some work is extremely important for a data science
program.
Today, if you get out with any sort of learning at all, you will need to be able to express or write
some Python notebooks to explain your work. And this has become part and parcel of everything
today, not just data science, in even other areas, this is very, very useful. So, it is a good skill to
have. And as part of this course, you can take some first steps towards it. And as you go further in
this program, you will become more comfortable with using Python in a notebook in Google Colab
or any other notebook environment that you like.

So, with that introduction, let us jump into this Python notebook. I have created a notebook for
you in Google Colab, and I will just show that to you and it will also be shared to you, you will be
you will have it available. And this is basically for simulating this experiment. So, let us see that.
(Refer Slide Time: 03:49)

So, this is my Google Colab Python notebook. And you just have to go to this URL that you see
here colab.google.research.google.com, and this will be like a drive file for you. So, on your so
you can use the online degree log in that you have, and go into colab.research and you can create
any notebook that you want. And this is a notebook that I have created like that, okay?

So, any notebook will have cells, what are called cells. And you can see the first cell is highlighted,
I can go to the second cell, I can go to the third cell, I can go to cell after, cell after cell, everything
is a cell. These blocks that you have here, this is a cell. And every cell can have either code or text.
So, you can see here code, text, and you can keep adding cells whenever you go here, you see plus
code plus text. You can either add a text here or you can add a code here.

So, for instance, like you know, in the very beginning, if I want some text here I can put it is in
some language which is called markdown. You can also learn that later if you like. This is so let
me just call this probability in Python. So, what I type here is in something called markdown, and
it is good for you to learn markdown as well. So, you can see some sort of title came here. And
you can see this is probability in Python.

So, this is a way to type text. So, you can see this is also a text column. I can edit it if I want to, if
you do not want to, I can go back here. And then these are code. This is code. What do you see
like this are code, this is slightly different background and foreground, that is code. This is all
Python code. You can see Python code and text gets mixed up together. You write some text
describe something and then you write Python code for doing that. So, that is called a notebook.

So, notebooks are very nice. Any work that you do, you can see clearly that you can describe it in
a notebook very conveniently. You write the text and then you write code. The code can be in
many other languages as well. So, Python is just convenient. In most cases, people use Python, but
you can also have other languages. So now so Google Colab, there is this connect. So, Colab will
connect to some virtual machine on the background. When you run the first command, it will do
that connection on your own, you can see it is allocating. Then it is connecting.

And once it is connected, it will run this first piece of code. So, it is connected, it has given me
some RAM and disk, etcetera. and it has run this. So, Python works with something called
packages. You will learn these things later on modules or packages. And I have imported an
important module called NumPy, NumPy, as NP. So now, this is you not really know the meaning
of this, but let us go on and see how I am doing this.

So first, I have a function for generating a uniform outcome. Suppose you have n outcomes in the
sample space and you want to select m outcomes independently one after the other uniformly. You
need a function for that and I am defining that function. So, you can run this. So, your mission
knows that this uniform n, m is something that will generate uniformly, there are n outcomes in
the sample space, it will generate m outcomes for you independently. So, supposing here is an
example. Let us see an example here. Example is always good.

(Refer Slide Time: 7:25)

Supposing you want to toss a coin, right? So, there are two outcomes, I will denote the first
outcome by heads number one, outcome number one is heads, outcome number two is tails, one
and two. So, if I do if I could toss the coin once, I could toss it 10 times, I could toss it 100 times.
So, this is a command which prints one toss, uniform of 2, 1. So now 2 is what? 2 tells me that
there are two outcomes in the sample space. The first outcome is number one, second outcome is
number two. That is what this two tells me. And I want one experiment to be done. Experiment to
be done once.

And what about this command here? Uniform 2, 10. Again, two outcomes are there and I want to
do the experiment 10 times. So, toss the coin 10 times and tell me what happened. Here is toss the
coin, once again two the domains that it is a coin, right? So, my outcomes are only two in number,
one and two. So, and I want to toss it 100 times.

So, let us just see, if I run this what happens. So, it tossed it first time, the first toss resulted in tails.
The next 10 tosses resulted in so many ones and so many twos. And then the 100 tosses resulted
in so many ones and so many twos. Once again, number one is my code four heads, two is my
code for tails. Now, it is easier to use one and two, when things like that in computer programs as
opposed to H and T. So, it is not very hard to do the translation, but this is easy enough.

So, let me just run it once again. Now, I can run multiple times. You can keep running and you see
the outcome changes. When I tossed once, the outcome changed here. And when I tossed 10 times
the outcome changed, it keeps changing, I can keep running. Every time I run, I will get different
random outcomes.

So, this is what I meant by saying, I am simulating an experiment. The tossing of a coin is getting
simulated here inside this Python language in some way and you are getting the output here.
Hopefully that was clear. If you will have this notebook with you, you can sort of hit this button
here and keep getting multiple iterations going.

(Refer Slide Time: 09:25)

The next thing I want to do is throw a die. We have been doing throwing a die calculation for so
long. So, here is throw a die. You could throw once or you could throw 10 times or you could
throw 100 times, exactly similar to before. Now notice, all I have changed here is uniform 6, 1. So
in a die, I know there are six possible outcomes, all of them are uniform.

So, all I need to do is make uniform six 6 comma 1 and I have my die. So, it is as good as that. So,
it is very simple. I can throw the die once, 10 times, 100 times and off you go. There you go, you
get the answer. And here it is the number is the same as the outcome, so I do not need to do any
translation. So, you see the outcomes here. So, you know, 1, 2, 3, 2, 1, 6, 6, that is possible. So,
you can run it different times you run, you get different outcomes. So, that is throwing a die,
alright? So, simple experiments, you know, how to do it. It is uniform outcomes, this is very easy.
(Refer Slide Time: 10:25)

Now, the crucial thing we will learn is something called this Monte Carlo technique. This is
estimating probability of events by simulation. I have given here a brief description of what it is.
So, my basic idea is, is this estimate. So, I put equality here, it is probably not equality. So, maybe
I should say, approximate here, just to get that right.

So, you can see here. So, if you have an event A and you repeat the experiment N times
independently, you keep on repeating it, and simply count the number of times the event A
occurred, which I could call as N survey. And then if you take this ratio N survey by N, you can
show using various theory and also in practice, you can sort of observe that this is a good
approximation for probability of the event A.

So, this is the basis of Monte Carlo. So, this equation right here. I repeat the experiment and times
independently and count the number of times the event that you want occurred in A, then NA by
N is roughly the probability of that event A. So, one can show also that as N becomes larger and
larger and larger, this becomes a better and better and better approximation. And this is the method
that is very useful.

So, we are going to just do very basic simulation, here the first simulation I am going to do, we
will evaluate probability of a coin toss. So, we know it is 0.5, right? I am expecting my uniform to
be uniform, so it is 0.5. So, notice what I am doing here, I am counting. So, this is where I count
the number of heads, number of heads is 0. And for I in range, though, this is basically like a
repeating of a loop 1000 times. Every time I repeat this loop, I generate a coin toss and check if it
is equal to heads. So, that is the part which checks if the coin toss is equal to head. So, maybe I
can write that down here. So, this is repeat coin toss repeat 1000 times, whatever I choose to do.

So, then I count the number of heads and divided by 1000, I get a probability estimate by Monte
Carlo. So, let us run this 0.492, 0.513, 0.48, 0.492, it is sort of close to 0.5. So, you would not get
the exact 0.5. What the theory tells you is as you keep going 1000, 10,000 100,000, and all that
this will become closer and closer and closer to 0.5. Of course, in computer simulations, there are
so many other things to consider. But anyway, 0.5 sort of what you have in your head in theory
and you got that in practice using Monte Carlo. So now, this is very, very nice result to have.
(Refer Slide Time: 13:17)

So, let us once again do probability of a die showing a number, maybe you thought coin toss is too
easy. So, we can modify the Monte Carlo simulation. So, what I would do for any Monte Carlo is,
copy this, paste it and make it number instead of number of heads, so that it is neutral enough,
right? Number equal to 0 variable for storing the event occurrence. And then my repeat, how many
ever times I want to repeat.

And I do my experiment, uniform of 6 comma 1 is an equal to 1. And then I do number equals


number plus 1, and finally right? So, I expect this number to be one by 6, so 0.157, 0.16 is pretty
good. So, maybe instead of just the number being one, maybe I want to find the probability that
the number is either 1 or 2. So, what you can do is you can store the value of the die. And then
simply check if die equal to equal to 1 or die equal to equal to 3, then now you increment. Or okay?
Is that okay?

So, what will be this event here, die equal to equal to 1 or die equal to equal to 3. The event A is
1
A equals 1, 3, is not it? This is the event A. So, probability of A should be 3. You should get

something close to 0.33. You got 0.333 I mean I mean, that was just a fluke, I think. We will do it
once again, you will get something slightly different 0.321, 0.31, 0.355, slightly around 0.33. That
is what you will keep getting.
So, you see, you can make these small modifications. So, you first do the experiment. So, you
always have this for Monte Carlo. This is repetitions, then you do the experiment first. Do not do
anything more clever, just do the experiment and then event. That is it. So, that is Monte Carlo for
you. You can copy paste this two anything else you like. Variable for storing repetitions, do the
experiment, have some logic for counting whether the event happened or not and then increment
the number. So, this is the way to do it, right?

So, if you can change it, so I am claiming that if you change this to 0.10,000, you will get something
smaller, it will take a little longer to run. Of course, I have to divide by the 10,000 divided by
1,000, you will get something wrong. So, you see, you are getting much closer numbers, much
more consistent. So, 0.33, 0.33 is not changing. You are getting something much closer to 0.33,
etcetera, etcetera. So, these are things you can verify.

So, hopefully for this, so this is the essence of Monte Carlo. So, you can always copy this and
paste it somewhere. And then make the change that you want. Change your experiment, change
your event condition. You got your Monte Carlo going. Just keep doing it and you have Monte
Carlo experiments. So, this is the basic.

(Refer Slide Time: 16:45)

Now, let us consider slightly more complicated experiments. I am going to consider three different
experiments, which actually very classic probability experiments. We did not consider the theory
for it, so I have put a small description of the theory, given your background and probability, now
you will understand this. The first classic problem we will look at is something called the birthday
problem.

So, the birthday problem is a very interesting problem. Most people have this we give a wrong
answer to this question. And as it turns out, it is very surprising with the answers, very nice, simple
answer. So, let us just see what it is. So, you have a group of N persons. The birthday problem
basically asks, what is the chance that some two have the same birthday?

So, at least there are N persons, a lot of people. What is the probability that two of them, some two
of them, share the birthday? So, they have the same birthday. All right. So now, if you think of
distribution and probability, etcetera. If you just pick a random person, it is not unreasonable to
think that their birthday would be uniformly distributed from 1 to 365. I do not know.

And in different cultures, people get married only at certain times, and you know, they have kids
only at certain times. So, all these things can affect your uniform distribution. But generally, I
think it is a reasonable approximation. And can you assume two people's birthdays are
independent? Yeah, of course. I mean, if you pick them random enough, far enough apart, I guess,
it is not a bad assumption. So, these are reasonable assumptions you can make.

So, most people imagine that you should have at least 100 people before you start seeing common
birthdays, right? Because uniform independent assumption is probably reasonable. So, only when
you have a collection of 100 people, do you have even a chance of having common birthdays. So,
this is something that is something people naturally assume.

But you know, it turns out, if you only have 23 persons of this sort, uniform, distributed birthdays,
and independent, then you have more than a 50 percent chance of two people sharing a birthday
just with 23 people. Does it sound believable? It seems a little surprising. But it is actually true,
you can calculate this probability.

So, for that reason, let us say event A is that some two have same birthday. So, that is event A, and
then let us say event A complement there is no two have same birthday, right? So, in fact, in this
problem, it turns out A complement is the easier thing to calculate. What is A complement? Once
again, how do you calculate probabilities of events, you just write them out one after the other and
then use the AND rule and the OR rule and combine and do it. There is no other way to do it.

So, A complement is birthday one on any date 𝐵1. So, the first person, the first birthday can be on
any date 𝐵1. I mean, there is no collision if there is only one guy, right? There is no problem. And
birthday two is on any date other than 𝐵1. So, the first person had a birthday on 𝐵1, the second
person has a birthday, anything is allowed other than 𝐵1.

What about for the third person, any date other than 𝐵1, 𝐵2, and so on and so on and so on till the
last person birthday N on any day, other than 𝐵1, 𝐵2 to 𝐵𝑛 minus one. So, what is the probability
for the first one birthday one on any date, so that is just one, right? 365 by 365, everything is okay
with you.

The next one is any date other than the first guy, so that will be 1 minus 1 by 365. Then one minus
2 by 365, so on till 1 minus N minus 1 by 365. These are all a condition, condition, condition, you
can multiply all of them together, you get probability of A complement. Is that okay? Hopefully
convince yourself of this. This is the probability that you will get.

So, it looks like a pretty long expression. Even for N equals 10, there is lots of multiplication to
do. And maybe you can multiply, but I am going to be a bit lazy. I am going to write a computer
program to do this and verify both by calculation and by Monte Carlo simulation that the
probability comes out to be correct.

So, here I am taking 10 persons, N equals to 10. And there is some logic here. First is the logic for
calculating the first number I print is the probability from the expression. So, why is this expression
that I mean, you can check that out. I do one minus product of 1 minus something that is N by 1
by 365. So, this is like a short Python code in NumPy for doing this expression. It gives the same
expression here. Let us compute it.

And then I have my Monte Carlo. Remember, repetition, experiment, event. Repetition,
experiment. And how I do this is a little bit more complicated. I do not want to go into details of
the Python code, believe me, it is true, this is the birthday problem. And if you simulate, you see
you get answers which are very, very close.
So, the theoretical calculation that we did and the Monte Carlo experiment that we are doing gives
you the same value. So, this number by 1000 and the theoretical calculation gives you roughly the
same value. You can change this number, for instance, if you make this 23, you will get my
promised 50 percent.

So, you see the theoretical calculation is 50percent, by Monte Carlo, sure enough, gives you 0.523,
you can repeat it if you like. So, in fact, you may say 50 percent is still luck. But even if the number
goes to something like 40, you literally guaranteed a same birthday, that is 90 percent.

If you have a class of 40 people, two people are going to share birthdays with 90 percent probability
nearly. You have to be really unlucky not to have a common birthday in even in a class of 40. If
you go even to 60, I think quickly, this number will go off very close to 100. Look at that as 99
percent and 60 people, you are virtually guaranteed unless you are really, really unlucky and really,
really everybody sort of came from very dependent population. It is a very, very likely that you
would have two people sharing a birthday for 60. So, this is just telling you how Monte Carlo sort
of agrees with the calculation that is done here. All right. So, that is birthday problem.
(Refer Slide Time: 22:41)

And the next problem is something called the Monty Hall problem. So, let us look at that next. I
have 3 or 4 more problems, I would not probably spend too much time on everything and let you
read it up on your own. But this Monty Hall problem is very, very classic. You can read it, this is
like a game show. There are three doors and there is one car and two goats. There is a car behind
one of the doors, there are two goats behind the two other doors. And you do not see the what is
behind the door. Of course, I mean, you are a contestant, you just see the 3 doors.

And you have to pick a door at random. Whatever you open, you get, right? For instance, if you
got a goat, if you like goat, that is a great gift for you. But you know, most people would like the
car, isn't it? So, if you are lucky, you will get the car. So, it looks like if nothing else happens, the
probability of getting a car is one third, is not it? So, one third is probability that you get a car. two
third is a probability that you will get a goat, just luck of the draw, you try and go off.

Now, this contest is a little bit different. So, you make a choice. The host will not show you what
is behind the door. He will open another door which has a goat behind it. So, right, you pick the
door, there is something behind the door, you do not know what it is, could be car could be goat.
I do not know. But the host is going to show you the door, open the door, which has a goat behind.
So, definitely whatever you pick, there will be some door with a goat behind it, right? Where there
are two goats. And the host is going to open that door and show you here is a goat.

Now you have the choice of switching. Alright. You could either stick to your original door or you
can switch. So, the question is, what should you do? That is the Monty Hall problem. Should you
switch or should you stick to your original door? So, most people who do not do a detailed
calculation will say, it does not matter. Most people will say it is one third this way or that way,
either switch or you do not switch it should not matter.

But the fact that the person you know opened and showed you have goat apparently ends up having
a significant impact on what you should do. And the surprising thing is, if you choose to switch,
your probability of winning goes to 2 by 3. You can do a simple setup, it is not that complicated.
But a lot of people have been confused by this, many people do not believe this. So, it makes sense
to do an experiment and verify this.
(Refer Slide Time: 25:22)
So, that is what I have done here. I put here a little bit of case where I know I am not going to go
into details here, you can see I am doing a car location and fixing the goat 1 location, goat two
location, and then contestants original choices again uniform. And then I am checking if it is a goat
or not. And then finally, I am counting if the other close door is equal to car location. So, this is
the case of success for switch, right?

So, if you can check this and you will get 0.669, sure enough, simulation bears out, you can see
that. Even if you did not believe the theory, by theory, you can show this. There is a wiki page and
I have given a link here to a wiki page, you can go there and show you why the two third is justified.
There is lots of good reasons for it. You know, if you do not believe it, if you like computer
2
simulations more, here you go, you can simulate. You see the answers 3. So, that is the Monty Hall

problem. It is a nice problem.


(Refer Slide Time: 26:18)
Next is a something called a Polya’s urn scheme. I am going to skip this. Here again, you will see
that it works out, okay?
(Refer Slide Time: 26:24)

And the last experiment I have done is something called a simple random walk or gamblers ruin.
So, it is a slightly more complicated experiment than maybe what you are used to. So, let us walk
through this real quick. Gambler has K units of money, K rupees, if you like K dollars, if you like
dollars, whatever unit. K units of money and he is going to a casino to play. Do not ask me why
there is a casino in India, Goa has lots of casinos, if you know, if you did not know. Anyway.

So, this is a game he does. If he has any money at all, if he has at least one unit of money, he is
going to toss a coin. And the coin results in heads, the casino will pay him one unit. You can
imagine the casino has infinite amount of money, it is not going to go bankrupt. It can keep on
paying him as much as he wants.

If he gets a tails, then he loses his one unit. He has to give his one unit to the casino. Now, of
course, if he loses all monies, run out of cash, he is bankrupt, and he stops, right? And if he gets N
units of money, he will also stop. So, N is like his goal. Once he gets N units of money, he feels
rich enough, he thinks he can go buy something with that money, so he stops. So, this is called
gamblers ruin problem.

In fact, it is a special case of the problem. It is called a simple random walk with two absorbing
barriers. There is a barrier at 0, there is a barrier at N. He starts at K, he will keep drifting. He will
lose some money, gain some money, moving around here and there. If he ever hits 0, he stops.
That is an absorbing barrier as it is called.

And if he ever hits N, also he stops. This is also an absorbing barrier. So, there are two barriers
and the person keeps drifting around. You can imagine, right? So, there is a toss of coin going on
and it keeps on moving. And the interesting probability to calculate is this probability of
bankruptcy. Probability of bankruptcy is interesting, right?

I want to find out, if I start with K units of money and if I play this game, what is the probability
that I will go bankrupt? Now, you can say a small P is the probability of heads. So, I am allowing
the coin to be biased. So, most casino games, it is not going to be a coin toss which is fair, right?
Most casino games are loaded in favour of the casino. It might be slight loading, but even then, it
is loaded in favour of the casino, right?

So, say P is the probability of heads and Q is equal to 1 minus P, this is the probability of tails,
isn't it? So, you can show, I am not showing the details, actually the calculations are briefly shown
below and not in great detail. Our probability of bankruptcy if your coin is fair is one minus K by
N. If it is not fair, then there is a more complicated formula involving Q by P, Q by P par K minus
Q by P par N, divided by 1 minus Q by P par. This is the formula.

And how do you derive it? There is a wiki page which has some definition, derivations but actually
it is very simple derivation. Sometimes very complicated. Let us say 𝑋𝑘 is the probability of
bankruptcy starting with K units. The main idea is to condition on the first toss, you have to
condition on the first toss.

Now, 𝑋𝑘 is probability of bankruptcy given first tosses head times P plus probability of bankruptcy
given first tosses tail times Q, right? So, first toss could be either heads or tails, right? You either
have heads and go bankrupt or tails and go bankrupt, right? So, then you do the conditioning is
just and or combination.

So, you do the conditioning, and you write it out Now notice, what is this probability of bankruptcy
given first toss is head? Now, from the first toss is head, you know you have K plus 1, other than
the fact that you have K plus 1, everything else remains the same. So, what you have is just 𝑋𝑘
plus 1, right? So, this 𝑋𝑘 just like this nice little function, which takes you from K to the probability
of bankruptcy, so you just put 𝑋𝑘 plus 1 here.

Same way the probability of bankruptcy given first toss is tail. Now when the first toss is tail, you
lost all money, you went to K minus 1, you have 𝑋𝑘 minus 1. Now how do you solve these kind
of equations? You need boundary conditions. Boundary conditions is X 0 is equal to 1, right? If
you got to 0, then you are definitely bankrupt already. So X 0 is 1, and 𝑋𝑁 is 0, right? If you go to
N then you stopped, so 𝑋𝑁 is 0.

So, if you solve this equation, you will get this. So, I am not going to show you how the equation
is solved, you can take a look at it. And then if you plug it in, you will see this is a valid solution,
1
both for the case P equals Q equals 2, and P not equal to Q, alright? So, there is an expression for

probability of bankruptcy. All right?

So, now we have done a I have done a Monte Carlo simulation here. So, I put K equals 5, N equals
𝐾
10 and I am printing 1 − 𝑁. And I am repeating 1000 times. I start with K five and then if K is 0,

greater than 0, or less than N, right? So, this is where I will toss a coin. And if it is heads, my K
goes up. If it is tails, my K goes down. And that is how I keep repeating it, right?

So, if K is 0, when it ends, I increase the number or if K is N, when it ends, I would not increase
the number. So, this is how this works. This is where it is. So, some of you who are paying really
close attention will know what if it never hits 0 or N, okay? You may ask, right? What if it keeps
on walking in the middle and never hit 0 or N? It turns out, that is not possible, you can prove that.
In this case, it will either hit 0 or hit N. So, given the way it works, it is a more complicated proof,
but you can prove that.

𝐾 𝐾
But you can see that, 1 − 𝑁 and this Monte Carlo probability agree, right? So, I printed first 1 − 𝑁,

I got 0.5. This number also sort of agrees. So, even this so this is sort of probably gives you a hint
on how for complicated events. Also, you can do Monte Carlo. Monte Carlo is so easy to do, you
do not need any recursive equation, etcetera. So, in real life, a lot of applications will have a very
complicated sequence of things happening. And to compute the probability you just have to write
a Monte Carlo simulated. And for simulating, the only thing we did was uniform, isn't it? We just
wanted a uniform toss. You got it, great.
(Refer Slide Time: 32:31)

Now, what about the biased coin? What if P is not half, right? Can you do simulations? So, I put
here something that will simulate a biased coin for you. So, we need a method for tossing a biased
coin. So, far we had only uniform outcome. So, I put here biased of P comma M. You put in some
value for P, it will toss a biased coin for you M times, and it will give you the answer, alright? So,
that is what biased of P comma M does, let us run it, get it into your computer into our computer,
and then you can run it.

So, if P is 0.25, I am printing P, and then I am repeating it 1000 times, the biased coin and counting
the number of heads, sure enough, 0.25 results both in theory and in practice. So, biased coin is
also something you have. Once you have biased coin, you can repeat the same gamblers ruin
experiment with the biased coin, 0.45, and you can print the theoretical value and the value that
you got from Monte Carlo simulation. And you can see, sure enough, it agrees a lot.

So, even if the probability of winning goes down to 0.45, right, which is only 5 percent reduction
in that one experiment. Notice the casino is likely to win or you are likely to go to bankruptcy 75
percent probability. So, let us say we push it to 0.4, and see what happens. That is already 89
percent, 88 percent probability, right? 0.4.

If you go to 0.35. I think they should be really, really close to 100. You very little chance of getting
out alive. So, in 0.35 is a good enough, 15percent advantage is good enough literally to wipe you
out of all your money in a casino. So once again, take a look at this Python sheet. It is maybe not
every line in the program you will understand. But eventually as you progress in this program and
the BSC do the other courses learn Python, you will see that this is a very useful resource for you
to have.

(Refer Slide Time: 34:30)

And what I have for you, beyond this is an activity, okay? What is this activity? Let us see what
the activity is. This is the activity for week one. There is only one activity for week one. But
hopefully it is complicated enough, it will take time for you. You remember the week 0 activity
where you went and created a Google sites for yourself for this course and you pulled a data set
and describe the data set and gave a link to it, etcetera. Hopefully you did that. Now you have an
activity for week one.

So, you go through and understand how the various, you know, just to how to set parameters? You
do not have to rewrite the code, just how to set parameters. And like I told take a template and
repeat the Monte Carlo. And then what I want you to do is to pick any experiment that interested
you, right? So, we had so many experiments for which we did calculations of probability, pick
something very simple. If you think it is too complicated to do any changes in the Python program,
you can just throw a die and compute some probability of some event.
So, you can even do something so easy if you like, it is up to you. Let me not tell you what you
want to do. You could try to do binomial. You can do so many things, right? So, whatever you
choose, you can try to do. And you modify the Monte Carlo to do repeat the same thing. So, you
compute a probability of that event by in theory, print that value. And then repeat that same
experiment in Monte Carlo and compute or simulate by verify by simulation whether the
probability value matches. Do that for any experiment of your choice and take your own version
of the notebook, add that in the bottom and then put it out on your Google sites page.

Keep everything private, shared only within the organization, do not keep it public and show it in
Google sites page. So, do that, do create a Collab notebook as well. So, you create a Colab
notebook, copy all these things and add your own little Python Monte Carlo experiment and verify
that the theory matches with the practice of computer simulation.

So, that is your activity. Hopefully, you enjoy it. Hopefully, you have seen a quick recap of all that
you learned in stats once, so all of this, the theory at least is something you have already learned
in stats one, maybe some of it was a bit different, but mostly it is recap. So, week one is almost
entirely recap. From week 2 onwards, we will push ahead with something new and something
maybe you have seen a little bit of random variables and all that before we start with all of that
and keep proceeding as we go into week two. Thank you very much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 2.1
Random variable and events
Hello, and welcome to week 2 of statistics 2. And the first week was largely leak recap of
basic probability, the foundations of the theory and definitions and some interesting
relationships between data, statistics, probability and things like that. Hopefully, you had a
good time reading up things around the week once content. Hopefully, you are now ready to
jump on.

And second week also is sort of recap. We are going to look at discrete random variables.
This is again a topic you have seen before in statistics one. So, we will go through it at a
reasonably fast pace. But at the same time, hopefully, this is a good recap for you. And so
that all notation and everything is reasonably well set in this course as well. So, let us get
started.

(Refer Slide Time: 01:04)

So, I am going to motivate the idea of a random variable and why you need it and all that.
And let us go back to this experimental setting of the IPL powerplay over. So, we have been
thinking of the one over of an IPL powerplay innings as our experiment. And the outcome
itself, if you look at what happens in a power play over, there is so many things that can
happen.
So, here in this little table here, I have shown you the first over of a few games and a few
teams that played it. And you can see there are several deliveries. Sometimes there are 6
deliveries in the over, sometimes there are 7 deliveries if there is like a no ball or wide or
something, then you have seven deliveries. And every delivery, there is a run or there is no
run. And when there is no run, there could even be a wicket. So, all that is not necessarily
captured.

And you know, the outcomes are of various types, you may want to know how bad the
batsman hit the ball this and that. So, there is the actual things that happen, for instance, the
speed of the delivery, who is the bowler, who is the batsman. So, there is much you can ask
about what happens during the powerplay over. It is a complex experiment. There is lots of
things happening.

And to even write it down in a reasonable way you need, if you remember from last week,
you need like an YAML file and this that. It is a very complicated way to put it down. And
what I put down here is a simplified spreadsheet version of it. But nevertheless, you can see
it, it is a lot of things happen during a power play over. So, when you are looking at a
complicated event like this, so one always tries to see if you can simplify things a little bit.

So, not, do not worry about the entire outcome, is there some particular aspect of the outcome
I am interested in. So, for instance, several numerical values you are noticing here are of
interest here. And it is always good to know think of numbers or you know, some smaller
outcome set as opposed to a much larger complicated detailed outcomes. And random
variables basically allow you to do that. So, these are numerical values, which are computed
from the outcomes of the experiment.

Typically, people will say numerical values, think of random variables always as taking a
number. But you can also have some other finite set, if it is only like heads or tails, you might
as well have heads tails as opposed to 0 1 or something like that. But you know, one can
always associate a numerical value with any set and then so. So, that way numerical values
are interesting to consider.

But naturally, in experiments like this, several numerical values show up, you see that
number of runs in the over maybe of interest to you, number of actual deliveries in the over
maybe of interest to you, number of wickets, number of boundaries, number of dot balls, and
there are some connections between these numbers. But nevertheless, they are all numbers
and they can be, they are objects of interest. And they can be more readily and easily
understood then. The entire detail of what happened in the over. So, it is a more complicated
occurrence.

So, keep this picture in mind. So, as we go forward, we will simply describe random
variables. We will say, you know, this is a random variable, this is the distribution, this is a
PMF, etcetera, etcetera. But do not lose track of the fact that this random variable is bringing
out one aspect of maybe a very complex experiment then you maybe want to define different
random variables understand them, etcetera.

So, this is sort of the big picture behind how and why random variables are very, very useful
in probability. And when you will see as we go along, most of the time we will be dealing
with random variables. So, I know we spent a lot of time putting down the theory and notion
of sample space, events, probability function, etcetera, etcetera. But most of the real action
when it comes to applications or when it comes to even working out things, everybody uses
random variables.

So, random variables are extremely powerful tool. They come as one aspect of the basic
theory, but they end up being very, very powerful. So, it is important to know have comfort
in dealing with random variables. How to express the distribution of the random variable?
What is the range of the random variable? What is the PMF? Particularly for discrete random
variables, what is the PMF? How to work with random variables? How to compute
probabilities of events with random variables? How to manipulate random variables? All
these things are very, very important skills to pick up in this week's.

So, let us get started. Hopefully, this definition on this motivation was clear to you. Let us
jump into the definition of a random variable.
(Refer Slide Time: 05:29)

So, here is the definition of a random variable. So, if you remember, we have a probability
space, which is this is an experiment and there are outcomes. There is a sample space, which
contains all the outcomes, the set of all possible outcomes. And typically, this is the definition
of a random variable. A random variable is a function with domain as the sample space of the
experiment and the range is the set of real numbers. So, it takes the outcome to a number.

So, I am saying number, some of the definitions you will hear from in some books, they
would not necessarily ask for a number, they will let it go to some other set also. But in this
course, mostly and most of the time random variables of interest always take you to a
number. So, we will stick to this definition. There can be a slightly more general definition
where the range is not a number, but some other set also. But I think having it as a number is
good enough for most cases, so we will stick to that.

So, random variable is a function from the sample space to the numbers. So, random variable
is going to take an outcome and map it to a number. So, just like the previous example we
saw. So, that is a random variable. There is a technical condition, I will describe that soon
enough, the function needs to satisfy some property, but it is mostly not very important to us.

So, here is an example. I have an experiment, I am tossing a coin, the sample space is H, T,
heads and tails. Now I can have a variety of random variables here, I can have a random
variable X, which maps H to 0 and T to 1. That seems like a natural and simple way to define
the random variable. And that is not necessarily the only random variable, I could come up
with a random variable Y, which takes heads to -10, and tails to 200. Now, why you may do
that? May or may not be relevant here.
As far as the definition is concerned, it is a valid random variable. It is not wrong. Or I can
get more fancy, use some really fancy numbers, I can take a random variable Z to be a
mapping H to √ , and tails to maybe or , or some other fancy number that you can think
of. Or you can have a very trivial silly random variable, which maps everything to 0. I can do
that as well. So, all of these are random variables. So you see, I am defining various types of
random variables from the same sample space.

So, when usually, like I said, people would think of meaningful functions not for the heck of
it normally would do √ or and all that. So, usually, some meaningful functions are
considered. If you ask me the most meaningful looking function in this whole thing is
probably X, is not it? One more thing I should warn you, this while random variable is a
function from the sample space to a numbers, usually we will forget all that. And we will
focus only on the values the random variables take.

So, this X(H), X(T) and all, people will drop the of T, of H notation, we will simply write X.
So, quite often, the notation will say X equals 0, X equals 1. You will even forget in your
mind that the random variable is actually a function from the sample space to numbers. So,
keep that in your mind. It is something that we will also quickly do, but the definition is that
it is a function from the sample space to the real line and you have seen a few examples here.

(Refer Slide Time: 08:42)

Let us continue to a slightly more complicated and maybe a more illustrative example, which
is throwing a die. So, I am going to throw a die and the sample space is 1, 2, 3, 4, 5, 6. So,
this be sample space we have defined here. How do we want to define the random variable
from the sample space to the real line? You can; so I am giving you a generic way to do it.
What is the generic way to do it?

You pick 6 real numbers of your choice, whatever they are. Let us call them x1, x2, x3, x4, x5,
x6. I can simply say, my X will take 1 to x1, 2 to x2, 3 to x3 so on. So, that gives me a random
variable, is not it? This is like the most generic random variable mapping to the real line that
you can think of. So, what values do you pick for x1? What values are interesting?

So, technically, you could do anything. So, like from a definition point of view, you do not
violate the definition, you can pick any value. But maybe some types of values are interesting
to consider. One type of value is, the xi’s could be distinct. What do we mean by distinct? No
two of them are the same way. They are all are different.

So, then in that case, this random variable simply is like a one-to-one function from 1, 2, 3, 4,
5, 6, to this other number. Now, essentially this is the same as the sample space in this case.
So, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6. So, you can have a random variable with sort of mirrors
the sample space. You have outcomes which are maybe not numbers, maybe something else.
But you can sort of mirror the sample space and map it to one number. So, you could do that,
that could be a choice that you would do.

In that case, random variables sort of represents everything that can happen. So usually, that
is not done. Usually, what is done is something like the second example here, you need not be
distinct, I might want to have a random variable, which just tells me whether the number is
even or odd. So in that case, what will I do, I only need two numbers, is not it? If the outcome
was 1 or 3 or 5, I will say this random variable E takes the value 0. If the outcome is 2, or 4,
or 6, I will say that this random variable E takes the value 1.

Now this random variable E now has a different sort of a meaning. It tells me whether the
number that came out is even or odd. So, this is not mirroring the sample space fully. It is sort
of making it a proper function of it, a different type of function. So, hopefully you see this
example, think about it. So, the way that definition goes, it seems like it is anything, but
usually people pick up something of meaning and assign the value suitably to the outcomes.
Hopefully, that was a reasonable example.
(Refer Slide Time: 11:18)

So, now what is crucial is, using random variables, you can define events. So, what are
events, we saw before, events are subsets of the sample space. Some subsets to which we
associate probability. So, that is the crucial idea for every event. You have to be able to
associate a probability. So, we define events in that fashion. So, events generally are subsets
of the sample space and meaningful subsets. So, subsets that bring out some occurrence of
something that is interesting to you. So, that those are events.

Now, if I have a random variable, I am mapping every outcome to a number. Now, using that,
I can define some interesting events. So notice, we give an example here. This is example
which says, I can say now I am going to look at the event that the random variable, X < x. So,
that is an event which is of interest, let me see if I can find my laser point.

So, notice this event here. This is the event I am talking about. Notice how it becomes an
event. So, at this point, you may even suspect why is this an event. Just having something
which says the random variable X < x. The reason is, this is basically representing the set of
all outcomes.

Remember, random variable is a function from sample space S to the real line. If I say now,
the random variable X has to take a value less than small x, then the outcome that occurred
must be such that X(s) < x, is not it? So, one of the ways to visualize this, if you want a
visualization as the following, maybe I will change my color to blue.

So, let us say you have the real line. So, this is the real line. And you are, the example here,
so for instance, you could have the random variable X(1) here, X(2), so on. So, you may have
X(6) here. And I may look at a x here. And if I say, X < x, I am talking about everything to
the left of X. And whatever outcome that is captured here is something that belongs to this
event, X < x. So, that is what is given here. So, that is exactly what is given here.

The set of all, the set X < x, the event X < x is actually a subset of the sample space. It
contains all outcomes which result in the value of the random variable X being less than that
x. So, this has to be; this is an event. And it has to be an event. So, this is, you remember,
there was a condition I mentioned that the random variable, the function should satisfy a
condition. This is sort of like the technical condition.

So, let me not go into more detail than that, this is this is enough. So, this is a very nice and
important connection. So, I kept telling you that, eventually, we will come to a point where
we will only worry about random variable, we would not go back to sample space and too
much of that, usually, it is not done. This is the reason why people do not bother too much
about doing other things.

The reason is, once you define a random variable, which is meaningful, using that random
variable, you can define events. And using that, you can take care of everything that you need
to take care in the probability space. Any calculation you can do, you can do it in the
language of a random variable. So, it is a very powerful notion.

So, notice now if X < x is an event, then the complement of that is also an event. What is the
complement of that, X x, is not it? And then you have to think of unions, intersections,
etcetera, and all these things will also become events. So, these are not very difficult to look
at. And you will see that all these guys are basically different ways of defining events using
random variables.

So, you take a random variable and you define a small, have a x, you can do this. In fact, you
can do a little bit more, you can even have events of this type. So, you can have, you could
have things like x1 < X < x2. So, this is an event. So, you can have so many other possibilities
here for event. So, this is another possibility. So, you can do all this. Once you have a random
variable using the range of values or limiting the range of values of the random variable to a
particular interval or a subset or any other thing gives you an event. That is what is nice about
random variables in the connection that they give to events.

So, once you understand the distribution of the random variable well, you can keep finding
probabilities of events and all. So, keep that in. So, here is an example. Once again, I have
given you an example where the random variable is sort of like the identity function in
throwing a die. X is simply the number that shows up there. You can see X = 1 is the event
that die shows 1, x < 4 is the event {1, 2, 3}. Any event can be expressed in terms of x.
Notice this is a crucial sort of property.

Because this x is like an identity. It is a mirror of the sample space. It is the mirror of the
sample space. So, any event that you possibly have in the sample space, you can define it
using x in this fashion. For instance, here is a trivial way to do it. So, it is sort of like a trivial
thing to say, but it is sort of important. So, that is why you can, you can worry about random
variables, and little else, this is really needed most of. So, this is a simple example.

(Refer Slide Time: 17:18)

Here is a slightly more slightly different example, where we use this random variable E that
we defined on the throwing of the die. 1, 2, 3, 4, 5, 6, I had this random variable E, which
said for 2, 4, and 6. I have the random variable take value 1. And for 1, 3, 5, it takes value 0,
is not it?

So, if you want to picture the real line, once again, it is good to picture it on the real line.
Picturing random variables on the real line will give you some idea. So, there is 0 and there is
1. This is E(2), E(4), E(6). And this is E(1), E(3), E(5). So, I am writing all this out in detail.
But you know, eventually you do not have to write it, you will sort of see how things work
out.

So, E = 0 is the event, I got it wrong. Did I mixed it up? Let me just redo this. So, this is E(1),
E(3), E(5). And this is E(2), E(4), E(6). So, if you look at the event E = 0, this event is simply
{1, 3, 5}. The event E = 1 is {2, 4, 6}. Now notice this event E < 0. So, if you say E is strictly
less than 0. This is E < 0, equal to 0 is not there, less than 0. So, that is the null event, nothing
is there.

On the other hand, if you say, from here, if you look at say E less than I do not know 1.2. So,
it says E 1. Let us say E < 1.2, I have everything captured. So, everything is possible,
whatever happens E would be less than 1.2. But notice if I have events like {2, 5}. So, what
happens to this event, you cannot express it pure only in terms of it cannot be expressed in
terms of E, is not it?

So, think about this point. So, thus the reason is, E is not like a mirror of the original samples
space, it loses something, is not it? Only it represents whether it is even or odd. So, if you put
{2, 5} as an event, just the value of E is not going to tell you anything. So you cannot be sure
about these kinds of events. So, you can have random variables like this.

So, you may have random variables that capture the entire sample space. You may have
random variables that capture only some aspect of the sample space. And they are useful still
that is how we work with experiments in practice most of the time. So, hopefully these three
few examples are giving you a rough idea of how random variables are defined. We
unnecessarily, I mean this I mean, I agree this is simple examples, but you have to start small
and slowly build upon more and more complicated examples.
(Refer Slide Time: 20:13)

So, I will leave you with this thought on why random variables? This is the first part of
defining random variables. You may wonder why do you need random variables. We already
know the samples etcetera. So, if you look at the IPL power play over, when you look at
complicated outcomes, there are lots of things happening. You usually get lost trying to
assign probabilities to outcomes.

So, remember when you want to do probability space calculations, you want to assign
probabilities to outcomes. And when the outcomes are itself is so complex, it is very hard to
assign probability to that whole outcome. And then you are thinking of all sorts of
complicated, so it does not work. What is easier is to start defining a bunch of random
variables that you can study. So, number of runs, number of in fact, number of runs scored in
first ball, number of runs scored in second ball. You can have random variables like there are
so many random variables defined.

And then you focus only on the finding out the distribution of the random variable. What is
the events around the random variable. And it is a distribution of the random variable, that is
what I mean. How to calculate probabilities for events defined using that random variable.
Just focus on that. Do not worry so much about the entire outcome. It is the outcome, it is too
big. You may not have enough data in practice to do anything reliable with that. So, that is
also a problem.

So, you think of as smaller aspect of that whole outcome, which maybe does not capture
everything that happened, but gives you the important information that you want. And more
importantly, in practice, you have enough data to meaningfully say something about that
distribution. So, this is why people work with random variables. In practice, they are easier to
assign probabilities too and easier to work with. Because you can look at the data and then
sort of motivated by that, meaningfully assign probabilities for random variables.

We look at all these kinds of aspects going forward. It just definitely a more complicated
notion than what you were used to in statistics one, but keep this aspect in mind. So in
practice, most of the time when you apply probability, you will be working with random
variables, right? We are directly working with random variables. So, you will have this big
experiment that is going on to generate the random variable somewhere in the back of your
mind.

The more you have that connection in your mind, the better you will be in translating things
to practice. But the felicity to use random variables, the skill that you need to develop in, you
know, manipulating random variables is extremely important. And we will focus on that a lot
in this course as well. So, this is the end of the first module. We will go to discrete random
variables and their distributions in the next one.
Statistics for Data Science II
Professor Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology Madras
Discrete Random Variables and their Distributions

Hello and welcome to this lecture. We looked at the definition of discrete random variables in the
previous lecture. In this lecture, we are going to start describing the distributions of discrete
random variables. And you will see, we will slowly start de-emphasizing the sample space and
experiment and outcome and all that and we will emphasize the random variables, more and more,
it is a very natural thing to happen in statistics.

(Refer Slide Time: 00:38)

So, the first thing you have to worry about when somebody gives you a random variable, the first
thing you have to ask is, what is the range of the random variable. The range is the set of values
taken by it. Usually in this class, at least we are thinking of that subset of the real line, but I have
always mentioned this caveat, you know, there are people who, that are texts which define random
variables in a slightly more general way.

And it is also useful sometimes, but most of the time it is taken to be a subset of real line. So, in
this class we will restrict generally to subset of the real line. So, usually when somebody defines a
random variable or somebody gives you a random variable or gives you a problem that involves a
random variable, always think of the range. So, this is a lesson that I have seen students forget
repeatedly.

They will start working, working, working, they would not have thought of the range first. Then
they will get some very meaningless answer and then they will try to reconcile that with everything
else. So always worry about the range of the random variable. It can be a bit subtle. You will see,
as we go along you will see some examples maybe, where I will bring out this aspect a little bit
more clearly.

So, let us again, do the simple examples we did where, when you throw a die and if you say X is
the number shown on the die, then the range clearly is 1, 2, 3, 4, 5, 6, the same as the sample space.
On the other hand, if I define the random variable to indicate to me whether the number shown on
the die is odd or even. If it is odd, I say 0, if it is even, I say 1. Then in that case the range is just
0, 1.

So, once again range of the random variable is very important. Once you start thinking of a random
variable in an experiment, you first want to think of the range that you want to associate with that
random variable. It gives you a lot of information about what you have to worry about when you
start thinking of probabilities.

(Refer Slide Time: 02:24)


So, to just sort of tell you that it can be a bit non-trivial to just think of range of random variables,
in a complicated setting, it can be non-trivial, right? So, I am giving you the example of this IPL
powerplay over now. I will also mention a few others, just offhand. Let us start looking at some of
the random variables we defined in the previous lecture.

Number of actual deliveries in the over, we know that the range here, will start with 6, is it not? In
an over, in cricket, you cannot have less than 6 deliveries. So, it starts with 6 but it may not be
always equal to 6, it can be 6, it can be 7, it can be 8, it can be 9. How large can it possibly be, I
do not know, I actually went and looked at the ICC rules of cricket? They are not very clear on
this point.

It looks like they do not actually specify any maximum number of deliveries allowed. What I mean
maximum number of deliveries; somebody can keep on bowling no-balls, is it not? So, I think then
the over will never end. I mean, I think in cricket, one does not foresee such possibilities. They
eventually see that, eventually people will bowl legal deliveries and get out.

So, as far as I could check in the ICC rules of cricket, there is no actual limit on the maximum
number of deliveries. But I think if people get frustrated, they will just stop the over at some point,
I think, or find another bowler or something like that. So, eventually it will stop but as far as theory
is concerned, maybe we have to just have 6, 7, 8, 9, et cetera or you go look at the data and maybe
you want to stop somewhere. But range starts at 6, that is important.

The next random variable we looked at is number of runs in the over, and even here, 0, 1, 2, 3, 4,
5, 6. Can you stop at 6? Really? Think about it. There are possibilities of getting more than 6 runs
in one delivery. I mean, do not think of the free hit that you get, I am not talking about that, I am
talking about just the one delivery. You can get overthrows, right?

So, for instance, you may hit, run for 3 runs and somebody may overthrow a 4 for you. So that is
seven runs. Or you may hit a wide, no ball for runs, even for that delivery, you get more runs. So,
it is not clear what is the maximum number of runs in a delivery is and definitely in the over, it is
really not clear what should be the maximum. So, you may ask, will it goes 0, 1, 2, 3, 4 on and on
and on, maybe you should do that.
But what about wickets? Number of wickets in the over? So, here again, you can start with 0, but
what should the possible maximum be? I do not know. 10 looks very unlikely, maybe you want to
set a probability 0 to it, but still, theoretically it is possible to go up to 10 wickets. So, you can
keep on bowling no-balls and people can keep on getting run out. It is possible. So, you can go
from 0 to 10.

Number of boundaries, that way, in the over, 0 1, 2, 3, it can keep going. So, depends on how many
no-balls somebody bowls. Maybe there is some limit to it, but it can go like that. What about
number of dot balls? Notice, number of dot balls can be 0 and at most 6. You cannot have more
than 6 dot balls in any over, because if you had 6 dot balls, over is complete, finished. So, there is
no point in getting 7 dot balls, it is not possible.

So, the point of this example was just to illustrate that in a real scenario, when you want to define
a random variable, you really have to think about its range. I mean, it is not a very trivial problem.
So, you have to think about its range and then worry about what part is likely or part is unlikely,
all that you have to worry about when you want to think of random variables in practice.

So, one can think of other examples, like stock market, if you want to think of the stock market as
they experiment and the stock prices as random variables; do you want to put a range to it? What
kind of range do you want to use? What kind of probabilities you want to use? All these are
important questions that people grapple with when they use practical applications of statistics. So,
this is all quite important stuff. So, the range is not some dull thing, it is a very crucial aspect that
you have to think about.
(Refer Slide Time: 06:40)

So, now we are ready to think of a discrete random variable. A random variable, we will say is
discrete, if the range is a discrete set. So, what is discrete? It turns out the strict definition of discrete
is a bit of a complicated one. So, we will use it in a very vague way, we will just say finite subsets
are discrete. If you have a finite number of elements in the set, it is always discrete. The integers
are discrete, all integer multiples of a number of discrete, any subset of a discrete...

So, generally I will let you think about what a discrete set is, think of just a finite set. So, if
something does not have an interval in it, it is discrete. That is probably a decent definition and we
will leave it at that, we will not bother too much more about a defining discrete. If there is an
interval, A comma B, an interval with A not equal to B, like 0 comma 1 or 1 comma 2, if there is
an interval in your set, then it is not discrete.

But if it is not there, then it is discrete, usually. So that is a pretty decent understanding of discrete
random variable. If you are interested, you can look up discreteness of sets. It is a bit of a
complicated notion in mathematics, we will not define it very precisely, but I will assume sort of,
you have an intuitive feel for what discrete is, just there is no continuous interval and all that in
the set, it is just, you jump from one place to the other all the time.

So, that is discrete for you. And in this lecture, in this week we will consider discrete random
variables. We will also consider continuous random variables later on where the random variable
can possibly take continuous values in an interval and all that. So, we will look at that case a little
bit later on. It is a bit more complicated; it is easier to start with discrete random variable. So, we
will start with discrete random variables.

(Refer Slide Time: 08:29)

So, the most crucial thing, like I said, the first one was the range of the random variable. So, you
have a random variable, you first find its range. So that is the first skill you need to have, you need
to nail down its range. The next thing you have to nail down or next thing somebody has to tell
you to fully specify a random variable is the probability mass function for discrete random
variables. This is enough and it is very easy to write down and describe.

And once you give the PMF and the range, PMF also includes the range in some sense; you have
completely specified a random variable. So, here is the definition, the probability mass function of
a discrete random variable X, with the range set T is the function from T to 0 comma 1, to the
interval 0 comma 1 And it is defined as the probability that X equals T for a t in the range. So that
is the definition. It is maybe a bit of a long definition to unpack.

So, notice what a random variable is, a random variable is going to take you from the sample space
to some range T. T is a subset of the real line, right? So, we know that that is the range T. And the
PMF, the probability mass function is nothing but this. So, it is basically a probability that X takes
a particular value in the range.
So, we will give you a much simpler way of thinking of this, as you can think of this, like a table,
like particularly for discrete random variables, but this is a proper definition, it is probability that
X equals T. So now what is X equals T? Remember I talked about how to define events with
random variable. X equals T is an event, right? So, probability that X equals T is probability of all
outcomes in X that result in X taking value T.

So, keep this definition in mind, this is very important. If somebody describes an experiment to
you and then from there defines a random variable, to compute the PMF of the random variable,
you have to use this definition. The PMF, the probability mass function of the random variable
taking a value small t is basically probability that the random variable itself takes the value t, which
is basically the probability of all outcomes of this experiment that result in the random variable
taking value T.

So, this is an important point to keep in mind. So, notice this other subtle little point. So, this point
is also very important. Let us take a subset of the range set T. Instead of taking a single value in
the range set T, so instead of taking small t as a single value on capital T, I am going to take an
entire subset of T and then instead of just looking at the event X = T, I can look at the slightly
more complicated event where X belongs to this subset B.

So, it turns out the PMF using the PMF you can find probability of X belonging to B, you do not
need anything more. And the definition goes in a very simple way like this, probability that X
belongs to B is probability of all outcomes that result in X taking values in B, and that is nothing
but summation over all T.

Now notice this B here, it is not the whole T, it is this B here, the summation over all T belonging
to B, the probability that X equals T and that is simply the PMF added over all the values in B. So,
the PMF I have defined for values in the range. You sum it up over all values in B, you get the
probability that X equals, X belongs to B.

So, what am I trying to convey here? So, the point is the probability mass function is the probability
that the random variable takes one particular value. That is all, as simple as that. That is the
definition. So, you have one value that the random variable can take, probability that the random
variable actually takes that value is my PMF. You go through all possible values here, you get the
PMF.
Now, once you know the PMF, it is enough to compute any event that you want defined using X.
So, that is the main point. So, any event, probability of any event defined using X, sorry about that,
defined using X can be computed using PMF. This is the main story. This is why the PMF is
enough to completely specify a random variable. You do not need anything more if you have the
PMF.

You have a random variable; you have any event defined using that random variable, given the
PMF, you can compute the probability of that. So, once you have the random variable and the
PMF, you are done. In some sense, you are done, everything you, any event you want to define
using that, you can find it. Is that okay? So that is the usefulness of that PMF.

(Refer Slide Time: 13:37)

So, let us do a few problems. Now, the problems that we will do first will be of the following type.
An experiment will be defined and a random variable will be defined. I will try to compute the
PMF of that random variable. So, let me go ahead and do some of these calculations. So, in this
experiment, a fair coin is flipped three times. So that is the experiment.

So, this is a problem where experiment is given and the random variable is defined in terms of the
outcomes of the experiment. So, you have your fair coin is flipped three times and I am defining a
random variable, which is how many heads will appear. So, this is my random variable, number
of heads that appear is my random variable. And my second random variable, which will be the
first flip, if any, that shows heads.
Here is again another random variable that is defined. So, notice how these two are random
variables. So, I will call this as X and I will call this as Y. Now I will make a table here, a short
table. So, this is a method you can always use. So, suppose somebody gives you an experiment.
You can make a table which lists all the outcomes of the experiment and the corresponding value
is taken by the random variable.

So, this is a very useful thing to do. So, quite often I do this, and this is a very useful tool to solve
problems. So how do I write it down? I will write down outcome, the value of X, the value of Y.
So, write this, in fact, I can even write down probability of the outcome. So, maybe I will write
down that. So, this is probability of outcome and outcome. So, these are simple experiments where
all this can be really written down.

So, what are the outcomes possible? You could have a head, head, head, head, head tails. Head,
tails, head, head, tail, tail. Then you have tail, head, head, tail, head, tail, tail, tail, head and then
tail, tail, tail. So, those are the 1, 2, 3, 4, 5, 6, 7, 8, got that, right? And the probability that it is a
fair coin, so it is okay to assign it 1 by 8 to each of them, and now what is X? X is the number of
heads that will appear?

3, 2, 2, 1, 2, 1, 1, 0 and Y is, which will be the first flip, if any, that shows head? Here it is 1, here
it is 1, here it is 1, here it is 1, here it is 2, here it is 2, here it is 3. There is none. So, now this table
completely describes my two random variables. So, wherever it is possible, if the experiment is
very small, simply make this table. So, it will simplify your life a lot. You can easily do
calculations.

But sometimes it may be very laborious to do this table. Maybe you want to keep such a table in
your head and do the calculations by hand quickly. You can do that also, but if you can make a
table, make a table. So, it does not hurt. So, from here, you can easily find out the PMF. So, 𝑓𝑥 first
of all, before anything I have to first find the range of X.

1
The range of X is in X is in 0, 1, 2, 3. And the 𝑓𝑥 (0) is basically P(TTT), that is 8. So, maybe I
1
should write down, P(TTT) and that is just 8, 𝑓𝑥 (1) , that is probability of HTT, THT and TTH
3 3
and that is 8. Probability of 2 is probability of HHT, HTH and THH. That is again, 8. And the
1
probability of 3 is probability of HHH. And that is 8. So that is the PMF of X.

So, you can also find Y and its PMF. So, Y belongs to… Y could be none or 1 or 2 or 3. So, notice
how this none is, actually not a real line, so maybe you want to keep it as, you could keep a value
0 here. So, for none you can say 0. If you say zeroth toss, I mean, it never really happens, right?
1
So, you could make the 0 here. So, 𝑓𝑦 (0) is just probability of TTT. That is 8. 𝑓𝑦 (1) is probability

of the first four guys, right?

4 2 1
and 𝑓𝑦 (2), that is 8 and 𝑓𝑦 (3) is 8. So, this is a way to fully specify these random variables. So,
8

these kinds of problems are very important. The experiment is given, a random variable is
described in English, in some sense, and you have to find its PMF, find its range and PMF and this
is how you go about doing it.

Hopefully this was illustrative, it was a very simple illustration, a very easy experiment to get your
head around and one can write down what is possible.

(Refer Slide Time: 19:43)

So, the next one is a little bit more complicated. You cannot make a table here; you will see why
you cannot make a table. The numbers are too large, but nevertheless, you can do the calculation
without too much trouble. So, this is about a three-digit lottery. So far there is a certain lottery
where a three-digit number is randomly selected from 000 to 999. This is a three-digit number,
there are a thousand of them, 000 to 999.

So, you should get tickets ahead of time, right? And your ticket will have a three-digit number in
it. If a ticket matches the number exactly, somebody generates the number randomly, maybe it is
123, I do not know, some number and if your ticket is 123, then your ticket is worth two lacs. You
get two lacs, you can go give it somewhere and they will give you two lacs. If the ticket matches
two of the three digits, it is worth, rupees 20,000. Think about what that is.

So, when you say two of the three digits, supposing you hold the ticket 123, okay? When will you
get 20,000? You should get 1, 2 and something other than 3, it could be, 120, 121, 122, 124, 125,
126, 127, 128, 129 or you could get, so let me just give you an example. So, rupees two lacs is
exact match, that is not very difficult to see.

So, when will you get rupees 20,000? So, here is an example. Suppose a ticket is 123, you will get,
when will you get rupees 20,000? If the number, if the lottery number is 120, 121, 122, 124 so on
till 129, there are 9 such numbers. It can also be 023, 223, so on until 923, is not it? It can also be
103, 113, 133, so on till 193. So, there are 27 possibilities for that.

So, ticket is 123, you will get two lacs if lottery number is 123, exact match, one possibility.
Otherwise, it is worth, nothing. It is 0 otherwise, for any other lottery number. So, what is the
distribution of X? So, you have a ticket. The ticket number is fixed. So for every ticket number,
whatever the ticket number may be, X has three possible values - 0, 20,000, 2 lacs. That is the
range and what is fx of, X belongs to 0, 20,000 and 2 lacs.

So, let us say this is a T. What is 𝑓𝑥 (𝑇)? What is the probability of 2 lacs? It is 1 by 1000, is not
it? One possibility. What is the probability of 20,000? It is 27 by 1000. What is the probability of
a 0 now? We can go ahead and do a lot of calculations, but one easy way of doing it is, say 1000
minus 28 divided by 1000, is not it? These are the only three possibilities. If you get anything out
of this 28, you are not going to get anything.

They will change at anyone of, two more than two places, it will vary. So, 1000 minus 28, whatever
that is, is 1000. So, you can do that calculation. It is 972, that is right. So, 972 by 1000. So, you
saw this problem, where making a big table with all the possible lottery outcomes is impossible.
You will have thousand lines.

And there is no point in doing that also, but you can define based on, you can take a simple example
and work it out and we will quickly get the number of possibilities. Is this correct? So, it is correct
for 123. So, if the ticket is 123, this is correct. What if the ticket is something else? So, let us say
the ticket is, I do not know, you get, say a ticket is 100, so this can happen. So maybe your ticket
is 100. In this case what will happen?

If two of the digits in your ticket match, right? It is not 123, but let us say 100, then 2 lacs is okay.
1
2 lacs is still 1000. What about 20,000? You will need a different kind of lottery number. So, you

will have 000 or 200 all the way till 900. And then you will have 110 all the way till 190 and 101
all the way to 109. So, in this case also you see 27 is okay. So, rupees 2 lacs is, so it does not
matter, one possibility.

So, it does not really matter what your ticket is. So, you may want to check that a little bit. So, try
different possibilities for tickets and make sure logically also, you sort of see that whatever that
ticket may be that are exactly 27 other three-digit numbers in this way which will match in two
places. And, they will all be distinct in some sense. So, you match one and then vary the third one,
match two maybe the second one.

There is no way you can get, can you? I do not know, think about it. I do not think it would work
for two out of three, it is not going to work. So, this looks okay enough to me for any ticket. So,
not just for ticket 123, for any ticket, you will have, one possibility for 2 lac, 27 possibilities for
20,000 and then 972 possibilities for the remaining, other possibilities, which is 0. So, this looks
okay for any ticket. So, this is true for any ticket.

So, this needs a bit of calculation. So, you see that these kind of problems are very interesting. So,
they experiment is defined to you, you understand the nature of the random choice in the
experiment, and then a random variable is defined for you. And you have to now find the
distribution of the random variable and you would use the probabilities assigned to the outcomes
of the experiment.
Very carefully from these events, which are favorable to you for a random variable taking a
particular number, and then compute its probability? So, these kinds of problems are very
interesting, sometimes they can be a bit difficult because it goes back to the computational
probability, but this is an important skill to have.

I will tell you why this is an important skill to have. So, in an actual complex experiment, you may
be interested in some random variable, but you may want to understand some outcomes in the
experiment and how they work and what kind of properties you can put there and how that
influences the random variable. So, that understanding quite often will help you.

So, only when you solve problems like this, that understanding builds up. So, it is a good type of
problem to solve.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 2.3
Properties of probability mass function
(Refer Slide Time: 0:13)

Let me summarize this whole random variable and its distribution for you in one slide. So, anytime
you have a random variable you have to worry about its range and worry about its PMF. So, usually
you will have a random variable 𝑋. We are thinking of a discrete random variable, so that is why
we have PMF and all that. We have a random variable 𝑋 with its range being a discrete finite set.
This is a very typical example. Usually the range is finite, if it is not finite, you will put …, keep
going, so this is ending at 𝑡𝑘 , you may not end at 𝑡𝑘 , you may keep going on and on and on again
and then you have a PMF.
So, you can always think of a discrete random variable in terms of a table like this. The first row
is the value that the random variable takes, second row is the probability with which that random
variable takes that value, that is a PMF, values PMF, range PMF; that is it. So, that fully captures
what the random variable is. So, there are 2 crucial properties of the PMF. The value taken by the
PMF is a probability, so it should be between 0 and 1.
And the total probabilities, sum of all the values in the PMF should be 1. Why is that? Because
random variable 𝑋 takes values only in that range, so if you take sum of all probabilities of the
random variable takes first value, second value, third value, so on, it should take some value so
that will exhaust all the outcomes in your experiment and that will give you a probability of 1. So,
these 2 properties are defining properties for a PMF.
So, now what will happen? One very typical problem that you will see is somebody will give you
a range and somebody will give you a PMF, a function from the range to 0, 1 or some anything
else and ask you when is it a valid PMF. For that you check these 2 conditions.
i) You check whether every value taken by the PMF is between 0 and 1.
ii) Whether the value adds up to 1.
So, that is it, so you do that you have the PMF.
So, this point I am making at the end is very important. So, you have to build up the skill of dealing
with random variables PMFs distributions moving them around without worrying about sample
space experiment and all that. So, that skill is also very important. I talked about 2 skills here, from
the sample space and experiment if somebody defines a random variable, you want to get a sense
of how to find its distribution using the definitions of the probability space, that is one skill
important.
The other skill is from PMF onwards, once you know the PMF how do you keep manipulating it
and working with it, that is also an equally important skill, both of these you have to develop to be
good at probability and statistics.
(Refer Slide Time: 2:44)

So, let me give you a couple of examples of what is this working with the PMF. So, here is a table
here and there is a PMF. So, here is a skill of working with a PMF. So, here is a PMF of a random
variable which is partially specified. So, I am telling you that the range is {−1, 1, 2, 4} and then
the PMF takes values 0.5, 0.25, 0.125 and the next one is blank and then you defined 𝑓𝑋 (4). So,
what is the property I will use?
It is quite easy to see what the property is going to be. So, everything is between 0 and 1 and they
have to add up to 1. So, if it has to add up to 1, you know that 0.5 + 0.25 + 0.125 + 𝑓𝑋 (4) = 1
and you do this you get 𝑓𝑋 (4) = 0.125, so that is 0.125. So, that is the first skill, and it is quick to
fill up. Now, find the range of 𝑋. Why is the range given?
For instance, 4 may work out to 0, so if the probability is 0 usually you do not put it in the range,
maybe the range is 1, but in this case every value is possible {−1, 1, 2, 4}, so that is the range.
Probability that 𝑝(𝑋), 𝑃(𝑋 > 3) is basically 𝑓𝑋 (4), if 𝑋 > 3, 4 is the only possibility and that is
3
0.125, 𝑃(𝑋 < 2), it is 1.5, so it is 𝑓𝑋 (−1) + 𝑓𝑋 (1), is not it? Less than 1.5 that is 0.75.

So, notice how easy it is to work with PMFs. So, it is very-very easy to work with PMFs, very
simple quantities and very basic algebra you can do, so this is the first example I am doing of
working with PMF.
(Refer Slide Time: 4:41)

There is one more example where things become slightly more complicated. So, these kind of
cases will appear also over and over again, one needs to be comfortable with this. So, here is the
PMF given, but notice this range here, it is not finite. So, this is what is called a countable case.
So, it keeps on going 1, 2, 3, 4, you can just count it out, so 1, 2, 3, 4,… and the PMF is given as
𝑐
𝑐/3𝑘 , so 𝑓𝑋 (𝑘) = 3𝑘 .

So, first thing is what is this 𝑐? It is once again, it is a partially specified PMF. Now, what do I use
𝑐 𝑐 𝑐
to find out the value of 𝑐? I know that 3 + 32 + 33 + ⋯ = 1. Why is that? This is the condition on
1 1
the PMF. So, you use this to find 𝑐. So, here is an infinite summation 𝑐(3 + 32 + ⋯ ) = 1 and from
1
3 1/3 1
the formula its 1 this has to be equal to 1. So, it is 2/3 = 2, so 𝑐 = 2.
1−
3

So, whenever you have a partially specified PMF like this the first task is to go ahead and get its
full specification using its properties, it has to be between 0 and 1 and when you add it up, you
have to get 1, so you do that you get this. You may have to use some skills in adding up, adding
of a geometric progression is up to infinity is, there is a formula for it, etcetera. So, this is the
formula for that and you use it.
So, you remember this formula. So, it is basically
𝑎
𝑎 + 𝑎2 + 𝑎3 + ⋯ = 1−𝑎 , 𝑖𝑓 0 < 𝑎 < 1.
So, this is the formula for that I am using here. There are more general geometric series formulae
available. So, now there is this question of 𝑃(𝑋 > 10). So, if 𝑋 > 10, 𝑘 = 11, 12, 13, …, so you
𝑐 𝑐 𝑐
have 311 + 312 + 313 + ⋯.

What will this add up to? Again, you use the same formula, but maybe you need a different formula
here, so,
𝑎
𝑎 + 𝑎𝑟 + 𝑎𝑟 2 + ⋯ = 1−𝑟 , 0 < 𝑟 < 1.

This is the other formula. So, if you, here you have, well, 𝑐 = 2 now. I know c is 2, so
𝑐 𝑐 𝑐 2 2 2
+ 312 + 313 + ⋯ = 311 + 312 + 313 + ⋯.
311
𝑐 𝑐
So, this is 𝑃(𝑋 > 10). So, = 𝑃(𝑋 = 11), = 𝑃(𝑋 = 12), … So, I am adding up all those
311 312
probabilities.
2 1 1
So, now 𝑎 = 311 , 𝑟 = 3 . This is the same formula here, so 1 − 3, so if you go ahead and do this,
1
this 2 will cancel, 3 will cancel, you have 310 . So, that is probability that 𝑋 is greater than 10. So,
now what about this guy? So, what is this conditional probability?
𝑃(𝐴∩𝐵)
How do you do conditioning (𝑋 > 10|𝑋 > 5); this is probability, so use 𝑃(𝐴|𝐵) = , is not
𝑃(𝐵)
it? So, what is 𝑃(𝐴 ∩ 𝐵)? 𝑃(𝐴 ∩ 𝐵) = 𝑃[(𝑋 > 10) ∩ (𝑋 > 5)], that is nothing but 𝑋 > 10. So,
𝑃(𝑋 > 10) we already know. What about 𝑃(𝑋 > 5)?
2 2 1
Use the same thing 36 + 37 + ⋯, so you will see 𝑃(𝑋 > 5) = 35, so you will see
1/310 1 1
= 35 , check this calculation, so you will get here , so that is the answer. So, you can see
1/35 35
how once you have the PMF, any event, even conditioning and all that, so if you define any event
involving that random variable 𝑋 and its ranges and limiting its range etcetera, there you go, you
have very easy calculation possible.
But use quite often the PMF may be partially specified, you may have to use some properties to
get its full specification. So, I think that is the end of this lecture. In the next lecture we will start
looking at something called, some of the common distributions that you usually come across in
statistical applications. We will stop here for this lecture, thank you.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 2.4
Common distributions
Hello and welcome to this lecture. In this lecture we will look some very common discrete
distributions as in some discrete random variables that will show up again and again in many
applications and very useful to get to know and be able to describe. And once again like I
mentioned before, we will soon start talking primarily about random variables and everything
else will be defined using random variables.

Events will be different using random variables, probabilities for events will be calculated
using PMFs, etcetera, etcetera, so we will get used to that idea slowly, so these common
distributions will appear again and again and again and you have to learn to identify situations
where these distributions are reasonable to employ. So, let us first get to know them and then
maybe eventually we will see some situations where these can be used and we will get
comfortable with them. So, let us get started.

(Refer Slide Time: 1:06)

So, the first and most important random variable, in my opinion at least, is the uniform random
variable. It is a very simple one, but still it is very-very useful. So, my strategy for describing
these would be to simply specify the random variable. How do I specify the random variable?
I specify the range, specify the PMF, that is it and we will keep proceeding. I will give a few
typical scenarios where simple toy examples or maybe even more complicated examples where
these occur.
So, the uniform random variable is basically if

you say, and this is a notation that we will use, so pay attention to the notation as well. So, this
notation that we have here, so this 𝑋 is 𝑢𝑛𝑖𝑓𝑜𝑟𝑚(𝑇), so where 𝑇 is some finite set, whenever
I put a finite set here and I say I have a random variable, which is uniform in that finite set,
what I mean is this random variable 𝑋 has range equal to that finite set and its PMF is uniform,
as in it takes the same value for every entry in that finite set, every element in that finite set, so
1
it is |𝑇|.

So, here are 2 examples, one is tossing a fair coin, clearly 𝑋 you can take as 𝑢𝑛𝑖𝑓𝑜𝑟𝑚({0, 1}),
and 0 I am using to represent heads and 1 is for tails and the next example is throwing a fair
die where 𝑋 would be 𝑢𝑛𝑖𝑓𝑜𝑟𝑚({1, 2, 3, 4, 5, 6}), so now instead of {0, 1} or {1, 2, 3, 4, 5, 6},
I could have any other set here, as long as I say uniform it is clear what that random variable
is, so this is a very-very simple and powerful, often used random variable, so often people make
this assumption that random variables are uniform in a set.

So, how do you sort of think of scenarios? So, if you are looking at a random variable and you
think all its outcomes are equally likely, then that is where you start putting in the uniform
assumption. So, that is one way to think of it. So, this is a simple random variable and like I
said it shows up again and again.

(Refer Slide Time: 3:03)

The next one, which is again very simple and again shows up all the time in applications is this
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝), Bernoulli random variable and once again we will use this notation. So, you
will see you will get used to these what are called parameters of random variables. So, here this
random variable 𝑋 is this also notice this notation tilde. So, this tilde notation basically says
random variable 𝑋 is distributed as or you can read it like.

That it is basically distributed as 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝), so this random variable has that distribution.
What is that distribution? The range is {0, 1} and then this 𝑝 parameter, this is a single
parameter 𝑝, which is a value between 0 and 1, a real number between 0 and 1, it basically
gives you 𝑓𝑋 (1), the range is {0, 1}, then 𝑓𝑋 (1) is 𝑝. So, now these 2 have to add up to 𝑧, add
up to 1, so 𝑓𝑋 (0) is 1 − 𝑝. So, this fully specifies the 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝), Bernoulli random variable
with parameter 𝑝.

So, notice these parameters, so when we start describing random variables, specific types of
random variables, common distributions that you see, these parameters will keep appearing
again and again. In the uniform case the parameter was a set, the set over which it is uniform,
in the Bernoulli random variable case the parameter is a number between 0 and 1, it is basically
the probability that the random variable equals 1. The range is {0, 1}, the probability that it
equals 1 is this parameter 𝑝.

So, a very typical example is this Bernoulli trials. We have seen these Bernoulli trials before,
where you know, you do the repeat experiment repeatedly or you basically do the experiment,
then there is an event and 𝑝 is the probability of success or probability that event happens, in
that case you have a very nice Bernoulli distribution. So, 2 very, 2 random variables, 2
distributions, we have seen already one is the uniform distribution; the other is this Bernoulli
random variable, Bernoulli distribution for the binary random variable. It just takes 2 values 0
and 1, that is why it is called binary in some sense.

(Refer Slide Time: 5:01)


The next one is a binomial random variable. We have seen this before. Now, the parameters
become slightly larger, so here is the notation again, 𝑋 ~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝), there are 2
parameters 𝑛 and 𝑝, 𝑛 is a positive integer and 𝑝 is the number between 0 and 1. So, for this
random variable, a random variable which has a binomial distribution with parameter 𝑛 and 𝑝,
the range is from 0 to 𝑛, the integers 0 to 𝑛, 0, 1, 2, 3, so on till 𝑛.

And the PMF is 𝑓𝑋 (𝑘), 𝑘 of course, goes from 0 to 𝑛, 𝑓𝑋 (𝑘) = 𝑛𝐶𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 , so this is the
probability that this random variable 𝑋 takes value 𝑘, is that okay? So, this is the PMF, this is
the range and this is the binomial random variable. See how easy it is to describe random
variables, so you have to just say some, come up with some notation for it, fix the parameters
and then define the range and the PMF using the parameters, that is it, so that gives you random
variable after random variable, is not it?

So, here is an example, you can see that number of successes in 𝑛 independent Bernoulli trials
will be a binomial random variable. So, we have seen this before in the earlier lectures, so this
clearly is a binomial random variable. So, as an aside you have to check that this is a valid PMF
that you can do, each of these probabilities is, each of these numbers is bigger than or equal to
0.

How do you check? They add up to 1. You, I mean I described this when describing the
binomial distribution, there is a formula here which shows that this is equal to 1, anyway so I
will leave it aside, I am not going to prove it for you, but you can show that this is a valid PMF.
So, this is binomial random variable. You have seen 3 already. Slowly we are becoming more
complex, uniform was just the same value assigned to every outcome, very easy probability
distribution.

Bernoulli already was a bit biased 0 and 1, you put a biased probability, you assigned an
arbitrary parameter 𝑝 to the probability of 1, it could be 0.9, 0.1, that happens quite often, some
things happen with very low probability, you cannot say everything is uniform, some events
happen with very low probability and you want to be able to do a biased thing. So, here is
binomial, again this when some repeated trials happen, number of successes end up being
binomial, so this is a distribution that is very common as well.
(Refer Slide Time: 7:31)

The next distribution is called, next random variable of interest is called geometric random
variable. This has only one parameter 𝑝, 𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐(𝑝) and 𝑝 is between 0 and 1 and the range
is a little bit more complicated, range goes from 1, 2, 3, so on and on and on and on forever.
And the PMF 𝑓𝑋 (𝑘) is (1 − 𝑝)𝑘−1 𝑝, so this is a PMF. You can check that it is valid. You can
just sum up over all possible 𝑘.

You will see it is 𝑝 + (1 − 𝑝)𝑝 + (1 − 𝑝)2 𝑝 + ⋯, you use the geometric formation, you will
1−𝑝
get 1−𝑝 and that is equal to 1. So, that is for 𝑝, strictly between 0 and 1. If 𝑝 is 0 or 𝑝 is 1, you

can quite evidently see that things will work out, of course, 𝑝 cannot be 0, I think, so I think
maybe that possibility is wrongly mentioned here, so I guess 𝑝 has to be…

So, this is the geometric random variable. So, once again, so if we say 𝑋 is distributed as a
geometric random variable with parameter 𝑝, 𝑝 will be between 0 and 1 and range is
{1, 2, 3, … 𝑠𝑜 𝑜𝑛} and the value of the PMF, PMF is defined as (1 − 𝑝)𝑘−1 𝑝. So, once again
you see the same principle is being used, you define the range, define the PMF, you fully define
the random variable.

And from what we have seen before, if you do repeated independent 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) trials and
you want to look at the random variable which is or look at the number of trials needed for the
first success, you get a geometric random variable, is not it? So, this is clearly the probability
for that. And you can check it is a valid PMF. And I want to point out one thing, quite often
people will describe the geometric PMF by starting at 0 instead of 1.
So, I have started at 1, quite often people will start at 0. It is just a convention, some people
start at 0 and then in that case the PMF will be slightly different, that is all. It is just a
convention, but in our course we will start at 1. So, that is the geometric random variable for
you. So, we have seen 4 so far uniform, Bernoulli, binomial, geometric.

So, all these 4 we have seen before in different forms, maybe we did not use the notion of a
random variable definition, but we have seen these distributions before, they occurred in
sample spaces also and we described them in terms of just pure theoretical sample space,
events, outcomes, etcetera, we describe this distribution. So, now we are studying it once again.
We just calling it, them calling them as random variables and using them.

(Refer Slide Time: 10:17)

So, the next couple of distributions we are going to see are new, you may not have, we have
not seen them so far. The first one is what is called a negative binomial random variable. Once
again there are 2 parameters. If you say 𝑋~ 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑟, 𝑝), where 𝑟 is a positive
integer and 𝑝 is between 0 and 1, the range goes from {𝑟, 𝑟 + 1, 𝑟 + 2, … }, and the PMF is
(𝑘−1)
given by that formula, 𝐶𝑟−1 (1 − 𝑝)𝑘−𝑟 𝑝𝑟 .

So, this may not be immediately apparent to you, the trick is to look at the case when 𝑟 is 1.
So, it turns out when 𝑟 is 1 this becomes a geometric random variable. You can go back and
see that if 𝑟 is 1, range is {1, 2, 3, … }, and if 𝑟 is 1, the 𝑘, the combination goes to 1, because
you have (𝑘−1)𝐶0 (1 − 𝑝)𝑘−1 𝑝, so that is exactly the geometric distribution for 𝑟 equal to 1.

So, it turns out the negative binomial is a generalization of the geometric distribution.
Remember geometric distribution arose as in repeated independent Bernoulli trials, the number
of trials for first success. So, the negative binomial is number of trials for 𝑟 successes in
repeated independent 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) trials. So, how long will it take for you to get 𝑟 successes,
and this is the formula for that.

So, you can check that this is a valid PMF. It is little bit more complicated but not that hard,
you can try it, it will give you some comfort in summing up series. So, that is the negative
binomial random variable. I have shown you how it arises, so sometimes you may want 2
successes when you do Bernoulli trials. So, what is the number of trials needed for 2 successes?
That will be a negative binomial random variable with parameter (2, 𝑝). So, that is the negative
binomial random variables, new one, we are seeing it for the first time in this lecture.

(Refer Slide Time: 12:12)

The next random variable which has huge number of applications in practice is what is called
the Poisson random variable. So, name Poisson is after a French mathematician, the word
poisson in French means fish actually, but anyway that is also name of a person and this random
variable is named after that person. So, what is a poisson random variable? It has one parameter
𝜆, which is greater than 0.

So, it is a non-negative real parameter. So, it is not a integer parameter, it is a non-negative real
parameter, it is taken greater than 0. So, this is a very popular and important random variable
for us to understand. It has one parameter 𝜆, I have been talking about it, 𝜆 ≥ 0, it is a non-
negative real number. So, so far we have been using integer parameters. So, this is a real
number, non-negative real.
And the range is {0, 1, 2, 3, … }, again non-negative integers are the range, {0, 1, 2, 3, … } and
𝑒 −𝜆 𝜆𝑘
the PMF is given here, the exponential function is involved in the PMF. It is , if you want
𝑘!
𝑒 −𝜆 𝜆𝑘
me to write it a bit differently, 𝑓𝑋 (𝑘) = . So, this is a very famous and popular PMF, it is
𝑘!

the Poisson random variable.

So, this table shows you the PMF for small values of 𝑘, 𝑘 = 0, it is 𝑒 −𝜆 , 𝑘 = 1 it is 𝑒 −𝜆 𝜆 , 𝑘 =
𝑒 −𝜆 𝜆2 𝑒 −𝜆 𝜆3 𝑒 −𝜆 𝜆4
2 is , , ,… , you can check that this is a valid PMF, it is very easy. You just use
2! 3! 4!

the expansion for 𝑒 𝜆 , so if you add it up, you will get 𝑒 −𝜆 𝑒 𝜆 that will go to 1, 𝑒 −𝜆+𝜆 𝑒 0 , 1, so
this is a very, definitely a valid PMF.

(Refer Slide Time: 14:15)

So, if you want to see how it looks you, here is a. here a few sketches, you can sketch this
poisson PMF, I have used a simple application to sketch it. For different values of 𝜆 I have
plotted the PMF from 0 to 20 and you can see as 𝜆 increases the peak of the PMF seems to
keep shifting. So, when 𝜆 is 0.5, the peak is close to 0 and 𝜆 is 5, you see the peak is sort of
close to 5.

When 𝜆 is 10 the peak is sort of close to 10, the 𝜆 is 15 peak is close to 15 and it has moved off
to the right, so this is true in general, if you keep increasing 𝜆 you will see that the Poisson
PMF keeps shifting. So, this is a very significant PMF. It has got lots of practical applications,
we will see it later on even in this week's lecture one application of the Poisson PMF.

(Refer Slide Time: 14:59)


The final random variable common distribution that we will see is what is called the hyper
geometric random variable. It has got 3 parameters, all of them positive integers, 𝑁, 𝑟, 𝑚. So,
if I say a random variable 𝑋 is hypergeometric with parameters 𝑁, 𝑟, 𝑚, the meaning is a little
bit more complicated. So, to understand the meaning let us look at the primary example where
this is used.

So, in this hypergeometric distribution typically, this is the scenario where it is used. You
consider a population of 𝑁 persons, there are 𝑁 people in the population, a group of people and
𝑟 of them are type 1, 𝑁 − 𝑟 or type 2, maybe male, female, you can think of it like that if you
like, if you have a 2 gender group, then you can have 𝑟 of them being male, 𝑁 − 𝑟 being female.

And then, somebody selecting 𝑚 persons from this group uniformly at random without
replacement, so you pick 𝑚 persons uniformly at random without replacement and then my
random variable 𝑋, if I say is the number of type 1 people selected that is a
ℎ𝑦𝑝𝑒𝑟𝑔𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐(𝑁, 𝑟, 𝑚), random variable, so it is a distribution. It is, it arises in sampling
like this, so it seems a bit complicated, but you can see how, why it is very-very useful.

So, you imagine the disease incidence example that we saw. Type 1 could be people with a
disease, type 2 is people without the disease, when you pick 𝑚 samples uniformly at random
one after the other for replay, without replacement, that is what you do when you are sampling,
you select people and test them for the disease and this 𝑋 is the number of type 1 people
selected, so this distribution is very important in practice when you sample.

So, to understand the range of 𝑋, you notice it is a bit complicated, it is going to depend on 𝑁,
𝑟, and 𝑚 in a funny fashion. So, here, the dependence is finally given here, the answer is finally
given here, it is from max (0, 𝑚 − (𝑁 − 𝑟), . . . . 𝑚𝑖𝑛(𝑟, 𝑚)). So, a few illustrative cases are
given here, suppose 𝑁 is 100, we fix 𝑁 100 and let us say 𝑟 is 50.

So, there are 50 people of type 1, 50 people of type 2, and I am picking 20 people, so in that
case my random variable 𝑋 could have 0 to 20, I could have 0 people of type 1 up to 20 people
of type 1. Now on the other hand if my 𝑟 is only 10, there are only 10 people of type 1, and I
am picking 20, there is no way I can have 20 people of type 1. So, my range will go from 0 to
10 only.

On the other hand if my 𝑟 is 90 and I am picking 𝑚 of 20, there are only 10 people of type 2,
so I should definitely have 10 people of type 1, whatever I pick, so my range range will start
from 10 and go to 20. So, you see these 3 cases are 3 different cases that can happen depending
on the value of 𝑁, 𝑟, and 𝑚, so those are illustrative and all this is condensed together in this
max (0, 𝑚 − (𝑁 − 𝑟), . . . . 𝑚𝑖𝑛(𝑟, 𝑚)).

Think about it, prove why this is true; that is the range of 𝑋. Like I told you range can become
complicated, it seems like a simple thing, but it involves some logic in putting things together.
(Refer Slide Time: 18:05)

So, what about the PMF? So, range was given to you. What is the PMF? Once again I will give
you the formula, so I am not going to prove this formula for you, you can see why this is true,
so you have, I know you have a population of 𝑁, 𝑟 of type 1, 𝑁 − 𝑟 of type 2, I am picking 𝑚,
𝑟𝐶 𝑁−𝑟𝐶
𝑘 𝑚−𝑘
I want 𝑘 of type 1. So, what is the favorable case for me? 𝑁𝐶
.
𝑚

So, that is the PMF for you. So, this is the hypogeometric distribution, hypergeometric random
variable, I have specified the PMF for you. So, I think this is the last slide that I have here. So,
that summarizes the common distributions that we are likely to see in a course like this. So, we
had the uniform, we had the Bernoulli, we had the binomial, we had the geometric. We have
seen this before.

We had the negative binomial, which is a nice generalization of geometric number of trials for
𝑟 successes, then we had the Poisson random variable, we have not yet seen an application, we
will see it soon enough it occurs quite often, surprisingly often, and then we had the hyper
geometric random variable, which arises in sampling. So, with that we will conclude this
lecture and move on to the next one.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 2.5
Poisson random variable: Queues and meteorites
Hello and welcome to this lecture. In this short lecture we will look at some applications of
the Poisson random variable in the sense of where it shows up in some, in totally different
unusual circumstances, you will suddenly come across the Poisson random variable as it is
very common and it applies in a certain nice scenario. And I thought it is a nice opportunity
to go back to a very practical sort of lecture where we talk about some data, look at a couple
of types of data and show how this Poisson random variable seems to be a very natural and
random fit.

If you go back and rewind to one of the earlier lectures, we spoke about how statistics is a
way of explaining random like phenomenon that appear in various natural settings. And we
will look at two such settings and we will try and explain one such phenomena and the
Poisson random variable is sort of at the heart of it. So, let us get started.

(Refer Slide Time: 1:13)

So, quite often we have this phenomenon that we have events occurring over a period of time
and they seem to occur at random. So, suddenly something will happen and then something
else will happen like that, so the same event so to speak, so it keeps happening over a period
of time at the exact time it appears is probably random, but one can say a lot about the
patterns, one pattern you can say a lot about is the average number of people who arrived at a
certain time or average number of events that happened over a certain time.
So, I am giving you a few examples here, let us firm up our mind with a few examples and
then talk more specifics. So, let us say you have a booking counter, like a movie theater or
railway booking counter or bank some, the teller in the bank waiting for a queue, et cetera,
any place where you have to queue up and you want to sort of think how often, you want to
say something about the queue length.

So, you want to see how the queue picks up when people come and this arrival of a person
into a queue wherever that queue is sort of like a random occurrence. So, we do not know
ahead of time when exactly somebody is going to walk into a queue. Now this setting shows
up in so many places, so supposing, so many unusual ways in which the same thing happens.
So, few more things have been put up there.

You are running a website and you want to see when people come and access the content of
your website. So, now again there is some randomness there in the arrival, and that seems to
be, there are so many different people who can decide to check out your website on a
particular day, you do not know ahead of time who is going to come when, there is a certain
random phenomenon going on here.

There is a few more things about emission of a particle by radioactive decay, so there is lots
of randomness in that, it is difficult to know exactly where every particles, every state is and
all that and which one is going to come off and then and the rate at which it comes and all
that is an interesting phenomenon to measure. So, radioactive decay is very important.

And another interesting thing is meteorite entering earth's atmosphere. There are so many of
these small objects floating around in space, when exactly do they cross paths with earth and
how do they enter etcetera, etcetera, it seems like this one after the other it would come and it
is not clear when that would come in and you can push this to so many other things, I mean it
is the quequeing of, I know random arrivals into a queue, the queue may be atmosphere or
something, different in different places.

But this phenomenon of there is a queue in some sense and then people are arriving into that
at random times is very-very common. You have a traffic signal, cars come at random times
in some sense, everything else, so many applications. So, quite often in many of these
examples a couple of things are common. So, these are two points here, these are very crucial
and quite often these conditions are satisfied.
One is you can assume a certain arrival rate, maybe that arrival rate will vary over time, so
sometimes arrival rate may be larger, some other times it may be smaller, but generally over
large periods of time you may be able to assume that the arrival rate is fixed, like for instance
meteorites entering the earth's atmosphere, it is okay to assume the number of meteorites per
certain unit of time, let us say per month or per day or per year or something is probably it is
easy to think or imagine some some arrival rate.

Same thing with other type of phenomenon that I spoke about just now, radioactive decay,
there is like an average emission rate, you can think of people coming into a queue, now that
depends on the time of the day and whether it is, various scenarios, but maybe over a long
enough period of time you can assume the arrival rate is fixed.

Now, here comes the next assumption or next assumption that you want to hold true that one
is that supposing one person has come one arrival has happened, one event has occurred, the
time you have to wait for the next event could be assumed to be independent of what has
happened in the past. In some sense the phenomenon of arrival of a new event or occurrence
of a new event and that you are tracking in this way is not dependent on the past of what you
observe.

That is quite reasonable in many cases. I mean if you are running a website, you want to see
how, what is the time the next person comes in, it is not like the same person is coming back
again and again or people are waiting in a queue to come into your website, they have access
to the machine from all over the place, so the scenario is similar even after so many people
have come, maybe, so these are interesting assumptions.

So, it turns out under these two assumptions if you fix a certain period of time and then look
at how many arrivals will happen that will be a Poisson random variable and it is a very
interesting fact and we will not prove or anything like that, but this hopefully gives you a
sense of why the Poisson random variable shows up all over the place, so this is the reason.

So, if you have, if you are tracking arrivals or any such event, your tracking events over a
period of time and they seem to occur randomly, but their arrival rate is fixed and once
something, once that arrival happens the time you have to wait for the next arrival is
independent of all that happened in the past, under these scenarios the number of arrivals over
a period of time ends up being a Poisson random variable.
So, now why is number of arrivals over a period of time important? You may be managing a
queue, I mean as in in your bank, you may be running a bank let us say, and then you have to
decide how long the queue could be or maybe you open a counter, the queue goes too long,
so you really want to know over a certain time how much can the queque build up, so you
want to build up theories like this and we are famous for queues.

We are a country where cues are all over the place, you only have to go to this temple called
Tirupati nearby to know how complicated a queue they maintain, so maintaining queues in
this country is a big problem. So, this Poisson random variable is going to show up again and
again when you think of queues and this is the reason why. So, I am going to give you a
couple of examples.

I gathered some data on radioactive decay and the meteorites entering the atmosphere, two
very different sort of events and Poisson will show up in both places. So, let us see how to do
things like this. Take data and try and fit some random variable into it. We will look at this in
more detail later, but I will just show you some preliminary examples of how such things
look.

(Refer Slide Time: 7:54)

So, first you need to look at the data and here is data on radioactive decay of a particle of, I
mean, alpha particles being emitted by radioactivity decay. So, there is three things I have put
down here, first of all there is a time interval of 7.5 seconds over which you are counting how
many particles came, there are 2608 such time intervals over which somebody has counted
the number of particles and they saw that out of this 2608, 57 time intervals, in during 57
time intervals no particles were emitted, 0 particles were emitted.
And 203 other time intervals exactly 1 particle was emitted, so once again remember there is
this time slot of 7.5 seconds and you have 2608 such different time slots and somebody
measured how many alpha particles are coming out of using some radiation counter or
something. So, and then they saw that out of this 2608 in 57 time intervals no particle came
in, 203, 1 particle came, to 383, 2 particle came all the way of to 16 such time intervals when
10 particles were emitted.

So, now we can make a fraction out of this number of times. So, I take 57/2608, I get a
fraction 0.022, this is almost as if this is like the probability like 0.022 is the probability that
no particles are emitted in a 7.5 second interval, is not it? Then 203/2608, this is sort of like
the probability 0.078 is the probability that one particle is emitted in a time interval of 7.5
seconds.

This is a reasonable way to do probability and then and this is how we come up with these
fractions. So, emission rate, the average number of particles that are emitted over, we would
keep a large enough window, the total number of particles divided by 2608, is not it, that is
the total number of, that is sort of like the average and that works out to some 3.8673, so that
is the emission rate per unit time in some sense, you expect 3.8673 emissions to occur. So,
this is again based on the data, I just added up all of them and divided by 2608.

So, it is sort of also reasonable to assume the time of next emission is independent of the past
big enough sample maybe you have and too many things happen there, so you do not have to
worry about the past. So, now what is the Poisson model now? You remember the poisson

PMF, it takes integer values k and ( ) .

So, now we are going to suppose that this fraction that I have here obeys the Poisson PMF, it
sort of looks like a Poisson PMF, it will not be exact but maybe close enough to it. So, how
do I fit it in some sense? I find this lambda and then I plug in this k there, (

) , that is the Poisson model for emission.

So, if the fraction that I have seen from the data is close enough to this Poisson PMF, then we
can sort of say that maybe this follows the Poisson random variable; that is sort of like a
rough way to see it. But are these two things close? How do you assess all that? All that is,
there is lots of statistics in that kind of work.

(Refer Slide Time: 11:29)


But let us just do like first order, we will just plot it and see how it is, so that is one way of
quickly visually checking that is this a viable model and you see it seems quite viable, is not
it? You see that the blue lines are actually the fractions that have plotted and the red crosses
are the Poisson PMF value, the PMF value you got from e power minus lambda lambda
power k by k factorial, that is the red value and the blue is the actual fraction.

So, look at the matches, it is quite good, is not it? So, this is one simple example where I have
shown you this Poisson fit to a data that comes from emissions of a radioactive object, so that
is radioactive decay.

(Refer Slide Time: 12:16)

You can repeat the same for what are called fireballs or meteorites entering the atmosphere. If
you look at data for 276 months from a period 1994 to 2016. Once again this data says, in 24
of these months no fireballs entered the atmosphere, in 54 of these months, of these 276
months, exactly one fireball showed up in the atmosphere and so on till in one month 10
fireballs showed up.

Once again I can do fractions to count; sort of to see the probability of how many, what is the
probability that no fireballs will show up in a month, for that I just divide 24/276. So, I do
that, I get this fraction. Once again we can do arrival rate, we get 2.5217 as the arrival rate
here and you can sort of imagine the time of, putting a time of next emission, the next arrival,
so it is also very reasonable that you can say time of next arrival is independent of the past, so
it does not seem totally reasonable.

Once again the Poisson fit, Poisson model is going to be, so ( )

and we expect the fraction to be close to the Poisson PMF, then you have a good fit, so once
again I want you to notice the contrast between the two things, one is radioactive decay of
some, from some element.

Here is meteorites entering the atmosphere and there is this common thing of the arrival rate
and the time of next arrival being independent and magically that gives you the same fit, the
same kind of PMF for this number, once again remember this is probability the number of
fireballs in a month, that you see in a month is equal to k is, I am expecting it to be a Poisson
fit, and let us see how good is this fit.
(Refer Slide Time: 14:09)

Once again we make this plot of the fraction and the Poisson PMF, I mean, it is maybe off
here and there little bit, but generally the fit is quite reasonable and so people talk about
Newton's explanation of gravity and how it fits the apple falling down from the tree and the
planetary motion, so here is an example where you have this statistical phenomenon
explained with the Poisson random variable and it fits to a radioactive element on earth and
as well as meteorites that enter the atmosphere.

So, maybe statistics is not too bad after all for explaining random like phenomenon that
happen in the universe. Great! So, let us, that is the conclusion of this lecture. Hopefully, we
will see more such examples; hopefully for many other examples also we will find nice
models like this and see how we can use them in making predictions and other inferences. So
that concludes the lecture, thank you very much.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 2.6
Functions of one random variable
Hello and welcome to this lecture. We saw the definition of a random variable, a discrete random
variable, we saw how in various situations the most crucial thing is finding the range and PMF of
the random variable and then you can work with it. And we saw several different distributions,
commonly occurring distributions and how to work with them.

In this short lecture we are going to look at functions of one discrete random variable. You have a
discrete random variable and then it look at a function of it. Now, how do you deal with that
scenario? What is that object? Function of a discrete random variable, what kind of an object is it
and how do you deal with it that is the objective of this lecture. It is a relatively short lecture, let
us get into it.

(Refer Slide Time: 0:56)

So, functions of a random variable. So, we have a random variable X, which we know now quite
reasonably, it has a certain range, it has a certain PMF and that is all I care about. So, once I know
that I understand the random variable quite well. So, we know that the random variable itself is a
function; it is a function from the sample space of the probability space. You are dealing with to
its range T.
And we are going to take the range as a subset of the real line, all of this we are going to do. Let
us say f is some function from the real line to the real line, some other function it could be any
function you want to, 𝑋 2 , 𝑎 𝑋 , log(𝑋), whatever function you want to think of you can have f as
that function. Can we think of 𝑓(𝑋)? What is 𝑓(𝑋)? So, now random variable takes values on the
real line.

Of course, I can hit f on those values, I will get something, but what kind of an object is it, random
variable itself takes values with some probability, so if you look at f of random variable, again it
is very natural that it is another random variable and it is a random variable in that same probability
space, you can simply see 𝑓(𝑋) as a composition of 2 functions, X is a function from sample space
to t, X is a function.

I am sorry, X is a function from sample space to t , f is a function from t to something else, so this
𝑓(𝑋) will simply be a function from sample space to some other range. So, 𝑓(𝑋) is also a random
variable. So, I guess its not very surprising. I wrote down a small little argument here but if you
have X as a random variable, if you have f as a function, 𝑓(𝑋) is another random variable, so it is
reasonable to think of this way. So, let us see a couple of examples.

(Refer Slide Time: 2:50)

The best way to understand what to do with functions of a random variable is to look at a few
examples and then talk about the general thing. So, I am going to take a couple of examples and
all right. So, here is a situation where X is a uniform random variable in a set. You know this. So,
what is the PMF? So, the range of X is this guy, this guy is the range. And what is the PMF?

The PMF if you want to write 𝑓𝑥 (−2) and all of these guys, maybe I should write it in a table,
table is better, so let us look at X and then t and then 𝑓𝑥 (𝑡), so t is -2, -1, 0, 1, 2, table is always
1 1 1 1 1 1
very very useful here and this is going to be 5, so 5, 5, 5, 5, 5, so this is the random variable.

Now, the function f, I am looking at is 𝑓(𝑥) equals 𝑥 2 and I want to find the range and PMF of
𝑓(𝑋) which is the new random variable. So, my 𝑓(𝑥) is is just 𝑥 2 , so maybe it is interesting to
look at, 𝑡 2 , so 𝑡 2 is 4, 1, 0, 1, 4. It needs such a long line. So, that is what is 𝑡 2 . Now, so notice, so
1 1
when I say 𝑓𝑥 (𝑡) is , I mean, probability that x equals -2 is and so on.
5 5

So, that is what this means. So, now what is the meaning of this 𝑓(𝑋)? Now, when I do f of the
random variable X, the random variable X takes values in the range -2, -1, 0, 1, 2, now if I hit f
with this function what will I get, if I act with this f here, okay, I am going to get these values, is
not it, 𝑡 2 , so 4, 1 and 0.

So, we quickly see that if X belongs to -2, -1, 0, 1, 2, f of X which is 𝑋 2 will belong to 4, 1, 0, so
it is not typical to write like this, it is typical to write it in increasing order, so 0, 1, 4, so it is easy
to see that. So, X took a discrete set of values, 𝑋 2 will take 0, 1 and 4 as its value, so you take any
of these values and square them you will only get 0, 1, 4. So, now what about, so this settles the
range of X.

So, you see range for a function of a random variable is very trivially, simply take the range of X
and then act on it with f and see what set you get that is your range of 𝑓(𝑋) , so it is quite trivial.
So, now PMF of 𝑓(X) equals 𝑋 2 . So, what is PMF now? I know 𝑋 2 takes value 0, 1, 4, now for
PMF all I have to do is find probability that 𝑓(X) equals 0, is not it?

So, what is, the event, 𝑓(X) equals 0 is an event and that event is actually equal to X also being
equal to 0, is not it? So, when will 𝑓(X) be 0? We see from the table that 𝑓(X) is 0 only when X is
0, so that is the event of X equals 0, so this is the probability same as probability that X equals 0
1
is 5, it is very easy to see. Now, what about probability that 𝑓(X) equals 1? This is the event that

X belongs to 1 or -1, is not it?


So, you see here if X is -1 also you get 𝑋 2 as 1, if X is 1 also you get 𝑋 2 as 1. So, 𝑓(X) equals 1,
the event, this event and this event are identical. So, that is the crucial factor here. The event f of
X equals something can be written as X belonging to something, so that is the crucial crucial part
here. So, if you say, if you look at a function 𝑓(X) and say I want to look at the event defined by
𝑓(X), I can in turn define it with X because 𝑓(X) is after all the function of X.

2
So, this is X belonging to 1 and -1 and that is 5. So, in a very similar way you can show probability
2
that 𝑓(X) equals 4 is also 5. So, what is the PMF of 𝑓(X) ? So, PMF of 𝑓(X), if you think of a and
1 2 2
f, 𝑓𝑓(𝑋) (𝑎), so in this case 𝑓(X) is 𝑋 2 , so this is 0, 1, 4, 5, 5, 5, so or, you can even write it as

𝑓𝑋 2 (𝑎), so these are various ways to write it. So, once again the crucial aspect is the event that
𝑓(X) takes a particular value can be written in terms of x taking a particular value or a set of values.

So, you saw here 𝑋 2 became 0 means X is only 0, that is the only possibility, but if 𝑋 2 is 1, you
could have X being 1 or -1. So, that is the only thing you have to pay attention to, you have to pay
attention to this detail of when you find the inverse values, so given 𝑓(X) , then you want to find
values of X you may get multiple possibilities, you have to account for that carefully, 𝑓(f ) may
not be a 1 to 1 function and then you have to pay attention to detail.

So, that is the simple way to do it. So, I did not do any great theory for it, I just wrote down the
problem and show you, showed you how to do it, hopefully this was clear.

(Refer Slide Time: 9:02)


Let us see one more example, may be a slightly more complicated random variable, we will take
a geometric random variable with parameter 0.5, so you remember the geometric random variable.
1
So, 𝑓𝑥 (𝑘), k equals 1, 2, so on, is 1 − 𝑝, which is 2 in this case, this is p, (1 − 𝑝)𝑘−1 × 𝑝. In this
1 1 1
case it is just 2 , 2 , so you see it is just (2)𝑘 , very easy formula for a geometric random variable.

Now, what is my function f? It is an interesting little function, so if you were to sketch it 𝑓(x)
versus x, x goes, let us say when it goes throughout 0, 1, 2, 3, all the way to 5, it says, it is equal
to x, 𝑓(x) is equal to x. What is x? y equals x is just this y equals x line, this is 5 and this is 5, this
is 2 and this is 2, like that, and then it is x up to 5 and then after 5 what happens, it stays at 5. This
function does not keep continuing to increase.

It sort of like, it limits it to 5, sort of, it goes up to 5 and then clips that 5, beyond 5 there is no
increase possible. So, now we want range and distribution of 𝑓(X). Now, geometric random
variable starts at k equals 1, 2, 3, so on, it keeps on going 3, 4, 5, it keeps on going. Here this 𝑓(X)
will take 1 to 1, so if you think of k for geometric 4, 5, 6, 7, so on, I am just adding all this, 𝑓(k)
is going to be 1, 2, 3, 4, 5, and then after that it is just 5, so on.

So, what is going to be the range of 𝑓(k)? Range of 𝑓(X), 1, 2, 3, 4, 5, that is it, it does not go
beyond 5 that is the nature of this function. So, you see what I did here, I took the range of x and
simply applied f on each value and then saw what set I got and that becomes the range, is a very
easy thing to find range. Now, for PMF, I simply have to find probability that 𝑓(X) equals 1, 𝑓(X)
equals 2 and so on.

So, that just comes from the PMF itself, is not it? Because here 1, if you look at, let us just evaluate
it probability that 𝑓(X) equals 1 is the same as probability that X equals 1, is not it, and that is just
half, this will go on till probability that 𝑓(X) equals 4, because this is probability that X equals 4
1
and that is (2)4. So, now for probability that 𝑓(X) equals 5, this happens whenever X is greater

than or equal to 5, is not it?

This is the event. When 𝑓(X) equals 5 , X could be 5, 6, 7, anything is greater than or equal to 5,
1 1
so that would come out as, maybe I will write it like this, (2)5 + (2)6 + ⋯ so on, now you know
the GP formula, write it more carefully, 25 so . You can use the summation of the geometric
1
( )5 1
progression, so 2
1 , so you just get (2)4.
1−
2

1 1 1 1 1
So, if you want to write the PMF here in this case it is 2, 4, 8, 16 and 5 also is 16. So, of course, you

do not have to go for great length and compute the probability of 𝑓(X) equal to 5, you can also do
this as 1 minus probability that 𝑓(X) belongs to 1, 2, 3, 4, that is also possible, this is also a very
1
quick way of doing it, you will see that it will end up being 16, I just did it in 2 different ways to

show you and you can check also that this has to work out correctly.

So, I showed you two examples where there were two different scenarios, two different random
variables and we hit it with the function and we saw how to find the range of that function of that
random variable and how to find the PMF of that. So, it is quite a simple thing to do.

(Refer Slide Time: 13:38)

I will show you one last slide where this idea is just written out in more detail, you have a random
variable X with, a random variable X with the PMF 𝑓𝑥 (𝑡) and 𝑓(X) is a random variable. This is
a function of X. 𝑓𝑓(𝑋) (𝑎), the PMF of 𝑓(X) at a, probability that 𝑓(X) equals a is basically
probability that X belongs to the set of all t, such that 𝑓(t) equals a. So, this is like the inversion I
am doing, 𝑓(X) equals a this set is like the f inverse of a.
What are all the values of t for which 𝑓(t) becomes equal to a. Now, this is a probability I can
evaluate using the PMF, sum over all t such that 𝑓(t) equals a 𝑓𝑥 (𝑡), so whatever I did in those two
examples I have written out in detail for you. So, you can find the PMF using the PMF of X, PMF
of 𝑓(X) can be found using the PMF of X.

So this is one sort of useful idea, functions of random variables we will keep revisiting later also,
but when you have one discrete random variable finding the PMF of any function of that is a very
easy process and hopefully you are convinced of that. That is the end of this lecture. Thank you
very much!
Statistics for Data Science 2
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 3.1
Multiple Random Variables
Hello and welcome to week 3 of this course Statistics 2. And this is the week from which
things are going to become new in this class. The first 2 weeks we did a quick review of the
basics of probability theory and then in the second week we saw discrete random variables,
PMFs and all that and up until this point, mostly it is been things you have already seen in
Statistics 1 and it was a quick review in some sense.

Hopefully, it was also interesting and we looked at the connection to data, which is a little bit
more, strongly maybe enough examples here there, maybe that was a bit new, but from now
on things will really get a little bit different from what you have seen before. So, for instance
the first thing that has changed is the format of the slides. I was using some other type of
slides before, but now we have a distinctly different format for the slides.

It is also maybe to symbolize that things are changing from this week onwards. So, basically
we will start looking at multiple random variables. If you remember before we saw that
random variables are very useful ways to deal with the actual probability space, which may
be very complex. So, you look at functions from the outcome to a specific number and then
only look at that. And the distribution of that has something to say about what the probability
space is, so this is very useful.

And now we are going to start looking at multiple random variables. It turns out you saw
before that some usual outcomes and experiments that you study and practice are quite
complex and there are a lot of different outcomes that are of interest to you and lot of
different random variables defined on the same probability space that are of interest to you.
And these random variables could be connected because they are all defined on the same
space.

So, how do we deal with that? What are the tools and what are the definitions of
distributions? How do you deal with the distribution there? How do you do computations
with that and how do you generally think about such scenarios is the main focus of week 3.
Once again we will see toy examples, we will see slightly more complicated examples and
maybe not too much of python coding this week also, but let us get started with how to deal
with multiple random variables in a probability space.
(Refer Slide Time: 2:35)

So, let us begin with a very simple example. Let us say you have a fair coin, let us say and
then you toss it 3 times, you toss it, pick it up, toss it again, pick it up, toss it again, so you
will have 3 different outcomes for this and naturally you can define 3 random variables. So,
here is a very natural definition for 3 random variables. I am going to call X1 as something
that indicates whether the first toss was heads or tails.

So, this notion of indicating will be very useful. We will often define these type of random
variables. They will take a value, 1, if some event happens, 0, if some event does not happen
it is sort of like the Bernoulli trial random variable. These are very popular in probability;
they appear again and again and again. So, this X1 is an indicator for the first toss being
heads, if it is 1, if the first toss is heads, 0 if the first toss is tails.

And then now you can do X2, X3, is not it, so X2 is for the second toss, X3 is for the third toss.
Now, these 3 random variables are defined on the same space and here is a very ready-made
example, so this is is quickly pointing you to a very typical scenario where will you, where
you will use multiple random variables, basically repeated Bernoulli trials. So, when you
have repeated repeated trials of any sort, you are naturally going to have multiple random
variables and that is something that will show up all the time.

And also you notice here something very interesting about the way these random variables
are defined, if you take all 3 of them together these things completely describe the outcome
of the experiment. So, experiment is simple enough, it is a tos experiment and you do not
have a very-very complex outcome and you define 2 random variables that sort of completely
specifies the outcome.
And another interesting observation we can make about this particular case is events that you
define with X1, supposing I define the event X1 equals 1. We saw before the random variable
taking a particular set of values or taking one value is actually an event. Now, if you define
an event with X1 alone, it is going to be independent of any event that I define with X2 alone.

There are really only very few events we can define here, is just 0 and 1, but still X1 equals 1
is independent of X2 equals 1 and it is independent of X3 equals 1. So, you, each random
variable is sort of, even though it all of them live in the same probability space, they are
talking about independent occurrences in the experiment and so if you just define events with
each of them separately, they are going to be, those events are going to be independent.

So, notice, so what I did there that is like a short line in which a lot of things were going on.
If you define event one with X1 and event 2 with X2, whatever way in which you define it,
these 2 events are going to be independent. So, that is another observation you can make
about these things. So this is a very typical setting once again where we will use multiple
random variables to describe the outcome.

(Refer Slide Time: 5:50)


So, let us move on to something that is slightly more complicated, maybe a little bit more
confusing. So, this is like a lottery, I mean, so you have a 2 digit lottery. We have seen this in
one such problem before, but let us go back and revisit it again. So, you select a 2 digit
number 00 to 99 uniformly at random. So, this is what you selected.

So, this is already complex, I mean, maybe not that complex, but still it is a little bit more
complex than the previous one. The previous one, there are really only 8 different outcomes,
it is easy to deal with, here there are 100 different outcomes and there are 2 digits and you
can do, and there is a number that is coming out and you can do various things with it, so you
can extract lot of partial information about this, the outcome that comes out of this
experiment and you can define random variables.

Like for instance I have defined here X to be the digit in the units place. So, you have 00, 99,
units place is the the first one from the right and then you have the tens place, so those are the
2 digits, so the the first random variable I am defining here, this X is simply the number in
the units place, so you can take example. So, supposing the random number is 23, X will take
the value 3. Now, I am defining one more random variable.

Notice what I am doing. Y is the remainder; I think there is a spelling mistake here; that is
okay, so I will underline it, so that you can correct it later. So, Y is the reminder, that is
remainder, that is obtained when the number itself is divided by 4. So, let us see a few
examples. So, let us say the number is 23, so for this guy X is going to be, suppose the
outcome is 23, then this will imply X is 3, and what will be Y.
So, if you divide 23 by 4 you will get remainder 3, so Y also becomes 3. So, let us take
another case, let us say 48, and for 48 you are going to get X equals 8, then Y equals 0, is not
it? So, likewise you can do. I know, in any number of things you can do, you will get X
taking values from 0 to 9. And you can show in fact that X is actually uniformly distributed
in 0 to 9.

So, we will see maybe later on some calculations, but this is not too hard to imagine. So, what
is probability that X = 0, probability that X = 0 is the same as the 2 digit number being 00 or
10 or 20 or 30, there are 10 favorable outcomes, there are 100 total outcomes to 10/100, it
becomes 1/10, that is the same thing for 1 also. Probability that X = 1 again, there are 10
favorable outcomes, 01, 11, 21, all the way till 91, 10/100, its 1/10, so all of them are equally
likely, so X is uniform.

So, likewise maybe this is a little bit more complicated for you to see, but Y will also be
uniform in 0, 1, 2, 3. I will leave this as an exercise for you, but you can again check, so what
is probability that Y = 0, what are all the favorable outcomes for you that result in Y being 0.
It could be 00 or 04 or 08 or 12 or 16 like that, so all the way up to 96, all the multiples of 4
that are between 00 and 99 and you will see that is exactly 25 in number, so 25/100, you will
get 1/4 for the probability that Y = 0.

Now, for Y = 1 maybe you have to think a little bit more, but you can see what are all the
numbers that will give you a remainder 1 when divided by 4, the first choice is 01. Next is 05,
next is 09, 13, 17, so on, so you will see that again is also exactly 25, you take all the
multiples of 4 and add 1 to them you will get these numbers, that is again 25. Likewise you
will see for 2 and 3 also, there is 25 favorable outcomes out of the 100, so you will get 1/4, so
X and Y are uniform in their respective ranges and their ranges are different.

So, here is 2 different random variables that I defined on a common probability space. If you
look at the dependence or independence, so previously, when in the coin toss experiment if I
defined an event with X1, we saw that it was clearly independent of anything I defined with
X2, but notice what can happen here. Supposing somebody told you the event X = 1 occurred,
so X e= 1 has occurred, which means the last digit is 1.

And then if you have to look at the event Y = 0, what do you think will happen here, can Y
be 0 if X is 1; that is not possible, any number that ends in 1 cannot be a multiple of 4. It
cannot give you a remainder 0 and 4 when you divide by 4. So, you see that this X and Y, if
you define an event with X, you define we went with Y, these 2 events are not going to be
independent.

One will determine or affect the occurrence of the other, occurrence of one affects the
occurrence of the other. So, that clearly shows that these 2 events are not independent, so this
is a different type of two random variables or two random variables we defined here and one
seems to affect the value of the other. So, this can also happen. So, we already saw two
examples, one was a simple coin toss example, where it appears that you define one random
variable with the first toss, another the second toss, those two do not affect.

And here we define something slightly more intricate with a two digit number and we are
seeing that one random variable can influence the other. Now, this is important in practice
and I will show you, we will go to the next example to see how things like this can play a role
in modeling and other things in practice.

(Refer Slide Time: 11:57)

We go back to our favorite IPL powerplay over data. This is something that is of interest to
us, we have been keeping this as one common theme throughout the course. So, here is once
again what is my experiment, it is one over of IPL powerplay, maybe you take it to be the
first over for instance, I mean, if you want a specific example or maybe the third over or
fourth or any particular over you pick or maybe any over, all the possible overs.

So, here is, here are two random variables that I have defined in this IPL powerplay, one over
of the IPL powerplay, the first random variable I will call X, this is the number of runs scored
in the over, the total number of runs scored in all the deliveries that were part of the over. So,
that is X for you. Then you can have a number Y, the random variable Y which is the number
of wickets that fell in the over.

So, once again notice, there are two different random variables, X is the number of runs in the
over, Y is the number of wickets in the over. Now, if you think of once again the IPL
outcome is, the out over is quite a complex outcome, so many things happen in the over and
you are not going to be able to model everything properly, but let us see if we can model X
and Y.

This is a specific interest, is not it? X and Y are very interesting random variables as far as
the entire outcome is concerned in this experiment. So, if you look at the events Y = 0, Y =1,
Y = 2, Y = 0 is probably the most common, there were no wickets that fell and Y = 1, one
wicket fell in the over and Y = 2, two wickets fell in the over. So, this is something that can
happen. It can happen in the powerplay if you have good bowling.

So, now notice what we can say, so there is this connection between X and Y that you may
imagine in your mind as a possibility when you model, like for instance supposing given Y is
0, we expect X to take larger values than when given Y is 1 and given Y is 2 we generally
expect X to take significantly lower values.

Now, it may or may not be true, maybe the last 2 wickets with the fill in the last 2 balls or
something like that, it could happen that way, maybe all the runs were scored before that, but
but you do not imagine that to be likely, if the wickets are falling then the runs are going to
be slightly slower, so slightly lower, so notice how this kind of thinking and modeling from
our side is helping in the understanding of data that comes from the IPL powerplay.

Supposing, you want to build a model for the IPL over and then you want to think of these
random variables, maybe you have to take these kind of things into account. So, given
number of wickets, maybe you want to modify the distribution for how many runs fell, so
things like this, so you already see that in complex experiments, these kind of relationships, I
mean, again everything is random there is no deterministic relationship.

I cannot write the equation of any sort, but still in the distribution there is some understanding
of what might happen if some wicket fell. So, these kind of relationships are very important
when you try to model complex experiments and this is something that is routinely done
when we want to write down distributions, particularly when you want to understand data,
this kind of thinking is very-very useful.
So, hopefully, I have convinced you that there are experiments both the small ones and big
sized ones where defining multiple random variables, understanding relationships between
them, trying to model distributions and between these random variables is of great use. So,
how do we have some, what are the main objects that we can focus on when we want to
jointly study multiple random variables? So, that is what we are going to see next, that is the
first part.

(Refer Slide Time: 15:55)

So, first we will begin with two random variables, just two, we will go to more than two,
soon enough, you will see there are many more you can define, but we will begin with two
random variables. In two random variables it turns out, so you remember we had PMF for
one discrete random variable, when you have two discrete random variables defined in the
same probability space you can have many types of PMFs.

One is something called the joint PMF, the other is, these are objects called marginal PMFs,
the thirdly you will have objects called conditional PMFs. All these three, one needs to have a
good understanding and capability of manipulating. So, previously itself we saw the one PMF
had some simple conditions that it needed to satisfy and you could play around with it and
then you could use it for various things.

Now, similarly when we go to joint PMFs and marginal PMFs and conditional PMFs, you
will again have properties that we satisfy, there will be structure, conditions, relationships
between them, all of them you should be able to write equations for and manipulate and solve
and all that, so all that will happen in this section. So, let us get into it.
(Refer Slide Time: 17:08)

So, first we will define what a joint PMF of two discrete random variables is. So, here is the
definition - X and Y are discrete random variables that are defined in the same probability
space. You can go back to the examples I gave you and think of specific examples if this is
confusing you. So, let us say the range of X and range of Y are known, I mean TX and TY.

So, we saw in the examples it is easy to double think of the range of a random variable once
it is defined. Now, notice how the joint PMF of X and Y is defined. We will first denote it as
fXY. It is not, do not think of it as the product of XY. Now, there is that confusion, it may be
the product of XY, but usually that is not the case, so when I write fXY, invariably I mean, the
joint distribution of X and Y, joint PMF of X and Y.

But it can happen that I am interested in the product also, we will come back to it later, but
live with this ambiguity for once and it will be clear. So, usually people will say the join PMF
fXY, so it will be clear to you what it is. Now, what is this fXY? It is a function from the
Cartesian product TX TY to [0, 1], and how do we define this function.

It is defined, so fXY, the arguments are two now because it is a Cartesian product, there is a t1
that will come from TX and there is a t2 that will come from TY, for every t1 and t2 this fXY
will assign a number. So, you give me t1, give me t2, give me a value in the range of X, give
me a value in the range of Y, this joint PMF will assign a probability value between 0 and 1,
it will be a probability to the (t1, t2).

Now, contrast this with the PMF itself. PMF was one random variable, there was a range for
that random variable, for every value in the range the PMF of that point is simply the P(X =
t). Now, in the joint PMF we have two random variables, so clearly we need t1 and t2, one
coming from the range of 1, the other coming from the range of 2 and the joint PMF simply
assigns the probability of X = t1 and Y = t2.

I put t1 here, t1 here, this is wrong, so please note this correction, this is going to become t2,
probability of X = t1 and Y = t2, that is what happens here. So, let me just make sure I make
this correction, a little typo. So, this is defined as the joint PMF. It is sort of clear what it is
and we will see quickly some examples on computation to be sure.

We understand what the joint PMF is. So, usually one can write it as a table or a matrix, so
usually the PMF is written in a table in one row, the value of x and then value of the PMF, so
here you have 2 axis, t1 and t2, so you can write it as a table or a matrix. And notice this
popular notation, so this notation is very important. So, instead of writing this and, and, and
and complicating things, we will simply put a comma, (X = t1, Y = t2).

So, this comma, do not interpret it as or anything it is and, so this is both of these, and the
probability of that is the joint PMF. So, let us look at some examples, you will see, it will be
clear with some examples.

(Refer Slide Time: 20:40)

We are going to look at a situation where you tossed a fair coin twice and you defined
indicators X1 and X2 for the first toss being heads and the second toss being heads. So, this is
the definition of that. Now, I know that the range of X1 is {0, 1}, range of X2 is {0, 1}, and if
I want to, if I were to evaluate (0, 0), so this is the joint PMF of X1 and X2 evaluated at
(0, 0), that is the P(X1 = 0 and X2 = 0), is not it?
So, now remember this is a fair coin tossed twice and is an independent tosses, physically
independent losses, so it is very natural to assume these two events are independent, so X1 =
0 and X2 = 0 is independent, so probability of the intersection and is basically product
because of the independence. So, you have ½ ½ you get 1/4. Same thing with (0, 1).

So, the joint PMF seems easy to evaluate in this case, is not it? You just have to write down
all the possibilities and simply multiply the probabilities and that is because of the
independent. So, independence is usually very-very easy and now we can make our little
table here, so this table, notice how this table works. The values of t1 are here 0, 0, and the
values of t2 are here 0, 1. And this matrix or table is basically ( )

So, this 1/ 4, for instance this 1/4 here is (1, 1). Easy enough, so this is all that is to joint
PMFs, so once again imagine how this will generalize, supposing X1 and X2 are 2 general
random variables, we simply put all the values taken by X1 on the row in the first row, all the
values taken by X2 on the first column and wherever you have the matrix in the ij th position
you look at X1 taking value X2 taking value , then you write down what the probability of
that is.

So, that is that is all it is. So, you can now build the PMF and these are very useful things to
write. So, notice a few interesting and important properties, we will look at them later as well,
but this is easy, every entry, the PMF will take values between 0 and 1, so that is very clear
and the joint PMF, of course, takes values between 0 and 1, if you add up the values of all the
joint PMFs, so all possible values in this table if you add up, you are going to get 1.

So, this, these are two properties that will show up again and again, each entry is between 0
and 1, sum of all entries equals 1, so these are very important properties of a joint PMF. Any
joint PMF will satisfy these two properties. So, note it down, this is very-very important, each
entry in the joint PMF table once you make it is going to be between 0 and 1. You add up all
of them, you should get one, it is very-very important.

Why should you get 1 if you add up all of them? That exhausts all the possible values that X1
can take and X2 can take and clearly that should be 1. I mean, in any outcome X1 is going to
belong to its range, X2 is going to belong to its range and if you add up everything that
basically tells you X1 belongs to the range and X2 belongs to its range, so that is going to be
1. So, that happens in general.

(Refer Slide Time: 24:44)


So, let us look at the next example. We looked at this random 2 digit, 2 digit number
example, you remember X was the units place and Y was the remainder that you got when
you divided by 4; that is also called number modulo 2, modulo 4, I am sorry, the number
modulo 4 is basically remainder when divided by 4. So, now we can do a simple little
calculation if you want to look at fXY(0, 0).

This is the probability that X = 0 and Y = 0, number ends in 0 and should be a multiple of 4
that is what this means, what are the favorable cases, you see 00 and 10 will not work. Why?
10 is X = 0, but Y is not 0, it is not divisible by 4, you get a remainder of 2, when you divide
by 4, so 1, 0 will not happen, so you see 00, 20, 40, 60, 80 all of these are, will satisfy both
conditions is 5 out of 100, 1 by 20.
Let us look at fXY(1, 0), you want the number to end in 1, X = 1 and you want it to be a
multiple of 4, Y = 0. Now, this is not going to happen. There is no outcome that will give you
this and you get a probability of 0 for this. So, notice how this has gone, some value is 1/20,
some value is 0. So, let us look at some other value, now (4, 2), number ends in 2.

So, I think I got that wrong, number ends in 4, seems to be a day of typos, number ends in 4
and number is 2 modulo 4. What do I mean by 2 modulo 4? Remainder 2 when divided by,
this is the same as, let me just write it in terms of Y, this is X, X = 4 and Y = 2. So, now you
can sort of look at all the numbers, if you want X = 4 and Y = 2, 14 will satisfy that, so 04
clearly will not when X is 4, but Y is 0.

If you go to 14 you will get that, X will be 4, it ends in 4 and you get a remainder of 2 when
divided by 4, 34 will satisfy that, 54 will satisfy, 74, 94 will satisfy that, again you get 5 by
100. It seems like it is either number is 1 by 20 or 0, no other number is probably going to
work and that is true. So, you can check out all the possibilities and if you write down fXY(t1,
t2), now notice how this is more general than the previous picture, pay attention to this, you
want to make a table of fXY(t1, t2), you put the values of t1 here along these rows, put the
value along the first row and put the values of t2 along the column 0, 1, 2, 3, 4, all the way to
9, 0, 1, 2, 3.

And (0, 0) is 1/20, (1, 0) is 0, (2, 0) is again 1/20, (3, 0) is 0, so on. So, notice how this 1/20
and 0 alternate. Now, if you go to t2 being 1, (0, 1) will be 0, why is (0, 1) is 0, you can check
that, if the number is ending in a 0 and you get, you are getting 1 mod 4, that will not happen,
if the number ends in 0 you will only get either 0 mod 4 or 2 mod 4, so you cannot get
anything else, so it will be 0.

But (1, 1) is possible, you can end in 1 and you can get 1 mod 4, so that would happen with
numbers like 01, 21, 41 things like that, so you would get five numbers once again, so you
either keep getting 1 by 20 or 0. You have to check this, maybe it is a bit of a pain to check
everything, but you can check in general sort of argue writing down with expressions that this
has to be the joint PMF.

Once again notice every entry in the joint PMF is between 0 and 1, and if you add up all of
them, you will get 1. You can check it out, you have 5, 1/20 is here, 10, 1/ 20, 15 1/ 20 is 20,
1/20, is 20 * (1/20) is 1. So, both the properties are satisfied for the joint PMF and it is also a
reasonable joint PMF for this problem.
So, notice how things are getting slightly more complex and as you go more and more into
complex outcomes this kind of dependency is very crucial and it will become more and more
complicated to write down. So, let me stop here with the first part of the lecture. We will pick
up with marginal PDFs in the second part going on.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 3.1
Multiple Random Variables
(Refer Slide Time: 0:13)

Hello and welcome to this lecture. We are in week 3. We are looking at the joint PMF of two
random variables. We are studying two random variables together in a probability space what are
all the different types of distributions that are of interest. So, we saw the joint PMF. Now, we are
going to look at what are called marginal PMFs when you look at multiple random variables. This
word marginal will show up a lot.

And it is very important to understand what marginal is, what marginalization is, so there is also
this word called marginalization, you will hear a lot when you study probability and statistics, it is
important to know where it comes from. So, let us first define marginal PMF. It is not actually a
major definition, but you will see it is easy to think of. So, I am going to start with two jointly
distributed random variables, let us say 𝑋 and 𝑌 have a joint PMF 𝑓𝑋𝑌 .

So, the PMF of the individual random variables, so you are given a joint PMF, let us say you do
not start with the marginal PMF, you start with the joint PMF, you do not start with the individual
random variables, you start with the joint distribution of two random variables, somebody gives
you the table. How do you go from there to thinking of 𝑋 as an individual random variable?
So, 𝑋 is a random variable on its own, 𝑌 is a random variable on its own. Each of them have their
own PMFs, those kind of PMFs when you have joint distributions, those individual PMFs are
called marginal PMFs. That is just a notation or definition in some sense, very simple definition,
but the important things are these two equations, these are marginalization equations. What is
marginalization?

You have the joint PMF, how do you go to the marginal PMF from the joint PMFs. It turns out
this way is uniquely defined by these specific equations. Remember once again from the joint PMF
you have a clear equation, which will take you to 1 marginal PMF, nothing else is possible and
that is given by this. Actually it is very easy to prove this. Let us look at the first statement 𝑓𝑋 (𝑡).
What is 𝑓𝑋 (𝑡)? It is the probability that 𝑋 = 𝑡, is not it?

Now, what is, let us go down to the proof here and see what is 𝑋 = 𝑡. If you look at the event
𝑋 = 𝑡, one can write this event as 𝑋 = 𝑡 and y equals y1, so let us say the range of 𝑌 is 𝑌1 , 𝑌2
to 𝑌𝑘 , some k entries are there. So, if you want to just think of 𝑋 = 𝑡 , I can always write this
event as 𝑋 = 𝑡 and 𝑌 = 𝑌1 or 𝑋 = 𝑡 and 𝑌 = 𝑌2 so on till 𝑋 = 𝑡 and 𝑌 = 𝑌𝑘

This is a way for me to decompose this 𝑋 = 𝑡 as a disjoint union of several events, so this is
something we have done several times before. Now, if you, now once you do this 𝑃(𝑋 = 𝑡), now
these are all odds but even though they are odds they are all disjoint, is not it, 𝑌 takes different
values, so they are all disjoint, so when you write 𝑃(𝑋 = 𝑡) , its 𝑃(𝑋 = 𝑡, 𝑌 = 𝑌1 ) + 𝑃(𝑋 = 𝑡, 𝑌
= 𝑌2 ) so on till 𝑃(𝑋 = 𝑡, 𝑌 = 𝑌𝑛 ) and that is what is written here.

So, notice how this goes to this, exact same thing, sum over all the values in 𝑇𝑦 in the range of 𝑌
𝑓𝑋𝑌 (𝑡, 𝑡′). So, you take, fix the value of 𝑋, the value of 𝑋 is fixed to 𝑡 throughout and then you take
all possible values for 𝑌 and simply add the joint PMF at all possible values for 𝑌, with the same
value of 𝑋. So, that process is called marginalization. So, here in the first equation you are
marginalizing 𝑌.

So, your marginalizing 𝑌 out of the summation, so you are only left with 𝑋 that is what happens.
Now very similarly you can do 𝑓𝑌 (𝑡), the same proof will go through, but in this case you have to
marginalize out 𝑋, so sum over all 𝑡 prime and 𝑇𝑋 the joint PMF, (𝑡’ , 𝑡), so you keep 𝑌 fixed at t
go through all possible values for 𝑇𝑋 and add up the joint PMF you will get the marginal PMF.
Now, the marginal PMF is so uniquely defined once you define the joint PMF, it is simply a PMF,
it has its own range and you can calculate from the joint PMF through this marginalization process.
So, let us see a few examples to see how marginalization works.

(Refer Slide Time: 4:56)

It is easiest to think of the table and do the marginalization with respect to the table, so in the toss
1
of fair coin toys example that we saw just a little while ago, we had this table for the joint PMF 4,
1 1 1
, , . If you want to do the marginal PMF of 𝑋1 you simply add over the columns of joint PMF
4 4 4

table. So, if you want to do 𝑓𝑋1 (𝑡1 ), 𝑡1 is 0, 1, I want to do 𝑓𝑋1 (𝑡1 ).

So, this is 𝑓𝑋1 (0), this is 𝑓𝑋1 (1)., so you simply add along the columns, so you add these two you
get this guy, you add these two you get this, this is how you marginalize and then likewise you add
1 1 1 1 1 1
along the row you will get 𝑓𝑋2 (𝑡2 ), so 4 + 4 is 2, + 4 is 2like that. So, once you write down a
4

table, joint PMF is written as a table. How do you marginalize?

You just sum over either the rows or the columns, the values that vary as you go over the rows are
basically 𝑡1 , is not it? If you write 𝑡1 there, so 𝑡1 comes from 𝑋1, so you will get the marginal of
𝑋2, the marginal PMF 𝑋2 as you add up everything in the row. If you add up everything in the
column you will get the marginal PMF of 𝑋1, whatever does not change, whatever remains the
same you get the marginal PMF of that.
So, simple, is not it, just write down the table, add up along the rows, add up along the columns,
you get the marginal PMF. So, once again remember given the joint PMF the marginal is unique,
there is nothing you can do to change the marginal PMF. You will just get 1 value for the marginal
PMFs, given the joint PMF.

(Refer Slide Time: 6:43)

So, let us take a slightly different example. So, let me just show this to you. You have got here, so
first anybody gives you a table you can verify that this is a valid PMF; verify it is a valid PMF.
How do I verify that? You can see every entry is between 0 and 1; that is easy enough and you can
add all these guys, is equal to 1. So, it is a valid PMF. No problem. Once it is a valid PMF it is just
a question of adding along the rows, adding along the columns.

You are going to get here 0.40, 0.60, you are going to get your 0.30, 0.70, so you got your PMF
for 𝑋2 here, you got your PMF for 𝑋1 here, remember this has to be a valid PMF, this has to be a
valid PMF all of that, you can check that this is a valid joint PMF, you can check that this guy is a
valid PMF 0.4 + 0.61. You can check that 𝑓𝑋1 (𝑡1 ) is a valid PMF 0.3 + 0.7 is 1 everything works
out very cleanly; you get marginal PMFs so easily here. So, you see, I mean, given any table like
this you can also add it up and find out it is no big deal, it is very easy.
(Refer Slide Time: 8:07)

I want to quickly point out that the same marginal PMF can result from different joint PMFs; in
fact, you can have any number of joint PMFs giving you the same marginal PMF. So, this is a
mistake that most commonly a lot of people make. What lot of people will assume is given the
marginal PMF they will simply assume that the joint PMF is the product of the marginal PMFs
and you will get some value for the joint PMF.

That is a valid answer, given the marginal PMF that is one way to come up with the joint PMF,
but that is not the only joint PMF which will give you the marginal. So, I have shown you here
1 1 1 1
case one and case two, in case one you see the 4, 4, , , distribution is there and you got a
4 4
1 1 1
marginal 2, 2, 2, very easy. Here look at case two the joint PMF is something else.

1
I put a variable X here between 0 and 2, maybe you want to take that variable to be, I do not know,
1 1
you take it as, I mean, do not take it as 4 take it as something else, like 10 or something, like maybe

take it as 0.1. So, you have your 0.1, here you have 0.5 - 0.1 is just 0.4. Here again you have 0.4
and here again you have 0.1. So, clearly this is different from this joint PMF, this is all 0.25 here,
you have 0.1, 0.4, 0.4, 0.1.

1 1 1 1
But you find the marginals you will get the same margin, 2, 2, 2, 2 so notice what has happened

here, so remember this rule, this is very-very important, you can go from joint PMF to marginal in
a unique way, you cannot go from marginal to joint PMF in a unique way. Usually it never really
happens. I mean, you have to really cook up a very strange situation, very simple trivial sort of
situation, only then you will go to the unique joint PMF.

Otherwise, always given the marginal alone, there can be any number, usually an infinite number
of joint PMFs are possible giving you the same marginal PMF, so this is something important to
understand and know that one way is definitely unique, the other way is not unique at all, you can
get any number of joint PMFs, if you only want a certain marginal PMF, so keep this in mind.

I know already this is a slightly different idea to keep in mind, but just by the way in which we
added, you can see that it can all work out and still you can have. So, notice this is a valid PMF,
1 1 1
for any 𝑋 between 0 and 2 this is a valid PMF, so 0, you add up, all this you will get 2 + which
2

is 1, so it is all interesting cases you can have.

1
So, in fact, you can push it to the extreme, you can say 𝑋 is 0, if you set 𝑋 is 0 you will get 0, 0, 2 ,
1 1 1 1 1
, that also has the marginal of 2, , , , marginally it will all look the same, but joint look at
2 2 2 2
1 1 1 1 1 1
how different it is, it is 2, here, 0, 0 here, while you have 4, 4, 4, 4, very different sort of joint
2

PMFs that give you the same marginal, so this is also important in practice. So, whether you have
1 1 1 1
something tending towards the independent case which is 4, 4, 4, 4 or something else is something

you have to look at very carefully.

(Refer Slide Time: 11:23)


So, let us move on, so let us go back to this random two digit number example, you remember the
1 1
join PMF that we had that is , occurring alternatively, so here again you can add to get
20 20
1 1 1 1 1 1
𝑓𝑌 (𝑡2 ) and 𝑓𝑋 (𝑡1 ) and you see you get what is expected, you get 4, 4, , and 10, 10. We knew
4 4

this before, we knew that 𝑋 is uniform, 0, 1 to 9, and then 𝑌 is uniform, 0, 1, to 3.

We knew these two before and we knew that 𝑋 and 𝑌 can sort of influence each other, it is not like
you, so notice what is happening here, this, so if you look at probability that 𝑋 = 0 , 𝑌 = 0, I
wanted you to think about this, this is basically what, this is probability that 𝑋 = 0 and 𝑋 = 0 and
1
this will work out to . Now, what is probability that 𝑋 = 0 times probability that 𝑌 = 0. So,
20
1 1 1
probability that 𝑋 = 0 is 10, is not it? Probability that 𝑌 = 0 is , is not it, so you get . So, these
4 40

two are not equal, so clearly 𝑋 = 0 and 𝑌 = 0 are two different events that we have defined using
the random variable 𝑋 and 𝑌 and they are not independent. You can see that the product of the two
probabilities is not the same as the probability of the intersection of the two events.

So, product is not the same, so these are not independent. So, notice how from the joint PMF you
can find the marginals and you can determine independence of events defined using the random
variables. So, this is an important process, so quite often you may define a joint PMF and then say
is this event independent of that event, you can calculate that from the joint PMF, find the
marginals find the probabilities, which they defined with 𝑋 and 𝑌.

So, basically find the individual probabilities of the two events and find the intersection of the two
events, see if the product is satisfied. So, these kinds of things you can do with the joint PMF. So,
the joint PMF is a powerful tool, it gives you all the data that you need to work out any problem
with the two random variables.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 3.3
(Refer Slide Time: 0:13)

So, next we move on to one more type of PMF that you can have, once you have random variables
and events, this is called conditional distributions. So, first we will look at just one random
variable. So, this is a very simple case where if you, supposing you have a random variable X that
is defined in a probability space and let us say there is an event that is defined a, one can define
something called conditional PMF of 𝑋|𝐴.

So, this is just a PMF which is Q of t, you can denote it 𝑄(𝑡), anything else if you want you can
denote it, basically it is the probability that (𝑋 = 𝑡|𝐴), so previously the PMF of X without any
conditioning is simply the probability that (X=A). Now, if there is an event A with which you want
a condition, then probability that X equals t given A, becomes a conditional PMF, conditioned on
an occurrence of an event.

So, this is, this makes sense, you can see why this makes sense, we will use this notation X given
A for denoting the conditional random variable, you have an unconditioned random variable X, if
you want a condition on event A, I will simply say X given A, (𝑋|𝐴), that is the conditional random
variable and that would have a distribution which is different from the original random variable.
So, how do you compute this 𝑓𝑋|𝐴 (𝑡), it is basically probability of 𝑃(𝑋 = 𝑡|𝐴), that is just
probability of intersection of those two events divided by 𝑃(𝐴), so standard conditioning, nothing
changes in the actual equation except that the conditioning has come in. So, this is a general
conditioning in terms of an event, but even here pay attention to the fact that the range of X given
A can be different from range of X.

Because once you condition on A, X may not be able to take many values outside of A, so the
range can change. So, this conditioning makes sense, sometimes it is used, people do condition on
events, but it is most popular to condition on events defined by another random variable and that
is where the conditional distribution of one random variable given another random variable’s value
enters the picture.

(Refer Slide Time: 2:19)

So, it is sort of very similar to the previous picture except that this event A is now defined using
another random variable. So, let us say you have 2 random variables X and Y, these are jointly
distributed, they have a joitn PMF 𝑓𝑋𝑌 , the table is given to you. So, remember the table is given
to you. I am going to define something called the conditional PMF of 𝑌|𝑋 = 𝑡. So, 𝑋 = 𝑡 has
occurred.

The random variable X has taken a value t, so the event 𝑋 = 𝑡 has occurred. Now, of course, I can
define now the conditional random variable Y, which is conditioned on 𝑋 = 𝑡 and that is exactly
the conditional PMF, PMF of that guy, so basically it is the probability that 𝑌 = 𝑡 ′ |𝑋 = 𝑡. So, you
can just do this calculation again, this is just a ordinary conditional probability now.

𝑌 = 𝑡 ′ |𝑋 = 𝑡, its P(𝑌 = 𝑡 ′ , 𝑋 = 𝑡), remember our comma is and, so it is the intersection, 𝑃(𝑌 =
𝑡′ and 𝑋 = 𝑡) divided by 𝑃(𝑋 = 𝑡). And notice now, what is the numerator? Numerator is the joint
PMF, 𝑓𝑋𝑌 (𝑡, 𝑡 ′ )and what is the denominator? Denominator is the marginal PMF, 𝑓𝑥 (𝑡). Now, we
will use this little notation here.

Hopefully, this notation is clear enough. I will say 𝑌|𝑋 = 𝑡 is the conditional random variable. I
will think of that as the conditional random variable itself, 𝑌|𝑋 = 𝑡. So, what is this 𝑄(𝑡 ′ ) its
nothing but the distribution of this conditional random variable. So, we will write it as 𝑓𝑌|𝑋=𝑡 (𝑡 ′ ).
So, that is a very common notation for the conditional random variable.

So, what is most important about the conditional random variable is this equation. So, this equation
is something that we will use quite a bit when we write down joint PMFs and all that, so the joint
PMF is the same thing, is the same equation that is used in the definition, I have written it a bit
differently.

So, the joint PMF evaluated at t comma t’ 𝑋 = 𝑡, 𝑌 = 𝑡′ is simply the product of the conditional
PMF of 𝑓𝑌|𝑋=𝑡 (𝑡 ′ ) × 𝑓𝑋 (𝑡). Do you see that? It is the same equation 𝑓𝑋 (𝑡) comes in. So, this is the
same product rule that we used when we defined probability of A and B. What is 𝑃(𝐴 𝑎𝑛𝑑 𝐵)? Its
𝑃(𝐴|𝐵) × 𝑃(𝐵).

The same thing except it is written in the language of conditional PMFs and joint PMFs that is it,
so this equation is very useful, quite often, some of the objects here may be easier to find, so quite
often it is very common that the marginal is easy to find and the conditional is easy to find and
then you can multiply the two to get the joint or maybe the other way around but usually the
marginal and the conditional are easier slightly to define than the joint directly. So, this is a very
common trick that is used to get to the joint PMF. So, hopefully the definition was clear. Now, let
us check our understanding with some examples.

(Refer Slide Time: 5:27)


So, here is a very simple example. So, here is a joint PMF, so notice how the joint PMF and the
marginal are given to us. So, notice how what this joint PMFs, from the joint PMF you can identify
that the range of X is 0, 1, 2, so better way to write range of X is this, I mean, X belongs to 0, 1, 2,
Y belongs to 0, 1, and the joint PMF is given, you can quickly check that it is all valid. 1 by 8
appears 4 times that is half, 1 by 4 appears twice that is another half.

It all adds up to 1 and you can do the marginal here, you just add up everything in the row, you get
half, you add up everything here you get 1 by 2, you get here 3 by 8, 1 by 4, 3 by 8, all of these
are are valid as well. So, everything is fine. So, now supposing I want to do the PMF, let us say
for 𝑌|𝑋 = 1, 𝑋 = 0, let us take that.

So, Y takes value 0, 1, I want to do Y given X equals 0, so this random variable, it takes value 0,
𝑓𝑋𝑌 (0,0)
1 and what is 𝑓𝑌|𝑋=0 its . What 𝑓𝑋𝑌 (0,0)? It is 1 by 4 and what is 𝑓𝑋 (0), that is 3 by 8. So,
𝑓𝑋 (0)
𝑓𝑋𝑌 (0,1)
you get 2 by 3. What is 𝑓𝑌|𝑋=0 (1)? It is , it is basically 1 by 8, 0,1 is 1 by 8 here, divided
𝑓𝑋 (0)

by 3 by 8 you get 1 by 3.

So, notice what has happened here. To get the conditional PMF you simply divide this column by
this number, so you divide the column 1 by 4, 1 by 8 by 3 by 8, to get the conditional PMF, 2 by
3, 1 by 3, so you divide 1 by 4, 1 by 8 by 3 by 8 you get that. So, now if you want to look at the
conditional PMF of 𝑓𝑋|𝑌 , let me go to 𝑋|𝑌 = 1. Y is 1, so you have to look at and so once Y is 1,
you look at, so maybe I should write it in another color, let me pick the color for you, I will pick
this dark green, so this is Y equals 1 for you and this is the probability that Y equals 1, you divide
this by this, you are going to get the distribution here.

So, Y equals X given this is going to be 0, 1, 2, and the probability is going to be, let me just do in
green. The probability is going to be 1 by 8 divided by 1 by 2, I will write it like that, 1 by 8 divided
by 1 by 2 and then 1 by 4 divided by 1 by 2 and that is just 1 by 4, 1 by 4 and 1 by 2, you do not
have to write it in such great detail, you can just directly write if you like 1 by 4, 1 by 4, 1 by 2. I
got that, so this way you can quickly see how to find the conditional distribution.

Any conditional distribution I give, you identify the row or the column and simply take that,
divided by the marginal that comes below it, that is it, that gives you the conditional PMF, very-
very easy and straightforward way to find conditional PMF. So, let us do one more example to
convince ourselves that we have understood this.

(Refer Slide Time: 9:37)

Here is the example I want to do. So, here is a joint PMF that has been given to us, you can check
that this is a valid PMF very quickly, you can just add 1 plus 3, 4, 7, 5, 12, yeah, so this is a valid
PMF, it adds up to 1. So, here the marginals have not been given and you have to just keep
computing some of the conditionals, so let us look at this. So, here first step is to compute the
marginal, so if you add up these 2 you get 4 by 12 which is 1 by 3.

Here, you get 3 by 12, which is 1 by 4, here you get 5 by 12, so here again if you add you get 1 by
2, 1 by 6, 1 by 3, so those are the marginals, that is okay. So, now if I want to find Y’s, so the range
of Y, so remember X takes value 0, 1, 2, and Y takes value 0, 1, 2. Now, if I want to find X given
Y equals or let me start with Y given. Y given, let say we do 𝑌|𝑋 = 0, so this guy, so this is, this
divided by this guy, this is going to be 0, 1, 2, and the probability is going to be 1 by 6, 1 by 3, and
1 by 2, isn't it? Do you see that?

So, you take; see X is 0, so I know I have to focus on this column alone, if I take this vector and
the joint PMF divide by the marginal, I am going to get the distribution. So, notice how Y given
X equals 1 is going to change, so if you say Y given X equal to 1, you see there is a 0 there, the
moment there is a 0 there, you do not want to include it in the range and put a probability 0. So,
you drop it, so you just say 1 comma 2, 1 by 12 and 1 by 12, and probability is total as 1 by 6,
notice the total does not matter when these 2 are equal, I know ahead of time that this is going to
be half; this is going to be half.

Anyway, but you can also divide and check that that works out. Similarly, you can do Y given X
equals 2 and that is going to be, so notice the way I am writing it, I write the values taken below
and the probability above, so this, you can use any such simple notation, so given X is 2, maybe I
should do a different color here, so let me just do a different color to be very clear on what is going
on.

So, supposing you look at Y given X equals 2, so I am looking at this column, then I am going to
divide by this quantity here that is going to be 0 comma 2, I do not have to write the 0 explicitly
because that is going to take a probability 0, that why, why do that, so you will get a 3 by 4 and a
1 by 4. So, notice how we are getting different-different marginals? When joint PMF for something
and then depending on how the matrix is, you will get different different marginal, all of it is
interesting.

So, let us also do a couple of cases row wise, so let me just do X given, I will do a different color,
maybe blue, so let us look at X given Y equals 1, so Y is 1, so you are focusing on this row here
and this guy here, this is going to be 0 and 1. It does not take the value 2 because that is already 0,
and then if you do 4 you are going to get 2 by 3, 1 by 3, so notice all the different probabilities that
you are getting between all sorts of distributions 1, 2, 0, 2, 0, 1, 2 by 3, 1 by 3, 3 by 4 and before
half half, all of these end up happening once you go to the conditionals.
And in every case you can check the equality, remember to check this equality. It is always true
that 𝑓𝑋𝑌 (𝑡1 , 𝑡2 ) equals f, maybe I should start from the left side and write, this is an important little
equality for you to know, 𝑓𝑋𝑌 (𝑡1 , 𝑡2 ) is always equal to 𝑓𝑌|𝑋=𝑡1 (𝑡2) × 𝑓𝑋 (𝑡1 ) and it’s also equal to
𝑓𝑋|𝑌=𝑡2 (𝑡2 ) × 𝑓𝑌 (𝑡2 ). So, this is always satisfied, so you should remember this identity is very-
very-very important, you can check in all our calculations.

You can check that that holds, I mean, this is how we just calculate it, it is sort of trivial in some
sense, but it is important to note, although these identities will be very-very useful, so somebody
gives you a table, some partial information about conditional PMF is given, this is given, some
conditional PMFs are given, some marginals are given some actual joint PMFs are given, you
should be able to calculate all the missing values using these identities.

One more thing I want to point out which I think maybe is, was not very-very clear, so notice here
the 𝑌|𝑋 = 0 is actually sort of a random variable and it has a PMF, so if you keep X fixed and
vary the values of Y, you will get 1, so if you add up all these guys, you will get 1, 1 by 6 plus 1
by 3 plus 1 by 2 is 1, 1 by 2 plus 1 by 2 is 1, 3 by 4 plus 1 by 4 is 1, so every conditional random
variable is actually a full blown random variable. So, it is a valid PMF.

So, if you do this, if you take sum over 𝑓𝑌|𝑋=𝑝 (𝑡 ′ ) and add it over all Y, you will get 1, so this is
also an important identity to remember, so the conditional PMF is a valid PMF on its own, so you
add up over all the range, in fact, it may not be the entire range, it may be something smaller than
Y also, but anyway let us add it over the entire range, maybe some of the values will be 0, it does
not matter.

You add up the conditional PMF over the entire range of Y you will get 1, so this property is also
very important when you solve problems. So, the conditional PMF is a valid PMF on its own, it
will take probabilities and you add up all the probabilities, you will get 0, you will get 1, add up
all the probabilities you will get 1. Remember 𝑋 = 𝑡 is fixed here, so you cannot keep changing
X.

So, X equal to t has to be fixed, it has to be one conditional random variable then it is PMF, when
you vary Y alone, 𝑌|𝑋, when you keep X fixed and vary Y and you look at Y given X that will
give you 1, so that is an important identity to remember. So, I think I have covered most of the
things, you can check all these things it is useful to know. So, I have a few examples to show you,
maybe this is a good point in the lecture to break, then come back and see the next lecture. Thank
you very much!
Statistics for Data Science 2
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 3.4
(Refer Slide Time: 0:13)

Hello and welcome to this lecture. We have been so far looking at multiple random variables
in particular two random variables, how to define something called the joint PMF of two
random variables and then we looked at marginal PMF and conditional PMF. We went in that
sequence, joint to marginal to conditional, but quite often in practice it will go the other way.

So, the marginal PMF and the conditional PMF are much easier to define and practice and then
you combine that and make the joint PMF, so it can go that way also. And so I thought I will
show you a few questions where something like that happens, the marginal and the conditional
are more naturally specified and you use, using that you go to the joint PMF.

So, in fact, the basic property that we have is this factorization rule here, so this rule is is very
powerful. Let me just make my go a little bit bigger, so this is the basic rule, so this says the
joint PMF at (𝑡, 𝑡′) is the marginal times the conditional, so the marginal of 𝑋 at 𝑡 multiplied
by the conditional 𝑌|𝑋 = 𝑡. So, this rule is quite powerful and one uses that in very different
ways to create the joint PMF.

So, let us look at a few problems, it is easier to think in terms of problems here. So, here is the
situation, somebody is throwing a die, you get a number between 1 and 6 and then depending
on what you get you toss coins. So, supposing you get the number 2, then you toss a coin twice
and toss a coin as many times as the number shown on the die and 𝑋 is the number shown on
the die that is the one random variable you define and 𝑌 is the number of heads that you get
after the coin toss.

So, the question asks what is the joint PMF of 𝑋 and 𝑌? So, here again the marginal of 𝑋 is
easy to define, 𝑋 is uniform from 1 to 6. Maybe this thing is a bit too big, so 𝑋 is uniform and
then what is given to you, notice is 𝑌 is the number of heads, but it is better to write down
𝑌|𝑋 = 𝑡, so this random variable, the conditional random variable 𝑌|𝑋 = 𝑡, so we have thrown
a die and 𝑋 has taken a value 𝑡, as in the number 𝑡 showed up in the in the die, so you are going
to toss the coin 𝑡 times.

And 𝑌 is the number of heads, so 𝑌|𝑋 = 𝑡 is going to become binomial, we have seen this
1
before it is a Bernoulli event, 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑡, 2), number of heads when you toss the coin 𝑡 times,

this is 𝑌|𝑋 = 𝑡. So, you have the marginal 𝑋 and then the conditional 𝑌 and now you can use
1
the multiplication to do. So, what does this mean? So, this is 𝑓𝑋 (𝑡) = 6 for 𝑡 between 1 and 6.

1
What is binomial? 𝑓𝑌|𝑋=𝑡 (𝑡′) is 𝑡𝐶𝑡′ to the power, so maybe I should write t 2 c prime in a
2

1 𝑡′ 1 𝑡−𝑡′ 1
different way, I have been writing it like this, 𝑡𝐶𝑡 ′ (2) (2) , but this 2 you can easily add

1 𝑡
here, so you will get 𝑡𝐶𝑡 ′ (2) . So, 𝑡 by t and 𝑡 − 𝑡′, you can add you will get just 𝑡.
1 𝑡
So, this is 𝑡𝐶𝑡 ′ (2) , 𝑓𝑋 (𝑡) is this, 𝑓𝑌|𝑋=𝑡 is this, so now one can do the factoring of 𝑓𝑋𝑌 , so

𝑓𝑋𝑌 = 𝑓𝑋 (𝑡)𝐹𝑌|𝑋=𝑡 (𝑡′) and notice 𝑡 ′ will go between what, 𝑡 ′ = 0, 1, 2, so on till 𝑡, so that is the
1 1 𝑡
value here, so if you plug it in you will get 6 𝑡𝐶𝑡 ′ (2) , so this is the joint distribution.

And if you want to make a table, let us make a table here, 𝑡 would go from, let us make a table,
maybe at least a partial table, 𝑡 goes from 1, 2, 3, 4, 5, 6 and what will 𝑡′ go to, 𝑡′ is the number
of heads that throws up in the one toss, it will just be, it will go all the way from 0, 1, 2, 3, et
cetera, I am only doing it partially, I put a dot dot dot here, I am not doing it fully, so if 𝑡
became 1 and you toss the coin only once the probability of getting heads is half, so it is very
easy to write down.

1 1 1 1
So, this would be 6 × 2. So, 𝑡 = 1 and then 1, 1𝐶0 is just 1, same way here 6 × 2 and after that

you will get 0, so maybe I should just write this down a little bit below 1, 2, 3, then dot dot dot,
1 1 1 2
so 6, after that it is all 0 here. What about for 2? For 2 here you will get 6 × (2) , so here you

1 1 2 1 1 2
will get 6 × 2 × (2) and here you will get 6 × (2) .

So, you can see how, I am just using this formula for different values of t prime and getting it
done, after that 0 and then 3 you will have all 4 writing down, 4 you will have everything
written down and then 6 you will have everything written down, so it will have some sort of
upper triangular sort of shape if you write it down. So, it is an exercise to fill this table for you,
go ahead and fill it out fully.

So, this is the formula, this is the formula, you just apply this, 𝑡 is here, 𝑡′ is here, you just plug
it in you will get the value for each point and remember 𝑡′ goes only from 0 to 𝑡, so this is
interesting, so for instance the range of 𝑌|𝑋 = 𝑡 is actually 0 to 𝑡, is not it? So, it is a bit
different, depending on 𝑡 the range changes and the conditional distribution also changes.

So, this is a way in which conditionals and marginals are specified in problems. So, most often
when you look at a problem in a probability space you always want to look for the marginal
and the conditional and make the joint using that, it is much much easier to work that way
rather than think of the overall joint PMF. So, hopefully this example showed you that. So, we
will see a few more examples of this nature just to drive home the point.
(Refer Slide Time: 7:47)

So, this one is a little bit more complicated, as in I am going to use Poisson distribution, if you
remember the Poisson distribution, I am saying here 𝑁 is Poisson, so 𝑛 takes values from 0 to
𝑒 −𝜆 𝜆𝑘
infinity and 𝑓𝑁 (𝑘) = . So, this is for 𝑘 = 0, 1,2, 3, … , keeps on going. Now given that
𝑘!

𝑁 = 𝑛, so small 𝑛 is being used here, so maybe I should use small 𝑛 here, let me use small 𝑛.

Anything is but, let us use consistently what is given in the problem. So, given the capital 𝑁 is
𝑛, as in you sampled the Poisson process and you got a Poisson random variable, given that
that Poisson random variable 𝑁 took a value small 𝑛, you toss a coin 𝑛 times, so previously
you just threw a die, it was just 1 to 6 and saw the number that you got and then you tossed the
coin that many times.

Now, I am not throwing a die, I am, somebody is giving me a Poisson distributed random
variable. It may not be as easy as throwing a die to generate a Poisson random variable, but let
us say maybe in your Colab or Python module you can see how you can generate a Poisson
random variable. So, you generate a Poisson random variable of with a parameter 𝜆 , somebody
does that and gives you a value small 𝑛.

And then what do you do, you take pick up a coin and toss it 𝑛 times. Now, you denote the
number of heads obtained by 𝑋, and the question that is asked is what is the distribution of 𝑋?
So, notice what we can do here? So, 𝑁, distribution of 𝑁 is very easy, what is also easy is
𝑋|𝑁 = 𝑛. So, this is what is given to you, given that the Poisson random variable to a value
small 𝑛. What is 𝑋?
𝑋 is you toss the coin 𝑛 times a fair coin and counted the number of heads, what is that going
1
to be, that is going to be 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 2), is not it? So, I know how that can be written down,

1 𝑘 1 𝑘 1 𝑘 1 𝑛−𝑘 1 𝑛
𝑓𝑋|𝑁=𝑛 (𝑘), is 𝑛𝐶𝑘 (2) , is not it? So, (2) , I am sorry, (2) (2) that is just (2) . So, if you

1 𝑘 1 𝑛−𝑘
want you can write it down fully, it is (2) (2) .

1 𝑛
I can add it all up and write it as (2) . So, and this 𝑘 will go from 0 to 𝑛, this is what it is. So,

I have the marginal of 𝑛, I have the conditional of 𝑋|𝑁, I can find the joint distribution. What
I am asked is actually the distribution of 𝑋? So, what is being asked is 𝑓𝑋 , but the way to go
there is through the joint PMF. So, you, let us find the joint PMF. The joint PMF of 𝑓𝑁𝑋 (𝑛, 𝑘),
this is equal to, just the product of these two.

𝑒 −𝜆 𝜆𝑛 1 𝑛
So, × 𝑛𝐶𝑘 (2) , is that okay, and then 𝑛 would go from 0, 1, 2, so on and given 𝑛, 𝑘 will
𝑛!

go from 0, 1 to 𝑛, so this is how this will look, I mean, it is very similar to the previous problem,
except that the previous one ended at 6, maybe you are happy with 1 to 6, but it does not matter
where you end, just keep on going and you will get that big binomial here.

And this looks like a pretty complicated little distribution; it is a little bit, but let us see, if we
can make some sense out of it. So, now the next problem here is to find 𝑓𝑋 . So, how do you
find 𝑓𝑋 (𝑘) from the joint PMF? You have to find the marginal from the joint distribution, the
joint PMF is given, you have to find the marginal. How do you do that? You add up over what
you do not want, is not it?

You do not want 𝑛, so you add up over all possible values of 𝑛. So, if I want 𝑓𝑋 (𝑘) which is
the marginal this is going to be ∑∞
𝑛=0 𝑓𝑁𝑋 (𝑛, 𝑘), is not it? I add up over everything that I do not

want. So, in this case it is sort of difficult to put it into a table because maybe you can, I mean,
you can just put 𝑛 there and then write a table, but the table does not really help you that much,
you can just work with these expressions.

𝑒 −𝜆 𝜆𝑛
So, this is what you are going to add over 𝑛, you get × 𝑛𝐶𝑘 . So, remember there is this
𝑛!

issue here, let me just come to that after writing it down, so you will see here, so 𝑘 goes from
0 to 𝑛 and 𝑛 keeps going, so now if I fix the value of 𝑋 at 𝑘, I got 𝑘 heads, so if 𝑛 is below 𝑘,
this whole thing will go to 0, so you have to be careful here, so this 𝑛 = 𝑘 to infinity is a big
step here.
So, if 𝑛 is less than 𝑘 this guy becomes 0, if 𝑘 is, if 𝑛 is less than 𝑘, is not it? So, there is no
way you can get 𝑘 heads if you only tossed less than 𝑘 times, so it is non-zero only if 𝑛 is
greater than 𝑘, is not it? k goes only from 0 to 𝑛, 𝑛 has to be greater than or equal to 𝑘, so you
start from 𝑘 to ∞ and do this summation you will get a distribution for 𝑋. Is that okay? So,
hopefully, I do not know, I mean, I will leave this as it is.

If you have the energy and enthusiasm you can think about how you will add it up and what
you will get at the end, so maybe I should try it, I do not know let us see if we get somewhere,
so this 𝑛! this 𝑛𝐶𝑘 , I am writing 𝑛𝐶𝑘 in this fashion, so notice this notation here, so let me write
it down, you may have been used to this notation, both of these are commonly used, so these
two are the same.

So, 𝑛𝐶𝑘 and this putting long bracket with (𝑛𝑘) in the middle that is also 𝑛𝐶𝑘 , so we get used to

that, I will use this (𝑛𝑘) a lot. So, you can try to simplify this, I am not sure if that is the main

point of the score, but anyway, so you can try to simplify this. How will you do it? So, (𝑛𝑘) is
𝑛!
, is not it? You can do this and I mean, after this it is just manipulation, but let me just
𝑘!(𝑛−𝑘)!

do this, it is not too difficult.

So, you have the ∑∞


𝑛=𝑘 𝑒
−𝜆
and then you will have 𝜆𝑛 and this 𝑛! and this 𝑛! will cancel and
1 𝑛
you will have a 𝑘! (𝑛 − 𝑘)! × (2) . So, I am going to do something here, you may or may not

recognize this trick.

It is a little trick that I am doing to get something out of this, so let us see here. So, I am going
to write this like this, so I am going to pull this 𝑒 −𝜆 out, it does not depend on 𝑛, I am going to
pull the 𝑘! out and I am going to pull 𝜆𝑘 out. So, why am I doing this? Because then I will get
a familiar form and then I can use it.

𝜆𝑛−𝑘
And then even for this half, I am going to do the 2𝑘 out and write this as ∑∞
𝑛=𝑘 (𝑛−𝑘)!2𝑛−𝑘 . So,

you can see what I have done here. I have pulled all the 𝑘 part out and kept only the 𝑛 − 𝑘
inside the summation, I am summing from 𝑛 = 𝑘 to ∞. You can try writing it out differently,
you will see this is exactly what you will get.

And this summation, this summation if you look at it very-very carefully you will see this
𝜆
summation is 𝑒 2 , so I mean there are various ways to identify this, so this part you can replace
𝜆 𝑛−𝑘
𝜆 𝑛−𝑘 ( ) 𝜆
with (2) and you are going from ∑∞ 2
𝑛=𝑘 (𝑛−𝑘)! ; that is exactly 𝑒 2 . So, once you identify that

you can come back in here.

𝜆 𝜆 𝜆
It is 𝑒 2 , you have 𝑒 −𝜆 + 𝑒 2 that will give you 𝑒 −2 , so you see that this 𝑓𝑋 (𝑘) becomes this very
𝜆 𝑘
− 𝜆
𝑒 2( )
2
interesting little expression , is not it? Is not that interesting? So, notice what happened
𝑘!

here. So, 𝑋 had got this distribution, this is a very interesting distribution property for Poisson.

So, notice what all is involved here, I mean, it seems like a lot of work with algebra and
manipulation, but that is what it is and you get a very nice result at the end. I will tell you why
this result is very interesting, so you fix any 𝑘 and try to find 𝑓𝑋 (𝑘) by marginalization and this
marginalization is not as easy as the previous marginals we computed because in this case there
is an infinite summation involved.

And you had to use all sorts of trickery to manipulate the expression, move this around, even
if you did not follow every step that I did, take your time, work your way through, you will
understand that it is all and you will finally get this very nice expression and this is for any 𝑘,
so 𝑘 could go 0, 1, 2, so on. So, what have you identified here? What is this distribution? This
𝜆
implies 𝑋 by itself is actually 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(2).

So, look at that, is not that interesting? So, you started with a Poisson distribution 𝑁 with
parameter 𝜆, with that the parameter of the Poisson being 𝜆 and then you did this little
experiment where you first to the Poisson random variable, took some value from it and tossed
a coin as many times as that Poisson random variable was, if the Poisson random variable was
100, you toss the coin 100 times, if it was 10 you toss the coin 10 times.

And then, in the number of tosses that you did how many heads you got became your random
variable 𝑋. And notice what has happened? This 𝑋 itself is another Poisson random variable
𝜆 1
and it is got parameter 2 and where did that by 2 come from, where did the 2 come from, it is

actually probability of success in your toss, it is the coin toss being half that fair coin giving
you a probability half for the heads, that is where the half came from.

And look at how we did the work, it was, it just went through the basic definitions, but maybe
this manipulation is a bit messy, maybe you need one more sheet in which you want to write it
down, I will leave that as an exercise, but this is an interesting little property we got from the
Poisson distribution and this is quite useful, quite often in a lot of calculations this is used with
Poisson random variables, this makes Poisson random variables very simple in practice.

Hopefully you understood the steps of this problem, but the final result is more important than
the way we simplified, you do a Poisson, first sample a Poisson, get a Poisson random variable
and do something that many times and if your success event was, some had a probability, you
will, the number of successes becomes Poisson random variable again. So, it is an interesting
property of the Poisson distribution. We can generalize this a little bit and all, but we will see
that later.

(Refer Slide Time: 20:57)

So, let us come back to a much more simple finite example. So, we saw that Poisson random
variable anytime things go to infinity, it is a bit scary, is not it? I mean, it is okay, you will get
used to it if you work with it for a while. Let us come back to a finite example, simple examples
where you can write down tables and feel happy about it. So, let us look at that, so 𝑋 is the
number of, so we come back to our familiar IPL powerplay over.

I have been talking about distributions around the IPL powerplay over a lot and we will
continue that in this course. So, once again I am at the IPL powerplay over and I am going to
let a random variable 𝑋 denote the number of runs in the over and 𝑌 is the number of wickets
in the over number of people who got out. So, let us say the marginal of 𝑌 is known, we are
going to assume some very typical marginal.
13 1
Most of the time there is no wicket, that is 16 and then there is one wicket with probability 8
1
and 2 wickets with probability 16, that is the marginal distribution of 𝑌, and then what is given

is the conditional distribution of 𝑋|𝑌 takes a particular value. And I have sort of used the
intuition here, so if there were no wickets in the powerplay over, maybe the runs are slightly
higher.

And if there was one wicket maybe the runs were slightly lower and then if there are 2 wickets
maybe the runs are really low, so this is a good thing to sort of visualize. Maybe here the table
again is something you can write, maybe but it is a bit more laborious to write down a big
3 × 12 table, one could do it, it is not very-very hard, but maybe one can also write down
something much more interesting.

So, let us see what we can do here. So, the conditionals are given and you can write down the
joint distribution, is not it, any (𝑡, 𝑡′), something I have been using for the joint distribution,
simply it is 𝑓𝑌 (𝑡 ′ )𝑓𝑋|𝑌=𝑡 ′ ′ (𝑡). So, notice I can also, I mean I can do any product, I mean like I
can write marginal of 𝑋 times marginal of 𝑌 given conditional of 𝑌|𝑋 equals something.

You can also do the other way around (𝑎, 𝑏) is (𝑏, 𝑎), both of them are exactly the same, so I
can do it either way I like, so a typical calculation supposing you want to do 𝑓𝑋𝑌 of let us say
1
(1, 6), I am going to get 𝑓𝑌 of, no, 𝑓𝑋𝑌 (6,1), 𝑓𝑌 (1), which is 8 ×, in the conditional distribution
1
of 𝑋|𝑌 = 1, what is the probability of 6, remember this is uniform, so it will just be 7. So, you

will get some number here.

1
On the other hand if you were to do 𝑓𝑋𝑌 of let us say (8, 2), so I am going to get 8 ×, so this

will become 0, so you will get different numbers of this sort. So, if you want to say find 𝑓𝑋 (𝑡),
how will you do this? You have to marginalize, is not it? So, 𝑡 could take values from 0 all the
way up to 12, is not it, the values that 𝑋 takes starts from 0 and goes all the way up to 12.

And, so one way to represent 𝑓𝑋 is to draw some graph like this, put 𝑡 here and put 𝑓𝑋 (𝑡) here,
so it is a very common way to represent a PMF. So, you draw something like this. So, you can
put 𝑡 = 0 and then you have 1, so 2, maybe this is not a very nice way of writing this down, so
sorry about that, so let us just do this, let us go a long line, there you go, 𝑡 0, 1, 2 and then 3
and 4 and 5 and 6 and 7 and 8, 9, 10, 11, 12. So, those are all the values that 𝑡 can take.
Now, what is the probability that 𝑓𝑋 (𝑡). So, if 𝑡 is going to be 0, I have to sort of marginalize
over 𝑌, so if you look at 𝑓𝑋 (0), this is just, if you want to sum over all 𝑌, only when 𝑌 is 2𝑋 is
0, otherwise 𝑋 is not even 0. So, this is the same as 𝑓𝑋𝑌 (0, 2) and for that you will get a
1 1 1
probability 16 times, this is 7, is not it, the conditional probability is 7.

Now, what about 𝑓𝑋 (1)? You will also get the same number for 𝑓𝑋 (1). 1 also happens only a
1 1
single time and that is that is how it is, so this 16 × 7 is one value here, so this is 𝑓𝑋 (0) and that
1 1
is the same thing as 𝑓𝑋 (1) and that common height is × 7. Now, what about 2 onwards?
16

Now, 2, 3, 4 happen here, actually 2, 3, 4, 5; 2, 3, 4, 5, happen here as well as here, is not it?

So, if you are doing 𝑓𝑋 (2) that is going to be, so maybe I should write it here, it would become
become a bit more messy, so if you are doing 𝑓𝑋 (2), it is going to be 𝑓𝑋𝑌 (2, 2) + 𝑓𝑋𝑌 (2, 1). So,
you see the marginalization happening, so you see how I am marginalizing, think about it. So,
𝑋 = 2 occurs when 𝑌 is 2 and when 𝑌 is 1, 𝑌 is 0, 𝑋 = 2 will not occur, so I do not have to
worry about it.

So, I am keeping 𝑋 fixed at 2 and summing over all possible values of the other variable, that
is the rule, marginalization rule is you keep your value you want fixed, sum over the values
taken by the other random variable, whatever is possible, if that is not possible, then joint PMF
of 0 you do not have to add, so there is no problem there, so you do this. So, what is each of
these values?

1 1 1 1
The first one is 16 × 7, the second one is actually 𝑌 = 1, so that is 8 × 7, is not it? So, you have
1 1 1 1 1 1
here 16 × 7, 8 × 7, here you have 8 × 7, if you add these 2 you will get something, I do not know
3 1 1
what you will get, × 7, and where will that go, that will go little bit higher, this , so its
16 16

probably going to be 3 times that height.

So, let us put that guy here, 2, 3, 4, 5 will be at that height and what is this height, that height
is this guy. This kind of a plot is called a stem plot or something and it is very useful to sort of
picture a PMF, if you want to picture a PMF on the axis you can draw the PMF as a plot like
this and sometimes a plot is useful for you to see as opposed to writing just a equation, is not
it, so that is useful.

So, after this things get a little bit interesting, 6 for instance shows up here, maybe I should do
a multi-color presentation, let us go to blue, this blue, 6 shows up here here and here, so if you
are going to write 𝑓𝑋 (6), you will have 3 terms 𝑓𝑋𝑌 (6, 2) + 𝑓𝑋𝑌 (6, ), so maybe I should write
it somewhere else, I do not know where, below here, let me write here, 𝑓𝑋 (6) is 𝑓𝑋𝑌 (6, 2) +
𝑓𝑋𝑌 (6, 1) + 𝑓𝑋𝑌 (6, 0).

1 1 1 1 13 1
So, this guy is 16 × 7, this guy is 8 × 7, this one is 16 × 7, is not it? So, you will end up getting
1
you know all 3 of them adding up when you will just get 7. So, notice how much higher it will
1 1 3 1
be, so let me just draw this for you, so if it is like, this is 16 × 7, this was 16 × 7 is going to be
1
just 7, so it is going to be like 16 times this guy.

1
So, we will go really high, so maybe it will go up to here, so that is 7 for you. So, now let us

come back to 7, 7 and 8, so maybe I will use a different color here, 7 and 8 are here and here,
13 1 2 1 15 1
so that will go as, if you were to do 𝑓𝑋 (7) or 8 you will get 16 × 7 + 16 × 7 that is 16 × 7, is not

it? So, that is like a lot of height, here so let us go here and here.

So, that height will come here and then you have 9, 10, 11, 12 and 9, 10, 11, 12, if I have to
pick one more color, what are the various colors out here, let us see this one orangish yellowish
13 1
color, so 9, 10, 11, 12, is going to be just × 7, a little bit somewhere here maybe, so there
16
13 1
you go, that is your PMF, so this is, let us say 16 × 7.

So, notice how I looked at different ranges of numbers very carefully depending on the
conditionals and got different values for the marginal distribution of 𝑋, number of runs in the
over, so notice how the modeling went, you first found the marginal of 𝑌 or given the marginal
of 𝑌 and then we are given the conditional of 𝑋 given the particular value taken by 𝑌 and then
you had to add over all the possible values of the joint PMF to get the distribution of 𝑋 and we
got this interesting little distribution.

So, we saw 3 examples of how in a problem or a given situation or a scenario you have to work
with the marginals and the conditionals to go to the joint and then maybe even go to the other
marginal. So, you are given one marginal one conditional, so you go to the joint and from there
you go to the other marginal, so this kind of a loop you can do, so that given some partial
information about the distribution you can complete it out and find the entire distribution.

So, in most real life scenarios the real trick is to come up with these distributions, so how do
you come up with these conditionals in the IPL powerplay over, I just came up with some
random model for the conditional PMF. Why is that conditional PMF ever true, that is the non-
trivial sort of non-mathematical problem in this area. Once you give me the distribution, doing
all these calculations is a very definite and easy thing to do.

But you should be very skillful in these kind of computations, this kind of computation should
be very clear in your mind and that sometimes useful to do other things as well. So, those were
three examples, we will go on to the next topic in the next lecture. Thank you you!
Statistics for Data Science – II
Professor. Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture No. 3.5
Joint PMF of more than two Discrete Random Variable

Hello and welcome to this lecture. So, far we have been looking at two random variables; things
seem may be complicated enough even in the two variable world. We are going to go to more than
two variables now; so, 3, 4, 5, 6. And believe me when you look at complicated problems, there
will be hundreds of variables. So, you should never be scared of multiple variables. So, let us get
used to some of it, even in this course. So, let us start thinking of more than two random variables.
How do you think of all these PMFs still in the discrete world joint PMF, marginal PMF,
conditional PMF; how do you do all that is the beginning of this lecture.

(Refer Slide Time: 00:50)

So, actually the joint PMF itself for multiple discrete random variables is very easy to define.
Supposing you have 𝑛 discrete random variables 𝑋1 , 𝑋2 , 𝑋𝑛 that are all defined in the same
probability space. And every random variable will have its own range; I will call that 𝑇𝑋𝑖 , I can do
that. And then the joint PMF of all these guys 𝑋1 , 𝑋2 , … to 𝑋𝑛 ; one can write as 𝑓𝑋1𝑋2,…𝑋𝑛 . Is not it?

Just like we did for before we had two random variables, you put 𝑓𝑋𝑌 . You have 𝑛 random
variables; you put all of them in the subscript. And it will become a function from the Cartesian
product of the ranges for each value that the random variables take. 𝑋1 takes some value, 𝑋2 takes
some other value, 𝑋𝑛 takes some other value, for every possible value that each of these random
variables can take. Let us say 𝑋1 takes value 𝑡1 , 𝑋2 takes value 𝑡2 so on till 𝑋𝑛 takes value 𝑡𝑛 .

What is going to be my joint PMF evaluated at 𝑡1 to 𝑡𝑛 ? It is simply the 𝑃(𝑋1 = 𝑡1 𝑎𝑛𝑑 𝑋2 =


𝑡2 𝑎𝑛𝑑 𝑋3 = 𝑡3 … 𝑋𝑛 = 𝑡𝑛 ) These 𝑡𝑖 ’𝑠 have to belong to the particular range and that is just the
joint PMF. So, if you have a small enough example, you can write out a table. But, we saw soon
when in the even in the example problems, quickly you will get into situations where you cannot
write tables and all that.

And you have to keep the joint PMF and the picture of it; imagine it either as a plot or no table, or
something in your head and work with it in problems. So, that is something important, and we will
use the same thing. This 𝑃(𝑋1 = 𝑡1 ) and and and this and we will not repeat; we will just put a
comma, and you have to interpret it as and when you write it down. So, this is joint PMF, easy
enough to define.

(Refer Slide Time: 02:40)

So, let us see a few examples and make sure we understand it; so let us go to the example of the
fair coin toss. We we saw earlier the coin was tossed twice; now we are going to toss it 3 times.
So, when you toss it 3 times, naturally define 3 random variables 𝑋1 , 𝑋2 , 𝑋3 ; and here this is small
enough that you can write down a table. So, if you write down a table of 𝑡1 , 𝑡2 , 𝑡3 ; they could take
value 000 001 010 011 100 101 110 111; 8 possible values you can write.
And notice how I have written these zeros and ones it is very common to write them in this
sequence; sort of like start with all zeros and then work your way towards all ones, by flipping the
last one making it 0 to 1 et-cetera. And every possibility the joint PMF the probability that the first
toss is a tails. If you look at 000, the probability that first toss is the tails and the second tosses the
1 1 1 1
tails, and third tosses the tails; that is 2 × 2 × 2 = 8.

1
And any other case also it is 8. The coin is fair whether it is head or tails, probability is half anything
1 1
happens; you just multiply half 3 times, you get 8. So, there are 8 possibilities each is 8; so sort of

like the joint PMF is uniform in in whatever range that you pick, simple example easy example.

(Refer Slide Time: 04:02)

So, let us got to a slightly more complicated example. I am going to define a 3-digit random number
now; previously we had 2–digit random number. Now, we want to do more than 2, so we got a 3-
digit random number; and will use this notation 000 to 999 to denote these 3-digit random numbers
so that we have around thousand digits as opposed to you know say 900 digits, or some 900 such
numbers; we have 1000 numbers 000 to 999.

I am going to define 3 random variables here slightly maybe different from what we did before. 𝑋
is the first digit from the left of the number or the hundredth place of the number. 𝑍 is the first
digit from the right or the units place in the number; and 𝑌 is the number modulo 2. So, what is
number modulo 2? If the number is even, it is going to be 0; if the number is odd, 𝑌 is going to be
1, so that is the thing. The table is too long here; I mean you can write down all the possible values
here. 𝑋 can take how many how many different values? Let us go to my color here.

So, 𝑋 can take values if you want to think of the range here; it can take values 0, 1, 2 so on till 9.
What about 𝑌? 𝑌 can take values 0 and 1; and 𝑍 can take values 0, 1, 2 so on till 9. So, if I have to
write down the probability now, the joint distribution of 𝑓𝑋𝑌𝑍 (0, 0, 0). What will I write down?
This is the probability that number starts with 0 and number is even; and number ends in 0.

So, what are all the favorable cases 000, 0 anything 0? 0 anything 0 number is definitely going to
be even; it starts with 0 ends with 0, is not it? So, how many such possibilities are there? 10
1
possibilities are there; so that gives you 10 out of 1000 and that is is that ok. So, maybe you
100

want to look at some other case which is a bit more interesting; maybe I do not know (1, 1, 1) let
say. What will happen to (1, 1, 1)? It has to start with 1, it has to end with 1; and it has to be an
1
odd number. It is always going to be odd, if it ends with 1; so again this will be 100.

What about this guy 𝑓𝑋𝑌𝑍 of say let us say (1, 0, 1)? It has to start with one end with one and it has
to be even. That is really not going to happen, so this will be 0. So, so you will get these kind of
1
possibilities, this is 100 showing up a lot; and you can you can see why right. So, if if 𝑌 is 0 and
1 1
the last digit is even, you will get 100; if 𝑌 is 1 and the last digit is odd, you will again get 100. If 𝑌

is 0 and the last digit is odd, you will get 0 and things like that.

So, you can you can see how how the joint PMF can be written down; it is not a very particularly
interesting exercise. But, you can see how joint works through examples like this. Some some
interesting possibilities can happen for every possible case; you will have to write it down. So, if
you want you can just do one more example; if you think one more example will help you.
𝑓𝑋𝑌𝑍 (8, 0, 6); what is this going to be? It has to start with 8 end with 6; and then it has to be an
even number.

Whatever you put in the middle digit, it is going to be even; because it is ending with 6, so it is
10 1
easy to do this. So, you will again get 1000, which is 100; so, so you see what I mean. So, in but if

you do it other way around 𝑓𝑋𝑌𝑍 (8, 1, 6), you will get 0. If it ends in 6, it can never be odd; so it is
a very easy joint PMF to write down. If you want, you can make a general rule about what it will
be. So, if you say you can write like this if you like. So, you can do 𝑓𝑋𝑌𝑍 (𝑡1 , 𝑡2 , 𝑡3 ) = 0; if 𝑡2 is 0
1
and 𝑡3 is odd, and it is 0 if. So, maybe I should write the 0 here, 0 if 𝑡2 is 1 and 𝑡3 is even; it is 100

otherwise.

If you want, you can write like this and 𝑡1 , 𝑡2 , 𝑡3 takes values in this ok. So, if 𝑡2 is 0 and 𝑡3 is odd,
then that cannot happen; so you get 0. If 𝑡2 is 1 and 𝑡3 is even, it cannot happen; in any other case
1 1
it will be 100. Because there are 10 out of 1000 possibilities and you get 100; so that is the that is

the possibilities here. So, sort of a strange example, but anyway it is a good enough example.

(Refer Slide Time: 09:53)

So, let us go to our favourite third example. So, I just want to introduce a slightly more complicated
situation, where it is things are going to become a bit more complicated. I will I will make a few
comments on how to think of this; so let us take the powerplay over. So, quite a complicated
example and a lot of things can happen; and let say the over has 6 deliveries. Just to be sure in fact
even that is not certain is not it. If there is a no ball that is bold or wide; that is bold, you can have
7 deliveries. But, let say the over had 6 deliveries, and I am going to say 𝑋𝑖 is the number of runs
scored in the 𝑖𝑡ℎ delivery.

So, you have now 6 random variables, and what can we possibly say about these 6 random
variables? So, you have 𝑋𝑖 ; 𝑖 = 1, 2, … 6. What kind of values will 𝑋𝑖 take? What are the numbers
that can be scored of 1 delivery in cricket? People might say 0, 1, 2, 3 is possible 4 is possible is 5
possible. I know it can be a no ball and a 4-6 is possible, is 7 possible? 7 is possible. Let just say
8 is also possible; you can go and look at the history of cricket. I think there is not really been a
serious match in the modern era, where more than 8 runs have been scored of 1 delivery.

7 is possible because of a no ball, you can hit a 6 and you get 7; but 8 is very difficult you know.
I mean you need to get an overthrow four after you run 4 runs; it can happen, it is very very very
rare. But, you can go look up cricket modern day cricket, this is this is 8; I mean beyond 8 is I
think for most practical purposes unreasonable. If you think it is possible, you can put in the 8 also,
9 also. So, every random variable 𝑋1 to 𝑋6 takes this one of 9 possibilities; and what do you think
will be the joint distribution joint PMF?

So, if you think of the joint PMF, let say you want to think of 𝑓𝑋1𝑋2 𝑋3𝑋4𝑋5𝑋6 ; even writing it down
is laborious is not it, let say (0, 1, 4, 2, 0, 0). So, what is the probability of this? How does one
even do something like this? It is not even clear how in a powerplay over one can decide something
like this. What is the probability of a maiden over? So, that is like 𝑓 all of this of (0, 0, 0, 0, 0, 0).

So, it is all it seems fairly confusing as to what to even do in these kind of problems. How do you
think of the joint PMF? And it seems quite extensive. If you think of the number of possibilities
for all of these guys; it seems like it is 86 . This is too many possibilities and even to write it down
and think about it is really really hard. So, this is a very typical of modern data problems and the
vastness of what is possible; and what you can measure or what you can reasonably say is very
limited. And you have to try and do something with that kind of situation; so this is very common
in modern problems.

Sometimes the data is vast, but sometimes data is not vast; so you think about how to deal with it.
But, even if the data were there even if somebody was trying to give you the probabilities; it is just
a huge thing to define. You define so many things to define the joint PMF; it is not as simple as
even the previous problem. Anyway, so I just wanted to put this out there for you to think about
how more complicated real scenarios can be.

And to write down precise probabilistic things is very tough. So, we need some tools to think of
such joint PMFs in simple ways; and that is where this marginalization and conditioning will help
you. So, quite often in practice when you have big complicated scenarios like this, people always
think in terms of marginal distributions and conditional distributions. The joint PMF itself might
be two and we will need to write down; so you think of how marginalization conditional PMFs
will work. And that is a very important part and that will come in the next lecture. So, we will see
that in the next lecture.
Statistics for Data Science – II
Professor. Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture No. 3.6
Marginal PMF of Multiple Discrete Random Variable
(Refer Slide Time: 00:13)

Hello and welcome to this lecture. In the previous lecture we looked at multiple random
variables; how to describe the joint distribution of multiple discrete random variables. In
particular I gave some simple examples of joint PMF and we looked at the scenarios and
computed what the joint PMF will be.

Now, we are going to start looking at marginal PMF. I sort of ended at the in the previous lecture
by saying, when you have a lot of random variables the joint PMF can become a very
complicated object, which you cannot easily describe. So, you are looking for some simpler
alternatives and this marginalization and conditional distributions are very nice pathways to get
to the joint distribution.

So, let us start by looking at the marginal PMF. In particular, the individual marginal PMF is
what I will start with, and then we will slowly generalize this. So, we are going to look at
multiple discrete random variables; we also we already saw the marginal PMF in the context of
two random variables. Now, we are going to increase the two to multiple that is nothing more.
So, so let say you have n random variables X1, X2,.. Xn and these are distributed with some joint
PMF f X1, X2,.. Xn so something. And then the PMF of the individual random variables are these
individual marginal PMFs; that we are interested in X1, X2,.. Xn.

And we can quite easily see that the PMF of X1 itself evaluated at some t, is going to be basically
P( X1 = t). And it turns out you can find these marginal PMFs by summing over the joint PMFs
for different ranges. What do you sum over you basically sum over all that you do not want to
keep. So, if you are looking at f(X1), you have to sum over everything other than X1. So, joint
PMF of course will depend on X1, X2 all the way to Xn.

So, you sum over the joint PMF, you can see here in the first formula there ( )
( ) ∑ ( ). I am summing over which is in

the range of X2, which is in the range of X3; all the way to which is in the range of Xn.
What I am summing? I am summing the joint PMF evaluated at t, which is the value of X1 which
I want to keep. So, I will not be summing over t, I want to keep that t; and then everything else is
all possible values. So, this is a very simple proof, the proof is very similar to what we did
before.

Basically, I am saying I want the probability that X1 equals t; I will write the event X1 =t, and X2
= , X3 = and so on; and simply add up over all possible values for ; so that is exactly
what it is. It is a very simple formula, except that when you execute it, sometimes it can become
a bit confusing; but keep this rule in mind. When you marginalize everything that you do not
want, you keep only what you want; you get the marginal PMFs.

So, when you go to X2 what will you do? You will keep X2 alone and then put variables for all
the other random variables, sum over all of them; you get the marginal PMF of X2 so on. So, that
is the multiple discrete random variable marginalization.
(Refer Slide Time: 03:28)

So, let us see a few examples just before we always been seeing examples; so, look at an
example. I am going to toss a fair coin thrice, it is a fair coin; and X1 is the indicator for the first
toss being heads. As usual X1, X2, X3 are 1 and 0 for the first toss, or second toss, third toss
respectively. We have seen the joint PMF before; this is just a 1/8. If you want to look at the
marginal say ( ); notice what I am doing here. I am keeping the 0 fixed; so, let us maybe use
some of the color here blue. I am keeping the 0 as X1 throughout and the other guys are varied
across you marginalize it out.

So, you look at you look at these guys; ( ) ( ) ( )

( ) ( ) . The same thing for ( ) and you get you

sum over all these things; you keep 1 1 fixed for X1 and everything else you vary. And simply
add them it up, you get half; ( ) ( ) ( ) ( )

( ) so it is easy enough hopefully, this is easy enough to

execute.

So, this is a very simple problem and you see how the margins work out. If you do X2 also will
get the same answer, X3 also will get the same answer. Of course, we knew the distribution of
X1, X2, X3 before it is not very difficult to do this.
(Refer Slide Time: 05:04)

Let us go to our slightly more interesting example of this 000 to 999. X is the first digit from left,
Y is the number modulo 2, and Z is the first digit from the right. So, in this case you might say , I
will have to sum up over all the joint PMF; but really you know the marginal is directly and
easily. You can write it down; there is no need to sum over the joint PMF or anything. If you
look at the first digit from the left, what is the probability that the first digit is going to be equal
to 0? What is the probability that it is going o be equal to 1? You will see all of them is 1/10.

There are 10, there are 100 out of the 1000 cases which are favorable, for every particular first
digit from the left. If you want the first digit to be 0, you have 000 to 099; so that is hundred
favorable out of the overall 1000, so you get 1/10. So, it is easy to quickly see that
( ); so you see quite often to find the marginal PMF from an
experiment. You may not have to do the summation and the marginalization very painfully;
directly you maybe able to deal with the marginalization. So, that is something that is important
to remember as well; do not go finding the joint PMF all the time.

You maybe able to quickly find directly find the marginal. Likewise, Y is the number modulo 2;
so Y is going to be 0, if the number is even; and 1, if the number is odd. And exactly 500 out of
these 1000 numbers are even, the other 500 are odd. So, if you only care about even or odd, it is
going to be uniform in 0, 1, ( ). So, it is 500/1000. So, it is uniform and 0, 1
same thing with Z. ( ); it is the last digit. Now, the units place
digit that is also uniform from 0 to 9. So, quite often the marginal maybe actually quite easy to
find, you may not have to worry too much about adding up over the joint PMF and all that.

(Refer Slide Time: 07:11)

So, let us come to this very interesting case. We have been talking about data a lot and practical
examples more complicated examples, where the space is difficult to specify and all that. And
this IPL powerplay over 1, the first over of the powerplay is maybe a good example of such a
situation. So, we will assume that this over has 6 deliveries; there are cases where it has 7. So, let
us say we just forget about the seventh delivery, which is consider the over to be the first 6
deliveries that are bowled. There is always be 6 deliveries in the over, so we do that. And Xi is
the number of runs scored in the i-th delivery; so that is the variable we have been we have been
looking at it.

Now, what do I do for the marginal distribution of Xi? So, if you remember I gave you this
example; and said the joint PMF of X1 to X6 looks formidable to calculate. It is formidable to not
just calculate even specify or write down or think about. It seems like it is all over the places;
you may not be able to really do something very clear. But, you will see interestingly, you can do
something very reasonable for getting the marginal PMF of the X1 to X6. It is not too bad like I
mentioned there are about 1500 IPL matches that have already happened 1598. And if you
tabulate the data from there, it is not too unreasonable to think of the marginal distribution of X1.
So, let us see, let me show you how this works. We have seen before that with I mean you can
take * +; it is very unlikely that more than 8 is going to happen. The trick is
how to assign probabilities to these values? How do we do this? It is we have never done
something like this before, is not it. Previously, it is always been some toy experiment; we have
been able to able to always sort of figure out what the probability should be. And how do you
assign probabilities? Maybe you can guess from your experience of IPL matches. What is going
to be the most probable run scored?

Let us say in the first ball mostly the batsman is going to defend or something. So, in the first
ball X1 it is very likely that the 0 is the dominant number. In fact you may even guess that 0
would be the most probable run that somebody scores of a delivery. Most cases maybe 0,
boundary can happen. But, how do we assign probabilities? What is the meaningful way to
assign the probability? So, traditionally what people do this is again. This is this is probably not
the best method and but, but it is a reasonable method; and it is not too bad. So, you can go go
out and collect the data in the past occurrences.

So, there is been like I said 1598 matches, where the first over has been bowled so far. And you
go and see in ball 1 what happened? As in how many times 0 runs were scored in. So, it runs out
in 957 matches 0 runs were scored; look at the large fraction, and 1 run was scored in 429
matches. Together that covers a large variety of large number significant number, and then you
have 2 runs in 57 matches; 3 runs in 5 matches and 4 boundary seems to be more popular. 138
matches and then 5 runs in 8 matches, 6 runs in 4 matches that is it so far looks like nobody has a
no ball; and been hit for a six in the first ball. So, you do not have a 7 or 8.

So, so one of the ways is to assign probabilities in the same proportion as data; so, maybe you
want to say P(X1)=957/1598 I am not claiming once again that this is the best way or anything
like that there is nothing like the best in these kinds of things. It seems reasonable and there are
good reasons why this might be okay; and we will see later on why this might be a good way to
assign these kind of probabilities. But, at least intuitively most of you would say yes; that is that
seems reasonable, so you can do that. Now, you can repeat his for ball 2.

So, notice how these numbers are large enough; I mean I would have liked to have 15000
matches, then maybe these numbers all will be much much better. And but but you know
somebody might say this between only 4 matches where six was hit. But, there is been 1500
matches and I am looking at only 6 possibilities; so it is okay, it seems like I have enough data. It
feels like I have enough data to make this statement; I am making a lot of loose statements here.
But, these are important things to just think about. I have seen enough matches to be able to
comment about what happens in the first ball of an over; it is not it is not very unreasonable.

(Refer Slide Time: 11:50)

So, I have done this now, I have taken all the balls the first balls have tabulated in how many
matches 0 runs were scored. How many matches 1 run was scored in the first ball, second ball,
third ball et-cetera; and I put down the distribution in this little table here. So, you can see the
marginals one is able to think of in a reasonable way; so, maybe I want to assign a probability of
0.5989 to X1 being 0. I want to assign a probability of 0.2685; this is again in proportion of how
many times in the past this has happened. So, maybe going forward this is a reasonable way to
assign the probabilities.

And you can notice there are some subtle differences. For instance, a six is much more likely
some 6 to 7 times more likely to be hit off the third or fourth ball. The fourth ball seems to be
very interesting for hitting six, and you can also look at boundaries; again the fourth ball is much
more likely than anything else. So, in general it seems very interesting and and it looks like 0 is
lowest in the last ball. So, almost for the highest time promotion of times 0 is unlikely in the last
ball; but still more than 50 percent of the balls are dot balls. That is an interesting observation as
well only for half the time across the matches people hit it 4 runs.
So, so you notice I will I will keep coming back to this example over and over again this week
and in the coming in the next week also. This this kind of a thing is very important for data
science and seeing the connection between, where the probability distribution is coming from
where statistics will eventually come from. And then at least where data is entering the picture,
in terms of coming up with something. And I also want you to think about how if you want to
look at the joint PMF for all the 6 balls, 1598 matches is not good enough. There are just too
many cases and they all will not appear even one time; or you cannot really put anything down
there meaningfully.

I will come back and comment on it for you later on as well. But, at least this much we are able
to do the marginal with this kind of data; we are able to easily write down. So, maybe the
marginal is a good idea, so let se what else we can do with this data going forward. So, so what is
the moral of the story model of the story? Is in large problems when there are a lot of random
variables marginals are your only way, in which you are going to make progress. So, you cannot
cannot deal with the joint PMF, you have to look for smaller distributions; and try to stitch them
together and how we do that we will see in this lecture.

(Refer Slide Time: 14:28)

So, let me now generalize a little bit; this is also important. Let us say we have 3 random
variables now and they have a joint distribution ; so, I am taking a very simple case. Just
now we saw n random variables and all that; let us start with 3 after 2 we should go to 3, so we
are going to 3. We have discussed individual marginal PMFs, so you have Now,
what about what about ? This is a joint PMF of X1 and X2; it is a very reasonable object.

So, so just now I mentioned how joint PMF of everything might be difficult. But, what about
joint PMF of X1 and X2? It does not seem that bad and it is very valid object; and can be say
something about that and likewise , . So, there is some meaning in this, it seems
reasonable to think of these pair-wise distributions for instance, when you have 3 random
variables and not just the entire joint PMF.

So, it turns out this is very much possible and you can do it exactly like before the principle
being. When you want to marginalize, keep only what you want; and sum over everything you
do not want; so, that is that is the principle of marginalization. So, if I want X1 X2, I want to find
the joint PMF of X1X2; I am given the joint PMF of whole thing, let say X1, X2, X3 I have. I
am trying to marginalize and find how do I do that?

( ) ( ) ∑ ( ) .So, that is the formula


that is given there some over all in the range tX3; the joint PMF . So, given the joint
PMF, you can always go to the marginals in this fashion; and the marginals need not be
individual random variables.

They can even be pairs of random variables X1X2, X1X3; what you do for X1X3? ( )
( ) ∑ ( ) Same thing with X2X3; you sum over all
possibilities for X1, so it is very simple extension. So, we started with the entire joint PMF and
then we said maybe the marginals are interesting; we looked at the individual marginals.

Now, we have pair-wise marginals, and you can now extend to other situations; so, this principle
is important to remember. Whenever you want to marginalize, you keep what you want sum over
everything you do not want; and you get a marginalization.
(Refer Slide Time: 17:13)

So, here is example; I mean this is a simple enough example, so that I can do this for you. So, if
you do X1X2, so let us first marginalize some back here over ; I am going to have t1 t2. t1
takes value 0, 1 and t2 takes value 0, 1; so if you want to do ( ), so this guy. What will
this be? So, maybe I should use different colors; so use a blue. So, 0 0 is blue, so you have 0 0
here 0 0 here and 0 0 here. And t3 I have to sum over all possibilities for t3; so if I do that I have
to add 1/9+1/9+1/9=1/3. So, let us look change color to look at 0 1.

If I if I look at no no 0 1would be X1 =0 and this guy this position; and that is 0 1 and 0 1. So,
you fix t1 to 0, t2 to 1 and then sum over all possible values for t3. There is only 1 and 2, this
( ) 1/9+1/9=2/9. Let us change to some other color orangish color. For 1 0,

( ) And then one can change to I do not know magenta or something like

that for this ( ) . So, this is my joint So, one can do also other joint PMFs, let

say you want to do I will do one more.

Let say we do just for variety; we have t1 here and t2 here. Remember now t3 can take 3
values, so t1 is 0 1; this (t3) is 0 1 2, so you have 6 possibilities here. I am not going to bother
with the colors; bear me for that, bear with me for that. So, I want X1 0 and X3 0, so 0 0, so
maybe the first one alone; so let us look at this in black. So, this is 0 0 that is it that is the only
the possibility; so ( ) . So, ( ) . What about 0 2? 0 2 is 0 2 is out

here; it is out here. So, ( ) , and then you have X1 is 1 and X2 is 0.

That is again just one possibility, no two possibilities; ( ) . So, you get a 2 by

9 star and then you have 1 1 being one possibility ( ) ; and ( ) . Notice

how the marginalization is working and how we are able to do it very easily given the table; that
is a simple enough calculation to do. Of course when the numbers become bigger, the probability
problem becomes more complex; these things are difficult to do. But at least for simple toy
problems, the basic probability one can do this very clearly. Alright, so I thought I will put one
sheet here for working, but I really do not need to work; it is clear enough what I have to do.

(Refer Slide Time: 21:26)

So, let us go to slightly more, so we did 2 random variables, 3 random variables; the next logical
thing is 4 random variables. I am trying to show you that you can have all sorts of variety here;
again the same principle you sum over everything you do not want. If you want the marginal
ofX1from the joint, you sum over ;

( ) ( ) ∑ ( )

( ) ( ) ∑ ( )
You want X1X2, you sum over ; you want X2X4, you sum from the all possibles of X1
all possibility of X3.

You want the joint PMF of X1, X3, X4 three of them; then you sum over t2 prime over X2 that
is it, it is as simple as that.

( ) ( ) ∑ ( )

So, from the joint PMF going to marginals it is a very simple one-way process; this multiple
things do not happen there, it is very easy. But, we already know from marginals you cannot
uniquely go to the joint PMF; and there analyze the variety in data science. So, let us see what so
I think that is it; I do not know. I do not think I have more worked out examples for you, that is
good enough I think.

(Refer Slide Time: 22:28)

So, in general here is the general formula, I do not want to beat around this formula; you read it
and understand. So, if you have n different random variables, you want to take any subset of
random variables; i1, i2, ik and want to do marginalization. You simply sum over everything
except for i1 to ik; so you can start from 1 go till i1 minus 1. Then jump to i1 plus 1, go go like
that. Skip, keep only ti1 to t ik sum over everything else; you are done.
( )

∑ ( )

So, that is the way to go from joint PMF to marginal PMF in the general case. So, I think that
concludes marginal PMF; the next big topic is conditioning with multiple random variables. That
we will do in the next lecture.
Statistics for Data Science – II
Professor. Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture No. 3.7
Conditioning with Multiple Discrete Random Variables

(Refer Slide Time: 00:13)

Hello and welcome to this lecture. We are going to discuss conditioning with multiple discrete
random variables. So, like I said marginalization is one key idea to sort of reduce the scale of the
problem look at a smaller problem, from a bigger problem; that is always very nice. But
conditioning is what connects the marginals to give you some picture of the whole in some
sense; so, that is important also. We have seen before how the joint PMF of two random
variables is the product of the marginal times the conditional. So, so you can see that the
conditional is sort of like a bridge between marginal and joint; so that is an important thing,
maybe principle to keep in mind.

So, when you have multiple random variables, let us say here the example. Here I am starting
with an example of 4 X1, X2, X3, X4; you can do all sorts of conditioning. So, I mean previously
we saw when you had 2 random variables x and y, you could do x given y, y given x; that is it.
So, now with multiple random variables you can do variety of conditioning; let me give you the
simplest example.
So, the simplest example we have seen, we have seen before. You take 2 of them X1 and X2, and
look at this random variable; the conditional random variable that I have been talking about.
( )
( | ) | ( ) ; so, that is how we write it. And we know that this is
( )

just the joint divided by the marginal. Conditional equals joint by marginal; so that is a very nice
formula. Remember joint equals conditional times marginal that is also a nice formula to
remember.

But, conditional equals joint by marginal, it is a very nice simple thing to remember; and notice
how the joint came. You took you took X1 X2 and simply joined it, put it on the numerator; and
the conditioning alone you kept in the denominator. So, this simple idea one can extend, not just
with 2 random variables even if you have more than 2. So, let us say I want to define a pair of
conditional random variables; in some sense X1 comma X2 given X3 equals t3.

So, remember see all these things are random variables on the same space, and I have events
defined using X3; that is what I am doing here. And once I define an event using X3, I can of
course condition the random variables X1, X2 conditioned on that event. I am looking at both of
them conditioned on that event; so, it is all valid and clear and very easy things to think of in
( )
some sense. So, I can think of the (X1, X2 | X3 = t3)= | ( ) . It is
( )

maybe a bit confusing, but you can see that the idea is very similar to the conditioning of X1 with
X2.

And the formula is also very very similar; it is the same principle joint by marginal joint by
marginal. So, what is the joint? I have X1, X2 given X3 equals t3 of t1, t2; I simply join everything,
put in the numerator ( ). You divide by whatever you are conditioning on; so
( ), simple enough. It is just simple extension and you can you can show that these are all
valid, they satisfy all the conditions of the PMF; thereby their values between 0 and 1 and they
add up to 1. All of these can be quite very reasonably shown; it is not a big problem to do that.

Let us look at the next one; this one is a little bit more interesting. So, I am going to condition on
a slightly more complex event what is the event? The event is that X2 = t2 and X3 = t3that is the
event. I am taking that event and I am conditioning the random variable X1 given that. So, X1
given this event X2 = t2 comma X3 = t3; how do I write the distribution for this? The notation is
again a little bit cumbersome. But, it is easy to write down | ( ),; that is simply
once again the joint by marginal nothing more to worry about. What is the joint? Combine
everything together ( ). And what is the marginal? What comes in the
denominator, whatever you are conditioning on, so ( ).

So, it is very easy to write down these conditional expressions. You can also do some more
flavor, more interesting variety here. So, here is the joint distribution of X1 and X3, joint PMF of
X1 and X3; conditioned on the event that X2 = t2, X4 = t4. Once again one can write a joint PMF,
joint conditional

( )
( | ) | ( ) ; so it is complex.
( )

Unveil these our expression, I am writing it like this; there are some simplifications here. But,
anything here is unwieldy; so we will just live with it. And once again the principle how do you
write the joint PMF? It is joint I mean I am sorry.

How do you write the conditional PMF? Is joint by marginal. So, you take ( )
divided by whatever you are conditioning on, simple. It is almost like it is the formula itself is so
simple that you can teach anybody. But, how how it connects to practice is where all the
difficulties will come; we will see conditioning with multiple discrete random variables. So, this
is done.

(Refer Slide Time: 05:38)


Let us see an example; this is slightly more complicated example. So you can see I thought I
should do one worked out example, try a few cases; and at least evaluate a few probabilities, few
joint PMF values to show you how this will work out in terms of flavor. You will see it is it is a
bit confusing sometimes, but it is worth it. So, let us start with let us start with what start at the
very beginning; so, we will start with X1| X2 = 0. So, first thing you have to do is sort of identity
the range; so let me remind you of when somebody gives you a big table of values and things
like that. The first is to look at every column, identify the range.

Make sure you understand what values it takes and all that; so that you can start doing working
on the problem. The first thing is I have identified here that * +; . And you
can see also that this is ; so, 3 of them have the same range and the range of X3 is 0, 1, 2.
Quick things to identify always start whenever you have a random variable problem start with
the range; range is easy to write down. So, let us just write that down first. And then I am going
to start looking at I am dealing with conditioning here; so let say I want to look at X1 given X2
equals 0.

So, maybe not let me not write down the PMF. So, let us look at X1 | X2 = 0. So, this guy so given
X2 is 0, I know I can only it is enough to look at the case, where X2 is 0. So, this guy and this
guy; that is it, I do not have to worry about anything else. So, I know the probability the PMF of
X1 given X2=0, evaluated at say 0 is the joint of X1, X2 of 0 comma 0 divided by the marginal X2
of 0. So, you look at how many cases you have X1 equals X2 equals 0; so if you look at this
distribution, I know it takes 2 values 0 and 1. So, given that X2 is 0, I am seeing that X1 takes
value 0 sometimes takes value 1 sometimes.

So, that is the first thing, so you identify the range of the conditional random variable also; you
identify the overall range that is ok. Then a particular conditioning if I ask you identify, the range
of that object. How do you identify the range? You just focus on what you have conditioned on.
This is X2=0, and then see what are the values that X1 takes for those particular conditioned
situation. So, when X2 is 0, you see that X1 can take value 0 and it can also take value 1, so 0 and
1.

The next step is to find the probabilities the conditional the actual (distrib) PMF; the conditional
PMF values for 0 and 1. For 0 what do you d? You look at 0 0; the cases where you have 0 0 for
both X1 X2 divided by the total probability of X2 equals 0. So, that you will have so that there are
7 cases; so in the denominator you will have 7/12; in the numerator, you will have 4/12. So, this
0 case is just 4/12 by 7/12; maybe I should write this down more cleanly 7/12. And what about
1? 1 is 3/12 by 7/12; so hopefully you see how I got that for the conditional random variable to
take the value 0. I have to look at the case, where X1 and X2 both take 0; that is 4 out of the 4 by
12, probability is 4/12.

And then I have to divide by the probability that X2 is 0; that is 7 out of 12. So, hopefully you
saw that same thing with 1, I will get 3 by 12 and 7 by 12; so, let us look at slightly more
confusing situation. So, let say X1| X3 = 0 and X4 = 1; look at this maybe I should write this. Let
me just write this in some other color; I will pick a blue color. X1 given X3 is 0, X4 is 1; what is X3
=0, X4= 1 that is here, that is here. Notice what is there is one more here, forgot this. So, this
number of cases is sharply decreasing even in this table; but anyway it is there. So, we see here
that this corresponds to 0; these two corresponds to 1, so the range is 0, 1 again.

And what about the probability? The 0, 1 case appeared three times that is like 3 out of 12; and
then 0 appeared and 1 out of 12; so that works out as 1 by 3, just cutting it down. So, once I have
1 by 3, there are only two values other thing should be 2 by 3. I do not need to I do not need to
work out that at all; once I get 1 by 3, I know the next one is 2 by 3. So, it is a valid PMF, it
needs to be valid PMF; so that is enough. So, that was one slightly more confusing situation; let
us look at something bit more interesting maybe; maybe maybe we will look at this case. Let us
look at X3, X4 equals 0, so this is something I am going to look at.

So, I will have a table of possibilities, so I should put some t4, t3 here and t4 here. Now, given X1
is 0 is all of these guys and if I want to look at X3, X4; I have X3 taking values 0, 1, 2. I have X4
taking values 0, 1; so X3 0, 1, 2 and X4 is 0, 1. And if I look at 0, 0 that is 1 out of the 7 cases; so
you can you can do it more carefully. You can do 7/12 in the denominator, 1/12 in the
numerator; we get the same answer. t3 0, t4 is 1 that is again 1/7, t3 is 1 and t4 is 0; that is again
1/7. 1, 1 again happens 2/7, so it happens twice; it happens once here, one is there and then 2, 0.
That happens twice again that is 2/7, and 2, 1 never happens.

So, you see here it works out to be a joint PMF in this way. So, so this an important simple basic
skill in conditioning and all that. Given a table or given the joint PMF marginalize; and find
whatever conditional you want, whatever marginal you want. Hopefully you this is easy to do in
some sense; it is just a question of carefully identifying and summing up the probabilities. In the
uniform case it is very easy, if it is not uniform and all that you have to add up; and if it is a more
complicated joint PMF, it is really really painful. But, it is easy enough to do and the principle is
important.

(Refer Slide Time: 13:48)

So, the next important thing is the conditioning the conditional PMF and the marginal PMFs are
related to the factoring of the joint PMF. So, what is factoring of anything? If you take a big
number, the number seems very big. But, if you keep factoring it, you can write it as a product of
various small numbers. So, even though like a number like 1024 seems very large; but actually it
is 2 into 2 into 2 into 2 ten times. If you do that you get 1024, it seems like if you if you say
description is 2 into 2 into 2 ten times; it seems a bit simpler than 1024 in some sense. So, so that
way if you look at a complicated big joint PMF, which we are not able to understand; or maybe it
is very confusing.

If you can factor it, if you can write it as a product of several terms; and each term maybe is okay
in some level. You can deal with it in some reasonable way, then there is some hope that you can
deal with the joint PMF. So, factoring the joint PMF is a hugely hugely important idea in
understanding a stochastic phenomena, with many different random variables. You do not want
to look at just the joint one, you want to see if you can factor, if you can think of the factors in
some way. So, for that this marginal and conditioning idea is very very crucial it turns out; you
can do factoring in so many different ways. And I will I will talk about some very important
special cases later.

But, where does the fundamental idea come from? It is actually very simple idea. We use the
marginals and conditioning that we saw so far, and we write it as a factor. So, let us look at a
simple example. So, here are 4 random variables X1, X2, X3, X4 and I have a joint PMF
( ); there is some range et-cetera, all of that is ok. I know that the joint PMF
( ) ( ); so, this is easy
enough to write down. So, now I will start putting these brackets, so notice this and is and; I
mean it is kind of everything together. So, it does not matter if I take the and of X2, X3, X4; those
things first and then do an and with this is the same.

So, everything has to be true, so it better be true in that way. So, now I have an and of two
things, and I can do conditioning; so notice how I do conditioning here. I can do probability of
X1 ( |( )) ( ). I
have achieved one factor, I have written the joint PMF as the product of two things, but next
what do I have is I can now repeat this. I have ( ); I can do
bracketing, conditioning and multiplication. ( ) (
|( )) ( ) and finally ( )
( |( )) ( )

( ) ( ); ( |(
)) ( |( )) ( |(
)) ( )

So, notice how conditioning is helping me out; I have written the joint PMF as a product of 4
factors. The first factor is X1 = t1 given all the rest, or actually maybe it is better to read it in
reverse. X4 equals t4 the marginal of X4 equals t4, then X3 equals t3 given X4 equals t4; and X2
equals t2 given X3 equals t3 t4. And then X1 = t1 given all of them; so it is a very nice and simple
thing. I can now write it in terms of my conditional PMFs that I know how to calculate; and I
have a product of 4 conditional PMFs; and 1 marginal and 3 conditional PMFs. So, this kind of
factoring is very very very useful to understand complicated joint PMFs.
You may have a lot of random variable in the joint PMF is very difficult to understand. So, you
maybe able to write down them write these things as product of margins; maybe the conditionals
are still difficult to understand. I am not saying now, but at least it gives you a sense of maybe
some approximation, maybe something you get an idea of. So, we will explore this idea more
later on; but let us focus on the factoring for now. The factoring itself seems to have enough
variety here.

(Refer Slide Time: 18:01)

So, let me show you some examples of this factoring, and show how particular cases it works out
like that. So, let us say let us look at the case, where we have I want to look at ( ). So,
I know I can write this as ( ) | ( ) | ( ); so, let us verify this. This we know
is ( ) 1/ 9, what is ( ) ? So, look at X3 equals 0 and that is this this and this.

There are 3 possible cases, ( ) ; and then I have | ( ). Given X3 is 0, X2 is 0 let us

change the color here; this is this guy and this guy is not it. So, given ( ) and

( ) ; so you get | ( ) . So, 2/9 /3/9 you get 2 by 3; so let us come to

this last case, maybe I will use my next color here. X2 = 0, X3 = 0 that is this, what I had in green
here, and then if you look at that case; that is the case when X1 is 0. So, that works out to be

| ( ) , and you can see ( ) | ( ) | ( )

( ).
So, it is easy enough to check this factoring; it has to be true, and I hope I convinced you that if
you do it in for for any other value. I did it for 000, you can do it for any other 011 002; you pick
your favorite value. And try and practice this calculation, you will see it what it will work; so it
is easy enough.
(Refer Slide Time: 20:19)

Now, here is what is something that is more interesting. I did it in one particular sequence, I did
t1, t2, t3, t4; and it turns out you can do it in any sequence. See the, and is the and in any other
sequence. So, instead of doing X1, X2, X3, X4 I can do X4 equals t4, and X3 equals t3; just reorder
the anding. I can and in any way, so it is and of everything. So, if I do that then repeat the same
thing as before; ( ) ( |(
)) ( |( )) ( |( )) ( ) Now, I get it in this
form, so I can first do marginal of X1 and then conditional of X2 given X1; then conditional of X3
given X1 X2; and then conditional of X4 given X1 X2 X3.

That is also possible; it is the same thing I have written down, I have done it in some other order.
You can do it in one more order; why you stick to just 1, 2, 3, 4 or 4, 3, 2, 1? Let me do 3, 2, 1, 4
why not. I am just adding, I can do it in any order I want; I can do ( ) ( |(
)) ( |( )) ( |( )). You
can go in any order that you like, and you will get the exact same answer.
(Refer Slide Time: 21:34)

So, let us try that; so previously I went 3, 2, 1, let me go 1, 2, 3. So, let me show you how that is
done. So, let say I want to look at the same evaluation, I did;
( ) ( ) | ( ) | ( ) I am going to do it as ( ). Just to be sure
let me check what I did before; I showed you X3, X2| X3 and then X1| X2 X3. So, I am going to do
it in reverse here. I will do X1 and then X2 given X1equals 0 of 0; the same calculation just to
convince you that you will get the same answer of 0. So, let us do a little multicolor presentation
once again. ( ) ; X2 given that one.
So, if you want to look at this guy | ( ), you will get you have to look at this possibility.

| ( ) ; you can divide carefully. If you want, you divide by 5/9, and then the numerator

( )
you will have ( ) . | ( ) ; and then let me pick my third color
( )

for this. X3| X1=0, X2 = 0 and that is just I think it is this one; I am sorry by this. This there are 3
cases here and then this is the favorite favorable case; there are 3 cases of X1=0 X2 = 0 and X3
being 0 is only one of them.

So, you will get | ( ) ; so you can repeat this calculation more carefully. So, if

you multiply it ( ) ; very well you get 1/ 9, which is what which is

what you know is the answer. So, you can check this in any other way, you can take anything
else do it in any other sequence; you will get the same answer, you will get the same. So, these
are identities which are always true. So, alright so I think, so that I would like to stop with the
conditional distributions with this with this lecture.

Hopefully, you saw the utility of it, so we will see soon enough from actual practical uses for
these kind of ideas. But, the factoring of a joint PMF into multiple factors as marginals and
conditionals is a very very powerful idea. It helps simplify a lot of calculations and know
formulations and modeling and all that. So, we will pick up from here in the next class. Thank
you.
Statistics for Data Science – II
Professor. Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture No. 3.8
Multiple Random Variables: Independence of two Random Variables

(Refer Slide Time: 00:13)

Hello and welcome to this lecture. In this lecture we are going to start looking at the property of
independence of two random variables. So, we will soon see independence of multiple random
variables; but, will begin with independence of two random variables. So, you may remember
before we defined independence of two events; when are two events independent, probability of
the intersection of the two events should be the product of the individual probabilities. Probability
of A and B equals 𝑃(𝐴) × 𝑃(𝐵). Then A and B are independent that is the same as saying
𝑃(𝐴|𝐵) = 𝑃(𝐴); and independence is very useful in computing probabilities we have seen that
before.

Now, it turns out can sort of extend this definition to random variables. Now, random variables
take outcomes to numbers and you can also define events using random variables. If I give you a
random variable X, you can say 𝑋 = 𝑎, or X lies in some range. So, all of that becomes events, so
we will say that random variables X and Y are independent if you define any event with X alone,
you will have supposing you define an event with X falling in some range or X being some value
whatever. Some way using just X you will find anyway, then you take the random variable Y.
And you define an event with just Y alone; so do not do not use X anyway, just use Y alone and
define the event. If those two events are always independent as events they have independent; then
we say X and Y are independent. So, that is basically the most important definition, but it turns
out there is a very equivalent formulation in terms of the PMF. And that is a very powerful and
easy thing to implement for us; so we will focus mostly on this PMF type definition. But, remember
the basic idea is that events defining defined using one random variable; events defined using
another random variable. If they are independent, then the random variables themselves are
independent; that is the basic definition.

Now, you can show equivalently that if you have the joint PMF of X and Y 𝑓𝑋𝑌 . If X and Y are
independent, I mean X and Y are independent if and only if; so this is this goes both ways. So, if
X and Y are independent, the joint PMF is equal to product of the marginals. 𝑓𝑋𝑌 (𝑡1 , 𝑡2 ) is equal
to 𝑓𝑥 (𝑡1 ) × 𝑓𝑦 (𝑡2 ); no conditional will appear here, just the marginals. And it is also true the other
way around if 𝑓𝑋𝑌 (𝑡1 , 𝑡2 ) equals 𝑓𝑥 (𝑡1 ) × 𝑓𝑦 (𝑡2 ); for all t1, t2, then the random variables are
independent also.

So, this product just like in the events we have probability of A and B, being P of A into B; so, the
intersection becomes the product. Here the joint PMF which is the intersection is not it;
𝑃((𝑋 = 𝑡1 ) 𝑎𝑛𝑑 (𝑌 = 𝑡2 )) is the probability that X equals t1 times probability that Y equals t2
𝑃(𝑋 = 𝑡1 ) × 𝑃(𝑋 = 𝑡2 ); simple translation of the definition. So, notice the factoring in the
independent case and the dependent case; in general you always have 𝑓𝑋𝑌 (𝑡1 , 𝑡2 )

being 𝑓𝑋 (𝑡1 ) times the conditional.

So, in the independent case clearly the conditional becomes the same as the marginal; so,
conditioning does not change the distribution. Any conditional distribution will give you the same
same distribution basically. This is the conditional random variable has the same distribution as
the original one.

So, the last two points are easy to remember and you should keep that in mind always X and Y are
independent; then joint PMF equals product of the marginals, product of the individual marginals.
And conditional PMF equals the marginal PMF; so simple definition, let us start checking it. So,
how do you check it?
(Refer Slide Time: 03:56)

Let us see a few very simple examples. Given the joint PMF, how do you check for independence?
So that is a simple example. We are going to take we will take a few more examples; and you will
see it will be a very interesting problem to check if random variables are independent. So, here
there are I put down a joint PMF of random variables X and Y 0101; so the range is very simple.
It is 1 by 4, 1 by 4, 1 by 4, 1 by 4. You can quickly compute the marginals here, so you will get
half, half, half, half; and you see product of the marginal is what you have here.

1 1
So, if you look at × 2, so this guy is product of maybe I should write down this, this and this.
2
1 1 1
The product is this, × 2 = 4; likewise here it is easy to see. Here it is easy to see, so maybe I
2

should use different colors here to bring out this; let us use a different color here. Notice the color
here, half into half, then pick another color; half into half and pick another color here is half into
half. So, you see how to check the product of the marginals is equal to the joint distribution.

So, once again what I am checking here? I am checking 𝑓𝑋𝑌 (𝑡1 , 𝑡2 ) equals 𝑓𝑥 (𝑡1 ) × 𝑓𝑦 (𝑡2 ), for all
𝑡1 , 𝑡2 . So, in this case you have independent that is ok; so that is good.
(Refer Slide Time: 05:52)

So, let us come to the next example; in this example notice what happens. You have here so for
contrast, here I have this guy; let us do this, you have independent. Let us see here this fY is going
to be 3 by 4, this is going to be 1 by 4; so this one is 1 by 2 plus 1 by 8 is 5 by 8, this is 3 by 8. So,
notice the product is clearly not 1 by 2; so you see here 𝑓𝑋𝑌 (0,0)equals half. And it is not equal to
𝑓𝑋 (0) × 𝑓𝑌 (0), which is this guy is actually you can write that down there. It is 5 by 8 and this is
3 by 4, so clearly this is not independent. Notice, how from the joint PMF; it is okay to do this
calculation and check.

(Refer Slide Time: 07:02)


Let us do a more general case. So, here we saw it was this was independent, this is not independent.
So, this is a more general case, so you know here I will always have half here, half here, half here,
half here. So, you can show by just calculation, if X equals 1 by 4, then they are independent; if x
is not equal to 1 by 4, it is not independent. So, the simple case seems easy enough; let us
complicate things a little bit more. We will make it a slightly bigger problem; here is 1 by 9, 1 by
9, 1 by 9, 1 by 9, 1 by 9. So, whenever you have a grid like this with all equal values, it looks very
tempting to say that it is going to be independent. Yes, you can check that; you have 1 by 3, 1 by
3, 1 by 3, 1 by 3, 1 by 3, 1 by 3 and all products are just 1 by 9.

So, in this case yes independent, so easy enough to see that; so, we saw before that this is
independent. Here is another case where the values seem all over the place 1 by 6, 1 by 12, 1 by
12, 1 by 4, 1 by 8, 1 by 8, 1 by 12, 1 by 12, 1 by 12, 1 by 24, 1 by 24. I do not know is it is it
independent? How do you check that? It is not 1 by 9, 1 by 9, 1 by 9 it is not so easy. But, let see
if we can check this; so here you have 1 by 3, so here you have 1 by 2. If you add up this, you get
1 by 6; so you can add it up, you will get; you can see how I got that. This for the finding 𝑓𝑋 , you
have to add like this along the columns.

Along the columns, you would add the first column; if you sum 1 by 6 plus 1 by 4 plus 1 by 12; 1
by 6 is 2 by 12, 1 by 4 is 3 by 12; and that is 1 by 2. So, let us add up the columns, so we will get
half 1 by 4, 1 by 4. Check this once again 1 by 8 is 3 by 24, 2 by 24, 6 by 24, 1 by 4. And you can
check even though it looks sort of very varied; this is always the product. So, half into 1 by 3 you
get that here; so you look at the 1 by 8, here its half into 1 by 4. You look at the 1 by 12 here its 1
by 6 times 1 by 2; so in this case also it is independent. So, even though it looks it is not 1 by 9, 1
by 9, 1 by 9; it is very different values for the joint PMF. But, the marginal is also different and
the product is still satisfied; so, this is also independent.

So, the independent case may look a little bit different from what you are used to; so you have to
pay attention a little bit, and you would really check whether the product is satisfied or not. There
are other simpler ways of checking, but anyway; so we will come back to that later.

(Refer Slide Time: 09:47)

So, we saw before that this was independent, this was also independent. So, in this case one can
very quickly check 1 by 4, 3 by 8, 3 by 8, 1 by 4, 3 by 8, 3 by 8; one can very easily checks that
here is a problem. So, the 0 sort of a giveaway, if you have a non-zero value for 𝑓𝑋 , non-zero value
for 𝑓𝑌 , and a 0 here. You see that you can never get this guy as a product of these two; so, clearly
this is dependent. So, you see the 0 is a bit of a giveaway; and maybe so we saw here this is
independent, this is independent, this was dependent. So, this guy it looks very dangerously close
to this, but if you add it up and see this will also be dependent.

So, how do you check this? You have to check one particular case; let us take one easy case to
check. What would be a easy case to check? Let us check this case; let me take this one. This is 1
by 4, 1 by 2 and this sum is what 1 by 8 plus 1 by 8 is 7 by 24; I think right 3, 3, 6, 7. So, you see
this guy is definitely not this into this; it is not going to work out, so this also is independent. So,
so it seems a bit non-trivial as the problem gets bigger, you have to really go in and check every
possible value. And you have to see if it is close enough or not, and at least it appears like that for
us.

Maybe there are smarter methods that people can think of; but generally the dependency is
determined from the joint PMF. In this fashion you find the marginals, very easy steps. Given the
joint PMF, you find the marginals and check if the product of the marginals is going to give you
the joint PMF. If it does not, then it is not independent; if it does, then it is independent. There is
a giveaway sign here, if you have there are two possibilities with the non-zero probability of the
marginal. But, the joint PMF is 0 that is a quick giveaway sign then it is going to be dependent; so
these are some things to look out for.
(Refer Slide Time: 12:05)

So, in general if you have to summarize, if you have to show X and Y are independent. You have
to verify that the joint PMF becomes product of marginals, for all values of 𝑡1 , 𝑡2 . However, to
show X and Y are dependent, it is enough to verify the inequality for some particular value of 𝑡1
and 𝑡2 ; so, that is how independence is defined independence. The joint PMF has to become
product of the marginal, for all possible values of 𝑡1 and 𝑡2 ; all possible values that a random
variable can take.

But it is dependent that way if even if one value, the product formula is not satisfied, then it is
dependent. And in particular, if you have two values in the range of the PMFs, but the joint PMF
is 0; that is a very easy way to find out whether the random variable is dependent. Actually it is
dependent; so that was summarizing the independence.
(Refer Slide Time: 13:00)

So, let see a couple of examples; so, you have this example. Here where X is the digit in the units
place, so X is uniform 0, 1, 2 all the way to 9. X is got that marginal and Y is the remainder;
remainder obtained when the number is divided by 4. So, Y takes values again uniformly in 0, 1,
2, 3. So, here is the case where you can quickly find this is 1 by 4, and this is 1 by 10. But, what is
the joint distribution of X taking value 1 and Y taking value 0?

So, this is the probability that the number ends in 1, but it gives you a remainder of 0, when divided
1
by 4. We have seen this before, this is 0. So, this is not equal to 𝑓𝑋 (1), which is times 𝑓𝑌 (0),
10
1
which is 4. So, this is a very telltale condition easy to see either X and Y are dependent. So, in

general if the two things seem like they are sort of tied together, it is going to be dependent. You
may guess that, but you have to really prove it in this fashion; so this is a very simple example to
see.
(Refer Slide Time: 14:31)

So, let me give you a more practical example and conclude with that; and let you think about how
you do this. So, now if you comeback to the example of this IPL powerplay over; we have been
talking about it over and over again. And once again I define these two random variables; X is
number of runs in the over, Y is number of wickets in the over. What are you going to guess are
X and Y independent? Mostly it looks like no. So, this is again there is no, I mean you may want
to look at data; and see how to justify it with data, we can think of that later.

But, these are the sort of questions that you will be faced with when you are modeling a real life
scenario with data; so data will come in, maybe you will have some data. Then you will be
interested in some objects, some random variables that you define. And then you have to know
sort of think of whether they are independent or not; because if they are independent what is great
if they are independent? Marginal is enough. The joint distribution gets determined by the
marginals and we saw already that in many cases, marginals are much easier to find than the actual
joint PMFs.

You may not get all kind of data for joint PMF. So, being independent is a great thing; it is very
very useful in modeling. It is often assumed in models without any great proof; it is a dangerous
assumption sometimes. So, you have to really check whether they are independent. So, maybe one
of the interesting things you can do is take the IPL data; and actually try and see, if you can look
at this joint distribution of X, Y is there a reasonable way to find it. And is it really the do the does
the data suggest X and Y are dependent or independent; is it is something interesting that one can
think of verifying.

Maybe there is a, we will do this as take it up as one of the assignments this week. Thank you very
much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 3.9

(Refer Slide Time: 0:13)

Hello and welcome to this lecture. So far, we have been looking at independence of two random
variables and we saw quite a few examples and calculations involving that, now we are going to
naturally move from 2 to 3 and 4 and higher numbers. So, let us look at how to think of
independence for more than 2 random variables. So, here is the definition, definition itself is quite
okay. It is, I am not been very precise, I am just vaguely said what it is. It is sort of an extension
of the independence for 2 random variables.

So, if you have n random variables and you want to say when they are going to be independent. If
you basically define events with any subset of them which are different and then those events have
to be independent. A easier way to think of independence in the discrete case is this formula. So,
you take this PMF, the joint PMF all the n random variables, these are all discrete, so it is easy to
write it on like that. It has to factor into the product of the marginals.

So, it is simple enough criteria, once you once you get something like this, then everything else
follows in the independence. So, this criteria hopefully you see the extensions on the 2 random
variable case as 𝑓𝑋𝑌 of a 𝑓𝑋𝑌 we can becomes 𝑓𝑋 times 𝑓𝑌 . 𝑓𝑋1 through Xn simply becomes 𝑓𝑋1 of
𝑓𝑋2 till 𝑓𝑋𝑛 , it is a natural definition. Once again joint PMF equals product of the marginal,
conditional PMF equal unconditional PMFs. So, you can forget about the conditioning if you
condition anything within that.

If you can take subset of random variables that are independent, they are independent, and you
will see quickly how independent random variables are very very easy to work with anytime you
have probabilities involving each individual random variable you can multiply the probabilities
together and the PMS multiply like this with no conditioning etcetera. It is very easy to work with.

(Refer Slide Time: 2:09)

So, let us see an example as usual we jumped immediately into an example this is our familiar
tossing a fair coin thrice example, we know the joint PMF has 8 different possibilities, every
1
possibilities with a value of . So, you can see that each 𝑋𝑖 if you find the marginal we have done
8

this before, it is basically uniform 0, 1 the marginal is also very easy to write down and you can
1 1
very easily check this for every 𝑡1 , 𝑡2 , 𝑡3 the joint PMF equals the product of the marginal into
8 2
1 1
into 2. So, clearly everything is independent here. It works out very easily. So, this is how you
2

determine independence from joint PMF.


(Refer Slide Time: 2:54)

But of course, not all joint PMFs will result in independent in the independent end of being the
random end up being in a situation where the random variables are independent. Here is another
example that we have being seen so far, you have a 3-digit number and it define this 3 random
variables X is the first of 1000th digit thousandth, 100th place digit, 100th place digit X is first
digit from the left. Y is the number modulo 2. So, 0 or 1 if the number itself is even, then Y is 0,
is the number is odd, Y is 1 and then Z is the first digit from the right the units place digit.

So, we saw that the marginal distribution are all uniform here like X is uniform from 0 to 9, Y is
uniform 0 to 1 and then Z is uniform 0 to 9. But now let us look at independence of pairs of random
variables first. So, X and Z you will see are independent. So, you can quickly check I am not going
into the detail calculation here, I will leave this is exercises for you please check this the joint PMF
1
of X and Z alone will be 100.

So, you can check it is easy to calculate the probability that X equals 0 and Z equals 0 you will see
1 1 1
it is 100 and that is exactly the product of the marginal, is not it? 10 × 10 you get that the prob,

the prob, the equality satisfied. So, X and Z are independent. Likewise, you can show X and Y are
also independent. But if you look at Y and Z you will see that they are not independent.

Once again remember how do you check something is not independent? You just have to find one
counter example, one 𝑡1 , 𝑡2 such that the 𝑓𝑋𝑌 of 𝑡1 , 𝑡2 is not equal to the product of the marginals.
So, here is one example I have given; Y = 1 and Z = 2. So, notice what happens here if Y is 1, then
the number has to be odd and if Z is 2, the number is even. So, those two together are never going
to happen. So, 𝑓𝑋𝑌 of 𝑓𝑌𝑍 of 1, 2 is actually equals to 0.

1 1 1
On the other hand the product of the marginals will work out to 2 × 10 be which is 20. So, in this
1
case you have this problem with is not equal to 0 clearly. So, this is not independent. So, you
20

notice how this is working out. 1 and 2 can never happen together but 1 can happen by itself, 2 can
happen by itself. There is no problem there but 1 and 2 cannot happen together. So, this is how
you check Y and Z are not independent. So, if Y and Z are not independent clearly X, Y and Z are
also not independent. So, that is a not independent case. So, these are dependent at random
variables.

(Refer Slide Time: 5:39)

Here is another example which I am going to call even parity. So, here is a joint PMF for you 𝑓𝑋1 ,
𝑓𝑋2 , 𝑓𝑋3 ; 𝑋1 , 𝑋2 , 𝑋3 takes 4 possible values together 0 0 0, 0 1 1, 1 0 1, and 1 1 1. So, there is a
mistake here this is cut to be even parity. So, this is 1 1 0 I made a mistake here 1 1 0. The reason
Y is called even parity is in every case the number of 1s that are there in 𝑋1 , 𝑋2 , 𝑋3 is even. So,
it is just a name for the thing.

But you can see why this, so you can once again check independence in this case. So, here what
will happened is even take any pair of 𝑋1 and 𝑋2, 𝑋1 and 𝑋2will be independent 𝑋2 and 𝑋3will
also be independent, 𝑋1 and 𝑋3will also be independent. So, you have 3 random variables here;
contrast with previous case. The previous case you had X Y Z we saw Y and Z where not
independent. So, clearly all 3 are not independent.

Look at how interesting this case is, in this case 𝑋1 and 𝑋2 are independent, 𝑋2 and 𝑋3 are
independent, 𝑋1 and 𝑋3 are also independent you can check individually. You can see if you just
1
let us look at 𝑋1 and 𝑋2 you have 0 0 0 1 1 0 1 1 occurring each with 4 That is independent is not

it, we know that. X1 and X3 again 0 0 0 1 1 and 1 0 again I made a mistake here I made a typo
here, this is also independent and then if you look at 𝑋2 and 𝑋3 together you again get 0 0 1 1 0 1
1
1 0. Again all 4 possibilities 4 probability each.

So, all pairs of random variables are independent but the 3 random variables together are not
independent they are dependent, why is that? You can look at the case 0 0 1 which does not appear
in this table which means the probability of those 3 occurring is 0, is not it? 0 0 1 0. But the product
1 1 1 1
of the marginals will work out to 2 × 2 × 2 equals 8. So, this is a violation of the condition that

is needed. So, all 3 together are dependent but pair wise they are independent.

So, this will happens always with even for ever parity case. Why is that? If we think about it, if
you look at all 3 together you can find 𝑋3 given 𝑋1 and 𝑋2. So, 𝑋1 and 𝑋2will fully determine
what exactly 𝑋3 has to be. So, in all those cases you have dependencies. So, that is what is
important. So, this is an example so I am just trying to show you examples of how dependency
and independent can occur in so many different cases you have to be very careful and look at the
formula carefully to check if it is a dependent or independent.

(Refer Slide Time: 8:25)


Now, the easiest case is the so called i.i.d. case and it is very common and it occur in so many
situations and we often use what is called this i.i.d. case; Independent and Identically Distributed
case, you have any number of random variables but if they are i.i.d, then you are in a good situation
because you can do a lot of calculations easily, hope to convince you of that in the rest of the slides,
but let us look at the definition first. n random variables are said to be independent and identically
distributed if first of all they have to be independent and then the marginal PMFs are identical.

So, the joint PMF, so what will happen to the joint PMF you can see if they are independent the
joint PMF is product of the marginals and the marginals are all identical. So, you just take 1 of the
marginals raise it to the power n you will get the joint PMF. So, everything is easy in the
independent and identical case i.i.d. case i.i.d. is very easy. So, basically why does it occur often
in practice? So, often when you do statistics in real life, you have to image some trail gets repeated
over and over again and you want to sort of assume that the distribution is identical every time it
is repeated.

Only then only when it repeats like that you can hope to gain some information from it. If every
time something very random happens you cannot do much about it. So, you have to assume with
the same experiment repeated trails are coming up if you want to pick up something. So, that is
why this happens all the times. So, you saw this Bernoulli trails that we saw before. And again,
you assuming the same situations gets repeated again and again. So, independent and identical
often happens in practice I have given you a few examples.
So, we will use this notation in particular values has very simple notation I will say 𝑋1, 𝑋2, 𝑋𝑛 i.i.d
X. So, if I say this what does this mean i.i.d X? This first of all a random variable X with PMF 𝑓𝑋 .
So, there is some random variable X with PMF 𝑓𝑋 .. So, that is the distribution that is the common
distribution here and this n random variable 𝑋1 , 𝑋2, 𝑋𝑛 they are all i.i.d. according to the same
distribution of X. So, they are all they all will have the same marginals and the marginal will be
the marginal of X. So, that is the way to think of this notation.

So, it just a convenient notations, so I might also sometimes say or I might also say like this 𝑋1,
𝑋2, when I do not want to use this random variable X, I might say i.i.d. 𝑓𝑋 . So, give the PMF itself.
So, these are convenient shorthand notations to describe what the distribution of n i.i.d. random
variables is. So, once again this is an important definition it will happen it will show up again and
again in when we look at statistical scenarios.
(Refer Slide Time: 11:25)

So, let us jump into one such statistical scenario and then see if we can do a probability calculation.
I had showed you in the i.i.d. case many calculations are easy because the random variables are
independent, events with them are independent so you can multiply probabilities together,
everything becomes easy. So, let us take few problems. One problem at least to begin with and see
how we can do it.

So, here there are n random variables 𝑋1 to 𝑋2, there i.i.d. with geometric p distribution. So, you
remember the geometric p distribution. So, what is geometric p? Geometric p it takes values 0, 1,
2, so on and what is the probability here, probability of success, it starts with 1 I keep forgetting
that. It starts with 1, 2, 3, and so on this random variable takes 1 with probability p, it takes value
2 with probability (1 − 𝑝)𝑝, takes probability 3 with (1 − 𝑝)2 𝑝 so on.

So, probability that X equals k is (1 − 𝑝)𝑘−1 𝑝. So, X is geometric. So, so, the question that is
asked is what is the probability that all of these random variables are larger than some positive
integer j. So, the question that is been asked is what is the probability that 𝑋1 is greater than j, 𝑋2
is greater remember what is comma, comma in this case is ‘and’ all the way till 𝑋𝑛 greater than j.
So, normally you might say you will have to look at the joint PMF and look at all the favourable
cases and at the probabilities of all that and all that you might have to say.

But remember this is i.i.d. So, this are all independent random variables. So, when they are
independent the events defined with them are also independent so you just simply write it as
probability of 𝑋1 > j times probability of 𝑋2 > j you can take product of this probabilities and you
work on. Not only that this are all now identical 𝑋1 to 𝑋𝑛 are identical they distributed the same as
this X here which is geometric. So, all of them are equal and equal to this probability j and then if
you multiplied n times this just this guy raised to the power n look at how easy it is.

So, I am defining some complicated condition involving n random variables just because their i.i.d.
is just simply reduces to this case. So, what is probability that X > j, it is 1 minus p power so
maybe I will just write it like this ∑∝𝑗+1(1 − 𝑝)𝑘−1 𝑝 . So, this is a geometric sequence with the first
(1−𝑝)𝑗 𝑝
term being and that is just (1 − 𝑝)𝑗 . so, this using this back here you will just get
1−(1−𝑝)

(1 − 𝑝)𝑗 𝑛. So, it is as easy as that. So, look at how easily we got to the answer using this property
that is independent and identically distributed.

(Refer Slide Time: 15:04)

So, here is another question what would seemingly be slightly complicated but once again we will
see independence assumption gives you the answer quite easily but this one involves slightly more
involved calculations. Let us see if we can do this. So, here is random variable X and the
distribution I like to write distribution like this I think it is a very compact way of representing
distribution in the general case instead of doing tables. So, X takes values 0, 1, 2, 3, 4. 0 with
1 1 1
probability half, 1 with probability 4 , 2 with probability 8, 3 with probability , 4 with probability
16
1
.
16
And 𝑋1 to 𝑋𝑛 it is been said basically this is i.i.d, i.i.d. samples with distribution X of i.i.d X that
what it is saying. So, what is the probability the 4 that 4 is missing in the sample. So, the first
question asks me probability that 𝑋1 is not equal to 4 , 𝑋2 is not equal to 4 so on till 𝑋𝑛 is not equal
to 4. What is this going to be on if it something else we would be totally confused but this is i.i.d.
is not it? So, i.i.d. is simply I know that this is probability that X is not equal to 4 raised to the
power n. I do not even have to go through the complicated steps of coming to this I know this is
always true.

So, what is probability that X is not equal to 4 it is 1 minus probability that X = 4 so that just works
15
out to (16)𝑛 . see how easy it was to get to this answer. Let us look at to next one what is the

probability that 4 appears exactly once. Now, this is little bit more confusing because 4 could
appear in 𝑋1 and not appear else where or it could appear in 𝑋2 and not appear else where like that
like there are multiple possibilities.

So, this one is little bit more 4 appears once appears exactly once. So, this exactly once is needs
more cases this is probability that 𝑋1= 4, 𝑋2 not equal to 4 till 𝑋𝑛 not equal to 4 or this jointly or
𝑋1 is not equal to 4, 𝑋2 is 4, 𝑋3 is not equal to 4, dot dot dot 𝑋𝑛 is not equal to 4 likewise, it can
keep on going till the last case where 𝑋1 is not equal to 4 so on till 𝑋𝑛 - 1 is not equal to 4 and then
𝑋𝑛 is equal to 4.

So, luckily you know every term here in the summation has the same value and what is that value
this is just n times there are n terms here n times probability that X is equals to 4 times probability
that X is not equal to 4𝑛−1 is not it? So, you can see when every term is probability that X equals
to 4 times probability that X is not equal to 4𝑛−1 and how many terms do I have N of them. Notice
1 15
how independent and identically distributed helps you, so this is N times 16 × (16)𝑁−1 . So, it is out

of looks interesting.

There are so many other ways to do this very quickly if you want to you can do I mean you can
see that it will be binomial and do this very easily. So, what is the probability that 3 and 4 appear
at least once in the samples. So, now we have to do a little bit more thinking there are multiple
ways to do this problem once again you can write it as intersection of two events and then you can
do various things with that. But let us repeat this method 3 and 4 have to appear at least once, so
this at least once complicates matters, if it were to appear exactly once then you can do some
calculation like this but this is at least once.

So, whenever it is at least once, 3 could appear on, 3 could appear twice, 4 could appear once, 4
could appear twice it looks like all sorts of possibilities are there. So, it is it is little bit more twisted
so p is 3 at least once, intersect 4 at least once. So, how do we go about doing this so one of the
things that is easy to calculate here is 3 at least once, let me call this event A, this event B. So,
turns out probability of A complement is easy to find what is A complement? 3 appears, 3 never
appears. So, 3 appears at least once, A complement is 3 never appears. Now, that is easy. We have
15
seen this before, it is 1 minus probability of c that is (16)𝑛 . Likewise, this is also you can find,

probability of B complement for at least ones complement is 4 never appears. So, that is also
15
(16)𝑛 So, I need A intersect B, A intersect B is A complement union B complement the whole

compliment. So, I need to find the probability of A complement union B complement, for which I
need A complement intersect B complement. So, I will find the probability of A complement
intersect B complement. What is A complement intersect B complement, think about this. This is,
this will happen if 3 should not appear and 4 should not appear, so 3 should not appear and 4
14
should not appear and if you think about it, this will be ( )𝑛 . So, think about why this is true.
16

So, each 𝑋1 is cannot belong to 3, 4, 𝑋1 cannot belong to 3 , 4 like that. So, its probability that X
14 14
is not equal to 3 or 4 raised to the power n and that is , X should be 0 or 1 or 2 and that is .
16 16

So, this gives you probability that A complement union B complement is what, this plus this minus
15 14
this, is not it? So, it is 2 × (16)𝑛 − (16)𝑛 and then you have probability of A intersect B which is
15 14
A complement union B complement whole complement that is 1 − 2 × (16)𝑛 − (16)𝑛 .

So, I mean I had quite a few reasons why I put this problem out, it is a bit of a difficult problem, it
is not very easy particularly the third part, but I wanted to convince you that even though you
might think once you come to PMFs and CDFs and random variables, you can forget all about the
basic probability. Notice, how basic probability is coming in and helping you so, the good
understanding of the axioms, writing events in terms of letters, from translating the English
description of the events into A and B and then manipulating it with unions and intersections and
compliments really helps you solve problems.
So, how we got to a complicated problem like 3 and 4 appear at least ones in the samples. It seems
like a complicated calculation, but if you break it down into what it is and then write it as A and B
and then go to A or B and then use the nots in the middle and write the intersections and
complements and go back and forth, magically you get to the correct answer.

So, hopefully, at least part 1 and 2 are very simple, the part 3 is a little bit more complicated, maybe
there are other ways to do it which are interesting to you, I will welcome you to try it, but this is
one way in which you can use the axioms carefully and you get to the answer. So, anyway, the
moral of the story is, in most cases when you have i.i.d. scenarios, you can calculate probabilities
like this without too much trouble by using basic ideas.

(Refer Slide Time: 23:27)

So, the next problem that I would like to do in this area of independence and all that is this
wonderful property of the geometric PDF, it is this so-called memoryless property. Let us see it as
a problem and solve it. So, the first thing is something we have seen before, what is the probability
that X > n when X is geometric. It is just simply ∑∝𝑘=𝑛+1(1 − 𝑝)𝑘−1 𝑝. I just did it just now but let
us see it once again. This is (1 − 𝑝)𝑛 𝑝 + (1 − 𝑝)𝑛+1 + 1 times p plus so on, this is a geometric
series with common ratio 1 - p and the first term is (1 − 𝑝)𝑛 𝑝. So, you do 1 - p, you will simply
get (1 − 𝑝)𝑛 . So, that is what this is.

Now, second question is asking you to find probability that X > 𝑚 + 𝑛 given X> 𝑚. Now, this is
a conditional probability, so conditional probability is P of A given B is P of A intersect B divided
by P of B. What is X greater than m plus n intersect X greater than m? Both have to be satisfied
which means that is just X greater than m plus n, is not it?

So, X needs to be greater than m plus n, and X needs to be greater than m, I mean if it is greater
than m plus n, it is already greater than m. So, you do not need to have that other condition. So,
you can see the intersection is actually the greater of the two. So, X is greater than m plus n. So,
now you reach this nice formula, so notice this 1 minus p power m plus n divided by 1 minus p
power m which is once again 1 minus p power n.

So, this kind of a relationship, so probability that so that implies, look at this let me just box it up,
X > 𝑚 + 𝑛 given X > m is equal to probability that X > m. So, this is called a memoryless
property of the geometric PDF. Remember, what is geometric distribution? You keep say tossing
a coin, what is the number of tosses needed to get the first head, so that is the geometric PMF.

What this tells you is given that there are like 1000 tosses for which you did not get heads and then
you ask what is the probability that you have to wait for another 100 is exactly the same as starting
at the zeroth task and asking what is the probability that you will have to wait for 100 tosses to see
the first head. So, that is the sort of memoryless property, whatever happened so far is irrelevant,
given that I am here, given that already so many have passed, you still see the same distribution
for the future. So, that is called the memoryless property of geometric, it is worth remembering. It
is a pretty famous property.

How do you use these properties? So, for instance, in a real scenario, you may want to model some
random occurrence and when would you go to geometric when you observe a property like this?
No matter how long you have been observing something, if the probability that, though if the
distribution that it is going to take random, the event is going to take beyond that point is the same
as always, then you know it is like a geometric, maybe geometric is a good model. So, will stop
here with independence and independence of multiple random variable and i.i.d and all that and
move on to the next topic in the next lecture.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 3.10
Functions of random variables
(Refer Slide Time: 0:13)

Hello and welcome to this lecture. We are going to be talking about functions of random variables.
Quite a while back, we saw functions of one random variable, and we saw that it was very easy to
find the PMF of a function of one random variable. Now, that we know how to deal with multiple
random variables, joint PMFs and all that, we are going to now start looking at functions of
multiple random variables and how to deal with them.
(Refer Slide Time: 0:40)
So, let us once again begin with a very simple example just to show you how these things will look
like. So, here is a situation where you have thrown a die twice. So, you have 𝑋1 , 𝑋2, you have
thrown the die twice, 𝑋1 is the first number and 𝑋2 is the second number that you see. So, what is
being asked is probability that the sum of the two numbers seen is 6. We have solved this kind of
a problem before, we know how to find the probability, but this, the next question asks you about
PMF of the sum.
So, if you look, think of the sum, what is the sum? The sum of the two numbers is actually 𝑋1 +
𝑋2. You can write it like this. So, even though you in the simple problem, you may be able to write
down all the possibilities and argue for what is the probability of the sum of the numbers being 6.
It is useful to think of the sum as 𝑋1 + 𝑋2. So, here this guy is a function of 𝑋1 and 𝑋2. And you
are able to think of 𝑋1 + 𝑋2 itself as a random variable and maybe you can let it be 𝑆. So, let us
say 𝑆 = 𝑋1 + 𝑋2.
Now this 𝑆, since it is a function, we can quickly identify the range, each 𝑋1 has range 1 to 6, so 𝑆
will have the range {2, 3, 4, … , 12}. And you can try to find probabilities for 𝑆. So, if you want to
find the PMF, 𝑃(𝑆 = 2), when will the sum be 2? 𝑋1 has to be 1, 𝑋2 has to be 1, that is the only
1 2
case. 𝑃(𝑆 = 2) = 36, when will the sum be 3 is 𝑃(𝑆 = 3) = 36, (1, 2) or (2, 1). When will the
3
sum be 4? It could be (1, 3), (2, 2), (3, 1). So, that is 𝑃(𝑆 = 2) = 36 so on. So, you can find the
PMF in this fashion.
So, you are going back to the basics of the experiment and finding the PMF but think of this, this
way. So, I actually have given the PMF of 𝑋1 and 𝑋2 , I know the PMF of 𝑋1 and 𝑋2, I am thinking
of this as the function of 𝑋1 and 𝑋2. So, this is simple example to see how functions can show up
very naturally.
(Refer Slide Time: 3:05)
So, here is another problem where maybe you have a this is a rectangle, this length and the breadth
and these two are random. So, and this is defined sort of conditionally instead of giving you the
joint PMF, they give you the marginal of 𝐿 which is given as uniform {5, 7, 9, 1} and then you are
given the conditional, given 𝐿 = 𝑙, is given 𝐵 as {𝑙 − 1, 𝑙 − 2, 𝑙 − 3}. We know how to deal with
this, given the marginal and the conditional, we know how to the joint, is not it? You just simply
sum over the various things. So, we know how to do this. There is a easy way to multiply the two
conditionals and the marginals will get the joint PMF.
So, it is, so we know this can be done, this is okay. So, you can find the joint distribution of 𝐿 and
𝐵. So, I mean if I have to write it down, so let me just write this down. So, 𝐿 and then 𝐵, is not it?
𝐿 could be 5 and then for 5, 𝐵 can be 4, 3 and 2. And 𝐿 could be 7 and for 7, 𝐵 can be 6, 5, 4. So,
notice 𝐵 shows up here again and then for 9, 𝐵 can be 8, 7, 6 and for 11, 𝐵 can be 10, 9, 8, and for
each you can write a probability.
So, 𝑓𝐿𝐵 , so maybe I should write 𝐿 as 𝑡1 , 𝐵 as 𝑡2 ; 𝑓𝐿𝐵 (𝑡1 , 𝑡2 ). So, this will be, so this is uniform,
1 1
you can just multiply the two probability, the conditional times the marginal, so it is 4 × 3. So, you
1 1
will just get 12 throughout. You can think about it, you will get 12 throughout. Did I write enough
of them? 1, 2, 3, 4, 5, 6, 7, 8, 9, I need to write 3 more. So, now what is area? Area is equal to 𝐿𝐵,
is not it? So, you just multiply, so it just 𝑡1 𝑡2 .
So, you multiply 𝑡1 , 𝑡2 , the area here, here is 20, here you have 15, here you have 10, here you
have 7 6s are 42, 25, 28, 9 8s are 72, 63, 54, 110, 99, 88. So, this is the area. So, there are 12
different values for the area and so the area is going to be uniform in this range. So,
1 1
𝑃(𝐴𝑟𝑒𝑎 = 20) = 12, 𝑃(𝐴𝑟𝑒𝑎 = 72) = 12. So, you saw how the function of two random variables
shows up naturally in many examples, if you think of the length and breadth of the rectangle as
random quantities, the area becomes a function of those two random variables. So, it is a simple
example, you can think of other examples.
(Refer Slide Time: 6:23)
So, how do you generally deal with the functions? Suppose I have 𝑛 random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛
and they have a joint PMF 𝑋1 through 𝑋𝑛 , 𝑓𝑋1𝑋2…𝑋𝑛 and then let us say the range is given to you
for every random variable and then I define a function from the Cartesian product of these ranges
to the real line, that is my function 𝑔. So, I can give, if 𝑋1 can possibly take values 𝑡1 , 𝑡2 , … , 𝑡𝑛 , I
can think of 𝑔(𝑡1 , 𝑡2 , … , 𝑡𝑛 ). So, that is possible.
So, I can feed in these random variables as input into this function, so I get a function of random
variables which itself will be a random variable. So, that is the first important thing, I have sort of
assumed it in this definition. You can see why that should be true 𝑋1 , 𝑋2 , … , 𝑋𝑛 are random
variables, if you take a function 𝑔 and then so say 𝑔(𝑋1 , 𝑋2 , … , 𝑋𝑛 ), it is very natural that it becomes
a random variable. So, that is true. So, if you have a random variable like that, you can define a
PMF for it.
And the PMF is usually evaluated given the joint PMF, the PMF of the function is evaluated using
this formula. You simply sum over all possible 𝑡′𝑠, 𝑡1 , 𝑡2 , … 𝑡𝑛 that give you 𝑔(𝑡1 , 𝑡2 , … 𝑡𝑛 ) = 𝑡.
So, you sort of do a 𝑔 inverse of 𝑡. So, this is like a 𝑔−1 (𝑡). This is sort of like a 𝑔−1 (𝑡). All the
points 𝑡1 , 𝑡2 , … 𝑡𝑛 which will give you 𝑡 if you apply 𝑔 on it. So, and then you just evaluate the
joint PMF at all those points and add it all up together.
So, it is a very simple definition, it is 𝑃(𝑔(𝑋1 , 𝑋2 , … , 𝑋𝑛 ) = t) is basically whenever 𝑋1 , 𝑋2 , … , 𝑋𝑛
take values that give you 𝑔 = 𝑡, that is what it is. So, you have to just add up over the joint PMF.
It is very simple and direct formula. So, you can extend it to joint PMF of two functions 𝑔 and ℎ,
I am not going to write it down for you, I will show you in an example how that is done. So, for
small problems, if there are small tables with joint PMF, you can use this formula directly.
(Refer Slide Time: 8:28)
So, let me show you is this method directly and let me show you an example for a small table. So,
here is an example. Let us say this is 𝑡1 , 𝑡2 , 𝑡3 and let us say the function 𝑔, this is the first function
I am going to take, 𝑔(𝑋1 , 𝑋2 , 𝑋3 ) let us say is equal to 𝑋1 + 𝑋2 + 𝑋3 . So, simple case first. So, the
way I usually like to do it is whatever function I have here, I will simply add and make another
column for it. So, just make a column for it. So, 0, 1, 2, 2, 3, 1, 3, 2, 3.
So, once I make a column, I know the probability, I can simply add it up. So, if you look at
𝑔(𝑋1 , 𝑋2 , 𝑋3 ), so this takes values {0, 1, 2, 3}, and how do I do probability for 0, 𝑔 takes
1 1 1 1 2
probability 9, it is just 9, for 1 it is 9 and there is also one more 9, so it is 9. It just becomes like a
3
function, very easy function and you just look at 2, 2 appears 3 times, so that is 9. Again, 3 appears
3
3 times, it is 9.

So, any other function that you want to come up with, so any function you want to come up with,
one can do the exact same possibility, exact same argument. Any other 𝑔 I define, I just simply
add that function here; the 𝑔(𝑡1 , 𝑡2 , 𝑡3 ) that is the method. So, you just write down all the
possibilities here and then you know the range, you can write it like this. So, I am not going to
repeat for any other example.
So, let me just do a joint PMF and show you. So, that is something, so any other function can be
done this way; whatever function you want to define here, you can use the same method. So, let
us say is the second part, let us say I define two functions and then ask for the joint PMF. Let us
say I do not know, so let us say, let us do 𝑔(𝑋1 , 𝑋2 , 𝑋3 ), we will keep the same 𝑔 so that we do not
have a conflict in notation. Maybe for ℎ I will do ℎ(𝑋1 , 𝑋2 , 𝑋3 ) = 𝑋2 𝑋3. Just for the fun of it why
not 𝑋2 𝑋3.
So, here, again I mean it is not too difficult, so you just write down both of these functions and
then look for common possibilities. So, I need 𝑋2 𝑋3 . So, let us write 𝑡2 𝑡3; 𝑡2 𝑡3 is 0, 0, 0 and then
you have 1, you have 2, you have the 0 again and then you have 0 again, then you have 0 again
and then you have 1. That is 𝑋2 𝑋3. So, now if you want the joint PMF of 𝑔 and ℎ, so let us use
slightly better rotation here. So, maybe I will call 𝑔(𝑋1 , 𝑋2 , 𝑋3 ) = 𝑋, I will call ℎ(𝑋1 , 𝑋2 , 𝑋3 ) =
𝑌.
So, if I think of the joint PMF of 𝑋 and 𝑌, so how do I do this, the joint PMF 𝑋 and 𝑌? You can
see 𝑡1 + 𝑡2 + 𝑡3 takes values 0, 1, 2 and 3 and the 𝑡1 𝑡2 it takes values 0, 1 and 2 and then I just
1
have to count. So, if you look at (0, 0); (0, 0) appears exactly once and that is just 9. Let us look at
(1, 0); (1, 0) does not appear at all. So, that is 0. Let us look at (2, 0); (2, 0) does not appear at all.
Let us look at (1, 1); so, these guys are values of 𝑡2 𝑡3 ; these guys are values of 𝑡1 + 𝑡2 + 𝑡3 , you
can find joint PMF anything else in the same fashion.
(2, 0) does and appear. What about (0, 1)? (0, 1) appears exactly twice. What about (1, 1)? (1, 1)
does not appear. What about (2, 1)? (2, 1) does not appear. (0, 2) appears exactly twice again. (1,
2) appears exactly once. (2, 2) does not appear. (0, 3) appears once. (1, 3) appears once, (2, 3)
appears once. I hope I have counted it correct, exactly. So, that is it, that gives you the joint PMF.
So, in small cases even when you have any number of functions, the simple table method will
always work.
You just write down the table and write down the functions, then see what are all the values it
takes and present even the joint PMF for individual PMFs or anything you like, it can be very very
easily done. So, this is the essence of the method for small cases. But as usual, when the case
becomes, when the function becomes a bit more complicated, random variable takes lots of values,
these kind of calculations are not so straight forward. So, anyway.
(Refer Slide Time: 13:59)

So, once you start thinking of functions of random variables, you will start getting this elegant nice
relationships between random variables. So, here is one such relationship. So, this is a very easy
relationship, we have seen this several times before. So, 𝑋1 , … , 𝑋𝑛 if it is results of 𝑛 𝑖. 𝑖. 𝑑.
Bernoulli 𝑝 trials, so remember 𝑋1 is 1, it is a success if the Bernoulli 𝑝 resulted in 1, it is a failure
0 if the Bernoulli 𝑝 trail resulted in 0. So, this is success-failure we have seen before, the sum of
the 𝑛 random variables 𝑋1, this is simply number of successes in n independent Bernoulli trials, is
not it?
So, it is easy to identify that this has to be 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝑝). So, it is a nice relationship. So,
expressed, so the way this is usually expressed is sum of 𝑛 independent 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑝) =
𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝑝). Notice how function of some independent distribution becomes some other
distribution. So, identities like this will are interesting in probability distributions, it just gives you
a nice questions to ask in quizzes, might also gives you nice relationships between the things we
are studying.
(Refer Slide Time: 15:32)

So, here is another example. So, you have 𝑋 being uniform in {0, 1, 2, 3}; 𝑌 being uniform in {0,
1, 2, 3}; if I start writing out all the 4 possibilities, I will get 16 length table, I could do it, it is not
too difficult, but maybe there is slightly a better method one can think of. So, if you take, if you
look at 𝑍, it will take values 0, 1, 2, 3, 4, 5, 6, is not it? So, you can quickly see how 0 will happen
1
as a sum of 𝑋 and 𝑌, 𝑋 has to be 0 and 𝑌 has to be 0, is not it? So, this will have 16, so it is easy to
see.
2
Now, what about 1? 1 can happen if 𝑋 is 0, 𝑌 is 1, 𝑌 is 0, 𝑋 is 1. So, that is 16. 2 is (0, 2), (1, 1)
3 4
and (2, 0), so that is 16. 3 is (0, 3), (1, 2), (2, 1), (3, 0), so that is 16. 4 is what? (1, 3), (2, 2) and (3,
3 1
1), so, that is 16. So, you see how its starts from 16 goes to 2, goes to 3, goes to 4, and then climbs
down. So, it starts climbing down because as you go to a larger sums, fewer cases accumulate. 5
2 1
is 2 plus 3 or 3 plus 2, 16. Finally, 6 is just 16.
1 1
So, so usually people like to plot these things. So, let us say you plot 𝑓𝑋 (𝑥) that would just be 4, 4,
1 1 1
, and 4, is not it? You remember, so maybe I will just write it like this. So, it is 4, this is 𝑓𝑋 and
4
you can also have 𝑓𝑌 . So, I was mentioning how you, this is one good way of visualizing PMFs.
1 1 1 1 1
So, this is just 4, 4, 4, 4, 0, 1, 2, 3; 0, 1, 2, 3. This is also 4. So, notice what 𝑓𝑍 is. So, 𝑓𝑍 it goes
from 0, 1, 2, 3, 4, 5, 6.
1 2 3 4 3 2
But it starts with very small 16 and then goes to 16, 16 and then 16 and comes down to 16, 16, then
1 4 3 2 1
. So, this is 16, this guy is 16, this guy is 16, and this height is 16. So, this is nice ways to visualize.
16
So, some people like to say two uniform PMFs when you add the distributions, you get some sort
of a triangular one. So, it is sort of nice way to visualize what happens to sum of two uniforms.
(Refer Slide Time: 18:59)

So, in general, the sum of two random variables, then the random variables take integer values, it
is good to think of them a little bit more precisely. So, I am saying 𝑋 and 𝑌 take integer values, so
they go either 0, 1, 2 or −1, −2, they take values in the integers and their joint PMF is 𝑓𝑋𝑌 . Now,
I am going to say 𝑍 = 𝑋 + 𝑌 and it turns out one can sort of write down the mechanics of this
without too much problems. So, we know the sum of 𝑋 + 𝑌 will be an integer.
The previous example is clearly an example of this. So, Z will be some integer and 𝑃(𝑍 = 𝑧) =
𝑃(𝑋 + 𝑌 = 𝑧). Now, 𝑋 + 𝑌 can be 𝑍 when 𝑋 and 𝑌 take various different possible values. What
are all the values that 𝑋 and 𝑌 can take? 𝑋 could be some 𝑥. If 𝑋 = 𝑥, then what should 𝑌 be?
Because 𝑋 + 𝑌 is 𝑧, 𝑌 is going to be 𝑧 − 𝑥.
So, now what are the possible values of 𝑋? 𝑋 could be any integer. So, this is just any integer. So,
this is just any integer. So, you go through all possible integer values of X and simply say 𝑃(𝑋 =
𝑥, 𝑌 = 𝑧 − 𝑥). So, you see how these two are the same, 𝑋 + 𝑌 = 𝑧 and this (𝑋 = 𝑥, 𝑌 = 𝑧 − 𝑥).
So, hopefully you see that. So, now this is just the joint PMF evaluated at (𝑥, 𝑧 − 𝑥).
You can also do instead of summing over 𝑋, you can sum over 𝑌, why not, it is the same thing.
So, it becomes the joint PMF of (𝑧 − 𝑦, 𝑦). So, both of these are easy expressions to remember
when you are thinking of sums of random variables taking integer values. Integer values are sort
of important here because I am summing over 𝑋 being and 𝑌 being integers, otherwise it becomes
a little bit more rough.
Now if 𝑋 and 𝑌 are independent, what will happen to the joint PMF? 𝑓𝑋𝑌 (𝑥, 𝑧 − 𝑥) will factor like
𝑓𝑋 (𝑥)𝑓𝑌 (𝑧 − 𝑥), is not it? 𝑓𝑋 (𝑥)𝑓𝑌 (𝑧 − 𝑥) and 𝑥 goes from −∞ to ∞, 𝑥 is an integer. So, this kind
of an operation is called a convolution. So, you have an 𝑓𝑋 (𝑥) you have an 𝑓𝑌 (𝑦) and then you
convolve the two, this operation is called a convolution. We convolve the two to get 𝑓𝑋+𝑌 (𝑧). So,
these two have to be independent to get your convolution operation.
So, this is nothing new, I am just writing it in this fashion just to introduce this notion called
convolution, the sum of integer value at random variables is something very important to
understand. So, this is a general formula one can use when it is integer value. Convolution shows
up as one of the possibilities. Convolution shows up as an important operation here.
(Refer Slide Time: 22:02)

So, let us we just saw the formula, let us use it in a particular case. So, you have two independent
Poissons, 𝑋~𝑃𝑜𝑖𝑠𝑠𝑜𝑛(λ1 ), 𝑌~𝑃𝑜𝑖𝑠𝑠𝑜𝑛(λ2 ), they are independent, what is the PMF of epsilon?
So, let us use the previous convolution formula. So, 𝑓𝑍 (𝑧) is just simply summation over 𝑋 taking
values, see −∞ to ∞ is the general formula, but this is a Poisson random variable, so I know
Poisson random variable takes only positive values and then I have 𝑓𝑍 (𝑧) = ∑∞ 𝑥=0 𝑓𝑋 (𝑥)𝑓𝑌 (𝑧 − 𝑥).

And 𝑍 also I know is going to be positive, so 𝑍𝜖 {0, 1, 2, …}. So,


∞ ∞
𝑒 −𝜆1 𝜆1𝑥
∑ 𝑓𝑋 (𝑥) = ∑
𝑥!
𝑥=0 𝑥=0

So, if you do not like 𝑥 and all that, maybe you can use 𝑘 and 𝑙 and all that if you think 𝑥 being
integer is a bit of difficult thing for you to imagine. What is 𝑓𝑌 ?
𝑒 −𝜆2 𝜆1𝑧−𝑥
𝑓𝑌 (𝑧 − 𝑥) =
(𝑧 − 𝑥)!

So, notice what can happen here, so if the sum is going to go up to 𝑧, I have put infinity here
𝑒 −𝜆1 𝜆𝑥
1 .𝑒
−𝜆2 𝜆𝑧−𝑥
𝑓𝑍 (𝑧) = ∑∞ ∞
𝑥=0 𝑓𝑋 (𝑥)𝑓𝑌 (𝑧 − 𝑥) = ∑𝑥=0
1
, but infinity really is not a possibility. See,
𝑥! (𝑧−𝑥)!
X is positive, Y is positive, if I say the sum is z, what is the maximum possible value that X can
take? It cannot be infinity; it is enough if I go up to z. So, this is a crucial simplification. It is
enough if I go up to z. So, whenever you do convolution, the tricky aspect is always to figure out
the limit’s values.
So, I happily wrote minus infinity to infinity without worrying about anything. It is enough if I go
from 0 to z, why? Because if 𝑥 > 𝑧, 𝑓𝑌 (𝑧 − 𝑥) will go to 0. So, you do not have to bother about it.
𝑓𝑋 (𝑥) will be equal to 0 if 𝑥 < 0. So, you fix your 0 with this, you fix your infinity with this guy.
So, that is the idea. So, that is good to see. So, you stop here, this need not be infinity, this can be
just z. What do we do with this formula?
It seems like interesting little thing for us to do. So, notice what I can do? I can do this little trickery
here, so, 𝑒 −λ1 𝑒 −λ2 they do not play any role. 𝜆1𝑥 𝜆2𝑧−𝑥 these two will add and I will simply get. I
have to retain I cannot do much more about this.
And then I have 𝑥! (𝑧 − 𝑥)!. So, one of the things to do here, it is a little bit of a trick I can divide
by 𝑧! and then add the 𝑧! there, so I will get here
𝑧! 𝑧!
𝜆𝑥 𝜆𝑧−𝑥 . So, ∑∞ 𝑥 𝑧−𝑥
𝑥=0 𝑥!(𝑧−𝑥)! 𝜆1 𝜆2 is the familiar binomial formula of (𝑎 + 𝑏)𝑛 . So, think
𝑥!(𝑧−𝑥)! 1 2
𝑧!
𝑥 𝑧−𝑥
about why this is true. ∑∞
𝑥=0 𝑥!(𝑧−𝑥)! 𝜆1 𝜆2 becomes (𝜆1+𝜆2 )^𝑧. So, this is the binomial
expansion formula.
We have seen this before in the context of binomial distributions, how (𝑎 + 𝑏)𝑛 becomes
𝑛!
∑ , that kind of an expressions. So, you have to identify this summation here. So, once you
𝑘!(𝑛−𝑘)!
do that you get this very nice expression. So, let me just finally write down what this is.
𝑒 −(𝜆1+𝜆2) (𝜆1 + 𝜆2 ) 𝑧
𝑓𝑍 (𝑧) =
𝑧!
What kind of distribution is this? You know this distribution, so this implies 𝑍 is Poisson, the sum
of two Poisson random variables is Poisson with parameter being (𝜆1 + 𝜆2 ). So, it is a very
important and interesting result. Previously we had sum of 𝑛 independent Bernoulli’s is
𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝑝), we saw that very nice result, here is another result; sum of two independent
Poisson’s is another Poisson. What is the parameter, you just add the 𝜆1 and 𝜆2 . Nice, is not it?
Very nice. So, this is the first problem.
The second problem asks you to find the conditional distribution of 𝑋 given 𝑍. So, you are given
the 𝑍, given the sum, what is the conditional distribution of 𝑋? So, basically, we have to find
𝑃(𝑋 = 𝑘|𝑍 = 𝑛). So,
𝑃(𝑋 = 𝑘, 𝑍 = 𝑛)
𝑃(𝑋 = 𝑘|𝑍 = 𝑛) =
𝑃(𝑍 = 𝑛)
We have already seen what 𝑃(𝑍 = 𝑛) is. What is 𝑃(𝑋 = 𝑘, 𝑍 = 𝑛)? So, maybe we can write
𝑃(𝑋 = 𝑘, 𝑍 = 𝑛) as 𝑃(𝑋 = 𝑘)𝑃(𝑍 = 𝑛|𝑋 = 𝑘) divided by, I am just making it a little more
laborious than it should be.
So, if 𝑋 = 𝑘, 𝑍 = 𝑛, we know 𝑋 and 𝑌, is not it?
𝑃(𝑋 = 𝑘, 𝑍 = 𝑛) 𝑃(𝑋 = 𝑘)𝑃(𝑌 = 𝑛 − 𝑘)
𝑃(𝑋 = 𝑘|𝑍 = 𝑛) = =
𝑃(𝑍 = 𝑛) 𝑃(𝑍 = 𝑛)
So, this you can easily figure out. So, 𝑋 = 𝑘, 𝑍 = 𝑛. One of the X is k and then Z is n. So, Y better
be (𝑛 − 𝑘), is not it? So, that is the best thing to remember.
So, here now we are ready, we are ready to substitute all that we know.
𝑒 −𝜆1 𝜆1𝑘 𝑒 −𝜆2 𝜆𝑛−𝑘
2
𝑃(𝑋 = 𝑘)𝑃(𝑌 = 𝑛 − 𝑘) . 𝑛! 𝜆1 𝑘
𝜆1 𝑛−𝑘
𝑘! (𝑛 − 𝑘)!
= −(𝜆 +𝜆 ) = ( ) ( )
𝑃(𝑍 = 𝑛) 𝑒 1 2 (𝜆1 + 𝜆2 )𝑛 𝑘! (𝑛 − 𝑘)! 𝜆1 + 𝜆2 𝜆1 + 𝜆2
𝑛!
So, it is just a different way of writing it, but this lets you identify the distribution. What is this
distribution? This is ending up binomial again.
𝜆1
So, this guy 𝑋|𝑍 ~ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝜆 ), is not it? That is very nice, is not it? So, X conditional
1 +𝜆2
𝜆1
on Z, the conditional random variable 𝑋|𝑍 is actually 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛, 𝜆 ). You can very quickly
1 +𝜆2
𝜆1 𝜆2
and very easily see 𝑌|𝑍 will be binomial and again, but 𝜆 will be a 𝜆 . So, easy to see these
1 +𝜆2 1 +𝜆2
things. So, this is again a nice little box formula here for you to remember.
So, notice how these interesting relationships get built between these distributions, binomial,
Poisson, Bernoulli, so this should hopefully tell you that all these things are related in some basic
way in which the experiments happen. This is, but of a quick calculation, the calculation itself is
not so important, I just wanted to point out how interesting relationships can appear between
random variables when you start using functions on them. So, this is good example to see.
(Refer Slide Time: 30:50)
So, a quick word on functions and independence. So, we saw independence of random variables,
now we are seeing functions of random variables. So, one of the important results, a nice result is
if 𝑋 and 𝑌 are independent, 𝑔(𝑋) and ℎ(𝑌), 𝑋 and 𝑌 are already independent, I am taking a function
of 𝑋, function of 𝑌, the functions will also be independent for any two functions 𝑔 and ℎ.
So, if 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 are mutually independent, as long as I do different sets of random variables, I
will keep getting independent random variables, so, this taking of this function will not suddenly
introduce dependence if originally the dependence was not there. If originally the dependence was
there, of course, if you take function, the dependence will be retain maybe, but if the original
dependence was not there, if you take functions, you will still be independent. So, functions and
independence is a small result, but worthwhile doing it. We would not prove it any more rigorously
than this, it is a guess, it is an intuitive result you can see why this is true.
(Refer Slide Time: 31:48)

So, I am going to leave you with a few exercises. These exercises are important. Just like I did
sums of Poisson random variable, go through and on your own, try to do these sums. We have seen
this before. You see, you take two binomials, same 𝑝 but different 𝑚, 𝑚 in one case, 𝑛 in the other
case; sum of those two independent, what will be the distribution? Sum of two independent
geometric distributions, sum of why two independent geometric, sum of 𝑟 𝑖. 𝑖. 𝑑. geometric
distribution, sum of two independent negative binomial distributions. So, these all will turn out to
be something nice. So, try it give it a shot and then you will see these relationships and they have
a very good connection as well.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 3.11
(Refer Slide Time: 0:13)

Hello and welcome to this lecture. In this short lecture we are going to see one type of function of
random variables which also shows up quite often. So, we saw general functions of random
variables how to deal with them and then we saw sums of random variables particularly sums of
independent random variables. And we saw that that is very nice and it results in so many
interesting relationships and sums occur all the time. We will see later on how sums are always
occurring in when you deal with random variables and probability spaces.

Another function which very common is minimum and maximum. So, here is an example I have
given you supposing you have two random variables X and Y which have a joint distribution, joint
PMF 𝑓𝑋𝑌 . I can define 𝑍 = 𝑚𝑖𝑛(𝑋, 𝑌). So, it is a very valid sort of operation to do, do not think
of it as some strange thing, 𝑋 is some value, 𝑌 is some value I can of course take 𝑚𝑖𝑛(𝑋, 𝑌), there
is nothing wrong with that, it is a function of 𝑋 and 𝑌 clearly.

So, here is a simple example, you throw the die twice, 𝑍 is the minimum of the two numbers seen.
I might be interested in it. So, when a random variable keeps occurring, I may be interested in the
minimum of those or the max of those. So, that is very useful to see and you can imagine so many
situations where you want a track how low some value can go, how high some value can go and
this operation will show up quite often in practice.

So, it turns out finding the minimum is not too hard given the joint pmf, you can write down very
easy little expression, maybe not easy, very easy, but still you can write it down, later on we will
see some simplifications particularly for minimum and maximum. So, the pmf of 𝑍 evaluated at
small 𝑧 is basically the probability that minimum equal 𝑧. So, if the minimum of two random
variables has to be equal to 𝑧, one of them has to be equal to 𝑧, the other should be greater than or
equal to 𝑧.

And this if you break up into disjoint case, you will get 3 different disjoint cases. One is both of
them are equal to 𝑧. The other is 𝑋 is 𝑧, 𝑌 is strictly greater than 𝑧. The third possibility is 𝑋 is
strictly greater than 𝑧 and 𝑌 = 𝑧. So, these are the 3 possibilities you can sum up and find the
probabilities of that. The first event is probability is 𝑓𝑋𝑌 (𝑧, 𝑧). The second one is summation over
all 𝑡2 > 𝑧, 𝑓𝑋𝑌 (𝑧, 𝑡2 ).. And the third event is summation over all 𝑡1 > 𝑧, 𝑓𝑋𝑌 (𝑡1 , 𝑧). So, finding
the pmf from the joint pmf is sort of a straightforward exercise. For the minimum, except that you
have all these ugly summations. We will see how to deal with them sooner. What about maximum?
I am not written down maximum explicitly, but you can see clearly how you can do maximum,
you can repeat the same thing for maximum. For max of 𝑧, 𝑋 can be 𝑧, 𝑌 can be 𝑧, that is one
possibility.

The other possibility is 𝑋 = 𝑧, and 𝑌 is strictly lesser than 𝑧. The third disjoint possibility is 𝑋 is
strictly less than 𝑧 and 𝑌 is equal to 𝑧. So, that is maximum equal to 𝑧. then you can write down
the same thing. I am not writing it again, you please go through the exercise and write it for max,
it is a very easy, simple thing to do. Once you know the min, extending for the max is not too hard.
(Refer Slide Time: 3:34)

So, let us try and execute it for an simple experiment here. Throw on a die twice and Z is the
minimum of the two numbers seem, what is the probability that 𝑍 = 1. So, if you want to find
probability that 𝑍 = 1. So, notice, I can be minimum of the two numbers should be 1, so it is this
11
and this, is not it, for all the cases. So, that is 5 + 6 that is . So, let us do a multicolour
36

presentation, go my blue here. So, supposing I want to ask probability that 𝑍 = 2, now 2 can
happen if this, if all these cases, the minimum of the 2 will be 2 and these cases also. You see how
9
this is working out. So, this will be 4 + 5 that is 36. Let us go to the third colour which is green and

ask what is the probability that 𝑍 = 3, notice here the minimum of the two has to be equal to 3,
7
so there is no other possibility, so these are the possibilities for that, that is 36. Hopefully, I do not

have to illustrate any more with my colours.

5 3
So, you can see probability of 𝑍 = 4 will be 36. Probability that 𝑍 = 5 will be 36and probability
1
that 𝑍 = 6 will be 36. And you can check 1+3+5+7+11 is 36, you can check that it is true. So,

you say it is a valid distribution and you can see how it works. Even for the max something very
similar is true. So, you just think about how max works out. You have to just do the sort of a
different summation, you will get it. So, this is simple illustration of how with pmf’s you can work.
(Refer Slide Time: 5:45)

But it turns out for minimum, particularly if the two random variables are independent, the
cumulative distribution function of the maximum is very very easy to write down. I did not talk
about the minimum here, the maximum is easy to write down. So far we have not mentioned the
CDF too much. In your Statistics 1 course, you might have studied CDF. CDF is a very useful tool
as well, but so far in the discrete case, PMF is quite good, CDF you do not need too much. But in
the maximum, when you want the find the distribution of the maximum, particularly the
independent case, you will see the CDF will wonderfully simplify things for you.

So, what is CDF again? We have seen this before, Cumulative Distribution Function of a random
variable is the function from the real line to 0, 1 and it is defined as simply the probability that the
random variable is less than or equal to 𝑋. So, for pmf, it is probability that 𝑋 = 𝑥, for CDF it is
probability that 𝑋 ≤ 𝑥. So, it turns out CDF is very very important. It is much more general than
the pmf, but so far we have been looking at discrete random variables, so pmf we have been happy
with. But CDF is important, it will make an appearance soon enough in our study.

Now, CDF or mining ma𝑥imum of two independent random variable is very very useful, why is
that? Let us say 𝑋 and 𝑌 are independent, and 𝑍 = 𝑚𝑎𝑥 (𝑋 , 𝑌). Now, when I want to find the
CDF of 𝑍 evaluated at small 𝑧, what is it by definition? It is the probability that 𝑚𝑎𝑥(𝑋, 𝑌) is less
than or equal to z. Now, it turns out this 𝑚𝑎𝑥(𝑋, 𝑌) ≤ 𝑧. is exactly the event that 𝑋 ≤ 𝑎𝑛𝑑 𝑌 ≤
𝑧. Think about this very carefully.
If 𝑋 ≤ 𝑎𝑛𝑑 𝑌 ≤ 𝑧, clearly 𝑚𝑎𝑥(𝑋, 𝑌) ≤ 𝑧. So, one way it is true. The other way it is also true,
if 𝑚𝑎𝑥(𝑋, 𝑌) ≤ 𝑧, of course 𝑋 ≤ 𝑧 and 𝑌 ≤ 𝑧 . So, you see that these two events are exactly the
same. So, maybe it is a little bit troubling thing about it for a while, you will see why.

Now, once I write it like this, my independence kicks in, 𝑋 ≤ 𝑧, 𝑌 ≤ 𝑧 are independent events, so
I can simply write it as probability that 𝑋 ≤ 𝑧 times probability that 𝑌 ≤ 𝑧. That works because of
independence. Remember, this to this is by independence. It will not happen always, if they are
not independent, you cannot write it like this, you have to go to the CDF or something, I mean the
joint CDF or something and do like, it is bit more complicated.

But, once you do this, you identify that probability of 𝑋 ≤ 𝑧 is simply the CDF of X evaluated at
𝑧. Probability of 𝑌 ≤ 𝑧 is CDF of Y evaluated at 𝑧. So, we get this very nice relationship in the
independent case, CDF of maximum is product of individual CDFs. What about minimum,
minimum is not quite so easy, you have to do a little bit more, but you will see you can do minimum
with 𝑋 ≥ 𝑧.

So, for CDF we did less than or equal to, for minimum you can do with greater than or equal to.
So, it is not the same, but greater than or equal to will factor very nicely. It is like, it is called the
complimentary CDF or something like that. So, I will let you think about the minimum on your
own, but for max at least it is very easy to write down. 𝑀𝑖𝑛𝑖𝑚𝑢𝑚(𝑋, 𝑌) if it is greater than or
equal to small z, then 𝑋 ≥ 𝑧. and 𝑌 ≥ 𝑧.. So, that works out. So, the greater than or equal to
direction will work for minimum.
(Refer Slide Time: 9:33)

So, let us apply this idea for i.i.d. sequences, independent and identical is the really the thing to
say. So, for probability of max, I will do the second part first, 𝑚𝑎𝑥(𝑋1 , 𝑋𝑛 ) ≤ 𝑧. We saw this just
now, this 𝑃(𝑋 ≤ 𝑧, 𝑋2 ). Remember, comma is and, 𝑋2 ≤ 𝑧 so on till 𝑋𝑛 ≤ 𝑧 .That is nothing but
probability of 𝑋 is less than or equal z raised to the power n. So, this is the general formula, and
this is just the CDF of 𝑋 raised to the power n. Very easy formula to write down.

For the first one, for the min, it is easiest to write it as greater than or equal to. So, this will work
out as 𝑋1 ≥ 𝑧 , 𝑋2 ≥ 𝑧 , so on till 𝑋𝑛 ≥ 𝑧. Think about it, if all of these is greater than or equal to
then the minimum is greater than or equal to. If the minimum is greater than or equal to, then all
of them are greater than or equal to. So, both ways it works, these two sets are the same.

So, this is nothing, but probability of 𝑋 is greater than or equal to 𝑧 the whole thing raised to the
power n. this is, when you cannot quite write it in terms of the CDF and all that, this less than,
greater than or equal etcetera. So, anyway I will leave it at this point in this formulation. So, this
is min and max of i.i.d. sequences. You see how the CDF raise to the power n shows up in these
pictures.
(Refer Slide Time: 11:20)

I believe this is the last problem that we are doing in this session. Let us work on it. So, here is
another very interesting property of geometric distribution. We are going to look at minimum of
two independent geometric distributions. So, we know how to deal with this, so I am thinking of
𝑚𝑖𝑛(𝑋, 𝑌) . I know the best way to answer this 𝑚𝑖𝑛(𝑋, 𝑌) is to look at this when this 𝑚𝑖𝑛(𝑋, 𝑌) ≥
𝑧, this is geometric, let me just use k. So, this is 𝑃(𝑋 ≥ 𝑘)𝑃(𝑌 ≥ 𝑘). Both of them have the same
P P. So, that is very nice.

So, for geometric, 𝑃(𝑋 ≥ 𝑘) will end up being (1 − 𝑝)𝑘−1. Now, 𝑌 is also another independent
geometric, so it is again 1(1 − 𝑝)𝑘−1 . So, together this just (1 − 𝑝)2(𝑘−1) . So, 𝑚𝑖𝑛(𝑋, 𝑌) ≥ 𝑧 is
(1 − 𝑝)2(𝑘−1) . So, keep this in mind.

So, what is probability that 𝑚𝑖𝑛(𝑋, 𝑌) ≥ 𝑘 + 1, instead of 𝑘 you put 𝑘 + 1, so that will be 1
(1 − 𝑝)2 𝑘. Is that okay? So, now what is probability that 𝑚𝑖𝑛(𝑋, 𝑌), just sort of beating around
the bush, so maybe you can see this, is equal to k. So, I will tell you this is 𝑃(𝑚𝑖𝑛(𝑋, 𝑌)) ≥ 𝑘 −
𝑃(𝑚𝑖𝑛(𝑋, 𝑌)) ≥ 𝑘 + 1, is not it?

So, if I do subtraction of these two, I am going to get equal to k. So, this is greater than or equal to
𝑘, so this 𝑘, 𝑘 + 1, 𝑘 + 2, etcetera, this is greater than or equal to 𝑘 , 𝑘 + 1, 𝑘 + 2, etcetera. So,
you subtract these two, you will get 𝑘 and that guy if you see very carefully you will get 1 − 𝑝2 , I
am sorry, (1 − 𝑝)2 . So, let us call q as (1 − 𝑝)2 , the simplifier expression here. So, it is 𝑞 𝑘−1 −
𝑞 𝑘 . So, that is 𝑞 𝑘−1 (1 − 𝑞).
What is this distribution 𝑞 𝑘−1 (1 − 𝑞) for 𝑚𝑖𝑛(𝑋, 𝑌) = 𝑘? That is nothing but, 𝑚𝑖𝑛(𝑋, 𝑌) being
geometric with, so we see from this expression that this is geometric 1 − 𝑞. So, notice the way the
geometric works. You remember, if 𝑋 is geometric with 𝑝, the probability that 𝑋 = 𝑘 is
(1 − 𝑝)𝑘−1 𝑝. So, notice that 1 − 𝑝, so q becomes 1 − 𝑝 is often a parameter for the geometric will
be 1 − 𝑞.

So, this is not peculiar to both of them being 𝑝0 actually, so I could have as well taken 𝑋 being
geometric with the 𝑝1, 𝑋1 being geometric with 𝑝1 and 𝑋2 being geometric with 𝑝2 and both being
independent, I could have actually done this and then you can show using the same sequences, use
the same sequence, just keep track of 𝑝1 , 𝑝2 you will see that this will be geometric with, instead
of this 1 − (1 − 𝑝)2 , it will be 1 − (1 − 𝑝1 )(1 − 𝑝2 ).

So, if you want you can write this as 𝑝1 + 𝑝2 + 𝑝1 𝑝2 . It is the same thing, both of them are the
same. So, this is a very nice result. This is once again the minimum of two independent geometrics
is a geometric distribution. So, you can do this try and repeat the same calculation for maximum
of two geometric PMFs that is not geometric, the maximum of two geometric distributions, even
if they are independent is not geometric, the minimum is geometric. That concludes the lecture
and that also concludes the lectures for this week. Thank you very much, will see you next week.
Bye.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 4.1
Expected value
(Refer Slide Time: 0:13)

Hello and welcome to week 4 and in week 4 we are going to start looking at what is called the
expected value of a random variable. So, so far we been thinking of looking at discrete random
variables we described discrete random variables using the range of values it takes and the
probability with which it takes each value in the range so called PMF the probability mass function.
So, once you have the values that it takes and the probability with which it takes this values, we
see there are different types of distribution we saw so many different distributions that show up
there and one one can look at that is possible.

So, often times there is a need to summaries the random variable into one number. So, you might
want to say instead of describing the whole distribution instead of describing the whole range, you
might want to succinctly describe the random variable as one number or now describe it in some
very small detail as supposed to be the big detail of the distribution. So, expected value is at the
heart of it and it has so many different utilities and it is one of those important connections between
data and probability theory this motion of an expected value, many of you think has connection
but this one is really really closely connected to everything. So, let us start looking at what expected
value of a random variable is and then go ahead and keep making those connections as we see
them.
(Refer Slide Time: 1:43)

We will begin with a brief introduction and the basic goal from a high level before we end back
on expressions and derivations with expected value, from a high level what is it, it is about
summarizing data. If you have a large amount of data and somebody want to know some summary
information about it instead of giving every possible value, you might want to say no the data is in
this range. It has a minimum value, a maximum value, a certain average value, may be a certain
median value. So, things like that you give some descriptive statistics or summary information
about a large set of data this is very common you might have studied this in statistics 1 as well.

So, in particular, we will let us look at the average value but more closely. So, the average value
is quite useful you see all over the place instead of giving a lot of data you just give the average.
If you want to know a sense of how well a class is doing you simply ask for the average marks of
the class. So, you know that gives you a sense of how well it is doing instead of looking at the
whole distribution and trying to figure out something. Same thing with any other spots or any other
numbers that you want to look at people always are talking about rate, they have reached number
of runs this batsman can be expected to score in a innings. So, things like that are very very
common and it gives us a sense of the batsman is.

So, quite often this is very useful and yeah, I mean I can keep talking about the average and why
is it is so useful and all that. But it turns out in probability theory, this motion of an expected value
of a random variable is sort of a theoretical construct, it is introduced with the certain definition.
It represents the average value that the random variable would take. So, if you were to observe that
same random variable a lot of times and you take the average you will sort of this expected value.

That is one of the reasons why it has a good connection to practice. I will talk a little bit more with
specific examples later on as we go through in this in this class. But first thing we have to
understand is the simple mechanics of it what is the expected value. A random variable is given,
the range is given, the PMF is given how do you compute the expected value. After you compute
the expected value, what are the basic properties of it can you simplify the computation can you
do something with it.

So, those are the sort of basic mechanics of using expected value that we will see in this lecture,
then later on we will slowly connect it to this notion of its average value why is it connect to the
average value we will see later on. So, that is the introduction. Let us jump into the definition.

(Refer Slide Time: 4:17)

Before the definition, let me also motivate this notion of why expected value is something
interesting with this very simple example. So, let us imagine there is a Casino, and somebody
wants to bet on something there in the casino and here is this game, there is 2 die that is 2 die die
that is being rolled and sum of the two numbers that you see you can place a bet on it. This is a
very common bet that a lot of people make. So, very popular game by the way.

So, here is a return if you bet under 7 and you win you get your money back and if you bet over 7
and you will you get your money back, if you say equal to 7 and you win you get 4 times the
money back. So, this is the result, and you can do some basic calculation with it supposing you bet
one unit of money on under 7, what is going to be or gain and what is the probability of that? So,
you can do this calculation if you bet under 7 and if you actually win you have a gain of 0.

5
Whatever money you put in you got back your gain is 0 and that has a probability of 12 this needs

a bit of calculation I am not going to do it, take it as a homework problem if you like you through
up pair of times what is the probability that the sum is less than 7, strictly less than 7 you will get
5
an answer 12. But if you bet under 7 and you know the number is actually equal to 7 or greater you

loss and in that case the gain is minus 1. You have lost money and that happens with probability
7 5 7
you can do a very simple calculation there once again 1 minus 12 is 12, we get that.
12

Same thing happens with over 7, the probabilities are the same you bet over 7, your gain is 0 if
you and the number is actually over 7 and the probability that a number is greater than 7 the sum
5
is greater than 7 is 12 you can do this calculation and you can convince yourself that is true. Same
7 1 1
thing with -1 you get 12 and equal to 7 on the other hand with probability 6. 6 is the probability that
6
you get a number equal to 7. 7 comes in 6 possibilities divided by 36 36, 1 by 6. In that case you
5
get plus 3 but then you get -1 with the probability 6.

So, this is the scenario so now you can ask a lot of questions about this, so how is it that the casino
set up this problem and how do you go about betting? So, this are all questions should you play
this game is this a reasonable game can you expected can you be expected can you expect that will
you actually gain money at the end or is it so bad that you are definitely going to lose if you keep
on playing for a long time. So, this are the kind of questions that are crucial.

So, like I said why is it the casino decided to 4 four times the money back if it is equal to 7 and in
the long term if you keep on playing this game again and again and again what can you expect to
gain? Will you expect to gain something is it going to be positive, is it going to be negative, is it
going to be 0? So, this kind of questions are very very crucial is not it. So, this expected value are
average value is tied to kind of question.

So, so it is important then in a problem like this to understand the average value that is more
important than anything else you may say ok the distribution is important, this that is important.
No, but really what is important that finally in this kind of games or in many such situations long
term when I keep on doing this what can I expect to gain is it what is the expected gain, what is
the average gain that I would have got after all of this. So, this might be something interesting that
has a lot of meaning in practice. So, hopefully I have convinced you that they have rich values
important for so many different reasons and now hopefully let us jump into the definition.

(Refer Slide Time: 8:10)

So, here is the definition. It is actually a very simple definition and that why easy to write down
means also quite easy to understand the definition. So, let us say you have a discrete random
variable X the range is 𝑇𝑥 and the PMF is 𝑓𝑥 . We been talking about how range and PMF are central
in describing a random variable that is what given there.

So, the expected value of 𝑋 we will denote it in this notation 𝐸[ ], and I will put something inside,
whatever I put inside 𝐸[ ] that is expectation of what is inside that. So, in this case we have put 𝑋,
so it is the expectation of 𝑋. It is called expectation of 𝑋, it is called expected value of 𝑋, I will use
both sort of interchangeably, but I think expected value of 𝑋 probably a better terminology you
could, you might want to stick to that.

The definition is very simple you sum over every value in the range, so you sum over every value
𝑡 that your random variable can take and then multiply it with the probability with which it takes
the value 𝑡 that is 𝑓𝑥 (𝑡), so, what is 𝑓𝑥 (𝑡), as you know 𝑓𝑥 (𝑡) is you make sure the choice are right.
So, this 𝑓𝑥 (𝑡) so this is also equal to in case if you want to write in down a bit differently
∑𝑡∈𝑇𝑥 𝑡𝑃(𝑋 = 𝑡) . So, this might be also another equivalent way of understanding this definition.
So, that is the average or expected value. Usually, the expected value of the random variable X.

So, like I said it it has the operational meaning of if you know if you observe the experiment I
mean if you repeat this you observe this random variable multiple times several times
independently and then you add it all up and average it, you will get this something close to this
Fx. We will prove such statements later on or we will see results like that later on, but forgetting
about all that what it means in practice, the definition itself is should be very clear to you. So, how
do you operationally execute this definition to compute the expected value given the range and the
PMF that should be hopefully clear to you just plug it and given the range and the PMF you can
just plug it and find the expected value.

So, once the PMF is given, once the distribution is given, the expected value is the same. So, some
people for this reason refer to 𝐸[𝑋], actually as a function of the PMF and not of the random
variable itself but it is very important to write it as 𝐸[𝑋], you will see why expected value of 𝑋 is
a very useful thing to have.

(Refer Slide Time: 10:48)

Few remarks that are the names lot of people use mean of 𝑋, so, if you say mean of 𝑋, so, if you
say mean of a random variable, it is exactly the same as the expected value of 𝑋. Some people call
it the average value of 𝑋 it is not a very good, nice thing to say average value, it is better to say
mean of 𝑋 or expected value of 𝑋.

So, so let may be maybe I should put a small strike through here to say that average value is I mean
it is sort of clear but maybe not a good terminology here. So, we will see some examples we will
see clearly there is 𝐸[𝑋], the average value may or may not belong to the range of 𝑋, it may not be
a value that 𝑋 actually takes even though it is the average value it may not be a value that 𝑋 actually
takes so it can happen if you have 1 and 2 the average of 1 and 2 is 1.5. It is nether 1 nor 2 so it
can happen like that.

So, usually people give units for the expected value if your 𝑋 has units if your random variable is
something you are measuring in a physical experiment expected value of 𝑋 also will have the same
units. You are taking that same 𝑡 and multiplying with the fraction of probability with which it
occurs. So, it does not lose the units you can keep the same units for expected value of 𝑋. Quick
definition hopefully you saw the definition.

(Refer Slide Time: 11:59)

Let us see a few quick example to see how you can easily do the summation given the PMF. Here
is this 𝑋 which is a Bernoulli distribution, you know what the Bernoulli distribution is, so the
Bernoulli distribution is Bernoulli p, it is a 1 with probability p, 0 with probability 1 − 𝑝. Easy
distribution and very easy to compute the expected value of 𝑋, if you say expected value of 𝑋, it
is 0(1 − 𝑝) + 1(𝑝) ; the value it takes multiplied by probability with which it takes its value and
it just becomes equal to p.

So, notice for Bernoulli random variable this is also equal to P(X=1). So, this is true only for
Bernoulli, but for Bernoulli random variable you see will that works out as a very nice result, easy,
very easy to sum and similar such problems for other distributions are also very very easy. So, if
1
X is uniform from 1 to 6, you know that it takes each value with probability 6 .

1 1 1 1 1 1
So, if you want to do expected value of X, it is just 1 6 +2 6 +3 6 +4 6 +5 6 +6 6 . So, you just have to

do this calculation and that is just 3.5. So, you can check this is true, it is 3.5. And quickly you see
in both examples, you got an expected value which was not clearly in the range. So, in the first for
the Bernoulli case, you got an answer p which could be 0.5, 0.7, 0.2, 0.1 like that and the random
variable takes only 0 or 1 and the random variable is uniform from 1 to 6 and took a value 3.5

It will definitely be within the range, but it would not be a value that it need not be a value that the
random variable is taking. So, here is another example, supposing just to show this units aspect of
the expected value, let us say somebody gives you a value of the lottery ticket, there is a random
variable and describes the distribution like this, you remember the way I write this distribution, I
write the value and I write the probability with which the random variable takes the value above
it. This is a very simple way of describing the distribution.

1
So, here is a random variable which takes the value 200 with probability 1000, 20 with probability
27 972
1000
, 0 with probability 1000. So, you take the value, multiply with the probability that with which

the random variable takes the value and add it all up, you get 0.56 as the total expected value of a
lottery ticket. So, most people will look at 200 rupees and say okay this lottery might give me 200
rupees, but look at the probabilities, the way the probabilities work 200 happens with very low
probability.

So, even though you, there is a possibility that you will get 200 rupees, the expected value of the
random variable over a long term if you take a average you will only make 0.56 rupees. So, how,
so unless the lottery is 50 paisa or lesser, you are not going to gain. Even if the lottery is 2 rupees
or 5 rupees, you are going to lose, if you because your expected gain is only 0.56 while you would
have spent so many rupees in getting the lottery ticket. So, these are very important calculations
before you buy a lottery ticket, you better do this calculation and know what the expected value is.

Here is one more random variable that I cooked up for, just for showing you, just to show you that
once I give you the distribution of the random variable, you can very easily find the expected value
of it just by doing this, multiplying the value by the probability adding it up to get your answer.
So, that is fine, simple example. But life is not always so simple is not it? It is going to quickly get
a little bit more complicated.
(Refer Slide Time: 15:50)

And let us begin with a slightly more complicated situation where I have a more general uniform
case, but uniform in a sequence. So, you have a random variable which is uniform from a to b
which basically takes 𝑎, 𝑎 + 1, 𝑎 + 2 to till a plus b. So, it is uniform in a sequence of values
from 𝑎 to 𝑏, both included. If you now do expected value, you see that probability, since it is
uniform, there are 𝑎 to 𝑏 is 𝑏 − 𝑎 + 1. So, each thing is occurring with 𝑏 − 𝑎 + 1, everything
1 1 1
is 𝑏 − 𝑎 + 1. So, it is 𝑎 × 𝑏−𝑎+1 + (𝑎 + 1) × 𝑏−𝑎+1 … … … … 𝑏 × 𝑏−𝑎+1 .

1
So, now how do you simplify this summation? This is showing them by the denominator
𝑏−𝑎+1

that is very nice, you can pull it out. But still you will be left with 𝑎 + (𝑎 + 1) + ⋯ … … + 𝑏, till
𝑏. So, when you want to add a sequence of integers, there is this very simple formula that you can
𝑎+𝑏
use. It is (𝑏 − 𝑎 + 1)× . There are various other formulae like this, so this is a very useful
2

identity, you can use this and simplify this. So, once you plug it in, you see that the expected value
𝑎+𝑏
becomes .
2

𝑎+𝑏
So, it is sort of intuitive, why it should be ? You take values from 𝑎 to 𝑏 and it is all uniform,
2
𝑎+𝑏
so the average value or the expected value should be . And see how it ended up being in a very
2

simple situation. So, this example should already slightly give you the ideas of what can happen
in these expected value computation. So, if the summations becomes more and more and more
complicated, you can get into some tough situations in finding the expected value. But if it is small
and easy enough, we can do it.

So, what I am going to do in the next 2-3 slides is take up some common distributions we have
seen before. We took Bernoulli, we will take binomial, we will take Poisson, we take geometric,
we will take a few other distributions and try and execute the summation. So, I will go through a
little bit fast, but I also I will try to give you some reason as to why this summations work out.
Very important is the final answer.

The expected value, it is good to sort of remember by heart, in your mind you should know what
the expected value of the random variable is. Even if you do not remember every detail in the
proof, it is okay. But the proof is also quite easy and interesting in its own way, it will also help
you how to do summations. Since, it is always an interesting topic.

(Refer Slide Time: 18:26)


So, let us start looking at more involved summations. The first thing we look at is geometric
random variable. So, you remember geometric, X is geometric means what is the random variable
here? So, 𝑋 takes values 1, 2, 3, so on. It keeps on going all the way to infinity and what is the
probability which with it takes value 𝑘, (1 − 𝑝)𝑘−1 𝑝. So, that is the probability with which it takes
value 𝑘.

So, if you do expected value, I have to do ∑∞


𝑡=1 𝑡 (1 − 𝑝)
𝑡−1
𝑝. How do you sum this? So, what is
the summation, it looks something like this. So, it looks 𝑝 + 2(1 − 𝑝)𝑝 + 3(1 − 𝑝)2 𝑝 … ...so on.
Here is this 2, 3, 4, so on and 1 − 𝑝, (1 − 𝑝)2 so on.

So, this kind of summation and you see it is a little bit non-trivial, it is not immediate, immediately
clear how to sum these things. So, this kind of summations even for the geometric random variable,
these kind of summations can show up. And in the geometric random variable it is good to know
the expected value because what is geometric? The first time the, it represents the number of
attempts you have to make till you see the first success. So, it is good to know if you are looking
at it multiple times, what would be the average number. So, expected value is good to know for
the geometric random variable.

And it seems like we have to do a non-trivial summation to get to it. So, we will do it soon enough,
but I just want to show you how it can become non-trivial. So, you go to Poisson, again you have
the same sort of situation. This is the probability with which the Poisson random variable 𝑋 = 𝑡.
𝜆𝑡
This 𝑡𝑒 −𝜆 𝑡! . So, this is the summation we have to execute to find the expected value of a Poisson

random variable.

I already showed you how Poisson random variable shows up in practice. It is very useful random
variable to know the expected value for, but to calculate the random variable, it looks like you
have to do a complicated summation. We will see it soon enough, but I just want to convince you
first that it is a summation, it is slightly more involved that your typical summation.
(Refer Slide Time: 20:45)

Even if you go to binomial which seems like the one of the easy distributions that show up from
repeated Bernoulli trials, you see the expected value is very confusing to calculate. So, this is once
again the probability that the binomial distribution takes the value 𝑡 when you have binomial(𝑛, 𝑝),
it takes the value 𝑡 × 𝑛𝐶𝑡 𝑝𝑡 (1 − 𝑝)𝑛−𝑡 . So, this again remember this some of us might have written
it like. So, this 𝑛𝐶𝑡 I will denote also as 𝑛𝐶𝑡 like this, 𝑛𝐶𝑡 I will denote it like this, binomial 𝑛𝐶𝑡

𝑛!
You know the formula for this. So, this is just . So, you see even for the simple random
𝑡!(𝑛−𝑡)!

variables or common random variables that we saw, geometric, Poisson, binomial, this summation
is involved. So, if you go to the other ones like negative binomial and hyper geometric, it is going
to get even more involved. So, how do you do summations like this?

I am going to quickly show you the proof and then show you the method for evaluating this
summations quickly, give you the answer like I said, the answers more important than the proof
here, but it is also good to know the proof, it is good to know how to do these kind of summations.
(Refer Slide Time: 21:58)

So, one crucial idea is this kind of a difference equation idea. Supposing you have a sequence, you
have to sum. You have a sequence that you need to sum which is ∑𝑛𝑡=1 𝑎𝑡 . And this at satisfies this
kind of a property; 𝑎𝑡+1 − 𝑟𝑎𝑡 equals some other sequence 𝑏𝑡 and this 𝑟 ≠ 1. So, this equals
another sequence, this is another sequence. In general, another sequence it could even be 0, I mean
it could be a trivial sequence whatever.

𝑎1 −𝑟𝑎𝑛 1
Then it turns out this summation is given by this formula. It is this + 1−𝑟 ∑𝑛−1
𝑡=1 𝑏𝑡 . This is
1−𝑟

actually very easy to prove. It is not very hard to prove, you just write down (sub) one equation
after the other and then you will see that if you multiple by 𝑟 and subtract, etcetera, it will sort of
nicely cancel and align up and add up and you will get finally this n minus, you will get this, this
is not very hard to prove, I am not going to prove it, but you can write this.

And this is a very important ingredient in many of these summations, so particularly geometric
like summation, I mean geometric progressions you can solve using this means. So, this is one
idea. Now another is the geometric progression itself. So, in the geometric progression, this 𝑏𝑡
becomes 0, 𝑎𝑡+1 = 𝑟𝑎𝑡 , so you just every time we go to the next in the sequence, you keep
multiplying by r and that summation is just a1 minus r an by 1 minus r.
𝑎1 𝑎1
So, if 𝑟 < 1 and n is growing very very large, this summation goes to . So, this is very
1−𝑟 1−𝑟
𝑎1
important, when n goes to infinity, it is 1−𝑟 , but absolute value of r should be less than 1 for this
formula to hold. This geometric progression. We also have this exponential function. This is a
𝜆𝑡
useful function to know. So, this is basically nothing but this is the same as 𝑒 𝜆 = ∑∞
𝑡=0 𝑡!

So, you multiply both sides by 𝑒 −𝜆 , you get this equation. So, this is the same property as that
leads to the Poisson PMF being a valid PMF. So, you add up all the probabilities of the Poisson
random variable, you should get 1 and that happens because of this. So, this is the exponential
function, this identity is very useful particularly when you want to work with the Poisson
distribution. The first two are sort of useful when you work with geometric distribution, this one
is useful for the Poisson distribution.

For the binomial, this is very useful. I think I have referred to it once before. If you do (𝑎 + 𝑏)𝑛 ,
you can write it as k equals 0 to n and then you will have the binomial looking formula. So,
𝑛𝐶𝑘 𝑎𝑘 𝑏 𝑛−𝑘 . A particular thing to note here is if you set 𝑎 = 1 − 𝑝 and 𝑏 = 𝑝. If you set 𝑎 =
1 − 𝑝 and 𝑏 = 𝑝.. the right hand side becomes 1 and what is the left hand side? Left hand side
you will get the binomial PMF values.

So, summation 𝑛𝐶𝑘 (1 − 𝑝)𝑘 𝑝𝑛−𝑘 ., maybe I should write in the other way round. It is okay, p
power, maybe I should write in the other, let me write it as 𝑎 = 𝑝 and 𝑏 = 1 − 𝑝. It is the same
thing. So, 𝑛𝐶𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 . So, this is the binomial PMF adding up to 1, a very simple way of
saying it. So, these kind of are very useful in the summation, once again I am going little bit
quicker, but hopefully you see that.
(Refer Slide Time: 25:57)

So, once you have those tools, so these are the tools that are useful for simplifying the summation,
once you have those tools, it is easy to identify what the geometric PDF is. For the geometric
summation, E[X] is this summation you can quickly identify that this follows the difference
equation. So, this difference equation with 𝑎1 = 𝑝 , 𝑟 = 1 − 𝑝 and this 𝑏𝑡 = 𝑟 𝑡 𝑝 So, you can check
this, so if this if you let this to be 𝑎𝑡 , what is 𝑎𝑡 + 1 ? That is (𝑡 + 1)(1 − 𝑝)𝑡 𝑝. So, if you do
𝑎𝑡+1 − (1 − 𝑝)𝑎𝑡 = (1 − 𝑝)𝑡 𝑝 this is what you get. So, you can check that this is true, so this
gives you the differential equation. The differential equation, the difference equation sort of picture
with 𝑎1 = 𝑝 , 𝑟 = 1 − 𝑝 and this 𝑏𝑡 = 𝑟 𝑡 𝑝. So, once you plug it in, you will just get the answer is
𝑎1
, is not it?
1−𝑟

𝑎1 𝑎1 𝑎1
So, this is the answer, and the 𝑎1 = 𝑝 , did I get it right? No, it is not , I am sorry, +
1−𝑟 1−𝑟 1−𝑟
1
∑𝑛−1
𝑡=1 𝑏𝑡 , you have to some work here, you just plug it and I plugged in the wrong formula here,
1−𝑟

you need to formula, plug in the difference equation. You do that and tend n to infinity you will
1
get 𝑝.

1 1
So, this 𝑝 is very important, so expected value of geometric is 𝑝. Nice, is not it? So, instead of, in

spite of this nasty looking summation, using some simple ideas, you will be able to simplify and
1
you get 𝑝. I did not go into great detail here, you just plug this into the d summation formula that

you get and tend n to infinity and you know that this r < 1. So, 𝑟 𝑛 and all will go to 0. So, you do
1 1 1
all that, you will get 𝑝. So, expected value of a geometric is 𝑝. geometric p, expected value is 𝑝. so,

that is the expected value. Good to know.

(Refer Slide Time: 28:24)


𝜆𝑡
So, now for Poisson you have to do a little bit more trickery. So, you have ∑ 𝑡𝑒 −𝜆 𝑡! . This t and t!

will cancel. Remember t! has a t in it, is not it? So, 2 into 2! you can cancel the 2. 3 into 3! you
can cancel the t. So, that is something you can do. So, this becomes 1 factorial, this if you want
you can cancel, this becomes 2! so on.

And then you pull one lambda out, if you pull one lambda out, you are going to get lambda e power
minus lambda, this will just become 1, this term will become 1, this term will become this, this
term will become this so on. And what you identify here is just 𝑒 𝜆 . So, you get 𝜆𝑒 −𝜆 × 𝑒 𝜆 , you
will simply get 𝜆. So, once again the most important thing is the expected value is 𝜆 for the Poisson
distribution. Poisson lambda expected value is 𝜆. Easy enough to understand.

Now, if you repeat something similar for the binomial, I am really not going to go into detail here,
you can just go ahead, and simplify, simplify, finally you will just get 𝑛𝑝. So, it goes once again
to cancelling this 𝑡. You can cancel this t here, you will get a (𝑡 − 1)! factorial, and then you pull
one 𝑛 out, so it is, let me skip the detail here. So, the big shot is once again that 𝐸[𝑋] = 𝑛𝑝.

So, these 3 distribution while they look a bit confusing if you have to do the calculation, the
expected values are very very simple. So, these things you should just mug up and should know
these answers, for a binomial np expected value is 𝑛𝑝, 𝑛𝑝. Poisson with parameter 𝜆 expected
1
value is 𝜆. Geometric 𝑝, expected value is 𝑝. Remember one thing very careful about the geometric
distribution, we are starting at 1, our geometric distribution starts at 1 and you get expected value
1
.
𝑝

A lot of people start geometric distribution at 0, I think it is not a good idea. Starting at 1 is the
1 1
correct idea I think, so 𝑝 is good, but if you start at 0, you will do 𝑝 − 1; there will be a subtraction

of 1. Anyway, so this kind of more involved summations will happen sometimes in expected value.
So, in, if you know the methods, you can go through it, but even if you do not know, you can just
look up in Wikipedia or any of the other websites, if you just look it up, there will be a clear page
on how this calculation is done, how the simplification is done and how you get the answer for
expected value of a random variable, common distributions this things are easy to find.

(Refer Slide Time: 31:08)

So, I also wanted to show you the picture of the PMF and how the pair the expected values comes
and sort of if you can see it, so let us say you look at Bernoulli 0.3 distribution. It takes value 0
with probability 0.7. It takes value 1 with probability 0.3, its expected value is 0.3 somewhere in
the middle. And if you look at binomial (20,0.3), the expected value is 6, is not it? So, this expected
value is 0.3, this, expected value equals 6, 20 × 0.3. Maybe I should write it like that, 20 × 0.3
which is 6. This is what we saw on the binomial and you see the expected value is where the peak
sort of the distribution comes in some sense.
That is not always true, I mean so here is some 0.3 where is the peak, I do not know. So, her is
1
geometric distribution, once again the expected value is 0.3which is roughly about 3.33 so on. So,

that is the expected value and you can see this is the geometric PMF that 0.3 decaying. This is
Poisson, Poisson once again e power, expected value of X is just 𝜆 which is 6.

Again, this Poisson is looking so similar to the binomial PMF, is not it? This seems good reason
for that and you see that it has a nice picture and the expected value comes somewhat in the middle.
So, it is a good thing to sort of see where the expected value will be, expected value would be
definitely inside the range of the random variable, but it need not be a value that the random
variable takes itself, it will be sort of, it appears to represent the peak of the PMF if you had
something like that, that is not a bad intuition. But clearly it is not exactly true, is not it?

Here is the geometric distribution, the peak happens at 1, but clearly the expected value is 3 point
something. It is a bit away. It is sort of represents the whole PMF in one number, so it will be
somewhere in the middle. So, hopefully, this gave you a sense of the calculations involved in
computing expected value, given the distribution does not seem too hard, except that in some cases
the summation may become a little bit complicated. So, we will stop here and pick up from here
in the next lecture.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 4.2
Properties of expected value
(Refer Slide Time: 0:13)

Hello and welcome to this lecture. In this lecture we are going to start looking at the properties of
expected value. So, expected value seemed like simple definition giving, given the PMF. We will
see it has some very surprising powerful properties and the ultimately when we connect it up with
practise, it has really even more surprising properties.

(Refer Slide Time: 0:33)


So, just to motivate a little bit, let me go back and complete this little Casino math that we have
being doing, expected value of gain etcetera. So, here is the strategy that somebody has come up
with, you bet 1 unit of money with probability 𝑝1 each on under 7, over 7 and equal to 7. So, with
probability p1 you are going to say under 7 with same probability 𝑝1 you are going to say over 7
and probability 1 − 2𝑝1. What about the value of 𝑝1 could be? It could be 0.3 or something and
then you say equal to 7, then what is the expected gain?

So, this is a interesting question here. So, this is the probability with respect to the gain. So, if you
have to find the expected gain here, so with this betting strategy you will have this probability
5 7 5
𝑝1 entering the picture, so it is going to be 𝑝1 × 12 × 0 + 𝑝1 × 12 × −1 + 𝑝1 × 12 × 0 +
7
𝑝1 × 12 × −1. This is the case where I would have bet under 7, this is the case where I would have
1 1
bet over 7 and then I do (1 − 2𝑝1 ) 6 × 3 + (1 − 2𝑝1 ) 6 × −1, is not it?

So, it is a straightforward level calculation here. So, if you simply you are going to get this 0 will
−7𝑝1
not play any role here, -2 here, so + so this if you do (1 − 2𝑝1 ) common, I am going to get
12
3 5 2 2
− 6 as − 6 anyway it is negative. So, -(1 − 2𝑝1 ) × 6. So, if you want to add this up, you can do
6
−7𝑝1 −2+4𝑝1
at, but anyway the news is quite bad, so you are going to get you can simply it you are
6
−2+3𝑝1
going to get -2+3𝑝1, no .
6

So, whatever you chose your 𝑝1 to be, you are going to get a negative expected gain. So, this is
negative for all p, you could choose 𝑝1 to be 1 for instance, you will still get negative, maybe that
−5
is the that is or something, you could choose 𝑝1 to be 0 for instance, and always bet equal to 7,
6
−1
you will get .
3

So, whatever you do the expected value is negative. And this is how Casinos and other betting
companies will play the numbers, so they will decide on the gains so that the expected gain goes
negative for you, and when it goes negative, in the long run you are going to lose. You are
definitely going to lose in the long run, and this is something have useful calculations.

So, you see this kind of a property for the expected value, why it represents the average value over
a long period is something we will see later, we will see much later is much more practical, much
more important, but before that this way of calculating expected value should be made simpler and
you should understand it better and so far we have been looking at only one random variable and
typically you will have multiple random variables, is not it, more than one random variable into
the picture. In those kind of scenarios, how to handle expectation and expectation ends up being
very useful in those scenarios as well.

So, our properties will be focused mostly on those kind of things, simple, how to simplify
expectations, how to understand expectation from that point of view, we will start with that. There
will be some very powerful properties for expectation ends up being extremely valuable in
practice, we will focus on that primarily and then later on we will see some other properties of
expected value which give you the sense of what is the long-term average when you observe
several instances and why that is the expected value. So, we will see that maybe not even this
week, maybe in one of the later weeks, we will see that.
(Refer Slide Time: 5:06)

So, the first thing is let us deal with two different types of random variables, we are familiar with
this expression, 𝐸[𝑋] = ∑𝑡∈𝑇𝑥 𝑡𝑃(𝑋 = 𝑡) . We will consider two cases first. Constant random
variable, what is a constant random variable? Seems like bizarre thing, but if you have a random
variable 𝑋 which has this probability, property, the distribution is 𝑃(𝑋 = 𝑐) = 1 that is it, it takes
value c with probability 1.

So, if that is the case, then the 𝐸[𝑐] = 𝑐. So, this is a useful identity to remember expected value
of a constant, normally you do not think of a number, constant number as a random variable, but
you can think of it as a random variable, you take constant 𝑐, think of it as a random variable which
takes value c with probability 1 and simply say 𝐸[𝑐] = 𝑐. This is something you can do. So, this
is the first property, very simple property.

The next one is when 𝑋 is non-negative. Suppose a random variable is non-negative, it is greater
than or equal to 0, you can show 𝐸[𝑋] is also greater than or equal to 0, this is one of those very
simple results, you can see here 𝑡 if 𝑡 is always non-negative, probability is anyway non-negative,
it is definitely greater than or equal to 0. So, the whole summation will be non-negative. So, it is a
very simple proof. So, these two are easy properties, important enough but very easy to write down.
(Refer Slide Time: 6:39)

This is probably the most central and most important property of the expected value. It is slightly
unexpected, lot of people gets surprised by this property the first time they see it, but it is a very
powerful property, and it is one of the most crucial reasons why expected value is so popular in
analysis of any probabilistic phenomena.

So, let me describe what it is, it is a bit of a definition, but the central point is there are n random
variables, let us say there are n random variables now 𝑋1, 𝑋2 to 𝑋𝑛 and I am defining a new
random variable 𝑌 as a function of this n random variables. So, this is the most crucial thing to
focus on in this whole definition, I mean the other things are usual, very typical.
I am defining my random variable 𝑌 not just directly by giving its distribution, I am saying 𝑌 is
the function of n other random variables. So, this can happen in practise, I mean quite often random
variables that we see in situation, experiment, it depends so much on other random events and
choices that are made and finally this result. So, thinking of it as a function of n random variables
with some joint PMF is very important.

So, now once you give me a function like this, I can go and find out its PMF, I can find the PMF
of 𝑌 and the range of 𝑌, 𝑇𝑌 and 𝑓𝑌 . Once I know 𝑌, 𝑇𝑌 and 𝑓𝑌 , I can find its expected value. So,
𝐸[𝑌] so this is 𝐸[𝑌], is not it? If I want the 𝐸[𝑔(𝑋1 , … … . 𝑋𝑛 )] it is the same as 𝐸[𝑌] and I might
compute like this. But what is surprising is you do not need the PMF of 𝑌, if you, you can only
use the joint PMF of 𝑋1 to 𝑋1 and the same expected value becomes
∑𝑡∈𝑇𝑥 𝑔( 𝑡1 , … … 𝑡𝑛 ) 𝑓𝑋1 ,…..𝑋𝑛 (𝑡1 , … … 𝑡𝑛 )

So, notice the difference between these two expressions, the first one in terms, is in terms of the
PMF of 𝑌, the next one is does not use the PMF of 𝑌 at all. There is no use of the PMF of 𝑌, the
function is used and you just sum over all possible values that the random variable 𝑋1 and 𝑋𝑛 can
take, the function times the join PMF of the 𝑋1 to 𝑋𝑛 . So, you do not need to find the distribution
of 𝑌 to find the expected value of 𝑌 when it is a function of n random variables. This is an very-
very important property that the above thing, this theorem is pointing out.

So, you do not need 𝑓𝑌 , you could find 𝑓𝑌 and find the distributions, it is not wrong, but you do not
need it, the joint PMF is good enough. So, it is very property on the face of it, it seems really silly
and simple, but it has so many powerful results. It is used again and again and again this property
to derive so many other useful things and I do not want to really prove this property, it is not very
hard to prove, maybe later on we will see it, but you can intuitively sort of accept this property.

You have expected value of a function of several random variables, how do you compute it? you
simple add up over all the possible values that the random variables take that function times the
joint PMF of all those random variables, the simple enough. In this course, we will simple take
this as the definition and it is true trivially to be to this whole thing. I do not want to go into the
proof here, but it is a very easy and intuitive result to understand.
(Refer Slide Time: 10:06)

So, instead of proving, I will just prove by illustration. I will just show you a few examples and
illuminate to you how both these expectations are the same. Here is a simple exam𝑝𝑙𝑒. 𝑋 is uniform
random variable, - 2, - 1, 0, 1, 2. Let us look at the function 𝑋 2 . So, it is 0, 1, 4 and you can see the
probabilities you can find.

So, we have seen this before, a few lectures back we saw how to find the PMF of function of
random variables, you can use the table method here if you like, just write it down as a table and
1 2
calculate, you know that 𝑋 2 will be 0 probability would be 1 with probability 5and 4 with
5
2
probability 5.

Now, if you want to find 𝐸[𝑔(𝑋)], you can do it in two ways. One way is to find the PMF of 𝑔(𝑋),
1 2 2
which is what I found and multiply with like this, 0 × 5 + 1 × 5 + 4 × 5 you will get 2. You can

also do it the other way. You simply do 𝑔(𝑋) , use the 𝑔(𝑋) , and then use the PMF 𝑜𝑓 𝑋. So, this
thing came from here, this came from here, and both of them lead to the same answer 2. So, this
is what the theorem was telling you.

Instead of finding the PMF of 𝑔(𝑋) and then finding its expectation using the PMF, you can simply
plug in directly from 𝑋 summation over all values that 𝑋 takes and plug in the function 𝑔 (𝑋)
which is - 2 square, - 1 square etcetera and 𝑃(𝑋) taking that value - 2, you add it up, you will get
it. So, this is 𝑃(𝑋 = − 2), this is just 𝑔[𝑋], this is just 𝑔(𝑋), this is 𝑋 2 . So, you sum, simply sum
over g(X), probability of X, so this guy is what? So, this thing is what is this formula, again.

So, I have done this here and this is after finding the distribution of 𝑔(𝑋). Hopefully you see how
both of them give you the same answer, like I said I am going to simply illustrate this answer, this
results instead of proving it greatly. So, now I am going to take two random variables, in the
previous example I took only one random variable, here is 2 random variables and notice how I
am writing the distribution, this is just some shorthand distribution.

I am going to take 𝑋 𝑌 takes all these values uniformly. So, each of these things have probability
1 1 1 1 1 1 1
I think 6 , 6 , 6 , 6 , 6 , 6 . So, how do you read this, this is basically P(𝑋 = 0, 𝑌 = 0) is 6 P(X=1,Y=0)

like that, that is how we read it. And here is the complicated looking function, 𝑔(𝑋, 𝑌) is 𝑋 2 +
𝑋𝑌 + 𝑌 2 , you can compute the distribution of it using the table method.

So, how do you write the table method, you can write x y and then 𝑔(𝑋, 𝑌) and then the probability
of this also. So, this is 0, 0, 1, 0, 0, 1, 1, 1, - 1, 1, 1, - 1 and then 𝑔(𝑋, 𝑌) you can compute. It is 0
here, it is 1 here, 1, here, 3 here, this is also 1, 1 and so from here you can quickly find the
1 1
probability, this is all 6 , is not it? So, from here you can find the 𝑃(𝑔(𝑋, 𝑌) = 0), this is just 6
4 1 4
being 3, it is 6 , being 3 is 6 , being 1 is 6 , so it is easy enough to find out and this is the distribution

that I am talking about.


(Refer Slide Time: 13:50)

So, you can compute once again 𝐸[𝑔(𝑋, 𝑌)] in two different ways, after you find the distribution
7
you can find it like this, you get 6 , you can go back to the original distribution, so here you can

see that this is 𝑔(0,0) and this is 𝑃(𝑋 = 0, 𝑌 = 0), you just add up everything, so this is 𝑔(1, −1)
7 7
and this guy is 𝑃(𝑋 = 1, 𝑌 = 1). So, you get here and here, both of them are equal.
6 6

So, just an illustration, this is not a proof, but we are not going to prove that result, it is a bit messy
with some rotation, I do not want to do the proof. So, hopefully, you understand this, this is a very
very very central and important idea in the area of expectation, it is very simple but you can do the
expectation in both ways, if you have a function of 𝑛 random variables, you do and need to find
the distribution of that function, you can simply use the original distribution and add up over all
possible values, you will get the same expected value.
(Refer Slide Time: 15:01)

So, this function property leads to this wonderful thing called linearity of the expected value, the
expectation or the expected value is linear in some sense. What is linear? If you take 𝐸(𝑐𝑋), it is
the same as looking at 𝐸[𝑋]and then multiplying by 𝑐. The proof is actually quite simple. It is
written down here, I do not want to go through it, but this is an important property to know. So, if
you have a random variable multiplied by a constant, so 2𝑥, 3𝑥, 4𝑥, 100𝑥 whatever, you can take
that constant out and simply take expected value only for that random variable. So, that is the first
property in the linearity.
So, the next one is much more interesting, for any two random variables 𝑋, 𝑌 and 𝐸[𝑋 + 𝑌] is
𝐸[𝑋] + 𝐸[𝑌]. So, once again I want to emphasize this property is very simple little property, lot
of people think 𝑋 and 𝑌 have to be independent, only then 𝐸[𝑋 + 𝑌] is the 𝐸[𝑋] + 𝐸[𝑌]. That is
not true at all, for any two random variables 𝑋 and 𝑌, 𝐸[𝑋 + 𝑌] equals 𝐸[𝑋] + 𝐸[𝑌], whether they
are dependent, independent anything.

So, the proof is very simple, if you assume a joint PMF and then you just simply sum it up, you
will see now the function 𝑋 + 𝑌 simply multiplies and the summation distributes over each of
those things. So, you need all these summations to exists, we will assume that. So, if you do that,
you get this. So, this first one is just 𝑡1 , 𝑡1 is what? Just 𝑋, 𝐸[𝑋]. This is just 𝑡2 , 𝑡2 is what? 𝐸[𝑌].
That is it, as simple as that. You do not need 𝑋 + 𝑌 any more, you
ca𝑛 𝑑𝑜 𝑋 𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑒𝑙𝑦 𝑌 𝑠eparately and add it up.

This little property here is extremely useful, one of the most useful properties, in fact, you can
combine these two and write like this, 𝐸[𝑎𝑋 + 𝑏𝑌] is 𝑎𝐸[𝑌] + 𝑏 𝐸[𝑌], 𝑎𝐸[𝑋] + 𝑏𝐸[𝑌], so this
is expected value of 𝑎𝑋 + 𝑏𝑌. So, once again this is very very useful, simplifies so many
computations around expected value. And the proof you can see is very simple, so this is, this
linearity property is very crucial.

(Refer Slide Time: 17:28)

So, let us take a few examples. So, let me go back to this example 𝑔(𝑋, 𝑌) being 𝑋 2 + 𝑋𝑌 + 𝑌 2 ,
7
we saw this example before. 𝐸(𝑔(𝑋, 𝑌) we saw 6. One more way to do this expected value is write
this 𝐸(𝑋 2 + 𝑋𝑌 + 𝑌 2 ). Now, this is one, this is, now you use linearity, expected value of the
sum of 3 rand, 3 different random variables goes the expectation, sort of goes inside, you take the
expectation inside the summation, you distribution it over summation, you get 𝐸[𝑋 2 ] + 𝐸(𝑋 2 ) +
𝐸(𝑋𝑌) + 𝐸[𝑌 2 ] squares. And you can find each of these things, they will all add up to give you
7
the same 6.

So, one more way in which the expected value helps you. so, if you look at this very closely, you
see that supposing I only have the marginal distributions of 𝑋 and 𝑌. 𝑋 is marginal with distribution
𝑓𝑋 , I know the marginal for 𝑌, I do not know the joint PMF, in fact it is not given to me, maybe I
cannot even estimate it, I do not know. It turns out, you can still compute 𝐸[𝑋 + 𝑌]. Why is that?
Because 𝐸[𝑋 + 𝑌] 𝑖𝑠 𝐸[𝑋] + 𝐸[𝑌].

So, I can find each of the expectations separately and 𝑋 + 𝑌 is simply the sum of the two
expectations. So, for any joint PMF, in fact it is independent of the joint PMF, it does not dependent
on the joint PMF. For any joint PMF, you can find expected values of this form. So, it should
involve 𝑋 alone + something that involves 𝑌 along, you simply distribute the expected value inside
and you need only the marginal, you do not need anything else. It is nice to know, is not it? So,
this is a very powerful property for computing the expected value.

(Refer Slide Time: 19:10)

So, here is a simple little problem where I can use this property. So, you throw a fair die twice,
what is the expected value of the sum of the two numbers seen? Let us 𝑠𝑎𝑦 𝑋 is the first number,
𝑌 is the second number, number on the first die, number on the second die. 𝐸[𝑋 + 𝑌] is simply
𝐸[𝑋] + 𝐸[𝑌]. We know that the 𝐸[𝑋] is 3.5, this is 3.5, it is 7.

I do not need the joint PMF, I do not need to look at the pair of the two numbers, all the 36
possibilities, what is the probability that it becomes 2, 3, 4 till 12 average it out, nothing. To find
the expected value, it is as simple as that, powerful, is not it? So, that is why the expected value is
a very very useful thing in practice.

(Refer Slide Time: 20:05)

So, now let us move on to this expected value of the binomial distribution. So, we saw that this,
when you have a random variable 𝑌 which is binomial(𝑛, 𝑝), you have this complicated summation
and I showed you can simplify it by cancelling all this extra and finally you end up in np. I will
show you a much much simpler method using linearity of expectation, when you use linearity of
expected value. Supposing you have 𝑛 i.i.d. Bernoulli(𝑝) random variables 𝑋1 through 𝑋𝑛 and
𝐸[𝑋𝑖 ] is equal to 𝑝, is not it? It is just Bernoulli(𝑝), we saw 0 with probability 1 − 𝑝, 1 with
probability 𝑝. So, expected value is just 𝑝.

Now, interestingly this 𝑌 which is 𝑋1 + 𝑋2 + so on till 𝑋𝑛 , we know from before it is binomial and
P, notice this interesting little connection here. So, if you add up 𝑛 i.i.d. Bernoulli(𝑝) random
variables, you get binomial(𝑛, 𝑝). The fact that they are independent is not so important, the fact
that they add up is very very important for me because now I can use linearity of the expected
values. So, 𝐸[𝑌] is simply the sum of all these expected values is 𝑛𝑝.
Notice how easy it became, suddenly, all of a sudden there is nothing in this problem while you
are struggling with all these factorials and summations and binomial formula and all that, using
linearity of expectation you get something so simple. So, hopefully, you see the power of this
simple little property that expected value has.

(Refer Slide Time: 21:33)

So, the next thing that quite often people with data worry about, they would like the average value
to be 0 for a lot of reasons, it is good to have the average value 0, but in general, you may have the
distribution somewhere else far away. So, it maybe the distribution which is some very large
number or something, it may be around something, but you may want to say translated, move it
around, make the average 0, so keep it around 0, but maybe the same distribution as that shifted
here.

So, is that possible, can you make random, average of a random variable 0 without touching its
distribution? It turns out, it is possible. So, that is do, by doing something called this translation of
the random variable. If you have a random variable 𝑋 and you add a constant to it, you get a
translated version of it. So, what is the translated version? The range simply becomes 𝑋 + 𝑐. So,
range of 𝑋 + 𝑐 is just range + 𝑐. Every value in the range, you add a c to it and what is the
probability?

Probability also sorts of shifts instead of being 𝑋 = 𝑡, it goes to 𝑋 + 𝑐 𝑏𝑒𝑖𝑛𝑔 𝑡 + 𝑐. So,


everything is translated, so you have a distribution far away in very large values, you can simply
sort of move it close to 0 and look at it with a 0 mean. So, this is something very useful, some
people call this centered random variable or centering of a random variable or centering of data,
particularly when you have data with mean being very large, you may want to just subtract the
mean, bring it down to 0 and many Machine Learning and other algorithms like that, like to have
the mean 0.

So, in particular it is interesting to translate by the expected value. Remember 𝐸[𝑋] is a constant.
You have a random variable 𝑋, 𝐸[𝑋] is just a number, it is not a, it is not random any more, 𝐸[𝑋]
is just a number, one number. Now what happens if you do translation by expected value of X? If
you look at a random variable 𝑌 which is 𝑋 – 𝐸[𝑋], you see that the 𝐸[𝑌] becomes 0.

So, this kind of a translation, 𝑋 – 𝐸[𝑋], this is some sort of a centering operation of the data or the
random variable. You are measuring any random phenomenon, you might want to subtract its
average from it and only look at the variation around it, may be it is interesting. So, otherwise you
just get mislead by the, by how large it is, while really what is important is the variation. So, this
kind of 0 mean random variables obtaining 0 mean by translation and what happens to the expected
value when you translate, it is very important.

So, in particular, if you look at 𝐸[𝑋] + 𝑐, we know what is going to happen to this, its 𝐸[𝑋] +
𝑐, 𝑖𝑠 𝑛𝑜𝑡 𝑖𝑡? 𝐼𝑡𝑠 𝐸[𝑋] + 𝐸[𝑐] and that is just c itself. So, this is what happens when you translate
a random variable. So, that is nice to see. It is a simple, I mean, it is just a fact I am putting out
here because it shows up quite often and nobody really talks about it too much, but this is important
to understand as well.

(Refer Slide Time: 24:26)


So, let me finally close with a simple little problem. We have here 3 bins, let us call it 1, 2, 3, bin
1, bin 2, bin 3 and some 10 balls are thrown at random, and it could go into any of this with
1
probability 3. Ball 1, ball 1, ball 2, so on till ball 10. One after the other we throw 10 balls into any

of these bins at random. And I want to find the expected number of empty bins, I am interested in
empty bins. So, maybe I am thinking if I throw 10 balls, maybe number of empty bins will be very
small, I mean it is very, at least 1 ball should go somewhere, maybe it looks like it.

But still, let us compute the expected value. If the expected value is small, maybe that is good, is
it easy to compute the expected value? So, here, here is a little trick here, this is very important,
this use of what is called indicator functions. So, I am going to say 𝑋1 equals 1 if bin 1 is empty
and it is 0 otherwise. So, this kind of thing is called an indicator random variable. So, you do this
experiment, you throw the 10 balls and then you see if bin 1 is empty. If bin 1 is empty, then your
random variable 𝑋1 = 1, otherwise it is 0.

So, this is how the bin is defined once again remember once again what is the experiment? Always
think in terms of the experiment, sample space, outcome, random variable. My experiment is what?
I am throwing these balls into these bins uniformly at random. After I have thrown 10 balls, I am
going to look at the outcomes, what are the outcomes, where did each ball end up? And I am not
interested in the location of every ball that is not what I want, that is a complicated outcome, I
want a function of that outcome, which is number of empty bins, in particular, I am looking at this
𝑋1 which is going to indicate to me whether bin 1 became empty or not.
So, now a same thing, I can now extend to 𝑋𝑖 , this I can say for i equals 1, 2, 3. So, what is the P(
𝑋𝑖 = 1), what is the probability that bin i is empty? All the balls should go into the other two bins.
210
So, that is 2 power 10 possibilities out of the 3 power 10 possibilities. So, it is 310 . Other way to
110
think of it every ball should not fall here, it is 1 − 310 , all the 10 balls should fall into different one.

So, 𝑃(𝑋𝑖 = 1) is easy.

210
So, what is 𝑃(𝑋𝑖 = 0) I do not really care too much at this point, but anyway it is 1 − 310 . So, now

what is number of empty bins? Number of empty bins is also a random variable, it is actually 𝑋1 +
𝑋2 + 𝑋3 , is not it? This is the random variable which is the number of empty bins. So, finding the
distribution here is a little bit difficult. Finding, so let us say 𝑌 equals this. Finding 𝑓𝑌 is a little
harder. Why is that?

You can think about why that is so, I will leave it as an exercise for you, think about how you will
find the distribution of 𝑌, what is the probability that there is no empty bins, 1 empty bin, 2 empty
bins, it cannot be 3 empty bins, is not it? So, all 3 it cannot be empty. So, notice the difficulty here
already, so this 𝑋1 , 𝑋2 , 𝑋3 are not independent, they are dependent, and it is complicated to find
the fy, but I can find the 𝐸[𝑌]. What is 𝐸[𝑌]?

2
It is just E[𝑋1] + 𝐸[𝑋2 ] + 𝐸[𝑋3 ] and that is just 3 × (3)10. And you can see it will be a very small

number. So, you can calculate it be a reasonably small number. So, you can be sure that most of
the time you are going to expect that all bins are occupied. So, this is a powerful little problem to
me, this problem quite often drives home the point and the power of the linearity of expectation.
You should try to find 𝑓𝑌 to understand what I mean.

Finding the distribution here in the balls on the bins problem is a little bit hard but expected number
of empty bins is quite straight forward and works out beautifully because of this linearity of
expectation and how easy it is to calculate. That is the end of this lecture on property of expectation.
We will meet again in the next one. Thank you very much.
Statistics for Data Science – II
Professor Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture 4.3

(Refer Slide Time: 0:15)

Hello and welcome everyone to this lecture. This week we have been looking at expectations and
hopefully this is a recap for you, you have seen expectations before in the statistics 1 course, we
are continuing to look at what expectations are and we will proceed, but I have been talking about
how you can do simulations of probability and experiments in Python, I showed you some simple
simulations, I am going to continue that a little bit with expectations also.

So, if you remember I was making this argument or this claim which we did not rigorously see at,
but I was making this claim that I know the average value, if you observe a random variable
repeatedly the average value is going to be close to the expected value in some sense. So, I wanted
to show you some simulations to convince you of that.

And also we saw some experiments, interesting sort of experiments like this, the dye gambling
game, under 7, over 7 and all that and then this balls on bins, I thought would do a couple of
simulations of that and show you that expected value and the long term average are the same thing.
So, let us go through and quickly see it.
(Refer Slide Time: 1:24)
Once again this is not a class in python coding, so I will not spend too much time on the coding
part but just the logic of how the whole thing works. So, let me see if I can zoom in a little bit.

(Refer Slide Time: 1:35)


So, this is a small piece of the notebook which does simulation of this casino die game that we
have been looking at. You can throw a pair of die and a player bets some units of money on whether
the sum of the two numbers is under 7, over 7 or equal to 7. Maybe his strategy is to bet different
sums of money for it, does it allows for that and then the returns, so we assumed if you remember
for under 7 and over 7 the returns were 1 is to 1.

So, while for equal to 7 the returns are some 4 is to 1, so you can sort of keep that also as a variable
here and then you can sort of simulate this with independent betting. So, this is what this program
does, you can go through and look at it the logic should be simple enough on how it works and I
could run it but I will have to go back and make sure I run some of the initial things first and then
run this.

We need to import first because it is going to be connecting and alerting and initializing, yeah, so
it should work now, yeah and we need this uniform distribution I do not think we need anything
else, so we can go back down and then run. Once again the sheet will be with you, this notebook
will be with you, you can also run it the exact same way I run it, so these values for a and b I put
12
it as and 6 so that in this case actually the predicted returns so this is the predicted return, so
5

maybe I should put that down there, expected gain, so this is the simulated gain.

12
So, I have set it up so that, I have set up the returns you can see the return for a is 5
which is 1.4

and b is 6. I have set it up so that the expected gain is 0 and you can see in an actual simulation
also the simulated gain comes close to 0, so I am printing the expected gain first that is 0 and this
is close to 0, you can change the returns, you can make a = 1 and b = 4. It is good to do 1.0 so that
everything works out.

(Refer Slide Time: 3:45)

And once again if you simulate you will see it is -0.5 and it is pretty good so this simulation works
you can see how it works. So, couple of other things that is this very-very important and interesting
and powerful module called scipy.stats, we will really not see all the functions in it but for us what
is interesting is all these common distributions, this scipy.stats can generate.
So, if you remember I generated uniform random variable for you in this code, I generated biased
coin toss in this code so this it turns out this scipy.starts module has all the common distributions
so you can write some simple piece of code to generate binomial geometric poisson and other
distribution.

(Refer Slide Time: 4:30)


So, here is binomial (20, 0.3) and this is the expected value, this is the average value in simulation
so this is what I am doing here and if you actually run it you will see binomial (20, 0.3) the
simulated value, expected value is 6 and so this is the command from scipy.stats which generates
1000 binomial random variables with parameters (20, 0.3) so this is how easy it is to use python
modules for generating random variables, I mean you have all these fancy distributions and there
are powerful python modules which in one line can generate one 1000 binomial random variables
instances of samples from the binomial random variables.

(Refer Slide Time: 5:27)


And you can see the help and all that so it is all out there it is just have to just take and use it and
I am adding up all the random variables and dividing by how many I have, 1000, so this is the
average value in simulation and that agrees very-very closely with the expected value. The same
thing you can do for geometric random variable again, once again that is the expected value and
this is the average value in simulation.

So, you can run it again you can see that with 0.3 the average value and the expected value are
very-very close and the same thing here the Poisson distribution this is the expected value and this
is the average value in simulation so this should hopefully convince you that this expected value
is very nice in predicting or giving you a number for the average value in simulation so this we
will see later on we will see a theoretical result which confirms this for us but it is good to see in
simulation.
(Refer Slide Time: 6:26)

So, final simulation I have done here is this balls and bins experiment, you remember this is one
of the problems I solved in the previous lecture, you have m balls thrown independently and
uniformly at random into n bins and we would like to compute the expected number of empty bins
1
by simulation and compare it with the theoretical value. So, this is the theoretical value 𝑛(1 − 𝑛)𝑚
−𝑚
turns out this is very closely approximated by 𝑛𝑒 𝑛 this is a good approximation in case you want
to see it.

(Refer Slide Time: 6:59)


1
And then this is the once again the expected value that we can calculate 𝑛(1 − 𝑛)𝑚 , this is the

average value in simulation. So, you can see the way I do the simulation first I repeat the simulation
1000 times you remember that is how you do Monte Carlo and this is the number of balls that are
there in every bin, keep track of balls in bins and then for every ball in m balls I generate a random
bin and then I keep track of the number of balls in that bin, I increase it by 1.

And then finally number of empty bins I go through every bin and then see if the number of bins
is empty I increment the number of empty bin, it is a very simple procedure and the average number
of empty bins is number of empty bins by 1000. So, this once again if you can simulate, you will
see, you will get numbers which are very-very close.

So, we are keeping this python simulation going, some of you maybe are doing the python course
right now, some of you may not be doing the python course but still I hope these commands are
simple enough and from your knowledge of computational thinking you may be able to figure out
how to work with this. So, that is the end of the short lecture introducing this simple simulations
to show how expected value is very-very useful in practice.

Thank you very much.


Statistics for Data Science – II
Professor Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture 4.4

(Refer Slide Time: 0:19)

Hello and welcome to this lecture, this lecture is on variance. So, we have seen expected value of
a random variable before, variance is a special type of an expected value, it is related to a special
type of calculation involving expected values, so we will see a motivation for variance and why it
is needed and its definition, once again this is something you have seen before in stats 1, so the
presentation will be quick and we will focus on only the new things mostly that we are doing here.
(Refer Slide Time: 0:43)

So, here is the motivation, so if you remember I started motivating expected value saying that
instead of giving the whole distribution and the entire data you can give the average value, it sort
of represents the distribution in some sense. So, here is an example, so to assess how good is the
expected value.

Here are 3 distributions, first I start with an 𝑋 which is actually a constant, 𝑋 = 10 with probability
1 that is a constant and there is a random variable Y which is going to take 9 with probability 1/2
and 11 with probability 1/2. And there is at another random variable Z which is going to take value
0 with probability 1/2 and value 20 with probability 1/2.

Now, interestingly if you actually summarize and find expected value for each of these random
variables you will get the same answer, the 𝐸[𝑋] is 10, E(𝑌) is 10, E(𝑍) is 10.

But clearly the 3 random variables are very different, the 𝑋 is constant, 𝑌 at least you can say it is
close enough to 10, 9 and 11, 𝑍 looks quite a bit far away 0 and 20. So, just by giving the
expectation alone you really cannot distinguish between these kind of random variables, it does
not seem like you get a sense of what the random variable is with just the expectation, the expected
value.

So, you need something more and that something more is usually should measure some sort of a
spread. So, you can see here that while the expected value correctly gives you the center of the
random variable, it appears that the spread is also important and you need some other way to
indicate spread, so that is where variance comes in, once again you have seen this before I am just
motivating very quickly with a simple example.

(Refer Slide Time: 2:28)

So, here is the definition for what is called variance and a very closely associated term called
standard deviation. So, if you have a random variable 𝑋 will denote the variance by this 𝑉𝑎𝑟(𝑋)
that will be our notation for variance, it is simply defined as the expected value of this function, so
notice, so this guy is some function of 𝑋.

So, Var(𝑋) is the expected value of a specific type of function of 𝑋. What is that function?
𝐸[(𝑋 − 𝐸[𝑋])2 ], so some people get confused with this 𝐸[𝑋] usage. What is this 𝐸[𝑋]? 𝐸[𝑋],
remember is some constant, think of 𝐸[𝑋] as some constant. So, you have a random variable X
and somebody tells you what its expected value is.

So, 𝐸[𝑋] as far as I am concerned is a number, it could be 10 like in the previous case. And then I
have a function (𝑋 − 𝐸[𝑋])2 square this is just a function of 𝑋 which is another random variable
and I have to take the expected value of that. So, we will see some illustrations on how to compute
but this is what the definition is, hopefully the definition is clear.
So, the variance involves squaring of 𝑋 in some sense. So, if 𝑋 has some units it makes sense that
the square root of the variance will also have some meaning and that is where the standard
deviation comes from, so standard deviation is simply the positive square root of the variance.

(Refer Slide Time: 4:08)

So now, like I said variance is simply the expected value of another random variable which is a
function of 𝑋 given by (𝑋 − 𝐸[𝑋])2 . So, one way to write out this formula since we know it is a
function of another random variable is to ∑𝑡∈𝑇𝑋 (𝑡 − 𝐸[𝑋])2 × 𝑃(𝑋 = 𝑡) it seems like a, I mean
I am using simply the formula for expected value of a function of a random variable I know that
is true.

Now, variance is non-negative. Why? Because this takes only non negative values, 𝑋 − 𝐸[𝑋] the
what is inside the bracket could be positive or negative, but once you square it, it is going to go
positive, so this clearly the summation is going to be non-negative, so standard deviation is well
defined, it says it will be a real number, you do not have to do square root of negative numbers,
we will get square root of only positive numbers or non negative numbers.

So, the units of the standard deviation become the same as the units of 𝑋. If 𝑋 has some units the
variance will become square of that units and standard deviation will go back to that original unit.
So, you can see where this 𝑋 − 𝐸[𝑋] comes from, so if you have a lot of values in the range t
which are far away from 𝐸[𝑋] then the variance is going to be larger and larger.
So, something that is far away on the left side can be compensated by something that is far away
on the right side, so the expected value will continue to remain in the middle but when you do the
variance whether it is on the left or right it is going to only have a positive contribution,
(𝑋 − 𝐸[𝑋])2 , (𝑋 − 𝐸[𝑋])2 .

So it is going to have a positive contribution, so the spread is going to be sort of well captured by
the variance in some sense, it is only one number but still it captures the spread very nicely, so this
is the basic motivation for the definition, let us go back to our example where we had these three
random variables which have the same expected value but their variances and standard deviations
are different.

You can do this calculation very easily, Var(𝑋) is you have to sum over all values in the range of
𝑋 which is just 10, so it is 10 - 10. Where did I get this other 10 from? This is 𝐸[𝑋], this is a t
from range of 𝑋, there is only one value so you get 0, the variance is 0, standard deviation is 0, so
variance is 0 means the random variable is a constant.

So, anytime random variables are constant variance will be 0, any times variance is 0 random
variable is a constant, both of those are true.

(Refer Slide Time: 6:35)

So, let us look at 𝑉𝑎𝑟(𝑌) once again 10 is the 𝐸[𝑌], so you do every value from the range so you
say take (9 − 10)2 × 𝑃(𝑌 = 9) that is 1/2. So, once again if you want me to write it, this is t that
belongs to the range of 𝑌, this is 𝐸[𝑌], this is probability that 𝑌 = 𝑡, so that is what you do, you
do repeatedly, add it up you, in this case you get1, so the variance now becomes 1.

So, even though 𝐸[𝑋] = 𝐸[𝑌] the 𝑉𝑎𝑟(𝑋) is 0, the spread indicator is saying 0, it is not spread
out at all, it is just it is concentrated on one value. On the other hand, 𝑉𝑎𝑟(𝑌) ends up being 1 and
the standard deviation is 1.

(Refer Slide Time: 7:28)

If you go ahead and compute Var(𝑍) you get 100 and the standard deviation is 10. So, you see
how these three random variables at least which are the same expected value by computing the
variance we are able to summarize it very nicely, it sort of indicates that 𝑍 has a larger spread and
𝑋 is a constant and 𝑌 is reasonably closely spread from the center. So, easy example.
(Refer Slide Time: 7:53)

So, let us go to a slightly more complicated example where we are throwing a die. So, your 𝑋 is a
uniform distribution from 1 to 6, 1, 2, 3, 4, 5, 6, 𝐸[𝑋] is 3.5, we have done this calculation before.
1 1
So, if you do variance you will see, you will get (1 − 3.5)2 × 6 + (2 − 3.5)2 × 6 +
1 1 1 1
(3 − 3.5)2 × 6 + (4 − 3.5)2 × 6 + (5 − 3.5)2 × 6 + (6 − 3.5)2 × 6, you will get 35/12 which is

2.916. So, the standard deviation is √35/12 and that will end up being, so this is basically

√35/12 and that ends up being 1.7078. You see the calculation so this calculation is hopefully
really really easy I am going to make this statement.

(Refer Slide Time: 8:41)


So, if you are given a PMF of 𝑋 and that 𝑋 has a very small range, the range is not very big, it is
not infinite like geometric you are not going to get some nasty summations which you do not know
what to deal with, all that then it is variance and standard deviation can be quite easily compute.
So, that is great.

So, we had this meaning that expected values like the average value when you repeatedly see
random variables where does variance come in or would you see variance, variance is also an
expected value. So, if you compute the 𝐸((𝑋 − 𝐸[𝑋])2 ) you are going to get something close to
the variance, so we will see some simulations of this on Python soon enough. So, that is throwing
a dice. So, once again given the PMF and the range is small it is easy to do the summation, variance
and standard deviation can be quite readily computed.

(Refer Slide Time: 9:29)


So, let us go to properties, so the properties are quite easy to establish, I am, I think I am not doing
too many proofs for the properties. So, let us look at the properties so if you remember for expected
value we saw the linearity of expectation played a huge role, 𝐸[𝑋] + 𝑏 𝑌 became 𝑎𝐸[𝑋] +
𝑏 𝐸[𝑌].

So, in this case things are slightly different because you have the squaring going on. So, anytime
you have a squaring going on things will not be linear, so things will change a little bit. So, for
instance I am scaling 𝑋/𝑎, so I am multiplying it by a, if it were linear you would only get a scaling
by a but in this case variance is sort of squaring, there is a squaring involved so when you scale 𝑋
by a the 𝑉𝑎𝑟(𝑎𝑋) gets squared by 𝑎2 , so this is not very difficult to establish I am not going to
write down lengthy proofs of this, but it is important to remember and it is easy to sort of see where
it will come from.

So, you just take 𝐸[𝑋] − 𝐸[𝑋] instead of 𝑋 you put 𝑎𝑋 and then the a will come out, a will come
out but then there is a square so you will get 𝑎2 , the 𝑎2 will come out of the how to expectation
you will get a square times that, so it is a very easy thing to establish.

So, when you multiply a random variable 𝑋 by 𝑎 the variance gets amplified by 𝑎2 . So, when you
take standard deviation I am going to take the square root but remember it is a positive square root
so it is going to be standard deviation of 𝑋 multiplied by absolute value of a. So, even if you were
to multiply 𝑋 by 𝑎 negative number the standard deviation gets only multiplied by the absolute
value something to remember.
The next thing we saw very interestingly towards the end of that expected value class is translating
a random variable. So, if you take a random variable X and then move it around then we saw the
expected value also moves around like that, so random variable has some distribution, you move
it around, it distribution will also move around.

But notice what happens to the variance, if you move the random variable like translate it, the
variance is not going to change, the spread about the center is not going to change and that behavior
is reflected in the third property here. 𝑉𝑎𝑟(𝑋) + 𝑎 = 𝑉𝑎𝑟(𝑋), so this again these are easy things
to establish maybe I will just write down a proof for this alone, see this you want to write 𝑉𝑎𝑟(𝑋) +
𝑎, so it is going to be E[(𝑋 + 𝑎 − 𝐸[𝑋 + 𝑎])2 ] , this is going to be E[(𝑋 + 𝑎 − 𝐸[𝑥] − 𝑎)2 ]

Remember this expected value is linear, a is a constant so you will just get a, so this a and a will
cancel and that becomes variance of a. So, it is very easy to prove these kind of statements, the
proof itself is not so complicated just write down the whole thing we will get it, but the property
understanding the meaning of this property. So, why is the spread not affected?

When you multiply by a the spread is going to be increased. So, if imagine a being a larger than
greater than 1 number, so if a is 10 or something if you had a spread when you multiply by 10 it is
getting spread out more so the variance amplifies by a squared standard deviation multiplies by
absolute value of a.

On the other hand if you were to shrink the random variable let us say you multiply by 0.5 or 0.1
or something you expect the variance to decrease and 𝑎2 is how it decreases. On the other hand if
you do not scale up or scale down you are just shifting the random variable, you do not expect the
spread about the center to change, so that is also reflected in this property, so there is, this is simple
meaning for these properties while the proof is also not too complicated. So, scaling and translation
is easy to establish.

(Refer Slide Time: 13:18)


So, I want you to contrast with similar properties for E[X], I mentioned that and you can see it is
different because E[X] is sort of linear, here you have the squaring term so it becomes a bit
different.
(Refer Slide Time: 13:33)

So, there is another formula for variance of the formula we saw so far is E[(𝑋 − 𝐸[𝑋])2] it turns
out you can do some simplification there and in fact the 𝑉𝑎𝑟(𝑋) = 𝐸[𝑋 2 ] − 𝐸[𝑋]2 So, notice this
difference here, hopefully you can see this difference, here you have E[𝑋 2 ] so the 𝑋 2 is inside so
here you have 𝐸[𝑋]2 .

So, notice the slight difference here it is not the same, so notice sometimes if you just glance at it
you will not see the difference, the two has sort of gone out of the bracket, so you do E[X] first
and then you square. So, if you want me to write it down clearly see E[𝑋 2 ] = ∑𝑡∈𝑇𝑋 𝑡 2 × 𝑃(𝑋 =
𝑡)

What is 𝐸[𝑋]2 ? 𝐸[𝑋]2 = (∑𝑡∈𝑇𝑋 𝑡 × 𝑃(𝑋 = 𝑡)2 ) , so both of these will not be the same, so this is
different, this is different, there will be lot of cross terms here when you do this there will be no
cross term, this is t squared, so these two are not the same, so do not think that these are the same.
(Refer Slide Time: 15:00)

So, the proof is here I do not want to go through the details of the proof in this class, you can see
it relies on the linearity of expectation and the fact that 𝐸[𝑋] is a constant and you can write down
the proof. So, there is another term I want you to remember 𝐸[𝑋] is called the first moment, this
is called the first moment, 𝐸[𝑋 2 ] is called the second moment, 𝑉𝑎𝑟(𝑋) is called the second central
moment.

So these are, this word moment you will see quite often being used for these kind of functions, you
can imagine 𝐸[𝑋 3 ] is called the third moment and all that so these moments are interesting. And
basically what we are trying to see here hopefully you can see the picture here, the moments are
actually the things that summarize your random variable, 𝐸[𝑋] is the first moment.

It gave you a good summary which is connected to the average value, the E[𝑋 2 ] gives you the
variance not directly but 𝐸[𝑋 2 ] − 𝐸[𝑋]2 gives you the variance which is the spread of the random
variable. What about third moment, fourth moment you may ask, but usually first and second
moment are very crucial and that is usually good wide variety of applications.

So, this alternative formula is very important, it is quite easy to use this alternate formula, you will
see in some examples, it is easy.

(Refer Slide Time: 16:21)


So, one other property which is very very very useful is this sum and product of independent
random variables. So, we know expected value is linear, so for any two random variables 𝑋 and 𝑌
whether it is independent or not independent whatever, 𝐸[𝑋 + 𝑌] is 𝐸[𝑋] + 𝐸[𝑌] and this we
used so interestingly in complicated problems to simplify our calculations.

Now, in addition if you know 𝑋 and 𝑌 are independent random variables then you can say more
there are more very interesting relationships if 𝑋 and 𝑌 are independent random variables, so it is
important to know that if they are not independent you generally cannot say anything more but if
they are independent you can say something more.

For instance, 𝐸[𝑋𝑌] becomes 𝐸[𝑋] times 𝐸[𝑌] if X and Y are independent, only if they are
independent, otherwise this statement is not true, it is important. And similarly 𝑉𝑎𝑟(𝑋 + 𝑌)
becomes 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) if 𝑋 and 𝑌 are independent random variables, amazing.

So, 𝑋 and 𝑌 being independent is a very strong and powerful sort of condition, it makes products
𝐸[𝑋 𝑌] becomes 𝐸[𝑋] times E[𝑌], 𝑉𝑎𝑟(𝑋 + 𝑌) becomes 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌), so very very
powerful results, we will see a quick proof of it I just want to go through this in very quick detail
just to show you why the independence is crucial, how it enters the proof and how you cannot do
it without that, I will do it only for the first one, I will leave the variance part as an exercise.

(Refer Slide Time: 18:07)


So, we know that this guy ∑𝑡1 ∈𝑇𝑋,𝑡2∈𝑇𝑌 𝑡2 𝑡1 𝑃(𝑋 = 𝑡1 , 𝑌 = 𝑡2 ). Now, notice since X and Y are
independent this guy becomes to P(X=𝑡1 )𝑃(𝑋 = 𝑡2 ), this is a crucial crucial simplification.

So, you will get something like this, so you can sum over all values, so this is another way to write
these double summations, write it as ∑ ∑,𝑡2∈𝑇𝑌 𝑡2 𝑡1 𝑃(𝑋 = 𝑡1 )𝑃(𝑌 = 𝑡2 ). It is the same thing, so I
am going to keep 𝑡1 fixed and then vary 𝑡2 alone and then do 𝑡1 𝑡2 𝑃(𝑋 = 𝑡1 ), 𝑃(𝑌 = 𝑡2 ).

Then the next value of 𝑡1 comes then you vary 𝑡2 , so this is the same thing as doing 𝑡1 𝑡2 this
double summation can be written out in this fashion. Now, what is nice about this inner
summation? Inner summation is over 𝑡2 so 𝑡1 is constant, so these two you can move outside of
the summation, so if you do that you will get summation ∑𝑡1∈𝑇𝑋 𝑡1 𝑃(𝑋 = 𝑡1 ) ∑𝑡2 ∈𝑇𝑌 𝑡2 𝑃(𝑋 = 𝑡2 )
so notice this little trick here it is important, 𝑡1 times probability of 𝑋 = 𝑡1 .

And then you have a ∑𝑡2∈𝑇𝑌 𝑡2 𝑃(𝑋 = 𝑡2 ). What is this guy, what is this summation? Summation
𝑡2 in 𝑇𝑌 , 𝑡2 times probability of 𝑌 = 𝑡2 this is 𝐸[𝑌] which is a constant, so this 𝐸[𝑌] is a constant
and that guy will actually come out of this summation. So, you get 𝐸[𝑌] times what is this guy
summation 𝑡1 =𝑇𝑋 , 𝑡1 probability of 𝑋 = 𝑡1 . What is this guy?

This guy is 𝐸[𝑋], so this whole thing becomes 𝐸[𝑋] times 𝐸[𝑌]. Notice how this factoring is
crucial, if this factoring did not happen, if 𝑋 and 𝑌 were not independent you are not going to get
this, at least not in this form, it turns out even if X and Y are not independent something like this
can happen in some situations, but if they are independent it is definitely going to happen, it is
very easy to sort of establish the result in this fashion, without this factoring I would not have been
able to write it so easily.

But who knows maybe something else will happen but you can easily provide produce 2 random
variables 𝑋 and 𝑌 which are dependent and these two will not be true. So, remember this is not
always true, not always true, this is important to remember, there can be 𝑋 and 𝑌 which are not
independent where some other factoring happens, some other crazy thing happens when you come
back to this but this is not always true if 𝑋 and 𝑌 are dependent.

So, there can be dependent cases where this will not be true. So, this is a quick proof I just want to
write down one proof as part of every week you should see at least one proof which is easy to write
down, but we will stop here, we will not do any more proofs as part of this course.

(Refer Slide Time: 21:57)

The second thing is an exercise I will let you prove it, this proof is an exercise here. So, it is a bit
involved but it uses this result, this result is useful here. So, you write 𝑉𝑎𝑟(𝑋 + 𝑌) using that
usual formula (𝐸[𝑋 + 𝑌] − 𝐸[𝑋] − 𝐸[𝑌])2 and then you will have a bunch of cross terms.

So, you square it, you will get a bunch of cross terms, you use 𝐸[𝑋𝑌] 𝑖𝑠 𝐸[𝑋]𝐸[𝑌] and you will
magically get this final answer. So, I leave it as an exercise some of these proofs you have to do
and enjoy. So, let you do it on your own. So, these are important we are seeing very powerful
properties for variance. So, let us try and use the property to compute the variance of sum of two
dice.

(Refer Slide Time: 22:48)

So, variance of one die we saw already it is square root of 35/12 it was easy enough to compute.
Now, what about variance of the sum of two die? So, let us say 𝑋 is the first die and then 𝑌 is the
number shown on the second die, 𝑋 and 𝑌 are independent, so this is the crucial assumption here.

So, it is not stated in the problem, it is what is the sum of two dice, I will assume independent, it
is usually not stated very clearly quite often people just assume this independence but in this 2 dice
case when you throw two dice the two things are independent, so maybe it is not a reasonable thing
not to state but it is good to state it very clearly.

So, you want the 𝑉𝑎𝑟(𝑋 + 𝑌), so this since they are independent, this just 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) and
you can quickly compute this as 2√35/12, this can go out if you like, so this is √35/3.

So, this was easy enough to compute but notice if this independence was not there, then notice
how difficult it would be to compute 𝑉𝑎𝑟(𝑋 + 𝑌), so you have to go through that whole formula
and then it is a bit more messy than you can imagine. So, if you can compute, if a given
independence it is a very very easy calculation to do.

(Refer Slide Time: 24:25)


So, previously when we discussed expectation we looked at expected value of common
distributions this was a pretty big and important thing that I did, I went through a lot of summations
and complicated stuff I am not going to do that for variance I will just show you maybe one or two
examples very easy examples where I know I can quickly compute, but the other examples I am
going to leave as exercise to you, go ahead and compute.

But here is a table, this is a standard table, I would suggest just mark this up, I mean just know
this/heart, it is good to know them, most people who work in statistics and probability would know
or at least know where to look, you can go to Wikipedia pages, they will give you details of what
the expected values, variances, etcetera.

So, Bernoulli random variable expected value is 𝑝, variance is 𝑝(1 − 𝑝), the binomial random
variable which is binomials, it is not binomials, this is binomial, so it is, this is not that 𝑛 𝑝, we
know expected value is 𝑛 × 𝑝, variance is 𝑛𝑝(1 − 𝑝), so let us how, it seems very close enough,
it is very interesting 𝑛 𝑝(1 − 𝑝), geometric (𝑝) is 1/𝑝 and then the variance is (1 − 𝑝)/𝑝2 .

Look at Poisson it is very interesting, expected value of 𝜆, variance is also 𝜆. And then if you take
the uniform distribution from 1 to 𝑛 expected values (𝑛 + 1)/2 it is easy to see and variance is
(𝑛2 − 1)/12, so that is what it is. So, let us do simple things so let us say 𝑋 is Bernoulli, so 𝑋
is 0, 1 and this is 𝑝 and this is 1 − 𝑝, we know 𝐸[𝑋] is 𝑝.
The best way to do these kind of things is to find E[𝑋 2 ] , so E[𝑋 2 ] is 1 × p it is also p. So, in fact
any expected value X power 3, X power 4, whatever you do you will get 1 × p. So, now Var(X)
is E[𝑋 2 ] - 𝐸[𝑋]2 , so you get p - 𝑝2 that is p(1 - p).

So, Bernoulli is easy to do this is the first case. So, notice how this alternate formula is very nice,
E[X] square is very easy. So, for the second one binomial( n, p) it is, I mean if you start calculating
so notice what will happen here. So, supposing X is binomial(n, p). This E[X] we know is np, we
found it by two methods, one was we wrote the summation, complicated summation and we
simplified it, we got np.

The other was this wonderful trick we had where we said X is actually 𝑋1 +….. so on till 𝑋𝑛 where
𝑋𝑖 are iid Bernoulli(p) so this is very very important, this independent assumption is very very
important. So, we got a very simple way to compute this. So, just linearity of expectation, take
expectation on both sides, so this is something that people typically do, you just take E[X] =
E[𝑋1 + 𝑋𝑛 ].

Now, use linearity, says n times expected value of each of these guys and that is p, so easy to get.
Now, we know we can use it for variance also. Why? Because this is iid independent so that
implies, 𝑉𝑎𝑟(𝑋) is simply 𝑉𝑎𝑟(𝑋1 ) + … … 𝑉𝑎𝑟(𝑋𝑛 ) so that is 𝑛𝑝 (1 − 𝑝), so each of these
guys is just Bernoulli, so this 𝑝(1 − 𝑝) so 𝑛𝑝(1 − 𝑝), look at how easy it was.

Otherwise if you want to compute directly, E[𝑋 2 ]=∑𝑛𝑘=0 𝑘 2 𝑛𝐶𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 this ends up being
a slightly more complicated summation. In fact once the variance, you know that this is variance
+ the mean square. So, if you can find E[𝑋 2 ]easily using this trick.

I mean otherwise if you have to directly compute the summation seems like a, this is a slightly
more involved summation to calculate. But how did I find this? This is variance + this is 𝐸[𝑋]2,
so Var(X) is E[𝑋 2 ]- 𝐸[𝑋]2 . So, you can sort of use it in reverse and get these complicated type of
summations easily figured out if you know the variance.

This is the sort of trick you can use to do it, this is how you compute for complicated distributions.
I am not going to go into detail of the other calculations I will leave it as an exercise for geometric
Poisson and uniform you can do those summations with all the few methods that I gave you for
finding the summation you can do variance of common distributions in this fashion. So, I have
some working here, I will skip this.

(Refer Slide Time: 29:45)

So, we saw in the expected value that by translating you can make the expected value of a random
variable 0. Now, here is something that you can do a little bit more, you can say a random variable
is standardized if the expected value is 0 and the variance is 1. So, this is sort of like standardizing
the random, if you have your random variable it is not at 0, it is not around 0, it is somewhere else
translated, you move it and you can make it 0.

Now, what about the variance? The spread may be a lot, can we standardize the variance? It turns
out making the variance 1 is some sort of a standardization. And here is this wonderful theorem,
if you have any random variable 𝑋, it should have some expected value , standard deviation,
1
etcetera and then 𝑆𝐷(𝑋) (𝑋 − 𝐸[𝑋]) is a standardized random variable, you can try and prove this,

it is not very hard.

1)
The supposing, you say 𝑌 is, this is 𝑌, you can show 𝐸[𝑌] = 𝑆𝐷(𝑋) (𝐸[𝑋] − 𝐸[𝑋]), this 1/𝑆𝐷(𝑋)

will come out, this is just a constant times 𝐸[𝑋] − 𝐸[𝑋] and that will go to 0. What about
X E(X) 𝑋
variance? 𝑉𝑎𝑟(𝑌), remember this is - 𝑎 𝑆𝐷(𝑋) this is the same as Var(𝑆𝐷(𝑋)) and this is like
𝑆𝐷(𝑋)
𝑉𝑎𝑟(𝑋)
1/a, this is like a constant here, so you will get 𝑆𝐷(𝑋)2 and that is just 1.
So, it is easy proofs, nothing majorly going on, but it sort of gives you something very important,
this will come back to us and we will use this in some interesting ways. So, given any random
variable 𝑋, so supposing you have a bunch of data, lot of values and maybe its mean is somewhere
else, its variance is something else and you want to be able to use it along with other random
variables which have some standardized sort of properties.

1
So, one of the most useful things you can do is standardize it, so you just look at 𝑆𝐷(𝑋) (𝑋 − 𝐸[𝑋]),

so hopefully you see that the PMF or the distribution of 𝑋, so PMF of 𝑌, sorry PMF of 𝑌 is very
similar to PMF of 𝑋, so let me just put very similar in quotes I am not defining it properly, I am
just saying why because 𝑋 why is just a translated and scaled version.

So, the shape of the PMF is not going to change, if it has a certain shape it will probably get scaled,
it will probably get moved around but the shape and the general relative behavior is the same. So,
PMF is sort of you can go from one to the other but the expected value is 0 and the variance is 1,
so this kind of things are said to be standardized random variables.

So, this sort of covers most of what I wanted to say about variance, once again it is a revision you
have seen this before so I am going through it very quickly, hopefully you got an idea of what
variance is, what is the notation we will do and what are the various ideas that one can use in
manipulating these things.

(Refer Slide Time: 33:04)


I will leave you with one slide which is strictly additional material I am not saying that this is part
of the course but something you should know about existence of expected value and variance. So,
it turns out one can come up with a random variable for which 𝐸[𝑋], the random variable 𝑋 for
which E[𝑋] will go to infinity. What do you mean/infinity?

The range has got so many values and the probabilities are spread so that when you multiply the
value in the range by the probability and add it all up, you will get something that blows up, so it
will keep on increasing. So, here is an example, a very very simple example 1/2, just random
variable takes values 1, 2, 4, 8, 16, so on powers of 2 and the probability with which it takes it is
1/2 1/4, 1/8, etcetera.

So, you can compute 𝐸[𝑋] and you will see it will be 1/2 + 1/2 + 1/2 + 1/2 every product
is a 1/2 and 1/2 it will go on and on and on forever, so 1/2 + 1/2 + 1/2 infinite number of
times is infinity, so it blows up for you, so 𝐸[𝑋] blows up. So, when 𝐸[𝑋] itself blows up it turns
out 𝐸(𝑋 2 ) will also blow up, so when there are two things that are blowing up in different ways
and you start subtracting and all that does not make sense.

So, 𝑉𝑎𝑟(𝑋) is not very well defined in this case, so these kind of situations can happen with
random variables but these are exceedingly rare in most practical cases you will not meet these
kind of random variables but somewhere in the corner of your mind just remember for theoretical
purposes any time you use E[𝑋], 𝑉𝑎𝑟(𝑋) you are assuming that they exist.

So, so far I just wrote down everything, I never said, I was made a statement saying assuming this
exists, assuming this exists, but if you see a proper book in probability and statistics they always
say let 𝑋 be a random variable with expected value and variance, so they will assume it exists,
assumes it is finite for example and then only they will proceed and give results.

So, all the results I stated using the expected value for instance I said 𝑋 and 𝑌 are independent
𝑉𝑎𝑟(𝑋 + 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌). Of course all that holds only when they exist, when they do
not exist none of them hold. So, I never kept saying that but this slide is sort of like, sort of a
exceptional situation and we are going to say we know that these situations exist but we live in the
comfortable world where everything exists.
So, there are more examples I have given you in fact there are even more confusing random
variables but even 𝐸[𝑋] is not well defined, does not, you will not get anything, it will not be
properly going to one value. At least in the previous case it was properly blowing up, so here it
will not even properly blow up at one point it will say I am positive, next point it will say I am
negative, you can make it do a lot of other things and it is not a proper well-defined sort of thing.

So, when it is not well defined again 𝐸(𝑋 2 ) will go after infinity or something else will happen,
so variance becomes ill-defined, so these kind of things can happen. In fact I am pointing out one
more situation in the third bullet you can have 𝐸[𝑋] being finite but 𝐸(𝑋 2) can go to infinity.

So, basically if you have 𝑋 belonging to, so this is a standard example like this and 𝑃(𝑋 = 𝑘), I
1
will say it is proportional to 𝑘 3 I have said proportional to because you have to divide by the sum

to make it have adding up to 1 etcetera. So, this is a standard example for the case where 𝐸[𝑋] will
be finite but the 𝐸(𝑋 2 ) will go to infinity, so variance will go to infinity.

So, these kind of situations can happen like I said the last bullet is where we are going to live as
far as this course is concerned, in fact entire program is concerned mostly we will be living in this
world, very comfortable world where we are looking at well behaved random variables, finite
mean and variance.

So, once again this slide is just sort of out of syllabus I am just putting it out there to tell you that
these things exist and we will be loosely aware of them but we will not, we will not study these
kind of situations very closely. So, that is the end of this lecture. Thank you very much.
Statistics for Data Science – II
Professor Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture 4.5

(Refer Slide Time: 0:16)

Hello and welcome to this lecture we are going to talk about covariance and correlation and once
again this is a topic you have seen in statistics 1 but nevertheless we will just quickly go through
and discuss these ideas once again in this topic with maybe some new perspective. So, we have
seen variance before expected value of a random variable and then we saw variance and then we
saw that variance was a measure of the spread of how spread out the values are likely to be if you
observe a lot of repeated observations of the random variable and expected value was like sort of
the center of 1 random variable.

Now, what if you have multiple random variables say 2, we have seen 1 but what if you have 2
random variables? Then you might want to have measures which relate the 2 random variables
individually you can have variance and standard deviation and expected value for each of the
random variables and can we have a measure of the 2 random variables combined in some way,
so those are things like covariance and correlation. So, let us get into that.
(Refer Slide Time: 1:22)

So, brief motivation, so supposing I mentioned we are going to look at 2 random variables, so let
us say we look at 2 random variables 𝑋 𝑌 which have a joint PMF which looks like this, so 𝑋 takes
values 0 and 1, 𝑌 takes values 0 and 1 and here are 2 possible joint PMFs. In one case you notice
so this is values taken by 𝑋, so this is 𝑋 and this is 𝑌 and here is a joint PMF 1/2, 1/2, 1/2, 1/2 and
then so that is, it is not 1/2, 1/2, 1/2, 1/2, that is not a joint PMF it is 1/4 , so apologies for that so
1
this is wrong it is 4 , 1/4 , 1/4 , 1/4 .

So, if you look at the marginal, so if you look at 𝑓𝑌 and then 𝑓𝑋 you will get 1/2, 1/2, 1/2, 1/2 and
interestingly the marginal here is also 1/2, 1/2. So, notice how the marginals are the same just
looking at 𝑋 and 𝑌 both these joint PMFs seem to give you the same marginals but actually the
distribution is very very different. In one case it is all 4 possibilities are likely 𝑋 can be 0, 𝑌 can
be 0, in the other case if 𝑋 is 0, 𝑌 is only 1, so there is no possibility of 𝑌 being 0 itself in this case.

So, these are two completely different joint distributions giving the same marginals and we would
like a measure which maybe can give you the difference between these two sort of situations. And
so once again let us repeat that here and this is wrong, it has to be 4. So, here is the observation,
so mean and variance, if you just look at mean and variance for 𝑋 and 𝑌 you are going to get the
same answer for both these joint PMFs and then the joint PMF is so different in this case, so 𝑋 and
𝑌 are independent in one case the value of 𝑋 determines the value of 𝑌
If I tell you 𝑋 is 1 then 𝑌 is 0, then likewise if 𝑋 is 0 then 𝑌 is 1, so that is true in the second case.
In the first case 𝑋 and 𝑌 are independent if 𝑋 is 0 or 1 𝑌 is just uniform 1/2, 1/2. So, how do you
quantify something like this, how do you come up with the measure which sort of quickly tells you
how two random variables behave, are they sort of independent tending towards more being more
independent or are they, is this one going to determine the other, are they very closely connected?

So, is there a nice measure one number that can come up with a with that can quantify this, so we
quantified center of a random variable we quantified spread of a random variable, now we are
going to quantify this relationship between two random variables. So, that is the idea behind
covariance.

(Refer Slide Time: 4:31)

So, let us see the definition, the definition is easy to write down, you have 𝑋 and 𝑌 being random
variables on the same space, the 𝐶𝑜𝑣(𝑋, 𝑌) is denoted Cov brackets 𝑋, 𝑌 and the definition uses
the expected value, so the 𝐸[(𝑋 − 𝐸[𝑋])(𝑌 − 𝐸(𝑌))], so that is the definition, very simple
definition.

So, supposing this, so let us just look at the definition and see intuitively if we can figure out what
it means if 𝐶𝑜𝑣(𝑋, 𝑌) is positive. So, see remember this is just a number 𝑋 and 𝑌 will be a
distribution, so this guy is just a number after you evaluate the expectation covariance is just a
number and it could be positive, negative or 0.
If it is positive then what is inside the expectation, we are saying is positive. So,
(𝑋 − 𝐸[𝑋])(𝑌 − 𝐸[𝑌]) is going to be positive so if it is positive then each of the terms
(𝑋 − 𝐸[𝑋])(𝑌 − 𝐸[𝑌]) is positive then 𝑌 − 𝐸[𝑌] is also going to be positive. So, roughly I mean
when I say notice the word I have used tends to be positives, its expected value is positive, so we
are expecting quite often maybe it is positive.

So, then if (𝑋 − 𝐸[𝑋]) is positive then we are expecting that 𝑌 − 𝐸[𝑌] is also positive. So, what
does that mean in practice in a rough sense if 𝑋 is above its average, 𝑋 is above 𝐸(𝑋) then you
expect 𝑌 to be above 𝐸(𝑌), 𝑋 higher meaning means 𝑌 will also be higher, I mean this is just rough
remarks based on what the definition is just to get you to intuitively appreciate why these things
are defined in this fashion.

On the other hand if it is negative then we expect (𝑋 − 𝐸[𝑋])(𝑌 − 𝐸[𝑌]) to be negative and so if
(𝑋 − 𝐸[𝑋]) positive if 𝑋 is above its average the (𝑌 − 𝐸[𝑌]) is going to be negative so Y is going
to probably tend to be below its average, so these are all just probabilistic statements but hopefully
you get the picture.

So, this one number 𝐶𝑜𝑣(𝑋, 𝑌) sort of represents whether 𝑋 and 𝑌 tend to be high together or once
one goes high the other goes low so that is the sort of impression that is good to remember if
covariance is 0 there is a no definition here 𝑋 and 𝑌 are said to be uncorrelated, so it is an important
definition to keep in mind if 𝐶𝑜𝑣(𝑋, 𝑌) is 0 we say 𝑋 and 𝑌 are uncorrelated random variables.
(Refer Slide Time: 7:14)

So, that is just the definition let us evaluate it for some cases before we evaluate let us just look at
positive and negative covariance a little bit more once again from an intuitive point of view I will
throw a few examples at you and let us see if we can sort of intuitively argue whether we expect
the covariance to be positive or negative or maybe even 0.

So, let us say 𝑋 is height of a person and 𝑌 is weight of a person. So, generally if a person is taller
then they also tend to be heavier so just the height gives more weight of course you can there are
exceptions to this rule, so it is just a probabilistic statement but we generally expect that to be true.
So, in this kind of data we expect 𝐶𝑜𝑣(𝑋, 𝑌) to be positive, this is not a rigorous statement, it is
not a problem message, it is just an observation of how to think of these things when they show
up in practice.

Here is another thing rainfall during monsoon and the debt of farmers if you want to look at those
two variables, if the rainfall is higher, then the farmer’s debt is going to be lower, so hopefully, so
covariance in this case you expect to be negative in some sense. Here is another example the
covariance is negative I have been looking at this IPL over a lot if you want to think of runs in the
over and wickets in the same over you probably expect this to be negative, so more runs then
maybe the wickets were lower.
So, just, so this just to build up your intuition a little bit before we get into the mechanics of
evaluating the correlation and covariance. But in most cases we will not be dealing with these kind
of very high level problems, we will be dealing with very simple problems.

(Refer Slide Time: 8:46)

What are simple problems a problem where the PMF is given to you, 𝑋 is given, 𝑌 is given and
the PMF is given to you, here is a PMF. How do you go about computing covariance? So, this is
a sort of problem that is given here so let us go ahead and do this problem. So, this is the joint PMF
of X and Y and we have to evaluate this expected value so notice 𝐶𝑜𝑣(𝑋, 𝑌) is 𝐸[(𝑋 − 𝐸[𝑋])(𝑌 −
𝐸(𝑌))], So, one very direct way to do it is to actually find 𝐸(𝑋) and 𝐸(𝑌). How do you find 𝐸(𝑋)
and 𝐸(𝑌), one can find the marginal so maybe, so let us see if we can find the marginal if you do
𝑓𝑌 here you have to add it up so you see you get 1/3, so this needs to be 2 once again I missed up
this PMF somewhere, apologies for that, so this is not a valid PMF, it is not a 2, 2 + 2 + 2 + 2 + 2.
So, you have 6 of them, 12 + 3 15, now this is a valid joint PMF.

So, you have 1/3, 1/3, 1/3. Say here again you have 1/3, 1/3, 1/3 for the marginal distribution of
𝑋, so that is good. So, if you look at the marginal distribution of 𝑋 and 𝑌 you see 𝐸(𝑌) equals 0
because it is - 1 with probability 1/3, 0 with probability 1/3, 1 with probability 1/3, so 𝐸(𝑌) is 0
and E(𝑋) is also 0, so when 𝐸(𝑋) and 𝐸(𝑌) are 0, it becomes really nice to evaluate this expression,
so it is just 𝐸(𝑋, 𝑌) so you just do 𝑋 𝑌 and then multiply with this probability.

So, if you do that this just becomes − 1 × − 1 × 1 / 15 and all these 0 multiplications will not
play any role. So, I mean 0 , 0 × anything is 0, so I do not have to bother with all the values where
either 𝑋 is 0 or 𝑌 is 0, only the other things matter to me and then you have
(− 1) × 1 × 2 / 15 + (− 1) × 1 × 2 / 15 + ( − 1) × 1 × 1 / 15.

So, if you do this you get this is 1 / 15, 1 / 15 that is 2 / 15 - 2 / 15 is going to cancel so you get -
2 / 15 as the overall covariance. So, this kind of calculation you should be able to do. So, given a
joint PMF a very small joint PMF you should be able to evaluate the marginals find the 𝐸(𝑋),
𝐸(𝑌), once you find 𝐸(𝑋), 𝐸(𝑌) plug it into this and evaluate the covariance.

And if you have E(𝑋) and E(𝑌) are not 0 then you will get slightly more complicated things but it
does not matter, so every possible 𝑋 and 𝑌 we just go through and multiply by the probability and
do the calculations, the straightforward computation of expected value. Hopefully this was clear
so this is how you go through and compute it. Of course, you can argue all the 0 values have
dropped but they do not show up in the summation, you will have terms of the form like 0 × (- 1)
that is already 0, I do not need to account for it.

So, we have got a sort of a small negative value, for this does that make sense, does it make sense
that if 𝑋 tends to be high then 𝑌 will tend to be slightly lower, maybe I do not know, you can you
can look at these numbers and sort of guess and see that 𝑋 and 𝑌 are equal with some lower
probability and if 𝑋 is 1 then 𝑌 is likely to be (- 1) and 0 and if 𝑋 is (- 1) then 𝑌 is likely to be
higher, so it is sort of going opposite, so negative sort of makes sense.
So, this is sort of a bread and butter problem, given a joint PMF evaluate the covariance, just go
through, do the, do it mechanically, so this is one problem.

(Refer Slide Time: 13:03)

So, few properties, how do we consider, are there some simplifying properties, are there some easy
ways to evaluate covariance, it turns out one can write it in a different form we will see that. The
first thing is if instead of 𝐶𝑜𝑣(𝑋, 𝑌) if I put 𝑋, 𝑋, I end up getting the 𝑉𝑎𝑟(𝑋), so that is a simple
little result, a good property to remember.

So, in 𝐶𝑜𝑣(𝑋, 𝑌) if you make 𝑌 to be equal to 𝑋 you will get the variance and variance if you, as
is non negative it is going to be positive and all that. Well, covariance can be the positive or
negative variance will only be positive and this comes from this very simple observation.

Here is another expression for covariance, instead of writing 𝐸[(𝑋 − 𝐸[𝑋])(𝑌 − 𝐸(𝑌))] you can
write it like this, if you found 𝐸(𝑋) and 𝐸(𝑌) you can simply find 𝐸[𝑋𝑌]. So, this form I really
like, so this is the form in which you can easily compute, you have to find 𝐸(𝑋𝑌), 𝐸(𝑋), 𝐸(𝑌) and
covariance is simply 𝐸(𝑋, 𝑌) − 𝐸(𝑋) × 𝐸(𝑌), so it is much easier to use this formula.

There is a little proof here you can go through and see, it just goes back to linearity of expectation
like I said, keep using linearity of expectati𝑜𝑛𝑠 𝑠𝑜 𝑚𝑢𝑐ℎ 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑓𝑎𝑐𝑡 𝑡ℎ𝑎𝑡 𝐸(𝑋) 𝑎𝑛𝑑 𝐸(𝑌) are
constants, so if you use that everything will sort of work out to this kind of an expression.
Covariance is symmetric, it is very easy to see, this is the product that you do it this way or that
way you get the same product 𝑋 𝑌 and 𝑌 𝑋 have the same covariance simple properties for
covariance.

It is also a linear quantity so unlike variance which involved a square there is no square here, so if
you keep X as the same and instead of if you look at 𝑎 𝑌 + 𝑏 𝑍 then it is just the linear
combinations of the covariance, the a and b come out and you can use this kind of an expression,
so this kind of linear property is very useful, it is easy to prove and it goes back once again to
linearity of expectation and everything follows from there. I am not going to prove this property
but this you can use some× to simplify your covariance computation.

(Refer Slide Time: 15:15)

So, now let us look at the connection between correlation and independence, so we saw how
covariance is very small of covariance is 0, we say that the random variables are uncorrelated. We
also saw another measure called independence, so independence was a much stronger measure of
how two unconnected two variables are.

So, if you remember the joint PMF has to become the product of the marginals for independent
random variables and not only that any event you define with one random variable should become
independent of any event you define with another random variable, it is a strong requirement, but
these two are connected in one interesting way.
So, the first thing is if 𝑋 and 𝑌 are independent already then they are uncorrelated, the covariance
between 𝑋 and 𝑌 will become equal to 0, this is an important thing to remember. So, if you are
already independent then you will be uncorrelated, your scatter plot will look all over the place
and all that, so it will be uncorrelated.

The proof is actually quite easy 𝐸(𝑋, 𝑌) is equal to 𝐸(𝑋) into 𝐸(𝑌) and so it will cancel with - that
and it will go to 0. But the converse what is converse? Suppose, 𝑋 and 𝑌 are uncorrelated do they
have to be independent that is a question you may ask, it turns out they do not have to be. So, even
if 𝑋 and 𝑌 are uncorrelated they may be dependent.

So, the important thing to remember here is this independence is a much stronger notion of being
unrelated to each other than being uncorrelated, uncorrelated is a slightly weaker notion, so
independence implies being uncorrelated but you can be uncorrelated but still be dependent, you
may not be independent so converse is not true.

So, here is an example, a very quick example you can come up with various other examples as
well for uncorrelated random variables but maybe it is a bit of a surprise to you so if you just have
a range of 2 for 𝑋 and a range of 2 for 𝑌 it turns out you cannot have an example of this type
uncorrelated will be independent in that case. But if you increase the range of 𝑋 to 3 make it – 1,
0, 1 and the range of 𝑌 to 0, 1 it is very easy to come up with examples like this where the random
variables are uncorrelated but they will be dependent.

So, here is a quick example let us do this calculation if you look at 𝑓𝑌 you are going to get 1/2 and
1/2 if you look at 𝑓𝑋 here this is 3/8, 1/4 , 3/8. So, clearly, immediately you see because of this 0
or anything else for that matter but 0 clearly says it is not equal to 1/2 × 1/4 clearly, so these two
are dependent, we know they are dependent, easy to establish, so the 0 not being equal to 1/2 × 4
is very easy to see.

But you can also see other things, so many other things will satisfy this for instance 1/4 is not
equal to 3 / 16, 1 / 8 is not equal to 3 / 16 things like that. So, you can see everything is it is not
independent at all. But let us compute the correlation you see the 𝐸(𝑌) is 1/2, 𝐸(𝑋) is 0 and then
if you do 𝐸(𝑋𝑌), 𝑋 is 0, 𝑌 is 0 all this will go away so only these two will matter and that is - 1,
this is + 1 and you can see you will get 0 again.
So, you see, 𝐸(𝑋, 𝑌) equals 𝐸(𝑋) × 𝐸(𝑌) so 𝑋 and 𝑌 are uncorrelated but they are dependent, so
here is a simple example that you can do. So, this other property that I mentioned I think it is true
you can check it out if you only have two possible values for 𝑋 and two possible values of 𝑌, I
think you cannot have this property of being dependent but uncorrelated, I think it is not possible,
you can check it out if you do not believe me but if you just increase the number of alphabets for
𝑋 to 3 you can quite easily come up with these kind of examples of dependent but uncorrelated
random variables.
Statistics for Data Science – II
Professor Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture 4.6

(Refer Slide Time: 0:14)

Next, let us move on to the definition of something called correlation coefficient, this is actually
very closely related to covariance but generally what happens is covariance can be very large or
can be very small so if it is close to 0 I guess it is fine, it is uncorrelated but how do you make
sense of the actual value of the correlation, let us say the covariance is 100 or it is 10 or it is 1, it
seems like it is going all over the place and you do not have a sense of how strong the correlation
is depending on how high the value of the covariance is.

Now, this correlation coefficient is sort of a normalized version of the covariance and it ends up
giving you that real handle on how strong the correlation is based on its value you will see why,
so this is the definition of the correlation coefficient or simply correlation, some quite often.
𝐶𝑜𝑣(𝑋,𝑌)
𝜌(𝑋, 𝑌), we will use the notation 𝜌(𝑋, 𝑌)for it, its 𝑆𝐷(𝑋)𝑆𝐷(𝑌).

So, we sort of normalize it, we will see why this is a very meaningful normalization soon enough.
So, this is the definition once again we know how to calculate it given the joint distribution of 𝑋
and 𝑌 we can find the 𝑉𝑎𝑟(𝑋), 𝑉𝑎𝑟(𝑌) from there find the 𝑆𝐷(𝑋)𝑆𝐷(𝑌), find the 𝐶𝑜𝑣(𝑋, 𝑌) we
saw examples just now and then divide and you get the answer.

So, here is a very-very important result, it turns out the 𝐶𝑜𝑣(𝑋, 𝑌) lies between − 𝑆𝐷(𝑋)𝑆𝐷(𝑌)
and + 𝑆𝐷(𝑋)𝑆𝐷(𝑌), so look at that. So, the 𝐶𝑜𝑣(𝑋, 𝑌) is bounded in value, when if you give the
𝑆𝐷(𝑋) and 𝑆𝐷(𝑌) you know that the covariance lies between these two values, it could be both
positive or negative but it lies between these two values.

I am not going to give you a detailed proof of this result but it is an important result and I have
given you just a pointer to this proof. So, you can look at this expected value and you just simplify
using linearity of expectation, you will get to the covariance lower bound or the upper bound
depending on which one you take. This is just expected value of a positive quant, positive random
variable being positive, so use that and you can get it. So, I leave this as an exercise to you.

So, now we have normalized, so normalized so because of this property of covariance you know
that the correlation coefficient is between - 1 and 1, whatever may be the random variable 𝑋, 𝑌
covariance correlation coefficient, covariance might be very large but it is still bounded by the
standard deviations and when you divide by the product of the standard deviations you get a
number, it is dimensionless quantity, it lies between - 1 and 1.

So, if the 𝜌(𝑋, 𝑌) is = to 0 then you know it is uncorrelated, if 𝜌(𝑋, 𝑌) tends to 1 it is very positively
correlated, if 𝜌(𝑋, 𝑌) tends to - 1 it very heavily negatively correlated, so the value of 𝜌(𝑋, 𝑌) also
tells you the extent of the correlation.
(Refer Slide Time: 3:15)

So, 𝜌(𝑋, 𝑌) is just one number which summarizes the trend, trend is also a good way to say express
this correlation, if 𝑋 were to increase 𝑌 would also tend to increase, so it is a trend, it is close to 0,
they are uncorrelated, there is really no observable trend between 𝑋 and 𝑌, it is sort of all over the
place, maybe they are not even related to each other.

If it is 1 or - 1 if 𝜌(𝑋, 𝑌) becomes = to 1 or 𝜌(𝑋, 𝑌) becomes = to - 1 then it turns out, here is a


very-very surprising and very powerful property 𝑌 is a linear function of 𝑋. I am not going to prove
it to you in detail maybe you have seen this result before in an earlier class but this is something
very-very important to remember.

If rho becomes = to 1 or if it becomes = to - 1 then 𝑌 has a linear dependence on 𝑋, 𝑌 = 𝑎𝑋 +


𝑏 and that is a great thing to know, so such a strong connection is there between 𝑌 and 𝑋 if 𝑋 is
given 𝑌 is determined not just directly but I mean that is a linear function of 𝑋.

So, if the modulus of 𝜌(𝑋, 𝑌) is close to 1 then 𝑋 and 𝑌 end up being very strongly correlated if
you have an increase in 𝑋 you expect an increase in 𝑌 also. So, this kind of understanding of
correlation and covariance is important. Once again these are just numbers to quickly understand
the trend between two random variables.
(Refer Slide Time: 4:45)

So, here is a problem, you have a joint PMF where 𝑋 is between − 1/4 and 1/4 and we are going
to find the correlation coefficient or covariance or something just to get more practice into this
1
whole thing, so you can do 𝑓𝑦 and 𝑓𝑥 , so this is 2 , 1/2, 1/2, 1/2, so 𝐸(𝑌) is a 1/2, 𝐸(𝑋) is a 1/2,

and 𝐸(𝑋, 𝑌). Remember 𝑋 𝑌 if 𝑋 is 0 it is not going to play a role, if 𝑌 is 0 also it just goes off,
only this guy will play a role, it is just 1/4 − 𝑥, is that alright?

So, the Cov(X,Y) is 1/4 − 𝑥 − 1/2 × 1/2 which is − 𝑋. So, you see how covariance nicely
works out, so how do you find ρ(X,Y), you need 𝐸(𝑥 2 ) squared and 𝐸(𝑦 2 ) for that, 𝐸(𝑦 2 ) is also,
so 𝐸(𝑦 2 ) is also a 1/2 𝐸(𝑥 2 )is also a 1/2, so 𝑉𝑎𝑟(𝑋) is 1/2 − 1/4 that is 1/4, and similarly
𝑉𝑎𝑟(𝑌) is also 1/4.

You can sort of see why all these things end up being the same thing, hopefully it is clear to you.
So, these two variances are 1/4, 1/4, so if you do 𝜌(𝑋, 𝑌) you are going to get the covariance
divided by the 2 standard deviations that is 1/2 × 1/2 so you 𝑔𝑒𝑡 – 4𝑥. So, you can see as 𝑋
varies between - 1/4 and 1/4 𝜌(𝑋, 𝑌) will vary between 1 and - 1.

So, you can see how if 𝑋 is positive, 𝑋 is positive you expect a sort of a negative correlation, if
𝑋 is positive then when 𝑋 is 1 𝑋, 𝑌 is more likely to be 0, so sort of when 𝑋 is above its expectation
𝑌 is going to be below its expectation. So, that negative and positive also comes in very clearly, so
that is why this covariance is negative. So, hopefully this gave you a quick way of getting to the
this small problem where you quickly see how the rho and 𝐶𝑜𝑣(𝑋, 𝑌) can be computed.

(Refer Slide Time: 7:38)

So, now let us look at a slightly modified version of the previous problem, in the previous problem
we had 𝑋 = 0, 𝑋 = 1 and 𝑌 = 0, 𝑌 = 1. I am going to sort of translate it, I am going to make
𝑋 = 𝑐 𝑎𝑛𝑑 𝑋 = 𝑐 + 1 , 𝑌 = 𝑑 , 𝑌 = 𝑑 + 1 and see what happens to the covariance.

So, in some sense we have a sense that the covariance should not change, so the variance should
not change, hopefully it does not change, let us see or if it is similar enough. So, let us go through
and compute 𝑓𝑌 and that is still the summation here, so that does not change 1/2, 1/2, similarly 𝑓𝑌
here and that is again 1/2 1/2 that is good.

So, what about E(Y)? It is 1/2 × 𝑑 + 1/2 × 𝑑 + 1 and that is going to become 𝑑 + 1/2 so
the expected value is sort of shifted to the center instead of being 1/2 you are getting 𝑑 + 1/2.
And E(𝑋) is similarly going to be 𝑐 + 1/2. Now, there are various ways of doing the E(𝑋, 𝑌),
1
maybe we can just do it directly, so let us just do 𝐸(𝑋, 𝑌), it is going to be 𝑐 𝑑 × (4 – 𝑥) and all

these things will work out, so let me just do that 𝑐 𝑑 × (1/4 − 𝑥 ) + 𝑐 × (𝑑 + 1) × (1/4 +
𝑥) + (𝑐 + 1) × 𝑑 × (1/4 + 𝑥) + (𝑐 + 1) × (𝑑 + 1) × (1/4 – 𝑥).
So, let us just come combine the terms together so if you look at the c d term, you have a c d term
1/4, 1/4, 1/4 so you will get overall a c d and then if you want to look at the you know the c term
here, there will be a c term here and then there will also be a c term here and 1/4 + 𝑋 and 1/4 −
𝑐
𝑋 so you will get a 2 . Do you see that? There is a c term here, c × 1 and that is 1/4 + 𝑋 and

likewise here also there is a 𝑐 × 1 and 1/4 − 𝑋 that adds up to give you 1, so it will give you a
𝑐
1/2, so And likewise if you look at the d term, so again 1/4 + 𝑋 and 1/4 − 𝑋 so you get a
2
𝑑
and then finally you have the 1 term and the 1 term just appears only there so hopefully you see
2

how I got that, the 1 term appears only there so you will just get a 1/4 − 𝑋 so you saw how I did
the simplification, it is not very complex but you can see how I did that.

So, now if you do the 𝐶𝑜𝑣(𝑋, 𝑌) you get 𝐸(𝑋, 𝑌) − 𝐸(𝑋) × 𝐸(𝑌) and if you multiply these two
𝑐 𝑑
you see the 𝑐 𝑑, 2 , 2 and 1/4, all of these case will cancel and you will be left with the − 𝑥, so that

is a pleasing answer, so you see that you got the same - x when you did before as well.

So, same thing you can do with 𝜌(𝑋, 𝑌) and you will get the 𝜌(𝑋, 𝑌) to be – 4𝑥 as well, so you
can check this out, if you do not believe me, do the calculation you will get the answer. So, this
sort of shows that this covariance and all that is nicely defined if you keep translating it all over
the place you get the same covariance with the relationship between the random variables sort of
remains the same.
(Refer Slide Time: 11:18)

So, one very important aspect of sort of figuring out correlation and covariance is by looking at
what are called scatter plots. What are scatter plot? So, this is an example of a scatter plot. I have
two variables maybe you can think of this as data, so you get a bunch of data for X and a bunch of
data for Y, two random variables, which you think are correlated you get a bunch of data and you
do not know if they are correlated or not.

You want to understand maybe they are correlated, maybe they have a relationship, how do they
figure it out, you can do something called a scatter plot. What is this scatter plot? Scatter plot is
every 𝑋 and every 𝑌 that you got together, so in an observation you saw an 𝑋 and a 𝑌, you simply
plot the point, put an 𝑋 in 𝑋, 𝑌, so these are these dots. In the first case, there are six cases here,
case 1, case 2, case 3, case 4, case 5 and case 6. Let us focus on case 1.

So, in case 1, so every dot here represents a possible point that you saw, so data point that you
saw. So, if you just look at this data it sort of looks all over the place, so you cannot conclude from
here, if there is a trend that if 𝑋 is increasing 𝑌 is going to increase or anything like that. So, let us
go to case 3, contrast it with case 3.

So, look at the contrast here. So here, even though the data is sort of spread out, it is not like 𝑋 is
fully determined by 𝑌 or something but still there is clearly a positive trend, a positive correlation
here, if 𝑋 is going to increase then 𝑌 is also going to increase, so you see that straight line positive
correlation. In fact, the correlation here is probably very close to + 1.

Why is that? Because you see this band it is lying within the same small band and it looks like
almost like a straight line in some sense. So, let us come back to this 2, there seems to be a positive
correlation, it sort of looks like in that direction if 𝑋 is higher 𝑌 is going to be higher in some sense
and but maybe not that close to 1, so may not be very close to 1, may not be very close to + 1 but
positive correlation.

So, on the other hand, if you look at this 4 and 5 these 2 guys are very likely to have negative
correlation. Do you see why this correlation is negative? It sort of goes down in this angle if 𝑋 is
going to dec, if 𝑌 is going to decrease then 𝑋 is also going to increase, so it goes down like this if
𝑌 is falling down 𝑋 is increasing.

And likewise here, this also is probably negative correlation. So, between these two between 4 and
5 you can see that 4 is probably close to - 1, so this 5 may not be that close to - 1, so there is
correlation but maybe not that strong. So, this guy and 1 and 6 are probably uncorrelated, close to
uncorrelated. So, in this case rho is probably close to 0, so this is all rho, so the rho is probably
close to 0.

So, this kind of a picture is very useful particularly when you deal with real data, large data set,
how do you quickly see if there are two different data, two random variables or data representing
two random variables are they correlated or not, you simply do a scatter plot and just by looking
at the scatter plot you can sort of guess whether the correlation is positive, negative or it is close
to 1, - 1 or it is sort of maybe middling correlation is there a general sense.

So, I picked nice scatter plots here, actual scatter plots that you see may not be like this, later on
we will see some scatter plots from the IPL and see if we can conclude something interesting based
on that. So, that is I think all I wanted to say about covariance and correlation.

So, once again this lecture focused on relationship between two random variables and coming up
with one number to capture a trend between two random variables. If 1 random variable is going
to be above its average is the other one also going to be above its average or is it going to be vice
versa, that is the question we asked to try and settle in this lecture of covariance and correlation a
very useful matrix for that purpose.
Statistics for Data Science – II
Professor Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Lecture 4.7

(Refer Slide Time: 0:15)

Hello and welcome to this lecture, this lecture is basically on a very interesting topic of bounds on
probabilities or inequalities and probabilities using mean and variance. So, we saw before that
mean represents some sort of a center of the distribution and variance represents some sort of a
spread in the distribution, it seems like vague ideas here, can we say something more precise, is
there some precise connection to probability, it turns out not exactly but at least we can get some
good bounds on probabilities using mean and variance. This short lecture is about two such bounds
and those bounds are very very powerful and very useful.
(Refer Slide Time: 0:54)

So, before that we are going to use a simplified notation for mean and variance, this is very very
common in probability classes and statistics classes if you know what random variable you are
talking about everybody will denote mean as µ, there is no confusion, no need to write E[x] and
all that.

So, if in any situation you see this letter, this notation µ being used that typically will represent the
mean. Same thing with 𝜎 2 , 𝜎 2 will always represent the Var(x) if the random variable X is clear
from the context, so clearly σ is the standard deviation, so this is µ and σ you will see again and
again and again in statistics classes and probability classes.

So, if there are multiple random variables that you want to refer to a particular random variable
you simply add a subscript µ𝑋 is the mean of X expected value of X, 𝜎 2𝑋 is the Var(x) and 𝜎𝑋 is
the standard deviation so we will use this notation from now on, it will simplify our description.
(Refer Slide Time: 1:56)

So, what does mean say about the distribution? So, this is a question that you can ask, so I am
going to show you through to a simple example, a very simple example and then from there sort
of intuitively go towards bounds on probabilities. So, let us say you have a class that you took and
marks were given in the final exam or something and then the teacher said the average marks is
50
, so it is a likely average mark, it looks reasonable, maybe there is a spread of students in class
100
50
and you have 100.

Now, suppose now I ask a probability sort of question, what fraction of students will have marks
≥40 50? So, what does this mean? So, if you pick a student at random, what is the probability that
their mark will be above 50? So, this is sort of the question I am asking.

And you can sort of argue that it cannot be 0, it cannot be that nobody ≥50, if nobody ≥50, how
can the average be 50? So, average is sort of the add up everything divided by the number, so it
has to, some numbers have to be above or below or equal, it cannot be that everything is below,
then why is the average 50, it is not going to be above 50, it will be lesser than 50.

So, you notice that given the average I can say something about the fraction of people who are
above or below some number. Here is a more interesting question maybe you have not thought
about this too much. What fraction of students will have marks ≥80, the average is 50 only, now
what fraction will be ≥80?
So, here again you cannot give a precise answer but you can give a bound. So, for instance you
can say, it cannot be 1 it has to be definitely lesser than 1. Why is that? If everybody got above 80
then the average will be 80 and above, it is not going to be 50. If the average is 50 clearly you
know lot of people got less than 80.

So, here you can in fact even be more precise, it cannot even be 0.9 think about why. So, can it be
0.9? No, so this I will leave it as an exercise to prove why that is no, but it cannot be 0.9 also. So,
you will see it will be lesser than 0.9 and there are some good bounds that you can do just based
on the average. So, average does say something about the distribution of the marks and this lecture
will give you a brief idea of how to use average and variance and say something about where the
random variable is likely to lie.

(Refer Slide Time: 4:27)

So, this is also related to what is called standard units in probability. Supposing, you are observing
some random phenomenon and you observe an increase in that random phenomena, let us say I
mean later on I will refer to this so but let us say you are looking at accidents in the country,
number of accidents in the country I do not know if you look at these numbers it goes into the
lakhs every day or something like that, big number of accidents happen in the country.

So, supposing that number is increasing, when do you get alarmed, when do you think it is a huge
increase or when do you think it is a small increase? So, all of that in absolute terms it may not be
easy to see, but if you know the mean and standard deviation then you can say something
reasonable about it, so that is why the standard units is something that is very important.

So, supposing you look at a random variable 𝑋, it has a mean µ and a variance 𝜎 2 . You may get
values of 𝑋 which are close to the mean or it may be far away from the mean. So, always you can
think of this 𝑋 − µ so what value you actually got and what is the expected value this 𝑋 − µ is
sort of like a deviation from the expected value, so distance from the mean, it could be positive or
negative.

You can be above the mean or you can be below the mean, how far are you above the mean usually
the number of standard deviations you are away from the mean is computed as this standard unit,
it is, there is a good reason why that is very important.

𝑋−µ
So, what is the standard unit in some sense you are going to look at , so this quantity is sort
𝜎

of a standard unit and it tells you if this quantity is large then you are really far away from the
𝑋−µ
expected value, the 𝜎 plays a very nice role, so you do not expect , to be very very large, if it
𝜎

is 10 or something then you really got an outlier I mean way above what you expect things to see.

So, 𝑋 − µ you expect it to fall between − 𝑐 𝜎 and + 𝑐 𝜎, it has to be within a certain multiple of
𝜎, you might have heard this very famous six σ terminology, so if 𝑋 − µ is above six 𝜎 people
generally consider that as being an extreme event in some sense.

Anyway so what we are going to do in this lecture is make this notion a bit more precise using
bounds from probabilities. So, it make this may be a rough notion in your head, it may be very
vague but how do you make it precise, how do you put a number on it using a couple of inequalities
or bounds on probabilities using mean and variance? So, we will start with example.
(Refer Slide Time: 7:08)

Here is an example, supposing you throw a pair of die and 𝑋 is the sum of the two numbers. So,
the mean is 7 and 𝜎 is 2.42, you can compute this, this is easy to see. Now, if you look at the
probability that absolute value of 𝑋 − µ < or equal 𝑡𝑜 𝜎, so µ = 7, so absolute value of 𝑋 −
7 ≤2.42, so you do that calculation you will see 𝑋 has to lie between 4.58 and 9.42 but remember
this is just integers which means 𝑋 process the probability that 𝑋 takes values 5, 6, 7, 8, 9 and that
probability will work out to 2 /3, you can go in and compute this.

So, you see probability that 𝑋 − µ > 𝜎 is 1 /3. Now, what about probability that 𝑋 − µ > 2 𝜎,
you can do the calculation again, be this bracket that is missing here, you will see that this is 2 /36
and that is about 0.56. So, you see the drop between 𝜎 and 2 𝜎 the 𝑋 − µ, absolute value of 𝑋 −
µ is within σ, so X is within µ + 𝜎 and µ − 𝜎 that is probability 1 /3, 2 /3 and 𝑋 is between
− 2 𝜎 and + 2 𝜎 that probability is 1 − 0.056, so you are away out of 2 𝜎 only with very very
very low probability.

So, let us look at one more case here, this is 𝑋 being uniform from 1 to 100. In this case, you can
compute the mean will be 50.5 and the standard deviation is about 28.9 and probability that 𝑋 −
µ > 𝜎 is 58 /100, about half when points 4.65 then probability that 𝑋 − µ > 2 𝜎 is actually 0.
So, you see how this probability drops hopefully this example sort of convinced you about that
will also see like I said precise bounds to get you upper bounds on how likely is it that the random
variable deviates from its mean.

(Refer Slide Time: 9:11)

The first such inequality is called Markov's inequality, it is a very famous inequality. Here is the
statement if 𝑋 is a discrete random variable taking non-negative values with a finite mean, the
mean has to be finite it cannot be one of those infinite non-existing cases, it has to be finite. Then
𝑃(𝑋 ≥ 𝑐) ≤ µ /𝑐, so this is a very simple inequality, if you know the mean so and then 𝑋 ≥ 𝑐 ≤
µ /𝑐.

But remember 𝑋 has to be non-negative, so this is very very important. The range of 𝑋 has to be
non-negative, 𝑋 cannot take negative values, the supply is only when 𝑋 is positive so a very easy
example supposing you take 𝑐 to be 100 µ then P(𝑋 ≥100µ) ≤1 /100 which is 0.01, so already
you see a bound here.

It may not be a great bound but still you give you, get a bound, it cannot be very very away from
the mean if it less than 1 percent likely to be a 100 times the mean, so simple inequality it only
needs the mean it is a very very powerful idea by the way, this is used quite often in so many other
derivations.
(Refer Slide Time: 10:29)

The proof is here I am not going to go over the proof, it is quite simple let us just add it up and
define the event and observe that if it greater than equal to c you can instead of 𝑡 you can use 𝑐
here and you will get it is quite a easy proof but it is a easy inequality to remember as well 𝑃(𝑋 ≥
𝑐) ≤ µ /𝑐 but remember this holds only when 𝑋 is non-negative if the range of 𝑋 has a negative
value then this does not hold remember that, it is very important.

(Refer Slide Time: 10:58)


The next inequality is called Chebyshev's inequality, it sounds very fancy but it is a very simple
inequality. For this you need a finite mean and a finite variance, so X is a discrete random variable,
finite mean, finite variance, then P(|𝑋 − µ| ≥ 𝑘 𝜎) ≤ 1⁄𝑘 2 , so this is sort of exactly what we
are looking for.

I know |𝑋 − µ| is between − 𝑘 𝜎 and + 𝑘 𝜎, I mean I expect with k that the probability that it
goes out will be lesser. How does it fall? It falls as 1⁄𝑘 2 , P( | 𝑋 − µ| ≥ 𝑘 𝜎) ≤ 1⁄𝑘 2 , 𝑘 2

Proof is just using Markov's inequality that is why Markov's inequality is powerful, you use it to
(𝑋 − µ)2 you will get this I am not going to go into the detail but using this inequality is what is
important. So, there are other forms, numerous other forms you can manipulate it, instead of
writing 𝑘 𝜎 you can write it as 𝑐.

If you write it 𝑎𝑠 c then you will get k will become 𝑐 /𝜎, so you get less than or equal to 𝜎 2 ⁄𝑐 2
instead of writing |𝑋 − µ| ≥ 𝑘 µ you can simply write it as (𝑋 − µ)2 ≥ (𝑘 𝜎)2.

You can see how Markov inequality applies, so this is a non-negative random variable, even,
whether 𝑋 is negative or not, so I forgot to mention that important thing. So, Chebyshev's
inequality does not need 𝑋 to be non-negative, 𝑋 could be positive, 𝑋 could be negative, this
always holds, it needs to have a finite mean and a finite variance.

So, you can see how the proof has gone, you instead of 𝑋 − 𝑋 𝑋, you look at (𝑋 − µ)2 that is a
non-negative random variable, you can use Markov's inequality, so this is just Markov. Why?
Because expected value of (𝑋 − µ)2 is 𝜎 2 , probability that this non negative random variable
≥(𝑘 𝜎)2 .≤ 1⁄𝑘 2 .

So, Chebyshev's inequality is actually Markov's inequality applied on a slightly different random
variable, it is not all that different. So, Markov's inequality is really the fundamental one in some
sense. Now, the people also write it like this 𝑋 − µ ≥ 𝑘 𝜎 absolute value |𝑋 − µ| ≥ 𝑘 𝜎 is the
same 𝑎𝑠 𝑋 being between µ + 𝑘𝜎 and µ - k σ, so that probability, so if 𝑋 − µ ≤ 𝑘 𝜎 that
probability ≥1 - 1⁄𝑘 2 .
See this is the complement of that, this guy is the complement of |𝑋 − µ| ≥ 𝑘 𝜎. So, another way
to write this see remember |𝑋 − µ| ≥ 𝑘 𝜎 happens in two cases, the first case 𝑖𝑠 𝑋 ≥ µ +
𝑘 𝜎 𝑜𝑟 𝑋 ≤ µ − 𝑘 𝜎, these two are non-overlapping events, they are not mutually exclusive.

So, this probability can also be written as P(X ≥ µ + 𝑘 𝜎) + P( X ≤ µ − 𝑘 σ) ≤1⁄𝑘 2 . So, if the
sum of these two probabilities ≤1⁄𝑘 2 each of them better be less than or equal to 1⁄𝑘 2 , so these
are all the different forms in which you can use Chebyshev's inequality is the same thing written
again and again in different ways.

And hopefully you see why this is just the Markov's inequality applied to (𝑋 − µ)2 . So, it is very
powerful, it tells you how 𝑋 is sort of, it is around its mean and in the units of σ.

(Refer Slide Time: 14:55)

So, let us take a few distributions and compute random variable, mean and variance and the actual
probability, you will see how Chebyshev inequality may be actually quite loose, |𝑋 − µ| ≥
2 𝜎 ≤ 1 /4, you know that it is 0.25.

But if you take binomial (10,1⁄2) and compute µ as 5 and 𝜎 as about 1.58 if you actually do this
probability calculation, P(|𝑋 – 5|) ≥2 σ you see probability that 𝑋 needs to be in 0, 1, 9 or 10 and
that is 0.021, it is much much smaller than Chebyshev inequality and there are reasons for why
this is true but at least the inequality holds we know that the inequality is true, this is definitely
less than or equal to 1 /4 of course.
(Refer Slide Time: 15:44)

If you take the geometric random variable of 1 /4 parameter we know µ is 4 and 𝜎 is 3.46, it is a
calculation you can do and if you do P(|𝑋 − 4|) ≥ 2 𝜎 this is the probability that 𝑋 belongs to
11, 12, 13, 14, etcetera on that side.

And that is roughly about 0.056 once again this ends up being less than 1 /4. So, you can check
that the actual probability really satisfies Chebyshev's inequality, maybe it is far away from the
bound predicted by Chebyshev's inequality but at least it satisfies the problem.
(Refer Slide Time: 16:25)

So, let me conclude this little lecture by talking about what mean and variance say about the
distribution. So, clearly by Markov's inequality the mean itself can be used to bound how far the
non-negative random variable can be above its mean, so that we have seen.

If you are given mean and the standard deviation and Chebyshev inequality gives you a very nice
bound on the P(𝑋 − µ) goes above 𝑘 𝜎 or below − 𝑘 𝜎, so 𝑋 > µ + 𝑘 𝜎 or 𝑋 < µ − 𝑘 𝜎 that
1
is bounded as 𝑘 2 , very nice expression.

So, mean and variance through these bounds give you a very nice characterization of how much a
random variable has really deviated from its center. So, here is a simple example I am giving you
look it up and see if you can answer this question. Supposing the average number of accidents
decreases by 10000 per day, so number of accidents is that a significant decrease, how do you
answer that question?

You have to look at the standard deviation, how much has the decrease been. So, let me not say
average, number of accidents decreases by 10000 per day across the country then is that a
significant decrease? So, to answer that question you have to see what the standard deviation is
and if that ends up being if 10000 actually if you look at it, it is not that high, so the standard I
think the number of accidents, average number of accidents in the country is about a few lakhs if
not wrong, it is very very high.
So, 10000 does not sound that big at least from Markov’s inequality but you have to really look at
the standard deviation and then you will see is that a sharp decrease I mean if it is a very high
decrease then you know we are doing something, we are doing some good policy to prevent road
accidents and all that. If it is a minor decrease you are going to think of it as naturally something
has happened. So, this kind of understanding comes from mean and variance and they give you
very useful bounds on the probability itself. Thank you very much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Continuous Random Variables

Hello and welcome to this lecture. We are going to talk about continuous random variables. So,
this is the first step we are taking to a new subject, new topic in this study of statistics that we are
doing right now. So far, we have looked at discrete random variables and this simple description
using what is called the probability mass function, so x usually takes values like you know 1, 2, 3,
etcetera and then you have the PMF probability mass function, probability, that a random variable
X = x is something.

And then we work with it, we understand something about the probability calculations, etcetera,
all of that we have looked at, we have looked at multiple random variables, we have looked at so
many different ideas. So now we are going to start slowly moving towards what is this notion of a
continuous random variable.

And this is very important because you will see quickly the numbers can become very unwieldy
and become very large if you want to stick to the discrete domain and then this continuous random
variable domain, this gives you very easy ways to deal with these kinds of situations. So, let us get
started. I will give you a description of why all this is needed and then describe the main tools that
are needed for doing computations with continuous random variables.

(Refer Slide Time: 1:29)


Let us begin with an introduction. In the introduction, I am going to sort of motivate why one
requires this kind of continuous random variables and where does it show up usually when we
look at data. So, of course, in theory one can build it in a very different way, but I would like to
start with some data and show how this kind of a continuous view of node random variables helps
us in computations and dealing with data, summarizing it, working with it, etcetera.

(Refer Slide Time: 1:56)

So, let us let us think of our usual discrete random variables and let us say the alphabet in which it
takes values is the script X, so this other X, and when this alphabet grows very large one sees
immediately that it becomes a little bit difficult to work with the discrete random variable. See
remember, how did we deal with discrete random variables, we wrote down a big table of all
possible values that the random variable could take and the probability with which it took the
values.

Now, when the number of values goes larger and larger and larger, we saw even the binomial
distribution for instance, so if you have a binomial with n = 1000 and p =0. 6, you are already
getting to a point where you cannot make tables, you are starting to write down expressions but
then the expressions are also, this is 1000𝐶500 , 𝑝500 this is really difficult to sort of understand
what these numbers are, where the probability is etcetera and everything becomes very unwieldy
to deal with particularly when the alphabet grows large.

Let me give you two examples of situations that can happen, the one situation I will show is
meteorite weights, I think there is a spelling mistake here, so let me see if I can fix that, you will
see me making this mistake quite often o and e replaced with e and o, so I will do it here, maybe I
will not do it in every slide, you can treat both of them as the same. So, let us look at meteorite
weight.

So, what are meteorites? Meteorites are objects that enter the earth and they are moving around in
space and they hit the earth's atmosphere and they enter the earth, they are drawn by earth's gravity
to come inside earth's atmosphere, eventually some of them even hit the ground and some of them
completely sort of disintegrate in the atmosphere because of heat etcetera.

So, there is this calculation of, not calculation, there is this data on what is the weight of the
different meteorites that have actually landed on earth and maybe hit the earth at some . in time or
the various stages of how much damage they caused, et cetera.

And this distribution is spread over a wide range of weights; there is this 0.01 grams, little tiny
things that make their way in and then there is also on the other side 60 tons. So, the large weight
ones are very rare, if you actually look at the data there is 45,000 + meteorites, you can go to many
websites and download this data.
And if you look at their weights most of the weights are around a few kilograms, maybe even less
than that, 100 grams to 800 grams like that and a few of them will go into the tons and there is one
or two which are in the tens of tons and very large, but out of this 45,000 very few are in that range.

Now, if the meteorite hitting the earth is sort of like a random like phenomenon and if you want to
think of doing statistical study of this phenomenon and you start looking at the data, maybe starting
to fit a model and the distribution to it, so you notice now that the data is really vast, the alphabet
is very large, so you want to think of the meteorite weight as your random variable.

Then you have this big set of data to contend with 45,000 + and look at the range, 0.01 grams to
60 tons, maybe the range is not too scary, we have other methods to deal with that, I will tell you
how to deal with, when you have a large range how to simplify that, but still 45,000 entries in that
and so many different values. Do you really gain any insight from it, did you understand which
weights are more likely, which weights are less likely?

I mean, it looks really difficult to make anything meaningful out of a statistical study of something
like this if you are going to stick with discrete random variables, maybe there is something simpler
we can do. Something very similar happens even with the binomial distribution, so look at
binomial (n, p) if n is growing very large and p is a constant, so this can happen a lot.

So, if you look at Bernoulli trials, like incidence of a disease in a population or something, the
population is very large and p is going to be a constant and let us say maybe there is something
like that, then you have to really worry about how to do these big calculations with combinatorics,
combinatorial calculations, so the 𝑛𝐶𝑘 , 𝑝500 , 1 - p / n - k when n is of the order of thousands and
ten thousands and k is of the order of 5000, 500.

What do you gain by that? I mean everything is going to be this big number and you do not even
get a sense of where the probability is, is not it? So, maybe for the binomial, I know, a little bit
more but still it will be nice to be able to do calculations with the binomial distribution when n
goes large also. So, how do you, is there a better way to do it, is there a simpler way to do it that
is one of the interesting questions to ask.

So, these are the kind of situations where the notions of continuous random variables sort of
directly, in an easy way, enter the picture. So, when you expect the variable you are analyzing, the
phenomenon you are analyzing, to take a lot of values, maybe in a small range, maybe in a large
range, the range really does not matter.

When you take a lot of values and each of those values have different probabilities and you have
to keep track of all of them, etcetera, you are better off thinking about continuous modeling, so
instead of modeling them as discrete model them as continuous and then there are tools from the
world of continuous random variables which you will need and you can use them and you can do
simpler calculations with that model.

So, this is the core idea of when you have a lot of data and lot of different values that the data takes
within the same range. For instance, you might want to think of weight of an adult person, if you
think of weight of an adult person again the value can range in kilograms, it can range maybe from
45 kgs to 120 kgs or something like that and all the values, if you look at a large enough population,
there is going to be so many weights. Depending on the precision with which you measure, you
can really get a whole bunch of values which are all different.

And listing them out as discrete and keeping track of them, I do not know if you really gain any
insights from it or you can do any calculations from it. Is there a simpler way to describe that
phenomenon of weight of an adult person, meteorite weight that is hitting the planet earth, binomial
distributions and growths large, is there a simpler way to do calculations in those kind of statistical
situations and that is where continuous random variables enter the picture.

Do you give up something? Yes, you do give up something, you give up some precision, if you
keep track of everything you have a lot of precision, but the calculation is really painful, so maybe
this trade-off is worth it. So, this is sort of background, I want you to remember this in mind as we
develop the theory.

So, when we start developing the theory and writing down those equations for distribution
functions, density functions and calculations and integrations and all of that later, the picture it is
sometimes possible to lose track of these kind of ideas, but these kind of ideas are very useful in
the modeling. How do you go from data to a continuous model? This is the motivation, so you
have a much simpler description in the continuous one.

(Refer Slide Time: 9:09)


So, let me give a little bit more data on this meteorite situation, this illustrates a few things which
I thought were very interesting at this . to get to know. So, if you actually look at the data, I will
make the data available to you in a suitable format, it is 45,000 + rows of an excel file, there is lots
of data on the meteorites. One of the data is the weight in grams and the range of the data is from
. 01 to 60 tons, I think that is six followed by how many ever 0s to make 60 tons.

So when you have a wide range like this, one of the very interesting and simple transformations
you can do which you have learnt in your Mathematics I course is taking logarithms, so if you take
logarithms even to base 2 for instance here like I have done, you will get a different range and that
range will suddenly start looking very small.

And not that you reduce the amount of data, you still have the 45,000 + data, but at least the
numbers you are dealing with are small, from - 6.6 to 25.8, instead of 0. 01 to 60 tons, it is difficult
to plot that, it is just too confusing, but if you just look at - 6.6 to 24.8 you are already feeling
better.

So, this kind, this trick of transforming the data to make it fall into a certain range or a certain type
of values that you like is a very useful trick when you model data. You might want to keep that
somewhere in the back of your mind as you go through your other courses in data science. So, that
is - 6.6 to 25.8, so this is one trick, it has got nothing to do with the continuous random variable
stuff we want to talk about, but I think it is a useful trick to remember, so in this data it makes
sense to take log, so we will take log and look at everything in the log domain.

So, here is the main idea that moves from discrete to continuous, it is sort of a bit counterintuitive,
it is bit confusing, people always lose track of these kind of ideas but this is how continuous enters
the picture, this is the simplification or this is how you go from large amounts of data to a short
description of it in the continuous domain. The main idea is to move away from focusing on
individual values that your data is taking, individual values that the random variable is taking but
focus on intervals.

So, here is what I would do, here for instance - 6.6 to 25.8, maybe I want to divide it into 100
intervals, so maybe you will roughly get - 6.6 to – 6.3, 6.3 to - 6, so on till 25.5 to 25.8. Now this
hundred is something in my hand, I can change this hundred, I can make a 1000 if I want, I can
make it 500, I can make it 25, I can make it whatever else I like.

You can imagine in the continuous world in the theory, you want to make that as large as possible
and make these interval widths as small as possible, but you never get out of the interval stuff, I
mean it is always an interval, it is always a bunch of intervals as opposed to individual values. So,
that move from individual values to intervals is very crucial when you go from discrete to the
continuous world.

So, that is something, so you have got to remember this when we, the main idea and moving
towards the continuous world to have a model in the continuous world is to start thinking of
intervals and when you think of intervals everything becomes very-very easy, so that is the first
trick, and the important trick and the main idea here.

So, once you put bins or intervals, you can simply count how many times my variable of interest
fell in that bin. How many times the log of the weight of the meteorite fell between - 6.6 to - 6.3?
How many times did it fall between 25.5 to 25.8? I will now keep track of these counts. So, this is
called histogram, if you do this, the term that people use is histogram, and these histograms are
hopefully much better for you to describe and think about and visualize and picture and do
calculations with than the individual values.
And anyway, I can make these intervals quite small, I mean this is compared to the overall weight
and these intervals are quite small and they do give me enough precision to work with this data,
so this move from discrete individual values to bins and counts of the values in the bins is a crucial
idea that moves into the modeling in the continuous world.

So, once I do this notice what happens. So, this is called the histogram of any data that you look
at, so I am going to look at histogram of log of weights and this is a picture I can draw. So this
picture, many computer programs will let you draw histograms in a very nice way, we will see as
part of our diploma course, we will see a clear way to plot histograms, plot things like this, this is
a part of another course.

But here I have plotted a histogram and these are the bins. So, on the y axis here, you have the
count of number of values, so this is the count, so number of this is the count that shows up here
and these are the bins, the x axis, so for instance you look at look at a bin, like something around
4, so you see around 5 is where the maximum number of count seems to appear, about 1800, 1700
of them are around 5.

And as you go towards the right above 15 and all is very less count, 100 and below, and in fact,
very less also, I mean, if you go to - 5 also the count is really less, and it goes all the way to 25,
but the counts there are going to be really tiny compared to the 1800 and look at this picture, so
you already see some sort of easy picture to remember in your mind, I mean somebody gives you
45,000 values spread out over a large range, it is very difficult.

You take log and then you bin, you get this very nice and simple picture and it is something nice
to describe. So, instead of focusing on individual values, maybe it is simpler to build a continuous
model where you focus simply on the shape of the histogram, put some modeling on the shape,
maybe some function of the continuous variable x and then you do that and then start doing
calculations with that.

So, remember we went from individual values to intervals, so we will have to be stuck with
intervals, once you come to the interval you cannot go back to the individual value again, so you
have lost that precision of the individual value. So, we will focus on intervals and we will try to
find probabilities that our variables fall inside intervals and these intervals are small enough that
they sort of are close enough to the, maybe the individual values or maybe, that is enough for us.

So, I mean, what do you really gain by knowing the exact weight of the meteorite and I do not
know, that is a different sort of study. For most people it is important that the meteorite is very big
or very small, so intervals are good enough in some sense, so anyway many of our measurements,
even weight if you measure, our ability to measure has a precision, so you can go up to say
milligrams, micrograms, I do not know, whatever grams, very-very small precision but you cannot
really make the precision infinite.

You cannot measure a weight up to as precise as you want, there are fundamental limits, so you
can only have a precision. So, any weight you measure, even though somebody gives you 0.01
grams, it is actually within a certain interval only, so everything in the physical world is only within
intervals, so doing intervals is fine, I mean that is good enough, you do not need anything more.

So yes, we do give up precision, but you know, precision is never there beyond a certain ., so as
long as you can make your intervals very small, as long as you can increase your number of
intervals we do not really lose much. So, that is the important thing.

All right, so this was motivation, introduction and I want you to sort of carry these thoughts into
the theory, so once we start building the theory, this will seem very remote because the theory
itself is a little bit more involved, you will have integration, you will have to do calculations which
are a bit painful.

You may feel scared about it, you may feel uncomfortable around it but the modeling is more
crucial, so when you identify data, which kind of data do you want to quickly do a continuous
model for, for what you do not want to do a continuous model, things like that should easily be
assimilated when you look at data.

(Refer Slide Time: 17:22)


So, having with that introduction, so it is one final piece of introduction also. So, here is the PMF
of binomial distribution n = 100, this is a PMF. So even here when you plot the PMF even though
we cannot do the calculations that easily. You see that there is a nice shape, you see this nice shape,
so maybe if you give up the precision of the individual values and just focus on the shape and
come up with some other model even for the binomial, maybe we will gain a lot.

So, that is something, that is very interesting and binomial, like I said, shows up quite often in
Bernoulli trials and all that, so these kind of calculations ability to do easy calculations with
binomial distribution. So, for instance, suppose I asked you, in this PMF what is the probability
that the binomial random variable falls between 50 and 60?

So, if you have to do accurate calculation, you have to do 100𝐶50 × 𝑝50 , (1 − 𝑝)100−50 + 100𝐶51 ,
look at the calculation, it is just so unwieldy. I just need to know what is the probability that the
binomial random variables falls between 50 and 60. Is there a easier calculation? Can I do
something easy? I am willing to give a precision, I do not want the exact answer, within something
is good enough for me.

And is there a simpler method? So, can we do something easier here? All of that comes because
of continuous random variables; this is very nice theory out there which gives you these tools. So,
calculations with PMF in the binomial case is very tough and maybe there are simple alternatives.
So, let us keep this introduction in mind as we jump into the next section which will talk about a
very important thing called the cumulative distribution function.
Statistics for Data Science 2
Professor. Andrew Tangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 5.2
Cumulative distribution function
(Refer Slide Time: 0:15)

Hello and welcome to this lecture. In this lecture we will talk about the cumulative distribution
function or quite often this word cumulative is dropped, this word is dropped, many people will
simply call it a distribution function which is also fine, I mean, so both of them are good enough,
I think traditionally many people use cumulative distribution function, I will keep saying
cumulative distribution function, but you will see in many books people will simply call this the
distribution function. And it is not a bad name, distribution function is also a very good name one
can use either of these.
(Refer Slide Time: 0:50)

So, here is the definition for CDF. The reason why I like CDF is CDF sounds like a much better
algorithm, acronym than df, so CDF is quite nice. So, the CDF of a random variable, any random
variable 𝑥 has a CDF and the CDF is defined in this fashion. So, it is a function from the real line
to 0, 1, so 𝐹𝑋 (x) is a function, the input 𝑥 small 𝑥 comes from the real line, any real number can
be its input, the output is a value between 0 and 1.

So, that is what is important to know what it is and how do we define it, how do we connect it to
the random variable and notice that notation 𝐹𝑋 (x), so sort of succinctly conveys everything, it is
the CDF of the random variable 𝑥 evaluated at small 𝑥 and that simply defined as 𝑃(𝑋 ≤ 𝑥), so
this is the definition of the cumulative distribution function.

It seems simple enough, it is easy enough to do this, so given any random variable one should be
able to, at least discrete random variables we know how to compute the CDF, but the CDF is a
very important bridge between the discrete world and the continuous world. You will see it gives
you that nice interconnect and I will convey that to you in this lecture or maybe the next one, but
before that let us first understand what the CDF is.

Do some simple calculations with it, get used to the idea of how to work with the CDF because so
far we have been working with the PMF, even with the discrete random variable we have only
discrete random variables so far, we have been working with PMF and it is always P(X=x), so now
we have the slightly unusual situation of 𝑃(𝑋 ≤ 𝑥),, but that should not really worry us too much
because it is just it is less or equal to small x.

We know how to count and l that so that is what we have to get used to now first and then we will
see how this nicely generalizes beyond discrete random variables also, so let us get started. A few
quick properties, these properties are very easy to prove, the first one is, if you want 𝑃(𝑎 < 𝑋 ≤
𝑏), notice the way a and b are included here, a is not included, 𝑏 is included probability that x falls
between 𝑎 and 𝑏, 𝑎 not included 𝑏 included that will simply be 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎).

So, the difference of two different values of the CDF is = the probability that the random variables
falls within the two points, so it is an important thing to keep in mind, that is a very nice property,
it is very easy to prove, so 𝐹𝑋 (𝑏) is P(X≤ 𝑏), 𝐹𝑋 (𝑎) is P(X≤ 𝑎),, if you subtract these two, like
these two are, so you will get all the values of 𝑥 which are less or equal to 𝑏 but they are greater
than a that is exactly the scale.

So, it is a very easy derivation that is good to see. So, the next important property for 𝐹𝑋 is it takes
non-negative values, clearly it takes non-negative values, it takes values between 0 and 1, so we
have already said that is the property of 𝐹𝑋 and it is anyway a probability, it has to be between 0
and 1, it cannot do anything else, but it has to be a non-decreasing function. What is non-
decreasing?

It has to, it cannot slope down it has to be, it has to sort of keep increasing, why is that, because if
𝑥 small 𝑥 increases I am saying now probability that 𝑥 is less than equal to something larger, so
that should be a bigger event, so once you look at a bigger event, the probability of it has to be
bigger, so you go back to your basic axioms about probability, when you have a bigger event and
there is a smaller event inside it, probability of the bigger event is bigger.

So, because of that reason 𝐹𝑋 is a non-decreasing function. So, now the last two properties are sort
of limiting properties in the sense that I want to look at the case where small x becomes smaller
and smaller and smaller and keeps on going in the - ∞ direction, so eventually whatever your
random variable is, if your small x is going towards the left, eventually you will run out of all the
possible values it can take, so your probability will go to 0, so 𝐹𝑋 will go to 0.

So, it starts at 0 at - ∞ and notice what happens at + ∞, if you keep moving off to the right, you
will, when you just by saying 𝑥 is less than equal to∞ you are including all the possible values
that 𝑥 can take, so that probability is simply 1, so immediately picture this random variable, this
function, it starts at 0, it has to keep on increasing, cannot decrease it takes values between 0 and
1, at ∞ it goes to 1.

So, it will have always a picture like that, so any CDF will be like that, it will start 0, maybe it will
do some things and then eventually it will go to 1, we do not know what is going to happen in the
middle, but eventually that is what will happen, starts at 0 goes to 1 and the important property
like I said is difference between two values of the CDF is simply the probability that x lies in that
range.

There is some adjustments for the beginning end, etcetera, but that is what it is, simple definition,
distribution function, cumulative distribution function is a very important definition, like I said it
is a bridge quite often between the discrete world and the continuous world.

(Refer Slide Time: 6:15)

Let us start with examples, so it is good to define things but only when we do simple examples do
we really understand what it is, so here is the simplest random variable that we can think of, the
Bernoulli random variable 𝑥 taking 0 with probability 1 − 𝑝 and 1 with probability 𝑝, and here is
a picture of the CDF, of this, so this is a picture of 𝐹𝑋 (𝑥), and so notice what I have done here,
maybe the expression is easy to deal with also in this case.
Now 𝐹𝑋 (𝑥) will be 0 for 𝑥 < 0, so you can see that 𝑦, so maybe I will write it down, so if you say
for instance 𝐹𝑋 ( - 1), its 𝑃(𝑋 ≤ −1) and clearly this is 0, so because 𝑥 takes values only 0 and 1,
there is no question of this being less than equal to - 1, so for any 𝑥 < 0, 𝐹𝑋 (𝑥) =0, 𝑃(𝑋 < 0).

Now, if your 𝑥 bit happens to be between 0 and 1, 0 inclusive, then you will get 1 − 𝑝. Why is
that? So, if you have 𝐹𝑋 (𝑥) and 𝑥 is say between, 𝑥 is, let us say point 1 or something, then that is
𝑃(𝑋 ≤ −0.1) and that is P(X=0), so you get 1 − 𝑝, so this is true for any other value of 𝑥 between
any 𝑥 such that 0 is less than equal tox < 1, so this < 1 is important.

You cannot make it 1. Why is that? So, once 𝑥 becomes 1 the whole probability will be 1, why,
because 𝑃(𝑋 ≤ 1), 𝑥 includes everything, so that is 0 and 1 everything is included, so you get 1.
So, you see you can easily check these three properties and that is how it is. So, I have just showed
you some illustration just to argue why this is true but you can hopefully see why this CDF comes
out to be 0, 1 − 𝑝 and 1.

So here we took a discrete random variable x and then we try to compute the CDF, a much better
way to visualize this is through a graph. So, here is x and here is 𝐹𝑋 (𝑥), for 𝑥 < 0 the CDF is 0,
you can see here 0 and I put this circle here without filling, which means at 𝑥 = 0 the CDF is not
0, it is actually 1, so I put a solid circle here to show that it is actually, I am sorry, 1 − 𝑝, which
is, I do not know, some value 0.4.

So, this is basically 1 - p, and notice what happens between 0 and 1 𝐹𝑋 (𝑥) stays there, it does not
go up or it does not move anywhere, it just stays flat there, till it hits the next point 1 where x takes
some value with some probability, there again at one you have a discontinuity, you have a jump at
1, the actual value of 𝐹𝑋 (𝑥) becomes 1, and then after 1 it just simply stays at 1.

So, this is the plot of this 𝐹𝑋 (𝑥) , I have just carried over the same plot for 𝑥 < 0 it is 0, x between
0 and 1 it is 1, 𝑥 greater than 1 it is 1. I am sorry just I said that wrong, for 𝑥 < 0 it is 0. For
𝑥 between 0 and 1 it is 1 − 𝑝 and that 1 − 𝑝 comes here, it is flat, between 0 and 1 is flat again
and for x greater than 1 𝐹𝑋 (𝑥) becomes = 1 and that is a flat line there.

Now you have this filled circle and open circle here think about why that happens, see at 𝑥 = 0
there seems to be two values here, so I have to tell you which is the correct value for 𝐹𝑋 (𝑥) and
looking at this very closely you say at 0 𝐹𝑋 (𝑥) becomes 1 − 𝑝, so you should put a solid circle
on top and unfilled circle, open circle, sort of in the bottom, same thing at 𝑥 = 1, when 𝑥 is 1 𝐹𝑋 (𝑥)
= 1.

So, it is not 1 − 𝑝, so at 1 − 𝑝 you put a open circle and at 1 you put a solid circle, so it is just a
figure to illustrate how the function 𝐹𝑋 (𝑥) looks, so this picture is very nice to keep in mind, if you
forget about everything else somebody gave you this picture, you learn a lot about the random
variable, if looking at this picture I know that the random variable is discrete, it takes two values,
it takes a value 0, it takes a value 1.

At 0 it takes that value 0 with probability 1 − 𝑝, it takes the value 1 with probability 𝑝, this
distance is 𝑝, this distance is 1 − 𝑝, the jump the amount of jump that you have in the step is
equal to the probability with which 𝑥 takes that value, is not it, so for discrete random variables
the CDF sort of looks like a step and keeps going up, this does not happen just for the Bernoulli
random variable, I showed you in detail for the Bernoulli random variable. How this plot comes?

(Refer Slide Time: 11:36)

You can quickly generalize this to other random variables, so let us take throwing a die as the next
random variable. Now, 𝑥 is uniform now between 1 to 6, 1, 2, 3, 4, 5, 6, and if you do a CDF for
this. Now, I have written down the whole formula here in gory detail, it is 0 for 𝑥 < 1, it is 1 for 𝑥
≥ 6, so that is seen, then between. Now if 𝑥 is between 1 and 2 the probability that capital x is less
than or equal to some value between 1 and 2 it is the same as the probability that 𝑥 takes value 1
that is 1 / 6.
If 𝑥 is between 2 and 3 then 𝑥 could be 1 or 2, so that is 2 / 6, if x is between 3 and 4, x could be,
if small 𝑥 is between 3 and 4, then capital X, so capital X could be 0 momentum, so not 0, 1 or 2
or 3, and that is 3 / 5, so this gives you 3 / 6, it is easy to see how, so you go from small x to the
possible values that x capital X could take so 𝑃(𝑋 ≤ 𝑥) is what you are doing you just go to the
possible values there and you do it, so it is quite simple enough in that case.

Now you can also sketch this 𝐹𝑋 (𝑥) which is what I have done here and the sketch is very important
I want you to take some time and make sure you understand how the sketch came, it is very easy
at some level, but sometimes if you are seeing it for the first time it can be confusing, it is very
easy, so on the x axis you go 0, 1, 2, 3, etcetera, for x < 1, it is just flat at 0, for x between 1 and 2
it is flat at 1 / 6, so this value is 1 / 6, this value is 2 / 6, this value is 3 / 6, so on.

So, that is what I have plotted here just steps and they all sort of like non-overlapping steps and
you can go from one up, starts at 0 and goes through these steps and reaches 1, so this going from
the PMF. The PMF was given to you, this discrete random variables PMF was given to you, you
went from PMF to the CDF and this is the way you could do it.

So, supposing I give you the CDF, I want you to go back to the PMF, how will you do it, for a
discrete random variable it is very easy, is not it? You go and identify where all the jumps happen,
where all the steps happen and how much did you jump by, so you just have to say this PMF will
take value 1 with probability 1 / 6, value 2 with probability 1 / 6, value 3 with probability 1 / 6, so
on.

Wherever the jump occurred, exactly where the jump occurred the probability of the random
variable taking that value is 1 / 6. Now, notice when there is no jump, if you come here for instance,
let us say it is 4.5, there is no jump here, no jump, so that implies probability that x =4.5 is actually
0, is not it? So, if there is no jump the random variable does not take that value as in some sense it
is probability 0 of taking that value.

So, nice simple plot, in some sense the cumulative distribution function is very nice, it is for the
discrete random variable, but some of you may argue at this point the PMF is fine, PMF is good
enough, why do you need the CDF, what did I gain from looking at the CDF. We will come to it,
we will come to it slowly, but at least for now try to make sure you can go from PMF to CDF,
CDF to PMF for a discrete random variable.
This is an important skill that you need to pick up when you deal with random variables. How do
you go from PMF to CDF? How do you go from CDF to PMF for a discrete random variable? I
have shown you two illustrations in simple cases, I will also show you something slightly more
general, but hopefully you see the picture.

(Refer Slide Time: 15:39)

So, here is a slightly more general PMF, general CDF, and let us see if we can draw some
inferences from here, here is the picture, so suppose somebody tells you that this is the picture of
Fx of x versus x and there is 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 𝑝1, 𝑝2 , 𝑝3 , 𝑝4 , 𝑝5 , I have just drawn it like that, so
maybe let us say, let us assume that this is an increasing order 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , so that is easier
for us.

So, supposing you have to now label this axis, how would you go about doing it? If this is the CDF
take here, let us put capital x, if this is the CDF how would we go about labeling this. So we know
that it starts at 0 and we know it starts at, ends at 1 and the PMF takes the random variable takes
value 𝑥1 with the probability 𝑝1. So, wherever the jump occurs, it better be 𝑥1 , this jump occurs
here 𝑥1 , this jump occurs here 𝑥3 , this jump across here 𝑥4 and this jump occurs here it is 𝑥4 .

So, maybe I should draw dot here to show where that comes from, so that is easy enough to see,
then what are these jumps, what will this jump be, this jump will have a magnitude of p1, this jump
will have a magnitude of 𝑝2 , this jump will have a magnitude of 𝑝3 and so on, so that is it, so CDF
is very nice, I mean, it is easy enough to draw, some people may say it is a better looking picture
than other things but anyway.

So, you see this is the picture of the pdf, so this this kind of a picture is good to retain in your head,
so you have a PMF for a discrete random variable, how will the CDF look, you start at 0, it will
jump, wherever the random variable takes a value with positive probability it will keep jumping,
wherever it does not take value it will be flat, so what is hidden in this is wherever it is flat,
wherever the value does not have a jump, there is no jump, greater than flat actually there is no
jump at that point that is what is very important.

Whenever there is a jump exactly at that point there is a probability of random variable taking that
value, when there is no jump, it will not take that value, so its probability 0. So, this is a good
picture to keep in mind that hopefully nicely summarizes CDF for you. It is also seems simple. I
have not really added anything of value beyond PMF here, at one level it is another representation
of the PMF if you want, but there is nice pictures you can draw which will give you an impression
of where we are going.

(Refer Slide Time: 18:10)

So, but before that let us do something slightly more non-trivial, so let us go to a large random
variable, so far we have been dealing with 4 values, 5 values, 2 values, let us go to 100 values, so
here is x which is a random variable which takes uniform but values from 1 to 100. So, now trying
to plot this we will come to plots soon enough, but let us look at the description of the CDF, it
takes value 0, when x ≤ 0.

In fact, I think this is not 0 here, this is a mistake here, this should be 1 and x is, so I know what a
mistake I made, this is wrong, so let me just cut this out and say if x < 1, that is correct, so if x <
1, then the CDF takes value 0, so that we can see easily and then after that that if x is, so if x is
between 1 and 2 it takes value 1 / 100, when x is between 1 and 2, x takes value 2 / 100 and so on.

And this is captured in this description if k is 1, 2 to 99 and x falls between k and k + 1 inclusive
of k, not including k + 1, then you have k / 100 for the probability and if x is greater than or = 100,
then you go to 1. So, that is the picture, so there is, it starts at 0, at 1 it jumps by 0.01, 2 it jumps
by 0.01, 3 it jumps to 0.01, 100 times it jumps and gets to 100, so it is easy to picture this.

(Refer Slide Time: 19:57)

So, now let us start doing this calculation of probability of intervals, so there is a reason why I am
doing probability of intervals here, we will quickly, in a few more slides we will do a different
way of doing this calculation, but one can ask for probabilities of intervals, so uniform random
variable like this it is easy to do the calculation, probability that x lies between 3 and 10.

So, then this we know, given the way in which I have written down that interval it is 𝐹𝑋 evaluated
at 10 - 𝐹𝑋 (3) and you get 7 / 100, is easy enough to see. Now, what if I change this to 3.2 to 10.6,
you might say why am I doing this, x does not take these values but you will see, it is illustrative,
you again you can use the same formula nothing changes, you can put capital 𝐹𝑋 of 10.6 - 𝐹𝑋 (3.2),
but 10.6 still comes down to 10 / 100, 3.2 still comes down to 3 / 100 that is what is very important.

So, this is 10 / 100, this is 3 / 100, interestingly this is also 10 / 100, this is also 3 / 100, why is
that, because this whole range is just k / 100. So, it does not matter whether you do 10.6, 10.2,
10.3, 10.8, you will always get 10 / 100, does not matter if you do 3.1, 3.2, 3.3, 3.7, you will always
get 3 / 100, so you get the 7 / 100 again.

You can also do other ranges, like if x ≤17, if you do not put anything on this side you can take -
∞ there and that is just 𝐹𝑋 (17), you get 17 / 100, so it is just the direct formula, you can also do x
greater than 87, so notice what I have done here, greater than 87 is 1 - 𝐹𝑋 (87), so this is P(X>87)
is probability of x, no.

This is 1 - 𝑃(𝑋 > 87̇) , you remember the complement, what is complement of that that is x ≤ 87,
𝑋 > 87̇ is x ≤ 87 and now we are ready to write it as 𝐹𝑋 (87), so any interval, even if it is not
within two points, it is only x greater than 87 or x < 17, you can compute all of those using the
CDF.

So, in PMF terms you would have added a bunch of probabilities sum over all the values of k, in
CDF it is just the value of the CDF itself, so that is how the interval calculation is done. Once again
I want to highlight 17 and 17.3 there is no difference, so given the CDF it does not change, 87 and
87.4 there is no difference, you get the same value for everything. So, this is computing probability
of intervals using CDF.

This is again a simple skill and you should develop this given a description of the CDF, how do
you find probabilities of the intervals. There are different types of intervals, I may include
something, I may exclude something, you should know how to move around between these two,
maybe the intervals only go on the left side, only go on the side or between two values, maybe two
different intervals how do you do all that, you use the basic rules of probability.

So the AND rule, OR rule all of that, just comes NOT rule, there is no, there is nothing new about
it but how to use the CDF for this calculations is very important. We are going to revisit this
calculations later on when we do some continuous look at this, but I want you to pick up the skill,
this is important you may have some assignments in tutorials on how to do these kind of
calculations.
So, this is a good point to stop this lecture. In the subsequent lecture we will start looking at this
larger alphabet case more closely and start drawing pictures of the CDF and see if we can make
something, make this jump into the continuous world in an interesting way. Thank you.
Statistics for Data Science II
Professor Andrew Tangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 5.3
Continuous Random Variable: Approximation of CDF from Discrete to Continuous

(Refer Slide Time: 0:14)

Hello and welcome to this lecture. So far we have been looking at CDF of discrete random
variables. How to plot them? How to go between PMF and CDF and how to compute probabilities
of intervals of random the intervals within which random variables can fall using the CDF and all
that? So, those are important skills to pick up.

Now let us start moving on to larger and larger alphabets. To me a picture like this sort of nicely
conveys where we are heading. So, this is the plot of CDF of binomial random variable (𝑛 , 0.6)
and I am increasing 𝑛 but I am keeping the scale of the picture the same, so notice why I am doing
that. You may think about why am I keeping the scale the same, so only then the picture will come
out like this.

So if you keep increasing the scale you will not see this and continuous is always like that, you
will keep the scale same and only then you will see the continuous line as opposed to, not discrete,
anyway think about that, so here is 𝑛 = 10, 𝑛 = 20, 𝑛 = 50, 𝑛 = 100 and on the same scale
notice how the CDFs start looking like a continuous line. I mean this is actually the CDF; I am
actually drawing those steps.

See remember when 𝑛 is 10, 𝑥 takes value 0 to 10, at every point it jumps at 0 it jumps by a very
small number. Why is it very small? Because the probability that 𝑥 takes value 0 is 0.410 , which
is like very-very small, so you see a very tiny jump there and slowly as it comes close to 5 and 6
you start seeing the big jumps and then as it goes closer to 10 again you see very tiny jumps, but
at least you can see the jumps.

When 𝑛 = 10 you can see the jumps, you can see that there are the flat segments in the line
segments. Now even when I move to 𝑛 = 20 you are already seeing that the intervals are
shrinking, the flat intervals are shrinking, they are becoming smaller and smaller and smaller
because I have kept the scale the same, I am sort of viewing this whole thing in the same scale.

When I go to 𝑛 = 50 already you are seeing you know when everything has become very tiny
this flat parts where it is staying the same, the step length sort of has become very tiny, but you
see the jump around 50, 60 there is a jump, the jump you can notice only there, everything is
bunching up down below and there is some significant jump around 50, 60. It gives you a nice
picture of how the binomial behaves and looks.

When you go to 𝑛 = 100 you can see really, I mean, this is most people will say this is almost
like a continuous curve, is not it, most people will sort of draw this, and it is much simpler
description for it than thinking of it as discrete and thinking 100 point PMF, etcetera, I just draw a
line that line is seems fine enough right for most practical purposes for calculations it should be
good enough.

Remember from a CDF what am I calculating? When I want to find probability of an interval I am
just subtracting the value of the CDF at 2 points. Supposing I want to do calculation of probability
that x lies between 50 and I do not know what this point is maybe it is some 50, 60, 70, 60. So all
I need is the difference of these two probabilities of the CDF. So this is probability that x takes
values between 50 and 60 is not it.

So, I do not need to really find I mean, I do not need to precisely know every value here it is
enough if I have the continuous approximation, so you see the difference of the two points, even
if I draw a continuous line and have some simple description for the continuous line the difference
is good enough.

So, this kind of a picture in my mind at least it sort of conveys how the cdf is much more interesting
and comes out in a very natural nice way and makes the, shifts the focus from individual values to
intervals and all these small steps to maybe a continuous function or a plot connecting all of these
things together and moving towards the continuous world. So, this is sort of makes it may be easy
to picture how the continuous random variables are defined and they enter the picture. So, this is
for the binomial.

(Refer Slide Time: 4:23)

You can also do a similar plot for the meteorite data, you can convert this histogram into a PMF
type thing and then draw the meteorite data and here again you see from - 5 to 25, look at this nice
smooth shape, this is actually a discrete PMF converted into a CDF. I am doing small steps, but
since the steps have become so small they just look like one continuous nice line.

How much easier is it to just describe a continuous line as opposed to all these small values in the
bins, and as far as computing probabilities of intervals is concerned the continuous line is good
enough, it is going to be good enough, so any range you want to calculate the difference in the
probabilities should be okay. So you see this is very interesting when you look at the picture of the
CDF and see literally how the continuous model becomes much more interesting in the large
alphabet case than the discrete model that we have had so far.
(Refer Slide Time: 5:20)

So this leads us to the definition of a cumulative distribution function, so far we have taken a
random variable 𝑥 and defined the CDF for that random variable 𝑥, so now I am going to forget
about random variables, there is no random variables, let us forget about any random variable at
all, there is no random variable, I do not care. I am only going to care about a function, this is a
function that goes from the real line to 0, 1.

Once again I want to repeat this no random variable associated with it, I got rid of the of this 𝑥, I
think this 𝑥 should also go this, 𝑥 should not be there, there is no reference to a random variable 𝑥
it is just a function. I am going to define cumulative distribution functions without reference to
random variables but keeping all the properties that I want for distribution functions.

So a function 𝑓 from the real line to 0, 1 I will call it a cumulative distribution function or a
distribution function if it satisfies the following four properties, so 𝑓 needs to be a non-decreasing
function taking values between 0 and 1, at - ∞ it should start at 0, at + ∞ it should sort of saturate
at 1, so that is how the picture should be and there is this technical requirement here, I will not go
into too much detail.

It needs to be continuous from the right, from the right it has to sort of close itself and stop, so let
me not go into detail here, but the first three points are very important and that is how our
cumulative distribution functions are defined, they have to be non-decreasing, start at 0, end at 1
and sort of jump, maybe they can jump, maybe they can do whatever else they want in the middle,

Now I am not talking about discrete random variables or anything I am just talking about any
function, starts at 0 ends at 1, it can do whatever it wants in the middle, I am calling but as it should
be non-decreasing, that is all, as long as it is not decreasing I am going to call it a cumulative
distribution function. So, these functions you can see they mirror the properties of the CDF of a
random variable.

So, if you had a random variable and define the CDF as 𝑃(𝑋 ≤ 𝑥) then you will get definitely a
proper valid CDF. But there can be other functions which satisfy this, so far we only had discrete
random variables, so when we take a discrete random variable and computed CDF we thought, we
saw that it had a very simple structure, it had this step stepwise structure.

Now for my arbitrary CDF here, arbitrary cumulative distribution function I do not need the
stepwise structure, it can be smooth and continuous also. So, this is sort of the theoretical jump
that we are going to make, we defined CDFs, we looked at discrete random variables and say, saw
that the CDF has a certain jump structure, but when we go to large alphabets it looked like the
jump structure was sort of coming down to a continuous line.

So, maybe in theory we want to throw away the discrete random variable, we want to define a
distribution function which has all the properties of the distribution function that we want, it starts
at 0 ends at 1 and it is non-decreasing and slowly gets to 1. What can we do with this discrete
distribution function? We will see soon enough, but this is the important and crucial jump, so we
forget about the stepwise structure.

We forget about all that jumps, etcetera, distribution function can be anything as long as it starts
at 0 and ends at 1, it is non-decreasing there is that technical condition of being continuous from
the right, as long as it is like that we are good.
(Refer Slide Time: 9:03)

So, let us see some examples, shall we, let us look at some examples. So, here are 3 – 4, I mean, 4
different examples of valid cumulative distribution functions. So, you can of course, have any CDF
of a discrete random variable and that would be a valid CDF, so that is how the whole definition
came about, so but you can see all the properties are satisfied, it starts at 0 ends at 1, if you look at
the first plot for instance the top left plot here, it starts at 0.

That is a tick mark, it ends at 1, it is just a tick mark, it is sort of non-decreasing, yes, that is also
tick mark, so it is a valid CDF, so it starts at 0 ends at 1 and it is non-decreasing. So, both of these
you can sort of see they come from the discrete world and here is a continuous CDF. So, it starts
at 0, that is a tick mark, it ends at 1, that is a tick mark and you can see it is non-decreasing, but
there are no steps, I do not have the step phenomenon.

It is a valid CDF, it is a cumulative distribution function, so here is another function, again starts
at 0, ends at 1, it is non-decreasing, that is also true, no steps, but it is a valid continuous CDF, so
we are generalizing the distribution function to an arbitrary shape, it can take any shape it wants,
it can be continuous, it need not jump, but it has to go from 0 to 1, it has to be non-decreasing, that
is it.

As long as you satisfy that you have a valid cumulative distribution function. So, this is an
important step in theory when we jump from the discrete world to the continuous world. When we
want to move towards continuous random variables we have to have a distribution function which
is continuous. It does not have jumps, it just goes from 0 to 1, non-decreasing, and you can see the
connection here and it seems like this continuous, the one on the left bottom here sort of mirrors
the discrete picture very closely.

You the discrete jump, but this guy, this continuous guy is very-very similar, is not it. These two
are similar. Likewise here if you drew a picture like this and then you see these two so these two
are very-very similar. While the top picture needs all these jumps to be specified the bottom picture
is just a line. How easy it is to specify a line. How easy it is to do a calculation with the line?

Similarly on the right side also the top picture so many so many tiny jumps, the bottom picture is
just one continuous curve, you can describe that continuous curve in so many interesting different
ways. So notice how the calculations with probabilities of intervals can become much simpler if
you have a continuous model. So, the theory that needs to develop here is how do we think of this
distribution functions? What, how do we work with them etcetera, all of that will come in the
theory.

But hopefully you see the connection between the probability and the modeling in real life that
you might have to do with continuous random variables and how easy it is to work with them in
practice much more easier than discrete sometimes and sometimes when you have to do it you can
use this picture.

(Refer Slide Time: 12:30)


So, let us come back to this problem of using continuous CDF and doing calculation for uniform
discrete distributions. We saw this example before when 𝑥 is uniform between 1 and 100 we saw
the small steps are going to be there, you have this complicated looking CDF, one can try and
approximate it, so if, so if you want a picture here, so if you want CDF picture here, so this one is
going to be something like, it starts at 1.

So, 0 it is going to go, reach, 1 at a 100, so it is just a picture here, this is this guy, now if I do a
change of color, maybe blue, so this guy is simply going to be a straight line like this, this guy is
this, so you see here how this discrete CDF for this uniform distribution between 1 to 100 and this
continuous CDF that I had where I just had 𝑥 / 100, a simple 𝑥 / 100 that I am doing here is a nice
approximation, they are very close, is not it?

I mean, depending on the scale you look at they are really close here, so now let us see if we can
do our wonderful calculation of interval probabilities with both these cases. So, I am going to ask
the same question. Remember I did interval calculations with CDF, 𝑃(3 < 𝑋 ≤ 10), if I use the
exact CDF I am going to get 7 / 100, if I use the approximate CDF here with just the continuous
CDF 𝐹𝑋 (10) - 𝐹𝑋 (3), I get the same value.

But notice what goes wrong here? If you go to 3.2 and 10.6, my exact CDF will give me the same
exact 7 / 100, but notice what happens to my continuous CDF? My continuous CDF does not make
all those kind of nuances between 73.2 and 10.6, it will simply do 10.6 / 100, - 3.2 / 100, it will
give you 7.4 / 100. So, it does not quite match, there is a small difference, but some people will
say, the ease with which I can work with 𝑥 / 100 is so much better, so I will take it any day.

It is, I know there is going to be errors, but there are errors all over the place, so maybe this is not
too big and maybe instead of 100, if I were to maybe even zoom it down a little bit more maybe I
will get better answers who knows, so the same thing carries over with x less than or equal to 17 /
0.3, if you do that exact calculation, the exact CDF, you will get 17 / 100, if you just do it with the
approximation you will get 17.3 / 100, good enough.

So, this is the sort of interesting comparison you can do with discrete world and if you do a
continuous model for the same thing, what you lose, what you gain, hopefully you can see that in
this picture. So, let us take a, yes, so clearly the calculation is simpler, but maybe in the uniform
distribution you do not need it, it is easy enough calculation even with the exact PMF.
(Refer Slide Time: 15:31)

Let us go to binomial, so let me show you a simple approximation, this is actually a very simple
approximation for this binomial CDF, there are more fancy approximations, but let us start with a
very simple approximation here. So, this is a binomial distribution 𝑛 = 100, 𝑝 = 0.6. If you have
to do the CDF, the CDF sort of jumps whenever k is, let me go back to my black, so this is when
k is 0, 1 till 100, in the other places it stays flat, so that is how the CDF works.

Now this guy is a nice function, I would urge you to plot it, this 1 / 1 + 𝑒 −𝑥 , so of course, this is
𝑥 − 6 𝑡 / √24, do not ask me why, where all these things come from, these are all, we will look
at this later on, but these are, this is an interesting numbers, these are related to, so for those of you
are keen on it, this guy is 𝑛𝑝, this 24 is 𝑛𝑝 × (1 – 𝑝), so you might wonder where this comes
from.

So, the 60 is the mean, this is the variance, etcetera, so look a look at this function, so this function
is so much simpler than the CDF, so here you have 100𝐶50 and here you do not have any such
thing, it is just exp, you just plug it into a calculator you will get the answer, so it turns out this is
a reasonably good approximation for this, it is not very bad approximation.

So, how do you test this approximation? You calculate probabilities of intervals, you calculate
probability of interval using the exact CDF you are going to get you know 0.0271, 0.51, 0.44, 0.01,
look at the answers here with this very simple approximation function and you get fairly close
answers, you do lose something there is the symmetry that comes here on this side which does not
come here etcetera, it is okay, you do not lose too much.

In fact, better approximations are possible, even for this 100.6 you can do much better
approximations than this one, so this continuous CDF seems to be a valuable tool in our hands,
instead of modeling some discrete random variable particularly when it takes a lot of values using
very many and discrete values of PMF and complicated CDFs and doing major summations and
all that here you have a simple formula and it gives you approximate values.

So you get a sense of where the value lies and it seems very-very interesting so this is the power
of the continuous approximation. So, when you do binomials particularly uniform, you might say
uniform is easy enough, why should I need anything more fancy, look at this for binomial, so this
is really-really powerful and like I said this is a very simple approximation, I picked one very
simple one just to show you one can complicate this, even very relatively good looking
expressions.

Maybe a little bit longer and slightly more ugly than this you can get very good approximations
for the binomial. So that hopefully is another way to convince you that this all the effort we are
going to put into learning about continuous distributions, continuous distribution functions,
continuous CDFs is worth it and when you really model you get much simpler models in the
continuous world.

So, that is the end of this little lecture which showed you the connections between how to
approximate in large cases, how to approximate a CDF with continuous and how to do calculations
etcetera, from now on we will move on and start studying the general theory.
(Refer Slide Time: 18:42)

So, what is the notion of a general random variable? What is in general a random variable? What
does… we already saw discrete random variables, which took discrete set of values and what are
continuous random variables, we will see all that in the next lecture. Thank you very much.
Statistics of Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 5.4
Continuous random variable - General random variables and continuous random variables

Hello and welcome to this lecture. We are going to be talking about general random variables and
in particular continuous random variables. So let us do a quick recap of where we are with respect
to random variables. We saw discrete random variables. So what are discrete random variables?
We looked at probability space where the outcomes, the sample space which contains all the
outcomes is actually a discrete set. It does not have an interval of the real line for instance, it is a
discrete set.

And we looked at definition of events in that sample space which is probably all possible subsets,
etc. things like that. And then we looked at assigning probabilities to events. Now, how did we
assign probabilities to events? We eventually came to this idea of PMF, for discrete random
variable which seemed like a very easy and simple way to assign probabilities to every outcome
and then that in turn gave you probabilities to every event and we are very comfortable with that
setup.

So then when this week we have been talking about situations where the discrete random variable
even though the actual set is discrete the number of such outcomes is so large or it comes to a
situation where dealing with the discrete PMF seems to be very painful. It is not every easy to
count and to see how many, what is the probability, etc. and it seems like going to the CDF and in
particular doing an approximation of the discrete CDF with the continuous CDF seems to be a
very good idea.

But the discrete CDF came from an actual probability space. So there was a probability space
where there were outcomes, there were events and the PMF and then we got the CDF. Where does
the continuous CDF comes from? Does it come from a probability space? So that is a question that
one might ask. It turns out answering that question is very technically painful. So it is not, maybe
today it is not that involved, but if you start reading it on your own, when you are starting out it is
very painful technical journey to understand all that.
So what we will do is we will sort of accept the continuous CDF as corresponding to a random
variable and start studying what kind of random variable it is, how to deal with that, etc. So that
brings us into this wonderful possibility of general random variables, continuous random variables,
we will not talk about the probability space and all too much, even though it will be sort of intuitive
that there should be something like that and it is not too difficult to imagine, that it shows up some,
so this kind of thing throws up something unusual, I will point that out as well.

But this is a very interesting, new type of random variable that we are going to studying in this
lecture. Now discrete but continuous random variable, different type of random variable which has
very interesting new properties, and it is very useful in modeling, hopefully you are convinced of
that based on the examples I gave you. How, when you model real scenarios, there are many
scenarios where you do not want to be restricted by the discrete world and subsets, so it is sort of,
it seems too painful and anyway the continuous approximation is very nice and easy to deal with
and let us explore what that means from random variable point of view in this lecture.

(Refer Slide Time: 03:33)

So this theorem basically allows us to define a CDF first, valid CDF you can define in any which
way you like. You can just take a pencil and draw a line from 0 to 1 and this theorem sort of assures
you that there is a random variable in some probability space, etc. and you do not have to worry
about it, just deal with the random variable as if it were already known to you. Notice the
connection between the CDF that you came up with and the random variable.
The value of the CDF at a particular input 𝑥, small 𝑥, 𝐹(𝑥) is actually the 𝑃(𝑋 ≤ 𝑥). So that
connection is very, very important, and that allows us to use the CDF directly to compute
probabilities involved in the random variable. Any event you define using the random variable, 𝑥
falls inside an interval, 𝑥 is greater than something, 𝑥 is less than something, you can use this
connection to derive all those probabilities.

So what is the main relationship here? Here is the main relationship, so, well the main relationship
is given in the theorem itself. So one of the quick properties you can derive using this relationship
is that 𝑃(𝑎 < 𝑋 ≤ 𝑏). Note that you have a is less than and then b is less than or equal to that is
important, and that is 𝐹(𝑏) – 𝐹(𝑎).

You can see why that is true, see 𝐹(𝑏), I think we should write something here, 𝐹(𝑏) is 𝑃(𝑋 ≤
𝑏). 𝐹(𝑎) is 𝑃(𝑋 ≤ 𝑎). If you subtract these two guys what are you going to get? 𝑥 has to be strictly
greater than a but it needs to be less than or equal to b, so this is the difference. It is basic events,
so even if you subtract also 𝑥 less than or equal to a is contained in the event 𝑥 less than or equal
to 𝑏. So if you subtract the two probabilities you are going to get the probability that 𝑥 falls
between, I mean strictly greater than a but less than or equal to 𝑏.

So it is a nice simple property. So once you come up with the CDF once it is valid you can find
any probability you want involving the random variable. That is very important. Well there is
couple of critical things to worry about. Now what is the 𝑃(𝑋 = 𝑥)? So we have got intervals, we
have got less than or equal to, how do you find the 𝑃(𝑋 = 𝑎)?, that is just one single event.

If you remember the PMF that played an important role. The PMF was pretty much probability
that 𝑋 equal to a particular value, but here we are not talking entirely about 𝑥 taking a particular
value. We are only saying less than or equal to, less than, greater than, what about probability that
𝑋 equal to a particular value. Now that is the point where these continuous random variables
arbitrary CDFs they differ sharply from discrete random variables. So let me emphasize that a little
bit by talking about that point.
(Refer Slide Time: 06:42)

So if you remember the CDF, the CDF can have jumps, particularly for discrete random variables
we see that it is flat, jump, flat, jump, flat, jump like that, that is how it proceeds but for continuous
random variables we allow for a continuous line also, but also for a continuous CDF, I am sorry
CDF can be continuous but also it can have jumps. Jumps are not disallowed. So CDF has to start
at 0, it has to keep increasing. It cannot decrease. It has to increase all the way to 1. It can increase
either by jump or slowly steadily, like in a continuous manner, it can increase.

So now if the CDF that you drew rose at one particular point, at one particular point 𝑥1 it had a
jump from 𝐹1 to 𝐹2 then the 𝑃(𝑋 = 𝑥1 ), the value at which the jump occurs, that probability is 𝐹2
- 𝐹1 , just the width of the jump. I will show you an example soon enough, you will see what I
mean.

So if there is a jump in the PDF then a probability that x equal to that value is the magnitude of the
jump. How much you jumped by, you can only jump up, you cannot jump down as you go right.
It is an increasing function. So that is the first result about probability that x taking a particular
value. x falling between two values, just difference you know how to do it, but what about x equal
to a particular value?

If there is a jump it is equal to the magnitude of the jump. Now the next natural question is what
if there is no jump? So if 𝐹(𝑥) is continuous at a point 𝑥0 there is no jump but 𝑥0 , then the
𝑃(𝑋 = 𝑥0 ) = 0 . So this is a bit non-intuitive. A lot of people get a bit confused by that. I will tell
you why the confusion comes soon enough, in the next page, but this is an important point to
remember.

So these three properties sort of you should remember for a CDF. So suppose somebody gives you
a CDF and wants you to identify the probabilities for intervals what should you do? Probability of
a particular point what should you do? If there is a jump probability that X takes a particular value
of, at that point is the magnitude of the jump. If there is no jump probability that X equal to that
value 0, and if you are given an interval it is simply 𝐹𝑏 - 𝐹𝑎 .

So now why is this non-intuitive? Why is this, when F is continuous, 𝑃(𝑋 = 𝑥0 ) = 0, why is that
non-intuitive? Here is the reason why, so let me take an example and then talk about what happens.

(Refer Slide Time: 09:16)

So here is an example of random variable x with some CDF, 𝐹(𝑥) and then 𝐹(𝑥) is drawn for you
here. So this is 𝐹(𝑥) versus 𝑥. So that is the picture that we have here. So you can see it starts at 0.
So it starts at 0, it keeps going, it stays a 0 till 0, at 0 there is a jump, and what is the magnitude of
the jump? The magnitude of the jump is 0.5.

You can quickly see, even from this definition you can see for 𝑥 strictly less than 0 is 0, for 𝑥 equal
to 0 it is 0.5. It has jumped up by 0.5. After that there is no jump, it is just 0.5 + 0.1 𝑥, so that at 5,
so this point is 5 for instance, maybe this is marked here. So at 5 it reaches 1 and stays 1 afterwards.
So this is just a line, this line is 0.5 + 0.1 𝑥. So you can see why this is a valid CDF, every property
of the CDF is satisfied. There is a jump here though.
So now notice what happens if you start looking at calculating probabilities using this CDF so
there is a random variable with the CDF. Now I am going to throw some intervals at you, throw
some values at you and we are going to try and compute probability.

(Refer Slide Time: 10:44)

The first value you can compute probability for is the 𝑃(𝑋 = 0) that is 0.5, So I just drew the
magnitude of the jump and you can see that at 0 there is a jump of 0.5 so 𝑃(𝑋 = 0) is 0.5 𝑥. It is
easy enough to see.
(Refer Slide Time: 11:07)

Now notice what happens when I look at an interval around 2. So this 1.99 and 2.01 are somewhere
here. So this is the interval. I am going to take small intervals around 2, around the point 2. So
what is the 𝑃(1.99 < 𝑋 ≤ 2.01). So notice how I put strictly less than here and less or equal to
here. This is an interval. So I can just do 𝐹(2.01 ) − 𝐹(1.99), so that is just if you pluck that in
here you will get 0.1 × (2.01 - 1.99) that is 0.1 × 0.02 that is 0.002.

There is a small probability was this interval has a small width, it is not very high. So this
probability is small but it is non-zero. There is non-zero 𝑃(1.99 < 𝑋 ≤ 2.01). So that is all right,
you get small probability.

Now if you make, so if you have a finite precision for your random variable x it was taken with
positive probability. So that seems fine. Now what happens when you increase the precision, when
you make the precision higher, so you say x has to be between 1.9999999 and 2.0000001, you are
making the width of the interval smaller than the probability keeps falling.

So the probability that it is within an interval is non-zero but as you keep decreasing the width
clearly the probability keeps on falling. Now it turns out if you keep pushing and pushing and
pushing in the limit if you say 𝑃(𝑋 = 2.0000000) with no limit on precision then the answer has
to be equal to 0. So this is what is non-intuitive about it. So 2, the value 2 is actually in the interval
1.99 to 2.01, is that okay?
Now x can take a value in the interval with probability 0.002, but the probability that x takes any
value in the interval equal to 2.000000 with infinite precision is actually 0. We do not associate a
positive probability with values taken in the interval. So this is one confusing thing in the theory.
It would have been nice to not have this, maybe something else, but it turns out to have a consistent
theory you have to do this, with continuous these kind of CDFs and random variables defined from
the CDFs and the discrete probability in case we never had such confusion.

So it is very clear what the values taken by the random variable are. So in the continuous case the
random variables take values sort of which are in intervals. With some intervals you can say
something, with lesser precision you can say something, with infinite precision you can never
really nail down any continuous random variable or place where the CDF is continuous. You
cannot really nail down with infinite precision, that is not allowed. It is not just possible. You have
say 0 once you go to infinite precision.

So this confuses people a lot, everybody gets confused by this. So it cannot take any value, you
are saying it cannot take? I am not saying it cannot take any value. It can take some value, but you
cannot be so, so precise about the value. You are not allowed to go to arbitrary levels of precision,
that is all.

So take it with a pinch of salt if you do not want to understand it too much you can just forget
about it. You can just say that CDF defines a random variable and that random variable's
probabilities are calculated like this where the CDF is continuous, a random variable, at that point
with infinite precision it takes that value with probability 0, but if you go around that point. It has
a non-zero probability of falling in that interval.

So that is the way, the best way you can think about it. The foundations of this, why this is so, this
is very deep in the technical nature of the mathematical theory. I will point you to some literature
maybe later on if you are interested. So this is the main result here. You cannot take values with
infinite precision when 𝐹(𝑥) is continuous at that point. So let us just accept it for now and we
will use this rule very clearly if 𝐹(𝑥) jumps at point then it takes that value with non-zero
probability.

If it is not jumping, it is just smooth and continuous at that point it takes that value with probability
0. So for instance if somebody says in this CDF what is the probability that x equals 4, you should
just say 0. What is the 𝑃(3.9 < 𝑋 ≤ 4.1)? You have to say 0.02. So that is a little bit unsettling
but that is what it is. When you are imprecise you can put a non-zero probability.

When you are very, very precise, down to the last infinite precision it is just impossible. You
cannot assign proper probabilities there. So I will stop here for this but think about this, this is just,
remember this as a method to go from CDF to probabilities for a random variable.

(Refer Slide Time: 16:25)

So now we come to the definition of a continuous random variable. So we have, so far we have
talked about discrete random variables, CDFs for discrete random variables and then we say that
this notion of a CDF can be generalized. I can think of a continuous function maybe even, which
is a valid CDF and then we said for every valid CDF there is a random variable. Now we come to
condition on the CDF which will make you random variable itself continuous.

So that is this definition. Random variable with the CDF 𝐹𝑋 (𝑥) is said to be continuous if the CDF
function is continuous at every point, there are no jumps. If there are no jumps, so previously, just,
example we saw, there was a jump and then there was no jump. What if the entire CDF has no
jumps? Then it becomes a pure continuous random variable. So that is the idea.

It is a bit bizarre but that is the nature of it. There are no jumps in the CDF. So what happens to
𝑃(𝑋 = 𝑥) = 0 for every 𝑥. If you specify small 𝑥 with exactly precision and you want your x to
be exactly that, it is 0 but around that you can assign a non-zero probability. It falls into an interval
is 𝐹(𝑏) – 𝐹(𝑎), but notice when in this interval we had this confusing thing of I have to say a
strictly less than x, less than or equal to b.

I have to keep saying that and for given a CDF because there maybe jumps. If there are no jumps
in the PDF notice what happens. 𝑃(𝑋 = 𝑎) = 0, 𝑃(𝑋 = 𝑏) = 0. So what happens because of that
is, all these limits do not matter. The intervals, less than, less than or equal to does not matter,
because 𝑃(𝑋 = 𝑎) = 0.

So I can just drop the less than or equal to and less than and simply have equality here for all
possible intervals whether they are closed on one side, open on one side, closed on both sides,
open on both sides, all of them have the same probability. So that is a very nice comforting thing
about continuous random variable. Of course the one that makes everybody comfortable is, it never
takes any one value, exact precise value with positive probability but it can take values in an
interval so that is, it is sort of like, that is how continuous things are, they are always like that.

So this is one slide which defines continuous random variables, so remember the last point. So
when you are dealing with continuous random variables or a range where the random variable is
continuous you do not have to worry about less than or equal to, less than, all of them have the
same probability, but if there are jumps then you have to worry about this less than or equal to. All
right so let us now look at a few examples.
(Refer Slide Time: 19:01)

So here are a few examples. So given a CDF one of the skill that you need to pick up. If somebody
draws a picture of the CDF and shows it to you. You should be able to identify whether it is discrete
or whether it is continuous or if it is neither. So I have four examples here of CDFs all of them are
valid CDFs, because the first skill is to identify whether or not it is a valid CDF.

Valid CDF, you know the conditions, it has to start at 0 and then it can jump or not jump but it has
to keep increasing, at least non-decreasing then finally hit 1 and then it stays flat. There is that
right continuity condition but it is okay. I think it is a technical condition but this is the CDF
definition but according to those definitions all these four are valid CDFs. So you can see the top
left is a discrete random variable.

How do you identify the discrete random variable? It will only have jumps, it will not increase
without a jump, every increase will be at a jump. It will jump or stay flat, jump or stay flat, jump
or stay flat. If it does like that then it is discrete. How do you identify the value is taken by a
discrete random variable?

You go look at the CDF wherever it jumps, those values the random variable takes and what is the
probability with which it takes those values? The magnitude of the jump. So if you look at the top
left it takes values 1, 2, 3, 4, 5, 6, each with probability it looks like 1 / 6, So it is like a throw of
the die.
On the other hand, let us look at the top right, which is clearly a continuous random variable, the
CDF has no jumps, it starts at 0 and then goes in a straight line to 1 and it is flat after that it is a
continuous random variable. If you want to calculate probability, then it takes one particular value,
it is always 0, but if it falls within the interval we just have to subtract from top to the other.

So for instance this random variable probability that it falls in the interval 0 to 6 is actually 1, the
top right, that is it, it falls entirely in that interval. What values it takes if you want to take 2 to 4,
you can subtract it out, you will see 1 / 3 is the probability, 0 to 3 is 1 / 2. You can sort of calculate
based on this equation. So that is the top right.

The bottom left is a mixture. It is neither continuous nor discrete because it does both. It has a
jump and it also rises without a jump in a continuous fashion to 1. So that is the mixture
distribution. It is neither discrete nor continuous. So there are cases like that, rarely we will see
such cases also but mostly we would like to consider discrete or continuous random variable.

Yet another example at the bottom right of a continuous random variable is like a smooth curve
and not just lines, it can have a smooth curve that increase from 0 to 1. that is again a continuous
random variable. There are no jumps here, but the probability of intervals is a bit different from
what we had before. So it looks like the way the curve is drawn. It takes a lot of values and all that.

I mean we do not have precise equation here. So we need the precise equation of the curve to say
anything more but clearly one can identify it is a continuous random variable. So this is a good
skill and you should clearly know that. Seeing a picture of a CDF identify whether it represents a
discrete random variable or a continuous random variable or a mixture which is neither, neither
discrete nor continuous. Hopefully this gives you some idea of how these random variables with a
general CDF may look. So why would you use continuous models?
(Refer Slide Time: 22:49)

So I am going to throw a few scenarios at you, where the number of values that are taken in that
random phenomenon is such that discrete may not really be very useful for you. You may want to
use the continuous random variable to approximate situation.

So here is the first situation. It is a very standard example in many books. This is given as an
example. You have a dart which you are throwing into a circular board and you want to measure
the distance from the center of the board to the dart. So it is a very standard example that is given
to say why you may need continuous random variables. It can fall over, there are so many distances
depending on how precisely you measure.

You will get so many different possibilities dealing with discrete will be very tough, continuous
might be easy, weight of a meteorite hitting earth we already saw that example. We saw there are
so many values and continuous for so much more easier to deal with weight of a human being,
height of a human being, there are such a wide verity of heights that you might as well deal with
the continuous random variable and do calculations.

Coming back to cricket, we have been talking about cricket a lot. The bowling speed, the speed of
a delivery in cricket that is probably a good thing of a random variable. Price of a stock, because
the stock keeps varying over a wide, lot of a different values it takes keeping track of the discrete
nature may not be that easy, continuous might be much, much more easy.
So hopefully, so this is something you have to look out for. When some data comes to you, which
of the, which just the data that you want to sort of model with the discrete random variable, which
you want to model with a continuous one, which is well model like that, which maybe, you may
even have to use mixtures. So for which you may have to use mixtures things like that, you have
to identify but that is good to do and these are the sort of rules of thumb.

If it takes a smaller number of values and you can deal with it in discrete that is great and that is
probably the easiest, but if it seems to be taking, the data seems to be taking a lot of different values
and you are just breaking your head keeping track of all of them, a continuous model might actually
be better. So with the continuous model you do histograms and all that. We will see examples of
that a bit later.

(Refer Slide Time: 25:06)

So we are going to close this short lecture with a couple of examples. This is an important skill to
have. Suppose somebody shows you a CDF how to calculate probabilities for events involving the
random variable? It is actually very simple, you do not need to know the deep nuances of these
things. You just the formula but sometimes you can slip up. You will be very, pay careful attention,
I would say you should try to sketch the CDF and make sure you clearly understand, how, what
type of question is being asked and where it falls, etc.

So I have put a whole bunch of calculations, I may not do all of them. What I feel is representative.
I will do. So here is a whole bunch of probabilities that are being asked here. Maybe some of them
are not, maybe one or two I might add to this. So for instance you may want to calculate probability,
two more things which may be interesting.

𝑃(𝑋 > −2) , 𝑃(𝑋 ≥ 3), why not. Let us just add these things. So I like to sketch, I think sketching
is good. I have a page for working out. So sketching is quite nice so let us just sketch it. So first it
is good to get the range and the ideas of 𝑥 < − 5.

It is going to be 0 and then between - 5 and 0, it is just 0.2, it just flat 0.2 and notice this is square
dot like this, this is 0.2 and then at 4, at 𝑋 = 4 it becomes 1, so maybe this is not a great scale but
anyway. There is one at this point and between 0.2 and between 0 and 4 it had got a linear rise, 0.2
+ 0.2 𝑥 it is a linear thing.

So I know my this thing is going to look like this. So you can test it out. So 𝑋 = 0 gives you 0.2,
𝑋 = 4 gives you 1 and that is how it looks. So there is a jump here and the jump is of height 0.2
and there is no jump other than that and then this is 1, this is 4. So that is your CDF. So sketching
is quite useful. Once you sketch it is very easy. So 𝑃(𝑋 ≤ −3), so the first calculation itself is a
bit difficult so let us start with 𝑋 ≤ −3. So we will start with the easiest one.

𝑃(𝑋 ≤ −3), is simply 𝐹(− 3) and that is 0.2. 𝐹(− 3) is 0.2. It is very easy, so what about 𝑃(𝑋 =
−3)? At 𝑋 = – 3, F is continuous, it is just staying the same, so this is 0. So once you see that this
is 0, you can quickly find what is 𝑃(𝑋 < −3), which will also be 0.2. It is 0.2 - 0, which is 0.2.

So this is how you do these calculations, hopefully it is clear to you. The next question asks you
for 𝑃(−3 < 𝑋 < −1), so this is going to be 𝐹 (− 1) – 𝐹(− 3) so that is 0.2 – 0.2 that is 0, So
again - 3 to - 1, F is continuous and it stays flat, So it is just going to be 0. There is no change here,
what about - 1 to 1?

The formula does not change 𝐹 (1) – 𝐹( − 1), so F( 1), if you put 1 here you are going to get 0.4,
𝐹(− 1) is 0.2. So it is 0.4 - 0.2. So notice how the answer comes out very cleanly, we are just using
the formula, but you have to be careful when you do it and not get mislead by what is going on.
So let us look at the other cases. So maybe I will use the other page for it.

(Refer Slide Time: 29:55)


𝑃(𝑋 > −2), so for this it is best to do 1 – 𝑃(𝑋 ≤ −2) and that is just 1 – 𝐹(− 2), that is 1 - 0.2,
that is 0.8. So you see how we got that calculation, it is quite easy, x greater than - 2 is 1 - P(X≤
−2) and that is just the CDF, direct definition of CDF.

So the next question is asking for 𝑃(𝑋 ≥ 3), see, notice at X = 3 it is continuous so 𝑃(𝑋 = 3) =
0, so I might as well write it as 𝑃(𝑋 ≥ 3) and that would be 1 − 𝑃(𝑋 ≤ 3). So that is
1 – 𝐹(3) and 𝐹(3) if you could execute 3 you get 0.8. So 1 - 0.8, that is 0.2.

So this is how you do it, you just have to carefully look at every point and do it off. There is one
more question that is being asked 0 less than or equal to, x less than 3 that is just 𝐹(3) – 𝐹( 0 )and
that will work out as 0.6 if I am not wrong.
All right so the next question that is asked here, is there an 𝑥0 for which 𝑃(𝑋 = 𝑥0 ) > 0 ? So this
is an answer by looking for jumps. So that jump happens only here. So 𝑃(𝑋 = −5) = 0.2 there is
no other value that is taken with positive probability except for - 5.

So is 𝑥 a continuous random variable? The answer is no. It is not a continuous random variable. Is
it a discrete random variable? No, that is also no. It is one of those mixtures, neither discrete nor
continuous. So hopefully this gave you some confidence. Just given a CDF like this you should
have this basic skill of figuring out probabilities for different intervals and values and everything.
It is easy but you have to systematically do it, correctly do it.

(Refer Slide Time: 32:19)

So here is another example, I am not going to go into great detail here except for just plotting. So
let me just plot this and maybe answer a few questions or maybe not, everything, 0 up to - 5, then
from - 5 to 0 it goes as a straight line at - 5 it is 0. That is important to check.

So if you put - 5 here you get - 0.2, + 0.2 that goes to 0. At 0 it is 0.2 and it is a straight line. That
is interesting. It is a straight line and then at 4, once again at 0 it is 0.2, it is continuous again and
then at 4 it becomes 1, so it goes up like this and stays at 1, so that is the shape. So it is all
continuous, does not jump at all, there is no jump. It is continuous so is x a continuous random
variable? The answer is yes.

There is no x not at which 𝑃(𝑋 = 𝑥0 ) > 0. It is always equal to 0. So your calculations will change
a little bit here and there. I am not going to repeat the probability of intervals here. I will leave that
as an exercise for you, hopefully you will see how to do it. It is just F of, identifying the interval
carefully, associating it back to the CDF and you will be done.

So that brings me to the end of this lecture. Once again what did we do here, we looked at the CDF
and every valid CDF we said actually corresponds to a random variable and there are properties or
probabilities involving that random variable, for events involving that random variable can be
calculated using the CDFs definition.

In particular, there is one special type of CDF. When the CDF is continuous everywhere that
random variable is called a continuous random variable and this continuous random variable has
this odd property that the probability that the random variable equal to a particular value is always
0. The reason is when you specify it with infinite precision like that it is not possible for you to
associate a non-zero probability there, but if you put a small enough interval then you will get a
small probability.

Intervals have non-zero probability. A particular value with infinite precision has 0 probability and
for continuous random variable. So the continuous and the discrete are two different worlds, and
when you have a continuous CDF, you have a continuous random variable and its behavior is very
different from a discrete random variable, when there are jumps, there is a probability, positive
probability with which that random variable takes that value.

When there is no jump it does not take that value with positive probability. I hope this is good
enough, clear enough in some way. I mean as you keep practicing the skills you need to practice
importantly, or given a picture of the CDF, you should be able to calculate probabilities, you should
be able to check whether it is a valid CDF or not. You should be able to figure out whether it is a
discrete random variable or continuous random variable or a mixture or things like that.

So what we will consider from now on are continuous random variables, in particular how can you
do calculations more easily using continuous random variables and special types of continuous
random variables and as I showed you continuous random variables are invaluable for various
modeling scenarios. Thank you very much.
Statistics of Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 5.5
Continuous random variable - Probability density function

Hello and welcome to this lecture. So this lecture is going to be all about continuous random
variables. In particular, something called the probability density function. So for the discrete
random variable, we had the probability mass function, probability that discrete random variable
takes a particular value and all that.

When you come to the continuous case it seems a bit odd that you cannot have the PMF anymore.
PMF it does not give; it gives you just 0000 it does not make sense. So you need something called
the density. So the continuous random variables are, will take values over an interval, they will
have a certain density over that interval but not over particular value. So it is sort of continuum in
that case.

There are lot of continuous distributions, very interesting distributions defined / different density
functions. So this density function plays a very, very crucial role, makes the calculations very easy
for continuous random variables, I am saying very easy but it is a little bit more involved than
discrete but it is, density function sort of like the analog of the PMF in the continuous case.

So we will see why that is true, so you might wonder. I mean already we saw the CDF, why not
use the CDF for everything. It turns out the CDF is not that easy to deal with when you want to do
calculations. So the density on the other hand is a much, much more intuitive and natural thing to
use.
(Refer Slide Time: 01:36)

A quick refresher on integration. I think you probably learned integration in the Math II course if
you are doing it along with this or you have done it before or you have already some background
on integration. It is just a quick refresher so that we are on the same page. Usually one talks, first
of all before you come to integration you will be comfortable with differentiation.

This is derivative of a function which represents the slope of the function at a particular point or
the, it is when, or its rate of change when you focus on a very small region around a particular
point, how fast it is increasing or decreasing, it is captured / the derivative and there are formulae
for derivative. Given a function, given a description of that function, what is its derivative that are
tables which tell you how to do this, etc.

Now if you have a function small f and there is a capital F whose derivative is small f, so notice
the inversion here. So usually given a function you differentiate, you get the answer. Here we are
doing the reverse. You have a function small f, which function when differentiated will give you
a small f. So that is sort of called the indefinite integral of that function, so something like that.

It is of course, I am being very imprecise and loose here just to get you a rough idea of what it is.
So people usually write ∫ 𝑓(𝑥)𝑑𝑥 and they will write F(x) like that. So it is also called the anti-
derivative or something like that. So F(x) and f(x) are related in that fashion.
Now once you have an anti-derivative you can do something called a definite integral for your
𝑏
small f. So this notation ∫𝑎 𝑓(𝑥)𝑑𝑥 = 𝐹(𝑏) − 𝐹(𝑎). So this definite integral is density under the
curve so a very usual picture, a very common picture, supposing you have a function f(x) this is
the function f(x) and you go from a to b, this area. This area is given / this guy.

𝑏
So ∫𝑎 𝑓(𝑥)𝑑𝑥, this is just notation, notation for that area and that is given / 𝐹(𝑏) − 𝐹(𝑎). So how
do you find the anti-derivative, there are these tables of integrals you can go and look at up, this is
not the only table out there, there are so many other tables. So given f(x) of a certain form you
usually know what the anti-derivative is and today many computer packages can calculate the anti-
derivative, the definite integral for you. At least numerically to whatever extent it is possible you
can calculate the definite integral.

So this is just a one page refresher so that we are clear on notation and we will use this notation a
lot, you can see now already the capital 𝐹(𝑏) − 𝐹(𝑎) that is like the CDF, so the small f is
something inside, you can integrate, you get the CDF, so you can see where we are going with this
and where the density comes sort of naturally. So this is a quick refresher on integration.

(Refer Slide Time: 04:45)

Now this leads us to the definition of what is called the probability density function. If you have a
continuous random variable X, remember what is continuous random variable? A random variable
whose CDF is continuous at every point. So that is the continuous random variable. It is said to
have a probability density function, which is usually denoted small f. The CDF is denoted with
capital F, probability density is denoted with small f.

If for every 𝑥0 the CDFs evaluated at 𝑥0 should be the integral from - ∞ to 𝑥0 of the PDF. So this
is the idea. So if you want I can draw a picture here. We draw a picture for you. So let us say you
have a CDF here, maybe I should draw the PDF. So you have a function which is like this, let us
say this is, 𝑓𝑋 (𝑥) so now, so what is, so let us say this is 𝑥0 , what is integral from - ∞? So - ∞ is
something way, way, way out there. ∞

So integral from - ∞ to 𝑥0 of 𝑓𝑋 (𝑥) is what? So the area under this curve all the way here. This
guy has to be equal to capital 𝐹𝑋 (𝑥0 ). So this is what is happening. So this right hand side is
actually representing this integral. So if 𝑥0 , as 𝑥0 moves the integral will keep changing, the value
of the definite integral, the area under the curve that you are including will keep changing.

So you can see as 𝑥0 increases the area under the curve only increases. So it is a non-decreasing
function. It is an increasing function and this 𝑓𝑋 (𝑥0 ) has to be between 0 and 1, it has to be a
probability, it has to start at 0 end at 1, so that puts some conditions on the density, but this is the
meaning of the density. If you can express the CDF as the integral, as the indefinite integral from
- ∞ to x naught of some other function then that function becomes the PDF of the probability
density function of your continuous random variable.
(Refer Slide Time: 07:05)

So CDF is the integral of the PDF. Now what do I know from my basics of integration? So the
derivative of the CDF is going to be the PDF. So how do I get, see when I do the integral, the
indefinite integral is going to be what? So notice, so this is important, the derivative of the CDF is
the PDF, so if you want to integrate, so if you have the PDF 𝑓𝑋 (𝑥), it is anti-derivative, this capital
𝐹𝑋 (𝑥). Derivative of the CDF is usually the PDF. This is directly taken as the PDF.

So even, if you differentiate with respect to x, you are going to get the PDF. The anti-derivative is
this. So when you do integral from - ∞ to 𝑥0 , 𝑓𝑋 (𝑥), what do you, what will happen from your
relationship of definite integral? It is going to be 𝐹𝑋 (𝑥0 ) - 𝐹𝑋 ( - ∞). What is 𝐹𝑋 ( - ∞), this is 0 so
you get your, you get what you want. So this is what you will get.

So if you can, so this is the standard way of getting the PDF. The definition is sort of like done in
ulta fashion but you have a CDF, the CDF is given to you. It is simply differentiated; you have the
derivative take that as your PDF. So in most cases you will have a derivative and you just have to
use the table, differentiate the CDF, you will get your PDF. So for just to be technically correct I
am defining it in this fashion, but usually you just take the CDF, differentiate it, you get your PDF,
derivative of the CDF is the PDF, it is a very good role to remember, so that is correct.
(Refer Slide Time: 08:56)

So why PDF people might ask. We already, I have a CDF, I am already confused and already
computing probabilities why do you want to confuse me more? Well that is one of the reasons why
we have all these definitions. We can keep asking you more and more complicated questions,
confuse you about it, no, no, no that is not the reason.

A PDF is really, really useful, what happens is whenever you want to have a measure of some
distribution, so you have a distribution of the values and you are describing it with the CDF,
whatever function you used to describe, if that function is high, you want the probability of 𝑋
taking a value there to be high or if it is low then probability is low.

Notice what happens in the CDF, it keeps on increasing, and just because the CDF is higher, it
really does not mean that x takes more values there. Only the difference matters. It is only between
0 and 1. So it is a different sort of definition. On other hand, the density is not like that. The way
the density behaves, you will see soon enough. If the density is higher, then X takes mores values
around those points.

So I will give you very clear calculations to show you how that is true. So the density is sort of
like a truer, I mean if you see the density you sort of see the distribution. If you see the CDF you
are not quite seeing the distribution. You are seeing how the probability increases and all that.
Maybe some of you can visualize but it is not so easy. On the other hand, the density you will see
it is very, very easy to visualize.
I will show you some plots. You will see it makes a lot of sense. So for instance the CDF will go
and saturate at 1. So it is probably the highest value of the CDF, but once it has saturated it never
takes a value there. 𝑋 will never lie there. So it is confusing to deal with the CDF for that reason.
You would want the function, distribution function that represents a distribution, if 𝑋 is never
going to take a value there it should be 0 there.

The PDF will satisfy that, but if it is flat, if you differentiate, you are going to get 0. So that is why
the PDF is much more intuitive and easy to deal with in many complex calculations, believe me
when I tell you, PDF is so much easier to deal with than CDF, so that is why PDF is used in many,
many computations. So it is not just to confuse you. Except for this integration in differentiation,
you need to be able to differentiate, and some integration is involved when you calculate
probabilities with PDF that is not there in CDF, but it is okay, you can get used to that and PDF is
very, very useful.

(Refer Slide Time: 11:21)

Yes, examples, so example is always good. The first, these are all uniform distributions / the way.
Later on we will see what it is, look at CDF 1. It starts at 0 and then just flat line increases to 1 at
5 and then it is flat again. So CDF is like that but notice what the PDF is. I have to take
differentiation.

So if you differentiate a constant flat thing when you differentiate you get 0 and then this equation
is actually, so supposing you call this 𝐹𝑋 (𝑥) and this is x, this is actually 𝑥 / 5. At 5 it hits 1, at 0
it is 0 so this is 𝑥 / 5, and this is 1 and this is 0. So when you do derivative 0 is flat so keeps on
being 0. 𝑥 / 5, the derivative is 1 / 5.

So this is 𝑓(𝑥) versus x. So x / 5, the derivatives 1 / 5, and notice what happens after 5 it goes
down to 0 again. Why? It is 1 flat and then constant is, the differentiation is 0. No slope here, the
slope is flat. So notice how intuitive it is, like the PDF. The random variable takes values only
between 0 and 5. The PDF is non-zero only in 0 and 5. It is 0 outside.

So the PDF goes to 0, you know the random variable does not take values there. So it is very nice
property, simple property the PDF has. Looking at the PDF I can say that, hence 1 / 5. So I know
how the probability will work out. So we know how to do probability calculation in CDF. You can
also do probability calculations in PDF. We will look at examples later. You have to do integration.
It is a little bit different, but you can do that also. We will see enough examples later on.

So here is another CDF, so here in this case what happened is, so this point is 0.5 or 1/2, so this
equation is 2𝑥, so this is 0 and this is 1 and this again is 𝐹𝑋 (𝑥) versus x and if you look at small
𝑓𝑋 (𝑥) versus x, for the flat it is going to be flat at 0, it is flat at 1, it is going to be flat at 0 when
you take derivative you get 0 and then for 2𝑥, what is the derivative of 2𝑥?

If you go look at your formula you will see the slope is 2 so derivative is just 2, so the PDF goes
to 2. So this is 0, 2 and then 0 again. So let us contrast these two pictures. So the distribution on
the left, the random variables takes values between 0 and 5, sort of uniformly the PDF is flat, the
density is flat, so we say it is uniform. In the right hand side picture, the random variable takes
value only between 0 and 1/2, it is sort of concentrated so the density is higher.

So when the random variable takes smaller range of values the density is higher, one point where
everybody gets confused is the density can go above 1, so this is a probability we are dealing with,
probability density, it is not probability itself. So probability density can go above 1 also. The CDF
will never go above 1, the CDF is always between 0 and 1. The PDF on the other hand can go
above 1.

It is sort of like density, density is what? Amount of probability per unit length. So that is the way
you have to think of density, that is why it is called density, density is what weight per volume or
something like that is density. So likewise here probability per unit length is your density. So it
can go up to 2. So when density goes to 2 your length cannot be great, your length goes only up to
1/2. So 2 into 1/2 is 1. So you are still within the probability regime.

So density can go high if the range shortens up for you. So this is a good picture to remember. So
notice how once again I would say the PDF is so much more intuitive, just when you look at it you
know where the values are, so wherever the PDF was high, the values are going to be there. If it is
flat then it is equally likely, sort of uniform all over the place, otherwise it can take different values.
So that is a good picture of simple CDF, PDF illustration, the derivative is very easy and all that.
Let us complicate things a little bit more.

(Refer Slide Time: 15:54)

Here are two other distributions, these are called exponential and normal. We will come back and
look at what exact formulae for this but I just want to point out that in every case when the CDF
is smooth also, the PDF ends up being smooth, it has a nice behavior and always wherever the
PDF is high is where the values of the random variable will be.

So look at the top, the left picture here, top left and bottom left, from the top left maybe it is
difficult to figure out, people who know enough about functions will know that this function is
growing faster near 0 and it is slowing down its growth as it grows up to 1. It is difficult to see,
but look at the density.

The density immediately tells you the picture. So it is high around 0 and it tapers off and falls very
sharply as it goes higher, so you know this random variable tends to take values between 0 and 2
most of the time. It is going to go about 2 and all probability is going to be lesser. So the density
indicates where the random variables probability per unit length, where it is very high and all that.
So that is the sort of picture you should keep in mind.

The right distribution is a very, very famous distribution, you would have thought about the bell
curve so to speak. So this is an example of that normal distribution. We will see later on what the
formulae are. Once again I will argue that the CDF probably does not quite tell you immediately
where the random variable is going to lie. But look at the PDF, it just immediately brings out that
most of the time the values may be around 0 and then it can take other values and it is sort of
symmetric around 0.

So it can take positive, negative values equally, but look at the PDF on the left side like it is only
going to be positive. It is never going to be negative. So all these nice, quick things you can come
up with the PDF and it is very, very useful as well for doing calculations. So hopefully you are
convinced that the PDF is a good thing so it is a very good thing, as we will see soon enough. So
what are the properties, how do you identify a PDF, what are the properties that the PDF has to
satisfy?

(Refer Slide Time: 17:54)

So here is the definition of a density function, you remember we had a definition for a distribution
function. Cumulative distribution function, which functions are valid CDFs, like that which
functions are valid PDFs, what functions are valid densities. Here are the three properties.
The function has to be non-negative and it has to be greater than or equal to 0. Why is that? It is
the derivative of an increasing function, it has to be a non-negative so that is how it will be, and
anyway it a probability density, you cannot go negative. So if you go negative somewhere over
interval, probability cannot go negative. It has to be positive, so it will be positive.

If you integrate from - ∞ to ∞ you should get 1. So that is not very difficult to see, why? Because

∫−∞ 𝑓(𝑥) if you write it in terms of the CDF it becomes 𝑓( ∞) − 𝑓 (− ∞) that is 1 - 0, that is 1,
so that is one thing, one way to think about it. The other way to think about it is the total probability
has to be 1.

If you integrate out all possible values that x can take, you should get 1, and f(x) has to be what is
called piecewise continuous. It is sort of like a technical property. It cannot, it has, in every interval
sort of it has to be continuous. So it cannot do, it can have jumps, PDF can have jumps. It is not
that it does not have to have jumps, but every piece of it has to be continuous, that is all.

Here are more important properties. In fact, the last property is probably the most important and
let us come to that. The first thing is about existence. If you come up with a density function. So
anyone if you come up with the CDF, I told you that there is a random variable with that CDF,
same thing is true for density. If you come up with the density function, there is a continuous
random variable whose density is that function.

So you can just, cooking up a continuous random variable is like cooking up a density function.
You come up with any density function that satisfies this non-negative, integrates to 1 piecewise
continuous, you have a continuous random variable. Now there is something called the support of
a random variable, a continuous random variable particularly. The support is basically the points
at which the density is strictly greater than 0.

So those are the possible intervals where x can take values, remember this is a continuous random
variable, I cannot say x equal to a particular value with no 0 probability, but intervals where it falls
the PDF has to be strictly positive. So places where it is strictly positive is called support of the
random variable. I will use this support over and over again.

You will see for instance a very, so this - ∞ looks very ugly but I can write like this. support of 𝑥,
𝑓𝑋 (𝑥) will be equal to 1. So this is a definite integral over the support of x, wherever 𝑓𝑋 (𝑥) is non-
zero. If it is 0 of course it does not contribute to the integral so that is an easy thing to say. I have
this habit of dropping things like dx and all that but it is just extra fitting in some sense, but anyway
it is good to write it down.

Integral, over the support ∫−∞ 𝑓(𝑥)𝑑𝑥 =1. Now coming to the last point, I have put it as the last
point, it is very, very, that is the most crucial point, if you have an event A defined using the
random variable, how do you define events with the random variable? You say 𝑥 is less than or
equal to something, x is greater than something, x is between something, 𝑥 is between this or 𝑥 is
between that, things like that.

How do you find probabilities of those events using the PDF, we know what to do for CDF, what
do you do for PDF? You simply integrate the PDF over those intervals. Find the area under the
curve over the intervals that you are interested in and that gives you the probability of that
particular event. So this is how we use the PDF directly to calculate probabilities. So this last
property is very important. We will see a few examples that will be clear to you.

(Refer Slide Time: 22:02)

Oh there you go, immediately we jump into examples. I believe there is two or three examples
before we jump into common distributions. I will just do the three examples and we will stop this
lecture at that point. So here is the function for you. This function is 3𝑥 2 between 0 and 1 and it is
0 otherwise.
So the first thing is you have to show f is a valid density function. So what are the density function
conditions? 𝐹(𝑥) ≥ 0. That is true, integral - ∞ to ∞, but you know it is just 0 to 1. 3𝑥 2 𝑑𝑥, what
is the anti-derivative of this? What when differentiated will give you 3𝑥 2 , 𝑥 3 . So it is 𝑥 3 ., evaluated
at 0 and 1.

So this is how people write it. 𝑥 3 . evaluate at (1 − 𝑥)3 at 0 so that is 1, and is it piecewise
continuous? Yeah, it is piecewise continuous. It is usually an easy thing to check. So it is a valid
density function, no problem. So let us consider random variable x with this density. What is the
probability that 𝑥 equals 1 / 5?

So this is the easiest thing to answer. So any random variable which has a PDF or defined using a
PDF is definitely a continuous random variable and once you have a continuous random variable
probability that it takes equal to 1 / 5 is actually 0. What is the probability it takes 2 / 5 again 0. So
all easy questions.

Now what is the next thing? This question is a bit interesting so what is the probability? So maybe
we should sketch this 𝑓(𝑥), so let us sketch, this, it looks like this from 0 to 1. It is 3𝑥 2 at 1 it
becomes 3 and 3𝑥 2 will go something like this, it is a curve it will to be like a parabola. It will go
like this, this is your f(x).

Now 1 / 5 is somewhere here, 2 / 5 is somewhere here. So at 1 / 5 and 2 / 5, so if you look at f of


1 / 5, it is 3 times, 3 / 25 𝑓 (2 / 5) is 3 times 4, it is 12 / 25. So the density at 2 / 5 is 4 times the
density at 1 / 5. So what I am doing here is asking for the 𝑃(1 / 5 − ∈ < 𝑋 < 1 / 5 + ∈), some,
think of ∈ as something small.

Notice the density at 1 / 5 is around 3 / 25 at 2 / 5 is around 12 / 25, but to compute this probability
1
+∈
what do I need to do? I have to integrate 3𝑥 2 , so remember that. I have to integrate ∫15 3 𝑥 2 𝑑𝑥
−∈
5

and that is 𝑥 evaluated between 1 / 5 - ∈, 1 / 5 + ∈ and that if you do this you will get (1 / 5 + ∈) 3
3

– (1 / 5 − ∈)3 .

You can do this calculation; you will get what? I do not know. So I hope I am right here, 6 / 25 ∈
+ 2∈3, so this is the, I think it is the correct answer. So you get 6 / 25 ∈ + 2∈3, just to make sure I
get the right, 3 / 25 ∈ - 3 / 25 ∈, the other one it cancel and this, yeah, I think this is correct, so you
get this value.
So you see, if you think of ∈ as something very, very small, this guy is probably much, much
smaller than ∈. So you can sort of ignore this term. It is going to be, think of 0.01, so if ∈ is 0.01,
this term is of the order of 0.01 or something. This one is 0.013 , so it is just, it is gone 0.000001
so it just become very small. So this guy maybe you can sort of not worry about.

So this is roughly about 6 / 25 ∈. The probability that x lies within a small interval of 1 / 5 is 6 /
25 times ∈. So notice at 2 / 5 the density is higher than at 1 / 5. So the probability that x lies within
that same 2 ∈ interval around 2 / 5, better be a little bit higher, so is it higher?

Let us check that out. So 𝑃(2/5− ∈ < 𝑋 < 2/5+∈) is going to be integral, the same thing again,
so let me cut that short, it is going to be (2 / 5 + ∈) 3 – (1 / 5 − ∈)3 so that is going to be 6 times,
oh sorry, this is 2 / 5, 6 times 4, that is 24 / 25 ∈ + + 2∈3 , once again maybe this one is much,
much smaller than ∈ for small value is ∈ and you can see that this is 4 times.

And good enough, I mean we saw that the PDF itself is 4 times around 2 / 5, so now this ∈, you
have to be slightly careful about. I mean this is not exactly like at 2 / 5, at 1 / 5, 4 times, 4 times,
yes, sort of like that, but to get that exact probability, you will have to account for the every possible
value that comes.

I mean to get the exact probability we will have to do indefinite integral, find the derivative, anti-
derivative, substitute the values and subtract and all that, but I want to bring out this point that this
PDF actually represents the probability that 𝑥 takes values around that point and when the PDF is
higher it is true that x takes values around that point with a higher probability than comparatively
at lower point.

But this kind of trick is very important, this skill is very important. Given a particular function
identify whether it is a valid PDF or not and then compute probabilities for events involving that
random variable as integration. This integration will come in and you should be comfortable with
that. We will use simple functions mostly in this course and even otherwise in practice, there is no
need to be worried about integration, in practice you always have tables and Wikipedia to help you
and all that. So you can find integration without too much of a problem.

(Refer Slide Time: 29:17)


So here is another problem. Something very, very similar just for you to get comfortable, this is
2x so once again if you want, if you like sketches you can sketch it. So this is density remember 0
to 1, it goes up to 2 and stays, sorry it does not stay there. Remember this density it cannot stay
there, it comes down to 0, so this is how the density is and this is, you can calculate the area under
this curve. It is 1/2 into base into height so that is 1, so it is a valid density.

So what is the 𝑃(0.1 < 𝑋 < 0.3), so remember 0.1 and 0.3, somewhere here, the densities are like
that. So how do you do this, so this probability, probability that x is between this is, is less than or
0.3
equal to, does not really matter, ∫0.1 2𝑥𝑑𝑥 the anti-derivative is 𝑥 2 , 𝑥 2 between 0.3 and 0.1, 0.09,
- 0.01 0.08.

So you can calculate the others, all of them in the same way, it is just confusing you with open
here, closed here, open there closed there, etc. you will get the same answer, so just for comparison,
supposing you look at, see this is an interval of lengthy 0.2. So let us say we look at the same 0.8
and 1, so same interval but situated at point 1, so this you will get as 12 - 0.82 , so that is 0.36.

So notice how much higher this probability is compared to that probability. So that would work
out in that fashion so I will leave the remaining things as exercise for you, so being able to integrate
calculate the difference of the anti-derivative, substitute, etc. that is very, very important, you can
also find the CDF. So the CDF you can see, so what will be the CDF here, the CDF, if you do the
CDF it will start at 0 and go at 1 and then it will be x square.
So this is 2 and then it will, sorry it will be 1, and it will be 𝑥 2 , so this function will be 𝑥 2 and start
at 0 and end at 1, at 1 it will go flat. So this is how it will be, so notice how the CDF and the PDF,
the anti-derivative, becomes the CDF and notice how the CDF does not truly tell you how the
probability is going to behave and the PDF is telling you exactly how the probability behaves.

(Refer Slide Time: 32:04)

So let me close with another simple type of problem. This is again a very typical problem given a
function like this, there is this parameter k which is not specified in the problem and you have to
find a k to make it a valid density function, first thing you check this k has to be a non-negative
and it lies between 0 and 1 / 4 𝑘, it is 2𝑘, 3𝑘, etc. So if you want you can sketch it and see row 1
/ 4, 3 / 4 and 1.

It goes to 𝑘 here then 2𝑘 here and then 3𝑘 here. So this is the function. So here the only thing to
use is that the entire PDF, the area under the curve, the entire integral has to be 1, so if you look at
the integral 0 to 1 in this case, so let us take a value there, f(x) dx has to be equal to 1. So now 0 to
1/4 3/4
1 it splits into three different intervals. So you have to split it like that, ∫0 𝑘𝑑𝑥 + ∫1/4 2𝑘𝑑𝑥 +
1
∫3/4 3𝑘𝑑𝑥, so this is something you can do.

1/4 3/4
This function f(x), if you write 0 to 1 you have to write ∫0 𝑓(𝑥)𝑑𝑥 + ∫1/4 𝑓(𝑥)𝑑𝑥 +
1
∫3/4 𝑓(𝑥)𝑑𝑥. Now f(x) from 0 to 1 / 4 is 𝑘, 1 / 4 to 3 / 4 is 2𝑘, 3 / 4 to 1 is 3𝑘. Now remember k
is a constant so you just have to multiply by 𝑥 and then x takes values 1 / 4 to 0 so this is just 1 /
4k + this is 1/2, so 2𝑘 times 1/2 + 3𝑘 times 1 / 4 has to be equal to 1, you can solve for 𝑘.

So let me see, 𝑘 / 4 + 3 𝑘 / 4 is 𝑘, there is another 𝑘, so you get 𝑘 equals 1 / 2. I hope that is


correct, it is 𝑘 +, 𝑘 to 𝑘, 𝑘 is 1/2. So you just solve for k, so this is also a typical example, so this
got to, this is got to help you identify which is a valid PDF. So one value of k gives you a valid
PDF here. So once again if you look at this problem for instance where does, which is the, if you
fix a small ∈ interval which is the area where it takes higher probability it is between 3 / 4 to 1, is
not it? That is the place with the maximum PDF.

You can draw the CDF corresponding to this, it will be continuous, it will vary sort of straight line,
three straight lines leading up to 1. All right so this is the definition of the density function. Once
again the density function is probability, sort of probability per unit length, for small lengths it can
vary over the length itself, so you have to integrate to find the actual probability, that is very
important, PDF is the derivative of the CDF that is a good way to remember it.

That is how you find the PDF, you can go back and forth / doing differentiation and integration.
So thank you very much, in the next lecture we will start looking at common distributions and
common continuous distributions, look at their PDFs, CDFs, etc. Thank you.
Statistics of Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology Madras
Lecture 5.6
Continuous Random Variable: Common Distributions
(Refer Slide Time: 00:13)

Hello and welcome to this lecture. We have been looking at continuous random variables, in
particular the probability density function for continuous random variables and how one can do
calculations with them. One of the things we did for discrete random variables if you remember is
after we looked at PMF and all that we studied a lot of common distributions particularly you may
remember Bernoulli, binomial, geometric, Poisson, things like that. So, standard distributions that
show up over and over again, when you want to model something in practice.

So like that in the continuous case also when you want to do modeling there are a few common
distributions that show up again and again and again, they are so common that we have to know
them quite well and we are going to do that in this lecture. We are going to see only three
distributions there are in fact so many of them and so many different types of applications and as
in when we meet other distributions and introduce them, but these three are really, really, really
common. So let us start looking at them.

The first one is the uniform distribution. We had the uniform discrete distribution which is, there
is a finite discrete set, and the random variable as uniformed in that set. So that was a very simple
distribution to describe, and the continuous case also you can have a uniform distribution. Now
you say that random variable x is uniform, square bracket [𝑎 , 𝑏], so that is an indication that it is
a continuous random variable and it is continuous between 𝑎 and 𝑏, 𝑎 random variable that is
continuous between a and b and uniform between 𝑎 and 𝑏, not just continuous, continuous and
uniform in a and b basically has a flat PDF in that range from 𝑎 to 𝑏, the PDF is flat.

Now from a to b if the PDF has to be flat and it has to integrate to 1, the height of the PDF has to
be 1 / (𝑏 – 𝑎). So that is the picture that you have here for the PDF. This would be a, this would
be b and the height here would be 1 / (𝑏 – 𝑎).

So that is the description of the PDF of a random variable which is uniformly distributed from a
to b. You can see why the uniform comes in. You take any interval between a and b as long as the
width of the interval is the same, the probability of the random variable falling in that interval is
the same.

So what is the, if you look at this interval for instance probability of, so this is 0, so remember this
is 0, so this area is the probability of a random variable falling in the interval as long as the width
of the interval is same wherever you are situated you will get the same areas, wherever you keep
it as long as the width is the same, so this is saw some w and this is also w now probability of x
here is 𝑤 /( 𝑏 – 𝑎), probability of x here is also 𝑤 / (𝑏 – 𝑎).

So wherever you situate this window with w you are going to get the same probability. So that is
the uniform distribution, but remember this is a continuous random variable so probability that x
equal to any value is going to be 0. So, that we do not model it like that. It is just any interval as
long as the width is the same wherever you go you will get the same probability.

So what is this? This is probability; this is also a probability calling as an interval. So the CDF is
basically the integral of the PDF, you can see for less than a, the integral is all 0. From a onward
it starts increasing, it will increase linearly and at b it will go to 1. So the formula is sort of like
this 𝑥 − 𝑎 / (𝑏 – 𝑎) and it is 1 for x greater than or equal to b. So this is 0, this guy is 1, this is a,
this is b and you can see this formula would be (𝑥 – 𝑎) / (𝑏 – 𝑎).

That is easy enough to see. So that is the CDF, but I know PDF is very clear. It is flat and it works
out very nicely. So now if a and b are close together the height will start increasing, 1 / (𝑏 – 𝑎) will
keep increasing as a and b come close together and as a and b separate out the height will fall down
1 / (𝑏 – 𝑎) falls down, so keep that in mind. So that is the uniform distribution.

(Refer Slide Time: 04:59)

One can do computation of probability here. So here is the uniform distribution from - 10 to 10 so
𝑓𝑋 (𝑥) start with writing out the PDF, it is 1 / 20 for −10 < 𝑥 < 10 and 0 otherwise. This is the
PDF and any probability, so it is just, you have to find the width of the interval and simply multiply
it by 1 / 20 is not it? So probability you know let us just write down a few of these guys - 3 less
than or equal to x less than or equal to 2. It is just, the width is 5.

So 5 / 20, 1 / 4 if you want. 𝑃(5 < |𝑥| < 7) . Now once again | x | means this is the same as 𝑃(5 <
𝑋 < 7) + 𝑃(−7 < 𝑋 < −5) , x is between 5 and 7 or it is a disjoint union so you can as well
subtract it out. So you will get 2 / 20, + 2 / 20 that is 4 / 20, you can also calculate in so many
ways. So there is this 1 − ∈, 1 + ∈, all that matters is the width of 2 ∈ so 𝑃(𝑥0 − ∈< 𝑋 < 𝑥0 + ∈
) is going to be 2 ∈ / 20, for any 𝑥0 , okay 𝑥0 inside let us say - 9 or something like that. So an ∈
small.

So x 0 is inside sufficiently. I am just doing it so that 𝑥0 + ∈ will not go outside of 10. So that is a
problem. So as long as 𝑥0 + ∈ is less than 10 and 𝑥0 - ∈ is greater than - 10 this will work. So we
can take something like this, so this would be, also a working out so that covers 1 - ∈, 1 + ∈ being
2 ∈ / 20. So it does not matter what 𝑥0 is. This is independent of 𝑥0 , wherever you situate 𝑥0 you
are going to get the same probability.

So look at the next computation, maybe this is a bit interesting, so it is 𝑃(𝑋 > 7|𝑋 > 3). So this is
just conditional probability and what is conditioning? So x greater than 7 and x greater than 3, so
this is the intersection divided by 𝑃(𝑋 > 3).

Now what is 𝑃(𝑋 > 7) and 𝑋 > 3 is it is just 𝑃(𝑋 > 7) / 𝑃(𝑋 > 3), so greater than 7 is 3 / 20,
greater than 3 is 7 / 20, so that is just 3 / 7. So if you want to contrast or if you want to say for
instance 𝑃(𝑋 > 4), you are going to get, so maybe you should do this, 𝑃(𝑋 > 4|𝑋 > 0) . So this
is also like the 7 , 3 so the 7 is 3 + 4, is not it? And 4 is 0 + 4, so basically x is greater than 3, what
is the probability?

That you are going to be conditionally 4 away from 3 so likewise x is greater than 0, what is the
probability are going to be 4 away from 0? So if we do the same calculation here, you will get
𝑃(𝑋 > 4) / 𝑃(𝑋 > 0), this is going to be 6 / 20 / 10 / 20, 6 / 10 to 3 / 5, so different probabilities
like these, you can think of conditioning and all this and you can get it.

So uniform is I want to hopefully convince you it is a very easy distribution to calculate


probabilities for, there is no need for complicated integration, no need to remember any formulae
it is just the width times the high, very easy calculation for the probability.
(Refer Slide Time: 09:06)

So let us go to the exponential distribution. So this distribution is also very common and here is
the PDF and the CDF given for you. So this is a notation X tilda is like X is distributed exponential
with parameter 𝜆. So there is this parameter 𝜆. So in the uniform case we had two parameters (a ,
b), the starting point and ending point of the interval, in exponential case you have only one
parameter lambda and the PDF is 𝜆𝑒 −𝜆𝑥 .

So let us look at this plot once again, x positive so this is 0, this is also starting at 0 like here, for
next is equal to 0, you are going to get lambda here, this height is 𝜆 is not it, and then this falls, so
this is just 𝜆𝑒 −𝜆𝑥 . So is this a valid PDF? So it is non-negative, it is continuous all right. Okay, so
piecewise continuous and then does it integrate to 1 is what you have to check? You can check
that it will be valid.

So if you integrate you are going to get, this integral will be 𝜆𝑒 −𝜆𝑥 . and you have to go from 0 to
∞ and that will work out as correctly 1. So this is a valid PDF you can check that, but notice here
couple of things that are different between uniform and this. Uniform (a b) it was, the support was
finite, it was only between a and b.

Exponential the support is an infinite length set, 0 to ∞ . So it keeps ongoing. It does fall down
quite rapidly but it never hits 0, 𝑓𝑋 (x) never becomes 0 when x is positive. It slowly keeps on
decaying it just goes down forever and ever but never becomes equal to 0, support is very large,
but it keeps falling quite rapidly so it is much more likely to take values close to 0 than say close
to 100 or something.

So you can see why that is true, the PDF falls like that. So if you were to do the CDF you have to
integrate this guy for x less than or equal to 0 you are going to get 0, so there is no area under the
curve here for x being greater than 0, you have to integrate the PDF from 0 to 𝑥0 . So the way to
𝑥
do this CDF for instance is 𝐹𝑋 (𝑥0 ) is ∫0 0 𝜆𝑒 −𝜆𝑥 𝑑𝑥. So this is −𝑒 −𝜆𝑥 from 0 to 𝑥0 so you get

1−𝑒 −𝜆𝑥0 .

So that is the formula here, so 𝐹𝑋 (𝑥) is 1−𝑒 −𝜆𝑥 for x being greater than 0. Now does it eventually
go to 1, if you see this calculation so you can see it will eventually go to 1. 1−𝑒 −𝜆𝑥 as x keeps on
growing 𝐹𝑋 (𝑥) becomes 100, 200, etc. 𝑒 −𝜆𝑥 will start falling down and down and down.
Eventually it will become 1 so it is a valid CDF as you can expect.

So this is called the exponential distribution, you can see why, because the exponent function plays
a role, it is falling exponentially, many, quite a few random variables and practice take this
distribution. So it is a good model for many situations.

(Refer Slide Time: 12:31)

So let us do a few probability computations here, it is very similar to before so this X is exponential
with parameter 2 so the PDF of x is 2𝑒 −2𝑥 for x > 0, 0 for x ≤ 0. So this is the PDF and if you
7
want to do 𝑃(5 < 𝑥 < 7) , it is going to be ∫5 2𝑒 −2𝑥 𝑑𝑥 and this integral is 𝑒 −2𝑥 . So between 5
and 7, so it is 𝑒 −10 -𝑒 −14 , it is 2 times 7 which is 14.

So there is a minus sign, so −(𝑒 −10 -𝑒 −14 ) ,, there is a - sign overall, so it gets flipped around. So
you see this is already very small, 𝑒 −10 so you remember e is roughly 2.7 or something. So 2−10
is already like 1 / 1000, this 2−14 is already very, very small. So this whole thing is a small
probability. 𝑃(5 < 𝑥 < 7) . It decays quite rapidly.

Let us look at these other things, so 𝑃(1 − ∈ < 𝑋 < 1 + ∈ ) you can do the same calculation, it is
𝑒 −2(1 − ∈) − 𝑒 −2(1+ ∈) , maybe you can pull 𝑒 −2 common outside and you will get 𝑒 −2 (𝑒 2∈ − 𝑒 −2∈ ).
So this is the probability you can get and there is other one this 𝑃( 9 − ∈ < 𝑋 < 9 + ∈).

So basically this question has the same width interval 2 ∈, one located at one interval at 1, around
1, 1 - ∈ to 1 + ∈ the other one around 9. What is the probability that x takes a value close to 9,
what is the probability that x takes a value close to 1? That is the kind of question we are asking.
So this is turning out to be 𝑒 −2 (𝑒 2∈ − 𝑒 −2∈ ). so that is one value.

If you go to 9 notice what happens, it is 𝑒 −2(9 − ∈) − 𝑒 −2(9+ ∈), and this is going to become
𝑒 −18 (𝑒 2∈ − 𝑒 −2∈ ). So notice, look at this probability this has gone much, much, much smaller than
𝑃(1 − ∈ < 𝑋 < 1 + ∈ ) .So the same width, notice how non-uniform it is, 𝑒 −18 is just really small.
The same width of 2 ∈ around 1 has a much more higher probability of occurring than the same
width around 2 ∈ that is the PDF falling exponentially faster for you.

So let us look at this calculation here, there are two calculations 𝑃(𝑋 > 4) , this is going to be 1 –
𝑃(𝑋 ≤ 4) and we can even do it a bit differently, you can use the CDF here so it is 1-𝑒 −8 so you
just get 𝑒 −8 . So it is easy to get this guy.

Now look at this other one, 𝑃(𝑋 > 7|𝑋 > 3) this is going to be 𝑃(𝑋 > 7)/𝑃(𝑋 > 3) it is just like
this calculation it will be 𝑒 −14 / 𝑒 −6 = 𝑒 −8 . So notice this calculation here, 𝑃(𝑋 > 4) and 𝑃(𝑋 >
7)/𝑃(𝑋 > 3) they were occurred to be the same. In fact you can show this in general 𝑃(𝑋 > 𝑠 +
𝑡)/𝑃(𝑋 > 𝑠) is actually 𝑒 −𝑡 .

So it is not every hard to see this, so it is 𝑒 −𝑡 and this in particular is independent of s, no matter
where you keep your s, s could be 1, s could be 100, s could be 1,000. So you are looking for the
value of X and if you are given that X is going to be greater than 100 , 𝑃(𝑋 > 100 + 1)/𝑃(𝑋 >
100) = 𝑃(𝑋 > 1000 + 1)/𝑃(𝑋 > 1000). So it does not matter where you are, that gap is what
controls the conditional probability.

The conditioning is important if it is not conditional then things change quite a bit. So it is a nice
property, it is an interesting property of, so this is unique to exponential distribution. No other
distribution satisfied this property. So it is a nice property of the exponential distribution to know
it is called memory-less property. This property is sort of like being memory-less, just by the X is
greater than 100 or X is greater than 1,000, once you are given that it is sort of the same from that
point on.

So it is very common to use waiting times, of modeling waiting times, supposing you are waiting
for a bus in a bus stop or you are waiting for something to happen, it is usually a random waiting
time, so that random waiting time is very commonly modeled as exponential and you will see in
many case that is a good model and no matter which, at what time you go to the bus stop your
waiting time is the same. So given that you come so late it does not matter what it is.

So it is sort of intuitive in practice, this memory-less property and how waiting times are controlled
by that. So this is just a small introduction to how this property comes in very handy when you
model some real life situations. So that is the exponential distribution.

(Refer Slide Time: 18:57)


The next common distribution that we are going to see is this normal distribution. So I mean normal
is also called Gaussian distribution. So if there is one distribution that is most common in almost
all models, it is the Gaussian distribution. So it is not wrong to say when its doubt, when you are
in doubt if you do not know what the distribution of random variable is simply assume it is
Gaussian or normal.

It is so prevalent, it is so common, it occurs in so many different situations that one needs to have
a lot of comfort in working with the normal distribution. There are two parameters, remember the
exponential distribution had only one parameter lambda, in the normal distribution you have two
parameters. The first parameter is commonly denoted μ the other parameter is 𝜎 2 so σ is, I mean
you can also think of the second parameter as σ but it is common to write the 𝜎 2 .

So this μ belongs to R. σ is positive real number. So μ maybe I should say is any real number, can
be positive, can be negative, σ on the other hand is a positive real number. So that is the constraint
so 𝜎 2 is written that way.

The PDF is slightly complicated looking PDF, it has got a sort of big structure, but let us see if we
1
can unwrap it. so there is is a constant outside multiplying an exponential so we can ignore
σ√2𝜋
(𝑥−μ)2
that for now and its exp (− ). Now you can see that there is a negative sign here and you
2𝜎2

have (𝑥 − μ)2

(𝑥−μ)2
So what is inside here is positive always, so is positive, so this - will be negative. So this e
2𝜎2

power minus as X becomes larger and larger and larger, much larger than μ, also it will go to 0.
You can see here, this picture goes like that. It goes to 0 and on the left hand side also if X becomes
much, much, much smaller also this will become a very large positive quantity because you are
squaring it.

Even if X goes negative it was your squaring, whatever the value of mu is eventually it will go off
all the way to very large value because the squaring in 𝑒 (−) will drive it to 0. Where will be it
maximum? You can see x equal to μ is where this will be maximum, why is that? See it is exp of
minus some positive quantity, non-negative quantity so what will be the way in which you can
make this exp of minus very, very large.
(𝑥−μ)2
You have to make what is inside the exponent very, very small, is not it? So this has
2𝜎2

tobecome very, very small. What is the smallest it can become? It can become 0 and that happens
at μ. So this point is μ and then this shape, how do you get this shape? This is the famous bell curve
as they call it. It is shaped like a bell.

And you can see how this bell shape will come, if you want you can plot it using some numerical
calculation. It is going to sort of fall like that and you can just think about it in various ways, why
it should be like that, because it is a property of the exponential function. What is the height? I
1
mean if you look at the height here that will be σ√2𝜋. So, in terms of why this is a valid PDF it is

non-negative it is okay.

Why does it integrate to 1? It is a complicated thing to show. So I will have you to look it up. It is

an exercise to check integral ∫−∞ 𝑓𝑋 (𝑥)𝑑𝑥 equals 1. So another thing to remember is support of x
is the entire real line. So a normal random variable can take any real value from −∞ to ∞, when I
say take, I mean take values in any interval on the real line it could be from −∞ to ∞ , so support
is the entire real line, there is no constraint.

Remember exponential support was only the positive thing, uniform its support is between a and
b, normal its entire real line. So there is no problem. It is sort of centered at 0. It is going to take
values close to μ with higher probability and as it goes both sides, it is going to decrease right, so
that is something you can see.


CDF is 𝐹𝑋 (𝑥) = ∫−∞ 𝑓𝑋 𝑢𝑑𝑢 and there is no closed form expression here. So if you remember for
both the uniform and exponential we had a very easy form for the CDF (x – a) / (b – a) and then
2
1 − 𝑒 −𝜆𝑥 . So in this case you will not have a closed form expression, it was this 𝑒 −𝑥 , does not
have an indefinite integral which is in closed form, but the shape of the function you can see, it is
sort of not too difficult to see.

At μ the value will be 1/2, eventually it will go to 1, it will start at 0. So these are things you can
show, why should it be 1/2 at μ? Remember this whole area, the entire area under the curve is 1
and then it is symmetric about μ. The area to the left of μ and the right of μ are the same, so each
of them should be 1/2. The probability that x is less than μ = 1/2.
So this is again an exercise you can show by symmetry 𝑃( 𝑋 ≥ 𝜇) = 1/2. You can put less than
or equal to, greater than or equal, does not matter, it is a continuous distribution, is not it? So it
does not matter. So this is both 1/2, so that is a nice thing to see. So this μ is sort of like a central
point, you can easily that that is accordingly true.

So there is something called the standard normal so if μ is 0 and σ is 1, 𝜎 2 or σ, + when it is 1,


so that is called a standard normal distribution. It is called a special name, normal 0, 1. So easily
this is the most crucial distribution and unfortunately it seems to have a complicated CDF. The
CDF is not very easy but PDF is nice to draw and today you can do numerical calculations for
integration and find the probabilities that you want. So it is not too bad, this is the normal or
Gaussian distribution.
(Refer Slide Time: 26:13)

So let us see how to do probability calculations with this. There is a bit of trick that one uses. This
is the trick here, we will come back and justify this trick a little bit later on, so the CDF of normal
distribution does not have a close form expression I mentioned that. What you can do is this little
trick of standardization, right now I am going to just provide it to you, without any proof but this
standardization is a very standard thing. We will see a proof soon enough later on.

So it turns out if X is normal with mean μ and variant 𝜎 2 then (X – μ) / σ is actually a standard
normal. So it is normal with mean 0 and variance 1. So if, maybe you can call this Z, Z is a new
random variable, is not it? It is a function of X and it is sort of centralized, standardized. You
remember we talked about how to do this (X – mean) / standard deviation to make this normal,
make the mean and variance 1 in general, we saw it for the distributions.

So this is sort of like that, we will come back and see this in more detail later, but this one trick
that is used for tabulating the value of CDF. CDF does not have a closed form expression. You
have to look at tables or you have to look at scientific computing software for giving you the value
of the CDF, and for that this trick of converting to the normal distribution is used.

1
So once you come to normal that is the PDF and the CDF, it is exp (−𝑧 2 /2) . So that σ is 1
√2𝜋

and μ is 0. So you get this, CDF has this integral expression which you can use in numerical
computation and tabulate. So this is available in most places if you get CDF for the normal
distribution, you will get it.

So to convert probabilities for a normal distribution, to compute probabilities for a normal


distribution you first convert the probability computation to the standardized case and then you
use normal tables. So we will do this eventually, but let me not show you examples at this point,
maybe you will see some examples soon enough, but this is the standard procedure.

So in the exponential and uniform case we could get closed form simple expressions for probability
computations, in the normal random variable you have to use tables or you have to use a computer
which has the CDF for the normal, these are standard normal built up. So this is the way you do
probability computations.

(Refer Slide Time: 28:49)

So here is an example, soon enough, I was mentioning I will see an example. We will do this. So
𝑋−2
this is normal with mean μ equals 2 and 𝜎 2 = 5. So, Z is equal to is going to be Normal (0,
√5

1). So we will assume 𝐹𝑍 , which is the CDF of Z, is known.

That is the idea and then we convert. So if you say X is less than 10 so that happens if instead of
𝑋−2
X, so this is, if Z is , X = √5𝑍 + 2, is not it? So instead of X you put √5 𝑍 + 2, √5 𝑍 + 2 < 3,
√5
which is the same as Z < 3 /√5. Now what is this 𝑃(𝑋 < 5) = 𝑃(𝑍 < 3/√5) and that is simply
𝐹𝑍 (3/√5)

So we are going to assume this function is known as in some table is given to you or there is a
computing software that is given to you where if you plug in CDF of standard normal it will give
you the answer. So that is something that you can do. So this is 𝑃(𝑋 < 5).

The same thing with X < 10 or X < -5. I am not going to do each of these cases, you can easily
do all the X less than thing exactly the same way if X < - 10 then √5𝑍 + 2 < 10 that means Z<-
12/√5. Just changes, what you put there, changes and you get the same.

What about X > 5 again? I mean there is no real major confusion here, greater than 5 that implies
Z > 3 / √5 and P(X>5)=P(Z>3/√5) . It is 1 - 𝐹𝑍 (3/√5), as simple as that.

So this little bit of trickery in computing probabilities for a normal distribution you have to pickup.
It is a very simple skill but the skill is very important. When somebody gives you a normal
distribution with certain μ and certain 𝜎 2 square you should be able to convert find probability for
any event involving that random variable by converting to the standard normal and then looking
up at tables or computing software for finding out what the CDF will be. It is a simple example.
(Refer Slide Time: 31:44)

So here is another problem and I think it has just been, maybe I should do this, and I will do this
in little bit of detail. So only one or two I will do in detail. So 𝜇 is 3 and 𝜎 is 1. So if you do the Z
equals 𝑋 − 3 / 1 is going to be normal 0, 1. So if you want to look at 5 < 𝑋 < 7 that is the
same as 𝑋 𝑖𝑠 𝑍 + 3 is not it? So you substitute 5 < 𝑍 + 3 < 7 and that is the same as 2 <
𝑍 < 4

So it is quite easy is not it? So 𝑃(5 < 𝑋 < 7 ) = 𝑃(2 < 𝑍 < 4) and you know the CDF of Z
is not it? So it is 𝐹𝑍 (2) - , 𝐹𝑍 (4) oh no 𝐹𝑍 (4) - 𝐹𝑍 (2), sorry. So that is it, that is the answer for this
problem and every other range is also similar. I do not want to go in to detail here. So let us look
at the conditional calculation 𝑃(𝑋 > 7|𝑋 > 3).

Again it is the same thing. I mean I keep repeating the same calculation, divided by 𝑃(𝑋 > 3), is
not it? Condition probability works out like this. Now you convert 𝑋 > 7, 𝑋 is 𝑍 + 3, so that is
the 𝑃( 𝑍 > 4) / 𝑃(𝑍 > 0).

𝑃(𝑍 > 4) is going to be 1 – 𝑃(𝑋 ≤ 4) and that is 1 - 𝐹𝑍 (4) and what 𝑖𝑠 𝑃(𝑍 > 0) ? Remember
0 is the mean, the μ parameter for Z right? That is the center point. So 𝑃(𝑍 > 0) is actually a 1/2.

So this is 2(1 - 𝐹𝑍 (4)) . So you can do such calculations, whatever the probability maybe, involving
standard normal or normal distribution you can do it in this fashion. Just that you assume, you
assume just like, this is not a strange thing to assume a particular function is known to you.
You assume 𝑠𝑖𝑛(𝑋) is known, 𝑐𝑜𝑠(𝑋) is known 𝑙𝑜𝑔(𝑋) is known, 𝑒 𝑋 , I mean nobody can actually
give you the value immediately but there is a computer which can give you the answer so we
assume that function is known to us, similarly this function 𝐹𝑍 we will assume is known to us.

So it is a function that is known, CDF of the standard normal, many packages and tables give you
the value. So that is the normal distribution. So we have seen. So let me summarize what we saw.
We saw three continuous distributions which are commonly occurring, which are used quite
frequently in modeling, the uniform distribution, the exponential distribution and the normal
distribution.

So we saw what the PDF is, we saw what the CDF is and we saw how to do very basic simple
calculations of any event involving random variables that are uniform, exponential or normal. So
that is a useful skill to pick up given the PDF and given the CDF you should be able to find
probability of any event involving the random variable. Thank you very much and see you in the
next lecture.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 5.7
Continuous random variable - Functions of a continuous random variable

Hello, and welcome to this lecture. In this lecture, we are going to start looking at functions of a
continuous random variable. So, you see that, just like we did the whole bunch of things for the
discrete random variables, we are going to do the same thing for the continuous case also. You
will see in most cases, it is sort of similar, what you did with the PDF, PMF in the discrete case
will sort of happen with the PDF. But things get a little bit more confusing because of some
integration and all that.

So, people get a little worried and they have to think about it very carefully, that is all. I mean, it
is, if once you have some practice, you will get your hang of it. And this fact that you have to use
the PDF and keep integrating it to get the probability, something to get used to, but once you get
used to it, it is very easy. So, now let us see functions of a continuous random variable.

(Refer Slide Time: 1:02)

Now, why functions? You might say already I have a random variable, already have its PDF
already have its CDF, for anyone to make a function of that random variable, it turns out here is I
mean, I have thrown a couple of very simple situations in reality, you will have so many other
cases for functions showoff very naturally. For instance, you may model the length of a square as
a random variable x. Then if you want the area of the square, it is going to be 𝑥 2 . So, you do not
need to have a new random variable for it. It is a function of the previous random variable.

Same thing with say, volume. Volume, it is not volume occupied. This is wrong, this is weight.
So, supposing you have density being a constant for some liquid 𝜌 , and then you do not know how
much of that liquid is going to come in, you want to find its weight, its density times the volume.
So, you see, there are very simple functions, you model something as a random variable.

There are some very simple functions of it, just a scaled version, or maybe even addition of
something to it or the squaring, simple functions like that. And it is good to know how to find the
distribution of the function of a random variable. You are given the PDF and CDF of x. How do
you find the PDF and CDF of 𝑥 2 ? Are there very simple, straightforward methods. So, this is like
a skill that you have to pick up. Why is this important?

Because quite often, this will happen, you will have one, one thing that you can model easily as a
random variable, but actually what you are interested in some other function of that. So, how do
you go from here to there, there is a very standard way to do it. And let us see what that is.

(Refer Slide Time: 2:46)

So, I will begin by giving you simple examples and motivating the method. And then we will see
a very general example. And we will not be, we will see a general enough way of tackling it. But
I will not give you too many examples that confuse you too much. But you will see some
interesting things can happen when you look at functions of continuous random variables. Let us
start with the easy case.

So, let us say x is uniform [0 ,1]. Remember, what is x is uniform [0 ,1] PDF is 1 by 1. So, just 1
between 0 and 1. So, it is the uniform distribution between 0 and 1. Let us say you want to look at
𝑌 = 2𝑋. So, this is going to belong to the range [0 , 2], [0 , 1] was your uniform distribution. Now
you do two times X, you can go to the range[ 0 , 2]. Now, what is the distribution of X, distribution
of Y.

So intuitively, you might guess that Y is, Y maybe it is going to be uniform. But you would like
to have some proof for that. So, maybe Y is uniform in 0, 2. It seems intuitive. But how do you
write down? And how do you prove it? It turns out, it is not very hard, it is very easy. So, you take
a small y between 0 and 2. And try to find the CDF of capital Y, evaluated at y.

So, basically, 𝐹𝑌 (𝑦). So, now this is crucial idea, you have to go after the CDF of the function first.
So, once you do that, you will see everything will fall in place. What is the CDF of Y evaluated at
small y? It is the 𝑃(𝑌 ≤ 𝑦). Now you can use your function substitution, 𝑌 = 2𝑋, capital 𝑌 is 2𝑋.
So, Y less than or equal to small y is the same as 2X less than or equal to small y.

Now, bring the two, we sort of invert the function instead of 2𝑋, you can say 2𝑋 less than or equal
to Y is the same as X is less than or equal to y/2 and that I know how to do. I know the PDF, so I
simply integrate the PDF from 0 to y / 2 , I get y / 2 as the CDF of y evaluated at small y.

𝑦/2
So, notice how I did that, so remember 𝑓𝑋 (𝑥) is just 1, so it is just ∫0 1. 𝑑𝑥 and 1 is just x, x
between 0 and y by 2. So, it is just y by 2, so that is the CDF of Y. So, you always find CDF first,
so this is an important trick. So, find CDF first, so that is the very basic trick when you want to
deal with functions of a continuous random variable in particular.
(Refer Slide Time: 5:42)

So, once you have the CDF, you can differentiate it and find the PDF. So, you take the derivative
of the CDF, you get the PDF. So, y / 2 is just 1 / 2. So, the PDF of Y is actually flat again, it is just
1/2 for any value of Y, it is just 1/2. So, you see from that Y is uniform 0 , 2. That is simple, basic
little derivation, to tell you what you intuitively knew that when you go to 2x, with x being uniform
and just scale it, you are going to get another uniform distribution.

So, that is not very hard. And here is a pictorial representation of what happened. You had a X
which had a PDF from 0 to 1. And now when you did 2x, the height reduced by 1/2, and the width
went up by, went from 1 to 2. So, it went from 1 to 2 and the height reduced by 1/2. So that is, that
is what happens when you do this kind of scaling. You can also ask some question like, instead of
Y equals 2x, what about Y equals 𝑋 + 2? I do not know 𝑋 + 5, or Y equals 2𝑥 + 5.

So, you can guess intuitively by going through the same method and repeating it, etcetera, any
such transformation, any such affine transformation, aX + b, so Y equal to ax + b generally, is also
going to be uniform in the suitable range, from 0 to 1. So, it is going to be b to b + a, assuming a
and b are positive for instance. So, that you do not get into trouble. So, Y is going to be uniform
here. So, this is a sort of a general result, you can have for uniform distributions.

So, that is hopefully that tells you that that there is an easy way to find the CDF and PDF for the
functions of a random variable, it does not seem too difficult, you have to just write down the
formula for the CDF and convert from y to x, so that seems to be the method. So, let me just
summarize that, in general, what you do.

(Refer Slide Time: 7:49)

Supposing you want to have a random variable X, with the CDF capital 𝐹𝑋 and PDF small 𝑓𝑋 , and
you want to find the CDF of a function of x that is g(X), g is a function, it has to be a reasonable
function, most functions that we consider will be reasonable. But it turns out they are very
unreasonable functions in mathematics. But anyway, so let us not worry about that too much.

So, g is a good function, nice function, we can take it as a very simple function if you like. So, Y
equals g(X) will also be a random variable, once g is a reasonable function, it will have a CDF 𝐹𝑌 .
And how do you determine it, this is what you do, 𝐹𝑌 evaluated at small y is P(Y≤ 𝑦). And that is
𝑃(𝑔(𝑋) ≤ 𝑦).

So, now this g(X) might be some function. I do not know some function like this, and then you
have a value of y here, which is maybe this guy, so this is y. Now, what is g(X) is less than or
equal to y, this is X versus 𝑔(𝑋). If 𝑔(𝑋) has to be less than or equal to y, you have to look at the
set of all small x such that 𝑔(𝑋) is less than or equal to y. So, you see this guy, you see this guy,
all this part and all this part corresponds to the set. So, these two together is the set of all x such
that 𝑔(𝑋) 𝑖𝑠 less than or equal to y.

Now my probability that g(X) is less than or equal to y is same as the probability that x lies in this
part in this set. If x were to lie in the set, g(X) will be less than y. So, that is the idea for every y as
you keep varying y, the set will also vary so if this y is going up or coming down, as this y varies,
this set will also vary, where it hits, the 𝑔(𝑋) will vary, the set will vary. So, the probability of y
less than or equal to y or probability that 𝑔(𝑋) is less than or equal to y will keep varying.

And it has been converted now into a problem of probability for X, of an event defined using X.
For X you have PDF and CDF, so you just simply integrate the PDF over the intervals, which are
valid for you. So, this is the general idea, you find the CDF, once you find the CDF, if it is
differentiable, you differentiate it, you will get the PDF, etcetera.

(Refer Slide Time: 10:32)

So, how to evaluate the above probability, you look at the subset 𝐴𝑦 , which is set of all x, such
that 𝑔(𝑋) less than or equal to y, it will depend on y. As you keep varying y, remember that picture
as you keep varying y, the set of all x for which g(c) is less than y varies. So, you convert it into
intervals in the real line and you find the probability that x falls in those intervals, either using the
CDF or using the PDF. So, it is just the integral over this Ay of the PDF.

So, if 𝐹𝑦 has no jumps, it is continuous etcetera, you can maybe differentiate it and find the PDF
also. So, this is the general method, for any function 𝑔(𝑋), this method will work. And this is what
I did, I showed you in the previous thing, it may be a little bit laborious, you might have to this
inversion, this inversion process can be a bit painful. You need to know the function g(X) properly
and then do the inversion. Once you do the inversion, the calculation is straightforward. It just
involves the integral of the PDF, we will see a few examples towards the end of this lecture for
this method.

(Refer Slide Time: 11:34)

There is a special case, which gives you a direct formula. So, we all love formulae. So, if there is
a direct plugin formula, there is nothing like that. That is what I want and I do not want all this
complicated method of CDF and integration and inversion and all this. I just want a formula. So,
if you want a formula, then it works only for a certain special type of g. In particular, if g were to
be monotonic and differentiable, so monotonic and differentiable, it is either increasing or
decreasing, so it is invertible, it has a clear inverse.

If that is true and it needs to be true only when, for the X in support of capital X, this has to be
capital X. Maybe I do not need to make it so hard, just make it capital X. In the support of capital
X, it has to be monotonic. So, monotonic meaning either increasing or decreasing, it cannot do
some crazy things like that and all that. It should be invertible and all that, so and it has a derivative.
So, d, you can differentiate g(X) and find its delivery to g prime.

If you do that, then the PDF of Y itself has a direct formula. And that formula is given here, you
have to find g inverse of y. So, if 𝑔(𝑋) is monotonic and in a certain range, you can find g inverse
of y. so you just, how do you find g inverse of y, you write y equals g(X) and solve for X in terms
of y, if you can do that you have the inverse. Once you have the inverse, you can just plug it in,
plug it in, into the formula.
So, 𝑓𝑋 (𝑥) is the PDF of X, instead of X, you put 𝑔−1 (𝑦). I will show you an example, you will see
how easy it is. And then notice what is going on here, there is a one by absolute value of the
derivative in the derivative also, instead of x, you put in g inverse of x, so remember, what is this
guy? 𝑓𝑋 (𝑥), you put instead of x you put 𝑔−1 (𝑦)., same thing here. So, this guy is g prime of x,
instead of x you put 𝑔−1 (𝑦), so you plug that in, we find 𝑔−1 (𝑦)., so finding 𝑔−1 (𝑦)is very crucial.
And it having a derivative is also crucial. If all these said conditions are true, you can easily have
a formula.

(Refer Slide Time: 14:00)

So, I will show you far more examples. I will show you three examples. The first is Y equals 𝑋 +
𝑎. So, this is g(𝑋). So, g(𝑋) equals 𝑥 + 𝑎. So, what is g prime of x? g prime of x is 1, the
derivative is 1, what is 𝑔−1 (𝑦)? So, if y is 𝑥 + 𝑎, x 𝑖𝑠 𝑦 − 𝑎. So, 𝑔−1 (𝑦) is y - a. So, this is how
you go about doing it, this is how you mechanically use the process. And now you substitute it
into the formula, so 𝑓𝑋 (𝑥) is what whatever 𝑓𝑋 (𝑥) instead of x you put 𝑔−1 (𝑦), y - a.

And then there is a 1 by a derivative, but derivative is just 1, does not matter what value you put
in here is 1. So, you get 𝑓𝑌 (𝑦) is 𝑓𝑋 (𝑦 − 𝑎). So, you can see why that is true. So, so if you have a
certain distribution for x and then you are taking y as just x + a, it is just the PDF shifted to the
right by, you take the original PDF, you shift it to the right by a, you get the new PDF for y. So,
translation is very, very easily dealt with using this formula.
(Refer Slide Time: 15:19)

So, what about scaling? So, scaling, g(X) is a times x, g prime of x is a and then y equals ax implies
x equals y / a. So, 𝑔−1 (𝑦) equals 𝑦 / 𝑎. That is the idea. So, I put x / a here, this is again wrong, y
/ a. So, in all of these things, the support has to be capital X. So this is 𝑓𝑋 (𝑦/𝑎)

And notice what happened here, it become 1 /|a|, so the absolute value will come in here, in the
previous case, the absolute value did not come in because derivative was just 1 it is always positive,
here a could be positive or negative. So, you have to put, the absolute value will actually enter the
picture, you will need absolute value of a in 𝑓𝑋 (𝑦/𝑎).

So, you see translation and scaling, so these are very basic operations, any random variable you
have, you may have to move it around, you may have to scale it. Both of those are easily handled
by this formula monotonic differentiable case, you do not need any CDF for that, you can use CDF
also, you will get the same method, at the same answer. But here is a direct formula for what will
happen to the PDF, when the scaling or translation happens.

So, for instance, if a is 2, you are getting 𝑓𝑋 (𝑦/𝑎) , so 𝑓𝑋 gets spread out. When you scale it up
with the random variable becomes larger and the original PDF sort of gets spread out over that
range. It is also sort of natural and simple in some picture. So, so maybe I should show this. So, if
you have for translation, if you have a certain 𝑓𝑋 , this maybe is like this and you move it to the
right by a you will get 𝑓𝑦 , it will go off somewhere, so you can be keep the x here, this is a fX and
this guy is a, so you push it to the right by a that is translation for you.
And notice what happens in scaling. So, maybe I should draw both in the same picture. So, if you
𝑦
have 𝑓𝑋 , if you do 1 / a 𝑓𝑋 (𝑎)is greater than 1, you will get something like this. So, this is for a

greater than 1, for a less than 1, it will sort of shrink, so it will shrink for a less than 1, so it will go
like that. So, this is a less than 1.

I think the picture became a bit messy, but hopefully you can see that what so what will happen,
so there is the 𝑓𝑋 (𝑥). And then 𝑓𝑦 if y is becoming 𝑓𝑋 (𝑦/𝑎) and then it gets divided by a. So, if a
is greater than 1, it will go down and it will spread out. If a is less than 1, it will shrink and then
do not go up. So, that is sort of how the picture looks for scaling and translation.

(Refer Slide Time: 18:32)

Let me make my changes here, so that slides come out okay. So, now you can combine the two.
So, ax + b you can do, this is a very common operation, it is called the affine function. You had a,
you had ax equal to x + b, you can do one after the other and you can combine the two formulae
1 𝑦−𝑏
to get down to |𝑎| 𝑓𝑋 ( ) , I am not going to illustrate what happens here, if things move around,
𝑎

things move and then get shrinked or blown up a little bit, that is what happens when you do an
affine transformation.

So, given a distribution, you should be able to easily visualize what happens under affine
transformation. So, affine transformations are very, very common. And this monotonic
differentiable function method directly gives you the what happens to the PDF when you do affine
transformation, translations, just move the PDF around scaling, either scale it down and spread it
out are sort of shrink and increase the, increase in value of the PDF.

So, this is a simple example, there are of course so many other monotonic differentiable functions.
Affine is just one example, but it is very popular an important example for you to know, so we
have kept it there.

(Refer Slide Time: 19:51)

Now let us come back to this affine transformation of random variable, of normal distribution
random variables. Supposing x is normal with μ equals 0 and 𝜎 2 equals 1. So, you remember this
is the famous normal distribution, this is μ and this is μ equals 1 and then let me not use μ here,
this is two parameters, this is the standard normal. So, this is the PDF for the normal distribution,
1
2 exp (−𝑥 2 /2).
√2𝜋

Now, if you do the transformation, y equals 𝜎𝑥 + 𝜇, this is an affine transformation. And one can
now use the formula is ax + b. So, it is 𝑦 − 𝑏 / 𝑎, instead of x, you should put (y – b) / a. So,
x + μ)2
instead of x, we put ( . So, instead of x you put, replace x with I mean, I have put x here,
2𝜎2

maybe I should have put y, maybe I should put y here, sorry about that.

−(𝑦−μ)2 )
Replace x with (𝑦 − 𝜇 ) / 𝜎, you get 𝑒 2𝜎2 . And then there is a 1 / |a|. So, let us say σis positive.
So, there is 1 / |a| and that is where the 1 / 𝜎 comes in. So, this, you can see is the PDF of normal
(μ, 𝜎 2 ). So, clearly this y, see you start with the standard normal (0, 1), if you do the affine
transformation 𝜎 𝑋 + 𝜇, you get another normal distribution (𝜇, 𝜎 2 )., so it is a very nice property.

So, for uniform also we had this property, had a uniform PDF, you do find transformation, you get
another uniform PDF, normal also satisfies this property, you start with a normal, you do an affine
transformation, you get another normal distribution. So, you can also sort of reverse it, supposing
you start with X being normal, μ σsquared, and then you do x - μ by 𝜎, you will get normal (0, 1).

So, here is a normal distributed random variable with parameters μ and 𝜎 2 . And notice what I am
doing here, this is also affine, so this looks a little different. But this is Y equals 1/ 𝜎 X - μ / 𝜎. It
is not any different from that. So, from here, you plug in instead of y you replace y with y + μ by
𝜎, in the next pi 𝜎, you will see you will get back your normal (0, 1). So, this I will leave as an
exercise to you, make sure you can do that.

−(𝑦−μ)2 )
1
So remember, this is what, this PDF is ̅
𝑒 2𝜎2 . And then you are doing this transformation
𝜎√2𝛬

y 𝑒𝑞𝑢𝑎𝑙𝑠 1 / 𝜎 𝑋 − 𝜇 / 𝜎. So, instead of x you have to replace with, replace x with (𝑦 +


𝜇 / 𝜎)/ 1 / 𝜎 , which is equal to 𝜎𝑦 + 𝜇.

So, you replace x with 𝜎𝑦 + 𝜇, you will see that + μ and - μ cancel, you will get 𝜎 2 𝑦 2 , the 𝜎 2
will cancel. And then you also have the 1 by |a|. So, 1 / σ that will cancel with the σ and you will
get the normal (0, 1). So, it is a small exercise, try this and convince yourself that it is true.

(Refer Slide Time: 23:51)


So, the moral of the story is that the affined transformation of a normal random variable remains
normal. The two parameters change, they change based on how you do the affine transformation,
but it remains another normal distribution. Same thing with the uniform also, we saw the property,
this is very good.

(Refer Slide Time: 24:17)

So, I am going to do three illustrative examples, there are three problems, and you will get
interesting little answers. In the, you will see maybe I will do one more problem as well. So, the
three problems to illustrate how one can do computation for different types of functions. It is it is
not meant to be exhaustive, just to give you a flavor of how other functions can be handled.

So far, we have seen some simple functions, affine seems like a very simple transformation, what
about slightly non-trivial transformation, so here is one example. So, you have exponential random
variable, so 𝑓𝑋 (𝑥) as 𝜆𝑒 −𝜆𝑥 for x greater than 0, 0 otherwise. And you need to find the PDF of x
squared. So, 𝑔(𝑋) equals x squared. And if you look at 𝑔(𝑋), so remember, support is what?
Support is set of all x such that x is greater than 0.

And if you look at g(X) here, in x greater than 0, it is monotonic for x greater than 0. g(X) is
monotonic for x greater than 0, so my formula applies, so what is my formula? I need my g prime
of x, which is 2x. And then I need the inverse. So, g( y) equals x squared implies x is square root
of y, positive square root of Y because x is greater than 0.
So, 𝑔−1 (𝑦) is square root of y, so now one can use just so Y is x squared, 𝑓𝑌 (𝑦) is just the formula
1 by 2 times square root of y, the PDF, which is 𝜆𝑒 −𝜆√𝑦 . So, that is the PDF and that is for y greater
than 0. So, you see this just the formula, how did I get the formula, by the way, I mean, it is
𝑓𝑌 (𝑔−1 (𝑦))
.
|𝑔′(𝑔−1 (𝑦))|

So, instead of x you put 𝑔−1 (𝑦), which is root y. Then 1 by g prime, which is 2x. And then 𝑔−1 (𝑦),
which is root y. So, that is all, so it is a very simple formula. So, the monotonic case, even if it
were a monstrous function, 𝑔(𝑋), as long as it is monotonic, I know my formula applies. I just
have to find the inverse, I have to find the derivative, close my eyes and plug, I am done, that is a
nice thing to know.

(Refer Slide Time: 27:04)


And I guess a slightly more twisted set up. X is uniform, from - 3 to 1. So, let us say this is x, and
this is - 3, 1. Then this is height would be 1 by 4. So, that is the height of the PDF. This is 𝑓𝑋 (𝑥).
So, now I need to find the PDF of x squared. The problem now is, the support is - 3 to 1. And
𝑔(𝑋) equals x squared is not monotonic in support of X.

If you were to plot x squared, you will have to do this, 𝑔(𝑋) would go like this, sorry about that,
so it will go over like this. So, I am not doing drawing very much, just scale. So, maybe I should
make this also a little bit better. So, this is 𝑔(𝑋) equals x squared, it is not very much to scale, do
not worry about the scale too much. Clearly, it is not monotonic in that range. In the previous case,
in exponential, it was only for x greater than 0, so it became monotonic was very nice.

But here, it is - 3 to 1. It is not monotonic. Once it is not monotonic, you cannot use the formula.
If you cannot use the formula, you have to fall back to the traditional method of evaluating. So I
am going to take y in the range, so y first of all, so y is x squared, so it belongs to 0 , 9. Do you
agree? So, x is - 3 to 1. So, if you square it, you will get values from 0 to 9, so it is only positive.
So, if you take y from 0 to 9, 𝑓𝑦 (𝑦) is 𝑃(𝑌 ≤ 𝑦), so this is 𝑃(𝑋^2 ≤ 𝑦)

So now, interesting cases will come. So, for instance, if x were to, if y were to be between 0 and
1, X squared less than or equal to y is the same as X falling between root y and - root. So, this is
true for x being 0 to 1. See remember 0 to 1, there is like - 1, - 2, - 3, X has support on both sides
- 1 to 1. So, it is got of, it is sort of got both sides going. So, if you pick any y here, if you say X
squared is less than y, then I am sorry, y is not here. So, why did I do this?
So y, if y is here, let us say, maybe just pick a y here and then this would be root y, this will be -
root y. So, if you want 𝑔(𝑋) to be less than or equal to y, then x is here, between - root y and root
y. So, that is the picture that you have to draw. So, you get this. Now on the other hand, if y is
between 1 and 9, notice what happens if y goes above 1, so for - 1, so this is 1.

My picture is so badly out of scale, forget about this. So, let us just get this going. So, this has got
to be, let me redraw this. I should draw a good picture. So, let me redraw this picture. You can see
in the video that I have redrawn this picture, it is a good picture to draw. So, if we just draw this
properly, so it is a question of getting the scale also right, so you got 0, we have got 1, - 3, this is
going to be really, really tiny, 1 by 4. And then, this 1 would come somewhere here. This is as a
reasonably to scale, I think.

And then if I draw g, g would go like this, so like that and - 1 would come here. So, it is reasonably
to scale. And g would go like this, this is 𝑔(𝑋) equals X squared. We have got it. So, this is good.
So, if y were to be between 0 to 1 and 0 to 1, so you will have root y and - root y both occurring
here. So, y goes above 1, y goes here; when it goes above 1, square root of y and - square root of
y is not going to work out.

So, it can go all the way up to like 9, takes any value there, root y if it is above 1, it will go to the
right of 1. So, X squared, X squared less than or equal to y may fall between - root y and less than
x - less than root y. But you do not need the all the way to root y here. It is enough if you do 1.
Why is that? Because if root y, after 1 it anyway becomes 0. So, it is enough if you do this, so the
probability 𝑓𝑦 (𝑦) implies 𝑓𝑦 (𝑦) is, its height is 1 by 4 by the way.

So, it is 2 root y by 4 times 1 by 4. So, in this case, fy of y will be 1 + root y, not 2 root y. It is 1 +
root y by 4. So, it is root y by 2 and then 1 + root y by 4; this is the CDF. So, if you were to do the
PDF, you can also do the PDF if you like, if you have to the PDF, you are going to get 1 by 2 times
root y is 1 by 2 root y for y between 0 and 1. And then 1 by 4 times 1 by 2 root y for y between 1
and 9. So, this is the PDF for y.

So notice, when you had a uniform distribution - 3 to 1, and then I did y equals x squared, I am
getting a strange distribution, the CDF ends up being square root of y sort of behaves like that, and
then the PDF will be 1 by 2 square root of y. So, you will start getting strange looking PDFs, you
started with the uniform PDF and because you did this nonlinear transformation of x squared and
you go through the calculation very carefully, you notice the care with which you have to do this
calculation.

See that you do, carefully look at one value of y at different ranges and how that translates into
that set Ay, the set changes from one to the other depending on the range. You will do a little bit
more care here, but it is not very tough as such, but you have to pay attention, you have to pay
attention to this detail of where it exceeds 1, where it does not exceed 1, it is a bit more difficult. I
mean when the formula applies, it is so nice and simple.

When the formula does not apply, if it is not monotonic, it is a little bit more care that is needed.
But if you practice a little bit, think of different cases, where such methods are used, we will get
to the answer. So, this is the CDF and PDF for X squared of squaring of a uniform distribution.

(Refer Slide Time: 35:29)

So now, here is that odd looking function. X is uniform from - 3 to 1, and my g(X) is max of( 𝑥 , 0).
So, notice this g(X) is max of (x , 0). So, what is max of x , 0? Why was the meaning of this? So,
one way to think of it is, it is 0, if x is negative, x is between - 3 and 0. And it is x if x is between
0 and 1. So, if you restrict to the support of x, so if you say x is in the support of capital X, which
is actually - 3, 1, so notice max of (x , 0), if x is negative, the maximum is always going to evaluate
to 0.

If x is positive, it is going to evaluate to x, it is a strange looking function, I mean, it is not so


comfortable to work with this function. So, if you were to say Y is 𝑔(𝑋) and then Y belongs to
what? It belongs to the set, 0 , 1. It takes the values 0. And it also takes values up to 1. So, that is
the range of values. Now notice what happens here, so this is clearly not a easy function to deal
with. It is not differentiable and all that in the way we know it, at least had 0 and seems to have a
problem with differentiation. We do not know how to deal with such cases very easily.

So, maybe you should go back to the CDF method. Whenever you have these kind of functions
where it takes a constant value, so if you want, I can plot this, actually, it is not very hard to plot,
𝑔(𝑋), it takes 0 for negative values and then it goes to x up to 1. So, it is a easy function in some
sense. But this, the fact that it stays 0 for so long and it picks up here, it is not differentiable here.

So my wonderful, it is not also invertible and all that. So, this is it is I mean, it is a monotonic, it
needs to be strictly monotonic, as it is increasing or decreasing, this is staying constant for a long
time and all that is not very nice for my formula. So, my formula will not work, I have to go back
to my previous method. So, now here, you notice, there is something strange about y equals 0. If
when y is 0, notice what happens 𝐹𝑦 (𝑦), is probability that y is less than or equal to 0.

So, if y has to be g(X) is less than or equal to 0. What is the range of x for which this is going to
happen? This is the same as in g(X) less than or equal to 0, it never goes negative. It is always
equal to 0. So, this is the same as x lying between - 3 and 1, then 0, do you agree? If x were to
between - lie to be, lie between - 3 and 0, then y will be 0. If not, y will not be 0. So, probability
that y is less than or equal to 0, it is the same as probability that mine, it is actually equal to
probability that y is equal to 0, I can write like this g(X) is equal to 0.

And that is the same as probability X being - 3 and 0, X between - 3 and 0. So, that comes out as
3 by 4. Notice that strange thing once again, so you are getting 𝐹𝑦 (𝑦), probability that y is less than
or equal to 0 to be 3 by 4. Now, if y were to be less than 0, what will happen, y is less than 0, 𝐹𝑦 (𝑦)
probability that y is less than or equal to a negative number is actually 0.

So, keep that in mind. So, if y is less than 0, it is 0. At 0, there is a jump in the CDF. So, whenever
you have a flat part in your function, you have the support of the random variable, and you have a
flat part in your function, so that value is being taken with a non-zero probability because of the
function. So, in this function, this functions takes the value 0 over the entire set of values - 3 to 0.

So, whenever x falls in the interval - 3 to 0, my y becomes equal to 0. So, the probability that y
equals 0 is actually 3 by 4. So, immediately, this tells you that y is not continuous, so it is not a
continuous random variable. So, if you take some transformation like this max of x , 0 and all that,
you can expect that the output, the output, you get a random variable, which is not even continuous,
so this kind of thing can happen.

Now, you can also do for y between 0 and 1, you can evaluate 𝐹𝑦 (𝑦), you will get 3 by 4 + y. So,
I leave this as an exercise, you can check that 3 by 4 + y. So, if you were to plot 𝐹𝑦 (𝑦), you get
this 0 here. And then there is a jump to 3 by 4, just hold on. So, this is y by 4, somewhere I thought
I made a mistake here. So, 𝐹𝑦 (𝑦), between 0 and 1 𝐹𝑦 (𝑦), this is an exercise, I will leave you check,
to check it, it is not very hard, but you need to do some work here.

So, 𝐹𝑦 (𝑦) is 3 by 4 and then it goes up to 1, so it goes up to 1 and then it stays at 1. So, if y greater
than 1, 𝐹𝑦 (𝑦) equals 1. So, this is what happens here, so and you get a mixture here, which is not
a continuous distribution, you see there is a jump. And then there is also a part where it is
increasing, it is not purely a discrete distribution, it is a mixture distribution. So, this can happen,
so I mean go through this steps and how I did it very carefully, the crucial part is the function,
g(X), stays a constant for an interval inside the support.

So, inside the support of your random variable, for an interval if 𝑔(𝑋) is staying a constant, that
particular constant value can be taken with non-zero probability. So, in this case y equals 0 gets
taken with probability 3 by 4, it is non-zero probability, so you do not get continuous random
variable. So, this can happen, in fact, you can quite easily cook up functions which will give you
a discrete PMF as the output.

So supposing, let us say I have a 𝑔(𝑋) which is like this, 𝑔(𝑋) let us say - 3, - 3 by 2, 0, and then
1, 1. Let us say it is like this, function like this, so you can put your little things if you like, does
not matter where you do this. So, if it were a function like this, supposing you have 𝑔(𝑋) and 𝑋 is
uniform from - 3 to 1, then 𝑔(𝑋) takes value, I do not know, this you can put as 1 by 3, 2 by 3 and
1.

Now if you look at, I do not know, maybe I will call it 𝑔1(𝑋), so if you define 𝑦1 𝑎𝑠 𝑔1( 𝑋) this
is actually a discrete random variable. So, your random variable, your function, the transformation
g(X), takes only 3 values, it takes only 1 by 3 or 2 by 3 or 1. And what is the probability by which
it takes it, you can easily calculate based on the going from the interval to the interval of x and
then you will get it. So, you just do that inversion very carefully, you will get it.
So, by choosing your function g, you can make a random, a continuous random variable into a
discrete random variable, you can make a continuous random variable into a mixture, you can
make it, you can keep it as continuous also. So, this thing is sometimes important, sometimes you
will have a function like max of (x , 0), particularly taking maximum, it happens quite often and 0
might be a threshold, you may only worry about what happens above 0 and all that.

So, all these things can happen, so what is the moral of the story here? The big moral is what I told
at the end, so you can do a function, a transformation of a continuous random variable, if it simple
enough, if it is monotonic, differentiable, has an inverse, that is only one inverse, it is very easy,
there is a formula and continuous remains continuous.

But in many other cases, the function is flat for a while and it jumps here, there and all that, then
you can get all sorts of strange answers, you can get continuous also, you can also get discrete,
you can get mixtures also. So, for transformations of random variable is a very interesting and
exciting thing and the one method that always works is that CDF method.

You try to find the CDF, do the inverse of the function, look at all the values of 𝑋 for which 𝑔(𝑋)
is less than or equal to y, find the probability of that. That always works, it is a bit laborious, you
have to pay a lot of attention, be very careful, but that is a method that always will give you the
correct answer. So, that is the end of this lecture, thank you very much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 5.8
Continuous Random Variables - Expected value

Hello everyone and welcome to this lecture. When we studied discrete random variables we have
seen that the notion of expected value of a random variable or expected value of a function of a
random variable gives useful information about the distribution. So, for instance the mean value
is sort of like the central measure what is the usually the random variable takes values around the
mean usually of course it is important to know the variance also. And then these the variance which
gives you a measure of the spread around the mean.

So, this mean which is sort of like the central value of the distribution. Let us put it that way and
then variance which gives you the spread around the central value. So, quite often it turns out you
may not be able to measure or know the distribution of a random variable. In many situations you
will only have data of sampled from what you think is a distribution. And in quite often the data
will not be rich enough for you to learn the entire distribution may not be that that much and it
may not be that spread out et cetera.

On the other hand, usually expected values you can make very good measurements of and with
limited data you can get something reasonable about the expected value and variance and things
like that. So, to an extent what you can learn about when you observe a random phenomenon
expected value is very very useful.

So, in theory how it plays with the random variables distribution and probabilities is important to
know we studied it in the discrete context. We will extend it to the continuous context. So, that is
a very small adjustment you have to make to extend it to the continuous context. And we will do
that in this lecture.
(Refer Slide Time: 1:58)

So, here is the definition theorem I mean I am stating it as a theorem one can actually define it
from a different way and prove it et cetera. Where we will not do that. We will state it as a result
and simply use it. Here is the main result. You have a continuous random variable with a certain
density. So, it has a density also at small 𝑓𝑋 (𝑥) are usual notation and let us say 𝑔 is some function
from real number to real number it could be any function any reasonable function of course.

So, I have not been very precise about what that reasonable is but just think of anything you can
draw or anything you can describe in a reasonable way. The expected value of 𝑔(𝑋) we will use
the same notation as before expected value of 𝑔(𝑋), 𝐸[𝑔(𝑋)] the same notation. It is given by an
integration and what is that integration integral from −∞ to ∞. It says −∞ to ∞ it is enough if you
restricted to the support of x. Because if the support is not there the PDF the density is 0.

So, it does not matter you do not have to integrate that. So, for convenience I am just leaving it out

as ∫−∞ 𝑔(𝑋)𝑝𝑋 (𝑥) 𝑑𝑥. This is called the expected value of a function 𝑔(𝑋). So, this is the
connection this is the way from the density you go to the expected value. So, like I mentioned quite
often you may not know the density but you can still know the expected value from other means.

So, we will look at those kinds of things later. So, this is a very useful definition very useful
connection. Many of you can go back to the discrete case and make a very simple analogy here.
So, here is the analogy. The important thing to think of x is discrete it has a certain range and it
has a PMF 𝑝𝑋 than expected value of 𝑔(𝑋) is summation over the range 𝑔(𝑋) times the PMF. So,
you can see what is going on here.

So, this sort of plays the role of summation. It is enough if you restrict it to support of x support of
x is like the range and this sort of plays the role of PMF density times the length. So, the probability
density times a small length is the probability that x is around when the probability that X falls
around x. And so you sort of multiplied with 𝑔(𝑋) and then sum over all the support which
becomes integration in the limit and all that so.

So, this is how the discrete goes into the continuous and summation gets replaced by integration
and PMF gets replaced by that small probability measure in that area which is density times this
small little length dx. So, that is the way to intuitively sort of think about it I am not being precise
here. There is a way to write these things down precisely but this is good enough for most of us.
And anyway the integration we are going to use tables and we will do that.

A quick warning and depending on the support of the support is very large. And then the function
the density function and the 𝑓(𝑋) in a certain way this integral may diverged to +∞ or −∞ or in
some cases it may not even exist. So, it is may not it may or may not be able to fix 1 number for
this integration it might happen. So, all those are sort of corner cases for us. Maybe I will mention
it in one or two chances if I get a chance I might do it.

But It is not so critical for us most of the time will be concerned with expected values which are
well defined and finite. So, this is the definition. Hopefully, we will see some computations in your
examples and assignments you will compute some expected value even in this lecture. I will
compute some expected values given some densities and you will see it is just an integration
exercise in some sense.
(Refer Slide Time: 5:49)

So, let us come specifically to some specific choices of the function g in the previous case we used
µa general function 𝑔 and we send expected value for 𝑔(𝑋) is calculated in that fashion. Now for
a continuous random variable two particular choices are of great interest; one is something that
gives you the mean value the expected value of x itself when 𝑔(𝑋) = 𝑋. So, this is the case where
𝑔(𝑋) becomes equal to 𝑋. So in that case you have expected value of x.

This is also denoted as mu sub x or if the only random variable you are dealing with this very clear
in the context you can simply say mu quite often It is taken as the mean. And since 𝐸(𝑋) 𝑔(𝑋 =
𝑥). So, the definition is quite clear integral over the support of the random variable 𝑥 𝑓𝑋 (𝑥). So, I
keep saying −∞ to ∞. But remember it is enough if you integrate over those values of x where the
density is non 0 so support is enough.

So, mean is the average value expected value of x it is sort of like the central value of the
distribution. So, it gives you a good idea of where the distribution is around. The next important
special case of expected values what is called the variance the variance of x in this case you can
see the choice of 𝑔(𝑋) here. 𝑔(𝑋) is x minus mu x whole squared (𝑋 − µ𝑋 )2. So, and the definition

goes out very cleanly ∫−∞(𝑋 − µ𝑋 )2 𝑓𝑋 (𝑥)𝑑𝑥 once again the integration is enough of it is over the
support of the random variable.
So, variance is the measure of the spread. And you can do a simple manipulation here use the
property of the PDF and show this very interesting result that Var(𝑋) is 𝐸(𝑥 2 ) − 𝐸(𝑥)2. So, this
is a very useful result so quite often when you want to find mean and variance of a random variable
you will find mean and then you will find the expected value of x square.

So, you find both and then you subtract and you get the variance. So, that is all you are doing. So,
clearly if somebody gives you an arbitrary so one of the skills that you pick up in a basic statistics
probability courses given the precise PDF, what is the mean? What is the variance? So, it is an
integration exercise it is important to get a hold of that. And you mean you should you should have
that skill It is a basic skill that you should pick up when you do when you learn probability and
statistics.

And we will see a couple of exercises along that direction. But what is the meaning of the mean
and variance mean sort of represents the central value of the distribution and variance tells us
something about the spread. And if you remember we saw some inequalities and we will talk about
those as well. But mean and variance are very very crucial. Quite often and complicated cases you
can pretty much only compute the mean of the random variable. And that might be that is actually
powerfully good enough in many cases. So, we will see those kinds of examples later on.

(Refer Slide Time: 9:02)

Here are examples. So, we saw very common continuous distributions. So, you saw 3 types of
distributions uniform distribution, exponential distribution, and the normal distribution. What is
the mean? And what are the mean and variances of these distributions? So, this is the slide that
tells you that this is an integration exercise. I am not going to go through the integration exercise
for this particular case.

But I will suggest that you try and you try it see if you get it this is a good exercise. You can try it
at a pen and paper and you will get a good hold of some of these things particularly for the uniform
you should be able to do it. It is not very hard to do the uniform case. So, for Uniform[a,b] what
1
the PDF is It is just flat from a to b height is (𝑏−𝑎) it is flat is the expected value very interestingly
(𝑎+𝑏)
happens to be .
2

It is not very difficult to guess the expected value it is the mean value central thing et cetera the
(𝑏−𝑎)2
variance comes out to be this . You can get that if you do the integration. I will once again
12

like I said this is an exercise. So, for the exponential distribution you remember 𝜆𝑒 −𝜆𝑥 for positive
x that is the density. The expected value of x if you actually plug in the formula and evaluate the
1
integral you will get 𝜆.

1
And variance will be 𝜆2 . So, this is an interesting little result for the exponential distribution. Most

people once they see this for a while they will muck this up they will know these things by heart.
So, what is the mean of the uniform distribution variance of uniform distribution mean of
exponential variance of exponential? In particular, the most important thing like I said the most
important distribution in all of statistics easily is the Gaussian distribution, the normal distribution.

It has two parameters mu and sigma squared that we saw before. This has this complicated e power
minus quadratic PDF or complicated Actually It is very nice PDF. The mean of a normal
distribution is actually equal to mu. And the variance of a normal distribution is actually equal to
the sigma squared. So, this is something really, really interesting. So, what this gives you these
very interesting results. So, if you look at the Gaussian so what does it mean to say mean of x is
mu.

(𝑥−µ)2
∞ ∞ 1
So, ∫−∞ 𝑥𝑓𝑋 (𝑥) what is that ∫−∞ 𝜎√2𝜋 𝑒 𝜎2 𝑑𝑥 is actually equal to µ. So, this is a useful identity

in my opinion. Because if you think about it this is just it is a complicated looking integral it is
actually not too bad, one can you evaluate this integral. This integral ends up being mu it is actually
not very hard to prove.

But anyway you can give it a shot. If you think you can prove this that is great. Look at this other
integral for variance so I will write it down slightly differently minus infinity to infinity or maybe
(𝑥−µ)2
∞ 1
is the same way ∫−∞(𝑥 − µ)2 𝜎√2𝜋 𝑒 𝜎2 𝑑𝑥 = 𝜎 2 . So, notice this identity this is slightly more

difficult to prove than the previous one, but it is possible you can write down some proof for this.

So, this so these are all slightly more involved integration. But hopefully, you can do this. And
many people once you once you get used to it you can mug up this integration as an identity. And
this will help you in some integration exercises. But this course and this program we do not bother
too much about such integrations it is not so critical I am just writing it down for you to realize
that it is a little bit more complicated. We will not go into such integration exercises in this class
it is not the main point of this class. But you should remember these identities.

These forms for the distribution and the fact that the expected values mu variance sigma squared
this is very very important to know and appreciate. So, modeling this comes up quite often and we
should know this. So, supposing for instance I will give you some simple ways in which you can
test supposing you are observing a random phenomenon and it looks maybe it feels like an
exponential random variable maybe you want to model that as an exponential random variable.

One very quick way of checking is to find the expected value and find the variance. I will later on
we will see some quick way of estimating the variance and mean of a random variable from
observed samples. So, you can do that it turns out you can do that with pretty reasonable
confidence. So, once you can do that and if it turns out that the variance is square of the expectation.

Then it is sort of like an added bonus proof for you when this is probably close to exponential. So,
things like that are quite useful in practice so it is good to know what the mean and variance or of
distributions. But the actual integration might be a bit more complicated. So, that is okay.
(Refer Slide Time: 14:36)

So, here is picture which may be conveys what we are thinking of here different uniform
distributions the uniform distribution is around half different widths. So, the black one is as
uniform distribution from 0 to 1. And you can see the height is 1. The blue one is uniform
distribution from so this is Uniform[0,1]. The blue one is uniform the height is a half so it will go
from 1/4 to 3/4.

The green one is uniform its height is going up to 4. So, this I think is (1/2 -1/8) and then (1/2
+1/8). That is what gives you a height of sort of 4. So, you can do this calculation is just basically
I am trying to convey how the last one is uniform. It is height is going up to 10 it is around 0.5.
So, that will be 0.45 0.55. So, you can see how this is working out. I mean as the width decreases
the height of the uniform distribution is increasing all of them have the same mean mean is 0.5 and
variance is going to be very different.

So, the variance for the Uniform [0,1] is 1/12. As you keep reducing the width the variance will
fall down. So, maybe I should write down what the mean is variances is for first thing µ is equal
(𝑏−𝑎)2
to half the variances . So, that is 1/12. For this distribution if you want to write the mean
12

and variance you are going to get mean to be once again half the mean is half you can see that sort
of the central values half.
(𝑎+𝑏)
And then the variance of course you need to do to calculate the mean and that also equal are
2
(𝑏−𝑎)2
not b half then for 𝜎 2 you want to do . Now (b – a) is 0.1 0.01 by 12. So, you can see as you
12

go from a wide shallow sort of distribution to small pk distribution even though the mean is the
same the variances sharply falling.

Now so you can also have the mean changing for the uniform distribution of the mean is changing
then it means that the whole uniform distribution shifts to some other place instead of mean being
0.5. If the mean were to be 5 this whole distribution with the same variance for instance will just
simply shift to 5 as you can imagine. So, this is just a picture to show you how I know what
variance means for the uniform distribution.

(Refer Slide Time: 18:01)

Let us do the same with the exponential distribution you vary the lambda and then you will get a
sense of the different PMF and their mean and all that. So, from this picture you can see that there
is the black which is this guy here so this is exponential 1. And it has a mean of 1 and then variance
of also 1. So, this one, the green one is exponential to let us say how do I identify that It is told
you It is exponential distribution. How do you know the parameter λ remember exponential
distribution is 𝜆𝑒 −𝜆𝑥 at x = 0?

The value you get for the density is simply equal to 2. So, this is once again the density remember
density this is the density. So, this guy is exponential too. So, the variance mean will be 1/2 and
the variance is 1/4. So, you can sort of see where the mean and variance would be for the black
one the mean is 1, then the variance is 1. For the green one, the mean has shifted to the left in case
half you can see it takes lower values with higher probability.

And then the variance is sort of 1 by 4. You can look at the blue one the blue one is exponential
half and mean would be 2 and variance would be 4. So, you can see that the blue one has sort of
lowered and It is spread out It takes larger values with higher probability. So, the mean value has
shifted to 2 and the variances has become 4. So, variance is also larger cases shows you the larger
spread.

So, this hopefully gives you a feel for how exponential works as you vary lambda in terms of mean
variance and density. So, you can see the behavior is very different from the uniform but you get
a sense of how this is controlled.

(Refer Slide Time: 20:15)

So, the last picture I want to show you is for the normal distribution our fixed mean is 0 for all the
3 and then the Sigma has been varied. You can sort of guess where this would be. So, this has so
let us write it in just black. So, this is got the black one has got Normal. Normal I think that is 0,1.
If I am not wrong let us sort it as. So, mean is 0 and then variance is 1. So, the green one so this is
mean 0 and variance is greater than 1.

So, you can see the variance has gone greater than 1 it spread out more. It has come down below
so that is the larger variance with the same mean. And the blue one you can sort of guess its mean
0 again and variance squared is less than 1. So, the sigma squared variances less than 1. So, it is
got it is a peaking thing and it is comes closer together takes values around the mean most of the
time. So, this is how the different PDFs would look.

So, this kind of a shape is a good thing to keep in your mind. So, when you think of a distribution
what is the shape of the density? So, that sort of tells you the kind of values takes on top of that
the mean and variance will give you something and quite often the shape of the density might be
difficult to get. The random variable dealing with may be very complicated. So, but mean and
variance is something that you can hope for getting.

(Refer Slide Time: 21:51)

So, that is I think that is all we wanted to do about mean and variance. So, we will do a couple of
problems of evaluating mean and variance given a density. But before that I want to mention
something about Markov and Chebyshev’s inequality. So, if you remember we used this Markov
and Chebyshev’s inequality to calculate probabilities to calculate bounds on probabilities that the
random variable will take a value far away from mean.

Far away from mean by so many variances then we got an upper bound. So, this I said sort of tells
you how the mean and variance affect the distribution itself. So, you can get probability bounds
using mean and variance. So, the same bounds hold for continuous distributions. In fact, any
distribution which has a mean and variance the bounds hold. So, that is the point of this slide I am
trying to say that the Markov and Chebyshev’s inequalities carry over to an arbitrary distribution
also as long as mean and variance exists for that random variable.

So, probability that what is Markov, Markov is first of all it needs a random variable needs to have
µ
non-negative support for Markov and probability that 𝑋 > 𝑐 is less than or equal to 𝑐 . So, that is

Markov inequality it is useful to get bounds for probabilities. Now if you have mean and variance
1
Chebyshev’s inequality comes at probability that the absolute value of (𝑥 − µ) ≥ 𝑘σ , 𝑘𝜎 ≤ 𝑘 2

So, that is Chebyshev’s inequality both of these inequalities are useful for getting bounds on x
deviating too much from the mean. And that is useful quite often to quantify. So, this holds in
general for any distribution not just for discrete which we saw before.

(Refer Slide Time: 23:38)

So, a few just 1 slide on the mathematical aspects of it. So, far we have just introduced continuous
random variables in a very convenient simple way from CDF. And going to generalization of CDF.
We never spoke about the probability space underlying a continuous distribution. So, in the
discrete case we were able to describe the sample space which was the set of all outcomes.

And then we were able to describe events as subsets of the sample space and then unions of events
intersection of events. And then the random variable came as a function from the sample space to
the real line. So, that is how the random variables were defined. Can you do something similar for
the continuous case? So, that is the question that one may ask. It turns out the technical nature of
the answering that question is too complicated.

And It is let me not say complicated it is just that it is the jargon and the language needed there is
a bit more involved and it is very technical. And I have skipped that totally here even though your
book has a description of that. I will welcome you to read the book and see if you can follow the
description. So, I will just briefly summarize it. This is only for those who are very heavily
mathematically inclined I would like to know what it is that we missed.

Then we did the simplification. So, the sample space you can take as for the continuous case at
least you can take it as the interval of a real line for instance. Then events will be intervals within
that their unions and complements. So, this is how events are defined turns out you can say all
subsets there are some bizarre subsets which are counterintuitive to our sense of measure and all
that so we do not do that and the probability function can also be reasonably defined.

So, all of this can be done a proper probability space can be defined for the continuous random
variable also. And you can define the continuous random variable as a function from the sample
space to the real line and then work with that and all that all that is possible. So in fact, the modern
theory modern probability theory does not necessarily need to distinguish between discrete
continuous mixture and all that there is a uniform language in which all of these can be described
in common that is the language of measure theory. And that is way beyond the scope of this
introductory course.

I am just putting it out here. So, that in case you are interested in studying more about the
theoretical aspects you can look up NPTEL courses of that I have pointed out here. It gives you a
complete characterization of the measure theoretic description of probability. It is an advanced
course if you are thinking of graduate studies and all that you can you can jump into it. So, the
slide is just to is the small bow at the axiomatic nature of probability theory and something that
we have skipped totally. But still it is good to just know that these things exist.
(Refer Slide Time: 26:37)

So, let us come to a couple of problems like I said a very important skill that you need to pick up
is given a PDF evaluate say it is CDF expected value variance. So, it is something a very basic
skill in continuous random variables. And so this is one of the first courses you are doing you
should pick up that skill. You should do some problems enough problems to be comfortable with
going from PDF to CDF and finding the expected value and the variance.

So, many of these are exercises in integration. So, this will also help you looking at a table of
integration and then making sure you can go from the table to substituting the value we do not
expect at least in this course we are not going to expect major wizardry in the integration itself. It
is not a point of this course. But just the basic skill of what is involved should be known to
everybody. So let us get started.

So, I will work out a couple of problems have two problems that I picked out both from the book.
So, this is just to get you some practice. So, here is PDF 1 minus mod x between minus 1 and plus
1 I usually like to sketch the PDF. You do not have to sketch if you do not want to became blue so
let me do black. If you can just imagine the formula without sketching that is also good. You do
not have to so it is 1 minus mod x it would look something like this.

1
So, it is a valid density you can see the total area under the curve is 2 × 2 × 1. That is 1 itself is

just the area of this triangle is not it? So, that is the PDF. So, you can see support is from -1 to 1.
So, immediately if we have to do the CDF two regions is very clear. It is going to be 0 for 𝑥 < −1
it is going to be 1 for 𝑥 > 1. Both these things are very clear. And it is very likely given the nature
of this PDF you are going to get something here for -1 to 0. Maybe I will put less than or equal to
1, one side does not matter and then from 0 to 1.

You are going to get something here something here. So, this is how the CDF is going to look how
did I do this? I mean you see less than minus 1 is going to be 0 all the time greater than 1 it is going
to be 1 all the time. And then from minus 1 to 1 you will have 1 type of integration and then 0 to
1 you will have the other type of integration. So, what do I do for -1 to 1. So, for x x in the range
minus 1 to 0 the CDF of x is going to be integral from -1 to x.

If you want, you can write down the formula here. You will have 1 - mod u du. So, I am just doing
it laboriously there is so many other easy ways to do it integral -1 to x. Now remember use all
negative here so absolute value will be minus u. So, it will be just 1+ u du. Hopefully you see
where I got that from. So, if you do this you are going to get -1 to x 1, 1 is just u itself.

So, u evaluated between x and -1 plus another u which is 𝑢2 by 2 evaluated between x and -1. So,
(1+𝑥)2 (1)2
you get x minus minus minus minus . So, you will get here x plus 1 plus x squared by 2
2 2

minus a half. So if you want to write it down and it is entirely half plus x plus x squared by 2. So,
that is the formula for this. And then for the other range 0 less than or equal to x less than or equal
to 1 your 𝑓𝑋 (𝑥) is going to be see notice this is 0 to 1.

So, if you are going to integrate you are going to do -1 to 0 of 1- mod u du plus 0 to x 1- mod u
du. So, this guy is simply something evaluated before this will come out to just half. This is like
putting x = 0 here. I have just evaluated -1 to 0. So, this is x =0 here and that is just putting x equal
to 0 we will get a half. For this one you have to repeat the calculation. So, this is going to be half
plus so integral this u is positive so we will get 1- u.

So, you will have u evaluated between x and 0 minus u squared by 2 evaluated between x and 0.
1 𝑥2
So, this will be 2 + 𝑥 − . So, that is the formula you can put this back in so this would be half
2

plus you can go back and fill in the question marks here you will get the CDF. So, like I said this
is not very hard per say but if you have not done this before if you have not done these integrations.

And carefully written down or tomorrow if I change this function instead of 1− mod x if I give
you some other function you should be able to repeat this exercise is the simple at least for simple
functions like this we will not give you a very complicated functions are maybe steps maybe simple
things like this but you should be able to split it and write out in different regions. It is a basic skill
and the basic skill is needed in a bachelor's program.

So, you need to pick this up this is CDF. So, what about expected value and variance? So looks
like I might run out of room here. So, let me see if I can simplify this a little bit. So for computing
the expected value of x and variance of x. So, I will do it sort of the laborious way. But if you look
at this function this function is symmetric about 0. Whenever you have a symmetry above 0 the
point about which It is symmetric will become the expected value.

So, in this case ahead of time I can sort of guess that the expected value will be 0. So maybe we
can do this calculation also if you come out to be 0. It is not too difficult to check this. So let me
see if I can do that for you.

(Refer Slide Time: 33:27)

So, 𝐸(𝑥) if you have to just laboriously do it is it is integral over the support ∫ 𝑓𝑋 (𝑥)𝑑𝑥. And since
there are two different functional forms you might want to do minus 1 to 0 x times 1 plus x dx.
Remember it is 1 minus absolute value of x when x is negative absolute value of x is −x. So, it
becomes (1 + 𝑥)𝑑𝑥. Think about why that is true. Then you have 0 to 1 x times what is the 𝑓𝑋 (𝑥)
from 0 to 1.
It is 1− mod x and x is positive. So, mod x is just x dx. So, one needs to evaluate this so let us just
𝑥2
go ahead and evaluate it. Let us just x and then x squared it will be between 0 and −1 plus
2
𝑥3 𝑥2 𝑥3
between 0 and -1. And then again between 1 and 0 − between 1 and 0. So, this would just
3 2 3

be minus half first one see if you put in 0 you are going to get 0 minus of just instead of x you put
minus 1 so It is minus half.

So, here again it will be 0 plus 1 by 3. So, here it will be plus half then here it will be minus 1 by
3 it is 0. So, you can check that even if you use the formula and do it carefully correctly make sure
you put the correct substitution in you are going to get the same answer. For variance I like to so
in this case the expected value of x is 0. So, variance of x is equal to the expected value of x
squared. So, It is the same thing instead of x I have to put x squared.

So, 𝑥 2 (1 + 𝑥). I am jumping straight into this step, I can simplify this step x squared times 1 minus
𝑥3 𝑥3
x dx. Then after that I am just doing the usual thing x 0 minus 1 + 𝑥 4 by 4 0 minus 1 + x 1
3 3

0 minus x power 4 by 4 1 0 that may be smarter ways to do this but we should know the laborious
ugly ways 1 by 3. Once again minus minus 1 by 3 equals to 1 by 3 plus like this case it is actually
again minus I think minus 1 by 4 plus 1 by 3 minus 1 by 4.

I hope you I got that right. I think that is correct. So, it is 1 by 3 minus 1 by 4 plus 1 by 3 minus 1
by 4 1 by 3 minus 1 by 4 is 1 by 12 1 by 12 plus 1 by 12 that is 1 by 6. So, the variance works out
as 1 by 6. Hopefully I did not make any mistakes here. So, this is the expected value of X and
variance. You see It is simple integration knowing the formula.

𝑥 (𝑛+1)
So, the only formula for integral that we needed was ∫ 𝑥 𝑛 𝑑𝑥 is . So, this is the formula that
(𝑛+1)

we needed and then splitting it up. And then so for absolute value of x It is −x if 𝑥 < 0 and It is x
if 𝑥 ≥ 0. So, we used simple things like this then we were able to reason it out. So, that is the first
example.
(Refer Slide Time: 37:05)

Hopefully, this was clean and you could follow. So, this is the next example. So, this involves
slightly more complicated functions trigonometric functions. But I mean there is no need to be too
scared about trigonometric functions again the tables for integration and all of that is available. So,
𝜋
you might want to sketch this. So, if you sketch ∫ 2𝜋 𝑥𝑓𝑋 (𝑥) if you are really worried about

2

trigonometric functions I am just showing this as one example.

There is no need to worry too much about it. It is just a shape and I think my scale did not work
out okay for you. It will have a shape like this some such shape so this value will be half this point
and this integral will turn out to be 1 and all that it is quite standard. The CDF is once again so it
is easy to find so looking at this the support you see is between minus pi by 2 and pi by 2 a CDF
𝜋 𝜋
is going to be 0 for 𝑥 < − 2 . It is going to be 1 for 𝑥 > 2 .

So, between pi by 2 and pi by 2 what does it going to be? So, I like to do this always when you
have to find the CDF from the PDF. First I like to write what I know clearly from the support itself.
And then what I have to find from the integration and different regions if there are different
function definitions put to question mark write down the region clearly. And then go after each
region and find the integral.

So, it is just a step by step method that you do. So, here is the region I am going to focus on. Then
try and find the 𝑓𝑋 (𝑥). It is going to be integral minus pi by 2 to x. And then It is going to be
𝑓𝑋 (𝑢)𝑑𝑢 notice how I am changing these variables? So, why did I do that if you think about it this
x is already shown up here? No point in putting x inside here again you will get really confused
by that put some other variable.

And then the integration variable is clearly different from the variable you are using for the limit.
So, now we have to go to the formula and look at cos it turns out the integral of cos x dx if you go
look at it you will find it sin x. So, this is a standard formula we one can just use so if you go look
at half half is just half is just fun it just going to come out and then you just have integral sin x sin
𝜋
x between x and − 2 .

𝜋 𝜋
So, this would work out as half sin x minus sin(2 ) it turns out sin(− 2 ) is actually −1. This is

actually minus 1 so you can look this up. This will come from some formulae. So, this is 1 plus
sin x by 2. So, that is the answer for the CDF. Is enough do not know I mean if you are not if you
are uncomfortable looking up integral formulae. And then plugging it back in you are worried
about trigonometric functions, you do not feel comfortable with cos sin et cetera.

I apologize this is a very standard example in a book just to get you to practice the integration is
just go look up the table plug it back in and how do we know sin minus pi by 2 minus 1 I mean
you can use the calculator calculator and radiants will give you the sign. It is minus 1 you will get

1+sin 𝑥
. So, that is the CDF. Let us do expected value and variance before we even do expected
2

value and variance given the symmetry of this function. What is the expected value of X going to
be? It better be 0.

So, this that will work out quite nicely in this case a variance you have to do some slightly more
complicated integration. So, that type of integration that needed usually it is a bit worrying.
Because I mean this integration may not be standard integration that you would have seen let me
see if I can give you some clear values from the table.

So, let us continue this is probably the only example you going to even see with integration and
all that and do not worried about integration because any integral that you will actually meet for a
problem we will give you only have to just pick up that integral and substitute it with limits. We
will not expect you to know the integral by heart so do not be too worried about it. So, for instance
in this formula in this problem you needed to know the integral of cosec that given to you.
(Refer Slide Time: 41:50)

And going further you want to find mean and variance you need two other integrals. ∫ 𝑥 cos 𝑥 and
∫ 𝑥 2 cos 𝑥 all of these are given to you. So, once so you will just substitute formulae do not worry
about the integration part. So, given this let us go ahead and do expected value of x. So, this is
going to be integral minus pi by 2 to pi by 2 that is the support of the density x times the density
1
itself that is cos 𝑥 𝑑𝑥.
2

So, now x cos x is here and for the ∫ 𝑥 cos 𝑥 I have given you a formula. So, you need cos 𝑥 +
sin 𝑥 of course this is the half outside that is to know consequence minus pi by 2 to pi by 2. So, if
𝜋 𝜋 𝜋
you start substituting pi by 2 here cos 2 it turns out is 0 you will have a + 2 here sin 2 is 1. So,
𝜋 𝜋
you will get something there and then put − 2 , cos(− 2 ) is also 0.

𝜋 𝜋
And then you will get a (− 2 ) and then sin(− 2 ) is turns out is minus 1. So, everything will

occurred and finally you will see you will get a 0. So, this needs checking maybe I should do this
in some detail half of cos pi by 2 plus pi by 2 sin pi by 2 minus minus cos minus pi by 2 minus pi
by 2 times sin minus pi by 2. So, all these needs calculation cos pi by 2 is 0 cos minus pi by 2 is 0
sin pi by 2 is 1 sin minus pi by 2 is minus 1.

So, it is pi by 2 plus pi by 2 minus everything will cancel we will get 0. So, this is something you
have to do so plug in this formulae into your calculators we will get it. So, the integration will be
given to you, you will not expected to remember the integration but you should be able to pull the
integral from somewhere plug it in and get the answer. This is a skill basic bachelor's levels skill
for computing expected value.

So, given a PDF given the formula for integration for somewhere you just speak out and calculate
the mean and variance. So, for the variance once again the expected value as ending up being 0.
So, variance of x as the same as 𝐸[𝑥 2 ]. So, that will be minus pi by 2 to pi by 2 once again the
1
support of the density ∫ 𝑥 2 2 cos 𝑥 𝑑𝑥. So, now this half is going to come out and then you have

this more complicated looking formulae sin 𝑥 + 2𝑥 cos 𝑥 is just mechanically doing it there is no
𝜋 𝜋
need to alarm or anything − to .
2 2

You do not need any wizardry like I said in the integration you will not expect that from you. But
at least the substituting into the formula that will be expected. So, that is going to be half outside
𝜋
overall. And then inside I am going to plugin 2 . So, it helps to know for instance if you calculate
𝜋 𝜋
sin and cos separately cos( 2 ) will go off to 0 sin 2 will end up being plus 1. So, you will just get
𝜋 𝜋2
( 2 )2 which is − 2 here.
4

That is what will happen if you put in pi by 2. So, when you put in minus pi by 2 you will be going
to have a minus overall and then when you put in minus pi by 2 the sin will become minus 1. So,
you will get minus pi squared by 4 minus pi by 2 cos will go off to 0. So, here again you will have
minus 2 so this will be plus 2. Is that okay? So, you see is pi squared by 4 minus 2 minus minus pi
squared by 4 plus 2.

So, this will work out as just half and that will cancel so you will get pi squared by 4 minus 2. So,
you will get an answer for the variance is not a very high value pi is some 3.14 3.14 squared is
some 9 point something by 4 is 2 point something minus 2 is 0 point something. So, it is a small
value but that is the exact value of the variance in this case. So hopefully, these two problems
illustrative and you will get some more practice for problems like this.

Once again I want to remind you the integration will be given to you just be able to identify the
intervals and plugin the formulae that. So, this is a basic skill that one should know when you deal
with continuous random variables. So, this sort of finishes most of what I wanted to do as basic
skills and ideas in continuous random variables. What I also want to show you in the next lecture
is how to stimulate these continuous random variables in the call I notebook. So, we will see that
in the next lecture.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 5.9
Continuous random variable - Illustration in Colab

(Refer Slide Time: 0:13)

Hello, and welcome to this lecture. In the short lecture, I am going to show you how to use or
simulate or deal with continuous random variables in Colab, in Python Colab. We have been
looking at simulating events and probabilities doing Monte Carlo to estimate probabilities and
compare it with whatever we derive in theory and see that it actually matches, all of that we have
been doing so far. And you can do the same for continuous also, it is not too far removed.

So, I am going to show you how to do that. Few packages we need as before, we need this NumPy
package, I am just running the importing of the packages in case I want to run and show you how
it works. Once again, this notebook is fully available to you, it is shared with you and the link has
been given to you, you can copy this to your own drive and then use it. That should be fine.
(Refer Slide Time: 1:03)

So, these are all discrete things we have seen before, we have seen all this in previous lectures. I
am slowly moving towards a continuous random variable. So, you will need the statistics module
the scipy stats, which we will use. Then we have seen Balls and bins.
(Refer Slide Time: 1:23)

So, common continuous distributions and histograms. So, this is the part which we want to talk
about. Let me zoom a little bit, so that you can see clearly, so that is the thing. So, we will use
scipy stats module. And for plotting here is another module, which is very common, it is called
matplotlib.pyplot, I am importing that. So, you will see, I will use this plt dot something it gives
you plots. So, the scipy stats module has lots of commands for generating continuous samples also,
continuous samples from a continuous distribution.

(Refer Slide Time: 2:00)


So, for instance, here, I am starting with a uniform distribution, you can notice this command a
look at the first command x = st.uniform.rvs. So, this is a command that will generate random
variable samples, with a uniform distribution between you can specify the limits, I have put here,
(0, 3), so it specifies between 0 and 3, and I put a size here of 10,000. So, it will generate 10,000
samples.

So, if you want to fool around with, you want to change the limits to say - 3 , 3, you can do that
you can change the limits here, you can do that. So, this will just be samples, just values of x, you
will see it will be values from 0 to 3.0, 1.145, whatever I mean, all sorts of values of 10,000 values
will be there.

Now, what I have done next, is to make a histogram of these values. So, this is a very nice and
useful skill to have when you have 10,000 values, all sorts of real values all over the place. The
first skill you should pick up is to do a histogram. If you remember, when we looked at the
meteorite weights, I did something like this, I first I saw that the range was really, really vast. So,
I took logarithms to make the range come to a certain thing here, we do not need it, I am simulating
between 0 and 3.

It is it is a very well-known range, there is no problem. So, first thing you do, you might want to
do is to look at the range of and get some basic descriptive statistics, what is the minimum, what
is the maximum, and once you get the minimum maximum within some reasonable control, next
you can do a histogram; histogram gives you a very good feel for the kind of nature of the
distribution of these 10,000 values.

And, matplotlib and NumPy gives you this hist function. The hist function, you pass the data first,
I will urge you to look at the help and documentation for these functions, it will tell you, you can
pick bins, bins pick trs 100 and range that given is 0 to 3. So, in the range 0 to 3, I am picking 100
bins of equal width. So, this is how histograms are done, you make a lot of bins, and then you do
histograms. And then you count how many values among the 10,000 fall into each bin. That is the
histogram, the count of how many values fall into each bin is the histogram.
(Refer Slide Time: 4:11)

So, I have done some description here. I will not go through that in great detail. But it turns out,
you can also have passed this parameter called density = true to get the density, instead of the
histogram value itself, the number of values that you get, you can convert them into a density
measure of approximation of the density.

So, here is sort of the description here. I will leave you to read it in the notebook that you have
shared, but you can you can get the density. So, you do that you get the histogram. And then on
top of that, I am going to make another plot. So, this plot, I do not have comments here. So, maybe
have a comment. This is the what is that color the orange line the uniform density.
So, what is the random variable uniform and 0 to 3, what will be its density? At 0 to 3, it will be
1/3. So, you can see this is also 1/3 that is 3.3333, and 0 outside of that. So, this orange line is
actually the density, the theoretical density, and the blue is actually the histogram. So, this is the
blue, the blue, blue histogram, so there are 100 bins from 0 to 3, they all sort of get close together.
And you can see the bin size, it is the bins are around 100, the most of most of the values are
around this 1/3.

So, you can see how the distribution is an approximation to that histogram. And how know this
scipy stats uniform random variable samples, sort of are uniform in distribution, so this is the best
we can expect. So, a picture like this to me illustrates a lot of ideas that you need to have in
modeling, when you when data comes to you, you do a brief descriptive stats analysis and then
plot a histogram. And then you should have this sort of idea that, maybe this is a uniform
distribution.

So, it is a good, it is much simpler to describe it as a uniform distribution, rather than give this full
complete histogram and struggle with it and all that. And then you may, you may be able to quickly
derive something useful from it, knowing that this is uniform. And this plt.show is just to create
this plot. So, you add these lines to the plot, and then do plt.show, you will get it. So, this is a good
visualization for this, I will urge you to change the bin size.

(Refer Slide Time: 6:39)


So, maybe you can make the bin 50 and then let me run this again. See, notice you are getting
again, the same thing. But in this case, the bins are slightly better behaved, so you see that they are
not all over the place, 100 bins were too much, 50 bins is better, you will get this plot. It is pretty
good, very good approximation, everybody will believe that this is linear, this is uniform. So, that
is for the uniform distribution. I have repeated the same with other distributions also.

(Refer Slide Time: 7:11)

So, I am not going to go into great detail here, the exponential distribution has this parameter
lambda. And once again, the scipy stats has a function called expon, which is an exponential
continuous random variable. And rvs gives you samples from that random variable scale is what
it is called as lambda scale. The location and scale are common names for these distributions.
Usually, location is mean, scale is usually the other parameter, in this case, lambda is the scale,
you should read the help function here and to see what it is.

And so I have generated 10,000 samples, and I am binning looking, look at the bin, I am stopping
at 10. Because I know above 10, this is, very little probability of getting, so I am ignoring that I
am taking 50 bins and then I am plotting here. So, this is the theoretical density, this is the orange
line, this is the blue histogram. So, this is the exponential, lambda density. So, that is what it is.

So, the orange line, I should call it, this is the orange line. So, it is 𝑒 −𝑥 . So, that is this plot, you
will get that here, like this. So you can see that the histogram approximates the density. So of
course, the generated exponential continuous random variables, hopefully it should approximate,
but, even otherwise, if you see that the histogram is sort of falling very rapidly, an exponential fit
might be very good for you.
(Refer Slide Time: 8:59)

The last example is normal. Once again, the same thing has been done, I do not want to go through
the whole thing once again, once again, scipy stats as a command for generating the normal
distribution samples, and then you can plot the density, the histogram along with a normal PDF.
And you see that the normal PDF matches the histogram really, really, really well.

So hopefully, this is interesting. And you see how, even in our Python world, one can deal with
continuous random variables, generate them, make histograms understand them. And then if you
were to tomorrow model a large data set, then you want to see how the distribution may be the
histogram and how you can fit a density to it might be really, really something interesting for you
to try. Thank you very much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department,
Indian Institute of Technology Madras
Lecture 6.1
Multiple Discrete/Continuous Random Variables

(Refer Slide Time: 00:12)

Hello and welcome. This week we are going to start looking at models where there are multiple
random variables involved and some are discrete, some are continuous and how do you deal
with these kind of situations, how do you write distributions for these situations. Some of them
may be joint distributions.

You may have conditional distribution, you may have marginal distribution. You may
remember a few weeks back when we looked at multiple discrete random variables, how we
had something called a joint PMF, conditional PMF, marginal PMFs and all of them had some
relationships between each other and those relationships are very important.

Likewise, here now that we have this new class of random variables, which are continuous
random variables, we are going to see what happens in a real situation or any situation where
you have multiple random variables involved in your random phenomenon and some of them
are discrete, some of them are continuous and you have to jointly describe how they are related
and how their distribution is. So, this is an important problem. And while I will start by first,
in this lecture, motivating why in real scenarios in random phenomenon, you will observe
naturally things of this nature. So let us begin.
(Refer Slide Time: 01:34)

So, we will begin with a quick motivation. So, for motivation, I am going to use a very, very
famous data set and statistics. This is called the iris data set. Iris is a name of a flower, very
pretty flower. And there is lots of classification involved in iris etcetera, and a particular interest
is this data set which was used first by a very, very, very famous statistician Ronald Fisher.

So, he is considered father of statistics, but many, you can see in the Wikipedia page, you can
go and read about his life and his times and how he played a key role in statistics. And you can
see the kinds of phrases that are being showered on him on the Wikipedia page. He is really, I
mean, this, his work has been very, very important and foundational in statistics. And this is
the data set that he used. So, I think it is apt that we study that data set in this course.
So, this is the iris flower. This is a picture I have taken from a website and that is how it looks.
There are various different types of irises. And this is one particular one. And all of them have
petals, which like these three small ones that are coming up on top and then there are these
sepals, which are the three ones, three sort of petal like things that are falling down.

So, there are three that are falling down, these are called sepals and the three that are sort of
standing up, these are called petals. So that picture probably gives you a clear idea of what we
are talking about. So, this data set is about the sepal length, sepal width, petal length, petal
width, irises.

The data set has three classes of irises. We will call them 0, 1 and 2. They actually have some
names and all that. We just call them 0, 1, 2. There are 50 instances in each class. And for each
instance, there is a record of what is their length of the sepal, width of the sepal, length of the
petal and width of the petal.

Somebody has done this and all of that is given in centimeters. It is, by today's calculations,
this is a very tiny data set. So, 150 is nothing today. Many people talk about billions of pieces
of data, but for class like us, this is a good example. Let me give you a good idea of how to
consider these kinds of situations.

So, the problem that Fisher tackled was a problem of classification. What is classification? So
given the data, given that you see an iris, you do not know what class it is. And so, you look at
the sepal and you can measure its sepal width, you can measure its peal length and petal width,
petal length and then can you find the class of iris from it? Can you say whether it belongs to
0 or 1, 2?

So, this is a typical classification problem. Later on, when you study more data science, more
machine learning, you will see such problems in great generality. Today's classification
problems have millions of data, where this one is just 150.

So, you use the prior data, build some sort of a rough model or things in your, using the data.
It is going to be a statistical model in some sense, but maybe it is, let us at least look at the data
and then see if you can do the classification. That is the problem that Fisher looked at. We are
not going to go directly into a classification problem today. This is, we just beginning to learn
statistics. But we will just look at this data.
We will see this data and see how to look at and how to model it, how to think of it. How will
you describe its distribution? So, assuming, so irises, there are so many of them and each one
is going to have its own petal width, petal length and all that, its own class and all of that is
there.

So, supposing you meet random iris, you see a random iris somewhere, what can you expect
its class to be? What can you expect its sepal width and length to be? What can you expect its
petal width and length to be? So that is the sort of problem we will look at. How do you describe
the distribution of all these objects for all the irises?

So, in short, we will consider how to statistically describe this combination of these five things.
There are five things in this data set. So those five things related to irises class, the SL is the
short form for a sepal length, SW is sepal width, PL is petal length, PW is petal width. How do
you statistically describe it? So that is the question.
(Refer Slide Time: 06:14)

So, always the first thing you start with the data set, when you start with a data set, when it is
presented to you, you should first see it. You cannot see its entirety. Maybe 150 itself is too
much. Maybe today it is not too much. We can put it in an Excel sheet. You can see all that.
But at least see a part of it. First see it a little bit. So, you have to see it. So, this is a small
tabulation for you.

I will also show you how to get this data into a Python notebook, many, many, I mean, this iris
dataset is so standard. Almost all statistical packages will have it inbuilt. You do not have to
do anything else. I will show you one such mechanism to pull this data set into a Python
notebook for instance. These are the three classes and that is the data. All of this is in
centimeters and you can see it has its range and etcetera.

After you look at the data for a little while, the next step that most people do is to come up with
descriptive statistics for the data. So, you can do that also. So, that I will call a summary, just
summarizing the statistics, just getting descriptive details of what it is. So, what I have done
here is put together a summary for sepal length, as in for each class, what is the range of values
min to max and then what is the average sepal length and what is the standard deviation of the
sepal.

All of these you know very well how to calculate. Given the data, you can calculate in a
spreadsheet program like Excel or Google Sheet. It is just a relative quick command. If you are
in a Python notebook, you have this data with you. It is just a couple of lines of commands.
Once again, I will show this to you later on how to calculate this, I show you later on. But here
I have summarized it for you.

So, you see the summary of sepal length, for instance, for class 0. Sepal length in the data varies
from 4 point 3 to 5 point 8 centimeters, an average of 5 and a standard deviation of point 4. So,
it sort of gives you a rough idea of what it is. So, likewise, there is a sepal length for class 1,
sepal length for class 2, so on for sepal width, petal length, petal width. So, we have done that.
We have done the summary.

We are still not close to describing the distribution. So, we are going towards the distribution
and looking at it, it looks like you want to model the sepal length, sepal width, petal length,
petal width as continuous random variables. It does not make sense to put discrete values there.
And it is going to be a lot of values and there is also a length. So, it makes sense to have
continuous. But look at the class. The class is just 0, 1, 2. It is very much discrete. So, you have
this confusion on how to describe these things in a coherent, nice manner. We will see that in
this lecture.

(Refer Slide Time: 09:01)

So, the next thing that a lot of people do after you summarize the data, you can also plot
histograms of the data. Now these histograms are very, very useful. They give you a picture of,
they divide the range of values into bins. Suppose, for instance, the sepal length for class 0 is
from 4 point 3 to 5 point 8. So, you will take that 4 point 3 to 5 point 8 and divide it up into
small bins and simply count how many irises fell in each bin.
So that is the picture. So, here I have done histograms for sepal length, sepal width, petal length,
petal width. I will also show you a Python notebook where all these things are done, so that
you can look at the commands and get yourself familiar with it.

There are three classes here. The blue color histograms are for class 0, the orange saffron color
histogram is for class 1 and the green color histogram is for class 2. I have shown it one on top
of the other in each plot, you can also plot them separately. You can see there is some overlap
in the histogram, but the histogram is sort of suggests a continuous model as well. But data is
very small, it is only 50 instances. If you increase the instances, you can hopefully see that you
will get something. So, this is giving you a clearer idea about how each of these things are
going to behave.

Once again, when a data set is presented to you, you want to see the data a little bit, you want
to summarize it and then you want to histogram it, so that you get a sense of how the distribution
is looking and then things start thinking of interrelationships and what you can see. So, clearly,
one quick thing you can get by looking at these histograms is the class is jointly distributed
with the lens.

When the class changes, the distribution of the lengths change and the widths change. So, there
is a joint distribution here. So, there is a connection between, it is not like these are independent
or anything. So, depending on the class, the type of values for the sepal length, the distribution
of values for the sepal width or petal length, petal width, they all change. So, there is a joint
distribution involved here.
(Refer Slide Time: 11:09)

So let us look at this a bit more closely. So, supposing I want to only look at the sepal length
and the class of the iris, so I have just taken one of the plots that we had in the previous slide
and put it up here and I have shown you this is a density histogram plot. Remember, I talked
about the density histogram plot in the previous lectures.

It is not just the count, its count divided by the bin width or something like that. So, this is the
density histogram plot. And you can sort of see that I have tried to sell a continuous
approximation, it is reasonable, I think, for like 50 data. It looks like a reasonable
approximation.

So, clearly, there are two things, two, three things coming out very clearly here. The class and
the sepal length have a joint distribution, as in if you change the class, the distribution of the
sepal length changes. Clearly, that is true. A continuous approximation for the sepal length is
going to be a good model. So, you may want to model the sepal length as continuous, but you
notice the class is discrete.

So, here in we have one of the first situations where we have two random variables, one is the
class of the iris, the other is the sepal length of the iris. These two have clearly a joint
distribution depending on, they depend on one another, but one is discrete, the other is
continuous. So, these are the kind of class situations you will have quite often in practice.

It may not just be one, just two variables, maybe there are multiple variables. We will come to
that later on. But to begin with, just two variables, one continuous, one discrete, how will you
describe the joint distribution? What kind of quantities do you have to define? How will you
play around with them? So that is what we are going to see in the next lecture. Thank you very
much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electric Engineering Department
Indian Institute of Technology Madras
Lecture 6.2
Joint distributions: Discrete and Continuous

(Refer Slide Time: 00:12)

Hello and welcome to this lecture. We are going to now start looking at how to describe

Joint Distributions when of, between two random variables, one of them is discrete, the

other is continuous. What is the easiest way to describe it? I will just propose one method.

There are other methods also, but the method I am proposing I think is most common,

most useful and simple to use, in practice it is simple to use. I sort of motivated this by

giving you the iris data set and all that. But in this lecture maybe we would not refer too

much to the iris data set, we will just describe a discrete random variable and a continuous

random variable together in a joint distribution. How would you go about describing them?

(Refer Slide Time: 00:53)


So, how do you describe the joint distribution of two random variables? Let us say X and

Y are jointly distributed. X is discrete. It has a range 𝑇𝑋 and it has a PMF px. Remember,

PMF px is 𝑃(𝑋 = 𝑥). So that is the PMF of a discrete random variable.

Now, what will happen in these scenarios is, now Y is jointly distributed with X, how do

we understand the joint distribution for every value that x can take, every small x in the

range 𝑇𝑋 . You can actually have a continuous random variable. I am calling it Y sub small

x as in if capital X were to be equal to small x, then y would be distributed like 𝑌𝑋 .

So, you can always say that. So, you remember you can go back to the iris class and

sepal width, sepal length picture that I drew there, if the class is 0, you have a certain

distribution, if a class is 1 you have another distribution, if class is 2, I have another

distribution. So, there are three different distributions. Depending on the class, you pick

one of these. So, that kind of a scenario is what we will assume is always available to us.

So, this 𝑌𝑋 , we will give it a slightly different notation. We will think of that as the random

variable Y given X equals small x sort of like a conditional random variable in some sense

and usual notation for this is 𝑌|𝑋 = 𝑥, so Y given X equals small x. So there is 𝑌𝑋 as this
particular distribution that Y is taking when X, capital X has taken a small value, value of

small x that we will denote as Y|X=x. So this is common notation.

Now, this density that we had, so the density given that the capital X became small x, we

had one of these densities, that density also we will denote it like this 𝑓𝑌|𝑋=𝑥 (𝑦). So, that

is the density. Remember, X has a PMF px and then Y for every value of X, given X

equals small x, Y has a conditional density, 𝑓𝑌|𝑋=𝑥 . So this is a density. It is a conditional

density and given X equals x, this is the density for Y.

Now, if you want the marginal density of Y itself, the entire density, you have to do this

summation. So, the summation over all values of X in the range of 𝑇𝑋 𝑝𝑋 (𝑥), PMF of x

times the conditional. This is easy to see. This is the total probability law. This is sort of

like the total probability law. All of these things have rigorous proofs and ideas of

probability space and all that. We are not going into that much detail here. But hopefully,

you can see that these formulae are intuitive and easy to understand. So, notice that this

is like probability that X equals x. This is density of Y conditioned on X equals x. So, you

multiply add over all X, you should get the density of Y.

It is sort of reasonable in some sense, except that we are mixing up probabilities and

densities. You might think, maybe I am doing something wrong here. No, it is okay. This

description is very comfortable and correct description for the kind of situation we are

looking at. So, there is a discrete and continuous random variable. They have a joint

distribution which is described in terms of the conditional densities. So, this picture is

important to keep in mind. So, this is how we will describe the joint distribution.

Going back to the iris class and sepal length for instance, the sepal length given the class

equals 0 will be one continuous distribution, sepal length given class equals 1 will be

another continuous distribution, sepal length given class equals 2 will be another
distribution and together that describes the joint distribution of class and sepal length.

Simple, right, since sepal is easy, you will see. It starting at this point, we start to start

doing some problems and you will begin to get more and more clarity about the situation

as we do more and more problems.

(Refer Slide Time: 05:12)

So, here is a problem. So, whenever you see a description, you should also do a problem.

So, let us do this problem. So this says, X is uniform 0, 1, 2. So, notice this braces here.

So, this braces here clearly indicate that this is a discrete uniform random variable taking

three values 0, 1, 2, and then Y|X=0 is N(5, 0.4). So, this is Y. So, these three describe

the conditional densities. So, these are all conditional densities.

Y|X = 0, Y|X = 1 , Y|X = 2. So, what the distributions are, so this is N(5, 0.4), this is N(6,

0.5), this is N(7, 0.6), fine. I mean, this could be any other distribution also, just three

different distributions. Maybe I should draw the three of them and then give you a sense

of how they look. Let us try that.

So, let us look at this 5, 6, 7. So, let us say x is here, 5 is here, 6 is here, 7 is here. So,

this Y|X = 0 is 5, 0.4. 0.4 would be a low variance and so it would go little bit higher and
then maybe fall off quite faster. This is correct, but anyway this is how it looks. This is

normal with mean 5 and variance 0.4. So, this is N is abbreviation for normal, very, very

common abbreviation. I will also use it quite often. I mean, it is difficult to write normal all

the time. So, I just write N.

And then you have got N(6, 0.5), so slightly higher variance. So, it will go a little lower and

then a bit wider, maybe not this much wide. So, I am not drawing this to scale, I will expect

that if you can do it on some computer or something, you will get a better thing. So, this

is N(6, 0.5) likewise, this is N(7, 0.6). So, this will go a little bit lower and then maybe

something like that. So, this is N(7, 0.6). So, each of these things is the conditional

distribution of Y given X became 0, X became 1 and X became 2, sort of like the sepal

length if you want. So, this is the different distribution. So, this is, I should not say this is

x, this is y. So, Y given X equals 0, 1 and 2.

−(𝑦−5)2
1 2(0.4)2
So, now, how do you do the marginal? So, 𝑓𝑌|𝑋=0 (𝑦) is the density of normal 𝑒 .
√2𝜋0.4

So, this is Y|X=0, likewise you can also write Y|X=1 and 𝑓𝑌|𝑋=2 . I leave it as an exercise

for you to write. This is easy. Instead of 5 you have to put 6 here, instead of 0.4, you have

to put 0.5 here for X equals 1, likewise Y equals 2 instead of 5 you need to put 7 there,

instead of 0.4 you should put 0.6.

So, what I will do is I will write the distribution for Y. So, this is going to be sum over every

value of X in the range. So, X in the range of x is 0, 1 and 2, probability that X takes value
−(𝑦−5)2 −(𝑦−6)2 −(𝑦−7)2
1 1 2(0.4)2
1 1 2(0.5)2
1 1 2(0.6)2
0 that is 1 by 3 it is uniform, so 𝑒 +3 𝑒 +3 𝑒 . So, this
3 √2𝜋0.4 √2𝜋0.5 √2𝜋0.6

is the marginal distribution of Y.

Remember, the conditional densities are all Gaussian, but the marginal is not Gaussian.

So, this is not Gaussian. Gaussian cannot have three different peaks and their average.
This kind of a distribution is called a mixture Gaussian. So, this is called mixture of

Gaussians. So, this is a mixture of three Gaussians, mixture of Gaussian or Gaussian

mixture or mixture Gaussian. This is commonly used. So, this is a very actually, this is

quite a popular model, the Gaussian mixtures quite often in practice these end up being

very popular.

So, notice the second question. The first question was easy enough answer. Like I said,

it all starts very small basic definition. This seems easy enough, but there is always the

next step which is slightly more complicated and that is where we will go next.

So, what will happen usually is some sort of a reverse question. You start with X and then

you give conditional for Y given X and then you plot the marginal of Y, etc all that seems

easy enough to do. What about the reverse? You observe a Y. You observe that your

capital Y is around 𝑦0 . Remember, this is continuous. So, we are not trying to be precisely

equal here. We just around 𝑦0 , let us say. Y is around 𝑦0 . What can we possibly say about

X?

So, that is a question of interest here. So, that is sort of like the classification problem with

irises. Supposing you saw that the sepal length was this much, what can you say about

the class of the iris? Is there anything you can say at all? That is the sort of classification

problem. So, you can see the second question I am asking is going in that direction. So,

that is where the interest comes.

So, it becomes a slightly more reverse problem and how do we do that is the question.

So, we will invoke something like a Bayes’ theorem in this continuous discrete combo

framework and try and answer that next question. So, for answering the next question, I

am not going to answer it immediately. I will describe one little method and then come

back and think about how to answer that.


So, you can just look at this picture. If Y were to be around 5 or 4 or 3 or much lesser,

you are going to probably say the class was 0. If Y were to be around 8 or 9 or even 7,

7.5 or something, maybe you are going to say the class is likely to be 2. But what about

if Y were 6.5? What are you going to say is the class? So, these kinds of questions are

interesting. So, this is the heart of how data science works, how statistics works. So, it is

important to understand how to answer these kinds of questions. So, let us see how these

kind of questions can be answered.

(Refer Slide Time: 13:27)

So, what we are after is some sort of a conditional probability for the discrete random

variable given that the continuous random variable took some value. So, you have a

discrete and continuous random variable that have a joint distribution which we have

described as before discrete, given every discrete you have a conditional for the

continuous, conditional density for the continuous. Now, I want sort of the inverse

problem, sort of the Bayes rule sort of equivalent. What is the conditional probability for

the discrete random variable taking a particular value given that the continuous random

variable took a value somewhere around something.


So, here again one can prove these formulae very carefully from basic theory and all that.

We are not going to do it. I am going to basically state it as definition for you. You will see

it is a very intuitive simple definition and most of you will immediately accept it. So, it is

okay. We do not have to see a rigorous proof. So, proofs for such statements can be

found in textbooks. I can refer you to them if you are interested. So, let us see what the

statement is. The statement is much more important to understand and use.

So, let us say X and Y are jointly distributed with X being discrete having a range 𝑇𝑋 and

PMF px and there are these conditional densities that are given to you. 𝑓𝑌|𝑋=𝑥 (𝑦), each

of them is a density, so the conditional density. So, the conditional probability of 𝑋|𝑌 =

𝑦0 , so what is this 𝑌 = 𝑦0 is very troubling. A lot of people get troubled by that, because

we know the probability that 𝑌 = 𝑦0 .

We know that Y is a continuous random variable, 𝑃(𝑌 = 𝑦0 ). So, why am I saying 𝑌 = 𝑦0 ,

it turns out it is okay. In the conditional case, in this particular case, you can condition on

𝑌 = 𝑦0 . So, again, theory there is a little bit more complicated.

If you really want to be very, if you want more intuition, simply think of Y taking values

around 𝑦0 and for convenience, we are simply writing it as Y equals 1. So, that is also not

too bad. So, here is the big formula and I am stating this as definition. This is actually you

can you can prove this. This is, you can write this down and prove it. It is not very hard

given, we are not say it is not very hard, this is technical you to write down probability

space equations and all that very carefully. You can write it down. But let us just accept

it. We will move on and it is very intuitive.

You will see, why it is intuitive. Well, I will say this once again. So, here is the formula. Let
𝑝𝑋 ×𝑓𝑌|𝑋=𝑥 (𝑦0 )
us first understand the formula, 𝑃(𝑋 = 𝑥|𝑌 = 𝑦0 ) this is equal to . So, you
𝑓𝑌 (𝑦0 )

have a density in the numerator, you have a density in the denominator.


The numerator density is the conditional density. Denominator density is the total

marginal density. And then you have the 𝑝𝑋 (𝑥) multiplying 𝑓𝑌 . Remember, this guy is

actually what sum over x and Tx, 𝑝𝑋 (𝑥), 𝑓𝑌|𝑋=𝑥 (𝑦0 ).

So, this should remind you of Bayes rule. This is very much like Bayes rule. This is the

conditional total probability dividing, except that you do not have only probabilities, you

have these densities coming in, but that is okay. You have density in the numerator,

density in the denominator sort of if you go to the limit, things should be okay and you get

a probability out of that. So, the best way to remember this formula is to think of the

analogy to Bayes rule. So, this is very, very similar to Bayes rule.

Except that, so normally how do you write Bayes rule, so you write Bayes rule as

𝑃(𝐴|𝐵)𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴). This is sort of how you write Bayes rule. So, notice, here we

have A being X = x and then, so B is sort of like, B is sort of like Y = 𝑦0 . That is what you

have here. So, wherever you have a discrete random variable, you are going to have

probability, 𝑃(𝑋 = 𝑥|𝑌 = 𝑦0 ) and wherever you have a continuous random variable, you

just smoothly go to the density.

So, notice here, this probability of B is going to be is going to be of course 0. It does not

make sense there. So, you just move into the density and likewise, here also B given X

equal to x is again, probability wise it is going to go up to 0, but you can move to the

density and then this equation would be valid. So, it is sort of like Bayes rule with the

density is conveniently replacing wherever you have probabilities for continuous random

variable. That is also true. So, this is how one evaluates this conditional distribution of the

discrete random variable given the continuous random variable is jointly distributed.

So, this is a very nice and interesting little formula. It is very reminiscent of Bayes rule. I

would just think of Bayes rule as the basic thing and then use this formula. Lot of you may
be thinking Bayes rule is already confusing and this is now added confusion with discrete

and continuous, but this is a very, very commonly occurring phenomenon. Many models,

you want to use this idea. So, it is good to get a good grasp of what this is. It is not too

much more complicated than Bayes rule. You just have to remember, whenever you have

a continuous random variable, you simply use density instead of probabilities. That is it.

So, this 𝑋|𝑌 = 𝑦0 , you can think of it as a conditioned discrete random variable, conditional

random variable, so distribution of 𝑋|𝑌 = 𝑦0 . So, this can help you in classification and

making decisions. So, now once you have conditionals and all that, you can even think of

independence. X and Y are jointly distributed. When are X and Y are independent. This

would happen if the conditional is independent of X, if the conditional is independent of X

given the probability for the discrete random variable given the continuous random

variable is also equal the original probability, all of these things work out when you have

independence.

Remember, independence will be if the conditional distribution of Y|X=x is simply

independent of X that is independence. So, you can define all these things for situations

where one is discrete the other is continuous also. So, when both of are discrete all these

are very standard, you just have formulae relating all of this. When one is discrete, the

other is continuous also, one can write down formulae like this.

So, there is one thing to see this formula, you have to see this in action. So, what I am

going to do is solve a few problems, which has one continuous random variable, one

discrete random variable, the joint distribution is described, how do you find these kind of

probabilities. Let me do a few problems for you.

(Refer Slide Time: 20:43)


So, we are going to start with a simple problem, where X is uniform. Once again, these

braces indicate that this is discrete, taking -1 with probability half and 1 with probability

half and two conditional distributions are given to you and you have to find the distribution

of X|Y= −1, Y =1 and Y = 3. So, you can see that Y|X = 1, remember, this is indicating

that this is a continuous random variable. This is continuous density −2 to 2 uniform and

this Y|X =1 is an exponential density with 5. I like to plot these things.

I think it sort of illustrated to plot them, plot the densities, the conditional densities at least.

So, you have −2 to 2 uniform. That is one density. You can see I have not drawn it

properly. So let us draw it −2, this could be plus 2, this is 1 by 4 height, so this is the

conditional uniform −2 to 2 and then you have the conditional Y|X =1 and that is Exp(5).

So, Exp(5) is what is the density, it is 5𝑒 −5𝑥 . That is 𝜆𝑒 −𝜆𝑥 is the density. So, this will

probably go down like this. So, it will start at 5. So, it will start somewhere out there 5 and

then the density at even 1 has become 5 divided by e power 5, e power 5 is 2 power 5 is

already 0.6 or something. It is gone below this.


So, by the time it hits 2, it will be really, really, really below this. So, it will go like that. So,

this is 5𝑒 −5𝑥 . So, notice when X is 1 itself, it is cross it, you can find out when it will be 1

by 4, it will be somewhere around here. Well, it is keep going. So, these are the two

conditional densities and X, of course, takes value minus 1, so that is sort of too relevant.

So, these are the two conditionals. Hopefully, you can picture this now. So, X takes value

−1 and 1. When X is −1, Y is uniform from −2 to 2. When X is plus 1, Y has a conditional

density, which is like this. So, X is Uniform[−1, 1]. So what is being asked here is X|Y=−1.

So, this is what is being asked here.

So, when X|Y=−1, it might possibly take values −1 and plus 1. So, for this, I want to

compute the P(X =−1|Y = −1). So, for this, I know I need my marginal of Y. So, let us

write down the marginal of Y, it is going to be half probability that X= −1 times the uniform

distribution, 1 by 4.

Remember, you have to be very careful here. This 1 by 4 is when X is between, this 1 by

4 is when Y is between −2 and 2. So, this is when Y is between −2 and 2 and then plus

you have a probability, so this is this kind of a thing is a bit difficult to write down. So, let

me write down the formula first and then we will write down.

So, 𝑓𝑌|𝑋=−1, y plus half fY given X equals plus 1 of Y. So, now this you can sort of describe.

So, this is actually 1 by 4 for Y in −2, 2 this guy is 5𝑒 −5𝑥 , but that is, I am sorry, minus y,

forgetting my xs and ys here, I am sorry about that and that is for y positive. So, how do

you combine these two and write, so you say it is half times 1 by 4 plus half times 5𝑒 −5𝑦

for, I am sorry, this is slightly more complicated density to write down.

It is 0 for y less than minus 2. It is half into 1 by 4 for y between minus 2 and 0. It is half

into 1 by 4 plus 5 half 5𝑒 −5𝑦 for y between 0 and 2 and then it is, I am running out of room
here, so maybe I should stop this somewhere here. So, that is good. That is much better.

And it is half 5𝑒 −5𝑦 for y > 2.

So, notice, how we wrote that. The first formula is generic, but then these conditional

densities have a certain range for Y. Only for Y have a certain support. Y between −2

and 2, this is 1 by 4, outside of it, this is 0. Only for Y > 0, it is 5𝑒 −5𝑦 , outside of it its 0.

So, when you write it down, totally, you will have to say Y less than minus 2 is 0, Y

between minus 2 and 0, it is half into 1 by 4 and then from 0 to 2, both of them will be

active. And then greater than 2, only this exponential PMF will be active. So, that is the

marginal distribution of Y.

𝑝𝑋 (−1)𝑓𝑌|𝑋=−1 (−1)
So, now what is probability of 𝑃(𝑋 = −1|𝑌 = −1). From the formula, it is .
𝑓𝑌 (−1)

So this is the formula. So, now what is 𝑝𝑋 (−1), that is half. What is 𝑓𝑌|𝑋=−1 that is just 1

by 4 between minus 2 and 2, and what is the Y value here is −1. So, evaluated at −1, I

am going to be 1 by 4. And what is 𝑓𝑌 (−1), you see it is half into 1 by 4 and that is 1. So,

notice this becomes 1.

So, you can also check the probability that X equals minus 1 given Y equals, 𝑃(𝑋 = 1|𝑌 =
𝑝𝑋 (1)𝑓𝑌|𝑋=1 (−1)
−1) is . So, now, notice this, the first thing is half, the next one is 0. The
𝑓𝑌 (−1)

conditional distribution of Y|X =1 is the density, exponential density and at minus 1 it is 0.

So, it is half into 0 by half into 1 by 4 and that is 0.

So, the conditional density of X|Y=−1 is actually, there is no distribution here. It is a

constant. It is equal to -1 with probability 1. Now that makes a lot of sense. So, look at

this plot. If Y were to be minus 1, If Y were to be -1 clearly, X has to be -1. There is no

way. If X is plus 1, there is no way Y is going to be negative. So, this works out so easily.

So, that is clear.


Now, let us look at the slightly more interesting case, this is when X is, when Y is equal

to plus 1. So, in this case, I am going to cut short a lot of this formula and then write

directly what I would get. So, it is 𝑃(𝑋 = −1) that is half times the conditional density

evaluated at 1, so that is going to be 1 by 4 plus, there is no plus, this is 1 by 4. So, it is

a conditional density of 𝑌|𝑋 = −1, that is the uniform density evaluated at plus 1. So, that

is going to be 1 by 4.

Now divided by the marginal density evaluated at plus 1. So, the marginal density

evaluated at plus 1 is going to be half 1 by 4 plus half 5𝑒 −5. So, that is the formula for

𝑋 = −1|𝑌 = 1. So, notice, 𝑃(𝑋 = 1|𝑌 = 1), you can go through and do the formula if you

want, but a quicker method will tell you it is just 1 minus this. It is 1 − 𝑃(𝑋 = −1|𝑌 = 1).

Why? Because this is a proper distribution. It is a proper random variable, given Y = 1, X

takes only two values −1 or 1. Once you find the probability, it takes one value, probability

takes the minus, other value is just 1 minus that. So, if you do that, you will simply get or

you can use a formula again if you do not like that method. You are going to get half, 5𝑒 −5

half into 1 by 4 plus half into 5𝑒 −5. So, you see how that formula sort of works to give you

the conditional distribution for the discrete given the continuous random variable took a

particular value.

The last one is X|Y = 3, just by inspection, if you got a value of plus 3, X could never be

−1. So, you can quickly see 𝑃(𝑋 = −1|𝑌 = 3) is actually 0 and 𝑃(𝑋 = 1|𝑌 = 3) is 1. So,

hopefully, this example showed you how to apply that Bayes like formula for conditional

probabilities of discrete given continuous. So, the continuous given discrete ends up

being a density. The discrete given continuous, you can use Bayes like formula using

density instead of probabilities.

(Refer Slide Time: 31:33)


So, this is first problem. Here is another problem. So, this is sort of a word problem, but

let us look at a description. So, there is a country and it says the age group of 45 to 50,

60 percent of the adult are male and 40 percent are female, not a very good sex ratio in

the country. It is alright. It is not the question. So, what the question asks is, it says, the

height in centimeter of adult males in that age group in that country as a normal

distribution, 160 is the mean and 10 is the sigma and that of females is normal with mean

150 and so normally when I write normal, normal you take it as mu and sigma square.

So, one needs to be careful here.


So, I am going to take this as, so I should be careful here. So, let me reword this. So, I do

not want the 𝜎 2 to be 10. So, I will say σ =10. So, I will change this, so that it is convenient.

So, I am remembering now that in one of the previous slides, I did the same thing, but we

go back and make sure I get that right. So, here also we will take σ to be equal to 0.4 and

σ to be equal to.

So, in the question, if I want 𝜎 2 , I will mention that 𝜎 2 . So, here I am assuming this µ =

0.6, µ is there and then σ is 0.7. I forgot to mention that. Normally, you think a lot of people

use 𝜎 2 there. I am going to take sigma for these problems just for convenience, I mean,

if it is 𝜎 2 , it is complicated. I mean it is okay, You could take 𝜎 2 also. It is not a big deal.

So, this is the distribution that is given to you. So, a random person is found to have a

height of 155 centimeters. So, is that person more likely to be male or female? So, what

is the question here? So, X is the, let us say, the sex of the person. It takes two values

male, female, male with 60% probability and female with 40% probability, that is the

distribution of X. So, I should say this is distributed like that.

So, now Y is the height given X = M, Y is normal with mean 160 and σ equals 10 and Y

given X equals female is normal with 150, σ equals to 5. So, you know the distribution.
−(𝑦−160)2
1 2(10)2
So, 𝑓𝑌|𝑋=𝑀 (𝑦) is 𝑒 . So, that is the distribution, 𝑓𝑌|𝑋=𝐹 (𝑦), you can write
√2𝜋10

likewise. So, just instead of 160, you put 150, instead of 10, you put 5 you will get the

conditional density. So, both of these are given.

One can also write down 𝑓𝑌 (𝑦). I will write it down below when I write the final answer.

You can see how to write that. It is going to be 0.6 times the first density plus 0.4 times

the second density that is 𝑓𝑌 (𝑦). So, the question is asking for X|Y=155, what is the

distribution here? And in particular, it is asking is that person more likely to be male or

female? So, here, what one can do is try and find probability that X =M|Y=155.
Let us try and evaluate that. So, that is going to be 0.6 times the conditional evaluated at
−(155−160)2 −(5)2
1 2(10)2
1 2(10)2
155 given X=M, so that is going to be 𝑒 = 𝑒 . So that is the
√2𝜋10 √2𝜋10

formula here, divided by you are going to write 𝑓𝑌 evaluated at 155, that is going to be
−(5)2 −(5)2
1 2(10)2
1 2(10)2
0.6, the same thing, 0.6 𝑒 +0.4 𝑒 . The resolution is not good enough.
√2𝜋10 √2𝜋10

5 e power minus, once again this 155 - 150, is not it, this 5 square by 2 into sigma square,

which is again 5 square. So, one can simplify this and you will calculate a probability.

So, what will be the probability that X=F|Y=155, if you want you can write it down. It is

just 1 minus this other probability, takes two values. So, you find this value, whichever is

greater you will pick that. So, that is the problem. So, you notice how some sort of a

classification problem can be solved by assuming a model like this. So, you have given a

model like this. In this model, the classification problem will have a precise sort of

calculation that you can do. Given the height, you plug it into this formula and see which

is a greater probability and you will decide.

In fact, there is only two possibilities. If one of them has a probability greater than 0.5,

that is going to be the larger one. So, you can just calculate 1 and check if it is above 0.5

or not. So, hopefully, I have not written down the final answer for you. I leave that as an

exercise. Go ahead and compute this final answer. So, you see how in a sort of semi-

realistic situation, these kinds of discrete continuous models enter the picture. So, these

are very useful to know and understand.

(Refer Slide Time: 37:57)


So, here is the last problem in this sequence that I am doing for you, slightly different. So,

here, again, you have a discrete random variable and a continuous random variable. But

here, I am not going to directly give you the conditional density. I am going to give you

the conditional density indirectly in terms of a relationship involving the random variables.

So, that is what is done here. I am defining a random variable Y to be equal to X + Z. X

is a discrete random variable Uniform {−3, −1, 1, 3} and Z is a normal random variable

mean 0 and variance 𝜎 2 . So, this is like, this is variance = 𝜎 2 .

I apologize for this, not being very consistent here. So, I should be very careful. So, normal

distribution, usually one writes µ and 𝜎 2 . In the previous two problems, I just simply took

it as σ plus, forgive me for that, you should take it as 𝜎 2 . 𝜎 2 is the more standard notation.

So, I am going to define Y as X + Z. Z, when X is Uniform {−3, −1, 1, 3} and Z is N(0, 𝜎 2 )

and these two are independent, X and Z are independent. Now the question asked, what

is the distribution of Y and find the distribution of X|Y = 0.5.

So, the trick is to find, so X distribution is given to you. The trick is to find 𝑓𝑌|𝑋=−3for

instance. So, what is going to be the density of Y|X =−3, so you remember, how do you
do this? So, Y|X =−3 is actually minus 3 is the same as −3 + Z. Z is normal with mean 0

and variance 𝜎 2 , what is −3 + Z?

You remember you have to use now the functions. It is a translation of a random variable.

It is just a shifting of the PDF. The density just shifts to -3 and in particular for the normal,

we know that this is going to be normal with mean minus 3 and the same variance 𝜎 2 .

So, it is just a shift you remember, I did this example quite often, in quite some detail and

we did functions of a continuous random variable so you get this.

So, this is the sort of the crucial step here. So, when you say X is -3, Y simply becomes

minus 3 plus Z that this is the random variable and this is just −3 +Z, so that is what is

important. The reason is X and Z are independent. So, there is no dependence, because

X to combine is three value, Z is now certainly not going to change. It is just going to be

just −3 + Z. The same thing you have to sort of see for the other once.

So, think about Y this is true. So, you can sort of convince yourself and this will be just

𝑁(−1, 𝜎 2 ), Y|X=1 is 𝑁(1, 𝜎 2 ), and then finally Y|X = 3 is 𝑁(3, 𝜎 2 ),. So, once you do this,

you are back to familiar territory. What is the distribution of Y, so this is the crucial step.

This crucial step is this understanding the conditional distribution. Once you got this, the

rest is same as before.

So, notice the trick here, the trick here was instead of specifying the Y|X =-3, Y equal, Y

given X equal to Y minus 1 etc. explicitly, you give a function involving X and another

random variable Z then expect you to evaluate that function. So, here is a little exercise.

This exercise, I am going to say X is same uniform let us say -3, -1, 1, 3 and Z is normal

0 sigma square, these two are independent. And then I am going to say Y = X Z instead

of saying X + Z, I am going to tell you X Z. So, this is one exercise for you.
For this problem, if Y is XZ, what is the marginal distribution of Y, what is the conditional

distribution of X |Y = 0.5. So, here also the problem boils down to that. You will have to

find Y given X equals minus 3 and you will see it is just XZ, so it is -3 times the distribution

of. So, it is -3Z. So, Y|X = -3 is just -3 Z and that is normal 𝑁(−3, 𝜎 2 ),. So, this is important

to understand.

So, this is the last problem that I wanted to do and you can see the sort of flavor of the

problems here. So, this kind of problems are very natural in reality. Even this one is

actually, this problem is actually a very common model in many communication systems.

So, people use this model in many systems.

So, it is very natural that a discrete random variable and a continuous random variable

together have some joint distribution because of whatever happens in practice. So how

to describe that, how do deal with that is very important and this can help you a lot in

classification, in particular, in the reverse question, continuous random variable and what

can you say about the values taken by the discrete random variable. So, we are going to

stop here for this lecture. We will pick up from here in the next video. Thank you.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department,
Indian Institute of Technology Madras
Lecture 6.3
Jointly Continuous Random Variables

(Refer Slide Time: 00:12)

Hello, and welcome to this lecture. We are now going to talk about a very interesting model, where
you use two continuous random variables or multiple continuous random variables and think of
them as being jointly distributed and jointly continuous. So, this is a very important model. I will
motivate it with some data from the irises, but you will see that many typical situations this model
can be very, very useful. So, what, once again in this course we will focus on is how to write down
the distribution under these models, how to do computations with them, the skill of computing
with them is very, very important. We will build up that skill in this part of the lecture. So, jointly
continuous random variables is the topic.
(Refer Slide Time: 00:55)

So, to motivate it, let us look at pictures here. I have drawn a couple of 2D histograms. So,
previously, we saw 1D histogram, one dimensional histogram, where a lot of data is available. We
bin them into suitable intervals and make a plot, the bar chart kind of plot and that is how we
moved into the continuous random variable situation. We made a histogram and we observed that
there is a nice smooth shape for this histogram. So, let us go and use that shape as the density and
that simplified our calculations, and we use that model quite extensively. So, far we have been
using it for computations.

Now, we will push that into multiple dimensions. So, for that the motivation is to look at two
dimensional histograms. What are two dimensional histograms? What I have shown here in the
blue picture on the left is a 2D histogram for the sepal length and the sepal width together in an
iris. So, we had all this data which had for every row there was a sepal length and a sepal width.
So, now we can do two dimensional binning. What is two dimensional binning? What, how many
of these sepal lengths and sepal widths fall into a rectangle? They are between, let us say, 4.5 and
4.6 and 3.7 and 3.8. How many of them are there?

I will share some code with you later on, on the collab notebook on how to make these kinds of
plots. These kinds of plots are very interesting. They look like a big city laid out in some detail on
the ground. Of course, our cities are not so planned as this one, but anyway, so this looks like a
very planned city. So, this is how to make these plots, you will see, I will show you some methods
later on. But what do we learn from this plot? So, what we learned from this plot is, there is a joint
distribution between the sepal length and the sepal width. So, clearly, this joint distribution is
brought out here.

Some of the lens, so some of the lengths are possible, some of the length and widths are possible
in some combination, not possible in some other combinations. So, that is what this is showing
you though. When you when you have those peaks, those tall, tall buildings means in that place
lot of sepal lengths and sepal widths fall, but there are blanks. So, just one blue, just like a flat
thing on the ground, nothing raising, so in those regions, it would not fall. So, these are, I mean,
to whatever extent you have data, you can come up with plots like this.

The same thing I have done on the right in the green plot for petal length and petal width. So, petal
length and petal width and the sepal length sepal width, we have both the 2D histograms. Once
you see the 2D histogram, it is clear to you that they are jointly distributed in some way. So, that
is the first thing. So, notice what is going on here. So, both this sepal length and sepal width, I
want to model as continuous random variable that is what makes more sense from the data. It
seems to make some sense. And now they have a joint distribution. The same story goes for the
petal length and petal width also.

So, now we need tools, we need descriptions of jointly continuous random variables, two random
variables, which are continuous, and they have a joint distribution, how will you describe that. So,
just before we saw a discrete and the continuous random variable having a joint distribution and
we learned about how to deal with that, now we are dealing with two continuous random variables
and how to deal with that and this is the motivation. So, clearly, it is well motivated from random
phenomenon that you observe in practice and many more such phenomena you can see and as you
grow more in data science, you will face these situations often.
(Refer Slide Time: 04:34)

So, how do we deal with it in probability this is the idea. The idea is to use something called a joint
density. So, if you remember, go back to the discrete random variables we studied, when we want
to study multiple discrete random variables together, we use the joint PMF and the joint PMF plays
a very crucial role of understanding what the distribution is going to be and all that. So, likewise
here, when you want to study two continuous random variables, you have to use something called
the joint density. So, the joint density is a function of two variables.

So, a lot of people say function of one variable is already a headache. We are giving us two
headaches at the same time. Yes, it is a little bit more complicated. You have to deal with both of
these variables, understand them, do calculations with them, differentiation session integration all
of that will enter the picture. It is a little bit confusing, but if you take your time with it and spend
some time with it, you will see it is good to appreciate situations where two random phenomenon,
two variables are sort of linked and how do you picture that mathematically, how do you model
that joint density is a indispensable tool.

So, a joint density function is much like a PDF. We had a density function. It had some properties
needed to be non-negative has to integrate to one. Something very similar you want for a function
of two random variable, two variables, 𝑓(𝑥, 𝑦). It has to be non-negative. That is one condition.
The other condition is it has to integrate to one over the two-dimensional plane. So, these two
variables, so, you can do two direction integrals. So, this is multi-directional integral.
When you do that big integral, you should get one. It should just all integrate to one. There is a
technical continuity requirement which we will sort of assume in this class at least piecewise
continuous in each variable. So, first two are critical needs to be non-negative, which should
integrate to one. So, that is the definition of joint density. You can see how this very easily extends
from one dimension. In one dimension also you have something very similar. We extend it to two
dimensions.

So, technically, so this point is just to get rid of the probability space and all that, it turns out, once
you have a joint density, there are two random variables, which have a joint, which live in some
probability space. So, that any, so notice this equation, this equation is very important. You identify
any region in the two-dimensional plane and you say what is the probability that 𝑋 and 𝑌 𝑗𝑜𝑖𝑛𝑡ly
will fall in that region. So, maybe it is a circle, maybe it is a rectangle on the 2D plane you just
draw some region. What is the probability that both my random variables will take values so that
𝑋 and 𝑌 together when you plot (𝑥, 𝑦), you will go into that region. So, this is the question you
can ask, a very reasonable question.

What is the probability that both of them will lie together? What is the probability that the sepal
length and sepal width will be within 4, let us say. So, it is a very reasonable question to ask. And
that probability is obtained by integrating the joint density over that area, over that region. So, that
is, this is easily one of the most important formula involving density.

𝑃((𝑋, 𝑌) ∈ 𝐴) = ∬ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦

Go back to the one dimensional, one random variable situation. When you have one continuous
random variable, how do you find probability of the random variable falling inside a certain
interval, you simply integrate the density over that interval. Same thing you do here. You integrate
the density over the region you get the probability. So, that is why it is the density. It is like the
probability per unit area sort of. This 𝑓(𝑥, 𝑦) is like that for small little area. So, you integrate it
over the area, you get your actual probabilities. So, that is why it is called the probability density.

So, quick word of notation, it is very common. In many books, you will find when 𝑓(𝑥, 𝑦) is
thought of as a joint density of 𝑋 and 𝑌, 𝑝𝑒ople will use the notation 𝑓𝑋𝑌 (𝑥, 𝑦). Now, this 𝑋, 𝑌 is a
little confusing. A lot of people will think it is 𝑋 times 𝑌. Maybe you want to write (𝑋 , 𝑌), but
nobody writes that comma. It sort of implicit from this and you have to get used to that notation.
So, joint distribution, joint density has this kind of notation, 𝑓𝑋𝑌 (𝑥, 𝑦). So, this, when you write
𝑓𝑋𝑌 (𝑥, 𝑦), it is immediately clear what it is. So, you do not have to say it is the joint density, etc.,
etc. So, that is why this notation is very useful.

One very important thing that we had even in one continuous random variable is the support of
that random variable. The support was the all the points where the density was strictly greater than
0. Same thing we can do here, except now the support will be two dimensions, it will not be one
dimension, it is not just one line, it is over the plane wherever the 𝑓( 𝑥, 𝑦) is non-negative, I mean,
not non-negative, strictly positive, it is greater than 0, it cannot be equal to 0 that part is the place
where the random variables jointly take values. So, they take values there. So, it is called the
support of the random variables, x comma y, the whole bunch of definitions. Hopefully, in the next
slide, we will see some more examples and see how to work with.

(Refer Slide Time: 09:45)

So, here is the simplest example of a joint density, a very commonly used to joint density, very
popular so you should get to know understand and how to use this. So, let is picture this. It is good
to plot it. I mean, it is two dimensional plots, three dimensional plots are very difficult. But what
I like to do is to plot the 2D plane and show the support and sort of visualize the 3D thing as
coming out of the 2D plane in different ways. So, here is what you can do for the support. You
have x here, y here. So, notice this density,
1 0 < 𝑥 < 1,0 < 𝑦 < 1,
𝑓𝑋𝑌 (𝑥, 𝑦) = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

So, 𝑥 is 0, this is 1 here, let us say y is 0, y is 1 here. So, the support is inside this unit square. So,
it is good to draw a region like this and share it like this. So, you just draw these shading lines sort
of, so you get the support. So, this is the support of (𝑥, 𝑦).

So, if you want to picture sort of, in your mind, if you want to picture the 3D plot of the joint
density, how will it be. It will be like this on square block that comes up to a height of 1 and it is
right, it is a cube that sitting there on that support. So, that is how this function 𝑓𝑋𝑌 (𝑥, 𝑦). will look
if you want to plot the value in the Z axis. You need three axes, because there is already two
variables and then the third function value itself will go on the third axis. So, it will be a picture
like that, so picture that in your mind. Anytime you have a function like this, always try to make a
mental picture of how it will look on the 2D plane and how it will come up and that will give you
a lot of clarity.

So, now, remember what we said in the previous thing. So, supposing I define a region A, so let
me define a region A, so first let us test the integration. So, this integration, so first of all this
𝑓𝑋𝑌 (𝑥, 𝑦). is a valid density, it is non-negative, it is greater than or equal to 0 that we can see. And
if we integrate over the whole thing, you do not have to integrate over the whole thing. It is enough,

1
∬ 1𝑑𝑥𝑑𝑦 = 1
0

you will simply get 1.

You can check that is what it will be. So, you first do this integral. It is 𝑥 between 0 and 1, that is
just 1. And then again you do integration over y, you will get 1. So, think about how this integration
works. It is a little bit difficult if you are seeing it for the first time. But hopefully, you do the 𝑥
integration first. So, maybe I should write that down.

So, this is the x integration that comes first. So, the 𝑦 integration, so 1, integral of 1 is simply
𝑥, 𝑥 going from 0 to 1, and then 𝑑𝑦 still remains. This guy is just equal to 1. So, you get integral 0
to 1, 1 𝑑𝑦 and then that is again 𝑦, 0 to 1 that is 1. So, this is how you do it sort of laboriously one
at a time. So, multiple integrations is just a sequence of integrations. There is nothing special about
it. You first do the integral over x, treating everything else as constant. So, when you do the integral
over 𝑥, there is no everything else here, everything else is just 1. So, it is no big deal.

But usually you will see, I mean, some integrals can have something more fancy here. But just
treat it as a single integral. So, integral 0 to 1 𝑑𝑥, this is just 𝑥 from the formula that if you
differentiate x, you get 1. So, you put x between 0 and 1, it is just a definite integral. So, you just
get 1. So, it is just 1 𝑑𝑦 integral 0 to 1, 1 𝑑𝑦 it is easy to do. This is two simple one dimensional
integrals. So, you have checked that this is a valid density.

The next thing was about integrating to find probability. So, supposing you have a region A here,
some region A here, so now the probability of 𝑥 , 𝑦 falling in this A is equal to integral over A 1
𝑑𝑥 𝑑𝑦. That is

𝑃((𝑋, 𝑌) ∈ 𝐴) = ∬ 1. 𝑑𝑥𝑑𝑦

This is nothing but the area of A. So, you can see, it comes out to be simply the area of A. So, if
you want to find the probability of A is simply find its area. Whatever shape it may be, you find
its area. So, for instance, if A is, you do ½ here and then ½ here, supposing that is your A, so
1 1 1 1 1
𝑃 (0 < 𝑋 < 2 , 0 < 𝑌 < 2) = 2 × 2 = 4 , this is the area of the smaller square, and that is just ½

into ½ , its ¼ .

So, why did it become the area because of this 1. So, it was just uniform, so it just becomes 1. It
becomes the area. So, it is just 1 here and it is unit square. So, in the unit square, so when you have
the unit square, uniform in the unit square probability of any region inside the unit square is simply
its area. So, it is very easy to remember. This is one of the simplest joint densities you can have.
Very few practical models will exactly fit this, but it is good to know that there are easy joint
densities of this fashion. So, as we go along, we will see slightly more complicated examples of
how to do probabilities and all that. And I will walk you through these integrations a little bit more
slowly. But hopefully, you see the big picture of where it is.

So, how do you generalize this? Now, if you want to go to an arbitrary joint density, there will be
some region in the 2D plane, where the density will take value. And if you want to take probability
of a region inside that support, you simply integrate the density over that area, you will get it. So,
that is the probability that you have to find. So, this is how you deal with joint densities.
(Refer Slide Time: 15:56)

So, here are a whole bunch of problems that have been asked around this joint density. So,
whenever I have a problem like this, I always like to picture it in the 2D plane and then try and
solve it. So, notice, look at the first one. What is the 𝑃(0 < 𝑋 < 0.1,0 < 𝑌 < 0.1). So, if you want
to draw it, let us let me see how to draw it and as least painful as possible. So, I might have to
make multiple pictures like this, excuse me for that. So, here is 1, so this is the full support and
this is asking for 0.1, 0.1 and 𝑋 has to be between 0.1, 0.1 so this is the area that we are interested
in. So, this area is just 0.01, is not it. So, this guy is 0.01.

And next there is the other area 0.5 to 0.6 and 0 to 0.1. So, maybe I will change the color. 0.5, 𝑋
is between 0.5 and 0.6. I do not know how many colors I can do here. So, that is 0.01 and it is
black in color and then 𝑌 is between 0 and 0.1. So, 𝑌 is between 0 and 0.1 so it will lie in this
thing. So, see here 𝑌, this area, this blue area is 𝑋 lying between 0.5 and 0.6 and 𝑌 lying between
0 and 0.1. And what is the area. The area still remains the same. It is 0.1 into 0.1 is 0.01. So, the
probability that 𝑋 and 𝑌 together lie in this area is 0.01.

And the last one, maybe I should change my color to some green here. So, 𝑋 is between 0.9 and 1
and 0.9 and 1 so that will come, I do not know, somewhere here. So, that is 0.9 and 1, 0.9 here.
So, here also if you had to do the calculation is going to be 0.1 into 0.1 is 0.01.
The next calculation that is being asked here is 𝑋 between 0 and 0.1 that is it, nothing on 𝑌. 𝑋 is
between 0 and 0.1. So, how do we do this? So, let me draw one more set of axis for this that will
help I think, 0, 1 and 1 here. Let us draw it off. It says 𝑋 ℎas to be between 0 and 0.1. So, 𝑋 is
between 0 and 0.1. I do not care what 𝑌 is. So, then you take the entire range of 𝑌. 𝑌 can be
anything, but 𝑋 is between 0 and 0.1. So, what will be this area, is just the area of this rectangle,
length is 1 with this 0.1 so it is 1 into 0.1 that is just 0.1.

The next question asks for 𝑌, let us go to the blue color, 𝑌 between 0.5 and 0.6, so 𝑌 is between
0.5 and 0.6. It is probably much smaller than this, but anyway I am just trying like that. So, 𝑌 is
between 0.5 and 0.6. I do not really care what 𝑋 is. So, it is this whole area. So, this will also come
out as 1 into 0.1, which is 0.1. Do you see that? Do you see how for the uniform, the unit square
is just a question of sketching that region and then simply finding its area. So, now we have more
involved things. So, let me draw one more picture here.

So, notice, so it is just a question of, I am trying to specify this area A in more and more
complicated fashion. The first two examples I simply gave ranges of 𝑋, ranges of 𝑌. So, you could
simply draw it. So, now I am saying 𝑃(𝑋 > 𝑌). So, how do you draw X greater than 𝑌? So, maybe
should write here 0, 1, 0, 1.

For 𝑋 greater than 𝑌, you should first draw the 𝑋 equal to 𝑌 line. So, this is the 𝑋 equal to 𝑌 line
and this is 1 comma 1, remember. So, what is the range of values where 𝑋 is greater than 𝑌, it is
this area, is not it? So, now we have a triangle. So, 𝑃(𝑋 > 𝑌) is the area of this triangle and that is
½ × base × height and that is ½ . So, that is easy to do.

Here is the next problem where 𝑃(𝑋 > 2𝑌).. Again, we just draw the same method. We just draw
this for the uniform, the unit square it is very, very easy. We just take the picture here and then do
it. So, I need 𝑋 = 2𝑌. So, it will be a line like this. So, this will go to case not to scale, but
hopefully you see it. So, this is 𝑋 = 2𝑌 and then I need the area of this. So, that is going to be ½
× base × height and that is 1/4

And finally, I have 𝑃(𝑋 2 + 𝑌 2 < 0.25). 0.25 is 0.52 . So, this, from your experience, you may know
is actually a circle. It is going to be a circle with radius 0.5. So, it will be like this. It will be a circle
like this. I am not sure I can draw the circle properly. There you go, roughly, okay. So, it is ½ , ½
and it is a circle. And this is like 1/4 of a circle. So, it is 1/4 × 𝜋 × 𝑟 2 . So, that will be 𝜋/16 . So,
that is the formula. That is the answer for this probability. What is 𝑃(𝑋 2 + 𝑌 2 < 0.25) it is 𝜋/16.
So, notice we are dealing with numbers like pi and all that is really great.

So, I know, I mean, if you are not sure about this formula, if it did not know it is from a circle, etc.
and the area of a circle is π × 𝑟 2, etc. go back and check out how you describe circles involving 𝑋
and 𝑌. So, that is 1/2. So, hopefully, uniform in the unit square, you will be very comfortable doing
computations of probability given any area, you will be able to just find that area and write it down.
So, one can also do some integral formula for this, but I am not doing it right now. We will do
slowly later on when we need. So, that is the first step of example.

(Refer Slide Time: 22:50)

So, now I am going to talk about the general 2D uniform distribution. So, previously, we had the
uniform distribution only in the unit square. Just in the unit square, we had a uniform distribution
and we were able to easily do it now we can extend. Instead of having the support as the unit
square, you can have any reasonable two-dimensional area. I put reasonable in bracket just say any
region it is okay. For our purposes, it is good enough. So, any two-dimensional region that you
can think of and you can plot and sketch etc, you can simply imagine two random variables 𝑋, 𝑌,
having the uniform density on that region. That region could be many, many, many possibilities.
So, here are some possibilities I have drawn here. The first one says it is a rectangle. So, you can
be a rectangle between x taking values between (𝑎, 𝑏) and y taking values between (𝑐, 𝑑). I am
drawing it on the positive side. It can be anywhere. Just for convenience, I am drawing it. This can
be the support. So, this is the rectangle we are talking about between 𝑥 between 𝑎 and 𝑏 and 𝑦
between c and d. It can be a circle. So, that is the equation of a circle centered at 𝑥0 , 𝑦0 . This is the
equation, and it is a circle of radius r. So, of course I am drawing it very poorly out of scale. So,
this is the circle, sphere. And you can even have multiple disjoint regions.

So, what do I mean by that? I might draw one rectangle here, another circle here and some other
bizarre shape here and say, my (𝑥, 𝑦) is uniform in this support. So, just say all these things. So,
you find the total area D is what, D is the total area of all these regions together and my density
will be 1 by the total area in these regions, wherever you are inside, whenever you are inside this
region, it is 1 by that area. Whenever you are outside this region, there is no support, it is 0, density
goes to 0. So, this kind of picture is very important.

So, sometimes when you observe two random variables which you think are continuous, you draw
their histogram, 2D histogram and you notice it is roughly of the same height in one area, a good
model is to say, in one area or multiple areas like that, a good model is to say uniform in that area
or starting model why not. So, it is a good thing to think of. So, this is the 2D uniform distribution,
a very easy density is just one by the total area of support that you are having.
(Refer Slide Time: 25:28)

So, the formula for any sub-region now is very easy for the uniform distribution. You have, instead
of the unit square, supposing you have an arbitrary region D, and you define a sub-region A, what
is the 𝑃((𝑋, 𝑌) ∈ 𝐴) = |𝐴||𝐷|. So, area I am showing with absolute value symbol. You can use
other notations if you like. So, it is basically area of A / area of D. So, that is the thing to think of.
Because it is uniform, this works out. So, if you have a flat histogram without too much variation,
roughly flat, then uniform is a very good model to use.

(Refer Slide Time: 26:18)


So, I have a couple of problems more. Let us just quickly work through these problems and that
will be, that will conclude this lecture on continuous random variables, jointly continuous random
variables. So, here I have defined a more complicated looking region D and I am saying 𝑋 and 𝑌
are uniform in that. I want you to sketch the support and compute probabilities of two areas, very
basic problem, important skill to pick up at least with uniform distributions.

You should be able to pick up the scale, given a region, just sort of describe that reason
mathematically, sketch it and then compute some areas. Very simple description. Maybe
pictorially you can do, but sometimes being able to do the calculation is very important and that is
a very good skill to have in general.

So, let us sketch this. So, the sketch is important is 𝑥 > 0, 𝑦 > 0. So, I like to draw it like this,
𝑥 > 0, 𝑦 > 0 and x + y < 2. So, I need to draw x + y = 2 and that will be a line like this to be
very symmetric. So, that is the line. So, this is 2, this is 2 and this is your support. This line is 𝑥 +
𝑦 = 2.

1
1 , (𝑥, 𝑦) ∈ 𝐷
So, that is the line. And what is your 𝑓𝑋𝑌 (𝑥, 𝑦) = 𝐴𝑟𝑒𝑎 = {2 , what is the area, this area
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
is area, total area is ½ × base × height = ½ × 2 × 2 that is 2. So, it is going to be 1/2 for (x, y)
indeed, which is this triangular area, and 0 otherwise. So, this is the density function. So, that is
good to know.

So, the total area is 2. So, we are seeing that. Now, 𝑃(𝑋 + 𝑌 < 1). So, how do you get 𝑥 + 𝑦 <
1 that is just 1 here. So, you can draw this line. That line will be 𝑥 + 𝑦 = 1. So, the area inside
this guy, so this area is ½ × 1 × 1, that is just 1/2. So, what is the 𝑃(𝑋 + 𝑌 < 1) = ½ /2 that
is ¼ . So, the total area is 2, this area is ½ , ½ / 2 is 1 / 4. See, that is the nice thing about uniform
distribution. You do not have to do any integration. It is just area calculation. It is a simple enough
calculation; you can do it.

The next one is 𝑃(𝑋 + 2𝑌 > 1). Let me do the blue calculation here. This may get a little bit
more complicated. So, 𝑥 + 2𝑦 = 1. So, where is that line going to lie? 𝑥 + 2𝑦 = 1. So, it will
have, it will start at 1 and it will go up to ½ here. I think this is the line. So, when 𝑌 is 0, 𝑋 is going
to be equal to 1. And when 𝑌 is, and 𝑋 is 0, 𝑌 is going to be equal to a 1/2. And I want 𝑥 + 2𝑦 >
1. So, it is going to be all this area. This area is this 𝑋 + 2𝑌 > 1.
So, how do you find the area in blue, the area in blue is best found by the blue area equals the
overall area which is 2 minus the small triangle area, so it is 1/2 × 1 × 1/2. So, it is 2 - 1 / 4 and
7 1
that is what 7 / 4. That is the blue area. So, 𝑃(𝑋 + 2𝑌 > 1) is 4 × 2 that is 7 / 8. So, quite a high

𝑃(𝑋 + 2𝑌 > 1).

So, noticed how I did this. Since it is just uniform and you only have straight lines separating, I
know the regions, it is easy to just simply find the area and then divide the areas to get the
probabilities that you will lie inside a smaller region. So, this works for uniform, does not work
for non uniform. So, uniform is very, very easy.

(Refer Slide Time: 31:07)

So, the next example we are going to see is a non-uniform density. Is the first time ever we are
seeing a non-uniform density. So, let us spend a little bit of time understanding what it is. So, this
is the density. The function that is given to you is 𝑓𝑋𝑌 (𝑥, 𝑦) = 𝑥 + 𝑦. Notice, it is not 1 divided by
the area or something. It is 𝑥 + 𝑦 for 𝑥 lying between 0 and 1. So, you have to show that the above
is the valid density. The first thing you can check is that 𝑓𝑋𝑌 ≥ 0, x and y are between 0 and 1 and
𝑥 + 𝑦 is definitely positive. It is greater than or equal to 0. The difficult thing to, slightly difficult
thing to check is the integration.
So, 𝑥 and 𝑦 is between 0 and 1. I always like to plot the support. Let me just do this. It is easy to
check 𝑓𝑋𝑌 ≥ 0. And if you were to plot the support, support is between 0 and 1. This is the support,
but the density is not uniform in this. So, density is sort of like a three-dimensional picture.

So, if you want to try to picture the three dimensional one, 𝑥 + 𝑦, when x and y are 0, 𝑥 + 𝑦 =
0, when x and y are 1, 𝑥 + 𝑦 = 2. So, it is sort of goes like an inclined plane from 0, 0 to (2, 2).
And when 𝑥 is 1 and 𝑦 is 0, it is at 1. And likewise on the y-axis also, it is at 1. So, it is sort of like
an inclined, strange, inclined sort of plane going in a different direction. So, it is hard to picture.

You can try plotting it in three dimensions in Python or something and you can picture it a little
bit. But there is a limit to which you can picture these things in 3D. It is, but it is good to have a
mental picture in your mind that it is sort of starting at 0 and it is inclined, it is not flat. It goes up.
So, x and y together are much more likely to lie close to 1, 1 than 0, 0. So, that is the sort of density
this is. So, density changes, suddenly you see dramatically everything seems to shift. So, how do
you show it is a valid density, you need to integrate, you integrate x from 0 to 1, y also from 0 to
1. So, there is no problem in that range, x + y dx dy.

1
∬0 (𝑥 + 𝑦)𝑑𝑥𝑑𝑦

So, let us look at the first integration. And that in that integration I can treat y as a constant, y is
like 2 or 3 or any number, constant, some C. So, if you have to do that integration, integral 0 to 1
𝑥 + 𝑦 dx, so I can integrate this as first 𝑥 and then 𝑦. So, what is the integral of 𝑥, it is 𝑥 2 /2 and
y is a constant, so it will be just 𝑦𝑥. So, this is the integration and I have to substitute between 0
and 1. If substitute 1 I get 1/2 + y, if I substitute 0 I get 0, so it is 1/2 + 𝑦.

So, this entire inside integration becomes 1/2 + 𝑦. So, now I go y = 0 to 1 1/2 + 𝑦 𝑑𝑦 and this
is just a direct integration. So, (1/2 + 𝑦) 𝑑𝑦 is 𝑦 / 2 + 𝑥 2 / 2 between 0 and 1 and that is just 1.
So, this gives you a check that this is a valid two-dimensional density. So, over two dimensions, it
needs to have a total integral of 1. So, this is sort of like the volume that is inside this region 𝑥 +
𝑦. So, that is how we get this 1. It is not a cube or anything, so it is not easy to compute the volume.
So, you have to do the integration carefully.
So, this is how you do it. Notice, there is no major integration involved here. It is just, you do the
inner integral first and then you do the outer integral. Just step by step if you do, it will come out
quite easily. Hopefully, it is not too scary.

So, now, notice the probability here. The first probability asks you to find for this region 1/2, 1/2,
so for that region, I also have a working sheet or not, maybe not, so I should do it off right here.
So, for that region, so let me just maybe block it out. I am just trying to demark it an area where,
wait does not draw when I do this. I am just trying to demark the area so that I can work on the
other thing. So, I am going to do probability that x is less than 1/2 and y is less than 1/2. So, if you
notice, this is the region and I have to integrate over that region. So, it is 0 to 1/2 and then x + y,
just the same density, dx dy.

So, I do the same integration. So, the inside integration goes first. So, I retain the outside
integration. I am not going to touch it now. The inside integration 𝑥 + 𝑦, remember, y is treated
as a constant. So, the integral will just become 𝑥 2 / 2 + 𝑦𝑥 substituted between 0 and 1/2 and this
whole thing I will integrate out with y. So, this is sort of, you can see how I am writing. I am
collapsing the steps that I had before. I am treating that whole integral inside and then finishing it
off.

So, this integral, if you put 1/2, I am going to get 1 / 8 + 𝑦 / 2 and that I have to integrate between
0 and a 1/2. Once again, the integral is going to be 𝑦 / 8 + 𝑦 2 / 4, 0 and a 1/2. So, that will be 1 /
16 + 1 / 16, that is 2 / 16, 1 / 8. So, you see notice how x and y are likely to take higher values
together. They are between 0 and 1/2 and 0 and 1/2, it is only 1 / 8 probability. So, that is the way
in which the sloping goes starts at 0. So, lower values are taken with much lesser probability than
higher values. So, this clearly brings out that non-uniform nature in this thing. So, if it had to be
equal, it will be just 1 / 4, is not it? It should be 1 / 4. It is gone to 1 / 8, much lower probability
that it takes lesser values.

What is the next one? Next one is a bit of a pain. It says 𝑃(𝑥 + 𝑦 < 1). So, let us do blue color
here. So, 𝑥 + 𝑦 < 1 is this blue area. So, this guy is 𝑥 + 𝑦 = 1. So, now, if we want to set up
𝑃(𝑥 + 𝑦 < 1), you will see you will run into some problems when you want to set up a double
integral like this. So, setting up a double integral like this needs some care. So, I have x here, y
here and I have this 𝑥 + 𝑦 < 1. And I want to integrate only from here to here.
So, the way I usually try to do is I try to keep the outside integral as the 𝑦 integral and the outside
integral go from 𝑦 equals 0 to 1. So, if you look at just, forget about the x-axis at all, forget that
the x-axis even exists. On the 𝑦-axis, I go from 0 to 1. So, the first integral can go from 0 to 1.
Now fix one particular value of 𝑦. So, now when you enter the second integral and you look at the
range of values of 𝑥, you have to fix a particular value of 𝑦 and then just draw a line parallel to
that and see where it hits the boundaries.

On the one side, it is at 0. So, x = 0. Where does it hit the boundary on the other side? So, this
point will be 1 - y comma y, because that line is 𝑥 + 𝑦 equal to 𝑦, 𝑥 + 𝑦 equal to 1. So, if y is y,
x will be 1 - y. So, it hits the boundary on one side that x equal to 0, on the other side at x equal to
1 - y. So, this part is one of the trickiest parts in two-dimensional integration, how to set up the
limits for the integral to get a particular area. So, in this case, I wanted this area 𝑥 + 𝑦 less than
1, you draw this triangle and then slowly set up the axis one at a time, limits one at a time.

How do you do that? You first start with one of these things, y going from 0 to 1, and then you fix
a particular value for y and then go into the next one. And then when you fix a particular value of
y, draw a line parallel to the x-axis, see where all it hits boundaries. There are two boundaries here
in this region. And if hits the first boundary at 𝑥 equal to 0, the second boundary at 1 - 5 you simply
integrate from 0 to 1 − 𝑦. Everything else remains the same, (𝑥 + 𝑦) 𝑑𝑥 𝑑𝑦.

So, hopefully, you see this. So, this one, now, I keep the outer integral 0 to 1, the inner integral is
going to be x square / 2 + 𝑦𝑥. By now you should be familiar with this. I am doing 1 - y dy. So,
now I have to substitute this here. So, it will be (1 − 𝑦)2 + 𝑦 × (1 – 𝑦) / 2 + 𝑦 × (1 – 𝑦). So,
one needs to simplify this. So, let us do that. So, you will get 1 / 2 × 1 - 2y + 𝑦 2 + y - 𝑦 2 the whole
thing dy. So, this is what you got here. And that will work out as integral 0 to 1, the constant term
is 1/2, − 𝑦 + 𝑦 will cancel. So, the y term will disappear. We will have a + 𝑦 2 / 2 - 𝑦 2 . This will
be - 𝑦 2 / 2 dy. So, now we are in familiar territory. This is just 𝑦 / 2 − 𝑦 3 / 6 between 0 and 1 and
that is just 1/2 - 1 / 6, which is 1 / 3.

So, that is calculation. It took some time and effort but look at the nice result it gave you. So, it is,
the support is 0, 1, 0, 1. Support is the unit square, but it is non-uniform. So, if you take 1/2 the
area, if you divide the area into two and if you take 1/2 the area, you actually get a probability of
1 / 3. So, that again shows you how non-uniform this distribution is. So, this calculation is
considered a basic skill in this area. When you study two dimensional continuous densities, joint
densities, you have given the density description and then you given a sub-area description, you
should be able to quickly setup an integral and do the calculation. And this, as far as this course is
concerned, we would not ask you very difficult integration or anything like that.

Even this integration I would say ranks us slightly more difficult than what I would like, but at
least you should be able to setup a numerical algorithm to compute this integration. So, even if
you cannot do it by hand, doing it by hand maybe is a good skill. If you can have it, it is great, but
you cannot, you should be at least able to setup a computer to give you the value of the numerical
double integration. If you can do that, that is fine. So, hopefully, this problem was useful for you.

So, that concludes the study of the joint densities. We saw uniform densities. We saw one example
of a non-uniform density. We will see many more examples as we go along. But this, there are
some few skills here to pick up given a description of the density, how do draw the support, how
to do some calculations or probability etc. important skill. And also, equally important is given
data, given the 2D histogram, how to think of where the support maybe, what kind of distribution
fits this, what kind of continuous distribution fits it that is also a very useful skill to develop. Thank
you very much. We will meet you in the next lecture.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department,
Indian Institute of Technology Madras
Lecture 6.4
Marginal Densities and Independence
(Refer Slide Time: 00:12)

Hello, and welcome to this lecture. We are going to be discussing about marginal densities and
independence for continuous, jointly continuous random variables in this lecture. So, so far, we
have been looking at two random variables which are both continuous and they have a joint
distribution, we looked at the joint density, how it looks, how to picture it, how to think of it, how
to do computation with it a little bit we saw some specific examples.
So, we are not going to go too much into detail on this, but I think the main basic definitions you
should be familiar with and the simple skills that I emphasize in the problems which, may be not
simple skills, but whatever, skills that I emphasize the problems you should practice and make
sure you get it.
So, let us go back to the discrete case. We looked at the joint PMF and then we had marginal PMFs
and then we had conditions on independence of random variables and then we had conditional
PMFs. And all of that was very useful. And at least, we had to pick up the skills of working with
those PMFs and doing the computations. Something very similar is also there in the continuous
case. So, what we are going to do in the next couple of lectures is to talk about marginal densities,
independence and conditional densities for continuous random variables. So, let us get started.
(Refer Slide Time: 01:37)
So here is a marginal, definition for a marginal density. So, these are all definitions. I will simply
make, I will state them as theorems but you can take it as definitions. It is sort of like, the way we
are seeing in this course is just to state what the result is and then get used to calculations with it.
That is what we would do.

So let us say you have two random variables, 𝑋 and 𝑌, having a joint density 𝑓𝑋𝑌 (𝑥, 𝑦). So then, if
you actually go through the definition, you can actually show that 𝑋 has a marginal density, 𝑋 is a
continuous random variable. Its density, its own density, the marginal density as we call it, is you
can denote it 𝑓𝑋 (𝑥) and it is given by this integral. So maybe I should write out this integral in
some detail for you.

So, 𝑓𝑋 (𝑥) = ∫𝑦=−∞ 𝑓𝑋𝑌 (𝑥, 𝑦) 𝑑𝑦 and this integral is over y. So, basically, you take the joint density.
So, this is the joint density and integrate all the 𝑦 out in some sense. So, integrate for a particular
𝑥. You fix 𝑥. So, imagine this is how you read these formulae. You fix a particular value of 𝑥 and
that 𝑥 is fixed here. 𝑥 is like a constant. So, in this integral, 𝑦 is the variable and 𝑥 is like a constant,
is not it? That is how you think of it. And that is fixed. So, this is fixed here. For a particular value
of 𝑥, you integrate out 𝑦.

So, if you want a picture, here is a picture. So maybe this is some supp(𝑋, 𝑌). So, you have a two-
dimensional joint density. It is difficult to draw it. It is, the support is here. It will, at every point
here, it will have some value. You have to think of a 3D picture that has a support like this.

So, now if you fix a particular 𝑥, and then go for 𝑦, so you are going to go 𝑦 from minus infinity
to infinity, technically, but only this part matters. So, you integrate the 𝑓𝑋𝑌 (𝑥, 𝑦). So, imagine a
3D picture, it sort of like a big hill or something, a mountain or something, you can imagine with
this base. And a particular 𝑥, you look at all the possible 𝑦 . So, 𝑦 goes from here to here.
So, you integrate over this length. What do you integrate? You integrate the function, the joint
density. You take a slice of that. So, it is like a, this big mountain here, and you take a slice of it at
𝑥, so you will get a one variable function. 𝑥 is fixed. So, 𝑦 alone varies. So as 𝑦 varies, you only
have a single dimensional function. So, you integrate that guy entirely from wherever it crosses,
wherever its non-zero from beginning to end, you get a function of 𝑥.
Now, why is this a function of 𝑥? As you vary 𝑥, so you vary 𝑥, you put 𝑥 here, you will get an
integral from here to here. You put 𝑥 here, you will get, for different values of 𝑥. So, these are all
different values of 𝑥. You can literally imagine that the slice is moving. As 𝑥 changes, the slice is
moving. And at every slice, you have a different function 𝑓𝑋𝑌 (𝑥, 𝑦), that value function of 𝑦. Y
alone varies and you integrate over the support of 𝑦, you will get a value. So that value is your
function 𝑓𝑋 (𝑥).
So, I will show you this in particular examples. You will see it is, in this abstract since maybe it is
a bit more confusing, but you can imagine this. That is how you read this integral. And then you,
when we see examples, it will be a little bit more clear.

So, this 𝑓𝑋 (𝑥) is a valid density. We will, I have not really proven it. That is why I have called it a
theorem. So, this is a valid density. And it ends up being the marginal density of 𝑋. So, 𝑋 and 𝑌
have a joint distribution. So, 𝑋 will have a distribution of its own, if you do not care about 𝑌. And
when you integrate it out like this, you get that marginal density.

The same story goes for 𝑦. So, you just flip 𝑥 and 𝑦, instead of taking slices this way, if you take
slices this way and vary y, you can do the same thing for 𝑥. So that happens to be the marginal
density for 𝑌. So, basically, these are the densities of 𝑋 and 𝑌, individual densities, these are called
marginal densities.

And the important point is, once you think of a joint density for 𝑋 and 𝑌, from that you can derive
specific marginal densities for 𝑋 and marginal density for 𝑌. So, both of these you can do if you
start with joint density. So, joint density exactly determines the two marginal densities. So that is
the main thing to remember.
(Refer Slide Time: 06:42)
Let us see some examples. So, there are two examples here. And I have put a theme here, which
is like marginals do not determine joint. I will come to this later. Do not worry about that too much.
But let us just start with these two examples and make sure we can do this kind of integration. The
easiest one we have seen is uniform on the unit square. So, how does uniform on the unit square
look, 0, 1, 1 and you have this support and this is your support, is not it? So, this is uniform on the
unit square.

So, your 𝑓𝑋𝑌 (𝑥, 𝑦), in this region, in this unit square, the value of that function is 1. It is like this 1
cube. It is a cube that you put on top of this. So, you have at every point inside the square, you
have a value of 1, every point outside of the square you have a value of 0. So that is, so imagine
that. So, keep that imagination. That is very important.

So, if I have to find 𝑓𝑋 (𝑥), I have to fix a particular 𝑥, let us say 𝑥0 . So, I will do it, 𝑥0 . So, you fix
a particular 𝑥0 and then look at and draw a line like this, is not it? Draw a line like this. And then
imagine, so this is your 𝑦, so this is the 𝑦-axis, and imagine plotting that. So, I have taken a slice
of this cube and how will that look? So, this one is going to look like this. So, at 𝑥0 you are going
to have y going from 0 to 1, and it will be 0 here, it will go up to 1 here, and then go here. So, x
equals 𝑥0 , x is 𝑥0 , and I have taken a slice.

So, this is the picture of the slice. So, 𝑥 equals, let me write down what this is, this is the slice at
𝑥 = 𝑥0 . So, I have taken that slice at 𝑥 = 𝑥0 and I have drawn it out for you. I have just sort of
rotated it and drawn it in this fashion. You can start from 𝑦 being less than 0. Come here, it will
all be 0. Between 𝑦0 to 1, it will suddenly rise up to 1 and come down. So, this is the picture.
From this picture, if you sort of see that other one, so once you see this other one, it is very easy.
So, the integral 𝑦 from 0 to 1, it is minus infinity to infinity, but really 𝑦 is 0, whenever 𝑦 is outside
of 0 and 1, 𝑓𝑋𝑌 of, so this is, I am sorry, maybe I should write down what this is, this is 𝑓𝑋𝑌 (𝑥0 , 𝑦)

So that is the plot here. So, 𝑓𝑋𝑌 (𝑥0 , 𝑦). So, that is the slice at 𝑥 = 𝑥0 . So, 𝑓𝑋𝑌 (𝑥, 𝑦) is this big cube
sort of function. I am taking a slice. So, my 𝑥 gets fixed at 𝑥0 , 𝑦 varies from minus infinity to
infinity and I get one nice little plot like this. So, this is 𝑓𝑋𝑌 (𝑥0 , 𝑦).
So, I have to now integrate this to get 𝑓𝑋 (𝑥0 ), I have to integrate this from 0 to 1. So,
1
𝑓𝑋 (𝑥0 )= ∫𝑦=0 𝑓𝑋𝑌 (𝑥0 , 𝑦)dy

𝑓𝑋𝑌 (𝑥0 , 𝑦) is just 1 and then 1 integrated from 0 to 1, I am going to get 1. So, this is true for any 𝑥0
between 0 and 1. So, what is this density? This is nothing but uniform (0, 1) is not it? So, this is
the density 𝑓𝑋 (𝑥0 ). You can see this picture here. I hope you can sort of see this.

So, for every 𝑥0 you get 1 and as you vary 𝑥0 between 0 and 1 so you take another 𝑥0 here, another
𝑥0 here, wherever you take a slice, you will see, you will get the same picture. The picture remains
exactly the same. The slice is the same thing. Wherever you slice it, you get the same thing and
the integral will always be 1. So, the marginal density of 𝑋 is actually uniform from 0 to 1.

You can do the same calculation for 𝑌 and you will get that 𝑌 is also marginally uniform (0, 1).
So, hopefully this example sort of illustrated how you go about doing this marginalization
calculation. So, if you remember before, in the discrete world what did we do, we took the joint
PMF and added over all possible values of 𝑦 to get the marginal of 𝑋. So, you summed over all
possible values of 𝑦 to get the marginal of 𝑋. So, here the summation gets replaced by integration.
So, instead of summation, you get the integration, integrate over all 𝑦, fix 𝑋 at a particular value
𝑥0 , you get the marginal density at 𝑥0 . So, the same story.
So, here is the second example. So, the first example was okay. So, let us look at the second
example. Second example is a little bit tricky. So, (𝑋, 𝑌) is uniform in 𝐷, but 𝐷 is not your standard
square. So, what is your 𝐷, this 𝐷 area, maybe I should do this whole thing in blue for you to
differentiate between these two, so this whole thing is going to be done in blue for you.
1 1
So, this is 𝑋 and 𝑌. If you want to do this 𝑓𝑋𝑌 , so this 𝐷 area is going to be 0, 2, 1, let me do 2 and
1 1
1 here. So, you have support here, [0, 2] × [0, 2]. So, this is the Cartesian product here, this guy,
1 1
the first Cartesian product. And then you have this other one, [2 , 1] × [2 , 1]. So, what is that, it is
1 1 1
[2 , 1]. 𝑥 is between 2 and 1, 𝑦 is also between 2 and 1. So you will get something like this.

I am not drawing this to scale. So, of course, if you draw it to scale, you will get this. So, you
notice the difference between the support for the first one, uniform one unit square, and support
1 1
for the second guy. So, it is 0 to 2 and then 2 to 1 and you have two different squares like this. So,
what is 𝑓𝑋𝑌 (𝑥, 𝑦)? What is area? This is 𝐷. So, these two things are 𝐷. What is area of 𝐷?
1 1 1
Area(𝐷) = 4 + 4 = 2

So, 𝑓𝑋𝑌 (𝑥, 𝑦) is going to be 2 for (𝑥, 𝑦) ∈ 𝐷 and 0 otherwise.


So, it is a slightly, it is uniform again, but it is taller, it is sort of narrower and taller compared to
the cube that you had before on the unit square. So, it is a slightly different joint density. It is not
the same as the uniform on unit square. It is a different density. So, now, let us do the slicing here.
1
So, if you do the slicing, notice what happens. So, if you slice here, 𝑥0 between 0 and a 2, you are
going to get the slice looking like this. So, you go for 𝑦 and then 𝑓𝑋𝑌 (𝑥0 , 𝑦), it will be 0 up to 0,
1
and then it will go up to 2 and then fall at 2 and go down to 0. So, this is what the slice will look
1 1
between 0 and a 2. So, this is for 𝑥0 in [0, 2].

So, if you look at this interval, then your 𝑓𝑋 (𝑥0 ) is going to be integral of this function from 0 to
1 1
that is just the area under this curve, so it is 2 × , it is 1 again. Now, let us go to 𝑥0 here. So, if
2 2
you look at 𝑥0 here or maybe I will call it 𝑥1 here, some 𝑥1 here and the slice now will look a bit
1
different 𝑓𝑋𝑌 (𝑥1 , 𝑦). It will be 0 up to a 2 and then it will climb up, go all the way up to 2 and then
fall at 1. So, this is how the slice will look at that point. But nevertheless, it does not matter for any
1
𝑥1 in [2 , 1].𝑓𝑋 (𝑥1 ) is also 1.

So, notice what happens here. So, 𝑓𝑋 (𝑥) is actually 1 for 𝑥 between 0 and 1. So, 𝑋 is actually
uniform in (0, 1). So, the same sort of calculation will show you 𝑌 is also uniform in (0, 1). So
very simple example, I went through it in good detail for you. So, hopefully, you can see and
understand and how I got these calculations done. But notice this amazing little result.
So, you had two different joint densities. Clearly, very different joint density one has a very
different support. It is just the entire unit square is a support. So, here, there are two smaller
1 1 1 1
squares, smaller areas, [2 , 2], [2 , 2] to [1,1]. So, smaller squares, so they are a bit taller, the density,
joint density is completely different.

For instance, in the first case, you can have values around 𝑥 = 0, 𝑦 = 1. So, your random variable
can take values around that point. In this second case, this is simply not possible. Random variable
will not take values around 𝑥 = 0 and 𝑦 = 1. So those points are not taken at all. So, it is a very
different random variable. But if you only look at the marginal densities, what is the story that you
get? They are the same. 𝑋 is uniform(0, 1), Y is uniform(0, 1).
So, this is an important thing to understand and appreciate even when you do modeling with data,
marginal densities alone are not enough to determine the joint density. The joint density can have
different shapes and still show you the same marginal density. You have to be aware of it and you
have to be, you have to worry about it a lot and make sure that your modeling for the joint is good
enough even marginals are given. So, this is an important distinction to remember. So, these kind
of cases can happen.
(Refer Slide Time: 17:44)
So, there are more examples here, I will do maybe one or two of these. There are three different
examples. Let me see how much quickly I can do this. Maybe one I will do in some detail, the
remaining I will go through at a slightly faster pace. So, let us take the first one. The first one has
support region. It is again uniform, uniform is sort of easier to handle. It says [1, 3] and [0, 4]. So,
let us just take that.

So, 𝑋 going from 1 to 3, and then 0 all the way to 4. So, you have this support region. This is your
support region. So, what is the value of 𝑓𝑋𝑌 ? 𝑓𝑋𝑌 is the area here. 4 times 2 is 8. So, 𝑓𝑋𝑌 is going
1 1 1 1
to be 8, is not it? So, 8 is the, I mean, 8 is the area, so 8 is the value of 𝑓𝑋𝑌 , inside this it is 8. So,
this much, is the first step to do, first step is try and plot the support of the joint density and find
its value inside the support. It is uniform, so it is easy enough to do.

So, now if you want to do 𝑓𝑋 (𝑥), you have to slice and most interesting slice is between 1 and 3.
If it is outside of 1 and 3, it is going to go off to 0, between 1 and 3. So now you have to sort of
1
mentally picture the slice. I am not going to draw the slice for you. It will start go up to 8 at 0 and
1 1
fall down to 0 at 4. So, it is 8 height, and length is from 0 to 4. So, the integral is going to be 2 for
𝑥 between 1 and 3.
So, this is I can quickly identify this is just uniform(1, 3). From the support itself, it is sort of clear.
So, this has to be uniform[1, 3]. So, once you see the joint density, you will quickly see that 𝑋 will
be uniform. The same thing you can do for 𝑌. If you do a little bit of work, you will show 𝑌 is
uniform[0, 4], it is easy enough.

So, let us do the second one. The second one is here. It is [0,1] × [0, 1]. And then there is [1, 2]
×[0, 2]. So, it is goes all the way up to here. So, this is the support. So, the support goes this way.
So here again, let us find out 𝑓𝑋𝑌 . It is 1 + 2, so it is 3. So, area is 3, so 𝑓𝑋𝑌 is 1 by 3. So, now if
you want to do 𝑓𝑋 (𝑥), you can notice quickly that there are two slices that are different. You have
take a slice here and a slice here.
1 1
So, this slice will just give you a height of 3 from 0 to 1. So, you will get 3 for 𝑥 between 0 and 1.
1 2
And for 𝑥 between 1 and 2, it is again 3, but it goes all the way to 2, so 3 between 1 and 2. So, if
1 2
you want to plot this, you can sketch this PMF 𝑓𝑋 (𝑥) is 0, 1, 2, 3, 3. So, that is how this PMF looks,
the marginal density looks.
2
You can do a very similar calculation for 𝑓𝑌 (𝑦). You will get, I think, you will get 3 for 𝑦 between
1
0 and 1 and you will get 3 for 𝑦 between 1 and 2. So that is how the distribution looks. So, hopefully
you see this. So, as the support varies, as the height of the uniform changes, you have to just
carefully do the calculation and you will get the answer.
So, the third thing I am going to do is again uniform, but in this case, it is not a rectangular shape
alone, it is sort of like a triangular shape. So, let us try that. 𝑥 > 0, 𝑦 > 0 and 𝑥 + 𝑦 < 2, so it
looks like this, so it is 2, 2 that is the support and 𝑓𝑋𝑌 is half into base into height, so that is 2. So,
1 1
𝑓𝑋𝑌 will be 2. So, the area is 2. So, you get 2.

So, if you are going to take a slice here, now things get a bit interesting. So, if you take a slice
here, so this point is what, this line, remember, this line is 𝑥 + 𝑦 = 2. So, if you take a slice at 𝑥,
the point where you get 𝑦 is going to be 2 − 𝑥.
So, you see that it is 2 − 𝑥. So, if you do 𝑓𝑋 (𝑥), you have to integrate 𝑦 from 0 to 2 − 𝑥. At a
1
particular 𝑥, you have to integrate from 𝑦 equals 0 to 𝑦 equals 2 − 𝑥. The height is just a 2 and dy.
So,
2−𝑥 1 𝑥
𝑓𝑋 (𝑥) = ∫𝑦=0 2 𝑑𝑦 = 1 − 2 , 0 < 𝑥 < 2
𝑥
So, that is your PMF. You can plot it if you like. If you plot it from 0 to 2, 1 − 2 will look like this
one. We can see that height is 1 and that is how the PMF looks. So, it is sort of a non-uniform PMF
from 0 to 2, it goes up to 1.

So, you can do 𝑓𝑌 (𝑦) also. I will, so if you look at 𝑦, a particular 𝑦.


2−𝑦 1 𝑦
𝑓𝑌 (𝑦) = ∫𝑥=0 2 𝑑𝑥 = 1 − 2 , 0 < 𝑦 < 2

So, this third calculation, maybe it is a little bit unsettling. Think about how I did it. The steps are
again the same. The first step is to draw the joint density, figure out its height. It is after all uniform.
So, it is just a question of finding area and then fix a particular x and take that slice. You look at
that slice closely.
So, only thing you would worry about is where that slice starts to climbs and where it drops. And
you can see as x varies that slice width also varies. So that is the only thing that is different here.
In the previous cases, it was all easier to see the slices. Here the slice has changed, because the
boundary is, depends on 𝑥. As you change 𝑥, boundary sort of continuously will changes. It is 𝑥 +
𝑦 = 2, so 𝑦 goes to 2 − 𝑥. So, identify that carefully and integrate it out, you will get the answer.
So, this was for uniform and marginal densities.
(Refer Slide Time: 25:30)

So, here is a general problem where you do not have a uniform density. So, 𝑥 and 𝑦 are between
0 and 1 and the function is 𝑥 + 𝑦. You remember this joint density. It is sort of like that, sloping
thing from 0, 0 to 1, 1, 1. So here, now, I am not going to bother with doing too much of drawing
and all that. So, let us just do 𝑓𝑋 (𝑥). So, I am going to integrate, 𝑦 will, for every 𝑥, 𝑦 is going to
go from 0 to 1. There is no problem there. So, because this is just a entire area, rectangular area,
so now we have to just integrate (𝑥 + 𝑦)𝑑𝑦, 𝑥 is treated like a constant, 𝑦 varies. So,
1
𝑦2 1
𝑓𝑋 (𝑥) = ∫ (𝑥 + 𝑦)𝑑𝑦 = 𝑥𝑦 + =𝑥+
2 2
𝑦=0

So, how does this look? Let us just plot it to make sure we did not mess up. And 𝑥 varies between
1 1 3
0 and 1. So, 0 and 1, it is 𝑥 + 2, so it starts at 2 and goes up to 2. So, this is not to scale, I am sorry.
So, this is your density function. Is this a valid density function? You can sort of check it. This top
1 1
one is called an area of 2. This bottom one square, the rectangle is also got an area of 2. It is a valid
density. So, valid densities are so interesting. So, so many different ones are there.

So, if you do 𝑓𝑌 (𝑦), I mean, it is, the whole thing is so symmetric. You are going to get
1
1
𝑓𝑌 (𝑦) = ∫(𝑥 + 𝑦)𝑑𝑥 = 𝑦 +
2
𝑥=0

So, you can check it out, you will get the same exact thing. So, it is very easy to find the marginal,
but once again, remember the marginals do not determine the joint, joint can be something else.
That is important to remember. To evaluate the marginal, one needs to just do careful integration
at some sense, but also drawing the picture, getting a sense of it is very important. That concludes
this lecture. Thank you very much.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology Madras
Lecture 6.5
Independence

(Refer Slide Time: 00:13)

Hello and welcome to this lecture we are going to talk about independence of two continuous
random variables in this lecture. So, we will just see the result then see how to picture some
independent continuous random variable how they look etc. So, here is the result the main
result is if you have 𝑋 and 𝑌 have joint density 𝑓𝑋𝑌 (𝑥, 𝑦) two jointly continuous random
variables.

They are independent if the joint density becomes the product of the marginal densities
𝑓𝑋𝑌 (𝑥, 𝑦) should be equal to 𝑓𝑋 (𝑥)𝑓𝑌 (𝑦). If that condition is satisfied the two random variables
𝑋 and 𝑌 are said to be independent. They are statistically independent if this happens. So, notice
to determine independence you need the joint density, why is that? The joint density is given
you can compute the marginals from it.

Once you complete the marginals you check if it is the product of the marginals. If that is true
then 𝑋 and 𝑌 are independent, but more interestingly if you know a head of time that 𝑋 and 𝑌
are independent random variables the marginals are enough. If you have the marginals you can
find the joint density if you know 𝑋 and 𝑌 are independence that is very important quite often
people forget that assumption we happily go ahead and assume 𝑋 and 𝑌 are independent and
multiply the marginals to get the joint distribution.
But remember that you are making an assumption there, you are making an assumption that 𝑋
and 𝑌 are independent and in some cases it may not be true and it can come back and cause
some difficulty for you. So, if you know they are not independent do not go ahead and multiply
the marginals. So, multiplication of marginals is true only when 𝑋 and 𝑌 are independent
random variables.

(Refer Slide Time: 01:57)

So, whole bunch of examples I do not know if I am going to work these examples out and great
detail for you I did the marginal computation for it you can go back and check. So, you will
see that I will just write down the result. So, uniform on unit square is independent, so this one
is not independent, [1, 3] × [0, 4] is independent. So this one is not independent, so this one is
not independent, this guy is also not independent.

So, from the support itself you can sort of guess whether it is independent or not, notice the
support here. So, if you look at uniform on the units square from the support when as X varies
for given a value of 𝑋 the distribution of 𝑌 remains the same it does not change at all. Whatever
value of 𝑋 you give me the distribution remains the same. But notice this case, this case the
support ends up being like this and depending on value of 𝑋 the distribution of 𝑌 can change.

So, the marginal like the changes different, so clearly this will not be independent. So, what
about this one, so [1, 3] × [0, 4] it had a support like this. So, once again even though it has a,
when it notice it is just one rectangle, one rectangle is a big giveaway when uniform
distributions. So, as you vary 𝑋 no matter where 𝑋 is distribution of 𝑌 remains the same. Of
course you have to worry only within the support outside of the support does not matter.
So, within the support so you see as you vary 𝑋 the distribution of 𝑌 does not change. So, you
can go back and check the other ones. I mean whenever you have 𝑥 + 𝑦 etc is very unlikely
things are going to be independent. When you do not have a rectangular shape or when you
have 𝑥 and 𝑦 occurring inside the definition of the, you have the non-uniform distribution, it is
in when you, you have to really multiply the marginals to make them independent otherwise it
is difficult to see.

So, quick examples just to show you how to do it, just go through these in some more detail if
you are if you think this came a little too fast for you. But this is very important. So, the structure
of how the support looks support itself may give you a hint on whether the random variables
are independent or not.

(Refer Slide Time: 04:20)

Here is one problem. I will work out this alone because it is a little bit interesting. So, here is
𝑋 and 𝑌 given as an exponential random variable. Remember what this exponential random
variable 𝑓𝑋 (𝑥) is 𝜆1 𝑒 −𝜆1𝑥 for 𝑥 > 0 and 𝑓𝑌 (𝑦) is 𝜆2 𝑒 −𝜆2𝑦 ; 𝑦 > 0 that is what it is. They are
independent random variables. Once they are independent random variables you can find their
joint density.

So, the joint density is easy to write down because they are independent you can simply
multiply the marginal. So, it will be 𝜆1 𝑒 −𝜆1𝑥 𝜆2 𝑒 −𝜆2𝑦 , and 𝑥 > 0 and 𝑦 is also greater than 0.
So, this is very interesting. So, once you know they are independent given the marginals you
can simply multiply. So, you will see how the joint density will look if they are independent.
Now the probability that is being asked here is probability that 𝑋 is greater than 𝑌. So
sometimes it is good to sketch this 𝑋 and 𝑌 probability that 𝑋 is greater than 𝑌, so you have the
𝑋 = 𝑌 line and you want this probability. So, this is what you want. So, probability of 𝑋 greater
than 𝑌, so it is a region inside the 2d plane. So, you have to integrate first from for 𝑥 goes from
0 to infinity all the values of 𝑥.

And once you fix a particular value of 𝑥 this is 𝑥 = 𝑦. So, this is also 𝑥 and 𝑦 will go only from
0 to 𝑥, so 𝑦 is 0 𝑦 goes to 𝑥. So, you take those slices and you will get 𝑦 goes from 0 to 𝑥 and
then just this 𝜆1 𝑒 −𝜆1𝑥 𝜆2 𝑒 −𝜆2𝑦 𝑑𝑦𝑑𝑥. So, you do this integral first treating 𝑥 as a constant, so
this will come out. So, you will only have this guy floating around, so that will end up being 𝑥
equals 0 to infinity 𝜆1 𝑒 −𝜆1𝑥 .

And then this one 𝜆2 𝑒 −𝜆2 𝑦 is just you know −𝑒 −𝜆2𝑦 evaluated between 0 and 𝑥. So, this 𝑑𝑥 and
that will end up being 0 to infinity 𝜆1 𝑒 −𝜆1𝑥 . So, this will be 1 − 𝑒 −𝜆1𝑥 . You see that, so this
integral, the integral of this indefinite integral of that is −𝑒 −𝜆2𝑦 . So, if you differentiate it you
will get this.

So, you see that you will differentiate −𝑒 −𝜆2𝑦 , you are going to get 𝜆2 𝑒 −𝜆2𝑦 . So, that is the
indefinite integral. So, you have to just substitute that between 0 and 𝑥. I did that a bit fast case
but you can do it step by step you will get. So, now this integration is going to go so we can
∞ ∞
multiply this out. So, it ∫0 𝜆1 𝑒 −𝜆1𝑥 𝑑𝑥 − ∫0 𝜆1 𝑒 −(𝜆1 +𝜆2)𝑥 𝑑𝑥 . So, the first integral is just 1.

So, this is just 1 this is just a valid pm pdf integrating over the entire range so you we will get
1. This one needs a little bit of work, so what you can do here is find the indefinite integral of
this guy. This one will end up being the integral will end up being integral will end up being
−𝑒 −(𝜆1 +𝜆2 )𝑥
. This is what the indefinite integral record being, then you have to substitute
𝜆1 +𝜆2
1
between 0 and infinity you will just get 𝜆 .
1 +𝜆2

1
So, this will end up being 1 − 𝜆 . So, notice how this integral worked out? Think about it
1 +𝜆2

as e power minus some constant into 𝑥 the integral is indefinite integral is this you have to
1
substitute values between 0 and infinity here. You will get 𝜆 . So, of course 𝜆1 is still there.
1 +𝜆2

𝜆1 𝜆2
So, that shows up here it is 1 − 𝜆 is 𝜆 ..
1 +𝜆2 1 +𝜆2
It is a nice answer is not it? I like the final answer. It is when you go through the integration
for it a little bit. But the final answer is really, really, really neat. Two exponential distribute
marginals being independent and you get the joint density and you want to compute the
𝜆2
probability that 𝑋 is greater than 𝑌 the answer is . So, notice how independence makes
𝜆1 +𝜆2

things very easy from the marginals you can go to the joint.

So, that concludes this lecture on independent random variables. You will see quite often
independence is very, very simplifying property and quite often when you do not know enough
about the random variables you might as well assume they are independent. So, in practice
many, many models will assume independence first and see if something is violated something
is violated you go back and try and fix the independence.

But otherwise is generally people try to assume independence as much as possible and hope
for the best. It might be a dangerous thing to do as we have seen before, but it makes for a very
easy and nice calculation. Thank you very much.
Statistics for Data Science 2
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 6.6
Multiple discrete continuous random variable Conditional density

Hello and welcome to this lecture. This lecture is going to be on conditional densities. Once
again we have been doing this two random continuous random variables going through all the
different types of densities trying to make sure we have the skills of manipulating these
densities carefully. We saw content joint density then we saw the marginal densities. Now, we
are going to see we also saw the independence property. And now we are going to see
conditional densities. So, this is sort of very similar to what we did with PMFs in the discrete
case. We are doing the same thing in the continuous case but densities.

(Refer Slide Time: 0:48)

So, here is the definition, the definition is mean these lots of writing but important point is this.
This is the important point you have to fix an a for which the marginal is greater than 0. So,
you take a value for x so that 𝑓𝑋 (𝑎) ends up being positive. Once it is positive you can define
this conditional density. What is this conditional density? This is the density of 𝑓𝑌|𝑋=𝑎 (𝑦).

So, I am not going to go into great detail on where this comes from many of you may get
confused by this X equal to a Y. Why is this confusing? Because X is a continuous random
variable. And 𝑃(𝑋 = 𝑎) = 0 is usually 0. So, what is the meaning of conditioning on an event
of 0 probability? So, again you sort of liberalize and think that when I say X equal to a really
mean X around a and sort of in the limit I am getting something.
And this is a good enough density it is a valid density and it works. So, do not worry too much
about where this comes from but this is just a definition. The conditional density of Y 𝑓𝑌|𝑋=𝑎 (𝑦)
is simply the slice so notice this one this is the slice. This is the slice of joint at X = a except
that your normalizing it you are dividing by this 𝑓𝑋 (𝑎). So, this is what makes it the density.
So, 𝑓𝑋𝑌 (𝑎, 𝑦) it will just be a slice it will be just one function of y it divide by the marginal this
makes it a density.

So, Y is this a density this is actually very easy to check. So, what is 𝑓𝑋 (𝑎) is actually

∫−∞ 𝑓𝑋𝑌 (𝑎, 𝑦)𝑑𝑦 is not it? It is just the slice integrated from −∞ to ∞ is the marginal. So, now
I have divided it by this. So, if I if what does this mean? So, if you actually do the integral of

∫−∞ 𝑓𝑌|𝑋=𝑎 (𝑦)𝑑𝑦 you will simply get 1.

So, you notice how this 1 has to happen. Because the denominator is simply the integral of the
numerator from −∞ to ∞. I am normalizing by that so if I integrate again these two will cancel
and you will get 1. So, it is a very simple trick to get a density and that density is called the
conditional density. There are so many densities to work with it is just joint density in two
dimensions and you had these marginal densities.

Now, I have a conditional density also the same thing whatever you did for X you can do for y
you fix a b such that 𝑓𝑌 (𝑏) > 0. and take a slice on Y then divide by the marginal of 1. So, in
many problems people get mixed up with X and Y and this notation. This notation is also I
think slightly more laborious confusing notation. There is no way to avoid it you have to be
very precise when you write down. So, as you get more practice this will be clearer but this is
the definition of conditional density.
(Refer Slide Time: 4:02)

The lots of conditional densities. Quick look of properties. These are both valid densities. So,
you can think of conditional random variables conditional continuous random variables Y|X=a
X|Y= b. These are the random variables with those densities. So, they are all related and they
have that. So, this is the important formula here. So, basically what is the meaning of times
here.

So, I mean marginal index times conditional multiplied by conditional joint disjoint density is
marginal density multiplied by conditional density. So, whenever you have this being true this
is something that you can write down. So, when you have the marginal densities is being non
zero at the point you get this. So, when you write it down you can write this wonderful little
formula.

So, the joint density 𝑓𝑋𝑌 (𝑎, 𝑏)is the marginal at a multiplied by the conditional Y given X equal
to a at b. This you can also take the marginal density of Y and then multiply by the conditional
density of X|Y. So, you see how these properties which are satisfied by probability so this is
sort of like the intersection property. 𝑃(𝑋 = 𝑎, 𝑌 = 𝑏) 𝑃(𝑋 = 𝑎, 𝑌 = 𝑏) that factors as
𝑃(𝑋 = 𝑎)𝑃(𝑌 = 𝑏|𝑋 = 𝑎).

So, these results are like similar results but in terms of density and not probabilities. In the
discrete case you had probabilities and all these results were true. In the continuous case these
results are true for densities the conditional density marginal density and the joint density. So,
people write this as this formula this is a very comfortable formula to keep in mind. So, this is
just quick properties of conditional densities for you.
(Refer Slide Time: 5:50)

So, let us see examples I have put down a few examples here. So, uniform on unit square let us
go to that. So, this is the support just drawing it in a big way. This is the support so if you fix
X equal to a what is the conditional density going to be? So, we know that the marginal is 1
again. So, the conditional density will again be uniform Y|X= a for every a will be uniform?

Do you see that? It is you can go through the calculation. But just the picture tells you the story.
So, Y the every X equal to a Y is also going to be uniform. The same thing with X given Y
equals b. This is for of course a between 0 and 1 b between 0 and 1. This is also going to be
uniform 0 comma 1. So, this is the first one this is the uniform on unit square. Let us do the
second one. The second one will be a little bit more interesting.

Let us draw it little bit better. So, this is my support. half half, 0. So, this my support. Again,
you can quite easily guess this Y|X= a in case remember this has to be a density now. So, our
particular a so if you fix a here a between 0 and half Y|X= a remember that slice it is constant
between 0 and a half what will the density be the density is going to be uniform 0 to half is not
it you can calculate it you will get that.

So, also from the picture it is quite easy to see this. Now if a were to be between 1/2 and 1 you
will get Uniform[1/2, 1] on the support you can see it is going to be uniform. Similar case you
can make for X|Y = p I am not going to write it down. So, you can see for Y = b. So, for Y
between 0 and ½ X|Y = b will be Uniform [0, ½]. If it goes to 1/2 to 1 it will be Uniform [1/2,
1].
So, you see the conditional densities how they work out it is just basically the conditionals and
marginals of slices done differently. So, when you do marginal when you do the
marginalization with respect to X at every slice you integrate and you find the total marginal
probability or only worried about the distribution over X. So, here you fix an X and your
distribution you had to worried about the distribution of Y|X takes that particular value. So, as
X goes differently we will get different densities for Y those are the conditional densities.
Hopefully you see the minor distinction here.

(Refer Slide Time: 8:56)

So, here are more examples just to get you practice here. So, this is again the same examples I
am just going through them a bit fast here 1 3 4. So, you can do this. This is one example. This
is the next example which we had 0 1 0 1 and then 1 2 0 2. So, this is the other example we had
and then the third example we had was this. So, that examples may be a little bit interesting so
let me spend some time on that this is 2 2.

So, this one again Y|X = a so you can make a slice here so it is just uniform and flat. So, the
flat density is just going to correspond to a Uniform[0, 4] . And this is between sorry a between
1 and 3. So, X|Y = b so if you do a slice like this. This is again going to be uniform. 1 comma
3 and b is between 0 and 4. Is not it easy enough to see. So, here things are going to be slightly
different you can take a here and here if you like.

And you will see I will just do the conditional of Y Y X equals a a between 0 and 1. It is going
to be Uniform [0, 1]. So, Y|X=a a between 1 and 2 is Uniform [0, 2]. So, you can see that is
how it is. So, it is all flat. So, in this case notice what happens if you put a here this is X +
Y=2. So, this point is going to be 2 - a is not it? So, if you look at this slice when you want to
do Y given X equals a a between 0 and 2 this is actually uniform.

Write it down carefully. This is Uniform [0, 2−a]. Notice how the a shows up here in the
distribution. So, Y given X minus X equals a is actually Uniform [0, 1−a]. So, if you can write
down the conditional density of Y it is going to be 1 by 2 minus a for y between 0 and 2 −a
and 0 else. So, notice how I mean it is quite easy still you just have to look at the density
carefully.

But it is still uniform. So, at every a, it is uniform it stays flat. So, the density is uniform even
if to normalize it is goes uniform but the width of the density keeps changing with a. As a goes
from 0 to 2 width goes from 2 to 0 you can sort of see it from this picture for every X Y has a
different distribution it is uniform between 0 and 2 − a.

(Refer Slide Time: 12:18)

So, I believe the next problem I am doing is for a non uniform distribution again. We have seen
this before we have seen the conditional density rates the conditional density for this guy is so
1
the conditional density is just ∫0 (𝑥 + 𝑦)𝑑𝑦. So, you remember this calculation it was just
𝑦2 1
𝑥𝑦 + you plug it in between 0 and 1. So, you are going to get 𝑥 + 2.
2

1
So, this is going to be the marginal. If you do the marginal of y you will again get 𝑦 + 2 x

between 0 and 1 y between 0 and 1. So, we saw before how this is a valid density and all that.
So, now what is going to be 𝑓𝑌|𝑋=𝑎 (𝑦) and a I mean a we can pick between 0 and 1. The
marginal is non-zero. This is going to be the joint density which is x plus y divided by well x
1
is a so it is a plus y divided by 𝑓𝑋 (𝑎) so as 𝑎 + 2.

𝑎+𝑦
So, this is the density it is 1 . And what is the range of values of y? y goes from 0 to 1 within
𝑎+
2

it does this make sense? So, let us try and sketch this. So, this is a bit interesting is not it? So,
1
if you look at y here and you sketch 𝑓𝑌|𝑋=𝑎 (𝑦) at y equals 0 it is going to be a by 𝑎 + 2. And
𝑎+1
it is a linear function it goes from 0 to 1 at 1 it is going to become 1 .
𝑎+
2

So, it is some function like this. This is how it looks. So, this is the marginal density. So, if you
actually look at the slice of this x + y this is how the marginal is going to look. So, you can
check whether this is a valid density or not for every a this needs to be a valid density. So, here
you will have a by a plus half as the area here and this area if you actually work it out it will be
half into this height. What is this height?
1
1 2
It is a + 1 minus this, so it is going to be 1. So, it is if you add so it is going to be 1 . So,
𝑎+ 𝑎+
2 2

this area is going to be half by a plus half this area is going to be a by a plus half and you add
it up you will get 1. So, this will be a valid density. So, notice how this sort of pictures the slice
the conditional slices for every a how does y look? Remember I told you the support is 0 1. But
it was sort of an uneven shape.

It is not a non-uniform shape some sloping angle from 0 0 to 2 2 and 1 1 et cetera. So, it is a
different sort of picture. But if you slice it at x y is going to look like this it will look like this.
So, because at x equal to a it starts at a and goes up so it will have a picture like this. So, this is
the conditional density. So, just by basic integration finding the marginals and looking at the
structure carefully you can do the conditional system as well.

So, hopefully you got a reasonable idea. I know this last few lecture was lot about this
integration and working with conditionals marginals et cetera. This will need some practice.
Hopefully you can do the assignment questions, activity questions, and all that and become
some practice et cetera.

And this basic skill is important for simple functions like this. You should be able to understand
the support and understand the conditional understand the marginal do some quick calculations
and find out what they are. These are very basic skills and you should have them thank you
very much.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology Madras
Lecture number 6.7
Data to distribution

Hello and welcome to this short lecture on Data to Distribution. So, so far we have been looking
at distributions the distribution is given to you and you have to do some calculations joint
density finding the marginal finding the continuous etc. So, quite in practice when you look at
data, you are only going to have data. So, it is going to be a lot of data you will have a sense of
histograms summary mean, minimum, maximum, standard deviation you will have a sense of
what the data is.

Maybe you also know that some part of the data I want to model in discrete way, some part of
the data I want to model in continuous way, I showed you the Iris data as an example just to
motivate how these things may look. So, what is the connection? How do we move between
data and distribution? What are the things to watch out for? What are the things to be careful
about? In the entire data science course you will see a lot of emphasis on this.

There are different ways in which people will emphasize, how you have to be careful when
you go from data to distribution. Let me say a few words on this as a conclusion for this week's
lectures.

(Refer Slide Time: 1:17)

So, we were talking a lot about this Iris data set. This is a very small data set by comparison
today it has only 150 lines of data, but it tells its own story in a very nice way. It has a vector
there is a class variable which is discrete and then there is a sepal length, sepal width which we
want to model as continuous and petal length, petal width, you want to model as continuous
and there is all sorts of dependency between these things and one needs to be careful about
how we are going to model this.

So, how do we model this to statistically describe maybe you want to think of a joint
distribution. But, so how do you write down these things that is discrete discontinuous you
have to have conditionals and you have to worry about a lot of these kinds of things. So, how
do we do this?

(Refer Slide Time: 02:05)

We saw the summary things before, but now let me jump to one interesting type of picture you
can do this 2D histograms and you can combine them with the classes. So what I have done
here in the left picture is, I have done the 2d histogram of SL and SW. We saw this before, we
saw this before for only class 0. Now on the same plot, I have done for class 1 and class 2 and
I have shown them in different colors.

That is what I have done in the plot on the left. The plot on the right I have shown the sepal
length and the petal length SL and PL and again for different classes, class 0, class 1, class 2.
So, you see how the histograms are looking very interesting, very different 2D histograms.
They give you a picture. So, maybe you want to think of them as continuous random variables,
for every class you have a different pair of continuous random variables which are jointly
distributed maybe you can think of them as some sort of distribution etc.
But interestingly you see that this pair SL and PL seems to separate out the classes, the classes
do not seem to overlap too much, there is a little bit of overlap not too much on SL, SW you
see lot of overlap between the classes. Here there is not too much overlap number 1, number 2
point I want to observe want you to observe this is very very important is how many points do
you have in each bin of your 2D histogram.

Look at the number there look at the 𝑌 axis the 𝑍 axis it is gone to really low numbers 0, 1, 2,
3 quite often it is 0. It is 1, 2, 3, 4 maximum is 5 it does not go beyond 5. Why did that happen?
It happened because you have only 50 pieces of data for 50 instances for data for each class.
Once you start doing all these 2d histograms and multiple variables it turns out you need a lot
of bins for SL for instance if I take between 5 and 8 maybe even if I takes 6 bins for PL between
2 and 6 maybe again I take another 6 bins or 5 bins.

So, let us say already you have 5 into 5 25 bins in 2 dimensions. 25 bins and there is only 50
instances of data what is going to happen. The number of data that can come into each bin is
very, very, very less. So, this problem of multiplication of the bin numbers is a very big problem
and this is just 2 dimensions. Supposing you want to do a joint distribution for more than two
random variables, just like we did two random variables you can do multiple random variables.
We will see that in the subsequent lecture.

But the number of bins is going to explode very quickly and you would not have data, you will
have very little data. When your bins have very little data you cannot rely on them. That is not
a very strong data that gives you a strong comfort or confidence in making any sort of statement
statistically about what is going on. So, this kind of thing is very important. So, you look at the
data, you look at the summary, you look at the histograms, maybe you look at the joint
histograms like this, but be mindful of how many how much data you have and make sure you
have enough data to do your modelling in some reasonable fashion.
(Refer Slide Time: 05:21)

But here again so this tells you, this tells you how to deal with one discrete and say two
continuous random variables. How will you describe this? So, these are not difficult it is just
extend from what we did before, when how did we describe one discrete and one continuous
random variable, we simply gave the PMF for the discrete random variable and we did
conditionals for every possible value of the discrete random variable, same thing you can do
here.

Supposing you have 𝑋, 𝑌, 𝑍 where 𝑋 is discrete and 𝑌, 𝑍 are continuous 𝑋 has a range 𝑇𝑋 and
it has a PMF 𝑝𝑋 . For every value that 𝑋 takes, you simply define a conditional joint density
𝑓𝑌𝑍|𝑋=𝑥 . So, you can do this and so the overall joint density will simply become the average of
this. This is very similar to what we did before. So, this kind of modelling you can do I mean
you can write it down very cleanly as a probabilistic model in this fashion if you want.
(Refer Slide Time: 06:13)

So, I want to just quickly point out a few pieces of wisdom about how to go from data to
distribution. There should be enough data points. This is the problem and I pointed out how
even in the Iris class this is very difficult. So, in practice mostly unless you have very few data
unless you have no very specific things mostly people will not try to find the distribution
directly.

Because finding the distribution will require a lot of data and you may not have that much data,
maybe for one or two random variables together you can find the distribution. But when you
have a lot of random variables it is difficult to find distribution in practice. So, mostly people
work around it a little bit. But as far as possible if you have a sense of the overall distribution
of the data you are looking at overall variables that are of interest to you.

Then I think you are on much more firmer ground because all the calculations are from
probability they just come in very cleanly and you can think about them and do them very
nicely, if you have a sense of the distribution. So, that is something important to pick up as you
look at data.
(Refer Slide Time: 07:19)

So, let me also describe one more data set. So, this is diabetes data set, I will show you in the
code lab notebook how to get this data set and how to access it I will do that. But notice here it
is slightly becoming bigger it is bigger than the Iris dataset. The iris data set was sort of
smallish. Here you have 442 patients and there are 10 variables. Already you can see the
number of variables are going up.

If you start doing joint histograms for all of these things with bins you are not going to go
anywhere. So, you have to think of these problems in a very, very different way not always
start with distribution, you should have a sense of the distribution. But then you should think
of a few variables at a time and see if you can connect them in some reasonable way etc.

So, supposing you give this distribution I mean, I do not want to talk too much about this data
in this lecture. We will, I will give you the data itself then you can look at it and try and answer
this question of how would you describe these distributions? So, data is not going to be
sufficient. So usual goal is not to find the distribution but only to find something like what is
the disease progression, one year after I have measured all these things. So, they were also
already diabetic in the beginning to start with these are all the data points that they had at that
time.

Now after one year what is going to be this situation? Can we come up with any sort of
prediction? So, you make some assumptions on the distribution to whatever extent possible
you figure out what you can say about the distribution and then as long as you can justify, see
if you can justify those assumptions from the data and then you can go ahead and do some
computations.
So, in this lecture we did not talk too much about these kind of aspects, how to go from data to
distribution something we did not really spend too much time on and as you progress in data
science you will learn more about it. But what is important in this course in statistics 2 at least
is the other skills part, given a distribution how do you compute marginal? How do you
compute conditionals? How do you play around with it? How do you make some statements
about it? Those things we have to pick up skills. Thank you very much.
Statistics for Data Science 2
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture number 6.8
Multiple discrete continuous random variable summarizing data

(Refer Slide Time: 0:13)

Hello and welcome to this short lecture on I wanted to just show you the colab notebook.
Where you can generate these summary statistics for data make some histograms and all that.
Once again though some concern express some students that they do not know python they
are not able to understand the commands et cetera. This is not for you to understand every
command eventually maybe over the next couple of years you will pick up enough of Python
and background to know all these things.

You should know all these things when you want to become a data scientist eventually this is
just to show you initially a flavor a little bit of a flavor of the kinds of things you can do
today with data and how to represent it et cetera. So, this is my notebook I just run some of
these things. It is not extremely important but let me just run it so that I can show you how it
works.

So, this all this you have seen before I have walked you through some of this code just to
show you how one can do this in Python and colab. You remember the so many different
ideas that I have illustrated it. Hopefully the last thing we saw was making these histograms
and connecting them to the density and how the density approximates the histogram. When
you look at random phenomenon of different nature. And how that density is important and
useful et cetera.

(Refer Slide Time: 1:38)

So, we are finally coming to the data part of things. So, you have densities, and you have
distributions and things like that. But then what about real-life data? When real life data
comes in? How do you handle it? How do you picture it? How do you summarize it? How do
you make histograms from it that is the point of this very short lecture. I have use some of
these results in the lecture itself.

So, this one is just shows you how I got it. So, that is how I think of it. So, there is this
package called SK learn in Python. It is a very, very popular package among machine
learning folks for various types of algorithms you can run on it we will not do any machine
learning here. We will just do data sets from this. So, there is this SK learn dot data sets
which has this Iris data set you remember the Iris data set that I described in the lecture with
this just one command you can load that Iris data set.

At least load the package which will load the Iris dataset into your notebook. And then all
you have to do is this Iris equals load Iris command that will load the data into you. So, this
like I said this Iris data set is a very popular data set in statistics it is very widely used. So,
this is available in most of the statistics packages even if you use other packages this Iris data
set will be available in some form.

There is this description for this data set and you can print it which is part of the data set
itself. And it says 150 instances I mentioned this there are four attributes or data in this the
sepal length, sepal width, petal length, petal width, and there are three classes I call them 0 1
2 and here are the names for you. The first one is setosa, the next one is versicolor, next one
is virginica. So, class 0, class 1, class 2 in case you like these names you can retain these
names some summary statistics is given to you.

(Refer Slide Time: 3:28)


So, I guess it talks about Fisher and references et cetera. So, this is a description of that. So,
now what are the commands that one would use to quickly summarize the data. So, it turns
out sip stats which we imported before provides this very simple command called describe.
Which will take the data set and then give you all the summary statistics. So, I am using that
to get all the summary statistics and then I am plotting the minimum and maximum this is for
all the classes.

And then I am plotting the mean I am printing the mean and printing the variance. So, you
can do that also for class 0 alone. Class 0 it has to stop at 50. So, I am stopping with 50 rows
and all the columns that is what this command means. So, I am taking data here only for the
first class, class 0. And again I do the same thing once I get the some stats minimum
maximum mean and variance. And that gives you all the summary statistics.

So, this kind of summarizing the data very, very crucial any other data set you have you go
ahead and use this kind of functions to get a quick summary of what this data is. In fact
describe will give you a few more statistical quantities which are of interest usually but we
will just stick with this for now.
(Refer Slide Time: 4:45)

So, the next thing is to plot histograms. So, we have seen plotting of histograms before so
here I have taken all the four guys and for class 0 I have plotted the histogram one after the
other and set the x axis limits to 0 6. So, this one just does that so that you can compare all of
them together. So, you see for instance the sepal length is sort of between 4.2 and 5.8 or
something. The sepal width is between two point something to four point something.

The petal length is from 1 to 2 or something like that. The petal width is from 0 to I do not
know 0.7 0.8. So, you can see that is the histogram and in different shapes. It is a good plot.
And once again I do not want you to be able to understand every single line of this code. It is
just that there is it is possible to write Python code very simple Python code to generate these
kinds of plots. So, the last type of plots I introduced in the lecture were these 2d histograms.
And those are this nice bar 3d bar chart which made with some it looked like this.
(Refer Slide Time: 5:46)

So, you remember this sort of buildings laid out in a well-planned layout I think is the 2d
histogram. You remember so this is the commands that get you that I am not going to talk
about what these lines are I will let you read it up. And like I said this is not a Python class.
You are not expected to remember all these Python commands I will not ask you about them.
It is just that plots like this can be very easily generated with a few lines of Python code. It is
not very hard.

This is given just as a reference for you to look up and learn on your own. So, you can keep
one objective of your Python learning class and the other classes as you go along I should
learn how to make these plots. So, that should be one of your objectives. This is just to show
you that these things are possible. And I have also given you the piece of code if you play
around with it search for it on the internet half of the coding today happens first by searching
on the internet for how to do something and then reproducing that.

So, that is an important skill as well you need to pick up that. So, you can look at the code see
how this is obtained. And you will understand how this works. So, that is it that is all the
colab stuff I wanted to show you from this week of lectures. Hopefully this week was
interesting for you we saw some interesting ways of modelling even real life data I would say
how to think of histograms plot them, how to think of multiple continuous random variables.

So, continuous random variables are a big part of static probabilistic models. And you should
be comfortable with doing computations with continuous random variables, expectations,
variance in multiple continuous random variables how to calculate marginals, conditionals,
probabilities involving these and multiple random variables, all of those are important skills
to pick up. Hopefully you have practiced them enough for this week. We will carry on from
here in the next week. Thank you very much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electric Engineering Department
Indian Institute of Technology, Madras
Lecture No. 7.1
Statistics from Samples and Limit Theorems

Welcome to this lecture. We are going to be starting the second half of this course. The first half
of the course, so to speak, was largely probability. And we were in this comfortable world of
probability spaces, where the experiment was very precisely defined, the random variables that
come out of the experiment out, experiments outcomes are sort of well defined and the
distribution is known and you will learn to do calculations with those distributions.

We will continue that, we will keep continuing that. We will not forget that. But we will now
introduce the context of statistical problems and procedures, where they come from, how they
arise, how probability is sort of connected to it, and how one does statistical procedures and
analyzes them is what we will slowly start. So, in this week, we may not fully start it, but we will
make some steps towards understanding the setup for a statistical problem and the statistical
analysis situation. So, this week is largely about statistics and their connection to samples and
these limit theorems which justify many of these procedures. So, let us get started.

(Refer Slide Time: 01:30)

So, the first one is, we are going to start talking about statistics from iid samples. This notion of
iid samples, independent and identically distributed samples, is sort of very crucial and we will
see how iid samples are very informative about the underlying statistical phenomenon and this
underlying random phenomenon. So, that is sort of going to be the theme. And we will look at
this a bit more closely in this lecture.

(Refer Slide Time: 02:05)

So, it turns out, we have seen iid samples before. So, so far in the, in our study, we have come
across iid samples in multiple situations. In fact, 3 situations, as I can recall, we have already
come across iid samples. The first one is when we studied Bernoulli trials. So, you can see I
forgotten the r here. So, let me just add the r. It is Bernoulli trials.

So, when we looked at Bernoulli trials, we saw iid samples as something that was independent
and identically repeated and we saw outcomes or random variables from the experiments. And
what we got was a series of samplings, samples of the same random variable, same distribution,
but independently repeated. We saw that in Bernoulli trials. So, the experiment keeps repeating
and we look at the success. I will describe this in a little bit more detail.

Another place where we saw it was Monte Carlo simulations. So, even there again we would run
the probability experiment over and over again, and try and find out something about it. And the
third part, maybe you never thought of it as an iid sample, but we worked with data, I mean,
samples from the data and then we computed histograms. And even there, we were using iid
samples.
So, I will come back and talk about these three later on at the end of this set of lectures. But we
have seen iid samples before. In all three instances, we could infer something very useful from
these iid samples. So, let me go into some more detail on each of these scenarios and show you
how iid samples occurred and what information or statistics did we glean from it as what we will
look at in the next few slides.

(Refer Slide Time: 03:53)

First, let us start with Bernoulli trials. In Bernoulli trials, we have an experiment, and there is an
event of interest. In that experiment, there is only one event we are interested in. We are not
interested in other events. We want to know if that event happened or not. So, we think of the
occurrence of the event A as a success. And if it did not occur, it is a failure. And one of the
information or statistics that we want from such a observation or trial is what is the probability of
the event. So, maybe we are interested in that. We are interested in is that it is high or low or
maybe even the exact value we are interested in that. So, this is what the setup is.

And Bernoulli trials are nothing but, well, independent repetitions of an experiment, say n times.
So, now we can associate a random variable with each of the Bernoulli trials. Notice how I do
this. This is a random variable which is usually called an indicator random variable. It indicates
whether an event occurred or not. In this case, it indicates whether A occurred or not. So, if Xi, I
define as the random variable which is equal to 1, if A occurred in the ith trial, and it is 0 if A did
not occur in the ith trial. So, that is the definition for Xi.
It is, you can see it is a Bernoulli distribution and it takes value 1 with probability of A, takes
value 0 with 1 minus probability of A. And for different values of i, these random variables are
independent. So, you have X1, X2, so on till Xn. These are independent and identical Bernoulli
samples. So, notice how, from the distribution, the description of Bernoulli trials, we are able to
come to an iid sampling situation.

And from this we could glean a lot of information. In particular, we are interested in finding the
probability of the event A, now, for instance, prevalence of the disease in a population. If you
remember, this is how I introduced Bernoulli trials. So, this is something that is very useful. So,
once again, iid samples, we have seen these Bernoulli trials before. And from the iid samples, we
want to get or glean some information or statistic that is hidden about the underlying statistical
phenomenon. So, this is one example.

(Refer Slide Time: 06:17)

A very similar example is Monte Carlo simulations. I want to go through this a little bit fast. So,
there is again an experiment or event of interest. This is typically too complicated for you to
write the equations with. So, we ended up writing some programs. And even if we wrote, even if
we did some analytical calculations, we wanted to check if it is correct or wrong, so we wrote a
program, simulated it and then repeat the simulation multiple times independently and that is
crucial.
You have to repeat the same simulation over and over and over again and hopefully
independently. So, the computer had some way of generating some random things and we
assumed they were independent. And we repeated it multiple times and we figured out how
many times a particular event A occurred and we measured the probability of A as
approximately nA/n.

So, again, you might wonder why is this true. So, this is sort of like the bedrock of probability
theory in some sense. So, you want to think of the probability of an event A as some sort of a
frequency with which it occurs. If somebody says probability of A is 10% and if you observe the
experiment 100 times, you pretty much expect that around 10 times or maybe 9 or 11, something
like that the A must have occurred. So, only then it means, becomes meaningful to talk of
probability in that sense.

So, this interpretation of the probability as being the frequency with which it occurs. So, the
number of times, so if, particularly if you repeat the same experiment over and over again
independently multiple times, then the fraction of times that event A occurs should roughly be
assigned as the probability. So, this is what we hope we are doing in our analysis and all them.

So, as I want to point out that this is very important, the sort of foundational in some sense. And
from our theory, it should turn out that this is true. So, if you if you repeat the experiment
multiple times and observe what fraction of times A occurs, it better be close to the probability
that you assign to it. So, the theory should be complete and closed in that fashion. So, this is
something that we should try.

Now, this value of n is very, very interesting and important. And you may want to say, if n were
100, I am going to be this close to the probability of A, nA/n will be this close to the probability
of A. If n were 1000, I will be this much closer, if n were 10,000, I would be this much closer. I
want to say, I want to make some statements of that type so that I know that my statistical
procedure is working correctly and things like that.

Once again, in this situation, we have iid samples. It, maybe it does not look immediately like iid
samples to you, but you can again define an indicator and a variable exactly like we did for the
Bernoulli trial. The Xi is the indicator for A and then you line up X1, X2,.., Xn, these are iid
samples. So, notice, most often in statistics and when you want to compute some information
about a statistical phenomenon, iid samples naturally get involved.

(Refer Slide Time: 09:16)

So, let us come back finally to computing histograms. This is something that we did for
continuous models. So, when you wanted to model a random variable as continuous and usually
you looked at some n data points and these data points will, like I am denoting them as x1 to xn
and then bin these data points. You define maybe a bin of suitable thickness, and then you said,
[a, b] is the interval of the bin and then you counted how many of your data points fell inside that
bin. So, that was something that we counted to make the histogram.

And if you have, if you want to model this data as coming from a continuous random variable
and you want to do calculations with that then normally you had a density for that continuous
random variable. And if you have given the event A as falling in the interval [a, b], you know
that the density when integrated from a to b will give you that probability. We know that much
about the histogram and the probability involved in.

So, the data points themselves, once again, in this scenario, are usually thought of as iid samples
from the density fX(x). So, each xi is continuous and it has density fX(x). It is a random variable.
So, the observations of these random variables are thought of as the data points. And then from
there again, you expect that the probability that x lies between a and b is number of times your
data point fell in that bin [a, b] divided by n. This is your density histogram, is not it? So, you
expect this to be sort of true.

So, here, again you see you have iid samples. So, it turns out every time we were trying to get
some information from data or doing an experiment to gather some statistic, we were actually
using it samples. So, this notion of iid samples is very, very important and crucial.

(Refer Slide Time: 11:11)

That is because in the way we think of the theory and the way we think of data, iid samples hold
information about the underlying distribution. So, this is something very important to convince
yourself about and be convinced about and it is sort of easy to see also.

So, every time you do not know the distribution. So, you are hoping to repeat and repeat and
repeat till you get enough data to have some method of getting what you want from the data. And
this notion of an iid sample is very, very important. Every sample should, first of all, why should
it be identical? It should be identical, because only then the same statistic you measure will not
change. So, you can hope to measure.

Why should it be independent? You should not get the same value over and over and over again
then you do not learn anything about the distribution. So, it should be independent and it should
be identical in some sense. It cannot be all over the place then you do not get any meaningful
information from it. So, there is various ways to think of why iid is very informative and I will
leave you to think about why that is important. But iid samples are really the bedrock of
statistical procedures.

So, usually, as we saw in the previous three scenarios, you have given one particular sampling
from an iid sample model and you want to get some partial information about the distribution
about maybe probability of an event, maybe the PMF, maybe the mean, variance, whatever you
want about the distribution that underlies the samples. So, this is, this common to all the three
procedures that we saw before. And this one can sort of generalize and that becomes a typical
statistical procedure and problem.

So, data that you obtain is usually modeled as observations from iid repetitions of an experiment
and then your goal is to glean some statistic from that data, so unknown or partially known
distribution. You may know, you may have some idea about the distribution, you may not know
anything about the distribution, what can you get from the iid samples is very crucial. So, we see
quite a few examples.

So first example from data, from the data world I can think of as the iris data. So, we saw the iris
data set. There was this data about every iris. So, supposing instead of giving you 150 different
irises, I gave you the same iris data over and over again. If it were the same data, then you do not
get any information about the distribution of the sepal length, sepal width and petal length and
petal width. You do not gain any information at all. So, the data needs to be independent. It
needs to be different irises.

Now, instead of just focusing on iris plant, supposing I look at iris, next time I look at rose, next
time I look at, I do not know, name your favorite flower lily, there is lotus, there is that, keep on
changing the flower. Then again you are not, there is no hope of you getting any reasonable
information from that. It has to be the same distribution, so irises but irises, so that you get
independent information about the same phenomenon.

So, if you want to model that natural process by which the iris flower blooms and petals form
and the length and the width of the petals and sepals etcetera, you want to model that as a
random statistical phenomenon and study it, you should look at enough independent and
identical occurrences of that in nature, measure, get the data and then try and glean some statistic
from it. So, it is the sort of easy if you just think about it. But you have to now do a model and
then do calculations and all of that will enter the picture. Probability will enter the picture. But
that is the underlying situation.

So, keep this kind of an iris in your mind when somebody describes the statistical process to you
and the same thing continues. And any statistical problem is like that. So, there is an underlying
natural phenomenon which creates some observations and you hopefully want to get independent
identical repetitions of that same event and then you want to get some statistical information
from it. Every single data we look at will be like that. So, that is the idea.

So, in general, the way we do our analysis will be, first of all, how to have metrics for your
statistical procedure. Any procedure or process or thing that you come up with, usually, there
will be a metric for how good it is. So, you have to decide how to measure goodness metrics is
important. And then the second problem, which is interesting is, how is the goodness metric
connected to the number of samples? And these two are very important, number of samples is
very important, goodness metric of your statistical procedure is very important.

So, keep this simple slide in mind as we study more and more and more statistical processes.
Even your whatever learning algorithm you may call is, at the end of the day, is like a statistical
procedure. So, you always you want to think of this as the basic underlying phenomenon. So, let
us keep proceeding. We are only taking baby steps now. This is a very high level picture. We
will get into more details as we go along.

(Refer Slide Time: 16:34)


See immediately an example. So, I am going to start with a very, very simple example, where
somebody gives you 20 iid Bernoulli(p) samples and p is unknown to you. So, this is actually a
very simple situation in today's complicated world of statistics. We have been doing statistics for
more than 100 years now, very formally. So, it is very advanced field. So, this is maybe a very,
very basic example. But still you should have enough clarity about how this basic example
comes along.

So, I am going to give you 20 iid Bernoulli(p) samples and p is unknown. And maybe the goal is
to try and find p from the iid samples. So, it is reasonable goal one can think of. So, the first
thing to understand is, I am going to, what I am going to give you is 20 iid Bernoulli(p) samples,
I keep calling them samples, but at different points in time, you may get different samplings as a
result of the sample.

So, today, if I were to give you 20 iid Bernoulli(p) samples, I may give you the sampling 1, 1, 1,
0, 1, 0, 0, 0, I may give you that. Tomorrow, if you were to come and ask me, I may give you
sampling 2. So, because the process is fixed. Every number I give you is Bernoulli(p)
distribution. That is all. But the exact sequence is not fixed at different points in time. The same
samples may actually give you something different.

So, there is a difference between samples and the actual instance of the sample or samplings, as I
am calling it. And this difference is very important for you to keep in mind. The iid samples, you
will describe them as X1, X2, capital Xn, distributed like the distribution of X. You will say
something as simple as that. So, that is your distribution. So, the iid samples distribution remains
the same. But at every sampling, I will actually get different data. I will get a different set of
data.

Now, it is important for your statistical procedure to work with any particular instance. Your
statistical procedure, supposing you develop a statistical procedure for finding p from the
samples, it should work for sampling one, it should work for sampling 2, it should also work for
sampling 3 or maybe sampling 4, sampling 5, sampling 6, whatever samples I come up with, as
long as I am faithfully giving you iid Bernoulli(p) samples, your procedure should give you the
exact same answer.
So, the samples are actually random. So, that is the most important thing one needs to assimilate.
And that is why I put out this example explicitly. Maybe it is clear to you, but once again it is
important to see that the samples are random. They are not fixed at any point in time, but p
remains the same for every sampling, but the samples do not remain the same.

So, over time, if you sample today, if you sample tomorrow, you are going to get different
samplings. So, the only thing you are guaranteed is each sample is a observation from the
distribution Bernoulli(p). At least in this case, that is what is given to you. So, the statistical
procedure should give you, I mean, if you design it well, it should give you the value of p with
some guarantee on, this is how close I will be to the actual p. For every single sampling, it should
work. So, over any sample, it should work in that fashion.

So, now often in statistics, and as you, as we study more, in a few more lectures, I myself will
use the word samples for everything. I will use the word sample for the random sample X1, X2 to
Xn. And every instance of the sample, I will also say simply sample. I will not say sampling or
anything like that. Maybe we will write x1, x2 for the actual samples, but usually people simply
write X1. And there is this understanding that you know all this, sort of internally you know this,
so we do not have to repeat it.

So, this danger of mixing up samples and samplings is there. And usually, it is usually clear from
the context. So, people do not make this big claim. So, later on you will see. I will also do that
myself. I will be guilty of that same mistake. But keep in your mind, when I say iid samples, the
samples are random and every instance of the sample can be different and for every instance,
your procedure has to work. So, it is important.
(Refer Slide Time: 21:14)

So, now, having said all that, having given you examples, and some settings, etc., I am ready to
sort of define what a typical statistical problem or a procedure will look like. So, you will usually
have a model for the samples. So, you are observing some samples. You will usually have a
probabilistic model for the sample and that will write, that will be written in this fashion. It will
be iid samples, X1, X2 to Xn are iid according to the distribution of a random variable X. So,
what does that mean? So, that is what this thing is. This is iid according to the distribution of X.

So, if X has, let us say X is discrete, and it has a PMF PX(x), let us say, each Xi has PMF PX(x),
the same PMF and xi are independent. That is what this means. So, when I write like this, this is
how you read. So, you have a random variable X and that is defined for you. And if I say iid
samples from the common distribution of X or iid X, if I simply write it like that, I mean, every
Xi has the same distribution.

So, if X were to be continuous, let us say it is Normal(0, 1) then of course, each Xi will be
continuous and it will have the same density as X, and they will all be independent. So, if you
want to do any joint distribution, you can simply multiply the marginals and you will get a joint
distribution. So, that is the model for the sample and the model for the samples is probabilistic.
So, that is where the probability will come in.

Now, you will be given data. So, the given data, we will assume is one sampling instance from
my iid sample mode. So, somebody generated this data and gave it to you according to this
model. We will assume it is according to this model. It may not, the model may not exactly
match or model might be approximate in many ways and in many cases real life data you may
not be able to even model. So, but the model will be usually close enough. And that is enough for
your statistical procedure to be developed and for it to work.

So, usually, the distribution that we assume for X will be partially known or it will be unknown.
So, unknown is okay some common distribution, we do not know. What is this partially known,
you may know the nature of the distribution, but you may not know the parameters of the
distribution. So, this is the most common assumption. So, I might say X is Bernoulli(p) and p is
unknown, like we had in the previous case or I might say, X is normal with mean and variance
, and and are unknown.

So, this is very, very common or X is Poisson with parameter lambda and lambda is unknown.
So, things like that one may assume for the distribution of X and that parameter will be
unknown. So, the goal usually in a statistical procedure, in a statistical problem is to find,
develop procedures to find more information about the, missing information about the
distribution of X.

So, if parameter is missing, how do you find the parameter or if something else is of interest to
you, so you may not really care about the parameter, you may care about the probability of an
event. Of course, if you find the parameters, you can find the probability of the event, but that
may not be the best procedure. So, supposing you have to go in two steps to some answer, maybe
there is a direct step which is better. So, we have to define all that and define all that when we do
statistical procedures.

So, in general, this is how at least statistically problems in this class are going to look like this.
So, what type of information you may be interested in mean of X, variance of X maybe. Maybe
you are interested in probability of an event involving X, maybe you are interested in the
distribution of X or maybe you are interested in something more complicated to think of or
maybe the size of the range of X. So, you may be interested in things like that. These are all a
little bit more complicated and different types of information one can think of about the
distribution that is known or partially known.
So, this is a typical statistical problem, and you can see how probability enters the picture, you
can see how data enter the picture and you can see how your comfort with knowing probability is
going to help you in designing statistical procedures at least in this kind of a scenario. So, this is
a typical statistical problem.

So, that concludes this lecture. I do not have any problems to solve in this lecture. But hopefully,
you get an idea of how statistical problems arise, how iid samples arise and how iid samples hold
information about the underlying distribution. And we will see more and more examples as we
go along maybe not in this lecture, but from the next lecture, next week onwards at least we will
see many more precise specific statistical problems of this nature and we will try and develop
some procedures and provide some guarantees. Thank you very much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electric Engineering Department
Indian Institute of Technology, Madras
Lecture No. 7.2
Empirical Distribution and Descriptive Statistics

(Refer Slide Time: 00:12)

Hello, everyone, and welcome to this lecture. In the previous lecture, we saw how iid

samples are very nice and informative about the underlying statistics and we were

thinking about how to extract some information out of, from the samples and we will make

some more progress in that direction in this lecture.

In particular, we will define empirical distribution which is a distribution that one can easily

compute from the samples and we will also look at descriptive statistics. Descriptive

statistics are like from the samples you calculate mean and variance and various things

from the samples, and what is their connection to the distribution. So, these are the kind

of questions we will ask and answer in this lecture. So, let us get started.

(Refer Slide Time: 01:00)


So, first of all, let us get started with empirical distribution. So, the empirical distribution

has this definition. You have a set of iid samples. We will see this will be a recurring theme

and what we do from now on. Every time we will start with iid samples. So, you have iid

samples 𝑋1 to 𝑋𝑛 distributed with the distribution of a random variable X, common

distribution is the distribution of a random variable X. So, here is a notation. This # is

usually a number of 𝑋𝑖 that are equal to t. So, this is a count you can make. So, the sample

is given to you. You simply count how many times p occurs in the samples. So, that is the

number of 𝑋𝑖 = 𝑡.

And then once you do that, you can find a distribution. So, what is this distribution? This

distribution is basically probability of observing t as an outcome in this distribution is this

hash Xi equals t #(𝑋𝑖 = 𝑡), number of time t occurred divided by n. So, this defines a

PMF. The distribution with this PMF, the random variable with this PMF and that

distribution is called the empirical distribution. So, let me give an example. I think that is

probably clear and example will make it very amply clear.

So, here is a situation where you have 20 samples and these are the samples given to

you 1, 1, 0, 1, 0, 0, 0, 1, 0 like that. What is the empirical distribution of this sample? You
first observe that there are, so the first step is when you want to compute the empirical

distribution, you first want to compute the range. So, let me just get started here. You first

want to compute the range. What is the range of the random variable, in some sense

range are the set of values that this takes 0 and 1, is not it? It takes value 0, it takes value

1.

So, in the empirical distribution, you need p(0) and p(1). How many times did 0 occur?

You can count, you have to now count 1, 2, 3, 4, 5, 6, 7, 8. So, it is 8 out of 20. So,

probability of 0 is 8 out of 20. So, maybe I should not say of range, range is 0, 1. So, this

is the empirical distribution. And one, you can see, it is 20 − 8, so that is 12/20.

So, the empirical distribution takes two values, there are two outcomes 0 and 1. 0

happens with probability 8/20, 1 happens with probability 12/20 and that came from the

samples. Somebody gives you these samples, you have this empirical distribution. So,

here is another set of samples, where you have 7 0s and 13 1s in 20. So, here now the

empirical distribution will become, again, 0 and 1, just 0 and 1 occurring in this samples

and 0 occurs with probability 7/20, 1 occurs with the probability 13/20.

Here is another set of samples. Look at this set, 1, 2, 0, 3. Now there is 2, 3 and all

happening. So, now, if you look at the range, it is going to be 0, 1, 2, 3, is not it? That is

going to be the range. And if you want to compute the empirical probabilities, you have to

simply count how many times 0 occurred 1, 2, 3, 4, 5, 6, 6 by 20. How many times 1

occurred, again, 6/20. How many times 2 occurred, 5/20. How many times 3 occurred,

3/20. That is the empirical distribution.

As simple as that is just it is almost high school level calculations. Given a sequence of

numbers, how many times a particular number appears in that sequence divided by the

total number of numbers, you get a distribution, and that distribution is called the empirical
distribution. So, from any sample, you can compute the empirical distribution. Is that okay.

Hopefully this is clear. It is a simple enough process.

(Refer Slide Time: 05:19)

So, few observations. So, these are the things where people get a little confused. So,

what does this empirical distribution? Given a samples, given the sampling, instance of

the sample, I can compute the empirical distribution. Now, as the instance changes, my

empirical distribution also changes. So, the empirical distribution, so supposing I define

the common distribution of X, just by the common distribution, the empirical distribution

does not get defined. So, in fact, the empirical distribution is a random quantity. What do

I mean by random? Supposing the iid samples are generated according to Bernoulli half

process or Bernoulli p process.

So, here is the example. So, this t that occurs and p of t can change from one sampling

to another. So, if you take 20 Bernoulli(p) samples we saw before, even in the previous

slide we saw that there are different cases in which the empirical distribution keeps

changing, even though the underlying distribution is the same. So, I am going to say
supposing it is a Bernoulli(p) samples, p is some half or something, first time you sample,

you will get 20 samples, 0, 1, something that might come to some 8/20 and 12/20.

The next time you sample you will get a different number of 0s, different number of 1s.

So, the empirical distribution depends on the actual sample instance, not just the

sampling distribution. Keep that in mind this distinction between these two. Empirical

distribution depends on the actual instance of the sampling that you get and not only on

the sampling distribution. So, this is a more complicated quantity. So, it is a random

quantity.

So, descriptive statistics are usually properties of the empirical distribution. So, the

problem with the distribution is, it is a confusing quantity to keep track off. You have to

keep track of the values it takes, you have to keep track of the fractions with which it

happened, the lots of things, lots of moving parts, you have to look at it and it will change

from sampling to sampling, some letter may come, some number may not come, it is all

difficult to keep track of. So, instead of keeping track of the entire empirical distribution,

maybe you want to keep track of some statistic of the empirical distribution. So, that is

where this descriptive statistics come.

So, mean of the empirical distribution you could look at, variance of the empirical

distribution you can look at, probability of a particular event occurring according to the

empirical distribution you can look at. So, these are the various descriptive statistics we

look at. We look how they are computed given the samples and how they are related to

the underlying distribution. So, that is what we are going to do next.

So, what one hopes is, so from sampling to sampling, the empirical distribution changes,

its descriptive statistics change. However, if you observe a large number of samples, you

are hoping that the values that you get for the empirical distribution, the empirical

distribution itself, its properties, its descriptive statistics, sort of become close to the
properties of the underlying sampling distribution, which was the distribution of that X that

we said that comes from the samples.

So, if this were to happen, that is a good thing, is not it? So, you have a way of going from

the samples in the computing the empirical distribution, computing the descriptive

statistics, and that in some way, if it is connected to the underlying distribution in a very

clear theoretical way, that would be a good thing and that is what we are going to hope

to do.

(Refer Slide Time: 09:14)

So, we will start first with one of the simplest statistics you can compute and that is called

the sample mean. So, here is the definition for the sample mean. It is the mean of the

empirical distribution in some sense. If 𝑋1, 𝑋2, 𝑋𝑛 are iid samples with some common

distribution, the sample mean, which is denoted 𝑋̅ cases, this is the notation for the
𝑋1 +𝑋2 +⋯+𝑋𝑛
sample mean, and I will call it 𝑋̅, 𝑋̅ is basically , the average of the samples,
𝑛

sample mean, sample average, that is what the definition is.

Given a particular sampling 𝑥1 to 𝑥𝑛 , the sample mean will take a particular value. The

sample mean is a random variable, the way I have defined it, 𝑋1 to 𝑋𝑛 are random
variables. They are the samples, the random samples, random variables. The sample

mean itself is a random variable, X bar is a random variable. It is defined as a function of


𝑋1 + 𝑋2 +⋯+ 𝑋𝑛
𝑋1 , 𝑋2 , … , 𝑋𝑛 , . So, it is a random variable.
𝑛

If you are given a particular instance of the sample, one particular sample is 𝑋1 , 𝑋2 , … , 𝑋𝑛

then the sample mean will take a particular value. So, quite often in when we describe

these things, it is difficult to keep track of the difference between the random variable,

instance, particular value taken by it etcetera, people will simply say sample mean,

sample mean, sample mean, so mean or average value. They will just say that. And you

should be conscious of this difference between the iid sampling model, sampling

distribution and what you got from the actual samples, instance of the samples, and you

should know the connection between the two what is connected, what is not connected

you should know.

So, let us see a few examples. Here is the n equals 20 example. There are a set of

samples. Value taken by 𝑋̅ in this case is 12/20. Here is the next case, another sampling,

value taken by 𝑋̅ is 13/20. Here is another sampling 1, 2, 3 of some distribution sample

mean is 25/20. If you calculate all of that and do it, you get 20. So, depending on the

particular sampling that you have, your sample mean changes from sampling to sampling.

So remember, the sample mean usually refers to the distribution and also this value is

taken by that distribution, by that random variable. Every time you get a sample, you will

get a sample mean. So, remember that sampling, the sample mean itself is a random

variable and it takes different values for different set of samples.

So, it is all seems quite simple. But you will see soon enough, you will get confused

between where the probability comes, what is the value. Sample average is just one

value. So, what am I talking about probabilities. All these kinds of confusions will come.
So, you have to remember the difference between the model for the sample. I have an iid

model for the samples. I have n samples. I am assuming there is an iid model with which

these n samples got and generated, that is my model.

Then I have data, when I have data, data is a particular instance of that model. I have

only particular values. I am not going to have 𝑋1, 𝑋2 as data. I will have 1, 2, 3, 5, 10,

whatever. Those are the values. So, your sample mean will be an actual value. They have

the sample mean is a random variable, function of random variable, 𝑋1 + 𝑋2, some 𝑋̅ that

is your random variable. That random variable took this value in this sampling. So, that is

the, that is what is going on and this should be clear enough in your mind.

(Refer Slide Time: 12:51)

So, let us see some illustrations, Bernoulli 0.5 samples. I am going to change n, so here

in this case, 𝑋1 to 𝑋𝑛 is iid according to this distribution, 0 with half, 1 with half. So, if n

were to be 5, if you are looking at just 5 samples, here is one possibility 0, 0, 1, 1, 1 and

your sample mean will work out as 3/5. Here is another possibility 1, 1, 1, 0, 1, sample

mean will work out as 4 by 5, 0, 1, 1, 1, 0 sample mean will be 3/5, just giving you different

values.
So, you see how the sample mean takes different values than the sampling gives you

different samples actually. So, the underlying distribution is always the same, but you get

different samples and different sample means this can happen. Let us go to n = 20. If you

go to n = 20, there you go, long set of samples, long set of samples, and the sample mean

changes 12/20, 13/20, 13 /20. So, this is how sample mean works. Sample mean is

always you add up all the samples divided by the number of samples. The sample mean

keeps changing from distribution to distribution.

Now, supposing I got n = 200, it is difficult to write down all the samples that it got. I did a

few simulations of 200 samples of iid 0, 1 half, Bernoulli(1/2) samples. In the first

sampling, I got 95/100, second sampling I got 102/200, third sampling I got 98/200. So,

this is what will happen. Different times your sample, you will keep getting different sample

means.

N = 1000. Look at that now. Now, it is becoming, so this is mistake here in the slides, it

should be 495/1000 instead of 200. Sorry for that. Corrected the slides I give you. So, it

is 495/1000, 490/1000, 504/1000. Notice, now that the variation in the number of the

sample means is decreasing as we increase n. So, initially it was 3 by 5, 4, by 5, the

variation was like as much as 1/ 5 which is 0.2. So, notice the variation here. It is 0.495,

0.49, 0.504, variation has dropped below 0.01. It is 0.02 or 0.02. So, it is gone down 10

times when you went from 5 samples to 1000 samples.

So, the sample mean, and what is 0.5, 0.5 you can see clearly is also 𝑋̅. So, the mean

the distribution mean, the mean of this distribution, what is the mean of this distribution,

equals half. So, you see, as you increase the number of samples, the sample mean

seems to be going close to the distribution mean. So, at least in this observation, in my

simulation, it is working out and that turns out to be an important fact. We will write down

this fact later on. But you can see clearly in the observation that that is working or not.
(Refer Slide Time: 15:42)

So, now let us look at different types of samples. Here is a normal 0, 1 sample. So, I have

n samples, iid normal 0, 1. You remember the normal distribution. Normal distribution has
1 2 /2)
a density. It is 𝑒 (−𝑥 . So, you can plot this density if you like. It will look something
√2𝜋

like this. This will be this density 𝑓𝑋 (𝑥). So, normal distribution tends to take values closer

to 0, and then it can take higher values also. The variance is 1 and that means something

in the probability. So, that is the shape.

(Refer Slide Time: 16:20)


So, here is some experiment that I did with n equals 5. I got these samples. And the

sample mean ended up being −0.25. Next time I did, I got 5 samples 0.17, next time I did

I got 5 samples again 0.11. You can see the variation is about, it can be as high as 0.14

or maybe even higher than that, 0.03 variation. So, this is quite a bit of variation in the

sample mean. I repeated it for n = 20. I do not want to write down the samples for you. I

will show you later on in a colab notebook how to generate these samples and how to

calculate the sample means. It is easy enough.

So, you see, I got 0.08, −0.24, 0.41, still it is varying. Notice what happens when I go to

200. I am getting −0.01, 0.11, 0.12. Multiple times I do it. You can see already it is

shrinking. And if you go to n equals 1000, it really, really shrinks sharply. So, you get 0.04,

−0.04, −0.02. So, once again the distribution mean is 0. So, the distribution mean, 0, and

the sample mean appears to be getting close to the distribution mean as n increases. So,

these are experiments.

I mean, quite often, one can write an expression, but sometimes it is good to see some

actual values that you generated and sort of check that this is actually happening. So, I

would encourage you to have this colab notebook, which is like your playground. So,
somebody explains the statistical result, present a statistical result, quickly do a

generation of numbers. I will also show you how colab notebooks are doing this to

generation numbers and check that it is actually true. I mean, multiple times you generate

seem to be getting the same sort of observation. It is a nice little experimental thing to do.

(Refer Slide Time: 18:03)

So, here is a theorem. It is an important result, very, very important result. Expected value

and variance of sample mean. Remember, the sample mean is a random variable. 𝑋1 to
𝑋𝑛 are iid samples. And if the common distribution of the samples has a finite mean mu

and finite mean, finite variance 𝜎 2 , so I am going to assume that is the case.

So, these iid samples that I generated come from a certain distribution that is a common

distribution. And that distribution has a finite mean mu and finite variance sigma square.

If it is a PMF, you know how to calculate the mean and the variance. If it is a PDF, you

know how to calculate the mean and the variance. It is just 𝐸[𝑋], 𝐸[𝑋 2 ], mean and

variance are calculated from that.

Now, the sample mean 𝑋̅ is a function of all these samples, 𝑋1, 𝑋2 to 𝑋𝑛 . So, 𝑋̅ also you

can calculate the expected value and the variance. The expected value ends up being µ
𝜎2
and the variance ends up being . I am not going to go into detail on the proofs, but the
𝑛

𝐸[𝑋1 ]+𝐸[𝑋2 ]+⋯+𝐸[𝑋𝑛 ]


proof is not very hard. So, 𝐸[𝑋̅], expectation is linear, so you get . And
𝑛

𝑛µ
what are each of these things, each of these things are µ. So, it is and that is just µ,
𝑛

easy to see.

Variance is a little bit more complicated. You will have to do expected value of X bar

square. So, that if you do, you will get 𝐸[𝑋1 2 ] + ⋯ + 𝐸[𝑋𝑛 2 ] and then you will have all

these cross terms. So, 2 times expected value of X1, X2 plus dot, dot, dot 2 times

expected value of 𝑋1, 𝑋𝑛 or 𝑋𝑛 − 1, all the possibilities that occur and divided by 𝑛2 . So,

this is 𝐸[𝑋̅ 2 ]. So, you will have some expression here. Remember, these are all

independent. So, you will have that working out and then each of these things are

identical.

So, you will have this being n times expected value of X square plus, this will be how
𝑛×(𝑛−1)
many such products will be there, it is . So there is 2 and 2 will cancel, you will get
2

𝑛 × (𝑛 − 1)𝐸[𝑋}2 , so that is just mu square by n square. What is the expected value of X


square that is sigma square plus mu square. So, you will have n times sigma square plus

n times mu square plus n square times mu square minus n times mu square divided by n
𝜎2
square. So, these two will cancel. So, you just get + µ2 . So, that is 𝐸[𝑋̅ 2 ].
𝑛

So, variance of 𝑋̅ is just 𝐸[𝑋̅ 2 ] − 𝐸[𝑋̅]. So, the µ2 and µ2 will cancel and you simply get s
𝜎2
. So, it is just a quick derivation. The derivation is not so important, but the final
𝑛

understanding of what this result means is extremely, extremely important. The expected

value of the sample mean equals the distribution mean, mean of the distribution.

The variance of the sample mean equals the variance of the distribution divided by n,

there is by n that ends up coming in the denominator. So, notice that scaling. It is very
𝜎2
important. The one thing, the expected value is just µ, but the variance become , of
𝑛

course divided by n. Bear in mind that is very important.

Once again, remember, sample mean is a random variable. It changes from sampling to

sampling. Even though it changes from sampling to sampling, its expected value is going

to be equal to the expected value of the original distribution used in the sample. The

variance though actually decreases with n and sigma square by n. So, keep that in mind.

So, this is the crucial difference that happens in the sample mean. So, all that I have

written down.

Remember, the distribution mean is actually a number. So, it is not random or anything.

The sample mean is random. And the sample means expected value is equal to the also

the expected value of the underlying sampling distribution that is the important result.

More crucial, very, very crucial is this variance falling with n, decreases with n. It

decreases as 1/n. So, as n increases, what is going to happen to X bar as a random

variable. As n increases, irrespective of n, 𝑋̅ has the mean being always the same.
So, mean of 𝑋̅, so if you were to plot, so let us say this is µ, there is some 𝜎 2 , so 𝑋̅ for
𝜎2
small n, if n is equal to very small, like 5 or 10 or something, it is only . As n increases,
5

the spread of the distribution of 𝑋̅, so if you want to try to keep like say, I do not know, if

𝑋̅ were to be discrete, supposing you are doing PMF of 𝑋̅. So, for small n, it would maybe

have something around, so it made me spread out like this. So, this may happen for small

n.

What will happen for large n? As n goes higher and higher and higher, it is going to be

very, very close to just this, very close to mu. Remember, when variance goes down to

0, the values taken by the random variable come very close to the expected value. It does

not go too far away. That is variance controls the spread of the distribution. High variance

means high spread. Low variance means very low spread around the mean. So, as n

increases, the distribution itself comes closer and closer to the mean and that is the

picture that you have to have in mind.

The actual distribution we do not know. It may have some PMF taking some value, but
𝜎2
we know that the spread of the PMF is going to be really, really small. It goes to . So,
𝑛

one statement you can make very nicely is, as n blows up, as n increases, the value is

taken by the sample mean, just the values taken by the sample mean, as the sampling

change will always be close to the distribution mean. It would not be too far from the

distribution mean, that is what this means.

The upshot of this result is that is, if n is becoming larger and larger and larger, the values

itself, the values taken by the sample mean, will be close to the distribution mean, not just

that average of the sample. So, the actual values itself will be close, because the spread

is not very large.

(Refer Slide Time: 25:53)


So, we will see some examples and illustrations, hopefully, soon enough. But before that,

let us just finish off one more descriptive statistic, which is called the sample variance.

So, once again, you have iid samples, and the sample variance, which I will denote as

𝑆 2 . So, this is just a slightly different notation, 𝑆 2 . It is a random variable, which is defined
(𝑋1 − 𝑥̅ )2 + (𝑋2 − 𝑥̅ )2 +⋯+(𝑋𝑛 − 𝑥̅ )2
in this fashion, . And this 𝑥̅ is the sample mean.
𝑛−1

So, how do you calculate the sample variance given a particular sampling 𝑥1 to 𝑥𝑛 , you

first find 𝑥̅ which is the sample mean. You add up all the xis divided by n. You will get the
(𝑋1 − 𝑥̅ )2 + (𝑋2 − 𝑥̅ )2 +⋯+(𝑋𝑛 − 𝑥̅ )2
sample mean. And then you take this , you will get the sample
𝑛−1

variance. So, these are calculations you have done before in your statistics one class.

So, you should see this. This is the sample variance.

So, often this, the actual instance is also called sample variance, the random variable is

also called sample variance, but you should remember the difference when people talk

about these things. Why do, why are we used n−1 instead of n in the denominator, n is

also used by many people. So, n−1 is used because of the next result. You will see the

next result becomes very clean if I use n−1 here. So that is why I used n−1.
(Refer Slide Time: 27:24)

So, here is the big theorem. The big theorem says the expected value of the sample

variance is equal to the variance of the distribution. So, supposing you have iid samples

from a distribution which has a finite variance sigma square, the sample variance is a

random variable, but its expected value is equal to the distribution variance. So, we saw

before that the sample mean had expected value equal to the distribution mean, but its

variance was falling with number of samples. As number of samples increased, the

variance kept on falling. It went down to 0. So, the spread became very small.

Now, the sample variance, on the other hand, is going to have 𝐸[𝜎 2 ]. What about variance

of the sample variance? We are not doing that kind of analysis. One can show for most

reasonable cases, if it has a finite distribution is not very crazy, it has many more moments

then the variance of the sample variance will also go to 0. So, the same sample variance,

if you calculate the sample variance, you can expect it to be close to the original

distribution variance.

So, these are results. The result is much more important than proof and all that. So, I am

not doing proofs for this result. You can write it down. The proof is not too difficult. So,
here is the same thing as before. So, the variance of the sample, values of the sample

variance are going to be close to the actual variance of the distribution. So, as n increases

one can expect that the individual samplings will give you a variance which is close to the

variance of the distribution.

So, these two descriptive statistics that we saw, we saw the empirical distribution, we

maybe commented that it could be a complicated object. But if you calculate the sample

mean and sample variance, we are already getting some valuable information about the

underlying distribution. Remember, we were talking about why did we start all this from

the iid samples, we want to extract some information about the underlying distribution.

That was our goal.

Here are two descriptive statistics, which appear to be giving us good information about

the underlying distribution. One was the sample mean, very easy to calculate. Just add

up all the sample values divided by n. But as n increases, that is going to go close to the

distribution mean. That is great to have.

Next is the sample variance. What does that give you? As n increases, you just compute

the variance of the values that you get that you can expect to go close to the sample,

actual variance of the distribution, except remember that n and n−1, it is not too critical,

but n−1 gives you may be slightly better accuracy. So, that is a good thing to have. So,

simple descriptive statistics. So, this gives us promise. So, there can be simple statistical

procedures, easy to execute and compute and you can get information that is valuable

about the distribution.

(Refer Slide Time: 30:28)


Let us do illustrations. So, here are illustrations. I will show you a colab notebook code

for this so that you can also run it and play with it. Here is a Bernoulli half simulation that

I did. I am not showing you the actual values, because this is too many values to simulate.

You know the mean of Bernoulli half is 1/2 and variance is 0.25. So, I calculated sample

variance values for n equals 20, n equals 200 and equals 1000. You can see it comes

very, very close. I show you the colab notebook, you will see how to do this yourself.

So, you can see as n increases, the actual variance comes very, very close to the sample

variance. So, if you look at normal 0, 1, again, a similar simulation, mean is 0 and variance

is 1. You can calculate the sample variance values for n equals 20, 200,000, etc. And you

can see how as you go to 1000, the sample variance comes very close to 1, which is the

actual variance. So, these are, I mean, this is just to convince you that the result is not

just some abstract result. If you actually do simulation, you do get exactly what is

suggested.

(Refer Slide Time: 31:43)


So, the final statistic that we are going to see is something called the sample proportion.

So, this is very important as well. It is a very interesting and simple descriptive statistic to

calculate from the samples, sample proportion. So, let us say you have 𝑋1, 𝑋𝑛 distributed

as X which is iid, so I mentioned iid. These are iid samples from the distribution of X. Now,

I have an event A which I have defined using X, I could define it as X being greater than

t or I could define it as X being between a and b, whatever I want to do with X I do and

then I define my event A.

So, quite often, I am interested in something called the sample proportion, which is

basically the proportion of 𝑋𝑖 , the fraction of 𝑋𝑖 for which A is true. So, remember, 𝑋1, 𝑋2 to

𝑋𝑛 are distributed according to X. What do I mean by A being true as an A has occurred.

The event, the value of 𝑋𝑖 corresponds to A, as in it is inside the event A. Event is a subset

of the values that X can take in some sense. So, the 𝑋𝑖 actually, the occurrence of 𝑋𝑖 , the

value taken by 𝑋𝑖 is inside A. That is what it means. That is the numerator. Out of these

n values maybe that could be true for some fraction of them. And that fraction is S of A.

So, this is the sample proportion of an event A.


Now, once again, the sample proportion is a random variable. It will change from sampling

to sampling. Hopefully, I have an illustration for you. Here is the illustration. So, you have

a sample 0, 1, 1, 0, you can calculate the sample proportion of X = 1 that is 3/5. Here is

other samples minus 0.2, 1.1, 0.3, −1.2, 0.7, you can calculate all the sample proportions.

What is the proportion of values which for which X ≤ 0, S(X) ≤ 0 that is 2/5, so 2 out of

the 5 cases. X is between 0 and 1 that is 2 out of the 5 cases again. X >1. That is 1 out

of the 5 cases. Do you see how I am doing this? So, given any event defined using the

common distribution, I can find the sample proportion of that event quite easily from the

sample. It is a straightforward competition.

(Refer Slide Time: 34:09)

Here is the big result. So, this is a very important and interesting result once again that

ties up a simple statistical procedure that we do from the samples. So, that is the common

theme. Hopefully you get that. There is a, you have iid samples. And I am doing a very

simple computation with the samples be it the sample mean or the sample variance or in

this case, the sample proportion, proportion of values, which satisfy a particular event A.
And then I have this wonderful result which connects the probability with which or the

distribution value to that simple statistical procedure.

So, here is the connection. So, you have iid samples from a distribution A (X). A is an

event defined using X and P(A) is the probability of A. P(A) is the probability of A according

to the distribution of X. So, if X were to be between a and b, I will integrate the density of

X over a to b or if X is a and b, if it is the PMF, then I will add up all the values of the PMF

inside that. So, that is what I would do to compute P(A).

The sample proportion S(A) is the random variable, its expected value and variance are

given by these two expressions here. The expected value of the sample proportion equals
𝑃(𝐴)×(1−𝑃(𝐴))
the P(A). The variance of the sample proportion is . So, notice how this works
𝑛

out. It is actually, proof is very simple. I can convert the samples into Bernoulli(p) of A

samples just by the indicator random variable. 𝑌𝑖 is 1 if A is true for the 𝑋𝑖 and 𝑌𝑖 is 0

otherwise you have now a Bernoulli p of A sample. And P of A is the sample mean of 𝑌1

to 𝑌𝑛 .

And S of A, I am sorry, so this S(A), this is not P(A), this is S(A). S(A) is just a sample

mean of 𝑌1 to 𝑌𝑛 and the sample mean, we know, converges to the actual mean, which is

P(A) and the sample variance, the variance of, expected value of the sample mean

converges to the distribution mean. So, you get P(A). Variance of the sample mean goes

to variance of the original distribution divided by n. The variance of the original distribution
𝑃(𝐴)×(1−𝑃(𝐴))
is . So, this has got the Bernoulli(p) of A has mean equal to P(A), and variance
𝑛

equal to P(A) (1-P(A)).

So, once you have the sample mean, S(A) becoming the sample mean of this. This is iid

Bernoulli(p) of A then you can apply the theorem that you already know for the sample

mean. Its expected value is equal to P(A), its variance is the variance divided by n. So,
the sample proportion, which is again a simple statistic we can calculate from the iid

sample is connected to the distribution. The probability of A according to the distribution

is connected directly to the expected value of the sample proportion.

So, as n increases, we expect S(A) to be close to P(A). That is it. Mean of S(A) = P(A),

variance of S(A) tends to 0. So, this is the important story here. Once again, we saw three

simple statistical procedures computing three descriptive statistics, one was the sample

mean, the other was the sample variance, the third was the sample proportion. All three

of them are connected to corresponding information about the distribution, one is the

distribution mean, the other is the distribution variance and third is the distribution

probability of that event according to the distribution.

(Refer Slide Time: 38:12)

Once again, we will close with an illustration. I have 𝑋1 to 𝑋𝑛 being N(0, 1). I generate a

different number of samples and try to find 𝑃(𝑋 ≤ −1). This comes from the distribution.
−𝑥 2
∞ 1
So, it will be ∫−∞ √2𝜋 𝑒 2 𝑑𝑥. This equals roughly 0.159. So, that is the, that came from

the distribution. I did this numerical integration and I got 0.159. You know how to do this.

We have seen how this value can be obtained.


So, I took n equals 20 samples and actually found the sample proportion values. So, I did

this 5 times sample proportion values, 5 samplings. So, once again, 5 samplings here

and 5 samplings. For each sampling I have computed the sample proportion. So, I took

20 samples, first time I got 0.15 as the sample proportion of X less than or equal to minus

1. How did I do it? How do the 20 values, how many values were less than or equal to

minus 1? I divided that by 20. I got 0.15. Then you get 0.2, 0.15, 0.15, 0.15. For every

time I did 20 samples, 20 samples, 20 samples, 20 samples, those are the values I got.

Next I took 200 samples and then again counted the number of samples which were less

than or equal to minus 1. In the first sampling I got 0.170, next one I got 0.14, 0.15, 0.15,

0.16. Next I took n equals 1000. Again the value is very, very close to 0.160, 18, 162,

135, 153. You see some variation here, but I guess that depends on how you do the

sampling and all that.

Here is another value. Same thing repeated. From the distribution, I got 0.683 as the

probability of minus 1 to 1. And then as I do samples and increase them, I get the sample

proportion value. Again, 5 samplings for each of these number of samples. You see the

actual sample proportion comes very, very close to the probability that I want to calculate.

So, hopefully, these procedures are convincing to you.

(Refer Slide Time: 40:41)


So, once again, let us go back to the original question that we started these two lectures

with where have we seen iid samples? We saw them in Bernoulli trials, missed out on the

r here as well. We saw them on Bernoulli trials. And the Bernoulli trials sample mean

tends to the distribution mean, say a Bernoulli p samples, the distribution mean is p, and

the sample mean is fraction of successes. So, we know the fraction of successes is going

to go close to p.

Monte Carlo simulations, the sample proportion tends to the actual probability. You

remember, that is what we did, nA by n, the sample proportion in Monte Carlo simulation

tends to be actual probability. Same with the histograms, how did we do. The number of

values in the bin divided by the samples. That is going to tend towards the probability of

random variable falling inside that bin.

So, hopefully, this couple of lectures here have convinced you that from iid samples, there

are simple statistical procedures that give you information about the distribution and you

can analyze and prove results about those statistical procedures that show you that the

statistical procedure results in an expression or an answer which is close to the

distribution mean, which is close to the distribution variance, which is close to the
probability of an event according to the distribution. So, all three are very interesting

results. And hopefully, we will take this and move on.

So, what we are going to do in the next few lectures is see a little bit more about these

kind of results. Why do these kind of results? How to analyze these kind of results? What

are the theoretical implications, etcetera? We will see them in the next few lectures and

then we will jump into more examples and see how this can be used. Thank you very

much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electric Engineering Department
Indian Institute of Technology, Madras
Lecture No. 7.3
Illustrations with Data

(Refer Slide Time: 00:12)

Hello, and welcome to this lecture. In the previous lecture, we saw quite a few important
statistics that one can calculate from iid samples. And we are, of course, interested in the
underlying distribution and what we can say about it. But we should start with some basic
statistics first, basic descriptive statistics, like the sample mean, sample variance, sample
proportion. They all seem to mean something very nice. We saw some nice results that
encapsulate what they mean, etcetera, in some sense.

So, what I am going to try and illustrate for you in this lecture is to look at some real life data.
So, now, when you look at real life data, quite often, the distribution is not immediately apparent
and you have to rely on these kinds of statistics to sample statistics that you have to try and infer
something from it and maybe with enough experience and enough practice and enough
knowledge of what the data is about, you may be able to get data distribution or quite often you
may not even need the distribution or something like that.

So, all you have is data and all you have is very limited knowledge. You will see that it is a
different world. When you see data, when you do not have the distribution will be really lost. So,
let me show you some simple illustrations with actual data and then hopefully see how these
things work out when you see something in real life. So, let us get started.

(Refer Slide Time: 01:37)

So, let us begin with the iris data. You remember the iris data. There three different classes of
iris, this is sepals and petals and the length and the width, etcetera. It is very popular data set like
I mentioned in statistics. There are three classes 0, 1 and 2. In each in each class, there are 50
instances of data. So, think of it as in each class there are 50 irises and you have measured the
sepal length, sepal width, petal length, petal width for each of these 50 irises and this data is
available to you.

So, now let us focus on the sepal length of Class 0 irises. So, there is data available to you and
the data I have listed here, and you can actually pull it up in your colab notebook or something
and actually look at the data. That is something that you can do. That is fine. And the data will
look like that 5.1, 4.9, 4.7, so on. So, I am going to think of a model through which this data
could have originated.

So, remember, what is my model? My model is usually an iid sample model, iid samples from a
particular distribution. So, when you see actual data, you have to sort of imagine in your mind
that maybe there is some distribution, which you sample iid, you may get some numbers of this
form. In fact, what should the distribution be letters are very deep difficult questions to answer.
In most cases, we may not be able to answer such questions, but at least you can imagine that
that is the model. Some unknown distribution this data has come out.

Now, you can calculate the mean of this data. Now, remember, the crucial thing is, I will keep
using the word iid samples. But remember, in the previous lecture, I spoke to you about the
model for iid samples and the actual instance of the data, every sampling, as I called it. So, this
data is one particular sampling from my model. My model is iid repetitions of some distribution.
5.1 is one instance or one observation of that distribution or a value from the distribution that.
That is all. This is not actually anything, not directly the distribution. I do not know the
distribution. So, I have to hope for what I can do best.

So, I may calculate the sample mean. I will have to add up these 50 numbers and divide by 50.
That is the sample mean and I would do sample variance. I do for every number, that number
minus the sample mean square. You add that up 50 times and then divide by 49. So, that is some
small little adjustment we do for the sample variance. I would get 0.1242, which is 0.3524
square. So, that is like the sample standard deviation, square root of the sample variance.

I can also do some proportions based on this data. So, S(sepal length) of Class 0 iris being
greater than 5, I can count every single length which is greater than 5 divided by 50, I would get
a proportion, sample proportion is 22/50. Likewise, I might do a sample proportion of sepal
length being 4.8 to 5.2. I will get 20/50. So, these are things that I love. I did this on a Python
notebook sheet and quickly wrote some commands and got it out. It is difficult to do it by hand.
There are 50 numbers and you just go count, count, count. You can do it, but it is bit laborious.
Today, one can pull this data into a Python notebook and do it very quickly.

So, it is possible to compute these things. But nevertheless, this is all you can do. So, you do not
know the underlying distribution. Nobody has given that distribution to you. You may even ask
very foundational, philosophical questions on does nature have a distribution and then produce
iris or let us not go there. So, it is just a statistical tool that we are using to understand these
numbers that come on. So, this is the sepal length.

The same thing I can do for petal width. I am not going to go through the detail here. Again,
once I say Class 3, I can imagine that there is one unknown distribution from which these petal
widths are being sampled and then one can compute the various statistics that we did. So, these
are just numbers that come out just to give you a sense of how these actual numbers are. Now,
you can ask very intriguing questions. So, how good is the iid samples model? So, once you fix a
particular class, it is not too bad. I mean, if you look at the values and if you look at how they are
distributed and all that, it seems like a reasonable thing to do.

So, it is like you go and find an iris somewhere and then you go and you supposing it is Class 0,
let us say, and then you measure the sepal width and sepal length or you go to Class 3 iris
measure the petal width. It is going to be something of this sort. And it seems okay that every iris
should have a, should be independent of the other iris. And so if we keep measuring, we should
get independence. And I do not know. I would not mind this model. I think this model is
perfectly fine. It seems very reasonable in this case.

So, that is the iris data for you. Like I said, we are looking at actual data and we are imagining
that it comes from an iid sampling sort of model. Then we are proceeding, computing sample
mean, sample variance and proportion and all that. That is the first step.

(Refer Slide Time: 07:17)

So, the next example I am going to give you, I am often accused in the office of my slides being
very dull and dreary, no color, no art, etc. So I thought I should put something here. So, here is
some data on the Taj Mahal and you get an opportunity to put a picture of the Taj Mahal on the
slide. So, this is data of air quality around Taj Mahal. There is a monitor there and you can go to
the Pollution Control Board of India's website. Every day they put out a PDF file from which
you are supposed to pull out these numbers and populate a table, which is what I have done.

This is actually April of 2021, first April to twelfth April. There are about 11 numbers I think in
this. Some dates in the middle of missing. So, I think 11 numbers, observations. There are four
different entities that are counted, I mean, weighed in the air and their numbers are given below;
one is SO2, NO2, two gases which are important to monitor pollution. And the next two are just
particulate matter, any matter at all. 2.5 and 10 just indicate their size. 2.5 is I think relatively
small and 10 is slightly larger. Some units micrograms per cubic meter. So, if you take cubic
meter of air, how many micrograms of particulate matter PM10 of, 10 is some size are there or
not?

The max, which is the last row is basically the maximum allowable number within a 24 hour
average period. So, 80, 80, 60, 100 these are considered good numbers. I do not know if they are
internationally good or not, but at least in India, the Pollution Control Board these are the
standards. So, this is the standard. Below which it is good, above which it is bad. So, this is the
Taj Mahal we are talking about. This is definitely symbolizes one thing about India just good to
preserve.

And you can see the numbers are a bit disappointing for PM10. It is all way higher than the
maximum allowable limit. And the PM2.5 also quite often seems to be crossing the limit. And on
the SO2, NO2 front we are doing okay. So, here is data, just some data that somebody gave you
and not too much of it. It is small enough that one can try and do calculations by hand or even by
a small computer program or something available with you, you can do a calculation with the
calculator also. So, what can we say about this? So, this data is out here and what can you say?
(Refer Slide Time: 09:51)

You can do statistics, the descriptive statistics from the Stata each of these things. You can find
the sample means and this just a calculation. You can find sample standard deviations or
variances. I put the square. This is actually variance. So, let me just correct that. This is variance
and standard deviation. So, and I may be interested in this proportion. The proportion is
something of interest to me, the proportion that the max was exceeded in the data. I am finding
that for SO2, and NO2 it was 0, which is never exceeded, 2.5 was exceeded 7 out of 11 times and
PM10 was exceeded 11 out of 11 times. So, this is the situation.

I mean the phenomenon of how PM10 matter gets generated at the Taj Mahal or how PM2.5
matter gets generated at Taj Mahal might be complicated to model. Statistics might come and
help you. You may want to say some statistical statements about it feels okay. But then, do you
really like iid samples model for this data, the data that came out, that I showed you in the
previous table day after day the pollution levels, this iid samples model good? Do you think iid is
the way to go or do you think what happened one day will influence, what happened the next
day? One can think about it. That is, so you need more, slightly more knowledge about how this
particulate matter gets generated?

You should know a little bit about Agra surrounding areas, where is the Taj Mahal, where is this
environment monitoring station, what does it get influenced by, what happens if there are fires in
Haryana, fields and fires in Punjab and all that, does it change significantly. I mean, there are so
much more to this simple question of do you like iid samples for this data? Maybe you can start a
discussion in the discourse stop discourse forum and we can debate this. This is an interesting
thing to debate. So, it is beyond what a statistician should do at some level.

So, I mean, maybe I do not know if it is beyond or not. It is not really a simple mathematical
calculation. You cannot do calculations, therefore, iid samples is good. It is not possible. It is
more of an art and more of more of a knowledge base, your intuition and there is so much other
reasoning that needs to come into this than just simple statistical questions. So, these kinds of
questions we may or may not be able to answer mostly. Keep that in mind. But it is okay. It is
not too bad. As a first order approximation if you do not know anything, it is good to assume iid
samples. It is not bad.

So, look at the second point. Really that is the question we have to worry about is the Taj in
trouble. So, really, I mean, from the statistical analysis, I want to be able to definitively say
something about that. Is it dangerous? Is that going to damage Taj Mahal? Is the pollution going
to be high like this forever? Is it going to change over the time? I have given you data only in
April. So, one probably needs a lot more data. Try to look at the data. I could not find the data
conveniently organizes. Just PDF files and extracting numbers from PDF files is very difficult.
So, I have not put all the data out there for you. Maybe that is an interesting exercise for you.

So, look at the data from the Taj pollution numbers and look at it over a year and see if there is
anything there that is worrisome. It is really, really going very high. Is there something that one
needs to worry about in that? But really to be able to create a statistical story, analysis story from
this and argue convincingly that yes the Taj is in trouble, we need to do something desperate.
Otherwise, it will just, the beauty of it will disappear in some time. So, it needs a lot more work
than just doing a few sample mean and standard deviation calculations. So, it is a little bit
different from that. That is number one.

The next question, the third question I have asked here is, can we really conclude anything with
11 data points? 11 sounds really small. And is there a sense we have of how many data points we
need before we can say anything strong about these kinds of questions? So, these kinds of
discussions and questions are at the heart of concluding something from statistics or building a
statistical story about a phenomenon.
Now this, the telling of a story involving data is very, very important. You will see so many
articles and reports today about how people sell something with some data and all that. So,
whatever you learn or not learn from this course, you should learn how to read a statistical story
very carefully. I mean, read about number of data points, read about sample assumptions, read
about how reasonable the model is and from there conclude for yourself, how strong is the
confidence in that story and its conclusions.

So, these are questions. We are not yet ready to answer these questions in this course. I just
brought it up, because it is an important question. In every statistical story these questions come
up and one needs to be able to sort of know how to read these things.

(Refer Slide Time: 15:40)

So, let us come to one more piece of data. This we have looked at quite often in this course and
let me get into it once again. So, I am going to look at the IPL and then in particular, I am going
to look at run scored in the deliveries 0.1, 0.2, 0.3. So, you know what 0.1 is, zeroth over, first
over, first delivery, first over second delivery, first over third delivery, that is 0.1 to 0.3. So, I
have data from 1598 innings of past IPL matches.

There is a shared spreadsheet in which this data is there. It turns out cricsheet.org. Now, puts out
the data in csv format also. You remember one of my earlier lectures I was talking about
converting from yaml to csv, etcetera. Now, cricsheet does that automatically for you. You can
go look up the csv data itself.
So, all the calculations were done using the spreadsheets. So, 1598 is a lot of data. So, you
cannot be writing it down in pen and paper. You need a good spreadsheet program or any other
program into which you pull this data and do the calculations by computer. That is important.
What a sample means? Here are sample means. For 0.1, the sample mean is 0.73 and then 0.87
and then 0.95 of 0.3. What are the sample variances, 1.5, 1.8, 2, so that is 2.1 or something. So,
that is the variances between 0.1, 0.2, 0.3.

And a few proportions that may be of interest. Proportion of dot balls in 0.1 is 0.5989. So,
basically about 60% of the first delivery were dot balls, about 55% of the second delivery were
dot balls, 53% of the third delivery were dot balls. The next proportion I am interested in is a 4 or
a 6, a boundary hit. I mean outside the boundary, like the rolling straight or over the ropes 4 or a
6. In 0.1 about 10% of the deliveries for boundaries, 11.45 in 0.2 and 13% in 0.3.

So, notice the story that is coming out. This lots of data 1598. And what is the story here? What
is the trend here? Is that a trend? Is it clearly justified? Does it seem solid enough? Yes, right. I
would say yes. There is a clear trend from the sample statistics. Look at the trend that you can
read from this. And the story, I think the story is much more convincing than the Taj Mahal
story, is not it? Any story I build with Taj Mahal with 11 pieces of data not that convincing.

On the other hand, any data, any story, I build a simple story that I am building, the story that is
very clear is the run scored is going to clearly increase from 0.1 to 0.3. The run scored in 0.3 are
clearly higher than the runs scored in 0.1 with good confidence we can say that. So, the number
of run scored is going to be higher in 0.3. And that also makes intuitive sense. In the first
delivery, most batsman would defend. And by the third delivery, it is very likely that they get a
loose ball and they hit it. So, that is the, it is a nice simply story backed up by intuition, backed
up by data very solidly, in solid manner.

But still look at that question. Do you do like the iid samples model? Do you think every
delivery that is bold by one particular bowler in the first over of an IPL innings is independent
and identically distributed? Is it a good enough model? Maybe, one need some additional checks
to check this. Maybe later on we will see if we can do this. It is, but it is not a bad model. It looks
okay. Mean, first 3 deliveries that you bowled, I think it is okay. It is not too bad.
If you get hit for a 4 or 6 in the first ball, maybe it becomes a bit different, but in most cases I
would say first over even if you get hit for a 4 for the first ball most bowlers are going to fall in
the same plan the plan that they would have had before they came into the match. So, it seems
reasonable. So, here is three different stories that we tried to build one from the iris. There was
not much of a story to tell there. It was just data. I did not emphasize the story angle too much.

And then we got the Taj Mahal data. And then we wanted to say some story about is the Taj
Mahal in danger, but we felt like, it looks bad, but I do not know if I can strongly say based on
11 points of data. It seems a bit unreliable. We do not have any solid way of saying that, but at
least even then it feels 11 seems really small that too only in April looks really, really small, and
really unlikely to capture the whole thing.

On the other hand IPL, if you want to tell a story about how the third ball goes for higher runs
than the first ball, this may or may not be a great story. But anyway, it is a story. And if you want
to say that, it feels much more solid. You are on much more solid ground when you say that. So,
hopefully, this, this gave you a good feel for how these, whether or not you are confident about
the statistical story of building depends on so many other things and the sample means and
sample variances, sample proportions, do seem to convey something which is interesting.

So, what we are going to see in the next lecture is some justifications for why this number of
samples is important, why as the number of samples increases, when you have iid samples, you
can be much more confident about your story. So for that, it turns out one needs to look at sums
of random variables. So, if you remember, the sample mean came from the sum of the iid
samples, of course, divided by n, but that n is just a number, it is not random. So, the sum of the
random variable matters.

Even sample proportions, if you think about it, it is actually a Bernoulli sample, is not it, 1 and 0,
and you are actually adding up a bunch of random variables. So, what happens when you add a
bunch of random variables? We have seen this once before. I will quickly remind you of what
happens and then present you a result about sums of random variables which is very, very
famous and popular. So, that will come in the next lecture. Thank you very much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electric Engineering Department
Indian Institute of Technology, Madras
Lecture No. 7.4
Sum of Independent Random Variables

(Refer Slide Time: 00:12)

Hello, and welcome to this lecture. In this lecture, we are going to look at Sum of Independent
Random Variables. So, how to describe the distribution of the sum of independent random
variables? How to go about calculating it? What are the important salient points to remember? We
will only get started here. We will keep looking at this more and more as we go along in this
lectures and the next lecture also.
(Refer Slide Time: 00:35)

So, first, let us think about expected value and variance for sum of n random variables. There are
𝑛 random variables. I am not saying anything more about their distribution. They could be
independent. They could be identical. I do not know. I am just saying 𝑛 random variables
𝑋1 , 𝑋2 , … 𝑋𝑛 . They have some joint distribution. They live in the same probability space. So, there
are 𝑛 random variables and I am going to define their sum. Sum is easy to define. Just add up all
of them.

I am going to call the sum 𝑆. 𝑆 = 𝑋1 + 𝑋2 + ⋯ +𝑋𝑛 . So, it turns out, you do not have to say
anything more before we can write down a simple equation for expected value. Expected value for
the sum is equal to sum of the individual expected values, because expected value is a linear type
operator. So, it gets into the right hand side. You have expected value of 𝑋1 plus 𝑋2 and it
distributes inside the summation. So, that is always true. So, 𝐸[𝑋1 ] + 𝐸[𝑋2 ] + ⋯ +𝐸[𝑋𝑛 ]. It is
easy enough.

So, it turns out, if you put this additional condition that 𝑋1 to 𝑋𝑛 are pair-wise uncorrelated. What
is uncorrelated? I have described it here. 𝐸[𝑋𝑖 𝑋𝑗 ] should factor as 𝐸 [𝑋𝑖 ] × 𝐸[𝑋𝑗 ]. If that were to
happen, then variance of the sum is also going to become sum of the variances, Var(𝑋1 ) +…
Var(𝑋𝑛 ). So, let me write down.

So, for the first statement, I am not going to write down a proof. It is easy enough. I will write
down a proof for the second statement. So, it is got to do with (𝑆 − 𝐸 [𝑆])2 , is not it. This is the
variance of 𝑆. Expected value of this guy. So, let us see how to write what is inside this guy. So,
that is (𝑋1 − 𝐸 [𝑋1 ] + 𝑋2 − 𝐸 [𝑋2 ] + ⋯ + 𝑋𝑛 − 𝐸 [𝑋𝑛 ])2.

If you are the type who is getting quite upset with these lengthy sort of derivations, you do not like
this derivation, you do not have to pay attention too much to the derivation. The final result is what
is most important. I am just doing this so that people who are interested in this can benefit out of
it. So, when you square a bunch of sums like this, you will have two types of expression.

One type of expression will be, you will get all these guys, 𝑋1 − 𝐸 [𝑋1 ]2 + 𝑋2 , each of them will
get square and then you will get these cross terms. So, you can write it down and multiply. You
might have done this before in your life. And then you will get cross terms of this nature 𝑋1 minus,
so in fact, you will get two of them, times 𝑋2 − 𝐸[𝑋2 ] so on. So, I am not going to write down all
the cross terms.

So, an arbitrary cross term will look like this. So, 𝑋1 (𝑋𝑖 − 𝐸[𝑋𝑖 ])(𝑋𝑗 − 𝐸[𝑋𝑗 ] ). So, it turns out
turns out the expected value of this guy is actually equal to 0 if this is true. So, this is something
that I am not going to do in detail. You can check it out. It is just, you have to multiply this and
then distribute the expected value over each term and use this formula. You will see just everything
will cancel. You will get 0. So, this is a crucial part.

So, all the cross terms vanish when you take expected value. Only the square terms will remain.
And what is the expected value of the first term here, 𝐸[𝑋1 ^2] − 𝐸 [𝑋1 ]2? This is the variance of
𝑋1 . So, this is the crucial result pair-wise uncorrelated. The expected value of the product 𝑋𝑖 𝑋𝑗 has
to be equal to 𝐸 [𝑋𝑖 ]𝐸[𝑋𝑗 ] and then variance of the sum is sum of the variance.

So, this is the result in words. This word is, this result is much more important. Sum of 𝑛 random
variables, mean of the sum is always equal to sum of means. You do not need any other condition.
Of course, you need some existence conditions. We are assuming those are true. This is always
true if uncorrelated, the word uncorrelated will be used loosely here, I should say pair-wise
uncorrelated. Variance of sum is also the sum of the variances. So, good factor remember.

This pair-wise uncorrelation, being uncorrelated can be checked quite often using data. It is at least
roughly, you can you can check that. It is not too bad. So, it is a very nice fact to remember. So,
there is a quite a powerful fact about expected value and variance of sum of 𝑛 random variables.
So, in particular, if it turns out 𝑋1 , 𝑋𝑛 are independent, then they are of course pair-wise
uncorrelated. So, if you look at our iid sample situation, our random variables are independent.
When they are independent, they are also uncorrelated. So, this above result holds.

So, independence implies uncorrelated, but uncorrelated may not imply independence. If you
remember, I have showed you some example of this long, long ago. You may have a vague
memory of it. But that is important to know. So, independent is good. Independent also means
variance of sum is sum of variances. Simple result. Hopefully, the result is clear even if the proof
was a little bit long winded involved calculations.
(Refer Slide Time: 06:23)

There is few extensions, simple extensions, but these are useful when you will see, I mean, this is
quick derivations of what results we saw before about sample variance. You can scale in sum,
instead of just summing. Each 𝑋1 you can multiply by 𝑎1 and 𝑋𝑛 you can multiply by 𝑎𝑛 . So, this
is like a linear combination of these 𝑛 random variables, ai are constants. Again, if you take
expected value this is expected value goes in for any 𝑋1 to 𝑋𝑛 that would happen. And variance
would sum like this, if pair-wise uncorrelated. So, this is really important. So, I am assuming pair-
wise uncorrelated. So, if that is true, this would happen. Let me add it here also.

So, now, let us push it a little bit more as this general result. And then now we are going to say iid
samples. We are going to look at iid samples. So, 𝑋1 to 𝑋𝑛 are distributed like 𝑋 and they are
independent, identical and independent. Now, if I do a linear combination, 𝑎1 𝑋1 + ⋯ + 𝑎𝑛 𝑋𝑛 , if
I do expected value of Xn, what do I know about expected value of X1, expected value of Xn, each
of them, they are all identical to 𝑋 distribution, so which means their expected value is the same
as expected value of 𝑋.

So, you can take expected value of 𝑋 common outside. You would get 𝑎1 + 𝑎2 … 𝑎𝑛 . Same thing
would happen with 𝑎12 + 𝑎22 … 𝑎𝑛2 for the variance. The variance is going to be constant. They are
independent. So, of course, it is uncorrelated. So, all of these results are true.

So, now, let us look at the sample mean. So, 𝑋1 to 𝑋𝑛 is distributed according to 𝑋 iid. This is the
iid samples model and we are defining the sample mean. Remember, sample mean 𝑋̅ is the random
𝑋1 +𝑋2 +⋯𝑋𝑛
variable. It is 𝑋1 + 𝑋2 , and this is nothing but a linear combination with each 𝑎𝑖 being
𝑛
1
equal to 𝑛. You just substitute that here.

You will see expected value of, oh my god, thing became a bit messed up, apologies for that.
Expected value of 𝑋 is actually expected value of 𝑋 and variance of 𝑋 is variance of 𝑋 divided by
1
𝑛. So, this 𝑛2 added up 𝑛 times, you will get 𝑛 as the factor. So, this is the result we saw before. I

am just reiterating that. So, that is something that is important. So, see this is a simple extension
of the previous result, we get the mean and variance of the sample mean.

(Refer Slide Time: 09:13)

Now, we are going to start looking at sample mean versus distribution mean. We have already
seen that expected value of sample mean is equal to the distributed mean. We have seen that
𝜎2
expected value is the same as the distribution mean. The variance is the same as is . So, as 𝑛
𝑛

increases, the expected value remains the same. It is always 𝜇, but the variance is going to start
shrinking. As 𝑛 increases, the variance is shrinking, which means what you are around 𝜇 and your
values, your, the values that you take are getting lesser and lesser and lesser spread.

Again, up to this, it is okay. If all you wanted to say was that that is okay. It is a fair statement,
correct statement. Can we put some number to this? Can we say how close 𝑋 and 𝜇 are going to
be? So, maybe I want to put some number to it. So, only when I say something about probability,
I can say something little bit better than saying variance is going to 0. So, distribution mean is
going to be close to sample mean, but how much. So, very natural question. You have to quantify
it. How do you go about quantifying how close you expect 𝑋 and 𝜇 to be as a function of 𝑛, as a
function of other distribution parameters? Important question, is not it? Let us try and answer that.

So, you may ask questions like this. So, remember, 𝑋 is a random variable. It is not a constant. It
is not one number. It is a random variable. The first time you get your samples, you may get one
value for 𝑋. The second time you get your samples, even though it is from the same distribution,
you can get a different value for 𝑋 in your sample. This is the different sampling. So, you may get
a different 𝑋. So, different times you get samplings, you may get different value of 𝑋, the 𝜇 is the
same.

So, the question you have to ask is, over very large number of samplings what, how likely are you
to be away from 𝜇 by some delta? So, again, this is a probabilistic question, because 𝑋 is a random
variable. Lot of people start thinking that sample mean is fixed. But remember, we are changing
the samples. As you change the samples different, different samples will give you different,
different sample means. You repeat your experiment. You have seen this in your Monte Carlo
simulations when you do them in Python. You repeat the same experiment, you get different
numbers.

So, the numbers do end up changing. So, only statement you can reasonably make is about
probabilities. What is the probability that 𝑋, the sample mean random variable is going to be
greater than 𝜇 + 𝛿. 𝛿 is the difference? So, some difference that I have in mind could be 0.1, 0.2,
1, 10, I do not know, depends on how large 𝜇 is. How, what is the probability that my sample mean
is going to be greater than 𝜇 + 𝛿?

Similarly, I can ask what is the probability may sample mean is less than 𝜇 − 𝛿? Both of these I
can combine and then ask this question, what is the probability that |𝑋 − 𝜇| > 𝛿 ? So, this
particular one, if you were to think of 𝜇 being here, this is 𝜇 + 𝛿, it is 𝜇 − 𝛿 and if 𝑋 − 𝜇, this is
the region that absolute value of |𝑋 − 𝜇| > 𝛿. So, that is a simple pictorial illustration of what
happens. So, I want to be able to say something about the probability of this.
So, of course, if you know the distribution of 𝑋 exactly or the distribution of 𝑋, exactly, you can
write down precise answers for this. You can define this event. You can find all the values of 𝑋1
to 𝑋𝑛 that will satisfy this event, find their probability etc. But that is usually pointless. You may
not know the distribution. First of all, the distribution will be so complicated. You do not want to
find the distribution of the sum of these things divided by 𝑛. It is messy, very messy.

So, what we will settle for usually are good bounds. I say good bounds. Bounds itself is great, but
good bounds are even greater. So, can we get some good bounds, cheap and quick and dirty, simple
bounds that will give us a sense of how large these numbers are. So, I am going to show you one
such bound and one nice result, very popular result that comes because of it.

(Refer Slide Time: 13:48)


So, this result is called the weak law of large numbers. Sounds nice. It is a fancy name. I really
liked that name, the weak law of large numbers, this law of large numbers. So, in the long run,
eventually something happens. So, that is sort of like this law, weak law of large numbers. So, here
is the setting 𝑋1 to 𝑋𝑛 is iid according to 𝑋1 distribution. Those are the distribution means 𝜇 and
𝜎 2 . 𝜇 is expected value of 𝑋, 𝜎 2 is the variance of 𝑋 distribution mean, distribution variance.

Now from the samples, you will get a sample mean 𝑋. And the expected value of 𝑋 is 𝜇 , you know
𝜎^2
that. The variance of 𝑋 is , you know that also. All of these things we know. These are not the
𝑛
𝜎2
law of large numbers. This is the law of large numbers. So, 𝑃(|𝑋 − 𝜇| > 𝛿) ≤ 𝑛𝛿2 . So, this is very

simple use of Chebyshev's inequality. What is Chebyshev's inequality?

Chebyshev's inequality says, 𝑃((𝑋 − 𝜇)2 > 𝛿 2 ), for that matter, is less than or equal to expected
value, this is actually Markov, seems Markov, Chebyshev they are all sort of similar, divided by
𝛿 2 . So, that is all. This is a simple use of that. So, these two are equivalent. |𝑋 − 𝜇| > 𝛿 is, I will
2
put the 𝑋, probability of (𝑋 − 𝜇) > 𝛿 2 . These two are the same.

Now, you just use Markov, Markov’s inequality. This is positive random variable is greater than
𝜎2
𝛿 2 is less than expected value of that random variable 𝛿 2 and that is nothing but 𝑛𝛿2 . So, very easy

application and you get a bound. Notice how powerful this bound is. I mean, long back when we
looked at Markov inequality, Chebyshev inequality, you might have said, this point looks bad, it
is not very good, etcetera, etcetera. Look at how powerful it is. I do not know anything about the
distribution of 𝑋. 𝑋 could be anything.

All I know is that 𝜇 and 𝜎 2 and look at the powerful result I get. In particular, look at this 𝑛 that
comes up here. So, for any fixed 𝛿, as 𝑛 tends to infinity, this tends to 0. 𝛿 could be anything you
like. 𝛿 could be 0.001 or 0.000001. I can always go to an end very, very, very, very large so that
1
this thing will swallow how much ever this 𝛿2 squared is and make it 0. σ2 , of course, needs to be

finite for this particular proof. You will get it. So, this, that is result that is meaningful here.

Notice, this. Suppose I have a particular 𝜇, 𝑋 − 𝜇 probability that |𝑋 − 𝜇|, let us say, 𝛿. 𝛿 is very
𝜎2
small. Think of it as 10−6 . It is less than or equal to . So, if 𝑛 becomes 1020 , then this is
𝑛10−12

going to go to 10−8 and go to 0. So, the point is, it sort of like, I mean, do not think of 𝛿 being so
small, maybe 𝛿 is 0.1. So, if 𝛿 is 0.1 and 𝜎^2 by when 1000 times 0.1 ends up being very, very
small.

So, this quantifies the probability that the sample mean can deviate from the actual distribution
mean. Simple application that have sharper inequalities than this, but this inequality itself is quite
𝜎2
useful. Another reading of this inequality is with probability more than 1 − 𝑛𝛿2 the sample mean
𝜎2
lies inside 𝜇 − 𝛿, 𝜇 + 𝛿. Probability that it is outside is upper bounded by , so probability that
𝑛𝛿 2
𝜎2
it is inside is lower bound, its greater than or equal to 1 − 𝑛𝛿2 . So, this is another way in which

people write this down.

So, what is the meaning of this probability? So, this is a question that is often asked. I will, maybe
when I illustrate it, I will talk about the meaning of this probability. What is the meaning in the
sense, I mean, in what probability space does this, is this probability valid? So, that is the question.
So, we will ponder on this question as we go along.

But remember this question. I am not going to answer this question immediately. You may be able
to interpret it based on what I have described. But later on, when we actually calculate these
probabilities in statistical problems, I will come back and keep asking this question, what is the
meaning of this probability and we will debate it at that point.
So, one thing about Chebyshev inequality versus Markov inequality, usually it provides a weak
bound. Today in the theory of sums of random variables, in particular, this type of convergence
results that are very, very strong results there are exponential results of different types. So you can
sharpen this inequality. Do not worry about why this inequality comes. The most important thing
about this inequality is tends to 0 that much at least I know. How quickly it tends to 0, maybe it
can be speeded up, but it does tend to 0.

So, the weak law of large numbers is often written as 𝑋 converges to 𝜇, people will use this word
in probability. So, these are all technical terms converges in probability. In this course, we are not
going to go into detail of these things, but just remember this. This is just hopefully you understand
what it means a result like this.

If you can show the probability that something deviates, a random variable deviates from some
value greater than even something very small can be driven to 0, then usually people say that
random variable converges in probability. That is just a technical definition. It is not so important
for us in this course. If at all you do advanced probability courses, these kind of results will come
in. So, this is weak law of large numbers.

The way one can think of the law of large numbers is eventually if you observe something for a
long, long time, the average that you see will be the distribution average. So, it is like what justifies
many things about law of averages. Nobody is going to be much better than their average of doing
something. So, that is the law of averages. If you exceeded your average quite often turns out bad
days are ahead. So, that is the law of averages or weak law of large numbers.
(Refer Slide Time: 21:08)

So, here are some examples. So, let us look at some examples. In every case here the distribution
is known. 𝑛 iid samples, I give you the distribution. You just plug in the formula and then say what
the law of large numbers is saying based on the Chebyshev inequality application. Chebyshev
inequality needs the variance. The variance alone, you have to calculate for the distribution. First
𝑝(1−𝑝)
is Bernoulli(𝑝) samples, the variance is just been to (1 − 𝑝). So, probability more than ,
𝑛𝛿 2

sample mean lies in 𝑝 − 𝛿, 𝑝 + 𝛿.

Here is a uniform discrete distribution −𝑀 comma all the way till +𝑀. The mean is 0, variance is
𝜎 2 . So, that is the thing. Mean is 𝑝. 𝜎^2 is 𝑝(1 − 𝑝), the distribution mean. So, here again, mean
𝑀(𝑀+1)
is 0 and variance is . So, this is a calculation. I will let you verify this, the 𝜎 2 calculation.
3

Here again mean is 0 and variance is sigma square normal distribution with probability more than
𝜎2
1 − 𝑛𝛿2 sample mean lies in −𝛿, 𝛿.

So, you can see how I am using it. The mu comes here. 𝜇 is 0 here. So, it becomes −𝛿, 𝛿 𝜇 is 𝑝
here, so it becomes 𝑝 − 𝛿, 𝑝 + 𝛿 and the actual 𝜎 2 comes from the distribution. [−𝐴, 𝐴], this is a
𝐴2
continuous distribution and 𝜇 is 0 and 𝜎 2 is . So, that is the formula. So, just plug it in you get
3

the answers. So, distribution is known. Remember, distribution is known.


So, when the distribution is known, you can make a precise statement about your confidence. The
bound can be improved to make your confidence more precise, but the precision is okay to
whatever extent we are able to precisely say about where I expect my sample mean to be. My
sample mean is with some confidence expect it to be within some delta of the actual distribution
mean. When the distribution is known this is very nice.

(Refer Slide Time: 23:24)

Let us look at this data case. We had the iris and the Taj Mahal and the IPL three different cases.
We saw just a couple of lectures ago or so. So, let us look at the iris data and the sepal length. We
had 50 instances. The sample mean ended up being around 5. Sample variance is 0.1242. So, we
can say with probability more than 1 minus sigma square by 50 delta square sample mean lies in
[𝜇 − 𝛿, 𝜇 + 𝛿]. But what are we assuming when we say that we are assuming iid samples model
that these irises the data that came follows some sort of an iid distribution. And this seems to work
𝜎
if 𝛿 > √50.

𝜎
So, this sort of, I mean, if 𝛿 were to be lesser than √50, probability, this will go like negative. So,

if it goes negative, probability is greater than 0 is the only thing you can say, but that anybody can
𝜎
say. So, you do not want that to go negative. So, √50 sort of like the 𝛿 that you can tolerate when

you look at sepal length. So, it is okay. So, maybe I mean, I do not know the real sigma. That is
the other problem.

What is 𝜎? What do we use for 𝜎? I do not know the model. I do not know 𝜎. Maybe one can
simply substitute the sample variance for 𝜎 2 . We do not know it. Maybe it is 𝜎 2 . Sample variance
you put it in, you get something. So, this may not be too bad. So, that is the, that is that. So, the
next thing about the Taj Mahal air quality.

Supposing you take PM2.5, we have only 11 data points and the sample mean is 65.72, sample
𝜎2
variance is 15.92 with probability more than 1 − 11𝛿2 sample mean lies in [𝜇 − 𝛿, 𝜇 + 𝛿] is all

that we are able to say. And it seems to work for 𝛿 greater than 𝜎 11, √1. I have just used it √11
is something like 3 point something. And sigma is maybe around 15 based on what the sample
variances. So, 15 by 3, you are already looking at the 𝛿 of 5.

So, you are going to be off, the best you can do is around 5 and it looks so shaky. I mean, 11, so
you see why 11 is a problem. 50 was better, 11 is a serious issue. So, when the sample variance is
15.5
15.9, you expect sigma to be around , which is 3. something. You are looking at a 𝛿 of at least
√11

5. So, and even if you do that, your probability is like, really, really small. The confidence you get
is very small. So, you have to go to a really large 𝛿 before you can make this probability very
small.

So, supposing I want to make this probability, 0.95, so I want this, so supposing, let me just do this
𝜎2
calculation for you. Supposing, 1 − 11𝛿2 has to be greater than 0.95, so this is more than 95%

confidence that you will be within that 𝛿 region. So, this simpler is move to this side, move to that
side 0.05 by into 11 divided by 𝜎 2 has to be greater than, so let me just write it down a little bit
𝜎2
differently. This would mean that 𝛿 2 > 11×0.05. 0.05 is already 1 by 0.05 is what, 20.

20 20𝜎 2 20
So, which is the same as 𝛿 2 > 11. So, roughly about 𝛿 > √11 𝜎. So, notice the 𝛿 value that
11

you need is becoming greater than like √2 by runtime, 𝜎 1.4 times 𝜎. You are not going to, this is
just for 0.05 confidence. So, you need a really large 𝛿 before you are able to say something. So, it
is a multiple function of the 𝜎. So, it is not very nice.

So, sigma would be, I do not know, if you can say 15.9 or something, it is already like, 20, 25. So,
you can only put your sample mean within a window of 20 or 25. So, your sample mean, the
distribution mean can be as low as 40, 45 all the way up to 80, 85 or so. So, it is a wide window
before you can say anything about some sort of confidence level in this date, of course, assuming
it iid samples and all that.

So, you see that 11 is a big issue. On the other hand, if you go to the previous 50, already you will
see it is much, much better even for this kind of confidence. So, you can see the same sort of
𝜎2
calculation here. So, if you were to do .. 𝜎 2 > 0.95. Again, you would get the same sort of
50

𝜎2 2
calculation. This implies 𝛿 2 > 50×0.05. Notice, again you get the square root of, so 𝛿 > √5 𝜎. Look

at this. This goes less than 𝜎.

Even for 95% confidence, you can put with a quite a small 𝛿. So, sigma is like, I do not know 0.3
or something then you really get a very small 𝛿 and you can have 95% confidence about that data
even with the bound as weak as the Chebyshev bound, but look at how large you want it to go with
𝛿 before you can have some sort of confidence of sample mean lying within a large enough
interval.

So, think of this example. And this sort of brings out why number of samples is very, very
important. At least, assuming you are in the iid samples domain assuming things are okay, you
notice how much more easy it is to work with 50 samples as opposed to 11 samples. 11 samples
to get 95% confidence, you have a huge window that you need 40 to 80 or something that window
you can we have 95% confidence.
√2
But with 50 samples, you are getting a much smaller window is quite small times whatever 𝜎
5

is. So, whatever the sigma is in that problem you get that. So, it seems fine enough. So, the numbers
are quite important. Here is runs scored in delivery 0.3. 𝑛 is 1598. Sample mean is 0.9524 and you
can be really, really, really confident about it. If you want 95% confidence, what would it be delta
and you will get a very small 𝛿. And you can really be quite confident about where your sample
mean is.

So, but still all this is hinging on some assumption. So, this is actual data and you would assume
iid samples type of model, you have to justify that in some way or at least intuitively, it should
make sense to you. And next is the value of sigma square, even that you have to sort of estimate
from the data that is available. You can do that. But nevertheless, when number of samples
increases, your confidence level seems to be quite okay. So, you are able to predict that the sample
mean is reasonably within a small interval of the distribution mean.

So, hopefully, this gave you a sense of where we are heading with this kind of an analysis. We are
deriving statistics from samples. We are not stopping there. We are trying to assess as to how good
our statistics are, how good, how well, we predict distribution properties from the statistics that we
collect from the samples. So, what are the kind of calculations involved? Hopefully, this lecture
gave you a reasonable idea of how to go about doing it. We will come back to this. This kind of
thing is called a confidence interval.

We will study this in much more detail later, but even now you can already see where we are sort
of headed in terms of the statistical study of phenomena. Thank you very much. I will see you in
the next lecture.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 7.5
Statistics from samples and limit theorems:
Concentration Phenomenon

(Refer Slide Time: 00:13)

Hello Everyone, and welcome to this lecture. We are going to carry on, and look more
closely at this functions or sums of independent random variables. This is part 2 of this
lecture, and the previous lecture we saw how this weak law of large numbers kicked in and
using some simple Chebyshev's inequality. We could come up with some nice bounds for
sample mean, and we had some comparisons and we looked at how a larger number of
samples means, something better about the properties of samples that we get from IID, IID I
mean properties of the distribution that we derive from IID samples. Something becomes
better about them, as you get more and more samples.

So, there was theme to those kinds of ideas, we will build on those ideas in this lecture. So,
we will see more sharper results. There are two particular results that we will see. One is,
what is called the concentration phenomenon, and I mentioned how the Chebyshev result is
weak and could be tightened. We will do something called concentration, to do that, it will be
a quick result and the next result we will see something called the central limit theorem. So, it
is a very celebrated result in both of these we will do some in reasonable detail in the next
few lectures. So, let us get started.
(Refer Slide Time: 01:30)

Now first thing, we will do is this concentration phenomenon. So, the basic idea behind
looking at concentration or anything like that is to try and bond this probability that the
sample mean deviates from the distribution mean by more than sum t. So, this is sort of like
the central idea behind what is this what is called this concentration phenomenon. So, let us
see once again, what this main issue is, main setting is.

The setting is, you have n iid samples from a certain distribution X. So, this is our usual
setting. And then we define our sample mean, which is ̅ . I am going to

divided by n here for now. We have seen this Chebyshev bound before, which says the
probability that the sample mean deviates by more than delta from the distribution mean µ.
So, the distribution X itself since µ=E[X]. So, I guess that is clear from the context, but
anyway, nevertheless, let me just write it down once again for you.

µ=E[X] and . And we will assume both of these are finite in these kinds of
examples. So, the probability that the sample mean deviates by more than delta from the

distribution mean is bounded by 1/n. ̅ So, there is , but it is bounded

by 1/n, and this was enough to show the fantastic weak law of large numbers, which said
̅ generally gets very, very close to . It is almost indistinguishable from , as make n very,
very, very large. So that was the result of weak law of large numbers.
(Refer Slide Time: 03:04)

So, but let us see how weak, this Chebyshev inequality. We will take X to be Bernoulli half,
which has as 0.5 and sigma square as 0.25. So, Bernoulli half is just a 0. So, 0, 1 taken with
probability half, half. E[X]=1/2, E[X2]=1/2 so the mean, and variance will turn out to be 0.5
and 0.25 so you can do this calculation it is not very hard.

So now for n = 10, one can do this calculation, ̅ . So, how do you do this
kind of calculation? So, let us look at ̅ . So, when, let us look at this I
know is binomial and comma half. So, if we put n = 10 here, this probability of so if you look
at n = 10, ̅ is actually the same as probability that X1.

So, instead of ̅ that n=10. So, there is a divided by 10 here, so


that needs to multiply. ̅ So, this would be the same as
probability that so maybe I should wait a little bit more to the left probability, that
, you can put a range for this. So, maybe I should write it down a bit differently.
There are two ranges here. So, could be for ̅ ,
or ).

So, check this out, this should be correct. So, this is the probability that Binom(10,1/2). These
two are Binom(10,1/2), and what is the probability that it is either greater than 8 or less than
2. So, I can do an exact computation, and that is what I have done here to get 0.0215. And
you can also plug in the Chebyshev inequality. Put , and n is 10
and look at the number you get 0.278. It is an order of, the actual probability is 0.0215, and
the upper bound is 0.278.
What is worse is as n increases, n = 50 the same probability, probability of X bar minus 0.5 is
greater than 0.3. Now, remember n has gone to 50. So, if n has gas to 50, so when n is 50, the
same probability will work out as or
). So, this is the probability here. This also can be evaluated. You can put it into a
computer program. You can take your Python notebook, look at the CDF of binomial,
etcetera.

So, this is Binom(50,1/2), so that is probability. And look at the number here it is
. It goes so small. So, it drops dramatically from n = 10 to n = 50. The probability that
the sample mean deviates from 0.5, by more than 0.3 falls dramatically here.

But notice how, I mean, the Chebyshev bound is also falling. So, this is Chebyshev bound.
This is Chebyshev, it is also falling, but it is not falling so much. I mean, when it goes 5
times, it is falling only divided by 5, so that is what you expect. It is just 1/n behavior. While
this seems to show a very huge fall an exponential fall behavior, the actual probability is
falling very, very sharply while the Chebyshev bound is not keeping up.

So, the Chebyshev bound is weak. For this reason, you can repeat this experiment with other
parameters how their distributions, you will see the Chebyshev bond is predicting something
weak here.

(Refer Slide Time: 07:40)

So, Chebyshev bond falls as 1/n. In many cases, the actual probability will fall as some e
power (-c*n), so it will be like that. And you may remember from your maths 1 lectures,
when you look at the function 1/x and , we have done some comparisons long, long back.
And how 1/n is really, really slow and will shallowly sharply drop and it will, then 1/n
can never keep up with that.

So, exponential fall with n is much, much faster. And it looks like if we do a bit more work,
we can improve on the Chebyshev bound. So, you should be able to get a better bound, some
exponential bounds, maybe possible at least for the binomial case, it looks very much
possible. And that is sort of like the concentration phenomenon. So, how do we improve
these kinds of bounds.

Chebyshev is already a bit of a concentration result but can we get better bounds is one line
of study in probability and that is called concentration. So, let us see one very interesting way
in which you can make this better.

(Refer Slide Time: 08:40)

So now this title is Markov, Chebyshev and Chernoff and you see a lot of Russian names
involved. It is like this Russian mafia is against you. I know, I know how you feel. I think the
name itself strikes some fear. You feel like is going to gun you down. It is actually not that
difficult. We have already seen Markov and Chebyshev. It is actually the same thing.

So, the same Markov, you use with some small modifications to get the Chebyshev
inequality. So, I am trying to just drive home that point here. Markov inequality applies for a

random variable that takes positive values, and it gives you that bound. .

Now for arbitrary random variable, how do we apply Markov? We do not apply directly. We
take this function X – E[X], and then take it absolute value or you can think of squaring that
value. So, now will take only positive values. So,

So, how did I get ? So, this is just, the same as this is just Markov. So, this guy is
. So, same Markov applied in a slightly different way gives you Chebyshev.
So, it is not difficult that difficult but except people have this fear when they see these names
and they already conclude that they cannot understand, so they do not spend that time. Look
at it very carefully. It is actually quite a easy application.

Now, squiring is not the only way to get positive values. You can use other functions. So, you
took a random variable X. We did . Can we do something else? Can we do some
other function like the exponential function? We know that is always also non-negative.
You remember the plot of equal it is always non-negative, it goes positive. So, can we use
, instead of squaring. Squaring is good, but can we use , to get a positive values and
then what happens that gives you Chernoff method. Chernoff inequality.

(Refer Slide Time: 10:47)

So, here we will restrict X such that E[X]=0. Now this is not a very serious restriction. If you
remember one of my earlier lectures, I told you how. If you have an arbitrary random variable
you can always centralize it. So, sort of make its mean 0. You can translate it to make its
means 0. How will you do it, instead of X, you look at .

always has mean 0, so you can do that. So, this is not a serious restriction,
restricting X to E[X]= 0. You simply translate to make it 0. So then, I can simply look at X
greater than t because I know E[X]= 0. So is not needed.
Now, I will pick a λ, which is positive some λ. It could be 1, 2, 3, 4. I mean, you can think of
it as 1 if it is confusing you, but some Lambda you can pick and then and use the on
this inequality. So, if x>t that is the same as .

So previously, we squared both sides to get the positive random variable. Now, instead of
squaring, we will do this . On both sides, we do . Now, once you do , this becomes
a non-negative. It takes only positive values. Why is that? Because is non-negative. It
does not go to less than 0 at all. So, now we use Markov. So, notice how we go back and
back and back once again to Markov. Markov was really the only inequality.

You transform the random variable using these non-negative functions like squaring or
raising it to exponentiation , and then you again use Markov. This is less than or equal to
E[ ]. Now, previously we had variants being the expected value here so Chebyshev came
in terms of that. Now we have this other quantity E, expected value of . So, this seems to
be a naturally interesting quantity in these kinds of settings, and two divided by e power λt.

So, there is this λ floating around. You might think, why should this λ be there. So, we cannot
I just set λ = 1. So, it turns out, you have something to gain by keeping that λ there. So, we
will see that later. So, it is fine. Any function, which is a non-negative I can use E power λx is
perfectly fine. It is a non-negative function, so I can use it. So, I have put λ >0 here. We will
come back and revisit that range later. So, this is the central idea behind this Chernoff
inequality.

Once you have a Chernoff inequality, you see that this variance like this variance is E[ ]
also E[X]=0 there. So, that also keep that in mind. It seems to play a central role. In fact, it
has a name. This function is called the moment generating function of X, MGF(X) it is a very
popular abbreviation. It is a very powerful, useful quantity. I hope to show you how this is
very useful as we study central limit theorem and all that, this will come up again and again.

And you can see the natural way in which it is showing up. It shows up in this, a bounding of
deviation of X from its mean. So, that is a good thing to know. So, that is the MGF. And how
do you pick λ? You pick λ to get the best possible upper bound or the least possible upper
bound. So, for every value of Lambda, you will get different and different upper bounds. You
minimize that over λ. So, you will see, we will do that to get the best possible upper bound.

Now, if you remember from your calculations, finding expected value, finding variance and
all involves a lot of computation or the random variable. Now, look at this expectations.
Already E[ ]. You can expect some very unwieldy expressions to show up when you do
E[ ]. A lot of trick is in simplifying those expressions. And one very standard way of
simplifying that expression is to use an upper bound instead of the exact expression. You will
see this is a very powerful method again that is used to simplify our work.

So, the next few slides are going to be a bit dense in terms of the mathematical methods used,
but I will urge you to just stay with me. Do not worry too much if you do not understand
every step of the mechanics and the mathematics that is going on, but finally we will get to a
result, and that result you have to sort of appreciate and understand. So, stay with me. And
when it comes to the result, I will pause and point out the important things about the result.
So, let us go ahead.
(Refer Slide Time: 15:25)

So, let me begin by an example. So, we saw the MGF and the bound and everything that is
coming, let us see an example to make our ideas clear. First is this centralizing. So, supposing
you have a random variable X with some mean E of X how do I make the mean 0? So instead
of X, you consider X – E[X]. It is just a translated version of X. It has most of the distribution
properties intact, so you translate and you get a means 0.

Now this is a very important idea. Even when you do data science. When some data set is
given to you, you want to analyze how it looks. One of the standard things people do is to
first centralize it, make its mean 0, because you do not want to be distracted by its mean. The
mean can be anywhere. You pull it down to 0 and then see the distribution around 0. So, that
is a very powerful idea. It is used quite popularly in when you handle data also.

So, let us look at centralized Bernoulli half. Now Bernoulli(0,1) with probability, half each its
mean is a half. Now, when you centralize, I am going to take Bernoulli and subtract the half
from it. So, my X is going to be centralized Bernoulli half, if it is Bernoulli(-1/2,1/2), with
probability half each, so this has got mean 0. Very simple. Nothing to be worried about. The
Russian mafia is not yet involved, but still, it is an easy enough exercise to do this
centralization.

So, now we are ready to calculate the moment generating function of the centralized
Bernoulli random variable today. E[ ], it is just a function of X. There is nothing to be
alarmed about it, as far as the definition is concerned. How do you find expected value for the
centralist Bernoulli? E[ ]=P(X=-1/2)* + P(X=1/2)* .
You just put X = ½, X =-1/2 substitute the probability, the standard way in which you
compute expectation of any function. E[g(x)], we have seen that before. The summation over
probability that X takes a particular value evaluate G of that value. That is it, that is all I have
done here. It comes down because these two probabilities are half each, it works out to this

function,

You can plot this function. You can see that this function has a interesting, that U type shape
and all that. You can try to plot it, we have done these kinds of plots in your math course. Go
back and remember what you did, how to plot these kinds of functions. But this is still a bit
unwieldy, and what one can do is this very interesting bond. There is this very interesting
bond on this function. We are not going to see proofs of this bound, but you can plot these
two things and see how one is about the other.

I have introduced you, I have spoken about Desmos being a good plotting engine. There are
other plotting engines available, your Python notebook you can use plotting if you like, but
plot it, just take a look at it, plot it, and see convince yourself that this bound is true. So, here
is a bond. So, you see that the moment generating function for the centralist Bernoulli has a
very simple and nice upper bound .

Keep that in mind, that will come back and help us, as we study more and more things.
Simple example. You see how this, ties in with the previous results that we saw. So, this is
gives us some hope of being able to Chernoff bound with Bernoulli random variables.
(Refer Slide Time: 18:49)

So, now, another wonderful, wonderful property of the moment generating function is, it
plays very well with addition of independent random variables. So, you see the MGF is .
Now what I am going to do is, I am going to take X1 to Xn being iid samples, distributed
according to the distribution of X, and then I am going to define the sum of these random
variables. I will call it S = , and then ask the question what is the
moment generating function of S.

So, you might say, why is this important? Why do I need to care about the MGF of the sum?
So, it turns out if you remember, when we will look at sample mean and all that, we are
worried about how the sum deviates from its mean. So, that is something we are worried
about. So, it is good to know the moment generating function, because we know moment
generating function gives us bounds on deviation of the random variable from its mean, so
the moment generating function is good.

Now it turns out summing of independent random variables and moment generating functions
were made for each other. So, they work so well together and you can see why it is. It is not
very hard to see. If you do E[ ]=E[ ] = E[ [ ] Why?
Because of instead of S you put , and then you will get E power lambda
X1 plus lambda to X2, lambda X2 etcetera, and then you can just multiply it and write it as a
product.

Now, this is expected value of a product of independent random variables that becomes equal
to the product of the individual expected values. So, this expected value can be interchanged
if the random variables are independent. If they are not independent, you cannot do it. If they
are independent like in this case, you can push the expected value insight. Now that is one
thing, … .

And this is very nice, but not as another thing all these Xis are identically distributed. So,
each of these s is the same as the , as a function of Lambda it is just the same
thing. It does not, nothing changes because you put X1 or X2 because all of them have the
identical distribution.

So, as the sum of iid random variables is simply the individual MGF of the
distribution X raised to the power n. Very, very simple and tidy result about the MGF. So, the
MGF multiplies when you add independent random variables. Very useful and simple and
elegant result, and you see how the MGF is so powerful and why it naturally enters the
picture when we study samples and iid samples in particular. Wonderful, powerful result. Let
us see, let us use it and see how it happens.

(Refer Slide Time: 21:42)

Let us take the sum of centralized Bernoulli random variables. We have already seen, if X
were to have a centralized Bernoulli distribution, minus half, half with probably half each,
and if you take iid samples, according to X, the distribution being according to X as an iid
centralized Bernoulli samples. And if I do sum S = , I can now write

down . It is exactly going to be .


Now you see the comfort of the upper bound. See previously, if you just have to take the
summation here and raise it to the power n you will get so many terms in the expression, but
once you put a bound of , you see the exponentiation by n, simply gives you .
Look at the powerful way in which things are bound, and the bound works so cleanly with the
MGF.

So, this bound is really, really useful. We will use it in our next step, and we will see this kind
of an expression of being able to bound an exponential sum of exponentials raised to the
power n is very central in the concentrations phenomenon approach.

(Refer Slide Time: 22:51)

Now we are ready to see the Chernoff bound. The idea, you remember the idea it is just
Markov inequality applied to . and . and we had the MGF show up. And once you
have a good MGF, you have a good Chernoff bound. And that is what we are going to see for
the binomial.

So, I will start with X1, Xn being centralized Bernoulli, and I look at the
. Now, probability, this is Chernoff, so this is Chernoff bound, this is the basic Chernoff

method. .

Now, what do I know, I know abound for the moment generating function of
. Look at how neat this expression is. So, I am able to upper bound the probability
that the sum deviates by t by n exponential. So, that is what I wanted. I thought Chebyshev
was not that good, it was going 1/n, my actual probability is going exponentially down. Can I
get an exponential bound, and here is an exponential bound.

The only thing that is left here is how do I choose λ. Now, how do we choose λ you have seen
all of this before. It is not very hard, remember this. This is E power what type of a function
is this in the exponent, it is a quadratic function. Is not it? We have seen all this before except
that you might have forgotten that is done in math I. I wrote the exam I cannot forget about it.
No, no math I is very important. It will come throughout your life. So, this is a quadratic in λ.

So now, I am saying is less than or equal to something. I want to pick λ to have the
lowest possible bound, and I have E power something. Now to minimize E power something
what should I do, I should minimize that something. E power is just an increasing function to
minimize E power something, I have to minimize whatever is there in that exponent.

So, now what does there, in that exponent, it is a quadratic function of Lambda. And you
know how to minimize quadratic functions. You have plotted quadratic functions, and you go
through and find the λ, and that is exactly what you have to do here. You have to find that
value of Lambda, which will minimize this quadratic, and it is easy enough to do, it turns out
2t/n is that best possible value of λ.

You plug it in here, you will get this fantastic, fantastic little bound. And look at how very
neatly it is getting expressed as in terms of n and there is an exponential involved
. So, now let us say how to compare it with Chebyshev and all that. So, I have to go
back to Bernoulli.

So, remember the one difference is I am in centralized Bernoulli. I centralized it for


convenience because I did not want to deal with that a non-zero mean and all that, but now I
have to go back that non-zero mean, so that we can actually compare the binomial. Now,
instead of X1, if I add a half to X1, I get the Bernoulli. Y1 = X1+1/2 is Bernoulli. Now
actually I want sum of Bernoulli so that they get binomial. So, I want
. I will call it Y, but that is the same as . Is not it? It is the same
and that becomes binomial and come a half.

So now, if I want my binomial to be greater than n/2 + , this is the expression. I want
probability that Y, so this is E[Y], this is δE[Y]. So, I want my Y binomial to be away by
from E[Y] by . So, this is what we have seen before. So, this is the same as, so you
would normally divide this by the, by n. So, if you see this, this is the same as in case you are

wondering where I am coming from.

So, this is what I am seeing here. This is the same, as what we had before. So, probability that
the sample mean of the Bernoulli, I mean, Bernoulli samples or binomial random variable
that will be Y/n- 1/2, which is the actual distribution mean is greater than some something

okay DBS by something. So, that is the same, as . Why because Y—n/2=S.

So it is just a simple manipulation, and now I can just plug-in. I know this Chernoff

inequalities is here. I have to just use that here. I put you will get what I want. So,

notice this wonderful bound here, probability that .

(Refer Slide Time: 28:57)

So, that is an improvement or the Chebyshev inequality compared with Chebyshev P(Y>n/2
+ . So of course, a Chebyshev has the absolute value. It does both sides. This is

only one side. I mean, that will only differentiate by half or something so it is okay. This is a
good enough bound.

So, notice the comparison. This goes 1/n that goes ., which is what we wanted. So once
again, if you did not follow every step in the mathematical derivation it is okay, but the point
is, Chebyshev inequality told you that the sample mean does not deviate too much from the
distribution mean. That deviates too much the probability falls by 1/n. In fact, some simple
relations for the binomial distribution. This is for the binomial distribution. It shows us that
exponential probability bonding as possible.

So, this is the so-called concentration phenomenon. You can use it for a specific distribution
when the distribution very well or you know the class of distributions very well, you can use
this behavior. And really, the sample mean, the fact that it deviates from the distribution
mean is going to fall very sharply that is the moral of the story.

(Refer Slide Time: 29:39)

Let us compare Chebyshev and Chernoff for a binomial distribution. I have gone n = 10, 50,
100, 200, 400. You can see, look at the last column. Last column, the actual probability goes
as and all that. Chebyshev is like 0.007 and look at E power n delta square. It is not
quite close, but at least it is giving you the correct set of behavior. It says that the fall is
exponential. So, this sort of convinces you that the Chernoff bound is so much better than the
Chebyshev bound.

So, 1/n versus you see the big difference. So, one second the moral of the story here is
the concentration around for the sample mean being close to the distribution mean, is really,
really high. Mostly it is going to be around the distribution mean, but it can deviate a little bit,
but not too much. When you have 400 samples, 800 samples and all that it is going to be
really, really, really close, but when you have 100 samples even 50 samples, you can see
some differences, but as you keep increasing the samples, you will not see too much of a
difference. So, that is the moral of the story.

(Refer Slide Time: 30:44)


So, I want to make a few remarks on this concentration phenomenon. It is usually considered
a slightly advanced subject. Mostly in basic courses people do not see it, but I think you
should sort of know about this result. There is a lot of philosophy and importance here for
what this means.

So, let us start with what the setting is. The setting is, you have n iid samples according to a
distribution X, and you are looking at the sum of these n iid random variables. The
concentration phenomenon generally proceeds as follows. You try and get exponential
bounds for probability that Y deviates from its expected value beyond t. See in some sense I
am expecting my Y to be close to its expected value, to be concentrated around its expected
value.

I am not expecting Y to take values too much of the expected value of Y when Y is summed
up like this with iid samples, so I can get sharp bounds. And one of the methods, very
powerful methods in concentration phenomenon is to look at and use the Chernoff
method for bonding. It is always a starting point, and people use very clever bounding
methods to get better and better results. So, this is very important to know.

So, we saw this when X is Bernoulli. You may say what if X is not Bernoulli. But before that
what about the other side. So, we saw Y greater than E{Y]+t, what about E{Y]-t so it turns
out the standard trick here is to instead of looking at Y you look at minus Y, and then you
again use the same bounding method case, it seems like a bit trickery, but that is what
happens.
What about other distributions, which are not Bernoulli? Many, many, many extensions exist.
This is a very, very strong research area. I mean, it is still active, lot of people keep
improving these bounds. There is something called Hoeffding’s inequality, Bennett’s
equality. You can see it is gone way beyond the Russian mafia, the whole world is involved
now. Everybody is doing a work on this.

If your random variable is then the exponential conservation applies. If it is bounded as finite
variance more sharp results apply, etc. So, there is lots of, lots of tons of work that is being
done in improving this. Now, here is what is more important to me, why I wanted to
introduce concentration in even your course, the first course here I have a short lecture and
describe why this is important is when you deal with data concentration becomes a very, very
powerful aid for you to think about what is happening. This is good intuition. So, notice what
has happened here from a very high level.

So, there is n iid random variables according to some distribution and you added up all of
them. Now, you might say what if you have other functions. So generally, when you look at a
complicated enough statistical phenomenon maybe the rainfall or a drought or a production,
for crop or anything, the actual phenomenon, the actual number or you take the IPL score for
instance, whatever complicated statistical phenomenon if you take, the actual number that
you get will be a function of several underlying independent random variables.

You can always think of it like that. There will be so many factors in the data science
parlance, they call it factors. There will be so many factors, so many random variables in
your phenomenon. And the final phenomenon of interest to you will be some complicated
function of all these guys. Now what concentration and tells you? Is this really, really
powerful thing that if your function f sort of depends equally on all variables.

So, all this is very vague. I am being vague here. If you want to study this more deeply, there
is lots of deep study you can do, but notice the sum. Sum is a very clear example of one such
function. A sum depends equally on every variable, it does not unduly large on any one
variable, it depends equally well on all variables. So, some sort of results like that must be
true.

So, if you have a function of several independent, random variables, and this function
depends sort of equally on all of them. So, finally this function is what, it sort of amalgamates
or it takes everything together, lot of independent things together and finally something
comes. Anytime something that comes, you have concentration. Anytime you have a
function, which is, which depends on a lot of factors equally, then you will have
concentration. Concentration meaning, the expected value of that random variable will be
very indicative of how the distribution is. The distribution is around the expected value.

Now that is a good insight into data. Is it not? When you collect data about an event, and
what you want to do is to think about that particular column of data, and then think about
what is it, a function of. What are all the factors involved that came up with this function. If it
is rainfall then you are thinking of wind and moisture content and pressure and temperature. I
mean, so many factors are there. What happened somewhere in the Arabian sea, few days
back, what happened somewhere in the Bay of Bengal, few days back in different areas, it is
actually a function of all of that. The amount of rain that a particular city gets in a particular
day.

And if all of these are independent. Are they independent? Maybe not, I do not know, but if
all of these are independent and your final number sort of depends close, I mean, equally on
all these independent things, it turns out the average rainfall distribution. An average of the
rainfall is a good indicator of how the distribution is, the entire distribution concentrates
around the average.

So, if you are looking at a particular data, and if you see that it is actually a function of
several different random variables that are all independent, then you can expect it to be
concentrating around its average only average expectations and all of that make a lot of sense.
So, this is good intuition to have.

So finally, even if all the math that we did was not crystal clear to you, it is okay, keep
working on it. But this final takeaway is very, very important. Because when you look at
data, this is a question you were to keep asking all the time. Can I expect that this column of
data is going to behave the distribution, is it going to be close to its average or is it going to
be far away? Do I need a lot more samples? Do we need a lot less samples? All of these can
be answered with this kind of intuition. Thank you very much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electric Engineering Department
Indian Institute of Technology, Madras
Lecture No. 7.6
Central Limit Theorem

(Refer Slide Time: 00:13)

Hello, and welcome to this lecture. This lecture is about one of the most celebrated results of
probability theory, it is called the Central Limit Theorem. You might have heard this somewhere,
the bell-shaped curve, and where it comes from and how the bill is everywhere. Everywhere you
look at, you have this bell-shaped curve in the distribution etcetera, etcetera. The Central Limit
Theorem is behind all of that. It is a very classic celebrated result, and it is important in any
property course to understand it cleanly.

And you will see this moment generating function that it describes sort of plays a central role. It
ties up this very well. There are little bit more of analytical wizardry involved there. We would not
spend too much time on that, but I will at least point out where this comes from. So, it is a very
powerful result. You will see, finally, when we want to look at the binomial distribution, we were
looking at this, probabilities of Y deviating from its expected value, the binomial random variable
deviating from its expected value, we had the Chebyshev bound.
We show the Chernoff bound. None of them were really close to the actual probability. You will
see the Central Limit Theorem will give you a very good approximation to the binomial
particularly. So, it is a very useful, powerful result. I will make more remarks and what it means,
etcetera, but let us jump into it.

(Refer Slide Time: 01:34)

So, let me define this moment generating function. I did refer to it before. We have seen it before,
but I want to talk a little bit more about it and get started in a slightly more formal fashion with the
moment generating function. So, let us start with a zero mean random variable. Once again, this is
not a big issue. So, supposing you have a random variable, which is not zero mean what do you
do, you simply subtract the expected value you get the zero mean.

So, you go to that and then that translating does not change anything, any even with the original
one is simply translate. So, it is all easy to go back and forth with this. So, the MGA(X) it is
denoted, 𝑀𝑋 (λ), I will use this notation and 𝑀𝑋 (λ). It is a function from the real numbers to real
numbers. It says R to R here so it is, you can put a map BB here I should change that. So, it is from
R to R and it is defined as a 𝑀𝑋 (λ) is 𝐸[𝑒 λX ].

Now convince yourself that this is a function from the real numbers to real numbers. λ is the real
number. Given a real number, λ, particularly value for λ is 1, 2, 1.5, 3.5, 𝜋 2 , 𝑒 2 whatever,
whatever that value of λ is, it is a real number. I can just do 𝐸[𝑒 λX ]. X is sum distribution, but
𝐸[𝑒 λX ] is going to give me a number. So, this is a mapping from real numbers to real numbers so
it is a function, so it has that nice looking form, 𝐸[𝑒 λX ].

So, now supposing X is discrete. It has a PMF 𝑓𝑥 and you know, the PMF. So, usually, we assume
that. So, basically X will take values, small 𝑥1 , small 𝑥2 so on. What is the probability it takes
value small 𝑥1 , 𝑓𝑥 (𝑥1 ) what is the probability it takes value 𝑥2 𝑓𝑥 (𝑥2 ). So, what 𝑀𝑋 (λ)? It is
expected value of 𝐸[𝑒 λX ]. So, it is 𝐸[𝑒 λX ] times the probability that x takes value 𝑥1 , which is the
𝑓𝑥 (𝑥1 ) + 𝐸[𝑒 λX ] times the probability that it takes the value 𝑥2 and so on. Very simple, is it not? I
mean, it looks a bit more complicated in notation when you actually break it down and look at
what's going on it is a very, very simple calculation.

Now it turns out, it can also be used for continuous distributions. So, we have seen these continuous
distributions before, when you have a density 𝑓𝑥 and some support 𝑇𝑥 . You can again define 𝑀𝑋 (λ)
it is just the expectation. Now except that expectation is not submission anymore. You remember
from discrete to continuous, when you go, the expectation becomes integration. We have to
integrate over the range 𝐸[𝑒 λX ] 𝑓𝑥 (𝑥) dx. Over x you integrate. I made a mistake here. It is a
huge mistake apologies for that.

So, it is the density time 𝐶 λX dx. So if you want a picture, so if you want a picture for this
continuous thing, supposing in your, this is your 𝑓𝑥 (𝑥) you will have a 𝐸[𝑒 λX ] right so 𝐸[𝑒 λX ]
it is let us say λ is positive, so 𝐸[𝑒 λX ] . So, you will multiply these two. So, you multiply these
two and integrate, multiply, and integrate that is all.

Numerically there is nothing in this problem. So, you have 𝐸[𝑒 λX ] . For different λ you will get
different 𝐸[𝑒 λX ] . Of course, if λ is negative, it will go the other way. It will go this way, but it
does not matter, 𝑓𝑥 (𝑥) stays the same you just multiply these two functions and integrate.

So, it is very simple definition in some sense, if you have a numerical tool, you can do it. If you
know enough integration, you can also get closed form expressions for these integral and all that.
We would not spend too much time on this, but I want you to know that the MGF is a simple
calculation involving, the exponential function, of course, but it is just an expectation calculation
case, so nothing to be very alarmed about.
(Refer Slide Time: 05:38)

Let us see examples. Examples are always very nice. So, we will start with a simplest possible
example. In this case it is a ridiculous distribution. I am going to say X equals 0. I am taking a
random variable, which takes only one value with probability 1. So, if you look at 𝑀𝑋 (λ), it just
becomes 1. So, it is 1 × 𝑒 0 when λ times 0 is again 0 so it is 1. So easy enough, seems very simple.
You can also take other constants.

If x is not 0, then of course expected value of x is not 0. So, you would have to pull it down to this
and you can play that, so that is why 0 is meaningful. So, let us take this next distribution, which
is sort of like Bernoulli P. So, this is actually what centralized Bernoulli P. Do you agree?
Centralized Bernoulli p. So, what does Bernoulli p, is (0,1) (p, 1 – p), expected value equals p.

Now I am going to subtract p from this, - p and we get (- p ,1 – p) probability remains the same
that becomes centralized Bernoulli. So, expected value is 0. So, it is just a simple centralization.
So, you get - p with probability 1 - p and 1 - p with probability p. So, for p equals 1/2 is what we
saw before. - 1/2, 1/2 the probability, 1/2, 1/2 each.

So, if you do the moment generating function for this you will again, get (1 − 𝑝) × 𝑒 −𝑝λ +
𝑝 × 𝑒 (1−𝑝)λ . You can put p equals 1/2 you will recover the previous result that we had for
centralized Bernoulli 1/2.
(Refer Slide Time: 07:40)

So, here is a slightly different, random variable. So, this is again a discrete random variable. It
takes three different values. - 1 with probability, 1/2 0 with probability, 1 / 4 and 2 with probability
1 / 4. You can check its expected value 0 , I have centralized it sort of makes sure that it works.

Now, if you write down 𝑀𝑋 (λ), it is going to be 0.5𝑒 − λ + 0.25 + 0.25𝑒 2 λ . I want you to, I want
to point out one thing, just pause for a second. If I give you the distribution, you are able to go to
the 𝑀𝑋 (λ). If I give you an 𝑀𝑋 (λ) in this form, you can go back to the distribution. Think about
this.

So, particularly with this 𝑒 λ , 𝑒 − λ form, if I gives you the distribution, you go to the 𝑀𝑋 (λ), if I
do not give you the distribution, if I give you the 𝑀𝑋 (λ) you can also go back to the distribution.
So, it is very easy. So, let me show you how that is done here.

1 1 1 1
So, here I have given you an 𝑀𝑋 (λ), 3 𝑒 3λ/2 + (6) 𝑒 −3λ + (8) 𝑒 −λ + 8 . This is the 𝑀𝑋 (λ) I have

given to you. From here you can almost read out one by one what value it takes. Notice this
1
expression here, (6) 𝑒 −3λ actually gives you the distribution directly. You can just read off from

every term here you can read off one particular value that x takes.

1
So, notice this term here. 3 𝑒 3λ/2 . It means that x takes value 3 by 2. So, this gives you x takes

value 3 by 2 with probability, 1 by 3, so this directly gives you this value. Remember this is E[𝑒 λx ].
So, x takes value 3 by 2 with probability 1 by 3. You can go back to the formula it is just the direct
formula. So, E[𝑒 λx ] will be …..+ 𝑓𝑥 (𝑥𝑖 ) 𝑒 λ𝑥𝑖 +….is not it. So, it will have a term like this x takes
so this implies x takes value xi with probability 𝑓𝑥 (𝑥𝑖 )

So, you can go from the distribution to the particular term of 𝑀𝑋 (λ), or you can take the particular
term of a mix of λ and go to the distribution that is what I am doing here in reverse. So, this is an
important idea. The distribution fixes the moment generating function. The moment generating
function also fixes the distribution. So, from going from distribution to moment generating
function is one-to-one.

So, notice this one-to-one relationship to describe a random variable x you can either give the
actual distribution or you can give its moment generating function both are one and the same, so
they go back and forth. It is an important thing to the moment. So, you see her 3/ 2 with probability
1/2. What about this guys this term here? So, maybe I will use a different color.

So, you see the term here. This gives you this guy. X takes value - 3 with probability 1 by 6. So
likewise, you can see every time x takes value - 1 with probability 1 by 8, x takes value 0 with
probability 1 by 8. Did I get this right? I think this is wrong okay. Now I know what mistake I
made. This is not a valid distribution, so it turns out, this needs to be 1 by 4. I thought I had
corrected it. So, this will have to be 1 by 4 for this to be a valid distribution. Apologies for that.
So, this is correct.

So, you have - 1 with probability 1 / 8 and 1 with probability 1 / 8 and 0 with probably 1 / 4. Notice
this value is important to understand, so this gives you x taking values zero with probability ¼ .
So, it is 𝑒 λ∗0 . So, you have to think of the 1 as being 𝑒 λ∗0 . So, notice how the MGF directly gives
you the distribution also. You can go from distribution to MGF or MGF to distribution, it is a one-
to-one relationship. It is a useful thing, useful skill to build up and understand what is going on.

And finally, if x is Normal (0,𝜎 2 ), this is a very, very, very important result. Extremely important.
I have just put it as a last example. Let me put 5 stars next to it and 5 stars to the left of it, to drive
home the point that if there is one MGF that you have to mug up by heart and remember it is the
MGF of the normal distribution.
2 /2
Normal distribution with mean 0 and variance 𝜎 2 the MGF is 𝑒 ( λσ) , just mug it up by heart. If
somebody wakes you up in the middle of the night and asks you, you should be able to say. MGF
2 /2
of normal 0 mean variant 𝜎 2 is 𝑒 ( λσ) . Very, very important to notice.

So, how do you prove this? I mean, it is not, so I will leave it as an exercise. So, it is basically –

2 /2 ∞ 1 2 /2σ
𝑀𝑥 (λ) = 𝑒 ( λσ) = ∫−∞ 𝑒 λx √2σ𝛬̅ 𝑒 −𝑥 dx

This guy, you can show is actually equal to this. So, this is an exercise in integration it involves
completing the square and the numerator and in the exponent and all that, and I do not want to go
into the details here.

This is an exercise for you, look it up. If one of you can solve this please post on the discourse
forum solution to how this came about, and everybody would appreciate that. So, this is an exercise
for you to do and post on the discourse forum. So, this is examples. So hopefully, this conveys to
you, how to work with MGFs.

Generally discrete, you see is very easy, but continuous is little bit more involved, involves
integration and I am not going to expect that you know all these integration steps, but if you know
do share and let people also come up. What is important is the final result. The final result is really,
really important, you should know that by heart.

(Refer Slide Time: 14:08)


So, why moment generating function? Where is that moment and generating and all that is coming
from, here is the reason for that. So, you can 𝑒 λx = 1 + λx + 𝜆2 𝑋 2 /2! + …….etc, etc. Both of these
should give you the same function. I mean, I know I formally did not define any random variable
exponentiation and all that. We are not going to do all that, this is intuitively you can see why this
has to be true, and that's good enough for us.

So, you see, this is just the exponential 𝑒 x . 𝑒 x =1 + x + 𝑥 2 /2! + 𝑥 3 /3! so on, so that is the
expansion I have done here. Now expected value, distributes inside summation, assuming
convergences etcetera, etcetera all of that is going on , so

𝐸[𝑒 λx ] =1 + x +λ 2 E[𝑥 2 ]/2! + λ 3 𝐸[𝑥 3 ]/3!

So, 𝐸[𝑒 λx ]. If you were to write it as an expression in terms of λ, λ 2 , λ 3 , and all that the
coefficients give you the moments. So, these are called moments. This is the first moment or the
mean, this is the, let me write it out properly. First moment, this is the second moment so on. This
is the third moment, and this is the MGF. So, MGF is basically an expansion. If you expand the
MGF, as a polynomial series in λ then you get the first moment, second moment, third moment
and all that.

(Refer Slide Time: 15:49)

So, here is an example. So, let us take this example. This Normal (0,𝜎 2 ) The moment generating
2 /2
function is 𝑒 ( λσ) . So, you expand on the left side you get this expression, you expand on the
right-side 𝑒 x =1 + x + 𝑥 2 /2! + 𝑥 3 /3! so on. I am writing the 𝑥 2 /2! in a different way here. This
(( λσ)2 /2)2
term is actually equal to . It is the same as that, so I have put a 3 there. This 2 squared
4!

will become 4, 3 in the numerator, 3 in the denominator we will make it 4 factorial λ power 4.

Why am I doing λ 4 by 4 factorial? Only then it will look like the left-hand side. So, it will look
exactly like the left-hand side and that is what I want. So, once I have an expression like this, I can
read off moments, expected value of x is 0 expected value of 𝑥 2 , a 𝜎 2 , it is just the coefficients λ
squared by 2 factorial the coefficient is expected value of x squared on the left-hand side, sigma
squared on the right-hand side. Anytime you have equality like this, which is true for every λ then
you can read off the corresponding things.

And, of course, kind of inversions or so many things are involved here, but we will assume all that.
E[𝑥 3 ]=0, E[𝑥 4 ]=3𝜎 4 . A result that is not so difficult but a little bit more difficult to derive if we
were just doing integration. In case from the MGF, you can do this and you can just keep going on
and on and on and on and on. So, this is why it is called the moment generating function. This is
not so important to us in this class, but this is the reason why it is called that.

(Refer Slide Time: 17:23)


Now what is more important to us is the connection between MGF and sums of independent
random variables. We already saw this before, when we did the Chernoff bound for the binomial
case, this was very, very useful. Now let us generalize. Let us see how to do this. So, let us say we
have two random variables which are iid according to some distribution, and I am defining the
sum of those two random variables Y equals X1 + X2, that is it.

So, let me start with one simple example, X being iid Bernoulli(p). So here, the moment generating
function will work out, as what I have shown there (1 − 𝑝)𝑒 −𝑝λ + 𝑝𝑒 (1−𝑝)λ , so this is just the
usual. I mean, this is actually the centralized Bernoulli is it not it. So this is, so it should be careful
here, so I should write centralized. So, we are all doing centralized, so this is just - p, 1 - p with
probability p and probability 1 - p. So, that is the expression that we got here.

And now 𝑀𝑌 (λ) because this is iid is going to be 𝑀𝑋 (λ) 2. So, I can just take this guy and square
it. So, this is just the square of, so this is squared. So, when you square you get this expression. It
is a squared + b squared + 2 ab that is what you get. And from here, you can read out the
distribution. So, I went from distribution to 𝑀𝑋 (λ) and then I squared 𝑀𝑋 (λ) to get 𝑀𝑌 (λ) and from
𝑀𝑋 (λ), I can read out the distribution.

So, it is (1 − 𝑝)2 𝑒 −2𝑝λ + 2𝑝(1 − 𝑝)𝑒 (1−2𝑝)λ + 𝑝2 𝑒 2(1−𝑝)λ . Takes value 1 - 2p with probability
2p into 1 - p, and takes value 2 times 1 - p with probably 𝑝2 . So, this is a way to find the distribution
of the sum of two random variables, which are independent through, by using the moment
generating function, so this is very important to understand.
So, we know that there is a tie-in between the MGF and the sum, but this is, I am just showing you
examples of how you can use MGF to actually find the distribution of the sum of two independent
random variables. And it is very, very simple and direct. So, here is another example. Slightly
more twisted example. It is not centralized Bernoulli anymore. - 1, 0, you have this other
complicated looking distribution and mix of λ is this, you just square it you get 𝑀𝑌 (λ).

From there, you can read out the distribution. I am not writing down the distribution, but from the
𝑀𝑌 (λ), you can see it takes value - 2 with probability 1 by 4, - 1 with probability 1 by 4, 0 with
probably 0.0625, 1 with probability 0.25 etc. So, you can read all the value.

So, here is another distribution which is a little bit more complicated. You remember, I had this
distribution before 1 by 3 power 3 λ by 2 + 1 by 6 and all that, and then if you square it, you will
get this 𝑀𝑌 (λ). Just look at this ugly little expression here, but nevertheless, you can read out every
value.

It takes value - 6, with probability 1 by 36, it takes value - 3 by 2 with probability, 1 by 9, it takes
value 0 with probably 3 by 32. So, from the moment generating function, you can read out the
probabilities very, very easily. Seems simple enough for small cases, but its real power comes
when you use it for large n. How do we use it? Let me show you.

(Refer Slide Time: 20:59)


Let us look at the MGF of the sample mean. Now my samples X1 through Xn are iid X and my X
𝑒 λ/2 +𝑒 −λ/2
is centralized Bernoulli(1/2). So, that is why my moment generating function became .
2

So, X is centralized, Bernoulli 1/2. So that is the distribution for the x. Now what is my sample
mean? It is going to be X bar = ( 𝑋1 + ………. +𝑋𝑛 )/ n. So far, we never saw the divided by n.
So far we just looked at 𝑋1 + ………. +𝑋𝑛 . We never concerned ourselves with this division by n.
Now this division by n is very, very critical.

Now what I am going to do is, I am going to write this as simply 𝑋1 /𝑛 + ………. +𝑋𝑛 /𝑛. Now,
what is this? This is what is X by n. What is X by n? Now this X is centralized Bernoulli, so X is
- 1/2, 1/2 with probability 1/2, 1/2. X by n is actually - 1 / 2n and 1 / 2n with probability, 1/2, 1/2
it gets shrunk. So, if you think of X is being - 1/2, 1/2 with probability 1/2, 1/2, as n increases, this
comes down and down. So, it becomes - 1 by 2n, 1 by 2n.

The probability 1/2, 1/2 remains the same, probability equals 1/2 this is probability, that remains
the same at 1/2, but as you increase n it keeps coming down and down and down. So, each of these
guys. So, if you look at the sequence 𝑋1 /𝑛, 𝑋2 /𝑛 so on till 𝑋𝑛 /𝑛 these are iid according to X by
n. So, that is the story here.

Now, all I need to do is find the moment generating function for X / n. And then from there, I can
go to the moment generating function of X bar simply raising it to the power n that is what has
𝑒 λ/2n +𝑒 −λ/2n
happened here. So, the moment generating function for X by n is .
2

𝑒 λ/2n +𝑒 −λ/2n
So, it is 1/2 into so this is basically , it takes value 1 by 2 and the probability 1/2 , - 1
2

by 2 and with probability 1/2 this is just this audition and writing it up. So, once you find 𝑀𝑋/𝑛
what is the moment generating function for X-bar it is just the iid sum, iid sum of X by n
𝑒 λ/2n +𝑒 −λ/2n 𝑛
distribution. So, it is just this moment generating function raised to the power n, ( )
2

Remember, this is just a function of λ. So, there is just one variable here, and it is just like a
parameter so it is just a function of λ. Keep that in mind. So, we are able to find the MGF of sample
mean for the case of centralized Bernoulli quite easily, and we write it down, you get an expression
like that.
(Refer Slide Time: 24:23)

So, now what you can do is, you can plot. So, here is a plot for the MGF of the sample mean 𝑀𝑋 (λ)
was as a function of λ for different n. So, you can see what is going on here. It is really, really
interesting. I am starting with n equals 1 I get a big cup like this. So, notice the Y axis here it is
around 1 0.9997 to 1.00004. It is very close to 1. I have just zoomed up the area around 1 that is
what I have done here, and then they plot for n equals 10.

Notice what is going on here, this sort of shrunk down. And then for n equals 20, 50, 100 and
notice what is going on. As you keep increasing n we observe at least visually. Even though, we
have not proved this analytically, we would not prove it. You can prove it if you like to, but if it
keeps coming down, coming and coming down it becomes flat. And it is a fact that this 𝑀𝑋 (λ)
will tend to 1 as n increases as n goes to infinity.

So, this should not be surprising to you. Why should this not be surprising to you, because we
already know from the weak law of large numbers, there is some sort of a conversion for X bar
towards the expected value. What is the expected value of X it centralized Bernoulli 1/2 of course,
0. So, I know X bar is somehow converging to 0, and what is the MGF of the constant 0, it should
be the flat function 1.

So, this MGF is actually converging to the flat function 1, which is the expected value of X. So, it
is not very surprising, but look at this very nice picture that you can draw with this. What I am
going to do next is one very subtle and very profound change to what I have done here. So, the
one thing that is crucial is this division by n. And I got this division by n and then raised it to the
power n and I got this wonderful convergence to something flat.

What if I change n? So, this is a very profound and subtle question. What if I change n? Instead of
dividing by n what if I divide by something else? What is that something else? Why is that crucial,
all that is a little bit deep and little bit involved, maybe I will point you to something interesting
later on that, but let us try it. Let us see what happens in the same case. If we divide instead of n
we divide with √𝑛.

(Refer Slide Time: 26:43)

So, you can ask me, where is the route n coming from? Why is it well-motivated and all that?
Maybe I will give you some pointers to why the root n becomes important here, but it is crucial.
This route n scaling instead of 1 by n scaling changes things fundamentally, and in a very, very
interesting way. So, the setting is same as before. I have iid samples according to centralized
Bernoulli, expected value of X is 0 variants of X is 1 by 4. Note that down the variants will not
change when you translate. It is just 1 by 4.

Now previously I considered 𝑋1+𝑋𝑛 divided by n that was a sample mean. Now I am going to
consider this other scaling, which is division by √𝑛 instead of n. Once again, you may ask why
√𝑛 , what is so special about it. You will see the result is really special when root n comes in. You
will see why that is really special, but it is a bit profound and very interesting that √𝑛 causes this
kind of a fundamental change in what happens to the MGF.
Now I mean, technically the derivation, there is nothing different. Whether you divide it by n or
√𝑛 is the same thing. Instead of x, I will look at the moment generating function for x by √𝑛 . So,
instead of 1 by 2n, - 1 by 2n I will have 1 by 2√𝑛 and - 1 by 2√𝑛, nothing changes there. And so,
λ λ 𝑛

𝑒 2√𝑛 +𝑒 2√𝑛
the moment generating function for Y is simply this ( )
2

So, all these mechanics there is no change, and you can use this decimals or something you can
easily generate this and plot. So, that is what I have done here notice the profound change. Look
at the huge change here. So, I want you to I will walk you through how I have plotted it. There is,
𝑀𝑦 I plotted for different values of n. First is n equals 1. So, you see for n equals 1, it will just be
𝑒 λ/2 + 𝑒 −λ/2 .
2 2
I told you how it looks now. It looks like 𝑒 λ . It look, like this 𝑒 λ will look like this U right. So,
2
it is sort of like this 𝑒 λ . Notice what happens, as you keep increasing. So, I have put a t here I
2 /8
apologies for that. It should be 𝑒 λ .

So, I have the blue line for n equals 1 that is sort of like that. And then notice what happens when
n becomes equal to 10. And then when n becomes equal to 20, it is not converging to something
2 2 /8
flat it is converging to something that looks like 𝑒 λ . Some 𝑒 λ or something. In fact, you can see
2 /8
that 𝑒 λ .
2 /2
And if you do a little bit of work here, you can show that this will converge to 𝑒 ( λσ) as n
2 /8
increases. And what is 𝜎 2 , this guys is 𝜎 2 . 𝜎 2 is 1 by 4, so you see 𝑒 λ , that is what this
2 /8
converges to 𝑒 λ . So, this is nothing short of magic, it turns out.

So, you may say it is one special case for if you put n you are getting a flat 1. Now you put, √𝑛
2 /2 2 /2
you are getting this 𝑒 ( λσ) . So, what is so nice about the 𝑒 ( λσ) ,this MGF corresponds to
normal (0 , 𝜎 2 ). Notice this wonderful little property.

The centralized Bernoulli you add it up n times divided by √𝑛 you get the normal distribution.
That is what happens as n increases you are going to get a normal distribution. So fantastic little
result, so the binomial can be close to a normal distribution in some sense. So, this is sort of the
root of a central limit theorem and how it works.
Now you might argue, so this is for the binomial distribution, centralized binomial. Maybe for
only the binomial distribution, this is true tells out that is where the real wonder of the CLD comes,
the real specialty of the CLD comes because you can take some of their distribution.

(Refer Slide Time: 30:53)

So, I have picked here one of the other distributions we saw before, and looked at the same thing.
We repeat the same thing. Nothing has changed here. I have taken n iid samples here, instead of
that simple 𝑀𝑋 (λ) now I have a more complicated 𝑀𝑋 (λ) which has a certain expected value and
has a certain variance 5 by 2. Now this is a 𝜎 2 here now. Here again I have made this mistake here,
so it is not a minus it is + 5 λ apologies for that.

So, this is the distribution. It is a different distribution. It is not centralized Bernoulli anymore, it
is some other complicated distribution, but I define the same scaling. Why is (𝑋1 +…….+𝑋𝑛 )/ √𝑛.
So, now I have to do 𝑀𝑋 /√𝑛 and then raise it to the power n , I get this funky looking formula.
Some complicated (𝑒 λ/√𝑛 )𝑛 . And here you go.

I have plotted this for n equals 1. It looks very out of shape, asymmetrical plot, which looks a bit
different, but as you keep increasing your n, but this time it looks like I have to go to a larger and
larger n. So that is okay, n equals 100, n equals 200 so it has asymmetric form lot of values. But
by the time I go to n equals 200, I can start seeing some convergence.
2 /2
So, you look at the convergence it is actually converging to once again 𝑒 ( λσ) , e power 5 λ
2 /2
squared by 4, which is again 𝑒 ( λσ) . And from our formula, the formula that we mugged up. Do
you remember, I asked you to mug up something by heart this is the MGF of Normal (0,𝜎 2 ). Look
at that. Look at that amazing little result. What did we do?

We went from n to √𝑛 instead of dividing by n we divided by √𝑛 and the distribution in some


sense converges to a normal distribution. And it does not seem to matter what the initial
distribution is. Whatever the initial distribution is it seems to be converging to normal. Now it
turns out this is an astonishing, surprising fact about iid samples. Whatever the distribution is.

(Refer Slide Time: 33:10)

If you take iid samples, and that is the central limit theorem, the celebrated central limit theorem.
If you take iid samples, according to some distribution X. Let us say expected value of X is 0.
Once again, I told you that is not a serious limitation you can simply subtract and variants is 𝜎 2 .
And then you do the scaling. Y equals (𝑋1+…….+𝑋𝑛 )/ √𝑛, it can be any distribution. It does not
matter, these two formulae these two conditions have to be satisfied, then this MGF 𝑀𝑌 (λ).
2 /2
As you keep increasing n will tend towards 𝑒 ( λσ) . And this is the MGF of the normal
distribution. This kind of a property is called convergence in distribution. So, you say your Y
converges to the normal distribution in this fashion. So, this is a celebrated thing and look at the
astonishing factor. It does not matter what the distribution of X is.
You take iid samples, divide by root and add them up and divide by root and notice that subtle
change. If you divide by n what happens, you remember the previous result is should you. If you
divide by n by the week law of large numbers, the MGF will go to 1. It converges to 1, but if you
divide by root n you do not go so aggressive on the division you would go by √𝑛, √𝑛 is smaller
than n.

And it turns out you get a distribution and the distribution is always normal, and it always has the
correct variant 𝜎 2 . What a wonderful result. So, it gives you this fantastic thing. This is why in the
normal distribution it is so important in probability and statistics and modeling it shows up in so
many different places and so many different ways. This is this central limit theorem.

A few observations you can contrast with the weak law of large numbers. If you look at the sample,
mean it converges and distribution to expect value of X. And this is this leads to this general
statement. You can make a very loose meta statement saying sum of iid random variables tends to
be normal. Do I like it? I do not like it entirely, because the scaling aspect is sort of hidden in that.
The scaling is very important. If you simply scale by 1 by n it tends to constant. So, it is not the
same as the scaling is not entirely. I mean, the scaling is hidden and then you say sum of iid random
variable tends to be normal.

So, you have to know, in what sense, the actual interpretation is the scaling is very, very important.
We you divide by root n it goes to normal divide by n, it goes to constant. Interesting, little
statement about the CLT. The CLT, like I said, is a very, very celebrated result. There is lots of
nice experiments. I will let you read up a little bit on that, so there is these nice, you can build a
big experimental setup and draw some marbles from the top or something, and the way it will
scatter and it will form a bell curve.

So, this is why the bell curve shows up quite often in simulations and statistics a very nice
experiments and demonstrations you can do to illustrate how a lot of independent, random
variables, when you add them up with suitable scaling, you can get up a normal distribution to
show up. So, this is a wonderful and celebrated result.

So, we have not seen the proof of this and I do not think, I can. I can give you a great little proof
here. So maybe, maybe somebody okay. So, this is, again, a post on the discourse forum.
Somebody can start a post on the proof of the central limit theorem on the discourse forum, and
those who are interested can participate in this. The proof is strictly out of syllabus. It is not related
to your exams or anything. It is just for those of you who are analytically curious about where the
proof comes from, start a post on the discourse forum I will also participate.
(Refer Slide Time: 36:56)

So how do you use CLT to approximate probability? Now, this is important for this course. So,
whether you understood the proof for the CLT or not, you should know how the result is to be
applied when you want to approximate probability. So, this is very, very crucial. So, you have once
again 𝑋1 through 𝑋𝑛 iid X, iid is missing here it is iid.

Let us say the mean of X is 𝜇 , 𝜎 2 is variance, I am defining the sum Y equals 𝑋1 + 𝑋𝑛 . So, notice
I am not worrying about the scaling and all that. I will use the scaling, I will show you how to use
it in the calculation, but people will usually just write the sum, and you will see how the probability
gets approximated.

I am interested in this question. This is the question we asked before when we approximated the
binomial. What is the probability that Y deviates from the mean 𝜇? Why has n𝜇 as the mean E[Y]=
n 𝜇? So, what is a P(Y-n 𝜇 > 𝛿𝑛𝜇), so 𝛿 will be some fraction we will pick it as 0.6, whatever. So,
this is the question we have asked before.
(Refer Slide Time: 38:02)

So, how do you use the CLT for the approximation? So it is again, iid if you call the iid here. How
do you use the approximation? The crucial ideas is this (Y-n 𝜇)/√𝑛. When I divide by √𝑛, I am
going to get something that is approximately Normal (0 𝜎 2 ). So, in addition, I will divide by a 𝜎.
Why am I doing by the sigma so that I get rid of the Normal (0 𝜎 2 ) and my distribution becomes
Normal (0 1).. So, this is the crucial, crucial, crucial idea. So, (Y-n 𝜇)/√𝑛𝜎, I am expecting it to
be approximately normal.

So, any question you ask me about Y I will convert into an equivalent statement about this normal
distribution. (Y-n 𝜇)/√𝑛𝜎. And then find the probability in terms of the normal distribution who
CDF. I know, remember Normal (0 1) we know it is CDF that CDF is known to us. It is given in
tables, it is given in some computer program, whatever. That CDF is known to us, we can express
the answer in terms of the CDF of that normal distribution.

This is an approximation. I have put an approximation there, nobody knows whether it is an upper
bound lower bound whatever, it is an approximation, but it works very, very well.
(Refer Slide Time: 39:15)

So, the CDF of the Normal (0 1) we will assume to be sum F(z). This F(z) is known to us. So,
P(Y-n 𝜇 > 𝛿𝑛𝜇) , I am going to divide both sides by √𝑛𝜎 . So, I will get (Y-n 𝜇)/√𝑛𝜎 >
𝛿𝑛𝜇/√𝑛𝜎. Now, this I will say is approximately normal and here I should put an approximation,
I am sorry I put equal to. This should be an approximation. This is approximately normal (0, 1).
So, probability that a normal (0, 1) random variable is greater than something is 1 - the CDF of
that same thing.

So, this F is known to me, and I will approximate. Remember, this is approximate, it is very, very
important, it is not equal or anything. So, this is the way in which you use CLT for approximating
probability. So, notice that √𝑛 will enter the picture it is important, and you should know which
random variable they are approximating with normal (0, 1). Simple, is not it is a very easy to use.
I mean, even if you do not understand how all those limits worked out and all that, this is a very,
very easy to use.
(Refer Slide Time: 40:38)

So, let me show you a particular illustration. I am going to approximate the binomial (n ,1/2) using
the CLT. So 𝜇 and 𝜎 are both equal to 1/2. Remember binomial, so I should be careful here. So,
how do I get this, 𝑋1 , 𝑋𝑛 is iid Bernoulli (1/2) and for Bernoulli (1/2) these two are true. 𝜇 and 𝜎
as 1/2. Bernoulli 1/2 is mean is 1/2 variance is 1 by 4, 𝑝 × (1 − 𝑝) so 𝜎 this is a square root of the
variance is 1/2.

Now, you remember when we discussed the weak law of large numbers Chebyshev and all. I was
𝑛 0.6𝑛
interested in this event, 𝑃(𝑌 − 2 > ) . So, this is the sort of probability I was looking at. Now
2

using the CLT, I keep putting equal to here. I am sorry for that, that should be approximate. This
is approximately 1 – F(0.6 √𝑛). So, this F is, this just uses the formula from before, F is the CDF
of normal (0, 1) .So, that is the formula here.

And if we do 10, 50, 100 notice this, this is same as what we had before. I had the same events
listed out probability is like 10−6, 10−10 notice the CLT. It gives you such good approximation,
very close to what you expect here in terms of the probability. So, that is the nice thing to know
approximation it is really close and normal approximation is a very, very good thing to do for
binomial distribution. So, this is how you use the CLT.

Once again, not exact approximate, not even a bound. We cannot maybe prove that maybe in this
case you can prove, but it is a little bit more general than that.
(Refer Slide Time: 42:28)

So, here is one more example. I have taken our other favorite distribution Y being 𝑋1 + 𝑋𝑛 , and
this X having this distribution. So, here you can, again, see 𝜇 is zero and 𝜎 2 is 5 by 2. You can do
this calculation. Now CLT tells me Y/ √5𝑛/2 is roughly normal (0, 1).

So now, if I want to find 𝑃(𝑌 > 𝛿𝑛), some calculation 𝛿𝑛 I have just done that. I do not know
why I have done 𝛿𝑛 maybe I mean it is okay, 𝑌 > 𝛿𝑛 it seems like a reasonable thing to calculate
𝑌 2𝑛 2𝑛
for this distribution. So, it is a same as 𝑃( 5𝑛
> 𝛿√ 5 ) , and that is approximately now 1-F(𝛿√ 5 ).

2

Now you can put in calculations. I hope I got this, right, sure. This is, this should be fine. So, this
is the answer. So, that is for that distribution. Now, let us do a continuous distribution. Here I have
uniformed [- 1, 1]. Now for the uniform [- 1, 1] continuous distribution 𝜇 is 0 and 𝜎 2 is 1 by 3.

So, the CLT tells me √3Y. Remember Y, Y is this summation is going to be roughly normal 0, 1.
So supposing, I want to find 𝑃(𝑌 > 0.1√𝑛). Why 0.1√𝑛 it is like that. Some function it is okay,
some deviation. So multiply both sides by √3 is the same as p is greater than √3𝑌 𝑃(𝑌 > 0.1√𝑛)
and that is approximately 1 – F(0.1√3𝑛)

So, if n equals 100 you know, that this probability is roughly 0.2919 or n equals 100 this probability
is roughly 0.0416. Notice. I mean, I want you to once again appreciate how non-trivial this
calculation is. Supposing you want to find the exact value for n equals 100, what you have to do?
You have to find the distribution of Y. How will you find the distribution of Y it is the addition of
hundred independent, identical uniform - 1, 1 random variables.

You have to take the PDF, the individual PDF and do what is called convolution with itself. So, it
is slightly more complicated operation and find out the actual distribution for a 100, and then do
an integration from 0.1√𝑛. So, it is a lot of work. I mean it is not so easy to do and it is not very
pleasing, but notice how easily the normal approximation is giving you the formula. So clean, is it
not?

And it is also illustrated in the discrete case. Look at the discrete case. In the discrete case, you
have a complicated looking discrete PMF. And if you raise it to the power 100, I mean, you are
going to get one big mega expression. And who is going to go in and calculate this probability. It
is really hard to do the exact probability and notice how easy the normal approximation is. I want
you to take a step back and appreciate this result for that point of view.

So, when you have a sum of random variables, the normal approximation is fantastic. It gives you
a nice expressions. I mean it is going to be pretty good and it works out very, very, very cleanly.
Much, much better than trying to do exact calculations. So, this is a very popular method in
statistics to deal with the sums of independent random variables.

So, that the end of this lecture. I hope, you understood and appreciated how this CLT comes out
of nowhere and simplifies your computations tremendously. Thank you very much. I will see you
in the next lecture.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 7.7
Distribution, properties and connections
(Refer Slide Time: 00:13)

Hello, and welcome to this lecture. This lecture is going to be a slightly different from the previous
ones. In the last couple of lectures, we saw concentrations, central limit theorem and various kinds
of properties. When we have i.i.d. samples and we look at descriptive statistics or other things like
that, what can we say about convergence and things like that? We were looking at those kind of
properties, and how useful they are.
Now we will sort of step back a little bit and look at some interesting properties of various types
of distributions. In fact there are several distributions. So far, we have seen very few distributions.
We saw in the continuous world particularly we saw the uniform distribution. We saw the
exponential distribution. We saw the Gaussian distribution of the normal distribution, but we did
not see several other distributions that occur quite often at least in statistics. Quite often these other
types of distributions occur, and they have a lot of connections between each other.
Now, one of the things about continuous distributions, as you might have already realized is a lot
of properties depend on integration and complicated manipulations involving integrations. By
themselves, each step is not very hard, but this series of integrations and simplifications and all
that. And quite often it is possible that we get lost in the middle of all of those calculations and
sort of lose the big picture.
But on the other hand, if you only give the big picture and you do not know the details, there is
this problem of you do not have the confidence that you really know anything, so you have to just
keep talking without having a solid foundation. So, one needs to strike a balance between the two
and that is hard. And the course like this, in a program like this where people do not have a very
rigorous math foundation sometimes.
So, what I am going to do is mostly focused on the big results, and provide some hints at how,
why these things could possibly be true and why and some direction on how these things are
proven. So, I will stop at that, but it is very important to know these properties. Many of the
properties that I will state are important, and many of these distributions we will study are very
important. They often occur in practice the shapes and the way they are they occur often in practice
and it is good to know them at least when you study statistics and all that.
So, this lecture, relatively short lecture is going to be just an assortment of facts about various
different distributions. We will study a lot of new distributions and we will see connections
between them. So, let us get started.
(Refer Slide Time: 02:52)

So, here is some histograms that I have picked up from within. I mean I have generated some data
in Python and I have drawn some histograms for it, and this is how these histograms are looking.
So, there are various types of histograms you can see. Mostly I want you to focus on, if you were
to approximate these things with a CDF, you are probably going to draw lines like this.
So, you are going to draw something like this for this guy. So, approximating the PDF for this guy
you are going to probably draw something like this. So, for this guy, it is probably going to be
something like this, the tale is a bit long here. And for this guy it is probably going to be something
of this ship U sort of shape.
So, basically all of these kinds of things can happen in practice. The distribution can have different
types of shape. The shape can be different, the location of the peak may be different and the scale
may be different. So, you will see quite often in many distributions the parameters are called shape
parameters, location parameters and scale parameters. So, shape, location and scale are three words
that are often used to describe the parameters of probability distribution.
And you can see, for instance, the first one, the hist 1 I have drawn here is it is like a Gaussian
normal sort of distribution. You can think of the location as being zero. The scale, basically the
how much it is multiplied by that would be the variance parameter location, as a mean shape, I
guess, this is no real shape going on here, so there is no shape parameter.
For the exponential distribution there is only one parameter. Sometimes people also call have a
rate parameter. Sometimes rate is considered a parameter. If you have 𝑒 −𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡.𝑥 that constant is
always called a rate. So, that sort of tells you the rate of fall in some sense. So, how quickly it falls.
So, if you look at 𝑒 −𝜆𝑥 , as 𝜆 increases, it is going to fall faster and faster. So, that is the sort of
shape and location and things like that.
And then notice this histogram 3, it sort of rises for a while and then falls. So, this is a different
sort of shape and look at the shape, it is only between 0 and 1. The previous distributions all seem
to go for all 𝑥 and hist 4 is only between 0 and 1 and has a U shape. So, strange sort of shapes.
And you can have all sorts of shapes in that might occur in the phenomenon that you are interested
in, and it is good to know what are all these distributions out there.
What are all the standard distributions, different shapes at least mentally or I mean, at least for you
to picture what they are and know them. You do not have to remember all the formula by heart.
Today Wikipedia and so many other references would give you all these things. You do not need
to mug up the formulae themselves, but this connection between shapes and how the distribution
looks it is good to have a sense of that. Is it looking like an exponential, is it looking like a normal,
is it looking like something else. So, what are all these shapes?
So, what we are going to do in this lecture is look at a bunch of standard distributions. I will draw
some histograms, give you some expressions, give you some intuition and describe some
connections between these things. So, that is what we will do, and it is useful when you model.
When you pick up a histogram it is and you see its general shape, you may want to think in the
back of your mind hey this looks like this distribution.
This looks like that distribution like that, so it is good to at least know that from a high level. At
least for this course, we will mostly focus on things at that level. We may not do intricate
calculations using various integration techniques with the distribution, but you should sort of know
the connections and where they are from. Let us get started.
(Refer Slide Time: 06:46)
So, one of the first and we have seen already that the normal distribution occurs so many times in
statistics, and it is very, very important. It is a vital distribution to know very, very well. And this
a normal distribution satisfies this really, really interesting property, which it is sort of unique to it
in some sense.
So, supposing you have 𝑛 i.i.d., random variables, which are all normal i.i.d. normal. Let us say
each one. So, I am not saying i.i.d. normal it is not i.i.d. here. I am sorry about that, it is independent
and normal, independent and normal. I do not need the identical distribution. So, I kept saying
i.i.d., apologies for that independent normals. So, this is a very, very nice result about i.i.d. normal,
random variables, and this you should know very well. So, this is a very standard result in
probability. Not too hard to prove.
So, let me first state the result and then describe what it is. So, you have 𝑛 independent, normal,
random variables 𝑋1 , … , 𝑋𝑛 , 𝑋𝑖 is normal, its mean is 𝜇𝑖 and its variance is 𝜎𝑖2 . So, 𝑋1 will have
mean 𝜇1 , 𝜎12 , 𝑋2 will have mean 𝜇2 , 𝜎22 . So, different normal distributions, different means,
different variances possibly, but they are all independent, the independence is important.
And then what do we do, I do a linear combination of them. So, this linear combination is a
powerful thing that is used often, repeatedly in so many different places. So for instance, I might
call a random variable 𝑌 = 𝑎1 𝑋1 + 𝑎2 𝑋2 + ⋯ + 𝑎𝑛 𝑋𝑛 , some constant. It could be 1, 2, 3,
whatever. So, I am doing a linear combination of the 𝑋1 , … , 𝑋𝑛 . Each of them are in normal and
they are all independent. I am doing a linear combination.
So, it turns out the linear combination is also normal. So, that is a huge result, and it is very
important to know this by heart. This result is very important to know linear combination of
independent normals is normal. So, 𝑌~𝑁(𝜇, 𝜎 2 ). And once, you know it is independent finding
the mean and variance is easy. We do not need to know the distribution. So, whatever, maybe the
distribution, whether they are independent or not expected value of 𝑌 is always this.
So, for a linear combination, expected value of 𝑌 is easy to find. Variance, not so easy if they are
correlated, but if they are uncorrelated itself, you can find the variance very easily. And the
variance will come out like this: 𝑎12 𝜎12 + ⋯ + 𝑎𝑛2 𝜎𝑛2 . Now on top of it, if the random variables are
independent and they are normal, then it turns out 𝑌 itself has a normal distribution. The
distribution of 𝑌 itself is very easy to identify and it is normal.
So, this result is the final bottom line here. Linear combinations of independent normals is again
normally distributed. The proof is actually very, very simple. You just try and find the moment
g
e
So,
n this comes from the independence. So, this comes because of the independence, and each of
𝜆𝑎1 𝑥1 𝜆2 𝜎12 𝑎12 /2 2 2 2
e 𝐸[𝑒
this, ]= 𝑒 … 𝐸[𝑒 𝜆𝑎𝑛𝑥𝑛 ] = 𝑒 𝜆 𝜎𝑛𝑎𝑛 /2
r
Now,
a if you multiply all this notice what happens in the numerator after you multiply what happens
inside
t the exponent? So, I made one little assumption here. So, let us just say, so in the proof, just
for
i simplicity, I will assume 𝜇1 , … , 𝜇𝑛 = 0. So, if it is not 0 I have to centralize and adjust extra. I
will
n assume that the zero of convenience.
g 𝜆2 (𝑎12 𝜎12 +⋯+𝜎𝑛2 𝑎𝑛
2)

𝐸[𝑒 𝜆𝑦 ] = 𝑒 2
f 2 2 2 2
u this is nothing but a normal distribution with mean 0 and variance equal to 𝑎1 𝜎1 + ⋯ + 𝜎𝑛 𝑎𝑛 .
So,
So
n you see notice how the fact that the moment generating function for normal is e power lambda
squared
c something times a variance plays a role here. So, it is a very simple proof. You can, all
you
t have to do is use the moment generating function and go back and forth between moment
generating
i function and the linear combination. It plays well with the linear combination and the
independence
o and you get the answer. So, there is sort of a unique to the normal distribution. It is
an very powerful property.
There are very various reasons why, I mean, you may be doing maths 2 now and you may know
oabout the value of linear combinations and linear functions. And quite often in practice when you
f
have a bunch of variables, pretty much, the only thing you can do is the linear combination if you
want to drive something from it, and a lot of linear combinations are used in practice and knowing
� distributions is very interesting. So, if each of the variables is independent and they are normal
the
.then their linear combination is also normal. Very nice, simple, and elegant property for normal
distribution.
Y
(Refer
o Slide Time: 12:50)
u

a
r
e

g
o
i
n
g

t
o

f
i
n
d
So, now let us move on to another type of distribution, which is called the gamma distribution, and
I thought I corrected this. Looks like this has not worked out. Apologies for that. This has got to
be beta. So, there are the, so we say a random variable 𝑋 is gamma distributed with two parameters,
𝛼 and 𝛽. So, anytime we have a distribution we will have parameters for the distribution, so that
is the first thing to remember.
So, if you think of a uniform distribution, usually it has two parameters, (𝑎, 𝑏) as in its uniform
from 𝑎 to 𝑏. If you think of a normal distribution, again, it has two parameters. Normal distribution
is parameters 𝜇 and 𝜎 2 , mean and variance. If you think of the exponential distribution, it has one
parameter 𝜆. The PDF is 𝜆𝑒 −𝜆𝑥 .
So, parameters are important in the distribution. And when you vary the parameter properties of
the distribution, vary. The scale varies, the location varies things like that, the shape varies all of
that varies. Likewise, for this gamma distribution there are two parameters. One is usually called
alpha the other is usually called beta. The PDF of the gamma distribution, the probability density
function of the gamma distribution is proportional to, is proportional to 𝑥 (𝛼−1) , and 𝑒 −𝛽𝑥 .
Gamma(𝛼, 𝛽) ∝ 𝑥 (𝛼−1) 𝑒 −𝛽𝑥 , x>0
So, this is the shape of the PDF. And, and it is for 𝑋 positive. The gamma distribution does not
work for 𝑋 negative. It is only for 𝑋 positive, it is not like normal. Normal is on both sides, so this
is only on one side. It is like the exponential distribution in some sense. So, it is proportional to
𝑥 (𝛼−1) 𝑒 −𝛽𝑥 .
Why do we say proportional to? How do I find the constant of the proportionality? Supposing you
to find constant of proportionality it is exactly what if you want to write the exact formula it is
going to be
𝑥 (𝛼−1) 𝑒 −𝛽𝑥
𝑓𝑋 (𝑥) = ∞
∫0 𝑥 (𝛼−1) 𝑒 −𝛽𝑥 𝑑𝑥
So, this integral is usually called the gamma function and all that, but we are not worried so much
about this integration. What is really important is the shape. And this controls the shape of the
distribution. So, how the distribution looks is controlled by this. So, this guy merely scales, so
overall scaling. So, whatever that may be it is not very relevant to us. It, it may give us the exact
form, etcetera but what is really important is the proportionality with X.
So, these kinds of functions are special functions and complicated functions. We are not going to
go into detail of that here. You can go in and read in some other sources if you like, people define
this as the gamma function called it gamma, alpha it has so many interesting properties, but for us,
what is most important is the shape.
So later on, I will show you some pictures, and you can also draw. You can open up Python,
Matplotlib or Desmos or anything and try sketching these kinds of function, 𝑥 (𝛼−1) 𝑒 −𝛽𝑥 . How will
this function look? Remember 𝑒 −𝛽𝑥 is going to look like this. This is going to be 𝑒 −𝛽𝑥 . What does
𝑥 (𝛼−1) , it is going to go like this. 𝑥 (𝛼−1) assume let us say 𝛼 > 1. So, it is going to go off like this.
So, when I multiply these two, what is going to happen? It is going to start at 0 increase for some
time and then fall down. So, this will be the shape of the product 𝑥 (𝛼−1) 𝑒 −𝛽𝑥 , at least for 𝛼 > 1.
So, you notice, there will be this increase for a little while and fall down. As 𝛼 increases the rise
will be a little bit more, as 𝛼 decreases, if 𝛼 = 1, there is no rise it is only a fall. Becomes
exponential at that stage. So, this is how the shape will look, we will see later on, how this looks
etcetera.
So, interesting things you can notice if 𝛼 = 1, you get the exponential distribution. So, for 𝛼 = 1,
you get the exponential distribution. So, the gamma distribution is a generalization of the
exponential distribution or another way to look at it as the exponential distribution is a special case
of the gamma distribution. Gamma distribution is more general.
Now, there are two parameters here. The first parameter 𝛼 needs to be positive, it is called the
shape parameter. The second parameter is 𝛽, it is called a rate parameters, greater than 0. Often
the inverse of the rate is used and it is called a scale parameter. Now one can find the mean and
𝛼 𝛼 𝛼
variance of the gamma distribution. It is 𝛽2, where 𝛽 is the mean, variance is 𝛽2. You can also find
the moment generating function even though we will not spend too much time into it.
So, a couple of very interesting properties. So, the first two properties are simple enough mean,
and variance. You can try to prove it that is a bit complex. Again, a lot of the proofs will involve
integration, and we are not going to go into that kind of detail here, but these kind of formally are
𝛼 𝛼
easy enough to remember. Mean is 𝛽, variance is 𝛽2.

If needed, you can remember, but you can always look it up and you do not need to know these
things by heart. You can look up these distributions elsewhere. And if we ask a question about all
these distributions, I do not think, we are going to emphasize the knowledge of these kinds of
formulae too much. So, it is not too bad to just study this. Just a different shape and the distribution
and a very nice description of the shape of the density.
So, here are a couple of interesting properties. So, these two properties are why sort of gamma is
very interesting in practice. If you take i.i.d. exponential distributions. So, let us say 𝑛 of them,
and then add them together, it turns out you get the gamma distribution with 𝛼 being equal to 𝑛
and the 𝛽 being the same 𝛽. So, that is an interesting property. So, you add 𝑛 i.i.d., exponential
distributions, you get the gamma distribution.
Remember exponential is just 𝑒 −𝛽𝑥 . 𝑛 of them together, if you add the distributions, the density
becomes something like 𝑥 (𝑛−1) 𝑒 −𝛽𝑥 . So, that is the picture you should have in mind. Do you have
exponentials? You keep adding repeated copies, repeated independent copies of those random
variables. The density sort of becomes 𝑥 (𝑛−1) 𝑒 −𝛽𝑥 . So, that is sort of the first result. Again, we
will not prove this proof uses moment generating function and other methods.
Here is another very, very interesting property. If you square the normal distribution, you take a
normally distributed random variable, and then you square it. So, when you square a random
variable, it becomes positive, It cannot go negative. So, it is only positive, and it turns out the
1 1
square is a gamma distribution. And the parameters are a bit crazy. It is lot of like and 2 . So,
2 2𝜎
what is this?
1
1 1 −𝑥⁄ 2
Pdf of 𝐺𝑎𝑚𝑚𝑎(2 , 2𝜎2 ) ∝ 𝑥 −2 𝑒 2𝜎

1
1
So, this is the thing. Look at this 𝑥 −2 , so it is going to . So, its value at 𝑥 being close to 0 is sort
√𝑥
of blowing up. That is the sort of distribution you get. You take the normal distribution with mean
zero and variance 𝜎 2 and square it. You suddenly get something which is very, very likely to take
values, very close to 0.
In some sense, right relatively if you take a small window close to 0 small window away from 0,
the same window length close to 0 has much higher probability of occurring. Interesting fact about
1 1
the gamma distribution tries sketching this Gamma(2 , 2𝜎2 ) somewhere. And you can open up a
discourse thread, post your sketch of this and see why, whether that is reasonable.
Find the histogram, generate random numbers, which are Gaussian distributed square them find
the histogram, see if it meets the gamma distribution. So, that is a nice exercise for you to try. So,
each of these relationships, while I have written it down as a theoretical result, a very, very good
exercise is to take your Python notebook and create commands that will show this.
So, you take a 1000 Gaussian random variance, as in basically samples generated in Python, you
square them, histogram them and plot the gamma PDF on top of it and see if that matches, it has
to match. So, these are nice things for you to try to reinforce some of these ideas. So, that is all I
wanted to say about the gamma distribution. It will occur quite often in statistics and it is useful
to know.
(Refer Slide Time: 21:46)
The next is something called the Cauchy distribution. It is a very interesting and unique sort of
distribution that Cousin Probability Theory. Once again, it has got two parameters a location
parameter 𝜃, which could be positive or negative and a scale parameter 𝛼, which needs to be
positive. And usually, 𝛼 2 is what people write, but 𝛼 is also involved there.
The PDF, the density in this case can be written very neatly.
1 𝛼
X ~ Cauchy(𝜃, 𝛼 2 ) if PDF 𝑓𝑋 (𝑥) = 𝜋 𝛼2 +(𝑥−𝜃)2

You can try to plot it. We will plot it later and see I will encourage you to plot it yourself as well.
Once again, two parameters in this distribution location parameter 𝜃 and a scale parameter 𝛼. You
can imagine why 𝜃 is the location, and you can see why that is so.
So, this way, this distribution is very interesting, because the mean is undefined. The variance is
undefined the moment generating function is undefined. All these things are undefined. There is
no consistent way to put a number or even infinity for the mean.
And another very, very interesting property of the Cauchy distribution is, if you take two i.i.d.
𝑋
Normal(0, 𝜎 2 ) random variables and then you divide them, take the quotient 𝑌 how do you find
𝑋
the distribution for 𝑌 ,turns out there are some methods which involve some integration and all
that, and that has a Cauchy(0, 1) distribution. 𝜃 equals 0 location 0 and 𝛼 = 1.
Notice this bizarre, interesting connection with two normal distributed random variables you
divide them suddenly you get a Cauchy. Again, a property that you can check with your Python
notebook, write a small little segment where you generate a lot of normal random variable
𝑋
independently and then divide, get a lot of samples of 𝑌 . Then histogram it, then draw the Cauchy
on top of it, do a density histogram, and then draw a Cauchy on top of it to check if this is true.
So, that will give you a good experience, the understanding and dealing with these kinds of results.
1
Once again, a unique distribution, it behaves like 𝑋 2 because of that, the mean, and variance and
all are undefined, but it is still a valid distribution. So, something to think about.
(Refer Slide Time: 24:03)

So, the next distribution. So many distributions are coming. I think this is the last one. Let us see.
I mean, I am hoping this is the last one we will see. The next distribution that we will see it is very
common in statistics is called the beta distribution. Once again, two parameters. The first parameter
is called 𝛼, the second parameter is again called 𝛽. Maybe I should call it something else, but it is
common to call it 𝛽. Some people call it 𝑎 and 𝑏. Just to say that beta distribution of (𝛼, 𝛽) looks
a bit odd, but if it is okay to put it there, but some people write (𝑎, 𝑏).
The alpha and beta both are called shape parameters. They dictate the shape, and both have to be
non-negative greater than 0, greater than 0. And notice this, this is what is very different from the
previous distributions we have seen. So, the beta distribution has finite support. It is like, sort of
like a uniform distribution. It is between 0 and 1. That is interesting. Is not it? For instance, if you
put 𝛼 = 1, 𝛽 = 1, you indeed get the uniform distribution.
Think about it. If 𝛼 = 1, 𝛽 = 1, the dependence on 𝑋 vanishes and 𝑥 is between 0 and 1 that is the
uniform distribution. Otherwise, the density behaves as 𝑥 𝛼−1 (1 − 𝑥)𝛽−1 for every 𝛽. Sort of a
complicated dependency with 𝛼 and 𝛽 it turns out it is a valid PDF. It is proportional to
𝑥 𝛼−1 (1 − 𝑥)𝛽−1 . Of course, you have to divide by something. What you have to divide by is
complicated gamma function, something, something, something it is not relevant to us, but the
proportional thing is important. You will see in the plots that will come through.
What is interesting and important about the beta distribution is the mean, and the variance have
𝛼
very simple, clear formula. The mean is 𝛼+𝛽. Nice formula is it not. And the variance is a similar
𝛼𝛽
formula (𝛼+𝛽)2 (𝛼+𝛽+1)

So, there is some special cases I know if you let 𝛽 to be equal to 1 and vary 𝛼 then your PDF is
just proportional to 𝑥 𝛼−1 . So, that is called the power function distribution. So, 𝑥 𝛼−1 is an
interesting. I mean, as 𝛼 increases, it takes higher and values closer to 1 with much higher
probability. So, that is how 𝑥 𝛼−1 looks.
And the next property is very interesting. So, if you have two gamma distributions, which are
independent also it is not mentioned here. If you have two gamma distribution, so this property
will be sometimes interesting. So, if you remember, the ratio of two normal independent
distributions was the Cauchy distribution.
In this case, if you take two gamma distributions, one with parameter 𝛼 and one by 𝜃, another with
1 𝑋
parameter 𝛽 and one by 𝜃 that 𝜃 has to be the same it can be any 𝜃. Then it turns out 𝑋+𝑌 becomes
a beta distribution with parameters 𝛼 and 𝛽. So, here is crazy ratio result.
So, I mean it is good to know these things, it is good to know that all these distributions are
connected. If you take ratios, if you take squares you go between each other. So, it is sort of odd,
look at it. The gamma distribution takes values from zero to infinity. That is what the two gamma
𝑋
distributions do, but if you do 𝑋+𝑌 , you can have values only between 0 and 1.
𝑋
Remember 𝑋+𝑌 is always less than 1. So, you will have values only between 0 and 1. So, like the
whole thing gets squished into 0 to 1. And given the various forms of it, it becomes actually a beta
distribution. So, it is proportional to 𝑥 𝛼−1 (1 − 𝑥)𝛽−1 , so those are the sort of things one can prove.
Once again we are not proving many of these results, and it is way beyond the scope of the this
course, but it is good to know that this is what happens sort of various shapes of distributions for
the, in the continuous world and they all have these very interesting interconnections between each
other and very non-trivial interconnections in some sense.
(Refer Slide Time: 28:14)

So, let me leave you with a few plots. I mean, these many of these distributions had crazy
expressions. Maybe, you could not visualize some of them in your mind. I think this plot and you
should make plots like this. You should be able to make plots like this. You should use Desmos if
you like or you can use other plotting tools. Your Python notebook is a very good plotting tool. I
plotted this on a Python notebook only.
So, you can use any of these plots and this scipy stats module has so many all these distributions
are built into it. You just have to put the parameters in and you will get the answers. So, that is
what I did to get these kinds of plots. So, the normal in the Cauchy distribution. I plotted them one
2
on top of the other. So, if you remember, the normal distribution is 𝑒 −𝑥 , it falls exponentially and
1
Cauchy is only 𝑥 2 .

So, you can see the Cauchy, sort of is below normal and then but the tail is sort of thicker, heavier
tail. It goes above the normal distribution in some sense. And, the gamma distribution, once again,
the Gamma(1, 1) is exponential. So, you see the blue curve is just exponential. And as the alpha
parameter increases in the gamma distribution, you see that the curve goes right and right, the peak
keeps going towards the right. So, that is the gamma distribution for you. You can sort of visualize
what happens in the distribution.
The beta distribution is, has so many varieties. It is between 0 and 1. You notice B(0.5, 0.5) is that
U shape. The U shape distribution you get, I should put B(2, 2) it becomes like the inverted like a
cup shape like that. If you do (2, 5), it has a nice polynomial type you know rise and fall. It is all
between 0 and 1 though by the way. If you do (1, 3), (5, 1) if make one of those random variables
disappeared. So, you will see, you will get a power distribution or (1-x) power distribution.
If 𝛼 goes to 1, then one term disappears in 𝛽. 𝑥 𝛼−1 disappears, it is only 1 − 𝑥 𝛼 ,so you we will
have a picture like this. So, it will fall like this or this way. So, the power or the (1-x) power. So,
different types of densities, you can plot them you can look at them, I will really encourage you to
add to the notebook that you have with various different distributions and densities and hope they
look.
That is the end of this lecture. I hope this was useful for you in learning about the various different
types of distributions out there. And some of these things, many of these things will come back
and haunt us, later on in this course. Thank you.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electric Engineering Department
Indian Institute of Technology, Madras
Lecture No. 7.8
Descriptive statistics of normal samples
(Refer Slide Time: 00:13)

Hello, and welcome to this lecture. We are going to focus on i.i.d. normal samples. We already
saw how that is very important. And in the previous lecture, we saw quite a few related results
based on these i.i.d. normal samples. We will see a little bit more in this lecture. And going
forward, and when we start discussing statistical procedures in the later part of the course, these
normal samples quite often assumption will be that the data follows some sort of a normal
distribution, and under that there will be some standard methods and there these kind of statistical
descriptions become very important.
So, let us see, how descriptive statistics of normal samples behave. You remember we saw when
we had i.i.d. samples we would consider the sample mean and the sample variance, and we would
do some calculations with that. It turns out when the distribution is normal, you can say a lot more
about what the sample mean will be and what the sample variance will be. So, let us get started.
(Refer Slide Time: 01:13)
So, what is the setting, the setting is normal samples, 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d. normal. The mean is
the same 𝜇, and 𝜎 2 is the variance. So, 𝜇 is called usually the distribution mean, 𝜎 2 is called the
distribution variance, and these samples 𝑋1 , 𝑋2 , … , 𝑋𝑛 are particular I mean, random variables,
which are i.i.d., they all have the normal distribution. Now remember, every sampling, you will
get actual instances of the samples, each 𝑋1 , 𝑋2 , …, will take its values, and those are the samples
that are given to you.
And from there, we usually work with procedures for cleaning some information. So, that is the
sort of picture I want you to have in the back of your mind. So, like I said, this assumption of
normal distribution is quite standard in many situations, and it is a good approximation in many
situations. So, CLT, like Central Limit Theorem we saw before, is often called for and used in this
context. And even otherwise, the normal distribution is a very good assumption to make when you
do not know much about what is going on.
So, what is sample mean, we have seen this before
𝑋1 + ⋯ + 𝑋𝑛
𝑋̅ =
𝑛
sample variance is denoted as
(𝑋1 − 𝑋̅)2 + ⋯ + (𝑋𝑛 − 𝑋̅)2
𝑆2 =
𝑛−1
You remember that (𝑛 − 1) definition to make the 𝐸[𝑆 2 ] = 𝜎 2 we needed that (𝑛 − 1) there.
And the crucial thing, which I told you quite often, when we introduced this, is that sample mean
and sample variance are actually random variables in this sort of a description they are clearly
random variables. So, every time a different sampling is given to you, you can expect different
numbers to show up for 𝑋̅ and 𝑆 2 , but they will have a distribution. And can we find the
distribution? Can we think of characterizing the distribution when the samples are i.i.d. normal, is
the subject of this lecture? It turns out, it is possible.
Once again, we will not go into lengthy derivations of the results, the textbooks have them. We
will only provide the final result most cases and simply justify it using some arguments and
reasoning. So, that is the topic of this lecture, relatively short lecture to describe what the sample
mean and sample variance distributions will look like when the samples are normal.
(Refer Slide Time: 03:40)

So, let us get started. So first, the sample mean, we have seen this before. We have almost seen
this before, and we have proven this before. So, the sample mean, is simply a linear combination
of i.i.d. normal random variables, so that is the first and simple observation. The coefficients of
1 1 1
the linear combination is just 𝑛. It is 𝑛 𝑋1 + ⋯ + 𝑛 𝑋𝑛 .

So, we know already this result, that linear combination of i.i.d. normals is again a normal
distribution. So, that is something that we know. So, sample mean, clearly has a normal
distribution. So, it is already done, it sounds good, does not it. Sample mean, is the normal
distribution. And what about the mean and variance of that normal distribution that the sample
𝜎2
mean has, the mean is going to be 𝜇, and the variance is going to be .
𝑛

Both of these, we already knew. See the sample mean, the expected value of the sample mean is
𝜎2
always 𝜇. The variance of the sample mean is always 𝑛 . We know that and that is, that does not
need the normal distribution. So, these two are true, irrespective of whether or not they are normal
or not, the last two facts 𝐸[𝑋̅] and 𝑉𝑎𝑟[𝑋̅], but in the normal case what is special is the normal
distribution.
So, 𝑋̅ not only has mean, 𝜇 and variance 𝜎 2 , so these two are true always, but this normal
distribution is special. So, 𝑋̅ will be normal, only when the individual distributions are normal. As
in when the normal individual samples are normally distributed, 𝑋̅ is going to end up being normal
as well, so that is the special property of this sample mean exactly characterize the distribution is
exactly characterized in the i.i.d. normal case, and simple to describe also. So, that is a first
property.
(Refer Slide Time: 05:27)

The next property is, sum of squares of normal samples. So, notice this is a slightly different
operation. We are going towards the variance, but you saw that in the variance, there was sum of
squares involved. And what you are squaring, in each case, also ends up being normal in some
sense.
So, it is, if you go back and look at this, look at the description here. So, you notice 𝑆 2 is (𝑋1 − 𝑋̅)2
So, what does (𝑋1 − 𝑋̅)? 𝑋1 is a normal distribution and 𝑋̅ is actually a linear combination of
𝑋1 , … , 𝑋𝑛 So, if you do (𝑋1 − 𝑋̅), you will get yet another linear combination of 𝑋1 , … , 𝑋𝑛 , it will
be slightly different and the coefficient but you will get some coefficient there, and that also is
normal. And the mean of that normal distribution is 0.
So, you can you can do that. So, it is not very difficult to see this. 𝐸[𝑋1 − 𝑋̅] = 0, is not it? Because
𝑋1 has the same distribution as 𝑋. So, this is true. So, given that, that is true. So, if you want to
find the distribution of S square, you should first know the distribution of sum of squares of normal
distribution. So, that is needed.
So, that is why we are going after sum of squares of normal samples. And here, I can assume that
the mean is 0. So, I am going to look at the situation where 𝑋1 , … , 𝑋𝑛 are i.i.d. normal mean 0 and
variance 𝜎 2 . So, that is enough. When you want a distribution of 𝑆 2 , I just have to look at this guy.
So, it turns out this 𝑋𝑖2 is actually gamma distributed. We have seen this before is it not it, 𝑋𝑖2 ,
square of a normal distribution mean zero, variance 𝜎 2 is actually gamma distributed. Gamma with
1 1
one parameter is 2 and the other parameter is 2𝜎2 . So, we saw this result in the previous lecture, try
and refresh yourself square of a normal is gamma distribution.
And here is a new result. Sum of n independent 𝐺𝑎𝑚𝑚𝑎(𝛼, 𝛽) random variables is also another
𝐺𝑎𝑚𝑚𝑎(𝑛𝛼, 𝛽) distribution. So, this is a slightly non-trivial thing, but you can see why this is
true. Gamma itself is a sum of several exponential, something like that. So, when you add up
several independent gamma, you can expect the final thing to also be gamma. This is yet another
result about these distributions. We will not prove this, but it is quite of easier to mean it is not
intuitive, in some sense.
So, you add n independent gamma distributions, one of the parameters of gamma gets multiplied
by n, the other parameter remains the same, 𝐺𝑎𝑚𝑚𝑎(𝑛𝛼, 𝛽). So, now we are almost done. So, if
𝑛 1
you want to do 𝑋12 + ⋯ + 𝑋𝑛2 , you will actually get a 𝐺𝑎𝑚𝑚𝑎( 2 , 2𝜎2 ). So, this is the final result
for sum of squares of normal samples. It actually ends up being gamma distributed.
So, notice the difference, sum of n i.i.d. normal samples will again be normal. But if you square
each term, remember, when you square it takes only positive values rights, and the distribution
changes when you square itself, it becomes gamma. And when you add a bunch of squared things
𝑛 1
together, it still remains gamma, but it is a different gamma distribution, 𝐺𝑎𝑚𝑚𝑎( 2 , 2𝜎2 ).

So, the case when 𝜎 2 = 1, usually the 𝜎 2 = 1 is the standard normal, is not it, mean 0 variance 1
𝑛 1
is called the standard normal. So, if you take 𝜎 2 as 1, you get 𝐺𝑎𝑚𝑚𝑎( 2 , 2)., this is called the Chi-
squared distribution with n degrees of freedom, and it is denoted with this notation 𝑥𝑛2 . So, this
Chi-squared distribution is actually a gamma distribution, special case of a gamma distribution,
𝑛 1
where the first parameter is 2 and the second parameter is 2.

So, this is the distribution for sum of squares of normal samples. I guess, there are many. Some of
these we sort of have an inkling for some of these, we have not proven in detail, but hopefully this
is clean enough to you and good result to remember. So, sum of n i.i.d. random samples, normal
samples will be normal, sum of squares of n mean 0 variance 𝜎 2 normal samples will be Chi-
𝑛 1
squared distribution scaled Chi-squared distribution, it will be 𝐺𝑎𝑚𝑚𝑎( 2 , 2𝜎2 ). Good thing to
know.
(Refer Slide Time: 10:14)
So finally, we are ready to state the main result about the distribution of sample mean and variance
for normal samples. So, this is the result, which we will not prove in great detail. Of course, we
will not actually not even great detail, we will not prove it at all, but it is maybe, I mean, it is easy
for you to sort of justify for yourself why they should be true. We have seen enough results of
similar favor.
Some of these things may be a little bit confusing, but that is actually true and it is easy to it is not
very easy to prove this, but, it takes a lot of calculation. So, we are going to skip this, proof in this
class, but let us remember this result. This result, we should remember. We should know that the
joint distribution of the sample mean and sample variance is known. So, look at this result this
theorem puts that down very clearly. 𝑋1 , … , 𝑋𝑛 are normal, i.i.d. normal, I forgot the most critical
word there i.i.d., 𝑁(𝜇, 𝜎 2 ), then the sample mean 𝑋̅ is a normally distributed random variable mean
𝜎2
𝜇 variance .
𝑛

𝜎2
So, the marginal distribution of 𝑋̅ is 𝑁(𝜇, 𝑛 ),, you know this. And the marginal distribution of 𝑆 2
is a scaled Chi-squared (𝑛 − 1). So, remember Chi-squared with n degrees of freedom is
𝑛 1
𝐺𝑎𝑚𝑚𝑎( 2 , 2). This is Chi-squared with (𝑛 − 1) degree of freedom. So, notice the scaling there,
(𝑛−1)𝑆 2
.
𝜎2

(𝑋1 −𝑋̅ )2 +⋯+(𝑋𝑛 −𝑋̅ )2


So, if you remember 𝑆 2 = 𝑛−1

(𝑛−1)𝑆 2 𝑋 −𝑋̅ 2 𝑋 −𝑋̅ 2


So, if you do 𝜎2 , you are going to get ( 1𝜎2 ) + ⋯ + ( 1𝜎2 ) . So, this is the guy, and this
guy we are saying is Chi-squared (n-1) distributed. So, Chi-squared with (n-1) degrees of freedom.
So, this looks like n random variables, n independent identical sort of distributed random variables,
and you are adding them all up.
Should not it be Chi-squared n? Why is it Chi-squared n minus 1? Think about why that is true. It
is a, there is a subtle difference between the previous result we showed and the result that we have
here. I will let you brood over it, there is a certain difference here, the distribution of. So, if you
have all of them being independent, then you will have Chi-squared n, so but here, all of them are
not independent. Why is that? Think about why that is true? All the n guys are not independent.
So, if you take two at a time or things like that, you will get some, you can show some correlation
properties, which are interesting, but these guys together, all of them cannot be independent. It is
because if I give you (𝑋1 − 𝑋̅) + ⋯ + (𝑋𝑛 − 𝑋̅), so you have this result, so I think maybe you can
show this. So, if you add up all these guys, what do you think you will get?
𝑛
𝑋𝑖 − 𝑋̅
∑( ) = 𝑋̅ − 𝑋̅ = 0
𝑛
𝑖=1
𝑋
So, what will you get here if you do this? You will get, the first term will give you ∑𝑛𝑖=1 𝑛𝑖 = 𝑋̅.
𝑋̅
And what will be the second term −𝑛 𝑛 = 𝑋̅ is 0.

So, they add up to give you 0. So, these n guys together, they are all not i.i.d.. If you have each of
them being normal, identically distributed, but they are not all i.i.d.. So together, there is a
dependency. If you take (n-1) of them there will not be dependency. The last guy is like a useless
guy floating around. So, that is why you have only (n-1) degrees of freedom or degrees of freedom
sounds like how many varying things are that in these n terms.
Even though you are squaring and adding n terms, only (n-1) are really independent. The nth guy
becomes a linear combination. I mean, it is just directly determined by this. So, given the (n-1)
guy, the nth guy will be deterministic guy. So, in fact, they will not cause a distribution. So, this n
minus 1 sort of comes from there.
Like I told you, we would not write down rigorous proofs sort of high-level intuitive ideas on to
why this result might be true is all I am giving you. The first result, I guess, is something we have
proven very clearly. The marginal distribution of 𝑋̅, the marginal distribution of 𝑆 2 is Chi-squared
(n-1).
(𝑛−1)𝑆 2
So, remember that there is a scaling there 𝜎2 to make sure, we get rid of all the other extraneous
factors, and we get sum of squared normals, which leads to Chi-squared, but notice that (n-1).
Because of this subtraction of 𝑋̅, all these guys add up to 0, so only (𝑛 − 1), things are independent,
the last guy becomes dependent. So, good to know.
And then finally, here is the big result, 𝑋̅ and 𝑆 2 are actually independent of each other. So, if you
want to find the joint distribution of 𝑋̅, and 𝑆 2 you can simply multiply the distribution of the
marginal of 𝑋̅ with the marginal of 𝑆 2 , and you will get the joint distribution. It is a nice thing to
know, is not it. So, these two are independent, the sample mean and the sample variance end up
being independent. And the sample mean is, of course, marginally normal. The sample variance is
Chi-squared (𝑛 − 1) degrees of independence scaled version of that, and that is the big ticket
result.
So, we will use this result in the ensuing weeks. So, let me pause here. This is the final result, I
wanted to do this week. But this is an important breaking point in the class in the course. So far,
we have looked at the foundations of probability, various different distributions, types and
manipulations and what they mean. All along, I have been carrying some implications on data and
models and some statistical sort of thinking. Now, we will move purely into statistical procedures.
So from now on, in this course, we will start putting out data and samples, and we will start
developing statistical procedures, start understanding them, what kind of questions are asked what
type of answers to people give, how to interpret those answers, and how to tell stories based on
data and statistical procedure. So, we will start doing that from now on. Of course, we will do more
from a foundation’s theory point of view. There are other courses that we will build on this and
take it further. Thank you very much. We will meet again later.
Statistics for Data Science II
Professor Andrew Thangraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Parameter estimation: Statistical problems in real life
Hello and welcome to this week in the Statistic 2 course. We are going to make a big jump
into a sort of the next stage of the course where we will start looking at statistical procedures.
The first type of procedure that we look at is something called parameter estimation. This is a
very important and critical part of various statistical analysis methods.

(Refer Slide Time: 0:44)

But before we launch into this, I am going to first set a very high level view of how statistical
problems arise in real life and how one goes looking at these problems and just a big picture
view of what happens. And as it turns out in this particular course in the program you will
only be studying a very small piece of the entire process. But let us say it is a foundational
important piece.

Once you study this, there will be other courses that come later on in your program where
you will get a good understanding of the overall picture of how statistical problems arise in
real life, how you look at these phenomenon or you study them and what are the main things
to keep in mind.

So, this lecture is a bit a typical from the usual lectures we have. So, this is going to be
slightly more high level. I will ask some interesting questions for which maybe there are
statistical analysis and methods possible and then talk about various things that are involved.
But I want to remind you once again this course is more of a probability statistics basic
foundation course. We will get into mathematical aspects soon enough. But this lecture is a
little bit more from a high level.

(Refer Slide Time: 1:52)

So, I want to first present a few examples of questions for which maybe we want to look for a
statistical analysis and answer. So, here is the first question. Who is the best captain in the
IPL? So, we have been looking at IPL data throughout, so I thought the first example must be
something related to IPL and typically any question like this, if you want to think of
answering in a statistical way, in a data driven way based on some past data, you want to
make some meaningful conclusion, you always start with understanding the problem and plan
on how we are going to approach the solution to the problem.

So, to understand the problem you have to ask questions like what are the qualities of good
captain. Some people might say win/loss record. So, whoever wins the most is a best captain.
So, that might be, that is not really a complicated enough question, well I mean it is easy to
go see the win/loss and then just number and you put, put a tick mark against that. The most
winning captain is not what I am after. I am after the vague word called best.

Some people would say under a great captain, players play to their best. So, maybe you want
to see how the other players were performing; did they do much better than their average
during the captaincy of a particular person, things like that. So, you may want to ask very
interesting questions and often domain knowledge is important in the planning phase and the
problem understanding phase. Someone who knows cricket, someone who knows captaincy,
experts opinions, so maybe you want to use their input as part of the statistical decision
process.
So you see here, this question is little bit, this phase of the statistical analysis process is not so
mathematical and analytical and probability oriented, it is much more of a how do I say, it is
more of a knowledge of the area of the problem, what is it that one should look for and
typically, many of these problems will not have a easy and unique answer. It will not be
obvious. If the answer is obvious, it is a very well-defined question and usually statistical
data driven methods are not very, not really needed for those kind of questions.

So, in fact some people, so opinions start entering the picture when you ask questions like
this and is it backed up by data or is it just opinion. So, for instance, if you ask me who is the
best captain in the IPL, I am not going to go statistics and all that. I am simply going to give
you the answer. So, that is the answer. So, most people I think most of you are listening to
this lecture will probably agree with me that Dhoni stands very good chance for being one of
the best captains in the IPL. Many people would agree with this.

And is it based on facts? I mean yes of course, he has won few championship, but he has also
lost some but still people would consider him as a great captain. So, why is that and is that
backed by data, this opinions in it etcetera, etecetera, but we do not want to be driven so
much by opinion, an individual opinion and just sampling and getting answers, but we also
want to see data and see if this idea that one person is the best captain is justified by data in
some way.

So, that takes us to the next part which is data. So, you think of the problem, plan it and
debate it for a little bit with experts, come up with some sort of a plan maybe roughly and
then start looking at what data is available. This is very important, this is where the statistical
nature of the analysis starts entering the picture. You do not want to just go based on
opinions, it is not an opinion-based survey, it is also needs to be backed up by some data in
some way.

So, you start looking at what data is available. Score sheets from matches are typically the
best possible data here and how do you collect it, how to consolidate it, we have already seen
how to do that for the IPL for instance. There are other tournaments that you may want to
look at and you could maybe do some sampling of experts and fans. I mean it can be part of
the input and you can design a clever sampling method and pick some experts etcetera to give
them their opinions.

Because at the end of the day some opinions maybe matter also. So, you want to collect
something, maybe a little bit of sampling you can do decide to do some data of that fashion,
but also the data from the matches, it is very important to collect. So, this is how a question
like this would be statistically looked at and answered. You analyse the problem, understand
the problem, plan about it, think about it, debate about it. Once you have some sort of a
strategy on how to approach, start looking at data and then collect the data and then proceed
further.

The next stage is the analysis stage. So, in the analysis stage you already have the data, you
are going to start looking at descriptive statistics, understanding the data, histograms, scatter
plots, dependencies, these that, so far we already seen quite a few instances where we did all
these.

So, how do you fit a model, is there like a standard probability distribution which would
describe the distribution very closely, can we approximate the histogram with a continuous
distribution or what do we do, discrete modelling. Lot of these kind of models for how the
data could have come and how my central question influences the model.

So, that is where these kind of hypothesis and all comes in. So, you may say when the
matches in which a particular captain was the captain, what happened in that match? Is there,
was there like, is there a situation where captaincy played a role as opposed to just individual
talent of a player or just natural conditions. Did the captaincy play a role actively in a match?
So, that could be a hypothesis.

The hypothesis is I mean this captaincy played a role in the match, captaincy did not played a
role in the match. Looking at the data, how can you convince; data meaning the scores of the
match, how do you decide one way or the other. So, you can imagine. So, the questions like
this one can start asking and when you have to formulate these questions. Formulating these
questions itself is hard, it is not that easy.

How do you, and then after you formulate, you have to think of a model where captaincy
would have affected the flow of the data in some way of the match in some way, some maybe
some bowlers new change of bowler would have happened and the wicket would have fallen.
You have to start looking at these kind of things and that is where your knowledge of dealing
with statistical procedures within a model will help you. Once you have a model, there are
fixed statistical procedures for finding unknown parameters, testing hypothesis.

In this course that is what we will study. We will study that in a very detailed way. But how
to use that? So, when you study all of that a lot, then you will get ideas on how to use this to
answer a bigger question. So, that is what happens in real life, but usually the analysis part is
usually very well-defined in practice, you have a model, you fit a model and then you ask
some specific questions of estimating parameters or testing hypothesis within that.

So, many of these words may not make too much sense to you. I gave you some examples,
but this is the sort of things that happens in the analysis phase. A lot of people, if you talk to a
lot of people working in data science, they will say the analysis phase in some sense the
simplest of the phases. What do they mean by simplest of the phases? Simplest meaning
usually there is only one way to do it or two ways to do or ten ways to do it. It is easy to do it,
you can just do, there will be usually a program, you plug it into the program, it will give you
some answers, some fit, etcetera well defined process.

But then, you may also have to derive some metrics that measure the captaincy in the IPL.
Like I said measure the influence of captaincy in a match, how do you do that? What is the
metric for that? How do you define that? All these kind of things also entre the picture. So, it
makes it a bit more complicated, you need to know some domain knowledge as well as the
statistics and sort of put it together to come up with some clear analysis.

So, that is the analysis part. And we will like I said, in this course, in the few weeks we will
focus on estimating unknown parameters, testing hypothesis within a model, that is what we
will focus on. So, it seems like a very small part of the big piece we are doing, but it is a
foundational piece and you have to know that as well. So, this is the analysis part.

Then comes in many people’s mind what is the most important part of the final result? The
conclusion that you are going to draw and how you are going to communicate it to others. So,
why am I emphasizing that? Why is that so important? I will give you a specific example
later but even here, just because you conclude something, everybody is not going to believe
you, so this is a question where there can be 20 different answers.

Maybe not 20, I do not know. 8 or 10 different answer. Many people might think differently
and how you communicate it should also be convincing enough for some person. So, you
think of a typical fan in mind, a thinking fan may be and then give them some reasons for
why your analysis is good and why you think you have come to the conclusion. So, that is
why the conclusion, communication part of it is very very important.

I mean, you might have reached the correct conclusion, but if you do not communicate it
clearly to people, people will first of all not believe you. They will think this is just some
opinion from some guy, we may not believe it. But if you communicate it cleanly, at least
you can get through what led you to make this kind of a statement about a particular captain
in IPL.

So, hopefully, you got the flow of this. So, this is how a typical data problem or statistical
problem and analysis works, there are so many different phases and like I said in this course,
we are going to focus on the analysis phase and there are other courses that you will do later
on in this program which will talk to you about the other parts and the business courses
particularly and even the ML courses sometimes, they will do, they will work with you on the
other parts of the entire process.

But for us, in this foundation course, we are focusing on estimating unknown parameters and
testing hypothesis. That is what we will focus on. So, I want to give you a couple of more
examples of this flavour so that you get an idea of the big picture and when you get lost in
those equations, you can just close your mind and think okay there is a big picture eventually
I will be happy, but let me learn this first. So, just keep you motivated, I am giving you some
big picture ideas on how statistical analysis works.

(Refer Slide Time: 12:18)

So, the next example is also something that is very close to many of us, many of our hearts, it
is about Tiger Conservation. So, how many tigers are there in India? So, this is the question
that is very important to answer environmentally, environmentally many of you know the
number of tigers represents the health of the environment in one single number. So, we know
that that is very important.
In fact, there is a statuary body called National Tiger Conservation Authority. I will welcome
you to visit their website. They do a lot of activities, one of the big activities they do is the
census; counting the number of tigers and again, I am sure they went through a problem
planning phase, data phase, multiple phases. It is a big process. I will once again urge you to
go to that website and read about how they do the tiger census, their methodologies, lots of
interesting statistical aspects here.

Of course, it starts with some sort of a sampling phase. The sampling happens in different
ways; survey, satellite view, other data, landscape characterization this a lot of things happen.
Very important innovation in the past 5 to 10 years is use of camera traps, so we started
putting out cameras in different places which will sense movement and take pictures and
those pictures are accessed by people and lot of tigers are captured in those cameras and that
has become a very popular way of counting tigers.

Now, statistical methods play a big role here. So, the tigers that are caught on camera give
you a certain number, but what about tigers that you could not catch on camera? How many
would there be? How do you estimate the number of tigers that were not captured in the
camera? So, sophisticated statistical methods are used for it. There is a model that is built to
estimate parameters, then there is a joint distribution type model that is built across the
geography, where are the, what is the distribution of density of tigers and then based on that
an estimate is made.

So, we are not going to go into details here, but you can see how a practical problem in real
life is and the role statistics plays in clearly solving that problem. So, just to give you
numbers, in 2018 according to the official data, 2461 tigers were trapped on camera and
based on the statistical analysis methods, people have estimated that there are total of 2967
tigers.

So, of course, this statistics tells you that there is more tigers, some of you may say I am not
going to believe it, so that is fair, I said opinion comes in and you should read their broachers
and methods and if you read through them, I think you will be convinced that this number is
close to the truth.

So, they followed a very rigorous clean method to estimate and they do a lot of review, lot of
international experts come in and so this number is not put in lightly and it is still
disappointing, I think maybe we need to double this number, quadruple this number to a
forest to a healthy state and maybe it will happen soon. So, let us hope for the best. So, that is
tigers.

(Refer Slide Time: 15:31)

The third example is some example which is probably very close to our hearts right now,
right now we are thinking about all this. So, in our own program we are thinking of doing a
remote-proctored exam. Very pertinent question is was a remote-proctored exam successful?
You do a remote-proctored exam for like thousands of thousands of students, is it, was it
successful?

Now, why is this important? It is an important problem. So, it is the administers of the
problem have to convince authorities in the universities and also convince students that this
exam was indeed successful, it happened, it worked well, it was under control, the
performance was as expected, all of these are important. So, you have to convince people that
it is true. So, how do you do that? How do you statistically approach this problem?

So, it is easy enough to think, I mean all of us are experts in this area. There is no domain
knowledge needed, we know what happens. So, how to access success of the exam honor
code comes in as an important criteria was it how many violations of honor code were there,
how strongly did people collaborate to answer the question? Can we take the exam as a
proper measurement of an individual’s knowledge or did they collaborate and do something
which violated the honor codes.

So, these are important things and we can keep that as the good matric. If you know that they
were not significant honor code violations, at least the sanctity of the exam was not violated.
That is the part we are interested in when we say successful I mean of course, how well
students learn and how well they perform is different, but this is just in the success of the
exam administration process.

So, here is a picture, I think it is from one of our exams, I have blanked out the grade out the
faces for privacy, but you can see this is how the exam happened. So, you can see the number
of things that can go wrong here. So, data, the data that one can look at is the scores that we
got, some data from G-Meet, some data from proctors on how the invigilation went, how the
students camera angle was, how was the connection, could we see all the time things like that
all that we can use.

And we can also use course in previous in person exams say in the same subject or related
subjects written by the student. Now, why is that important because usually in person exams
we believe is a honor code violations are difficult. So, it is more the sanctity is more clear, the
rigor is more clear there, so we have scores in previous in person exams. This is what I
believe at least data that we can use to figure out and answer the question on whether the
remote-proctored exam was successful?

Then you can do analysis. So, you can look at the histogram of the scores, is it similar to
previous in person exams, very different and then from there you can figure out whether the
honor code was violated. Would did a particular student violate a honor code, this can be a
question and you can have an hypothesis on it and then look at all the data associated with the
student and have a way of making a call whether or not honor code was violated and can you
estimate number of violations, can you estimate groups which possibly collaborated, can you
do all that, can you compare.

For instance, see, we are getting answer sheets scanned by students and then there is people
who are going to look at it, how strongly should we invest time and energy into looking at
answer sheets, whose answer sheets should we look at? Should we do all that? So, all of those
are also statistics driven data driven. So, look at the nature of, nature of how statistics enters
the picture, how you have to look at the data, how we have to be making decisions driven by
data and how your analysis methods and processes must be statistically strongly motivated.

Finally, of course, conclusion and communication is very important again, we have to


communicate to university authorities and tell them yes, this is a good exam, we can relay on
its marks and methods, you may have heard about so many other universities doing online
exams, they get challenged in code and people have to scramble to justify whether the exam
was good or not and all sorts of opinions are floating around; we do not have a proper way of
statistically studying this and being sure about it.

And more importantly I think you have to communicate to students. So, students who take
part in this program should believe that yes, the exam went well. So, only then they have
confidence in the program itself and the fairness of it. So, I was showing you three different
examples that hopefully all of us have connection to and we know what these examples mean
and how statistics plays data driven decision making plays and very huge role in each and
every one of them.

In this course, once again we would not do all of these, we will do only a very small part, but
still I want you to keep these kind of examples in the back of your mind. So, when you
become a data scientist, when you start looking at data, these are the kind of questions that
will come, maybe in a cooperate setting, maybe in a company setting, maybe in a industry
setting, maybe in a social setting, maybe in a policy setting, various different settings, but you
have to have this comfort of dealing with people who can do this, dealing with data collected
to a collecting data, analysing it in a statistically strong manner.

That is all the examples we are going to do. We will soon jump into other things, but
hopefully these examples give you motivation for why these ideas are important.

(Refer Slide Time: 20:52)

What is the importance of communication? I keep saying this communication,


communication, etcetera, why is this so important? Here is the simple example to illustrate to
you how communication can completely change your perspective of what is being told to
you. So, let us say 1500 students wrote an exam in a course, maybe in an in person exam.
There was one honor code violation, somewhere somebody was caught doing something
which is not acceptable.

Maybe in a remote-proctored exam, there were 2 honor code violations. Would you say the
remote-proctored exam was successful at least from honor code point of view? Was honor
code maintained in that? I mean, most people would probably say yes, I mean this is not a big
deal. But notice how communication can make a big difference, look at this. I can
communicate this scenario in two different ways; one I can say 100 percent increase in honor
code violations in remote-proctored exams.

Is that wrong? That is not wrong, that is correct. Honor code violations went from one to two.
That is a 100 percent increase in honor code violations in remote-proctored exams. A very
factually correct statement, but conveys a completely wrong idea about what happened in the
exam. Would not you agree?

But look at communication 2; maybe it is more real. Honor code violations within 0.15
percent under remote-proctoring. It seems more truthful at least to me the communication too
seems to be more truthful of what actually happened. Supposed to just being factually correct,
but truthfully wrong. So, you are not expressing the truth of the situation.

So, I am sure you will understand this. So, tomorrow when you see such communications in
the press, in the social media, something which gets you to think in a certain way, be
cautious. Be doubly cautious, do not let your thoughts be controlled by news and things that
you see. Quite often the person who sends it has their own ideology and they are
communicating to you in a factually correct, but truthfully wrong way.

So, be very careful about it and truthful representation of what the data conveyed is very
difficult to find, there are lots of websites out there trying to do this, but it is very hard today.
Today’s social media driven generation, this is a skill you have to pick up, you have to look
at a news, look at a communication piece then first ask where is the data, let me understand
the data myself. And then I will see if the communication is correct or not.

So, this is important, this is the power of communication. So, you might say you can do the
greatest analysis in the world and come up with all sorts of sampling and data collection all
that, but if it is communicated wrongly, completely wrong impression can get out as well.
Anyway, one slide on communication that is all we will have in this course before we jump
into proper statistics.

(Refer Slide Time: 23:35)

So, let us quickly summarize statistical problems in real life have so many different
dimensions. We are going to just look at one small part of it in this course, the analysis part in
particular, we will be looking at unknown parameter estimation, methods for it, how to
analysis it, how to develop methods for it and how to estimate parameters in relationships
between variables and factors and the next part is testing hypothesis using data.

We will stick to our wonderful iid sample model and do our analysis and development of
methods within that model and of course, this will ultimately sit inside one part of a big
statistical problem which is analysed in multi dimensional wonderful way, but as far as this
course is concerned, we are doing this small part.
(Refer Slide Time: 24:22)

What about books? So, we will continue to use our standard textbook for the course. But
there are two other books that I would like to draw your attention to which I have learned a
lot from. One is Mathematical Statistics and Data Analysis by John Rice. It is published
through Cengage learning. You can get a cheap books, you can buy paperback for instance at
quite a low price. The digital copy is not very much available yet for this one.

It is a good reference, it has all the theory, equations sort of modelling driven in some way. It
also has lots of data in simple practical examples. So, it is a bit data driven in some sense. So,
it is a good book. I will pull examples from that and ideas from that when I teach. And
another book which is written for the general public and it is really again a fantastic book is
this Art of Statistics, learning from data from David Spiegelhalter.

That is a Pelican book, once again you can buy I think it is not too expensive in India to buy
this. Does not have equations, there is also a kindle version for instance, if you want to get it
on, you kindle. There are no equations and its, but it talks about the entire statistical problem
solving approach, I mean all the for instance, the seek faces I spoke about all that is described
wonderfully in this book and so the entire problem and all the aspects involved in it are
brought out very nicely in this book. But it is for general public, you will get a good idea by
reading these books. So, these are some books that I would like to use.

So, that is the end of this little motivational sort of lecture where we discuss some general
ideas about statistical problems in real life. We will not come so much back to these kind of
discussions later on in this course. We will jump deep into iid sample models, estimating
unknown parameters, testing hypothesis, etcetera. Later on when you see other examples you
will be able to connect it together. Also I will try as much as possible to motivate some of the
processes with some examples maybe in small scale and while we develop these ideas.

I hope you are excited in learning about statistical procedures, it is a big jump from all the
probability that we have been doing a lot of painful calculations, now we sort of get to apply
it in the realm of statistical procedures. Thank you very much.
Statistics for Data Science II
Professor Andrew Thangraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Parameter estimation: Introduction to parameter estimation

(Refer Slide Time: 0:13)

Hello and welcome to this lecture. In the previous lecture we saw how statistical problem is
phrased and how one needs to think in the big picture about the problem. And then we said we
will look closely at one small piece of it which is the analysis part within the probabilistic
statistical setting. So, one such method or procedure or statistical analysis method, I would say
more than analysis, I would say it is like a procedure for doing statistical procedures is
something called parameter estimation.

So, this shows up quite often within the realm of a bigger statistical problem. So, you will think
of the problem, break it on, break it on, and finally you will have to find one parameter which
is missing in your model using data. Data samples are come, have come and you have a model
for it and there is a parameter missing. So, you have to find a parameter and your model is
parametrized by something and you have to find that parameter.

So, this is very typical and what we will see first in this next few set of lectures is, given a
problem like that, how do you go about finding a parameter from samples. So, we have already
seen before that iid samples contain a lot of information about the underline distribution. So,
from the distribution, if there is a missing parameter, how do you find that parameter? So, that
is the problem of parameter estimation.
(Refer Slide Time: 1:32)

So, let us see a few illustrative examples. I will just show you some examples. The first and
the I would say the very typical example with which we will again and again introduce new
ideas is this Bernoulli(𝑝) trials. So, it is a very simple example, but it still has within a lot of
meaning and lot of ideas can be explained very easily using Bernoulli trial. So, I will use this
example over and over again to illustrate the important ideas.

So, what is the setting? So, there are 𝑛 Bernoulli(𝑝) trials. So, there is trial which you repeat n
times independently, identically. Every trial is successful with a probability 𝑝, not successful
with a probability (1 − 𝑝). And this 𝑝 is the unknown parameter. This is an underlying
distribution, iid samples are coming out, the parameter 𝑝 is unknown in this case, the parameter
is just 1 parameter 𝑝, it lies between 0 and 1. It is a probability of success and that is unknown.

So, here is one set of samples. I have just pulled it out. Some 10 samples from this Bernoulli
trial 1, 0, 0, 1, 0, 1, 1, 1, 0, 0. It could be anything else also. I repeat it in the next time, I may
get another set of samples. So, this is how it will look, just to be very clear. So, the, so I have
put here, can you guess 𝑝? Or can you estimate the value of 𝑝 given these samples.

So, that is the sort of question we are going to look at. Maybe 10 samples is too small, you do
not want to commit to a value of 𝑝, 10 samples, but maybe 100, 100 samples are given, maybe
500 samples are given. Then, can you say with some more confidence as to what 𝑝 can be? So,
that is the essential nature of this, this kind of a parameter estimation problem. It is clear there
is a parameter of the unknown distribution, underlying distribution. You are getting iid samples
from the distribution. Can you use the samples and estimate the value for the parameter? So,
that is the example.

So, it is a simple problem. Hopefully, this is clear. This is the, this are typical problem, I will
come back to this again and again whenever I want to illustrate an idea, whenever I want an
example, the first example I will come back to is Bernoulli(𝑝) trials. So, remember that.

(Refer Slide Time: 3:37)

So, let us look at a few more examples where this notion of iid sample from a distribution with
a parameter unknown and us trying to find the parameter shows up. So, it shows up quite a bit
in scientific experiments particularly, an engineering applications. So, here is a scientific
experiment that is very common. If you have a radioactive substance, it puts out some particles,
that is the meaning of radioactivity, some particles come out of it and one such particle is the
alpha particle, I think it is 2 neutrons plus 2 protons, just the nucleus of helium for instance.
So, it is a called an alpha particle.

And this is a very common emission and the theory is that you can model the number of
particles emitted within a fixed time period like for instance here a 10 second interval, you can
model it as a Poisson distribution. So, several experiments have been done and there is also
some theoretical reasons for why it can be expected to be a Poisson distribution. We know the
Poisson distribution very well. It could be 0 particles, 1 particle, 2 particle, 3 particle, 4 particle
like that and what is the probability that n particles are put out in a 10 second interval, it is
𝑒 −𝜆 𝜆𝑛
.
𝑛!
Now, depending on the actual substance, actual radioactive substance that you have, this
lambda may vary. So, one type of say, uranium or something will have one type of 𝜆, some
other Plutonium or something may have some other 𝜆. So, the lambda may vary from one
radioactive substance to another, but the model remains the same. So, this 𝜆 is an unknown
parameter of a distribution.

So, if you have an unknown radioactive substance, you are counting the alpha particles using
some counter and then you get the data. You have to go back and find the 𝜆 from the data. So,
that is the problem of parameter estimation once again. Here is samples from one observation,
I have taken this from John Rice’s book that I described in the previous lecture. So, you can
see this data is here 0 to 2 particles came out 18 intervals.

So, you observe number of alpha particles over several intervals, here I think there are some
2700 intervals or something. So, 18 of those intervals 0 to 2 particles were observed, 5 of those
intervals, 17 plus particles are observed and etcetera, etcetera. So, that is the data. So, from a
data like this you have to now find the 𝜆, what is 𝜆 that is the parameter estimation problem.

Again a typical parameter estimation problem, this kind of problem shows up quite a bit in
modelling. So, there is a, notice, there is an underlying distribution, in this case it is Poisson,
there is an unknown parameter there and from the samples that you did in one observation, you
have to find the 𝜆. So, remember again, this is one measurement, one time interval 2700
intervals of 10 seconds each, you did a measurement.

Tomorrow you come back and repeat it, you may get slightly different data for the same
substance you may get slightly different data. So, that is some nature of the problem, that is
how this iid samplings work. One sampling will give you one type of data, next sample will
give you another data, but within that you have to operate and come up with the procedure to
find the 𝜆 reliably and correctly. So, may not too different, it will be somewhat similar, but one
or two I mean, it will not be the exact same number, keep that in mind. So, this is another
example. There are many more examples like this.
(Refer Slide Time: 7:09)

I just want to give one more example which is noise in electronic circuits. So, if you have a
circuit and you want to measure the voltage or current, you will see over time even though
nothing else is changing in this circuit, you expect the voltage or current to be the same, you
will see some random fluctuations if you keep measuring with very sensitive measurement
instruments, you will see there will be this random fluctuations over time. So, that is because
of some various noise processes that happen with electrons and other things that are there in
the circuit.

And so, it is very typical to model the voltage or current as a Gaussian distributed random
variable. It will have a certain mean, it will have a certain variance. So, now, you have
measurements, now you can do iid measurements of the voltage. So, you can keep making
measurement over and over and over again, every time your measure is independent and
identical measurement, so the underlying distribution here now is Gaussian. It is normal, it has
got two parameters, there is a parameter mu, there is a parameter sigma square.

So, now if you do iid repetitions of the measurements, let us 10 measurements, you may get
something like this, 1.07, 0.91, 0.88. From these kind of measurements, you may ask what is
𝜇, you may ask what is 𝜎? So, you notice here the small difference between the previous two
examples. In the previous two examples, there was only one parameter. So, here you have 2
parameters. You can also have 3 parameters, you can have 4 parameters, you can have any
number of parameters for your unknown distribution, depending on how the distribution is.
So, maybe your distribution is 2 Gaussians mixed up, so then you will have 4 different
parameters to work with, things like that. So, you can have various types of underlying
distributions and you may have access to iid samples from the distribution. There is an
unknown parameter for that distribution, can you find the parameter from the samples? So, that
is essentially the parameter estimation problem. I hope it is clear to you.

This is one small procedure that may fit into several statistical analysis methods, but this
procedure you should know really really well, it is a foundational procedure which you should
really be able to do. Anybody tomorrow comes and gives you a distribution and says, hey, here
are iid samples from this distribution, I know for sure it is from this distribution, here are
unknown parameters of the distribution, can you find out these unknown parameters for me,
you should be able to go ahead and do that. So, that is the parameter estimation problem.

(Refer Slide Time: 9:29)

So, here is the setting, I have put it out based on the previous observations, here is how the
setting goes. You have iid samples 𝑋1 to 𝑋𝑛 from a common unknown distribution 𝑋. It is the,
the distribution itself is not totally unknown, it could be in which case, it is bit more
complicated, but let us say you know that it has some distribution which has a few parameters.

In general these parameters are called 𝜃, in most statistics textbooks you will see they are called
𝜃, 𝜃1 , 𝜃2 , 𝜃3 , you may have any number of parameters. They may sometimes be collected into
one vector that they will call theta and so let us just say 𝜃1 , 𝜃2 , etcetera. We will assume these
parameters are real valued, each theta is or it could be integer or something also; so then you
have to adjust, but generally we will take it as a real value and the parameter estimation problem
asks you what is 𝜃1 , what is 𝜃2 , can you estimate 𝜃1 , estimate 𝜃2 ? So, that is question.

So, we will see slowly some features of this process and what is possible, what is not possible,
what are the kind of assurances you can give about your estimate? It is, I mean you cannot just
say randomly some value for theta, that is not the point of this exercise. It should be, it should
depend on the samples, you should have some intelligent answer which you can justify also
that is also important.

So, what is an estimator? So we have talked about the estimation problem. The estimation
problem, the solution for an estimation problem is something called an estimator; the estimator
for a parameter theta. Some parameter theta of the underlying distribution is a function of the
samples. So, you have 𝑛 samples 𝑋1 through 𝑋𝑛 and this function usually is denoted 𝜃̂ . So, you
see that hat on top of 𝜃 one small inverted some triangle type shape, so that is the angle type
shape, so that is the hat it is called.

So, it is very common to denote the estimator of 𝜃 as 𝜃̂ . And what is that 𝜃̂ ? It is a function of
the samples. You give it the samples, this function will put out some value, some real number
and that real number you are assuming is an estimator of 𝜃. So, not assuming, you are supposing
or that is what you desiring to be an estimate of 𝜃. So, an estimator remember once again is a
function, is a function of the 𝑛 iid random variables 𝑋1 through 𝑋𝑛 . Is that okay?

So, the estimator is a function, you give it the samples, it will put out some real number. The
output is a real number, the input has all the samples, output is a real number and that number
is considered the estimate. So, I have put random variables here; remember, 𝑋1 , 𝑋2 , … , 𝑋𝑛 are
random variables, but in any particular instance you will get one set of samples some 1, 2, 3,
whatever number, you put it in you will get a number. But remember, 𝑋1 to 𝑋𝑛 are random
variables.

So, one needs to know this clear difference between the parameter and the estimator. 𝜃 is a
constant parameter, it is not a random variable, it is some number, some unknown number that
is all. It is a variable, not random in some sense. Now 𝜃̂ is a function of 𝑛 random variables
therefore it is a random variable. So, it will have a distribution, it will have a PMF, it will have
a PDF.

So, in one sampling, theta hat will give you one value, another sampling it will give you some
other value, another sampling it will give you some other value. Because depending on the
actual realization of the samples, the function 𝜃̂ will change. So, 𝑋1 through 𝑋𝑛 is in general
they are random variables, so 𝜃̂ of 𝑋1 to 𝑋𝑛 will also be a random variable. Depending on the
actual samples, it will take some particular value, it will take different values for different
probabilities which means it has a distribution of its own.

Even though it has a distribution, we are expecting that this 𝜃̂ will take values around 𝜃. 𝜃 is
some fixed value, 𝜃̂ is a random variable, it has a distribution, it is PDF or PMF or something.
And you are imagining if 𝜃 is here, the theta hat will always be somewhere around 𝜃, you
cannot have some 𝜃̂ being some random thing independent of 𝜃. So, you are hoping to design
a 𝜃̂ , design a function so that its distribution is concentrated or it gives you values around 𝜃 in
some predictable way. So, you can give some guarantees on that.

So, that is how we will work, but so have that picture in mind, you have samples according to
some distribution. The distribution has a parameter theta. It is some number and you want to
find a function of the samples which will be a random variable. It will have a distribution PDF
or PMF and you are hoping that PDF or PMF gives you values mostly around the actual theta
which is the unknown parameter.

So, that is how estimation works. This is the only thing we can do, we do not know 𝜃 ahead of
time, so you cannot exactly predict what 𝜃 will be. So, estimation will only put out a random
variable and the estimator is a random variable, it will have a certain distribution, we can only
give those kind of guarantees. So, hopefully this is clear, you can go back to the previous
examples I have put out, all of them follow this same sort of a picture.

So, the question is how do you come up with these estimators and all that, we will see that
slowly, how to characterize good estimators, how to design good estimators, we will see them
slowly in the rest of the lectures.
(Refer Slide Time: 14:55)

So, here is an example, like I said I will keep going back to this Bernoulli(𝑝) trials over and
over again. It is a very simple example, but still it is very useful to get a lot of ideas across. 𝑋1
through 𝑋𝑛 are iid Bernoulli(𝑝), the parameter is 𝑝 now. So, I told you I will call it 𝜃. In general
case when it is unknown I will call it 𝜃, in a specific case like Bernoulli, the parameter is 𝑝,
there is no need to call it 𝜃 or anything.

Here are 3 different estimators. You may already have a background or you may already think
one is a good estimators, one is not a good estimators, etcetera. But let us just see the three
estimators. There are 3 estimators, all three are valid estimators. Remember, the estimator is
some function from the samples to reals. As long as it gives you that, it is a valid estimator.
Whether it is a good estimator or not is a different question, we will come to that later. But
these are all valid estimators.

1
Look at the first one, 𝑝̂1 . It is just saying half. 2. I do not care what the samples are, I will not
1
even look at the samples. I will simply say 𝑝 is equal to 2 always. Does not sound like a good

estimator, so good estimator presumably will use the samples. So, this one does not look like
a good estimator, but like I said whether it is good or bad, it is a valid estimator. While estimator
is just a function, I defined it as a function from the samples to the numbers. If you choose to
ignore the samples, it is still a valid function in some sense, that is not wrong.

Estimator 2 is a sounds a little bit better, it is at least using the first two samples, but it is only
using the two samples. So, immediately you should think there are 𝑛 samples given to you, you
should probably use all the 𝑛 samples because 𝑝 affects every sample. According to 𝑝 every
sample is being distributed that is how every sample is being chosen. So, you probably have to
use all the samples, but if you are using only two, maybe you do not get that good estimator,
but we are expecting 𝑝̂2 to be slightly better than𝑝̂1 because it is at least using the first two
𝑋1 +𝑋2
samples; . That is my 𝑝2 .
2

Now, third estimator looks a little bit a more smart, little bit more interesting, little bit more
meaningful, maybe you can read multiple meanings into it, but nevertheless let us just say what
𝑋1 +𝑋2 +⋯+𝑋𝑛
it is, it is . This good merit in this 𝑝̂3 , it is a very smart estimator, we will come
𝑛

back and see why that is a smart estimator later on. But anyway, this three are clearly valid
estimators.

What do I mean by valid estimators? What is an estimator? Function from samples to the real
line. I can have any function, I am just showing you 3 functions. It turns out you can have any
functions, an infinite number of estimators are possible. I can say 2𝑋1 − 𝑋2 divided by nothing,
2𝑋1 − 𝑋2 , why not, it is a valid estimator. Anything you can do 𝑋1 + 2𝑋2 3𝑋3 divided by I do
not know 6, who cares.

So, any function that you come up with, it will be a valid estimator. Validity does not mean
that it is good. To be a good estimator, we have to think about how to characterize, we do not
even know how to characterize, what is a good matric. We know from a high level that the
estimator, whatever estimator I come up with is going to have a distribution and I have to
characterize whether or not the distribution is around the real data.

That is my goal, but how to go about writing it down clearly, precisely, mathematically, how
to develop good matrix for how good an estimator will be and how to go about designing those
estimators which will give you good values for those metrics. So, that will be the topic of the
next few lectures. Thank you very much.
Statistics for Data Science II
Professor Andrew Thangraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Parameter estimation: Errors in estimation

(Refer Slide Time: 0:13)

Hello, and welcome to this lecture. In the previous lecture we saw a brief introduction to the
parameter estimation problem. What is the parameter estimation problem you have iid samples
from an underlying distribution, that distribution is characterised by a parameter. What does it
mean, what does that mean like for instance we saw Bernoulli(p) as a possible distribution. So, we
are not really looking at one distribution we are looking at a class of distributions Bernoulli(p).
Bernoulli class of distributions but an unknown parameter p.

So, as you keep changing the parameter the distribution itself changes within a certain class. So
that kind of a situation how do you find out or estimate the value of the parameter from the samples.
So, that is the parameter estimation problem. We already saw that no one can exactly come up
with an estimate just based on the samples, you should know ahead of times something else. If you
already know the parameter ahead of time why do you want to estimate it? You do not want to
estimate it.

So, you want to estimate it and the estimator is going to be a function of the samples, samples are
random variables. So, function of sample is a random variable. So, your estimator is a random
variable, it will have a distribution. While what you are trying to estimate is a constant value, it
has no distribution. So, you are hoping a good estimator will have very little error, it will always
predict values close to the actual value ϴ. So, it is important to study the error in the estimation
problem process.

So, what is the error, how do you quantify the error, what can we say about error, can we control
the error, can we bring it down, can we make the probability for high values of error very low,
things like that is what we will study in this lecture.

(Refer Slide Time: 2:00)

So, let us look closely at this notion of estimation error. You have a parameter theta, you have a
random variable which is an estimator, estimator is 𝛳̂(𝑋1 , … 𝑋𝑛 ) . So, you can define an error, error
we will define as the difference between what the estimator is putting out and the parameter.
𝛳̂(𝑋1 , … 𝑋𝑛 ) − ϴ and remember this is also a random variable. 𝛳̂ itself is a random variable we are
hoping it is around ϴ. So, 𝛳̂(𝑋1 , … 𝑋𝑛 ) we are hoping will be close to 0, is not it? We do not want
very large values for the error.

But nevertheless it is a random variable, it is a distribution, it will have a certain PMF or PDF. We
are hoping it is around 0 it does not takes high values this is what we are hoping. If we have a good
estimator it will not take that many high values. So, here is I mean a paraphrase exactly that we
were expecting that a good estimator will take values close to the actual ϴ in which case the error
should be very very close to 0.
So, one way of precisely writing this down mathematically is remember error is not just a number
it is a random variable. So, you can think of probability of absolute value of the error being greater
than delta should be very small. So, what is, so one part of it seems ok, probability that the absolute
value of errors is too big should be small. But how do you pick the δ, in an actual problem what is
the meaningful choice for δ? So, those are things we have to think about we will come to it but
nevertheless, this seems like a reasonable way to approach the problem.

So, you have an estimator which is a random variable actually and you want to characterise its
error so you go look at the distribution of the error and you want the distribution to be such that
the probability with which it takes very large values is very small. So, it is absolute values. So, that
is important also. So, this seems reasonable, let us keep going ahead and see how to think of it.

One immediate thought that comes to our mind is how to characterise δ, how big should δ be? So,
way is to answer that you can look at some examples. The first thing to realise is the parameter ϴ
will have a certain range. So, we will know the range for instance Bernoulli(p) is an example I put
down. For Bernoulli(p). I know that the value of p is between 0 and 1. So, my estimator error,
whatever may be the p, my estimator error must be small with respect to that p.

See for instance you might think an error of 0.1 is very small, may be you think 0.1 is a small
number. Or maybe 0.01 is a small number, maybe my error should not be smaller, should not be
greater than 0.01 you may think. So, may be 0.01 is really small error you may say, may be for
Bernoulli(p) I may take δ to be 0.01. But imagine what if your p value is 10 power minus 5? The
actual p that you are going to likely to see if it is 10−5, then your 0.01 is huge compared to that.

So, it is good to have the magnitude of error characterised in terms of the parameter you are
estimating. So, if you are estimating p you can say maybe I should have only 10% error around p.
So, for instance the last thing I have put down here may be a good thing to say about P is the
absolute value of error should be less than or equal to p by 10. So, that keeps the actual error within
10%.

So, these kind of ideas you can use to characterise this δ. So, how big an error can you tolerate if
you are trying to estimate a p which is an unknown quantity by the way if you are trying to estimate
the p, then the error should not be more than 10% of the p. That is the reasonable way to put it.
So, if on the other hand let us say you are doing normal distribution, the mean of the normal
distribution can be anything. There is no restriction on that. So, how do you think of the error if
you are estimating something?

You have to think of the error as a fraction of what you are estimating. If as a fraction of what you
are estimating it is only 10% error or something, then you can be sure that the error is small. If
your mean your estimation is 1000 and if you estimated as 1001 it is ok it is not too bad. If the
mean that you are estimating is 0.1 and you are estimating it as 1.1, it is huge error. So, so the error
usually is relative to what you are measuring and that is something to be careful about.

So, usually people keep that, keep track of it. You can think of it as δ it is not too bad δ generally
small is good but keep this kind of things in mind, you really want to characterise that error as a
function of the parameter also. So, this seems reasonable. So, this kind of probability of absolute
value of error being greater than δ should be very small that seems like a good way to characterise
error in estimation and we will take that approach and see where it takes us.

(Refer Slide Time: 7:06)

Let us come back to our example, so here is our favourite example 𝑋1 , … 𝑋𝑛 is iid Bernoulli(p), I
have 3 estimators not just 1, I have 3 estimators for p, let me start thinking about errors for this
estimators. What are the errors for these estimators, what is the distribution, what is the probability
that the absolute value of error becomes greater than some δ, can I characterise it for each of these
distributions? It turns out it is possible.
So, before that let us just get our feet wait a little bit let just look at a few samplings from this
distribution and see how our three estimators are doing. So, here are 3 different samples let us say
from the same value of p you have first n samples being 1, 0, 0, 1, 0, 1, 1, 1, 0, 0. Your 𝑝̂1 is always
0.5 nothing changes. Whatever may be the sampling 𝑝̂1 is 0.5. What is 𝑝̂ 2 ? It is also 0.5. What is
𝑝̂3 ? It is also 0.5. For that particular sampling that is what happened.

Here is another sampling 1, 0, 0, 1, 0, 1, 0, 1, 0, 0. Here again 𝑝̂1 is 0.5, 𝑝̂ 2 is 0.5, 𝑝̂ 3 is 0.4. So,
you see it varies, 𝑝̂2 why is it 0.5. It is just (1+0)/2, first two samples I am adding by 2. Here is
another 10 samples from the third round, another round of sampling you got 1, 1, 0, 0, 0, 1, 0, 1,
0, 1. So, you see beyond hat it is always 0.5 I am not going to change anything. 𝑝̂2 becomes 1 now
and 𝑝̂3 remains 0.5.

So, just three examples of course, in the actual examples it might change a little bit here and there
but just from these examples we are saying immediately that t𝑝̂1 will not work for all values of p.
Why? If p were to be 1 for instance so you will just get 1, 1, 1, 1, 1, 1 even when you see the
samples as 1, 1, 1, 1, 1, 1 you are going to say p1 hat, 𝑝̂1 is 0.5. So, that is going to give you an
error. Only if P is very very very close to 0.5 will it work. If it is going to be away from 0.5, the
first estimator will not work. It is obvious.

But what is the problem with this 𝑝̂ 2 notice how flaky 𝑝̂ 2 is. Sometimes it says 0.5, sometimes it
says 1, it can even say 0. If you should take one more sampling where the first two are 0 which
can happen if the 0, 0 you get 𝑝̂2 will become 0. So, the variation in the estimator is very very large
it is going from 0 to 1 to 0.5, jumping all over the place it is not staying steady.

So, the 𝑝̂1 we complain saying it is always saying 0.5 it is not moving at all, by at least not moving
at all seems like a good quantity to have when you compare it with 𝑝̂2 it is moving all over the
place. So, this 𝑝̂1 is not able to adapt to different values of p easily, but at least it does not keep
jumping around. On the other hand, 𝑝̂ 2 may be can adapt to different values of p, but across
different samplings it seems to give a wide variation. That is also not very nice.

But 𝑝̂3 seems to be very promising, it is sort of steady, may be reacts to the samples a little bit but
does not get too wayward and does not seems to be disturbed. So, this 𝑝̂ 3 really is a promising sort
of estimator. And that is, these ideas are all quite important. The fact that your estimator value
itself should not keep going all over the place that is sort of natural, if you estimate over different
samplings is going all over the place then may be you do not trust the estimator too much. The
variation in the estimator value should not be too high.

At the same time, it should not be just stuck at one value. If the underlying parameter changes it
should adapt to, it should change. So, you should try a new all those things and change. So, there
are some like sort of tension or give and take already that you see, but this kind of analysis will
help you. So, first think that lot of people would do is given an estimation problem and given a
bunch of estimators they would first stimulate it. So, this kind of stimulation you can do, you have
enough colab, notebook experiments you have seen this kind of experiment is easy to do.

So, you generate these samples test out the estimators and see how it is varying, is it varying too
much or it is holding steady, is it working out for different values of the unknown parameter all of
that you can easily test using a colab script I will welcome you to write it yourself as well. So, this
is estimator through this example. Let us continue this example

(Refer Slide Time: 11:47)


Let us continue this example. So, I wanted to measure probability that absolute value of error is
greater than p/10. So, I wanted to think about that. So, let us look at 𝑝̂1. 𝑝̂1 is half which means
1 1
error is 2 − 𝑝, absolute value of error, absolute value of error will be absolute value of 2 − 𝑝. So,
1 1
that is the error and here I have plotted 2 − 𝑝, absolute value of − 𝑝, it will look like this; absolute
2
1
value of 2 − 𝑝.

Hopefully, you can see this. And then I have also plotted on the same picture p/10. So, you can see
this is p/10, this is p/10, that is absolute value of error hopefully you can see that and you can see
the absolute value of error is most of the time, for most values of p it is above p/10, only between
these two values which you can calculate, I have calculated it there, this value is 5 /11, this value
is 5/9.

Only between 5/11 and 5/9 when value of p is very close to 0.5, you have the absolute value of
error falling below p/10. For all other values, it is above. That is number 1. Number 2, error is
actually constant, there is no randomness there. So, probability that error is greater than p/10 is
just equal to 1. That is probability 1 if p is either less than 5/11 or p is greater than 5 /9. So, this
part, any value of p here or here is totally bad for you.

So, you see this analysis, this absolute value of error being p by, greater than p/10 is true for a
large range of p except for p around 0.5. So, the first estimator is sort of not so good, you can say
it is not behaving the way we expected to behave.
𝑋1 + 𝑋2
Let us go to the second estimator. Second estimator over X was . Now this is a, you can
2
𝑋1 + 𝑋2
write a table here. So, 𝑋1 is the first value, 𝑋2 is the second value, the error itself is − 𝑝.
2

You can look at the absolute value if you want and then you can find the probability that the error
equals e. So, (1 − 𝑝)2 etcetera, etcetera. So again you can calculate this, I am not going to go
through the details here you can see you can plot all these guys versus p/10 and compare when
they go below, above etcetera.

So, you can see for most values of p, the values of p < 5/11 or above 5/9 and less than 10/11, this
probability will be equal to 1. So, the probability of the error being greater than 10% is equal to 1
for a large range of p in the second estimator as well. So, you see the first estimator, second
estimator, we had a bad feeling about these estimators from the simulations and it is conformed in
the analysis also. So, they are quite bad in performance.

So, look at this later, I know I did not go into great detail in explaining to you how this works.
Think about how this distribution came about, but intuitively it is clear, you are using only the first
two samples, you are not really going to get error performance that is great.

(Refer Slide Time: 14:53)


Look at the third one. The third one is like I said really really good estimator, there is lots of good
reasons how we can justify it later on also, we will justify it more and more, but for now let us
look at our calculation of probability of absolute value of error being greater than p/10. Can I
control it in the third estimator, you see you will get a very surprising and powerful result with the
third estimator which is very very desirable for that property.

𝑋1 + 𝑋2 +⋯+𝑋𝑛 𝑋1 + 𝑋2 +⋯+𝑋𝑛
So, notice 𝑝
̂3 is so the error is just − 𝑝. That is your error. Now let us
𝑛 𝑛

recall our wonderful wonderful Chebyshev bound. So, like I have been mentioning to you that
these bounds are not just ideal tools, they are very very useful in actual problems. So, the
Chebyshev bound tells you, see error is a random variable, I am going to use the Chebyshev bound
on the error random variable.

Probability of absolute value of error minus expected value of error greater than δ is less than or
equal to variance of error by 𝛿 2 . Remember, error is a random variable. So, this is a random
variable and you use Chebyshev on that random variable and you get this. So, usually we write p
𝑃(|𝑋 − 𝐸(𝑋)|) > δ. So, X is error that is all. Error is my random variable and I have this
calculation.

So, what is expected value of error? If you calculate it, remember it is all linear. You will use
linearity of expectation, this will go just 𝐸(𝑋1 ),….., 𝐸(𝑋𝑛 ), each of them is p. What is the 𝐸(𝑋1 )?
It is p. 𝑋1 is Bernoulli(p), 𝐸(𝑋1 ) is p, 𝐸(𝑋1 ) is p so on. Each of these expected value is p. So, np
by p minus p and this will become 0. So, this is an exercise for you; check this. So, it is an easy
enough calculation. So, the expected value of error becomes 0.

So, this turns out as a good property. Why is this a good property? For any estimator, on average
it should give you 0 error. Error is a random variable, agreed. So, it is a random variable, its
distribution is going to be around 0 and you want the average value of the error to be close to 0.
So, it cannot have an error which is always non-zero in some sense. So, there should not be any
non-zero residue in the error all the time. On average it should sort of cancel out and give 0 error.

So, this seems like desirable property and this estimator has. In fact, the other estimators also have
it, we will come back to it. The variance of the error also is something that you can determine. So,
the variance of this we have determined it before, you go through and calculate, remember
𝑋1 , … . , 𝑋𝑛 are all independent random variables, they are all Bernoulli(p), you know how to
calculate it, the variance will come out to be p(1-p)/n.

So, think about why this comes, we have done this calculation multiple times before. That is why
I am sort of rushing it through. Go back and look at how we calculated variance of iid samples,
sums of iid samples and all that. So, from there you will see for this Bernoulli distribution, the
variance will come out to be p(1-p)/n. So, once you do this, it is easy to use the above bond and
calculate probability of error greater than p by 10. Remember, this guy would become 0 and this
was p(1-p)/n. So, that is what I have put here, the variance is just p(1-p)/n.

And instead of δ, I have put p by 10; p/10 is what I wanted for δ. So, you will have δ2 is 𝑝2 /100,
some cancellations will happen and you will get this value for probability that absolute value of
error is greater than p/10. This is true for any p, whatever your p maybe, probability that the
100(1−𝑝)
absolute value of error is greater than p/10, I can upper bound as .
𝑛𝑝

So, what is the big upshot of the result? You may have learnt about limits and other things in your
mathematics 2, what is the limit of this quantity as n becomes larger and larger, it goes to 0. I can
make it as small as I want it to be, whatever maybe everything value of the other things, I can make
this guy as small as I want just by increasing in m.

So, some of you who understood the Chebyshev bound, Chernoff bound, concentration results
etcetera, would be able to quickly see here, if instead of using Chebyshev, I were to use Chernoff
or something else which is stronger, I am going to get an exponential fall with n not just a 1/n from
your limits and other notion of functions, you may know 1/n is sort of a slow fall 𝑒 −𝑐𝑛 can be a
very fast fall. So, you can have an exponential decrease in n also possible.

Now, there is lots of meaning in this slide. So, I want to take up little bit more time and think about
this very carefully. So, here is an estimator which I have been arguing is a good estimator for
multiple reasons, seems reasonable. It is using all the samples and more importantly, its
performance is improving with the number of samples n. So, I want to repeat that once again, it is
a crucial, crucial, crucial idea that you have to grasp.

So, any estimator that you want to come up with, you are expecting that it will use samples. When
you have more samples, the performance of the estimator should be better, is not it? If I give you
10 samples, or if I give you 20 samples, when I give you 20 samples, the estimate that you get
should be better than what you get with 10 samples and if I give you 200 samples, it should be
even better. If I give you 2000 samples, it should be even, even better.

As you get more and more information about the underlying samples through, underlying
distribution through the independent identical samples, your estimation process should use all the
samples and put out an estimate whose accuracy should keep on improving with n. As you get
more samples, you should get more accuracy or other way round, if you get more and more
samples, your error should keep falling. That is what you expect.

But if you see the 𝑝̂1 and 𝑝̂ 2 , n never showed up anywhere. Probability that error became greater
than p/10 was just 1 for some cases. So, there is no n. Here, we are able to make n appear in these
probabilities and not only appear, it appears in a certain way such that for any value of the
parameter, this kind of probability goes to 0 with increasing n. So, this gives us a very good feeling
about this estimator, does not it?

Whatever maybe the value of p, I can just keep taking more and more and more samples. Once I
take enough samples, I can be very confident that my absolute value of error is not too large. So,
you can put in some values here for p and then get a good bound and be happy about what kind of
values you can expect for the error. So, this is a good property and this is something very very
important.
You want to build estimators which use or samples alright that is good, but they should also result
in performance which improves with n in some reasonable fashion. That is an important story that
came here and notice the power of the Chebyshev bound. It is so useful, is not it, to give you this
nice behaviour that you have.

(Refer Slide Time: 22:24)

So, here are few summary observations. Various estimators are possible. Valid estimators, there
are infinitely many. Every estimator will have an error and that error will be random, it will have
a certain distribution, you are hoping that that error has a distribution around 0, maybe expected
value is 0 or it is not, it has very small error, maybe its variance is small and the distribution is not
very widespread. With samples you are expecting that the distribution will get smaller and smaller
and smaller, probability that it becomes largest, negligible etcetera.

So, really we saw the probability that absolute value of error is greater than δ is a very useful way
of characterizing the estimator in a good design. These probabilities will fall with n. How fast they
fall with n depends on how good your estimator is. Maybe you can do even among the good
estimators, you can do different designs to make these probabilities fall faster, faster with n; things
like that are interesting topics for investigation. So, this is the first observation.

The Chebyshev bound is a very useful tool. You can also use the other concentration and other
results, we will come back and look at those kind of tools if at all we need them. But even
Chebyshev is good enough. So, notice this expression, from this expression, if you are looking at
absolute value of error not been too large, then it seems it is a very good idea to make expected
value of error be equal to 0. So, it looks like good design principles, you can just derive from these
kind of equations.

The first one is you are saying immediately that you want the expected value of error to be close
to 0. If it is close to 0, then probability of absolute value of error greater than delta I can bound
very very well. Not only that, this variance of the error which shows up on the right hand side, I
want to, wanted to tend to 0 as well with n. So, I want to design an estimator with an eye on the
expected error, the variance of the error and things like that. So,, you want to control the behaviour
of the expected value of the error you wanted to go to 0, the variance of the error hopefully should
also fall with n.

So, hopefully in this lecture I convinced you that estimators will give you errors. The errors will
have a distribution and with larger and larger number of samples, you can control the magnitude
of the error in the distribution. The probability with which the error becomes very large can be
controlled through some tools like Chebyshev and all that.

So, this gives you good design principles, you want to focus on the expected error, you want to
focus on variance of error, magnitude square, expected value of error and things like that. So, those
are the things we will look at in the next lecture and subsequently we will see good ways of
designing estimators. Thank you very much.
Statistics for Data Science II
Professor. Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture No. 8.4
Parameter estimation - Bias, Variance and Risk of an estimator
(Refer Slide Time: 0:13)

Hello and welcome to this lecture. We have been looking at the estimating parameters given iid
samples of a distribution, the distribution has, can be described using some parameters, it could
be, if you take Bernoulli(p), it could be the parameter p or it could be gamma or normal or any
distribution has parameters associated with it. If you imagine that your samples are known to come
from a certain distribution, but you do not know the parameters, so they come from that class of
distributions, but you do not know the specific parameters, how do you go about estimating
parameters of that distribution from the sample.
So, that is the problem we are looking at. This problem is also called the point estimation problem;
some people refer to it as point estimators, some things like that in case somebody uses that
terminology do not be surprised by it, point estimation, this is called point estimation. So, we
looked at how to measure the error in estimators and then we came up with this notion that maybe
you are concerned with the probability since error itself is a random variable you are concerned
with the probability that the error takes large values, probability that absolute value of error is
greater than something.
And to control that using say Chebyshev’s inequality or something, we are able to see things like
expected value of the error, the variance of the error matter. So, in this lecture it is going to be
largely equation oriented lecture, we will define these 3 aspects of point estimator which are very,
very crucial and often repeated in analysis, bias, variance and risk of an estimator. So, these are all
related to the expected values of various quantities and you will see they are very simple intuitive
definitions and they have nice relationship between them which we can look at.
Of course, the moderate skill that you need is given a description of the estimator, how do you
compute bias, how do you compute variance, how do you compute risk, that involves a little bit of
manipulation and I will focus a little bit on that skill as well. So, let us get started.
(Refer Slide Time: 2:17)

So, let us do a quick recap of what mean and variance are, so if you take a random variable X
taking values in some set and then there is a PMF for this random variable, I am assuming random
variable is discrete. If it is continuous, you will have a PDF, so equations are largely the same, but
I am just describing it for discrete for convenience. So, the mean or the expected value of X is a
summation over every value that the random variable takes X times the PMF. We know this
definition, it is called average value, we will also denote it by 𝜇 in many cases.
There is something called the second moment, second moment is denoted 𝐸[𝑋 2 ], so it is square of
X, so it is simply again summation over every value that the random variable takes but 𝑥 2 times
the PMF. So, now remember, if it were continuous, so if it were a continuous random variable,
with PDF 𝑓𝑋 (𝑥), then what happens? A summation becomes integral. So, 𝐸[𝑋] = ∫ 𝑥𝑓𝑋 (𝑥) and
𝐸[𝑋 2 ] = ∫ 𝑥 2 𝑓𝑋 (𝑥)𝑑𝑥
So, these two are means summation gets replaced by integration in the continuous case with the
PDF, so I am simply writing the discrete case, you can imagine what the continuous case will be.
A very important quantity is something called the variance which measures the spread of the
random variable about the mean is 𝐸[(𝑋 − 𝜇)2 ], you can also write it as a integral or a summation
if you like, but there is this very interesting relationship between the first moment the mean and
the second moment and the variance.
So, what is that relationship, that is given by this, 𝑉𝑎𝑟(𝑋) = 𝐸[𝑋 2 ] − 𝜇 2 , you can rearrange this.
So, the second moment is the variance plus square of the mean, so that is a good formula to
remember in your head if you do not like equations, you want it in words, the second moment is
equal to the variance plus the square of the mean, square of the mean plus the variance is the second
moment. Remember the mean could be positive or negative, but both the second moment and the
variance are going to be positive quantities. So, this is an important result to know.
So, we know mean captures sort of like the central value of distribution and the variance captures
how much do you expect the values of the random variable to be spread about the mean. So, this
is a good intuition to remember. So, if you have low variance for a random variable, then most of
its values are around the mean. So, there is one slide you already know this, but just we are doing
a quick recap, so that when I make comments on this in the rest of the lecture, you will remember
what I mean. So, this is mean variance for a random variable and second moment of course.
(Refer Slide Time: 5:42)

So, we are now ready to define the bias of an estimator. So, once again you have n iid samples
according to a distribution X, which is described by some unknown parameter 𝜃. An estimator for
𝜃, we know already is a function from the samples to the real line, so we have been denoting it
by 𝜃̂. So, 𝜃̂(𝑋1 , … , 𝑋𝑛 ) or simply 𝜃̂ in short. So, you have an estimator, you have a parameter, then
you have a distribution in terms of the parameter and you have n iid samples.
So, bias of the estimator for a parameter 𝜃 is simply 𝐸[𝜃̂ − 𝜃], so this is called the bias of an
estimator. We will denote it 𝐵𝑖𝑎𝑠(𝜃̂, 𝜃), maybe you can just write 𝐵(𝜃̂, 𝜃), so it is clear enough
what the bias is. It is the difference between the expected value of the estimator and theta. So,
notice 𝜃 as a constant, so you can pull the expected value inside and it is a linear term.
So, you have 𝐸[𝜃̂] − 𝐸[𝜃] and 𝜃 is a constant, so you simply get −𝜃. So, this is the bias and you
can see this, this difference in the second form is, it brings out, gives you some clarity. The average
value of the estimator, remember estimator itself is a random variable, if you do different
samplings, you will get different values for the estimator, while 𝜃 remains the constant.
𝜃 is an unknown constant parameter and every time you do sampling, you may get a different
value for 𝜃̂, you of course expect the 𝜃̂ to be closed to 𝜃 most of the time, but when you vary the
samples, 𝜃̂ will vary, it is a random quantity. So, now if you do 𝐸[𝜃̂ − 𝜃], what are you saying?
You are trying to see the difference between the average value of your 𝜃̂ as it varies and the value
that you are expecting 𝜃̂to measure; so 𝜃̂ − 𝜃, 𝐸[𝜃̂ − 𝜃] gives you some sort of a difference on
average.
So, the bias is the expected value of the error, that is the other thing to keep in mind. 𝜃̂ − 𝜃 was
the error and error is something which we want to control the distribution of and bias is the
expected value of the error. So, there is another definition, definition of an unbiased estimator,
when do you say an estimator is unbiased, if its bias equals 0. On average, if the estimator is going
to be equal to the expected value of the estimator is going to be equal to, exactly equal to the
unknown parameter 𝜃, then you say the estimator is unbiased.
So, these are all new terminologies, so this lecture is going to be full of such kind of terminology,
quite intuitive simple but sometimes equations are little bit unsettling, so take your time, look at
the equations closely, you will see what I mean. So, bias is the expected value of the error, you
can also think of it as expected value of the estimator minus the actual parameter. And when bias
is 0, we say the estimator is unbiased.
So, clearly from our discussion in the previous lecture you can see it is good to have an unbiased
estimator, unbiased estimator is good or at least the bias should be very, very, very small, it should
be going to 0 with more and more samples at least. So, it should have, it should have small bias,
estimators should have small bias. If on average, you are not going to go to the value that you are
trying to estimate, then maybe you are not doing something that is very nice. So, this is a good
parameter of an estimator to keep track of. So, this is important, bias of an estimator.
(Refer Slide Time: 9:26)

So, the next definition is something called risk of an estimator. So, it is also quite important, in
particularly we will be talking about what is called the squared error risk, it turns out there are
other types of risks for an estimator and we will talk about the squared error risk. This is the most
standard risk; it is easiest to understand in some sense. Once again the setting is the same, iid
samples n of them, from some distribution, parametrized by some unknown 𝜃 and our goal is to
come up with an estimator, but then also characterize that estimator, what are the different features
of the estimator, we saw already bias is one very useful feature.
2
Now we are talking about risk. The squared error risk of the estimator 𝜃̂ is simply the 𝐸[(𝜃̂ − 𝜃) ].
So now, the bias can be positive or negative 𝜃̂ − 𝜃 can go, 𝜃̂ can differ from 𝜃 on the positive side,
or on the negative side, even if you take 𝐸[𝜃̂], expected value of 𝜃̂ may fall on the right or left of
𝜃.
So sometimes, positive and negative is a bit misleading, or does not maybe perhaps capture the
entire picture of you may be interested more in the amount by which you differ from 𝜃. And that
is where the risk is a very interesting parameter, whether you deviate on the left side, or the right
side, the penalty is the same. So, it does not cancel out. So, when you look at the 𝐸[𝜃̂ − 𝜃], 𝜃̂
maybe way off to the right of 𝜃 for a long time and way off to the left of 𝜃 for a long time. And on
average, this will cancel out and you will get 0.
So, even if you have an unbiased estimator, it may be a bad estimator. So, a better way to measure
it is using something like risk, which puts a penalty of (𝜃̂ − 𝜃)2 . So, whether you go to the right
of 𝜃 or to the left of 𝜃, once you square it, you get a large positive quantity. And you take the
expected value of that. So, the risk is small. You definitely know for sure that your 𝜃̂ is around 𝜃,
so most likely to be around 𝜃.
So, this is the idea behind the risk, can you see why risk is a very, very useful quantity to measure,
on average being around the 𝜃 is important. But you can be far away from the average equally
often. And that can cancel out and you can still get a low bias. So, to cover that one thing is of risk.
Risk is clearly a measure of importance here. If the risk is small, you can be sure that 𝜃̂ will
definitely take values close to 𝜃, you cannot go far away because you are squaring, one side other
side will not cancel itself out. So, risk is a very useful quantity as well.
Now error is (𝜃̂ − 𝜃). So clearly, the so first, let me first talk about the second sentence, squared
error risk is the second moment of error. Error is (𝜃̂ − 𝜃), I am squaring error and taking expected
value. So, second moment of error is squared error risk, so that is easy. Another useful terminology
is something called mean squared error. Why is this? Because you are squaring the error and then
taking the mean expected value.
So, this quantity the squared error risk is also called as MSE mean squared error in many situations.
So, the same quantity, so many different names. Why? Because it is very, very useful. This is a
very central quantity and really captures the essence of how well an estimator is working. And so
that is why it is used in so many different contexts by so many different people. They all call it
different names, but they mean the same thing.
So, that is risk for you, 𝑅𝑖𝑠𝑘(𝜃̂, 𝜃), squared error risk, very, very useful quantity. So, we already
seen two very interesting quantities relating to quantifying or measuring how good an estimator is.
One is the bias, it seems good, but then you can have large bias in either direction, cancelling it
out on average. So, you look at the squared error and that is going to really give you a true picture
of whether or not things are working well.
So, risk is a good quantity, we will see how to compute these in various examples. Like I said, a
very important skill is given a description of the estimator, how do you compute its risk? How do
you compute its bias? So, using probability in various distributions, in various results that we know
about probability, you should be able to compute it, that skill is also an important skill to have. We
will come to it soon enough. But let us do one more definition. Something that we know already,
but we will look at the connections between these three, and then we will proceed to the
calculation.
(Refer Slide Time: 14:01)

The third definition is variance of an estimator. Now, here is an estimator, a variance of the
estimator we know already, any random variable has a variance and that is simply the expected
value of the random variable minus its mean whole squared. So, that is variance. Now, what is
variance by itself capture? So, if your estimator has very large variance, what does that mean? As
if you give it different sets of samples, the value you are estimating keeps varying a lot. That is
what variance means. So, that is not a desirable property for an estimator.
So, you want an estimator even if you take one set one samples, if you take another set of sample,
if you take another set of samples, you want the estimator to give values that are close by, the I
mean whether the 𝜃̂ is close to 𝜃 or not, is an error thing. But whether or not you are making an
error, you do not want the 𝜃̂ to vary too much. If 𝜃̂ is varying all over the place, there is no way
you can control the error.
Then whenever 𝜃 is there will be an error. So, variance of the estimator is also a very important
parameter for estimation. You want your estimator to have low variance, as the samples vary, you
want to keep predicting the same thing as long as the distribution is the same. So, variance is a
very important parameter. And there is a small difference here, I want you to look at the subtle
difference, it is very simple, but still pay attention to this detail here.
2
We are defining variance of the estimator. So, 𝐸[(𝜃̂ − 𝐸[𝜃̂ ]) ]. What about variance of the error?
So, if you look at Chebyshev’s and all that, variance of the error is important. We looked at first
moment of the error, that is where bias came from, we looked at second moment of the error, that
is the risk. What about variance of the error? Why am I saying suddenly variance of the estimator?
Why not variance of the error?
So, it turns out these two are one and the same. So, error is simply a translated version of the
estimator. So, 𝜃̂ is your estimator, error is 𝜃̂ − 𝜃, I am just translating it by a constant 𝜃, so when
your 𝜃 is a constant, 𝜃̂ − 𝜃 or 𝜃̂ , both of them will have the same variance, because one is simply
a translated version of the other. So, variance of error, Var(𝜃̂ ), both are the same. You do not have
to bother with that subtle difference between the two.
Once again, let me summarize, there are three interesting parameters here bias, which is expected
value of the error, risk, which is the second moment of the error. And finally, variance of the
estimator and which is also the variance of the error. All these guys you want them to be small.
(Refer Slide Time: 16:44)

But there is a very interesting relationship between the three of them, which is famously called the
bias-variance tradeoff and all these kinds of things. I do not know. So far, we have not seen where
the tradeoff is, forget about where the tradeoff is, but at least there is the bias-variance relationship,
bias-variance risk relationship is very important and easy relationship to have the risk equals bias
square plus variance.
So, this is the, this is the important relationship, you can expand it out, if you like, you will see this
expansion, hopefully I have not made a mistake. So, proof is actually very, very simple. Risk is
expected value of error squared, that is the second moment of error. And that is that we know
already, second moment of any random variable is mean of the random variable square plus
variance of the random variable, variance of error variance of estimator, both are the same. So,
you just get this relationship, it is a very simple and easy relationship.
So, this equation is nice and intuitive, because it shows, ultimately you are interested in the mean
squared error of your estimator. If you want to keep the mean squared error small, you have to
keep the bias small, remember bias square is positive, variance is positive. So, the mean squared
error is the sum of two positive quantities, bias square plus variance.
So, if I want to keep the mean squared error small, then my bias has to be small, and my variance
has to be small. And there could be a tradeoff between these two, I may be able to decrease bias,
maybe I might increase variance in that fashion or maybe I want to balance it out so that I get the
minimum. So, it is quite often that tradeoff happens. Maybe we will illustrate it with one example,
we will see.
So, this is a very nice and interesting relationship, quite often, when you design an estimator, this,
this gives you very useful intuitive ideas. For more complicated settings, when you do not know
how to think of an estimator, first thing you try is to reduce bias. So this is, at least I have used it
when I, when I have to find estimators, you first try to reduce bias. If you reduce bias, it looks
logical that risk may go down. But it turns out sometimes the things you do to reduce bias can
increase variance, you may see in all over the place right now, but just focus on this relationship.
If you just keep decreasing bias, your variance may blow up, if you just keep decreasing variance,
your bias may blow up. So, you have to sort of balance between these two. So, in many cases, this
tradeoff is very useful, but this relationship is very, very nice. So, you want to keep bias, low
variance low as much as possible, useful relationship. We have seen a whole bunch of
relationships.
So, what we are going to see in the rest of this lecture is how to calculate these things in simple
settings. So, we will once again focus mostly on simple settings and work out, given the estimator,
how do you go about computing it. So, let us get started with those examples, that is a useful skill
to know, you have to put pen to paper, work it out, get to the answer. Some of you enjoy that, some
of you, I know, get scared of that. But it is an important skill to pick up. So, let us work on it.
(Refer Slide Time: 19:39)

So, first is the simplest of examples. We are going to see our three different estimators for the
Bernoulli(𝑝) samples. So, we had Bernoulli(𝑝) samples, three different estimators for it. We will
try and compute bias, variance, risk for all the three estimators. Not too difficult, but let us start
1
out. The first estimator was a constant, we did not bother about the samples, we just said 𝑝̂ = 2.
1
So, that was our first estimator, bias is very, very easy here, 2 − 𝑝, variance is 0. Why?

Because my estimator is a constant, expect variance of a constant is 0. So, risk is, risk simply
1 2
becomes bias squared, which is (2 − 𝑝) , variance is 0. Is that okay? So, easy enough calculation,
make sure you understood how I got it, bias is 𝐸[𝜃̂ − 𝜃]. Now, 𝜃̂ is a constant here and this 𝑝̂1 =
1
, so it just becomes half, expected value of half is half minus p.
2
1
And variance is 0, because 𝑉𝑎𝑟 (2) = 0. There is no variance for a constant and risk is just bias
square plus variance and so it just becomes bias square. So, easy enough calculation, but it is good
to see this and do it a couple of times to be sure. Our next estimators become slightly more
𝑋 +𝑋 𝑋 +𝑋
involved, 𝑝̂2 = 1 2 2, bias I am saying is 0. So, how do I get this I need 𝐸[𝑝̂ 2 ] = 𝐸[ 1 2 2]

Remember, this is just a linear function 𝑋1 + 𝑋2, so expected value will go inside and this half will
1 1
come outside, 2 𝐸[𝑋1 ] + 2 𝐸[𝑋2 ]. What is 𝐸[𝑋1 ] and 𝐸[𝑋2 ], both of these are just 𝑝. So,
Bernoulli(𝑝) they are independent, whatever does not matter, each of them have Bernoulli(𝑝). So,
𝑝 𝑝
it is 2 + 2 and just becomes 𝑝. So 𝐸[𝑝̂2 ] = 𝑝.

So, 𝐵𝑖𝑎𝑠 = 𝐸[𝑝̂2 − 𝑝], that is 0. Hopefully you got that. So, now Var(𝑝̂ 2 ), I am saying it is just see
𝑋1 +𝑋2
and 𝑋1 and 𝑋2 are independent now. When you have independent sum of random variables,
2
1 1
, , we have a formula for the variance, go back a few lectures to sums of independent random
2 2
variables and how I found the mean of that, variance of that you remember, there was a formula
that I gave you, you can just blindly apply that formula when you have independent. So,
1 1
𝑉𝑎𝑟(𝑋1 ) + 4 𝑉𝑎𝑟(𝑋2 ).
4

What is variance of 𝑋1? 𝑉𝑎𝑟(𝑋1 ) = 𝑝(1 − 𝑝). So, that is variance of a Bernoulli(𝑝) distribution.
𝑝(1−𝑝) 𝑝(1−𝑝)
So, 𝑉𝑎𝑟 = . So risk, so, bias is 0, variance is , what is the risk? Risk is bias squared
2 2
𝑝(1−𝑝)
plus variance. So, that just 0 plus this guy, so it is 2 . So, easy calculations. I mean, most of
these results, it is easy to just plug in and do it. But just be careful and calmly do it, you will get to
the answer.
(Refer Slide Time: 22:54)
So, now comes our famous third estimator, which we saw, which we hope is a good estimator. So,
𝑋 +⋯+𝑋
let us see how to characterize the bias variance and risk for this estimator. It is 1 𝑛 𝑛, bias is
going to be 0. So, once again, the same calculation is before the exact same calculation like I did
before, except that now we have to use it for n different guys, you can show 𝐸[𝑝̂3 ] = 𝑝. So, we
just do that, so it is linearity. So, expected value is going to go inside the summation.
𝑛𝑝
And each of these expected values is just 𝑝, 𝑛 of them. So, 𝑛 , that is 𝑝. So, same calculations
above, so it will come out. So, bias ends up being 0. And for variance, you use the same linear
combination of independent random variables. When you do variances, just 1 by n squared sum of
1 𝑝(1−𝑝)
the variance. 𝑉𝑎𝑟 = 𝑛2 (𝑉𝑎𝑟(𝑋1 ) + ⋯ + 𝑉𝑎𝑟(𝑋𝑛 ) = 𝑛

So, this is divided by n is very important.


𝑝(1−𝑝)
So, your variance is . Now risk is what bias squared plus variance. So, 𝑅𝑖𝑠𝑘 = 𝑏𝑖𝑎𝑠 2 +
𝑛
𝑝(1−𝑝)
. Now remember, this is falling with n, so risk is falling with n, so this was an important
𝑛
requirement for us, as you get more and more samples, you are expecting your risk to go down.
But notice what happens in the 𝑝̂1 and 𝑝̂2 , there is no n in the picture. As you get more and more
samples, your risk remains the same. So, it is not very useful.
So, you see 𝑝̂ 3 hat already scores way, way better than 𝑝̂1 and 𝑝̂2 . So, for those of you are worried
1
about 𝑝(1 − 𝑝), 𝑝(1 − 𝑝), you can argue is less than or equal to 4. How do you prove it? You go
back to math 1, maximizing a quadratic function, how do you maximize the quadratic function?
1
You will see the 𝑝(1 − 𝑝), 0 < 𝑝 < 1 . So, you will see that at it achieves the maximum and that
2
1
is 4. So, you see why math 1 is useful, all of you keep asking questions about where will I use it,
where will I use it? Use it everywhere.
1
So, you see, 𝑅𝑖𝑠𝑘 ≤ 4. And so it keeps falling with n, as n increases the risk keeps falling. So, that
is the nice thing to see. So, three different estimators and three different risks and one risk seems
to be falling with n. And that is a good quantity to keep in mind.
So, once again, the skill to focus on here is how to compute the risk and how to compute the bias,
how to compute the variance, given an exact description of the estimator and the underlying
distribution with the unknown parameter. So, this is a good skill to pick up in this lecture.
(Refer Slide Time: 25:52)
So, what I have done in this plot is just shown you the three plots, I am not going to interpret this
1 2
plot too much for you. So, this is 𝑝̂1, which is if you remember 𝑅𝑖𝑠𝑘(𝑝̂1 ) = (2 − 𝑝) , 𝑅𝑖𝑠𝑘(𝑝̂ 2 ) =
𝑝(1−𝑝) 𝑝(1−𝑝)
, I think I picked 𝑛 = 10 here, so also 𝑅𝑖𝑠𝑘(𝑝̂ 3 ) = . So, you see, it looks like this.
2 10
1 2
But, some of you will argue, what is the risk for (2 − 𝑝) , just the stupid estimator, it said half
beats even the, the other 𝑝̂ , which we were claiming has a great performance and all that.
Risk wise, in this part, this stupid estimator seems way better. But that is only over a small interval.
And that is an interesting little thing to worry about. The other thing to worry about is all of these
risks vary with p, they are not staying a constant with p. As p changes, your risk also changes. But
we saw how to deal with that. You can think of bounding the probability that absolute value of
𝑝
error is greater than 10, 10 percent error or something, you can look at it that way and that will also
work also.
So, but these plots sort of give you a picture of how to compare different estimators over different
ranges of p, and over all the ranges of p, you want your estimator to be better, just because it is
better over a small range, you cannot pick it up. Then the other values it will go bad. So, you
cannot, this is like saying, and I am really lucky, my p it is going to be half, so I will always say
half, ignoring the samples, that is not a good idea. You want your estimator to be robust. And as
you whatever p pay me, your risk should be low.
So, that is why the third estimator is a good one. So, this is a simple plot, this kind of plots give
you a good picture, a mental picture of what these things are.
(Refer Slide Time: 27:46)
So, we are going to do a couple of calculations to close out this lecture, just to get used to the idea
of given an estimator, how do we go about calculating bias, variance and risk? So, here is, here is
an interesting estimator for a Bernoulli distribution. So, notice how this estimator is constructed, I
mean I have not told you how I came up with this estimate. It is just one estimator, some of you
may not like it, some of you may like it. But anyway, let us, it is a valid estimator and let us go
ahead and calculate bias, variance and risk.
So, notice what I am doing here. I am doing 𝑋1 + ⋯ + 𝑋𝑛 . So, this part seems clear. So, just
𝑋1 +⋯+𝑋𝑛
, this seems clear, but then why am I doing these things? It does not seem very reasonable.
𝑛
√𝑛
I am adding a and then adding a root n. So, maybe it is reasonable, maybe it is not reasonable,
2
𝑋1 +⋯+𝑋𝑛
think about it. But at least it is not a very unreasonable looking estimator. It looks like this .
𝑛

√𝑛
But I have added the 2 on the numerator and √𝑛 in the denominator, just for fun, maybe, just to
see how bias, variance and risk will behave when such kind of changes are being done. So, let us
try bias. For bias, the first step is 𝐸[𝑝̂ ]. All of us know how to do this. So, if you go in with this,
each of these it is a linear operator, so all expected value will go in.
𝑋1 +⋯+𝑋𝑛
Expected value of 𝐸[𝑋1 ] = ⋯ = 𝐸[𝑋𝑛 ] = 𝑝, 𝐸 [ ] = np. So, you are going to get 𝐸[𝑝̂ ] =
𝑛
𝑛𝑝+√𝑛/2
. So, that is what you will get the mean you can you can think about how I got this. So,
𝑛+√𝑛
√𝑛 1
𝑛𝑝+√𝑛/2 −𝑝√𝑛 √𝑛(2−𝑝)
2
the bias is going to be 𝑏𝑖𝑎𝑠 = −𝑝 = =
𝑛+√𝑛 𝑛+√𝑛 𝑛+√𝑛

So, the first thing I want to point out is this is not an unbiased estimator. So, this has a bias, it has
√𝑛
a non-zero bias. But you know the bias is 𝑛 and as n becomes really, really, really large, this is
going to vanish to 0. So, this will also go to 0, it will keep falling, but except it falls very slow,
maybe a little bit more slow than you would want, but it also falls to 0. So, it is okay, bias is going
down with n, it is decreasing with n, but it is not quite 0. So, that is one observation.
The next thing is variance. So, variance is not too bad, because you can ignore the constant, this
√𝑛
by, that by 2, you can ignore, because it is only an addition of a constant does not change
2
variance, only you have to, you have to worry only about the sum. And this is again, an independent
1
sum, so you will simply get Var(𝑝̂ ) = 2 (𝑛𝑝(1 − 𝑝)), is not it? So, this is the variance.
(𝑛+√𝑛)

√𝑛
So, once again, remember how I got here. This root plays no role in the variance, you can just
2
𝑋1 +⋯𝑋𝑛
drop it and Var(𝑝̂ ) = . Instead of dividing by n, I am dividing by 𝑛 + √𝑛. So, I will get
𝑛+ √𝑛
1
2 times variance will go inside because 𝑋1 , … , 𝑋𝑛 are independent and 𝑉𝑎𝑟(𝑋1 + ⋯ + 𝑋𝑛 ),
(𝑛+ √𝑛)
each of them is 𝑝(1 − 𝑝), so 𝑛𝑝(1 − 𝑝). So, just think about how I got it. This is a good exercise
for you to get your hands on.
2
1
(√𝑛( −𝑝)) 1
2
Now risk, 2 + 2 𝑛𝑝(1 − 𝑝). So, let us just add these two guys. So, it is
(𝑛+√𝑛) (𝑛+ √𝑛)
𝑛 1
2 (4 − 𝑝 + 𝑝2 + 𝑝 − 𝑝2 )
(𝑛+ √𝑛)
𝑛
So, some magic has happened here, notice this, these things cancel. And risk is just 2 . So,
4(𝑛+ √𝑛)
it seems almost magical, at least the first time, I saw this estimator, I felt there was some magic
going on. Look at this estimator, it is risk. So, the previous three estimators we saw, when we
plotted it was this p, there was a change, there is some dependence on p.
The risk of this estimator is independent of this p, it is just p has just disappeared. Is not that
interesting? I mean, I always found that a bit intriguing about how this estimator comes about, you
might want to think about it. But anyway, that is not the point of this exercise, the point of this
exercise is, given any modified estimator like this, small changes to this, that the bias may change,
the variance may change, the underlying principles are just the same. It is just iid samples and you
are combining them to find the estimator.
So, you have to be able to find the bias, you have to be able to find the variance and you have be
able to find the risk. It will involve some algebra, it will involve careful understanding of the
probability, it is important, you have to do it. And when you do it, you get final answers. And this
particular estimator is a reasonably famous estimator in this area. And it has this wonderful
property that its risk is flat across p, whatever your p may be, the risk is the same.
And that may be desirable in some, some situations. I do not want my risk to change at all, whatever
the p may be. It is, an interesting thing for you to think about. So, that is one problem. I think with
that; we are going to stop this lecture. So hopefully, you understood the three important, important
characteristics of a point estimator; bias, variance, risk, and there is a simple relationship between
them. And given a description of a point estimator, you should be able to put pen to paper and
compute its bias, variance and risk. That is a good skill to pick up. I conclude this lecture. Thank
you very much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 8.5
Estimator Design Approach: Method of Moments

(Refer Slide Time: 0:13)

Hello, and welcome to this lecture. In the previous lecture, we saw very important characteristics
of estimators, bias, variance and squared error risk, or mean squared error and we saw how in
simple cases one can try and calculate it using basic knowledge of probability and the properties
of iid samples and the properties of the underlying distributions and all that, and how we can get
very interesting answers for such estimators.

So, in this lecture, we are going to start looking at design of estimators. So, how do you approach
the design of an estimator? A very, very popular and simple method is something called method
of moments and that is the first method that we are going to see.
(Refer Slide Time: 0:52)

So first, let me remind you about what parameters are and what moments are. Supposing you
have a , which is distributed according to some function, , this could be PMF, PDF,
whatever and there are parameters , , etc. In this PDF for PMF, which we do not know.

So one can always compute given the PMF or a PDF, we can always compute the moments
expected value of and expected value of , maybe the expected value of , , etc. You can
keep computing these moments, and you will get answers in terms of the parameters. See the
distribution is expressed in terms of the parameters. If you compute moments, you will again get
answers in terms of the parameters.

It is not very difficult to say, here is an example. So if you look at Bernoulli(p), expected value
of is , if you were to find the expected value of , you will get something else but that is
what you get. The Poisson distribution lambda, this again, as a parameter , [ ]

Exponential distribution again, parameterized by this and expected value of is . So we


know all these distributions are normal distributions, is two parameters. So you can look
at the two moments, expected value of is , and [ ] and then some other
distributions, which we saw very briefly earlier, something called the gamma distribution, you
remember the gamma distribution?
I think the PMF or PDF is proportional to if I am not wrong, something like this. So,
that is the gamma distribution and if you compute the first and second moment for the gamma

distribution, here is what you will get. Expected value of is , [ ] . So, what

is the point here?

There are parameters for these PMFs and PDFs, and you can find the moments and the moments
will be functions of these parameters. So here is another one, in case you are wondering about.
Supposing I have Binomial(N, p), so is binomial, (N, p), expected value of is [ ] is
what?

You may remember variance is , expected value of is second moment, so it is


going to be + . So these are all various different distributions. Hopefully, you
can refresh yourself. So, if you look at the beta distribution, there are two parameters a, b in the
beta distribution, and the expected value of , the expected value of can be expressed in
terms of a and b.

So, this is moments and parameters, So, unknown parameters of your PMF and PDF, you have
moments and your distribution moments expected value of , expected value of , they can be
expressed in terms of these moments using either integration, summation whatever. So, this is
the first thing to understand in the method of moments.
(Refer Slide Time: 3:49)

So, the next thing is moments of samples. Supposing, I give you n iid samples from some
distribution. You can always compute the sample moments. So the kth sample moment is defined
in this fashion, , I will use this notation for the kth sample moment, is a function of the n

samples, it is simply ∑ of each sample.

So, if you put k = 1, you simply get the sample mean, if you put k = 2, you get the second
moment of the sample, k = 3, you get the third moment of the sample, so on so forth. It is just the
average of no suitable powers of the samples that you observe. So, the important thing to note is,
the sample moment everything here is a random variable, it is not a constant.

So, you may have one sampling instant to and once you have that, your will take a
small value small for a particular sampling instance to , you will get your first sample
moment kth sample moment and all that. So, if you get another sample instance, you will get
another moment.

So, this moment, sample moments are random variables and they take different values for
different samplings. They have a distribution and all that, maybe there is some concentration etc.
But sample moments and distribution moments are the same, these two are different, right
distribution moment is a fixed constant function of some parameters. So or new or,

some , whatever some function of the parameters, it is a constant.


But the sample moments are random variables, and they will have a distribution will keep
varying. So, that is what I have written here. But you know, the last line is maybe something for
you to think about. We expect that will take values around expected value of , we sort of
expected and we have seen various results justifying this, particularly in the earlier lectures about
Central Limit Theorem, weak law of large numbers, concentration.

So all of that was designed to think of I mean, try and convince you that, you know, the sample
moments for larger and larger and take values close to their distribution, expected value. So, this
is something we have seen before. So we intuitively expect these two to be similar, you expect
to take values around expected value of .

(Refer Slide Time: 6:13)

So, the method of moments exploits this concentration or what we expect, in some sense, the
moments to be equal, the distribution moments in the sample moments are sort of equal. So, that
is what we do in the procedure for method of moments. We simply equate the sample moments
to the expression for the moments in terms of unknown parameters and we solve for the
unknown parameters.

So, that is what you do in the method of moments. So, supposing you have just one parameter
theta, usually it needs only one moment. I say usually, because I am expecting the first moment
to be a function of , what if it is not a function of ? What if it is a constant? For instance, you
may have a normal with 0 mean, and variance .
So in that case, first moment is 0, it is not a function of , then you cannot use it. So, you have to
keep searching till you get the first moment, which is a function of the that you want in the
normal case, maybe the second moment will work out. So you have to mean this is just a, I have
said usually here, remember, that usually just works most of the time, but sometimes if it is not a
function of , you cannot do it.

So, if you have one parameter theta, that is unknown, your sample moment will be some , you
actually have your samples, you get , your distribution moment, expected value of will be
some function of , we will see exact examples as we go along. But you will see this. So you
solve for for = .

Some function of equals , you will get and once you find a solution, you simply replace
your by , that gives you the estimate. Remember, estimator is also a random variable, you
can find it just for one sampling, you should say what do you do in general for samples. So this is
what you do for one parameter. So we will see some example it will be clear enough, I mean,
when I do a precise example, for high level, this is what we do.

If we have two parameters, maybe the first and second moment, if they end up being good
functions of , then you can invert this function, find and in terms of and and
then you know for the estimator, you simply replace small and by and . So, this is
a method of moments.
(Refer Slide Time: 8:14)

So, let us see some example. This is what will help you. through are iid Bernoulli(p).
[ ] . Remember, what is your unknown parameter, it is p, expected value of X equals p and
what is your method of moments equation now? The function p equals .

So, I am going to replace this guy. So this is another way to think of it. You replace expected
value of by the sample moment . That is it. So you simply get . So once again, this
is I mean, it is sort of very, very simple example, but still, it is important, the expected value of X
equals p. So it is the first moment of the distribution, simply a function of the parameter, in this
case, it is the identity function.

It is equal to the parameter itself. So if you compute the sample moment, you are expecting it to
be close to p. So that is the equation and if you want the estimator ̂ , you simply replace
with , you go to the random variable version of it, in some sense, and you will get this.

So, it is just a simple little equation. Hopefully, you see how I get to this, then you can cut shot
all these things and directly see that p hat will be the sample mean, but hopefully, you see the
method of moments working in this fashion. So, you write down an equation. So in this case, the
equation was really, really simple.
(Refer Slide Time: 9:46)

It can be a little bit different in some case, but let us proceed slowly. The next example is iid
Poisson( . Supposing you have n samples from Poisson distribution, how do you design a
method of moments estimator? Here also, it is very, very simple, the expected value of X is
directly equal to , whenever you have a parameter being equal to a moment, it is nothing. It is
very simple.

Your method of moments equation is very easy, . So, the estimator ̂ is simply ,
which is again, the sample mean. So if a parameter is equal to a distribution moment, that is the
easiest case. So you simply replace the distribution moment by the sample moment, and you get
your method of moments estimator. So for both for Poisson and Bernoulli, finding the method of
moments estimator is trivial, simply equal to the sample moment, = ̂.
(Refer Slide Time: 10:39)

The first difference we are going to see is with the exponential distribution. Say, notice this
exponential distribution here, if you were to use method of moments, expected value of is
1/ . The first time having a difference here. So if you are going to replace expected value of
with , which is the sample moment, I will get 1/ = . So now you have to solve for this
equation, solving for this equation is quite trivial.

becomes so it is just 1/ = is and in your estimator, you simply replace


small by capital , so you will get ̂ . So, that is . So the first time we

are seeing the method of moments at work here, the equation was not a direct function of the
moments, you had to do a little bit of an inverse to get to this.

So, this is part and parcel of the skill that you need to pick up for deriving Method of Moments
estimators. Express the moments in terms of the parameters, and then invert on equation to get
the parameters from the sample moments, and then replace the sample moments with the random
variables, , etc. You get your estimate. So this was a simple case, we will slightly see,
slightly more complicated cases in the rest of this lecture. So it is an important skill to pick up
how to design a method of moments estimator.
(Refer Slide Time: 12:07)

We will keep seeing more and more examples. One more example. So, this beating you to death
with examples here, but it is important, and hopefully, it is clear. Now I am doing normal
distribution, there are two parameters here, mu sigma squared, the first time we are seeing two
parameters. So clearly, we will need two moments at least. So first two moments are working out
nicely.

Expected value of X is , expected value of is + . In these two equations, what do I do?


I replace the distribution moment with the sample moment. So I get , + and
you solve. You have two equations, two variables, and are unknown, and are known
from your samples, you know them.

So, directly that is solved, for you just put , take it to the other side, take

square roots, you get √ . So even you want the estimator for , it is very easy. It is

just . The estimator for is a little bit more involved.

So, you need the second moment minus the first moment squared. So it is possible to simplify
this a little bit, I am not focusing on the simplification here. You know, you can see that n will
come here and then you will have , etc. You can you can do some simplification. It is not
claiming the simplest form for ̂ but it is the clearest and easiest form we can write down.
So, go through the steps once again, make sure you can reproduce it very clearly. This is how
you derive Method of Moments estimator for a normal distribution. When you do not know
and you write and in terms of the first expressed the distribution moments in terms of
and , when you have enough equations, you replace the distribution moments with sample
moments and do the solution and reverse find and in terms of the sample moments and then
replace with the sample moment random variable to get your estimators. So, hopefully, this is
clear enough, a simple enough procedure.

(Refer Slide Time: 14:17)

Maybe the next example is a little bit more complicated. So, let me work it out in detail for you,
this is for the gamma distribution. Now, there are two parameters and and the first two
moments are / and / / . So, what is my method of moments equation /
and / / = .

So, from here I have to solve for and . So, how do you do it? So one very crude way of doing
it is from one of these two equations, can you express one unknown in terms of another
unknown. So that is the approach you can take, in many cases it will work. So in the first
equation, if you look at / , you can write .

And in the second equation, you substitute in terms of . So basically, you will get from two
equations in two variables, you will get one equation in one variable, you eliminate one of the
variables. So that is the approach and of course there are more fancy methods to solve
simultaneous equations, we will not ask you about that. This course is not about solving
simultaneous equations.

We will ask you simple things, but simple equations, that basic skill is good to have. So here,
you just find one variable in terms of the other, and you substitute back here, we substitute back
here, you are going to get / / . So, the will cancel, this will cancel.

So, this squared will go to the other side, you will get or . So you

see how the equation works. It is slightly more complicated, it is not as easy as the previous two
cases. But nevertheless, it is good enough.

So, what do you do for ? you just substitute back in here. So becomes . Hope I did

not make mistakes here. Let me just make sure, alright, not too bad. So beta is this, is this

. So now, what are the estimators? The estimator ̂, the method of moment estimator is

going to be , ̂ .

Slightly more complicated estimators. So moment, you see the gamma distribution, the moments
are complicated function of the parameters slightly more. So you get estimators also being
slightly more complicated functions of the parameters. Simple skill, you would agree. Just plug
in the equations, express one variable in terms of the other, solve for the other variable. So let us
see one more problem.
(Refer Slide Time: 17:44)

Along the same lines, in this case, we have binomial (N, p), both N and p are unknown. So I
have n samples from binomial(N, p) both N and p are unknown. Expected value of is Np,
expected value of is + , you have to solve for these guys here. So, what are
my equations? , + .

Now, solve for and from this equation. What is our strategy for solving? You express one
variable in terms of the other, so this implies . So you write here. So, you will get

So, here you have this canceling, this canceling, you get a p in terms of this. So, you will get
= , I am just writing it out laboriously, = and
. You can write it out if you like, [ ]/ . It is the
same thing.

Now, N would be what? N would be just . So, it is going to be . So, this is p and

N. So, notice how I started with this equation and I came to these two guys. So this is the skill;
simple skill, inverting two equations, two variables, finding N and p in terms of and ,I
told you how to go about doing this.
So, once you do this you can find ̂ . ̂ will be . So writing in terms of random variable, plus
oh my God, I am writing all over the place. , then ̂ is /[
]. Easy enough in some sense, but this gives you a readymade simple method.

Otherwise, how will you start thinking of what estimators are? So, just gives you a very simple
starting point. If you have a distribution with unknown parameters, you have samples from the
distribution, how to find those parameters from the samples, Method of Moments is just a plug
and play formula and this skill is important and easy to pick up.

(Refer Slide Time: 20:51)

How do you use these things in practice? Sometimes, you get lost in those equations, you do not
know what happens in the, added the samples. How do you go about doing it? So Bernoulli(p),
here are the samples 1, 0, 0, 1, 0, 1, 1, 1, 0, 0. How do you find p hat. In this case, it was just .
So you find . So you add up all these things divide by 10. So you get 5 /10.

Straightforward, is not it. So we looked at these alpha particles emission earlier. So if you look at
the number of alpha particles emitted in a 10 second interval, we know it is a Poisson( , number
of particles emitted per second usually is given as the average number, it is 0.8392. From here,
you can find the average number of particles emitted in 10 seconds, it is 8.392.

So, sometimes this is also important. So, what you have to do when you count particles is, the
number of particles, we will be counting different number of particles over 10 second intervals,
but really, lambda is just the average number of particles emitted in 10 seconds. So, you can as
well count the total number of particles emitted, divided by the total time to find the average
number of particles per second and multiply by 10. So, that also will give you the same number,
you will get 8.392.

So, this is a nice simple calculation. So, notice how we are calculating in different ways. So,
you can do it in multiple ways. In some cases, the average is easy to compute in this fashion. So,
let us go to normal mu sigma squared. Here is 10 samples. How do you find ̂ ? It is just and

what is ̂? √ . You calculate, you will get it.

So this is how you calculate you just find the equation in terms of the moments, compute the
sample moments, plug it in, you get the answer. So this guy is once again. So this whole
thing is just , computed . Finally, Binomial(N, p), if somebody gives you Binomial(N, p),
you do not know and do not know p, but you know, it is binomial and those are the samples. 8, 7,
6, 11, 8, 5, 3, 7, 4, 6, 9 you can plug in the formula.

We looked at the ̂ and ̂ . So there are some / [ ], I mean, it is a


complicated fraction expression, you write it down, this ̂ will not be exactly 19. It might be
18.9 or 19 point something, but you can approximate it as 19 because you know, N has to be an
integer. So you can give the smallest, closest possible integer if you like. So ̂ works out as 19 in
this case and ̂ is 0.371.

So notice how, I do not know if you like it or not, but Binomial(N, p), looking at the samples,
will you say 19 and 0.371? Where does that come from? Maybe, maybe it is correct, may be it is
wrong. So one needs to look at these kind of methods and be convinced that they are working out
correctly. But in this case, Method of Moments gives you the answer. It is not unreasonable for
some method like that to work.

So hopefully, this lecture was interesting to you. You saw how to build a point estimator using
the method of moments idea, very simple method, just write a few equations solve for the
unknowns, in most cases, that works out very cleanly, it gives you a very nice estimator.
Hopefully nice in many cases, in some cases it may not work also.

But nevertheless, it is a method that works out for us and we can see how we can use it in actual
scenarios, we have samples. So that is the end of this lecture. In the next lecture, we will see
another very nice design idea called maximum likelihood. That is also a very popular idea. We
will see that in the next lecture. Thank you very much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Lecture 8.6
Parameter estimation - Estimator design approach - Maximum likelihood

(Refer Slide Time: 0:15)

Hello and welcome to this lecture. So, in the previous lecture, we saw one very important and
interesting method for designing estimators, that was the method of moments, where you looked
at the sample moments, evaluated the sample moments, I mean sorry, evaluated the distribution
moments as a function of parameters and then equated that function of the parameters to the sample
moments and data solving, express the parameters in terms of the moments and finally substituted
that back and got your estimates.

It seemed like a simple procedure, it was a simple procedure in most cases. It is quite useful, quite
often you may not get a handle on the actual distribution itself, you may not know the distribution,
and you have to guess at it. And then, why guess the distribution when you do not know. So just
use the moments, the moments can be quite reliably estimated from the samples using the sample
moments and try and match the moments as much as possible.

So as a principle, trying to get moments to match between your estimate and what you, what the
distribution actually gives you, is a very good principle for designing estimators. So in this lecture,
we are going to jump into another very, very important and interesting principle called maximum
likelihood. So in theory, the way we will develop it is to assume a distribution, when you know
the distribution very well, maybe you can do a little bit better, I mean you can, you can come up
with some interesting strategies using the distribution explicitly, not just the moments.

So, maximum likelihood is a method like it is very popular method, a lot of people swear by it,
they think maximum likelihood is the only method to use etcetera, etcetera. But it is a good, very
good method, it is founded on good theoretical ideas. And it is also easy to derive sometimes and
also implement, it is not too, too difficult sometimes the distribution.

(Refer Slide Time: 2:17)

So let us get started looking at maximum likelihood. The first definition is this word likelihood.
What is likelihood? Before you go about maximizing the likelihood, you need to know what the
likelihood is. So in this case, we will be talking about the likelihood of the iid samples that we
observed. So we are observing n iid samples, this has been our picture throughout, there are
parameters that are unknown in the distribution of these samples, theta 1, theta 2, etcetera.

And the PDF or PMF, so this 𝑓𝑋 (𝑥) represents either the PDF or the PMF. If you have the
continuous case, it is going to be the PDF the density function, if you have a discrete case, it is
going to be the PMF which is the mass function. Keep that distinction in your mind, I will just say
distribution 𝑓𝑋 (𝑥) like that, I will say that you can substitute it suitably. So we will, we know that
the PDF or the PMF depends on these parameters 𝛳1 , 𝛳2 , etcetera.
So, to bring that into the notation, instead of just writing 𝛳1 , I will write 𝑓𝑋 (𝑥; 𝛳1 , 𝛳2 ), etcetera. So
that notation brings out the fact that distribution depends on the variable x, and the value of the
distribution of that point x actually depends on the parameters 𝛳1 , 𝛳2 in this fashion So, here is an
example for you, if you think of the N(µ, 𝜎 2 ) distribution, the PDF evaluated at x with parameter
−(𝑥−µ)2
1
µ and σ is this function 𝑒 𝜎2 .
√2𝜋𝜎

So, you have mu appearing in the PDF expression, sigma appearing in the PDF expression, x also
appears in the PDF expression. So, I can write this function as 𝑓𝑋 (𝑥; µ, 𝜎), just to bring out that µ
and σ plays a role in that function. So, once you do that, you can write down something called the
likelihood of a particular sampling instance.

So, your samples are usually random variables, in a particular instance, those random variables
take some actual values, what are those actual values, I will denote 𝑥1 , 𝑥2, … , 𝑥𝑛 , the likelihood of
that actual sampling will, I will denote it capital L, L(𝑥1 , 𝑥2, … , 𝑥𝑛 ).

So, sometimes the arguments will be dropped, I will just say capital L. It is simply the product of
the distribution function. It could be PDF for the continuous case or PMF for the discrete case, it
is simply the product of the distribution function, let us say the density function evaluated at each
of the samples.

So, you see here with ∏𝑛𝑖=1 . Then I take the density function let us say and evaluate it at 𝑥𝑖 . So
the parameters, remember the parameters are always there for every evaluation, but this 𝑥𝑖 will
keep changing, in the first time it will be 𝑥1 the next time it will be 𝑥2 like that.

So, that is the likelihood function. So you can see, this likelihood, sort of represents the probability
of observing the particular sample. So remember, the samples are independent, so their
probabilities can be multiplied. So you evaluate the probability, let us say, if it is PMF, then you
evaluate the probability that particular xi occurred, it is the probability that 𝑥1 = 𝑥, small 𝑥1 . So
that is the PMF evaluated at 𝑥𝑖 .

So even the density sort of represents, how likely it is that you got that particular value. So this
likelihood is a good word for the expression. It represents some, in some sense, how likely I was
to observe this particular sample. But remember, remember what is very important, given the
samples, the likelihood is a function of the unknown parameters of the distribution.
So, say that to yourself, again, it is a very, very important idea that sort of hidden in this expression,
the likelihood for a particular sampling instance, I have put in the value of 𝑥1 , 𝑥2 , … 𝑥𝑛 but still the
theta 1 and theta 2 remain unknown. So likelihood, is actually a function of theta 1, theta 2. So I
am not showing that here on the left hand side, I simply say 𝑥1 , 𝑥2 , … 𝑥𝑛 , but the assumption is this
𝛳1 , 𝛳2 is also built into this, so this is a function of 𝛳1 , 𝛳2 , etcetera. So very, very important
observation to have.

(Refer Slide Time: 6:48)

So, here is an example, so let us begin with the Bernoulli(p) samples, those are the samples given
to you 1, 0, 0, 1, 0, 1, 1, 1, 0, 0. If you want to evaluate the likelihood, it is going to be the PMF of
the Bernoulli distribution. What is it? What is the PMF here, so 𝑓𝑋 (𝑥) = 𝑝 if x = 1, and 1-p if x =
0. So the first sample you observe is 1, so the PMF is P. The second sample is 1- p, the third sample
is 1- p, the fourth sample is p, fifth sample is 1- p, then you have a sequence of p’s. Then you have
1- p. 1- p.

All of these are multiplied together. So the PMF evaluated at the first sample, second sample, third
sample, so on. So it is almost like probability that the first sample equals 1, that is p. Probability
that the second sample equals 0, that is 1- p like that, you multiply all of these things, you get the
probability of the entire sample. So in the discrete case, clearly, this is 𝑃(𝑥1 = 1), 𝑃(𝑥2 = 0), so
on till 𝑃(𝑥10 = 0).
So, this exact probability, these are all independent. So you can multiply them out, you get this.
So now, after you multiply them out, what will you get? Here in this case, you get p power 5 𝑝5 ,
and (1 − 𝑝)5 . What is this first 5? This if you think about it, it is the number of 1s in the sample.
What is this 5? In this case, both happened to be equal, but this is actually the number of 0s in the
sample. Do you see that?

So, if tomorrow, some other sampling you observe for the same Bernoulli(p) distribution, the
likelihood will also be, always be p raise to the power of number of 1s in the samples that you saw,
and 1- p multiplied by 1- p raise to the power of number of 0s that you saw in the samples.
Remember, number of 1s in the samples, plus the number of 0s in the samples is actually equal to
the total number of samples.

So, it is enough if you count the number of 1s, what is the number of 0s it is n minus the number
of 1s. So that is also easy enough for you to see, n minus number of 1s. So that is, just Bernoulli.
Simple enough, you got the likelihood. So, hopefully, you get the hang of it. Any other discrete
distribution you have look at the sample, substitute the probabilities multiply them out, you will
get the likelihood easy enough to do.

(Refer Slide Time: 9:36)

Next is a normal distribution. So under some normal distribution, mu sigma squared, here are the
sequence of samples I saw 1.07, 0.91, 0.88. 1.07, 1.15. These are the samples 10 samples once
again, how will you evaluate the likelihood? You take the first sample and you know the density
function. So that is the density function, you evaluate 1.07, plugin x = 1.07 here, you will get the
density function evaluated at that point.

Now, I mean this is a lot of things to write, I have put dot dot dot here, all of these guys will get
captured in the dot dot dot. And still the last one and this last guy is this one. Then the PDF
evaluated at 1.08. So now I can multiply all these together and write it in one succinct way. The
1
first thing is I notice that is this occurring in every term, how many terms are there 10 of
√2𝜋𝜎
1
them. So I can multiply all of them together, I will get ( )10.
√2𝜋𝜎

And then I have e power minus e power minus e power minus multiplying 10 times I can add up
all the arguments of the e power minus and write just one e power, and then you will have 1.07
minus mu squared plus also so on, so on, the whole thing finally divided by 2 𝜎 2 , both of these are
the same, the same simplifications apply. So notice, in both cases, the likelihood became a function
of the parameters, this is the Bernoulli(p), so the likelihood as a function of p.

Normal mu sigma square, so the likelihood ended up finally being a function of both sigma and
mu. Sigma shows up in like, multiple places, mu also shows up in so many places, so it is a function
of the parameters. So likewise, here, this parameter shows up here and here. So likelihood is a
function of the unknown parameters, of course distribution is known, you are assuming that and
then writing down the likelihood.

Now, remember, as the samples change, if the samples change, this function will keep changing
also, but that is not of great interest to you, because you are looking at a particular sample, and
then evaluating the likelihood and you get a function of the unknown parameters. Hopefully, this
is clear. So any other distribution I give you, any other PMF or PDF I give you, you should be able
to write down the likelihood, basic skill in this area.

(Refer Slide Time: 12:03)


Next, we come to maximizing the likelihood. Maximum likelihood estimator, as the word suggests,
we are going to compute the likelihood and try to maximize it. Why would you maximize the
likelihood? I mean it sort of makes intuitive sense, you are assuming samples came from a
parameter, from a certain distribution, the parameters are unknown, the parameters could be
anything, you calculate the probability that you actually saw the samples and then out of all the
parameters, which parameter gave you the maximum likelihood for that particular sample?

And then you put that out. So that seems like a very natural way to define an estimator and that is
what is the ML estimator. So you have got a particular sampling instance (𝑥1 , 𝑥2, … , 𝑥𝑛 )., you
compute the likelihood function. So this is the likelihood. And then you maximize the likelihood
over parameters. So you are maximizing the likelihood over the parameters theta 1 𝛳1 and 𝛳2 and
what is this arg doing?

So, when you maximize the, when you maximize something, there are two possible outputs that
can come out of the maximum. Whenever you maximize a function, so there is a function, let us
say, imagine a function of one variable. The function of one variable, it has some picture like this,
you are trying to do a maximum here. So there are two outputs, one output is the maximum value.
The other output is the argument of the maximum.

Where did the maximum occur? That is called the argument arg. Next is what is the maximum
value? So in this case, I am interested in estimating the parameter, so I really want to know the
arg. Where did this maximum occur? Which value of the parameters maximize the likelihood of
observing this particular sample? Those are my ML estimates. So here is a function of many
unknowns, it could be one unknown, or it could be multiple unknowns, and I have to be able to
maximize that function.

An important mathematical problem for which you have to use a lot of calculus to precisely solve.
So a lot of you will keep asking questions like, why do you study limits? Why do you study
differentiation? Why do you study integration, all of these things? This is where it will help you.
Given a function of multiple unknown variables, you know the function, how do you maximize
that function? How do you find the point at which it reaches a maximum?

It is absolutely vital, it is very important skill to have in mathematics. And that is what you are
learning in math 2, is not it? So we are going to use that here. When you want to have, to find
parameters that are unknown distribution you find that likelihood of that samples and then find
those parameters, which maximize that likelihood and we are going to use calculus for finding this
maximum.

(Refer Slide Time: 15:03)

So, now when you use calculus, it can happen that your calculus problem is very simple. So that
is one case, I mean you may remember some functions have very simple derivatives, some
functions have complicated derivatives. And when you work with some functions you can get very
easy derivatives and easy answers and you manipulate them, sometimes it will become very, very
ugly and unwieldy.
So, a closed form solution is when you can find the maximum explicitly you can say, for this
function, the maximum occurs at this value, for instance, quadratic function is a great example,
given a quadratic function, you know exactly how easy expression for where it will reach a
maximum value. So, that is a closed form nice solution, but real life it turns out is a little bit more
complicated than that, for many distributions, you may not be able to exactly find nice closed form
expressions from where the maximum will occur.

So, in those cases, you have to rely on a numerical routine. There are many optimization packages
today, you give the function as a computer program, there will be these packages, which will find
the maximum value of that function. So maybe in your core lab exercises, they will had some
examples of such functions for you. There are lots of Python packages, which will do this.

So, optimization of functions of various different variables is an important problem. It is a very


big mathematical area, but we do not need to get into that in great length. But for us in this class,
we will need to maximize some very simple functions. So I will take those functions, I will go
through the maximization in some cases we will get very simple answers. In some cases, we will
have to do a lot of manipulation. Some of you enjoy such manipulation. Some of you get very
scared of the manipulation.

Let me assure you, we will not expect you to be great wizards in algebraic manipulation in this
course, so we will give you simple problems, which are easy enough in the activity questions and
assignments. But if the problem becomes a little bit more complicated, you should at least know
how to use a computer package in order to solve it. That is an important skill. So on I mean quickly
enough, when you go to work or something and when you meet such problems, you should be able
to solve it.

Even if you cannot do it by algebraic manipulation, you should know how to put it into a program
and get the answer. So usually, if you can get a nice closed form expression, by algebraic
simplification, even a computer program usually will be able to find it. So it is good to learn how
to set it up as well.

So, this is the brief description of maximum likelihood it is very elegant and simple to describe,
you will see when I see examples, it can become a little bit more complicated than how it sounds.
But, pay attention to this high level idea of where the maximum likelihood, how the maximum
likelihood decoder is defined. And we will jump into examples and problems soon enough.

(Refer Slide Time: 17:55)

The first example always the simplest example that we take is the Bernoulli p distribution. So, you
have n iid samples that are distributed according to the Bernoulli(p) distribution, you are actually
in one sampling instance you get 𝑥1 , 𝑥2 , … . , 𝑥𝑛 , each 𝑥𝑖 can visit 0 or 1. What is the likelihood we
saw this before likelihood is product of the PMFs. PMF is p if 𝑥𝑖 is 1 it is 1- p if xi is 0. So, if w
were to denote the number of 1s in the sample, the likelihood is 𝑝𝑤 (1 − 𝑝)𝑛−𝑤 , we also saw this
expression.

So, now the question is how do you maximize this guy? How do you maximize this over p? So,
the main idea in maximization is, so if you want to maximize any function, one approach you can
take and this is the most popular approach it is to differentiate and equate to 0. So this is an
important approach, there was lots of theoretical criteria and all of that which assure you as to why
this works, when this works etcetera.

At our level in this course, we will assume that this will generally work. So of course, there are
lots of special cases we do not go into that we will assume generally that this will work. So, once
again to maximize a function, it may be it has one unknown or two unknowns whatever, so, you
keep differentiating with respect to each unknown and equate to 0 and try to solve for the unknown,
so this is the idea.
So in this case, the likelihood is a function of p, p is your unknown. And you want to maximize
over p. Find the P which gives you the maximum possible likelihood function, so you want to
differentiate this L and equate it to 0. Now, here is another trick, this trick, the second trick is very,
very important in ML decoding, so instead of maximizing L is equivalent to, is the same as
maximizing log L, so note this term, this is extremely important. Why is it?

Because log is an increasing function and if you manage to maximize L, you would have also
maximize log L, because it is increasing with, it is monotonic, it is increasing all the time, so
instead of maximizing L you can also maximize log L. So you will get the same answer, think
about why, that has to be true. So this will simplify your work tremendously. Why is that? Notice
what is log L here? See log of a product is you can write it as sums of the logarithms of the
individual terms.

So, in this case you will, this will end up being 𝑤 𝑙𝑜𝑔𝑝 + (𝑛 − 𝑤)log (1 − 𝑝) . How nice is this
expression looking, you will know when you want to differentiate it. So I will call this function as
some, let us just call it some h(p). Now I want to maximize this h(p), so I will differentiate this
h(p) with respect to p and equate it to 0. So when you differentiate, remember w and all is a
constant, w, n - w and all is a constant, p is the only unknown here. I want to maximize it over p.

So, I am going to differentiate with respect to p, so w remains itself. What is the derivative of log
1
p? You will get + (𝑛 − 𝑤) remains as such, if you look at the derivative, you are going to get 1
𝑝

by 1 minus p times minus 1 , there is this function of function rule, log (1- p), so I do 1 by 1 minus
p minus 1. Is that okay? So, so that is it, I mean you have got a very simple expression now and
this I am going to equate to 0.

So, if you simplify this, you can multiply throughout by 1- p, move this to that side, if you like,
this is -1. So we will get w(1 − 𝑝) = (𝑛 − 𝑤)𝑝. So this p can come here and this w p also, this w
also, if you simplify this you will get p = w/n. This is the same answer. So see how you will
simplify it, you will get w(1- p) + w p, so that is w itself equals n p, just multiply it out and simplify,
you will get p = w / n.

So, it is a simple enough solution, so what does that mean? So this function that I took, I took log
first and then I differentiated, equated it to 0, I got a value of p. So this value of p, is the p that
maximizes this L. Now it needs a bit of thinking and careful arguing as to why this has to maximize,
it turns out according to the theory it can also minimize, but in this case it ends up maximizing, we
will not bother so much with such mathematical care and all that, you can show in this case, that
this will end up maximizing.

So you can show of p less than w/n, this quantity will go negative, will be positive, I am sorry,
when p is greater w/n, then this quantity will be negative. So it increases, reaches a maximum and
decreases, so you can show that. So this will be the point at which it gets maximum, so p =w/n.
This is nice closed form solution for this problem. So this is a nice and simple example where we
got a nice explicit closed form solution.

You may not get such forms every time, but in this case for the Bernoulli(p) case, we are able to
get w/n. So this w by n is pleasing in so many different ways, so let us carry on and then describe
why this is very pleasing.
(Refer Slide Time: 24:23)

So, the ML estimation asks for the argument of the maximum of this likelihood function. So how
do you do that? We go ahead differentiate with p and then equate to 0 and solve and we p star to
𝑥1 +𝑥2 +⋯+𝑥𝑛
be w by n, I just showed you how that was done and that is equal to How did I get
𝑛

this? w, why is w is equal to 𝑥1 + 𝑥2 + ⋯ +𝑥𝑛 . You can see this is exactly the number of 1s in
the sample. Because xi is 1 or xi is 0, if you just add everything, you will get the number of 1s in
the sample, it is easy enough.

(Refer Slide Time: 25:05)


So, and then how do you go to the estimator, p hat ML? You replace small x by capital X. So here
to here you replace 𝑥𝑖 by capital 𝑋𝑖 , you remember the same thing we did in the method of moments
also, we solved it with respect to 𝑚𝑖 and then you replace 𝑚𝑖 with 𝑀𝑖 to get method of moments
estimator, same thing you have to do here. So solve it with respect to a particular sampling instance,
so that everything is number, it easy for you to deal with.

And then you simply replace those sample instances with the actual random variables and you will
get an estimator. Remember estimator is a random variable, I mean it is a function of the sample
random variables, so you should write it like that in a clean fashion. So, that is your ML estimator,
so you can congratulate yourself, you have found your first ML estimator and that is, it is a
significant achievement at least in my opinion.

So, that is easy enough and look at how intuitive and nice it is, this is the same as the method of
moments estimator. Same as MME, method of moments estimator. What is MME? Method of
moments estimator and it is also the sample mean. So it makes sense that the maximum likelihood
estimator in some sense also agrees with that other intuitions about other estimators.

So, let me conclude this lecture at this point, what we are going to do, carry on doing is try and
solve this ML decoding problem for many more distributions and see what we get, you will see,
you will get a wide variety of answers, these answers seemed very simple, we will start with some
simple cases and the we will slowly build on it and get more and more complicated cases. You
will see a wide array of possibilities are there when you do this maximization, I will pick it up in
the next lecture. Thank you.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Parameter estimation Evaluation of ML estimators

(Refer Slide Time: 0:12)

Hello and welcome to this lecture, we are in the middle of evaluation of ML estimators and we are
going to continue with the Poisson distribution, we already saw the Bernoulli distribution, let us
now see the Poisson distribution. All of you remember the Poisson distribution, from the pmf of
the Poisson distribution if you find the likelihood of a particular sampling it is going to be the
𝑥
𝜆 𝑖
product of the pmf evaluated at each 𝑥𝑖 , so it is going to be 𝑒 −𝜆 .
𝑥𝑖 !

Then, you can multiply all these things together, so you will have each of these terms. So, you see
1
this 𝑥 , collect all of those together, keep them on the left. Why am I doing that because they
1!…𝑥𝑛 !

are not functions of my unknown parameter 𝜆, they depend on the sample but whatever the value
of 𝜆 is these things do not change, they are the same. So, this part is sort of irrelevant for me.

Why, because it does not change anything, it just multiplies every possible value for every 𝜆 this
will be a constant, so it does not depend on 𝜆. So, this part is irrelevant, in most cases people will
end up ignoring these kind of parts which are irrelevant to you in your maximization. How did I
get the 𝑒 −𝜆 , there is 𝑒 −𝜆 times 𝑒 −𝜆 𝑒 −𝜆 𝑛 times you are multiplying so you get 𝑒 −𝑛𝜆 .
And how did I get this? This 𝜆𝑥1 , 𝜆𝑥2 , multiply all of those together you get this. So, this part is
irrelevant, I can drop it, then I have 𝑒 −𝑛𝜆 . So, only the function of the parameters matter to me, so
only those parts are relevant to me, have to maximize only that part of the likelihood function, the
remaining part just the same whatever maybe the lambda remains the same you do not have to
bother with it.

(Refer Slide Time: 2:08)

So, now I am ready to maximize, so notice, first of all I have taken log, so I have dropped this guy,
this guy is irrelevant drop it, here to here I have taken log. So, when you take log you will get
(𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛 ) 𝑙𝑜𝑔𝜆 − 𝑛𝜆, so 𝑒 −𝑛𝜆 log is −𝑛𝜆 . Look at how simple the likelihood
function finally became when you want to maximize. My 𝜆∗ is just argument max over 𝜆 of this
expression.

So, how do I get 𝜆∗ equals this, I have to differentiate and equate to 0, so if I take this guy and
1
differentiate I am going to get 𝑋1 , remember 𝑋1 to 𝑋𝑛 is just a constant, log𝜆 will be 𝜆 and −𝑛𝜆
𝑋1 +𝑋2 +⋯+𝑋𝑛
derivative is just 𝑛 equal to 0, you solve this, you get this, 𝜆∗ is .
𝑛

So, we know 𝜆 is also the mean of the Poisson and the method of moments estimator also gives us
the same sample mean. So, the sample mean, put p hat here, so that is wrong, sorry about that, it
is lambda hat ML is the sample mean, it is not very surprising because lambda is the sample mean
for the Poisson distribution. And same as method of moments estimator, it is an interesting
observation that for Poisson it also works out quite simple.

But you notice the little bit of an algebraic manipulation here, if you are little confused by that, it
can take some time for you to understand it but it was actually a relatively simple procedure, just
take the distribution, multiply it out, identify the parts that depend on 𝜆, throw away the parts that
do not depend on 𝜆, then take log, you get a very simple function and you maximize it, differentiate
and equate to 0, you get the maximum, argument of the maximum and that gives you the estimated
directory, so replace the small 𝑥 with the capital 𝑋, you get your estimator. So, this is the recipe.

So, any mathematical problem solving when you have a clear recipe it is actually very easy,
because you have a recipe when you do not have a recipe you have to think about what to do,
etcetera. But when you have a recipe just plug in and solve. So, this recipe is an important skill
that you have to pick up in this course. Given a distribution with some unknown parameters how
do you try and get to the maximum likelihood estimator. So, we will see quite a few examples,
interesting simple examples like this you should be able to definitely get to a closed formats.
(Refer Slide Time: 4:44)

The next thing we will see is the normal distribution, here we have two parameters, so first time
we are going to see two parameters, you will see the method is easy enough but still there are two
parameters it can be a bit unsettling, let us get started. Again the recipe is exactly the same, I am
not going to deviate from the recipe at all, so the exact same recipe every single time.

The likelihood product of the density function of the normal distribution evaluated at every sample,
I forgot to put the 𝑋𝑖 here, it is very very important. So, now notice the simplification here I have
1
done little bit of simplification, I have grouped all this together and raised it to the power 𝑛
√{2𝜋}𝜎

and then 𝑒 power minus multiplying I have just grouped all of those case together and put
−1
∑𝑛𝑖=1(𝑋𝑖 − 𝜇)2 .
2𝜎 2

So, why do I do this, I mean this just gives me a compact expression and you see this is the same
as that, there is no difference here and if you have not seen these kind of things before, it might
look a little new to you but I mean it is simple enough, so you get e power minus summation it is
not too bad.

What is the next step, it is the maximization, when you do maximization you are going to throw
away the relevant part, keep only the functions of the unknowns mu and sigma in your hand, then
take log, you take log, you get the function that you have to maximize, so that is all I have done
1
here. So, here what happens is the relevant part of the function say can be ignored, so this part
√2𝜋
−1
I will ignore and then when I take log notice what happens, log I will get 2𝜎2 ∑𝑛𝑖=1(𝑋𝑖 − 𝜇)2 and
1
then you have a 𝜎𝑛 .

So, that is −𝑛 𝑙𝑜𝑔𝜎. So, I have a minus and a minus then I need to maximize, maximization of a
minus of something is the same as minimization of positive version of them. Remember, this is a
negative quantity, so negative quantity when you want to maximize you have to minimize the
absolute value, so that is what I am doing here. So, I have thrown away the minus, the minus
became plus, the minus became plus and the maximum became minimum.

So, instead of maximizing I am going to be minimizing but all these things are minor things
anyway we are going to differentiate and equate to 0, so that is what we are doing but just some
minor manipulation to make your work a bit easier. So, if you do not care about that, you just keep
the negative sign, you can just keep the minus minus sign and keep max here, so it is the same as
−1
arg max 2𝜎2 ∑𝑛𝑖=1(𝑋𝑖 − 𝜇)2 − 𝑛𝑙𝑜𝑔𝜎, it is the same as that both of these are equal.

So, that is the claim, that I am making it. Again I mean every step I hope you agree with me it is
quite simple, the recipe is clean enough to follow.

(Refer Slide Time: 8:00)


Now, comes the derivative. So, I am not going to do the derivative in the lecture, I will leave it as
an exercise but hopefully you see how to do the derivative. So, first thing for this guy you do
differentiate with respect to 𝜇 alone, treat 𝜎 as a constant. So, if you treat 𝜎 as a constant, this part
will go away and then here also when this is a constant does not really matter in this case and then
you have to just do the derivative with respect to this and you will see how the derivative will
work.

So, if you want you can expand this as 𝑋𝑖2 − 2𝜇𝑋𝑖 + 𝜇2 . If you differentiate you are going to get
−2𝑋𝑖 + 2𝜇 and then you will add it up over all 𝑖 and equate it to 0, you will end up getting 𝜇 ∗ =
𝑋1 +𝑋2 +⋯+𝑋𝑛
. So, this is an exercise, look at how you will do it.
𝑛

And for this guy you will differentiate with respect to 𝜇, with respect to 𝜎 and treat 𝜇 as a constant.
So, this is something that you can do, so I cannot ignore this guy but I can ignore this one, this
whole thing has become a constant, so if I differentiate that will just come out and it will be there,
etcetera. So, I am not going to go into details of this derivative, you can go through and check it
yourself with simple enough differentiation you will get sigma squared ML to be this big
expression. So, basically this whole expression divided by this n will come, so check it out.

So, these are again just blindly follow the recipe and we got very intuitive clear answers. So,
because this is the sample mean, this is almost the sample variance, the sample variance, the only
difference in the sample variance and this is the sample variance was divided by 𝑛 − 1, in this case
it is divided by 𝑛 minor difference but both are sample mean and sample variance natural answers
that we got by maximizing likelihood.

And also this sort of agrees with MME also, method of moments estimation also in some sense, at
least this part agrees with MME, let me just say everything agrees. So, this part agrees with MME,
same as MME, for this guy it is, you can look into it and see how the differences between this and
the MME are for knowing mean and variance, I think it is the same thing that you will get for that
also.

So, both the mean and the variance will agree with the MME, so you can check for yourself that
this also agrees. So, we have seen that for the Poisson distribution and normal distribution the
recipe for following and calculating the likelihood ML estimator is quite simple, you just put it in,
plug it in, you derivate to differentiation equal to 0 and you get things that agree with the sample
mean. But you know the situation can get quickly complicated. So, I am going to do a few more
examples and you will see the other examples things will go a little bit different.

(Refer Slide Time: 11:25)

So, let me make a few quick observations before proceeding further. The first observation is the
ML estimation is very very very popular, it is a good idea in most cases it gives very reasonable
results, it is also appealing in many ways, it sort of maximizes the probability that you see a
particular sample.
Some negative points people will say you need to know the distribution maybe that is a limitation
but one can overcome that and see for instance I want to tell you a couple of things. First thing is
even though in many of these cases we started with the distribution and derived the likelihood
estimator, the final estimator is just a function of the samples.

So, whatever distribution you use you will get some estimator and then after that you can just work
with the estimator, even though you made one assumption that assumption is not going to show
up too much beyond in a simple way in the calculation, so think about what that means, so if you
change the distribution you are going to get a different estimator.

In some cases the estimator just be the sample mean in which case it does not matter what the
distribution is you will get the same thing. And all these things make it very very interesting, how
does it depend on the distribution, does it change from the distribution, all these are interesting
things that people study.

So, one of the other negatives some people might say is the actual estimator to derive it you need
some calculus. And it is unavoidable, you need to use it, it is hopefully I have convinced you that
many of the examples need very simple calculus, maybe the normal distribution needs a little bit
more than what you might consider simple but even that I think once you get used to it you will
see it is quite simple.

So, there is lots of questions, so I mean how do they look in general, are they always going to be
equal to sample mean and things like that, is it similar to MME, different from MME, how will
you compare ML and MME, the method of moments estimator and ML estimator, how do you
compare the two of them, these are all very interesting questions, let us try and answer them in the
subsequent lectures.
Statistics for Data Science 2
Professor. Andrew Thangaraj
Department of Electrical Engineering
Indian Institute of Technology, Madras
Parameter estimation Finding MME and ML estimators

(Refer Slide Time: 0:12)

Hello and welcome this lecture is devoted to a bunch of examples where I will derive both the
MME and the ML estimator, so the method of moments estimator and the ML estimator for
different problems and show you the similarities, differences and sort of illustrate how these
things look and how they can be different, etcetera.
(Refer Slide Time: 0:32)

Let us start with the exponential distribution and for the exponential distribution ̂ ,

so this we have seen before, derived before so you know this. How do we do ML distribution
remember, the pdf is that is okay.

So, the likelihood is going to be ∏ and let us simplify multiply this together you
get , it is turning out to be easy enough. So, how do we do this? You take log, so
your .

So, now how do you do this? You differentiate and equate this to 0, you are going to get
. So, from here you get so from here you get ̂

, simple recipe, it is a really simple and we know it has got, the distribution is very

simply here described as lambda e power minus lambda, you plug it in, you do the same
calculation, you will get to the answer. So, here again we got both the method of moments
estimator and the ML estimator to agree so that seems like a nice result to have, so this is easy
enough distribution.
(Refer Slide Time: 2:38)

So, we are going to first start seeing some slightly non-trivial cases now. So, the first case is a
discrete case where instead of having 0, 1 just 2 values I am going to have 3 values. What are the
3 values? 1, 2, 3, so the distribution takes, the samples are either 1 or 2 or 3. 1 with probability
, 2 with probability , 3 with probability . So, the first thing I want you to note is
because 3 of them have to add up to 1, they all will be between 0 and 1, so this will be
true, so this is to .

So, let us do MME first, I will do MME on the left side, and ML on the right side, we will do
MME first. So, if you look at MME, I am going to have 2 equations, the sample moment
, this is 1 equation. The other equation is , so those
are the 2 equations.

Now, I have to solve for the unknown parameter, so remember these are the 3 parameters that are
unknown. So, because of this equation I will take and to be the unknown parameters, hope
that was clear, I think I did not say that early enough, and are the unknown parameters I
have to find. So, here I can substitute so these 2 equations will become, so if
you put , I am going to get .

So, if you simplify that I am going to get . So, you can rewrite this, you can
write . I hope this seems easy enough so instead of I am going to put
so I got , so that is ., so . So, the next equation is
, I think this is okay, if you rewrite this you
are going to get .

So, this is 1 equation, the second equation you are going to solve these 2 equations. So, for doing
that you can multiply this first equation by 5 and then you can subtract that minus this, so you are
going to get 2 , so you can do 5 times 1 minus 2. This will give you 2 equals 5 times this is
15 - 5 m1 - 9 so that , no, plus that is correct, so this is what it is. So,

you solve for , so something in terms of p.

So, what is , is maybe I should write it somewhere here so that I can get more space is I
can put it here . I hope I did not make any
mistakes in the simplification, can be very painful otherwise . So, let me just
check this real quick if I do what do I get , I know what is the mistake, this
has to be 3, so apologies I think I made, I think I got the error here so if you do I am
going to get , so that is good, that is good that agrees.

And then what will happen? I think it is okay, so this is correct. So, from here I have
got and in terms of and so when I go back and find my estimator ̂

and then ̂ , so this is my MME estimator. So, it required

some work but hopefully you are convinced it is not too difficult work, just setting up those
equations and painfully solving for them.

So, what is and you remember what and are, so is just sample mean ̅ and
then is just the second order second moment, summation i equals 1 to n xi square so that is
sample mean is just summation 1 by n summation i equals 1 to n xn. So, I have expressed my
both the MME estimators, some complex functions of, not complex just linear functions of m1
and m2 that is good enough to get.

So, how will ML look for this case? So, notice this case is becoming slightly more involved than
we thought, but ML actually happens to be very very easy. So, if you find the likelihood
What is ? is number of 1s, is
number of 2s, so likelihood is really easy. So, if you take log and differentiate you are going to
get .
So, you can do this, you can differentiate. Now, if you differentiate, you will see, you will get
these very very elegant and nice answers that the and , , So, notice

how nice the ̂ is, number of 1s in the sample divided by n, ̂ is number of 2s in the
sample divided by n, such simple elegant expressions, a very nice extension of Bernoulli. In
Bernoulli this is what happened, p was simply the number of 1s divided by n, w by n.

So, we expect something like that when we do 1, 2, 3, , , , should be the number of 1s


by n, should be the number of 2s by n, so that is the seems like a reasonable estimator but
look at what the method of moments gave you, it gave you something really convoluted and may
be confusing but the ML decoder has given you very nice answers, very intuitively pleasing
answers. So, hopefully this gives you a comparison even though the ML decoder may seem a
little bit more difficult in the beginning, the MME maybe is a bit more easy, the ML decoder
gives very very interesting answers in many cases, so in this particular case it is very very
interesting.

You can actually extend it to any number of digits, instead of just 1, 2, 3 you can have 1, 2, 3, 4
up to something, you know that something and each of these probabilities you do not know if
you find a sample the ML estimate of each probability, probability of a particular letter in the
alphabet is number of times the letter occurred divided by n, so this is a good principle to
remember, it is a easy proof in some sense.

So, hopefully I mean, I did not show you the differentiation, you can treat as a constant and
differentiate with respect to you will get the, you will get this answers, it will need a little bit
of work but you will get that.

(Refer Slide Time: 13:08)


Next we move on to another very very interesting and simple looking case where you will get
different answers with method of moments and ML. In this case both are nice and simple
sounding optimization problems but you will see the answer is very different, later on will come
back and compare these two as well.

So, iid Uniform [0, θ] is a continuous random variable and the density as you know is 1/ θ for x
between 0 and θ, 0 otherwise, this 0 otherwise is very important keep that in mind. So, let us start
with the method of moments estimator, this is usually the easiest you can equate to the
expected value of x, expected value of x is θ/2 from here you get just θ=2 .

So, ̂ ̅ . So, the method of moments estimator is like trivial, it is ̅ is

another way to write this ̅ , 2 times the sample mean, very very easy estimator with method of
moments. So, again is it intuitive? You are expecting uniform distribution between 0 and theta,
the mean is theta by 2, you compute the mean, you multiply by 2, seems reasonable.

Let us see what ML says, ML will tell you a completely different story, let us see what ML says.
ML the likelihood, so supposing you observe samples , the likelihood of these samples

is what, what is it, it is very very interesting, there are two cases here, it is going to be if what,

if all these guys lie between 0 and , it will become 0 otherwise.


So, let me give you a simple example of actual samples here, so here is n = 3 and maybe you got

samples like 1/2,1/2, 8, we are taking an extreme example. So, if you look at ̂ *2, ,

so this is your ̂ .

And notice 8 is bigger than 6, so this actually gave you a estimate for theta with which the
sample itself cannot agree. See 8 cannot be obtained if you say the theta hat is 6 seems a little bit
ridiculous, so but that is what the method of moments estimator is giving you. But on the other
hand if you were to maximize the likelihood you have to absolutely avoid this, to maximize
likelihood, to maximize L we have to avoid 0.

Why should you avoid 0? 0 is really low, it cannot be maximum, a non-zero value is always
bigger than a 0 value. So, you have to pick a θ so that this condition is satisfied, so θ has to be
above the largest that you saw. So, that is that is one condition we will assume, so our we
will say is bigger than or equal to max of to .

So, this, once you do this then your theta will always be . So, the 0 will not occur, so this kind

of ridiculous the method of moments you are getting this case where you are predicting
something which disagrees with the samples in some sense, so it is not a very nice estimator in
that sense.

So, this maximum likelihood estimator is trying to avoid that, so you are going to pick to be
greater than or equal to this. And you are trying to maximize 1 by, I am sorry θ to be greater than
. And θ you are going to make it greater than or equal to this, so that this 0 does not occur I

will always have .

Now, if I want to maximize over theta 1 by θ power something I have to pick the least possible θ
because if I keep increasing theta my 1 by θ power n will keep on decreasing, so I have to pick
the least possible θ and θ is greater than or equal to max of power so clearly my ̂
.

So, this is the ML estimator for Uniform [0, θ], so notice how the method of moments estimator
disagrees so seriously with the ML estimator it is 2 times sample mean is what the method of
moments is telling you. The ML estimator is saying you observe n samples from Uniform [0, θ],
you do not know theta what is your maximum likelihood estimate for θ, maximum of the
observed samples, that maximizes the likelihood, you predict any θ inside the maximum your
likelihood is 0, why will you ever predict that, it does not make sense, so you have to predict
theta hat ML to be max of this.

So, in this case theta hat ML will be equal to 8, you predict 8 you will not predict 6 like the
MME estimator is doing. So, notice how the two estimators are so different, they are different in
philosophy but is a very reasonable way to understand the distribution of, make a
prediction for uniform 0 θ, so you do not know theta, you are observing a lot of things, what will
be the best prediction for theta, maximum of those values, very reasonable and ends up being the
maximum likelihood estimate, so interesting example, so let us keep proceeding.

So, I do not know if you are the type who enjoys these kind of nice mathematical arguments and
getting to the answer but even if you do not enjoy the argument look at the final expression, the
final expression is a very intuitive expression for what the estimator can be for uniform 0 θ so
hopefully that is a bit more interesting for everybody.

(Refer Slide Time: 19:50)

So, next is a discrete distribution, uniform discrete distribution, I do not know N. So, here also
you can do MME, you can do ML, so I am going to give you the final answer for ML, you can
guess it is not so difficult, , I am not going to prove this with great detail you can
see, you observe a lot of samples, it is going to be 1 by N raised to the power, number of samples
you want to keep N bigger than the maximum that you observed and you want to maximize the
likelihood you pick n as the least possible value that you can assign to do that.

What will be MME? What is the expected value? It is uniform, so it is

, you can solve for them, you will get

So, ̂ ̅ , so it is sort of related to ̅ and sort of similar. So, you see again it is
very different in the uniform 1 to N case, the method of moments estimator and the ML estimator
are very very different.

(Refer Slide Time: 21:33)

So, so far we have seen cases where I was able to give a closed form expression for the ML
distribution. So, now we will see a couple of cases where we will not get close form expressions,
we will get some equations and I will say you have to solve it numerically. So, the method of
moments estimator we have seen before I am not going to reproduce it here, you can go back to
the previous, one of the previous lectures and look at the slides, the method of moments
estimator for the gamma distribution is derived there.

Let us look at ML, so the likelihood is ∏ . So, if you want to simplify this,

it is going to be is
And ( ) ,
So, I took log, you can take log of each term and keep doing the subtraction or addition
depending on how it works out, so you get this complicated looking expression.

So, now I am going to differentiate with respect to . So, if you do that, see remember when you
do that you can treat α and all as a constant, so if you do that I am going to treat α as a constant
so it is ∑ .


So, notice here this equation is nothing but . So, very nice relationship, we know

alpha by beta has to be equal to the mean, so being the sample mean as well, nice intuitive

relationship that needs to be satisfied by alpha star beta star.

So, the other one it turns out that you will not get anything so nice, it will be one very nasty

looking relationship you will get .

So, here you get this more complicated sort of equation and is times the summation. So,
these two equations together you have to solve and maybe you can rewrite this a little bit, this is

may be if you want I will rewrite it for you, you got to get, I will write down

So, I just divided by n to get a form sort of similar to that. So, anyway it does not matter what,
how you simplify this. So, these two, you have to numerically solve. So, there is no way to write
a very clean closed form expression here. So, there are computational procedures you put this,
input this equation in some form into a Python program suitably with load some packages, it will
solve and give you the answer.

So, this is the first case where we are not getting a very convenient, neat and clean closed form
expression for the ML distributor, on the other hand the method of moments estimator was very
easy, it gave you a closed form expression in terms of the sample moments. So, here we are not
getting that, just and all is involved, it is not clear how it is going to work out at the end
but the ML it is not very hard.
Today I mean even the close form expression even if it is not there, solving these kind of
equations is not very very hard, one can put it into a computer, it will crank a little bit and give
you the answer, so it is possible to do this. There are many more examples like this but the
gamma is complicated enough for you to see.

So, once again in this class we will not bother too much about complicated problem solving like
this, even if we do this we will expect you to do some very simple differentiations only nothing
more than that.

(Refer Slide Time: 27:09)

So, the next example is Binomial(N,p). So, you remember MME, we have seen before, so I am
not going to redo this, you can go back and check. So, the ML when you do not know both N and
p it becomes a little bit more complicated. So, let us see how the likelihood is going to work, the

likelihood is going to be ∏ ( ) .

So, this is a bit complicated, it is not that easy, so if you take this, you can sort of start
simplifying this if you like, you are going to get 1 term which will be like this

( ) ( ) . So, this is the expression that you get, it is bit

ugly, agree.

You can take log now so just to take log sometimes it is useful to take log like this and see how it

works out. So, you are going to get ( ) ( )


. So, you can try to differentiate with respect to
p, it is ugly enough, so you are going to get 1/p here, 1/(1-p), so you write it down, you will get
this equation that .

So, I am doing a lot of simplification in my head, I am differentiating and then equating it to 0


and then simplifying to get this. So, this is actually an expected result, the sample mean is N
times p, so this is 1 equation, it should be good. But if you start differentiating with respect to N,
you will get something that is really difficult to handle, so since this log N choose N1 how are
you going to start differentiating it, you can think about simplifying this in some ways, it does
simplify, you can get it, you will get a slightly more complicated equation. So, this becomes very
complicated, I will just leave it like that.

So, I welcome you to try it but you will not get anything too simple, so there will be some
simplification but it will not be all that nice. So, to get to the answer here is a little bit more
complicated. So, I am not even sure if how the answer will look at the end of the day, I have not
really tried very hard to simplify it but my guess is, it is going to be a little bit more difficult than
before. So, maybe it will come out to something interesting but it is at least very very
complicated.

So, we saw quite a few examples, simple examples like you know Bernoulli, Poisson,
Exponential, Normal where sample moments ended up naturally being the answers, it was very
very nice to see. Then we saw slightly more complicated examples uniform distributions both
discrete and continuous.

And then also the simple discrete distribution 1, 2, 3, , , where ML and MME started
giving different answers, very different answers MME usually was not that great, ML ended up
being always very good. And then these final type of examples where you have gamma and
binomial where simplification with ML is really really hard, MME ends up being much much
simpler in that case. You have to directly get something here in this case it is a little bit more
difficult.

So, you will see all these kind of examples with other type of distributions as well. But the recipe
is always the same for ML likelihood throw away what is not necessary, take log, differentiate
with respect to the parameters and try and solve you get your ML estimate. Thank you very
much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Section 8
Properties of Estimators
Hello, and welcome to this lecture. So, we have already looked at you know two important methods
for designing estimators coming up with estimators, one was this method of moments, way of
deriving estimators, the other is the maximum likelihood approach, both interesting approaches,
they give you different types of estimators, we saw a lot of properties. So, we are going to continue
looking at properties.

(Refer Slide Time: 00:37)

In particular, we will start looking at a property called consistency of estimators, we will not go
too deep into it, I will just mention a few properties give some few, I will simply define this
property and give a few examples of when consistency is there, when consistency is not there, we
will not go too deep into this, this is a little too advanced for this course, but I think this word
consistency is important to know.

So, here is a problem where you have and iid samples from a distribution and there is an unknown
parameter θ, there is an estimator 𝜃̂ for it. And the error we have seen this error random variable
before which is 𝜃̂ - θ, so error is a random variable. So, generally we expect as n increases, as you
observe more and more samples, this error should take values which are only close to 0, it should
take very very small values.

So, in some sense, we are expecting some convergent behavior for either error converging to 0 in
some sense with n and n increases you are expecting some convergence to 0. So, how do you
capture this mathematically? It turns out, the following requirement is something that conveniently
captures it.

The probability that absolute value of the error is greater than 𝛿, so that is a number, so this
probability is just a number, so this is a number, it can be between 0 and 1, and as n increases it
depends on n, that is the most important part and when you do not want to have something
independent of and you wanted to depend on n first, and as n tends to infinity, I want this number
to go to 0, it should become smaller and smaller and smaller with n.

So, what does that mean? If you observe more and more samples, as you keep observing more and
more samples, probability that your error goes outside of this – 𝛿, 𝛿 for any 𝛿, 𝛿 can be any small
number, so you can visualize if you like on the axis, so there is 0 and then there is – 𝛿, 𝛿 and I
want error to be only here, error should be here with probability approaching 1. So, probability
that it is outside of this should approach 0, so that is the condition.

So, if an estimator satisfies that condition, it is said to be a consistent estimator. So, this word
consistent is used in this context, this requirement also has a technical name and probability it is
called convergence and probability, if this is something that you will see in many statistics
textbooks, this word convergence and probabilities is used, this is what they mean. So, if they say
the error is converging to 0 and probability, they mean this. So, this is consistency of an estimator
in terms of the error converging to 0 and probability is a nice requirement to have.
(Refer Slide Time: 03:34)

So, let us look at bias and consistency with through an example here is a, here is the situation I am
looking at you know iid samples according to some distribution, and there is a parameter that I am
interested in, it is the mean of the distribution, mean of the expected value of X, mean μ is my
parameter of interest. So, one estimator I can think of is just 𝑀1 this capital 𝑀1 I have been thinking
about this (𝑋1 + 𝑋2 + … . 𝑋𝑛 )/ n the sample mean, this estimator is unbiased, we know it is unbiased,
we take expected value of μ becomes equal to μ we can prove that quite easily.

And this estimator is consistent, and the proof is simply weak law of large numbers. Remember,
the weak law of large numbers essentially says that this guy this 𝑋1 through 𝑋𝑛 divided by n, if
you call it something, the probability that absolute value of 𝑀1 - μ, so if we prove it using some
Chebyshev inequality and all that, so this greater than some delta is going to be less than or equal
to some quantity that tends to 0, tends to 0 as 𝑛 tends to infinity.

So, this was some proof that we did using Chebyshev inequality online, this is just shows
consistency. So, in fact the weak law of large numbers generally says sample moment, so this is
an important factor remember, sample moments converge to distribution moments assuming they
are all finite and existing and all that, as n tends to infinity. So, this convergence is in probability,
the previous sense that we looked at.

So, this fact is something very nice to remember, so sample moments will converge to distribution
moments as n tends to infinity in probability. So, this is something that one can use very well to
do this. So, here is a slightly modified estimator where instead of dividing by 𝑛 , I am going to
divide by 𝑛 − 1, why? Just for fun, just to show you that all sorts of combinations are possible.

This estimator is the biased estimator, it is not unbiased. Why? Because if you take expected value,
𝑛𝜇
you will get and that is not equal to 𝜇, so it is biased. But even though it is biased, it is
𝑛−1

consistent. So, bias and consistent it is not like they always go together, only unbiased estimators
are consistent that is not true, here is a biased estimator which is still consistent.

And in fact, the other type of example is very easy to come by, I mean if you have a simple
estimator like 𝑋1, I am just going to estimate it as 𝑋1, I do not care about all the other samples
throw them away, just 𝑋1 , so this is an unbiased estimator, but it is inconsistent, it is not consistent,
so it is not going to, the error is not going to converge to 0 error is going to be some p or something
which is a fixed distribution, it will not for any I mean there is no dependence on n, n does not
even enter the picture, so there is no question of error going to 0 with n.

So, this is a very bad estimator, it is got it is unbiased it seems like that is a good property, but then
it is not consistent. So, what is the point of the slide? The point of this slide is to show this bias
and consistency all sorts of possibilities exist, there can be unbiased estimators that are consistent,
unbiased estimators which are not consistent, biased estimators which are consistent also.

So, all these combinations are possible, so you will see quite often in textbooks in statistics, when
they discuss a estimator they will worry about whether or not I can have a biased estimator or
unbiased estimator which is better can it be consistent, can it not be consistent at all, these are
things that people will worry about, so this is a basic course, we are not going to go too deep into
things like consistency, but you know the fact that all sorts of combinations is possible is something
that you should be aware of.

(Refer Slide Time: 07:47)

So, here is a general result on method of moment’s estimators and also on maximum likelihood
estimators here are some general properties you can prove we will not prove any of this in this
class, but at least to know it will be good. So, usually if you take method of moments estimators,
if the quantity you are estimating is already a moment, then you will have unbiased behavior, so if
estimating the mean, the method of moments is going to give you an unbiased estimator, for
estimating the variance the method of moments will give you an unbiased estimator.

Is if it is directly a moment, then it you will get some bias behavior which is good unbiased
behavior with method of moments. But, in most cases most of the other interesting cases where
the parameter is not directly a moment you will have bias, so there will be bias in method of
moments in most cases and you can go back and look at our examples, we will see in most cases
we got biased estimators.

But on the other hand, in typically you can expect them to be consistent, so why is that because
sample moments converge to distribution moments in probability, we saw that before that comes
from the weak law of large numbers. And how do you find the method of moments estimator, you
equate sample moments to expression in terms of parameters, and then you invert that to get
parameters in terms of sample moments.

Now, if that function of sample moments is continuous, if it is a good smooth continuous function,
then just because the sample function sample moments converge to distribution moments, your
MME estimate will also converge to the expected quantity, and then you have convergence and
error will go to 0.

So, they end up being consistent, you can show consistency for method of moments estimator, if
your function ends up being continuous, that function being continuous is critical there are some
cases in the function will not be continuous, that you go off to infinity there somewhere in the
middle there will be a break, in that case then you cannot rely on that but in many cases, it will end
up being consistent.

So, interesting that even though it is maybe unbiased or something in it may be biased it can it is
likely to be consistent also. So, this is possible, but one needs to check these things, typically if
there is a continuous function of moments, then it ends up being consistent, that is a good thing to
know. Now, maximum likelihood estimators have a rich set of theoretical properties, a lot of
interesting properties that they satisfy, you can show that they are consistent, you can show that
bias will vanish in a limiting sense with 𝑛.

So, asymptotically they end up being unbiased and consistent. So, of course many maximum
likelihood estimators will be biased, we will see that they end up being biased for some reason,
but nevertheless the bias will go after 0, they have very interesting properties, a very interesting
properties this functional invariance property, if you have a maximum likelihood estimator for θ
then any function of θ you have a direct maximum likelihood estimate.

So, you can just put g(𝜃̂) you get the estimator for g(θ) for maximum likelihood all these kinds of
very powerful properties, so that is why maximum likelihood estimators are very popular in theory,
you can prove a lot of very interesting properties. So, all these properties are something that are
there, I am just doing it for so that you hear these words in these lectures and you are not surprised
when you see them, we will not go into great depth or detail on any of these properties.

What we will do very quickly to round out this properties section is to look at one way of
comparing estimators, this is a very popular way to compare estimators, this is to compute the
squared error risk, or which I called as mean square error, so you compute this for, supposing you
have two competing estimators, estimator 1, estimator 2, you find the squared error risk for first
estimator, squared risk for the second estimator and pick the one that is better. So, this is something
that is quite common and it seems very logical and reasonable and that leads us to study this MSE
or risk of estimators.

(Refer Slide Time: 11:38)

So, given the distribution one can compute the risk, quite often numerically maybe you can
compute it, but definitely, theoretically also it is possible to compute if you can do some
expectation computation. So, I am going to show you one example to highlight how two estimators
can have different risk, very different mean squared error.

So, I am going to look at this uniform [0 , θ] samples and two different estimators for θ, one is the
method of moments estimator which was just 2𝑀1 , you can see bias is 0 for this, so risk becomes
equal to variance and if you want to find the variance of this, you can do the calculation, let me
𝜃2
not repeat it here, you will get 3𝑛 as the risk of this estimator.

Now, the second one, max of 𝑋1 to 𝑋𝑛 the calculation is a little bit more involved, but look at this
number that you get, look at this risk that you got for MME and look at the risk that I am getting
for ML, notice the huge factor of 1/n , so the ML decoder is 1/n better in mean squared error than
the MME estimator, let me repeat that the ML estimator in this case for uniform [0, θ] is factor of
1/n better than the MME estimator, method of moments estimator. And n can be 1000, you may
have 1000 samples and look at that 1/1000 is a huge factor of improvement in the error.

So, this kind of calculation is also really, really, really interesting, for a for two estimators for the
same problem that you want to prove that the risk of one is better than the other. So, here we did
some calculations with some known distributions and all that, supposing you do not have all that
or you do not want to do it, or you are you know maybe it is to involve a calculation quite often
the calculation becomes so involved, that you cannot do these things theoretically and cancel out
and figure out what it is, here in this case it worked out very neatly, may not work out the same
way, in general.

So, what will work in most cases is a Monte Carlo simulation, if you want a quick and dirty
estimate you vary n you do 𝑛 = 100, 𝑛 = 1000, 𝑛 = 10,000, 𝑛 = 1 lakh I do not know a lot
of n, then you simulate over and over and over again repeat 100 samples some 1000 times you
repeat it, 100 samples do one estimator, do the other estimator then you can compute the risk, you
can compute the risk on your simulation and plot it versus n and see which is doing better.

So, that is sort of the Monte Carlo approach and it is interesting collab exercise we will look at it
if we have time to build a collab notebook for this. And this is also something that is very very
important and interesting when you compare estimators. So, if you want to compare two
estimators, you want to show one has better risk than the other, you pick a risk of choice, usually
the squared error risk is a pretty good choice, you have to show one is better than the other, you
can do it theoretically, if you are well versed with these expectation calculations, or you can do it
using simulations.

Some people like the theory better some people like the simulation better, either is good enough,
and you can compare for yourself and conclude which estimator is the better one. So, that is the
end of this lecture, hopefully you got a flavor for some other topics that are coming in, there are
two different ways that we have seen so far in approach of design of estimators, one is the method
of moments approach the other is the maximum likelihood approach, we will see one more little
bit later.

But there is all these kinds of interesting properties that you can study for estimators and the most
interesting in my opinion is this risk, squared error risk, the squared error risk really captures what
you need from an estimator and practice may not be able to calculate it in many cases in that case
you can at least do simulations and satisfy yourself that the estimators are working. Thank you
very much.
Statistics for Data Science II
Professor Andrew Thangaraj
Electrical Engineering Department
Indian Institute of Technology, Madras
Section 9
Confidence Intervals
Hello, and welcome to this lecture. In this lecture, we will look at what are called confidence
intervals, this is a slightly different approach to doing estimation one can say, so, so far we saw
problems which are called point estimation problems, so you are given an unknown parameter,
give me one value for that unknown parameter.

So, in some cases maybe that is not what you are interested in, in some cases you are interested in
an interval, can you give an interval and say, I do not know the exact value of my unknown
parameter, but I know that it lies inside this interval. So, that is the sort of thing that is done in
confidence intervals.

(Refer slide Time: 00:56)


Where do you see confidence intervals most of the time? Most of the time, you see them in surveys
and results, so today the world is full of survey, so many opinion polls here and there people ask
questions about whether you like this, whether you do not like this and then they say, then they
claim one conclusion, it is one conclusion that you will find in a Gallup survey, Gallup is a pretty
famous surveying organization, you find it in a website, it says 82 percent of Indians are willing
to take the COVID-19 vaccine.

So, they how did they come up with this statement, they asked 3045 people they claim in their
data, from the time period November 24 to Jan 8th, November 24 of 2020 to January 8th of 2021,
so many languages were used in querying people, they were called on the mobile phone and asked
whether or not they will take vaccines and 95 percent look at this, this number is what I am most
interested in what does this mean?

So, this 82 percent they claim from the survey, they have 95 percent confidence level in that 82
percent and the margin of error is 3 percent, so what does this mean? It is not if you not seen it
before, it is not clear what it means, so any good survey will give this information. So, take it from
me if this information is not there for how many people participated in it, what is the demographics
of the people, how they were asked, and 95 percent would confidence level and margin of error, if
this information is not there, do not trust the survey, that survey is not been done properly or at
least the results are not being communicated to you properly.
Any proper survey has to share how many people responded, who were the people who responded,
and these two numbers, one is called confidence level, the other is called margin of error. Why are
these things important? Because there can be other survey’s, here is another survey, this is through
another app called local circles, you can go on their website, you can find this survey as well.

It says 40 percent Indians willing to take vaccine. So, this was done Jan 25 2021, sort of around
the same period and this is through an app it is a mobile app that people have to install if they want
to and they claim there were 9628 votes. The demographics is people from 299 districts, so many
people from tier one city, so many people from tier two, so many people are rural, etc.

This is the information they give, they do not give confidence level, they do not give margin of
error, they have not put that down, I would have liked it if they had put that down, but they have
not put down but look at the huge difference in the result put out by two different surveys, 82
percent says one, 40 percent says the other, something has to be wrong and this is January of 2021,
I do not know, you know how India was at this time.

So, you can imagine what you trust more, so how many, whether you trust 80 percent more or 40
percent more, you can decide for yourself, but this is the kind of thing you will see, so you will see
many people put out survey results like this, without any data on how they got the survey, they
will just say 60 percent people believe, 70 percent people believe, they will not give you the
information on how they came to that conclusion.

So, you have to look for that information definitely and you have to look for this, these two words,
even these levels actually good surveys will give these levels they will give you this confidence
level margin of error, what do they mean? Are they related to estimation in some way? It turns out,
yes. And in this lecture, we will quickly see what these numbers mean, and how they calculate it.
(Refer slide Time: 04:23)

So, all of these are tied around estimation of sample mean, and this is something called confidence
interval. So, here is a typical problem 𝑋1 through 𝑋𝑛 is iid according to X, and I am interested in
estimating the parameter 𝜇 = 𝐸[𝑋] , it is some, I mean I keep saying parameter, it is you have to
think of it as some property of the underlying distribution, which is the mean, so 𝜇 = 𝐸[𝑋].

A very common estimator is the sample mean and we will stick to that, we will take sample mean
as the estimators, it is a good estimator, ML estimator, MME estimator every estimator that you
can think of, so it is a good estimator for the mean. Supposing somebody tells you, see remember
𝜇̂ is a random variable, so 𝜇̂ is a random variable and E[𝜇̂ ] = 𝜇 , all these things we know, suppose
somebody tells you probability that absolute value of 𝜇̂ , again if you want to picture this event can
draw the line and then this 𝜇 and this 𝜇 - 0.03 and then 𝜇 + 0.03 in this interval.

The probability that 𝜇̂ lies in this interval is 0.95, with probability 0.95 that is what the statement
means probability that absolute value of 𝜇̂ - 𝜇 is less than 0.03 is 0.95, so I have written that out it
lies on this interval 𝜇̂ , so you can sort of twist this around and also say 𝜇 lies in the interval [𝜇̂ -
0.03, 𝜇̂ + 0.03] with probability 0.95 so it is possible so 𝜇̂ you can take as a particular sample value.

So, this is, so this is 𝜇̂ in one sampling instance and you can say it lies in [- 0.03, + 0.03] with 95
percent probability the technical words that people would use to describe the situation is this one
sampling instance you take one sampling instance or supposing in one sampling instance you got
a value of 𝜇̂ , some value 10, 20, whatever some value of 𝜇̂ you got, you can put out that value and
say, this is the estimate of the sample mean at a margin of error 3 percent and confidence level 95
percent.

How did I get 3 and 95? It is because of this statement, 𝑃(|𝜇̂ − 𝜇 | < 0.03) is at least 0.95, so
that is the meaning of this confidence interval statement. So, you can see I have 95 percent
confidence that I am within 0.03 of the number I am putting out. So, I got some samples, I
calculated their sample mean, and what is my confidence level? 95 percent confidence level I have
that the actual answer is within 0.03 that is the sort of statement one is making when they say
confidence level and margin of error.

So, let us generalize this a little bit in general, if you can somehow prove that 𝑃(|𝜇̂ − 𝜇 | < 𝛼) =
𝛽 and 𝛼 you are imagining is a small fraction and 𝛽 is a large fraction, then if you compute μ hat
in one sampling instance somebody gives you the actual samples, you compute the 𝜇̂ , then you say
that 𝜇̂ is an estimate of the sample mean with the margin of error 100 𝛼 percent at confidence level
100 𝛽 percent.

So, we multiply β by 100 and say percent α by 100 and say percent. So, this is the definition for
𝜇 and I mean for margin of error and confidence level. Hopefully, it is clear, confidence level is
the probability with which that you are inside that interval and margin of error is the width of the
interval sort of how much it deviates from 𝜇 on both sides.

(Refer slide Time: 08:40)


So, how do you find this α and β? So, X is some distribution given to you, it has the CDF 𝐹𝑋 ,
supposing this is known distribution of X is known maybe this CDF 𝐹𝑋 is known, so we know
probability that X is less than or equal to small x is the CDF evaluated at x, standard definition.
And what about this 𝑃(| 𝑋 – 𝜇| < 𝛼) , it turns out you can compute it using the CDF, this is not
very hard, this is the 𝑃(𝜇 − 𝛼 < 𝑋 < 𝜇 + 𝛼) , remember, it is continuous, so less than less
than or equal to all of these are the same.

So, this is the CDF evaluated at μ + α minus CDF evaluated at μ - α . So, this guy is the actual
expression for this probability. Now, if you want this to be equal to β, what will α be? So, again
imagine this is μ and you are going to this interval (𝜇 − 𝛼) , (𝜇 + 𝛼) and there is a probability
with which 𝜇̂ will where X will fall inside this interval, as you keep increasing α that probability
will only increase.

If α is 0, this probability will be 0, this is equal to 0 if α is 0 and increases with α , it will strictly
increase with α and as α goes to infinity it will go to 1, so this function will go from 0 to 1 as α
goes from 0 to infinity. So, any β you put to you will get a unique value for α. So, you keep
increasing α till you hit a probability of β, you will get different probabilities for different α , it
keeps changing α , when you hit 0.95 or the value of β that you want, you stop that is your α .

So, that is what is going on here. So, this is a pictorial representation and the function here. So,
sometimes you will have symmetry, if you have symmetry then probability on the left side is equal
to probability on the right side and you can simplify your calculation a little bit, I am not going to
be detailed there it is not so important.

(Refer slide Time: 10:52)


So, the most common case is normal samples, you assume your samples are N(μ, 𝜎 2 ) and 𝜎 2 is
known, so you know the variance, you do not know the mean. So, this is an assumption that one
can make, we will relax the knowing the variance soon enough, but as of now let us say you have
normal samples with mean μ which is unknown and variance 𝜎 2 which is known.

So, μ is unknown, so what is my goal? I want to calculate these confidence levels, or confidence
levels and margin of error, at what margin of error? What confidence level? I want to calculate it
for normal samples. Why normal? You can say a lot of things but in fact normal is a very common
distribution because of the CLP and all that, so that is why normal is the first thing we look at.

So, the estimator is the sample mean, notice the sequence of arguments actually very simple, 𝜇̂ ,
which is the sample mean, what is the distribution of that? It is again normal, we know it is normal,
𝜎2
with mean μ and variance n
, so this is very important, this variance sigma squared by n is
important, you can prove this for 𝜇̂ , it is going to be normal with mean μ and variance 𝜎 2 /n , this
is one of the properties of the Gaussian distribution, it is very interesting property.

So, what I do here is centralize and normalize this distribution. How do I centralize? I subtract the
𝜎
mean divided by the standard deviation. So, that is √𝜎2 /n is , so this is what I am going to call
√𝑛

as some z. So, some z and this guy has a N( 0, 1) so this is a very standard random variable and its
CDF is well known.
So, there are tables which will give you the CDF of Normal (0 1), there are computation packages
will definitely give you the CDF. Now, why is this important because if you see
𝜎
𝑃(|𝜇̂ − 𝜇 | < 𝛼) = 𝛽 , you can divide both sides by .
√𝑛

𝜎 |𝜇̂ − 𝜇 |
If we divide both sides by , on the left side you have 𝜎 and then on the right-hand side you
√𝑛
√𝑛

|α|
have 𝜎 . So, this is not changing anything, the inequality I am dividing both sides by a positive
√𝑛

quantity, that does not change anything so this probability this equation is the same as this equation.

But on the other hand, this guy is normal, so this is the same as probability that a Normal(0 ,1)
distribution is less than something, so we all know the Normal(0 ,1) distribution, so it looks
α
something like this and you are going to look at this point and this point and say this is 𝜎 this is -
√𝑛

|α|
𝜎 and this area is β, as you vary β this area will keep increasing.
√𝑛

α
Now, it turns out if this quantity 𝜎 , if it is 0.99, then you get β as 0.68, if it is 1.64 you get β as
√𝑛

0.9, if it is 1.96 you get a β as 0.95. So, this comes from this is calculated using CDF of standard
normal. So, this is how a typical confidence level calculation will go for normal samples, when
the variances known. So, it is very easy to look at it, it is a centralized move it around and then
α
from this table you look up 95 percent confidence level this factor 𝜎 becomes 1.96.
√𝑛
(Refer slide Time: 15:04)

So, how do you use this to find confidence level? Here is a concrete example. I have given 16
samples drawn from a normal distribution with mean unknown and 𝜎 equals 3. We have to find
out a 95 percent confidence interval for μ. How do you find that? First n is 16, you find the sample
mean, it is 10.06 and σ is 3. So, 95 percent confidence interval, so β is 0.95 using the CDF of
α
normal 0 1 we get 𝜎 is 1.96.
√𝑛

So, α is 1.96 times 3, σ is 3 divided by root 16 that is 4 you can calculate you get 1.47. So,
𝑃(|𝜇̂ − 𝜇 | < 1.47) = 0.95, so that is the calculation. So, this is what is the calculation. So, 95
percent confidence interval is 10.06, remember, this is 10.0, 𝜇̂ ended up being 10.06, so [10.06 -
1.47, 10.06 + 1.47] which is 8.59, 11.53.

So, what does this mean? Let me repeat once again, what this means, this means that if you were
to repeat the sampling more and more and more once more, once more, once more, you keep
repeating the sampling, this is one set of 16 samples, take another set of 16 samples you compute
μ, another set of 16 samples you compute μ hat, another set of 16 samples you compute 𝜇̂ , you
keep on getting that, that sample is more or less going to be within this window for with good
property probability 0.95, 95 percent of the time you will get values within this.

So, that is I mean it is sort of like something you can show roughly to be true. So, the other way
to think about is the actual distribution mean is within this most of the times, so that is it is sort of,
it is difficult to imagine what that means, so normally people are not interested in that, why would
I get more and more and more samples I have this sample you tell me what it is, this is what people
would ask.

But those are difficult kinds of questions to answer, so this is the sort of guarantee that one can
get. If you were to keep doing the sampling over and over and over again, 95 percent of the time,
your distribution will be within this window, what you get will be within this window, that is the
reasonably confident statement you can make about this nominal sample. So, this is how you find
the confidence interval. So, this is how you do that calculation.

(Refer slide Time: 17:37)


So, let us keep pushing ahead. So, there is this other situation where you may not know σ, so, so
far we have looked at a situation where the standard deviation is known, the variance is known,
we were able to quickly use the normal, standard normal tables and do the calculation, what if it
is not known? If it is not known, you have to do something called the t distribution, you have to
use something called the t distribution, it is a slightly subtle thing, so let me just go through the
calculation.

So, you have n random samples, mean μ and variance 𝜎 2 , you have the sample mean, which
∑𝑛
𝑖=1 𝑋𝑖 ∑𝑛 ̅ 2
𝑖=1(𝑋𝑖 −𝑋)
𝑋̅ = , 𝑆2 =
𝑛 𝑛−1

, and then you have the sample variance, this time I will use the 1 by n - 1, so this is S square is 1
by n - 1, it is that compensated n - 1 to get unbiased estimate of the sample.

So, E[𝑆 2 ] = 𝜎 2 , so summation of (𝑋𝑖 − 𝑋̅)2 . So, we know some things about the distribution of
these things, 𝑋̅ is normal, 0 mean, it is not 0, it is actually μ sorry about that 𝑋̅ is normal with mean,
𝜎2 𝑆2
μ and variance , and then n - 1 , is chi squared n - 1. So, this is the results we have seen
𝑛 𝜎2

before not too much emphasis, but we have seen this before an X bar and S are independent.

So, here is a very very critical piece of, critical result which we have not seen before so far, but
𝑋̅ −μ
this is very useful in the context of confidence intervals. So, 𝜎 , when you know σ is normally(
√𝑛

0,1), we used it before, when you do not know σ what are you going to do? You are going to
estimate σ, how do you estimate σ? S, you have 𝑆 2 and then S is going to be your estimate of
sigma square root of the sample variance.

𝑋̅ −μ 𝜎
So, now it turns out, if you do this calculation 𝑆 instead of , remember if σ is known and
√𝑛
√𝑛

precise it becomes normal, if σ is not known and it is S you are estimating it from the data, then
𝑋̅ −μ
you get a t distribution, you do not quite get a normal distribution. So, 𝜎 is the standard normal,
√𝑛

it looks something like this, this guy is a t distribution, how does the t distribution look? It looks
like this.

So, the t distribution will go this is normal the t distribution will generally look a little bit more
spread out like this, this would be like t, something like this you can sort of expect, it will sort of
tend to that as it depends on n as n goes up it will tend to that but otherwise it will be a little bit
more spread out than the normal distribution. So, this 𝑡𝑛 is a distribution and you can assume that
the CDF of 𝑡𝑛 is known, there are computer packages will give you that just like the normal CDF
is known, CDF of tn can also be assumed to be known in calculations.

𝑋̅ −μ
So, notice the subtle difference here this 𝜎 when you know σ, when σ is a constant is normal,
√𝑛

when you do not know σ, you are estimating it from the sample variance, it becomes some other
distribution called the t distribution. So, this is where the t distribution comes, t distribution with
n - 1 degrees of freedom.

(Refer slide Time: 21:15)

So, let us not try to understand this too much I do not want to emphasize all the calculations and
all that, but just how you go about doing normal samples with unknown variance. How do you do
confidence intervals. So, this is something that can happen a lot, you will have samples you want
to assume it is normal, but then you will not know the variance, what do you do when you do not
know the variance?

So, you can find this first moment second moment and all that like this you can do, so you have a
particular sample instance 𝑋1 through 𝑋𝑛 , you can estimate a mean and the variance you have an
𝑋̅ , you have a 𝜎̅, 𝜎 2̂ , so you can estimate this, this is just sample mean sample variance, I can go
ahead and estimate this.
Now, 𝜇̂ which is my you know , 𝜇̂ remember 𝜇̂ is just is (𝑋1 + 𝑋1)/n , this is 𝜇̂ , 𝜇̂ I know is
𝜎2 ̅−μ
μ
normal with mean, μ and variance . And the Z which is 𝑆 , what is S? S again we define
𝑛
√𝑛

before, I do not want to write the S definition again, it is the sample variance, S by root n is going
to be 𝑡𝑛−1 , I know this.

So, if I have to do 𝑃(|𝜇̂ − 𝜇 | < 𝛼) = 𝛽, this is approximately the same as dividing this quantity,
𝑆 ̂
𝜎
remember the left hand side I am dividing by , right hand side I am going to define by . So,
√ 𝑛 √𝑛

these two are not the very same quantities is a particular instance thing, this is actually a random
variable, but because of weak law of large numbers and all that, we can expect these two to be
approximately the same.

̂
𝜎
So, this is the little trick that you use if you know σ, you divide by that is great, if you do not
√𝑛

know σ you want to divide by the sample standard deviation, when you divide by the sample
standard deviation, you cannot be very sure it is just approximate, but you can do something like
this.

|𝜇̂ − 𝜇 | 𝛼
So, 𝑃 ( 𝑆 < 𝜎̂ ) = β and then this guy is the 𝑡𝑛−1 distribution. So, probability that absolute
√𝑛 √𝑛

value of 𝑡𝑛−1 is less than α by σ root n. So, this you can solve using CDF of 𝑡𝑛−1. What do I
𝛼
mean by solve? Given β, find ̂
𝜎 . You can do this it is the same we did the same thing with
√𝑛

normal 0 1, now we are doing with 𝑡𝑛−1 , that is the difference.

(Refer slide Time: 23:41)


So, let us repeat this for the same normal samples we had before, this is the same set of normal
samples we had, μ is unknown, σ is unknown, how do you find 95 percent confidence interval
from μ ? How can you give an interval in which you can be sure that 95 percent of the time in
some sense in there is repeated sampling sense, your μ will lie. So, let us compute n is 16, I a m
computing the sample mean here, I am adding up all of them dividing by 16, I got 10.06.

I am computing the sample standard deviation that comes to 3.297 for these samples, you can
repeat this you can take this numbers and do the calculation yourself. If β is 0.95 you can use the
𝛼
CDF of 𝑡15 , 𝑡𝑛−1 you will get ̂
𝜎 to be 2.13. So, this is again the same calculation except that the
√𝑛

CDF is different, it is not Normal(0, 1) it is this t distribution. So, from here you can calculate α
to be 1.76 and then you know that this is approximately 0.95 and you can put out a 95 percent
confidence interval and a margin of error, what is the error etcetera you can put it.

(Refer slide Time: 24:53)


So, this is how confidence interval calculations are done. And very natural question is, what if
samples are not normal, you simply assume it is normal in some sense sample may not be normal
at least even the sample mean is going to be normal your CLT and all that so there is a good
assumption there.

And then if the specific distribution is known, you can do better bonds. So, for instance a very
common set of situations is Bernoulli(p) samples. So, this is what happens in sampling surveys,
you sample and ask either yes or no question, are you going to take a vaccine or not, yes or no. So,
it is only Bernoulli p answer, yes or no.

So, the sample variance will be 𝑝(1 – 𝑝)/𝑛 , and it turns out you can upper bound it at 0.25, so
this has a maximum value at p equals ½, so you can check that out, it is a quadratic and see it is
less than or equal to 0.2 behalf. So, common assumption and value that is done in survey samples
̂
𝜎 0.25
is you take to be roughly √ .
√𝑛 𝑛

𝜎2 0.25 0.25
So, this is a very very so you know is roughly , you simply take a square root and say √ .
𝑛 𝑛 𝑛

So, 95 percent confidence interval assuming normal distribution is 1.96 is that α , β value, so you
̂
𝜎 0.25 0.25
multiply by , your confidence interval is [ 𝜇̂ - 1.96√ , 𝜇̂ + 1.96√ ]. So, this guy is your
√𝑛 𝑛 𝑛

margin of error.
So, when 95 percent confidence interval somebody says 3 percent margin of error, this quantity
ended up being 3 for that in percentage 0.03. So, this is how survey sampling confidence intervals
are typically calculated, of course, there are some slots of bells and whistles some people add
additional factors here, some additional factors may be added here for correction, multiplied you
so here and here. So, some design factors and all that, so to account for something people introduce,
but this is the basic idea.
(Refer slide Time: 27:16)

So, let us go back and revisit our survey sampling example 82 percent in one study and 40 percent
in another study, n is quite large here, so if you go back and compute this is 0.25/n, n is 3000, n is
9000, 1.96. I mean you are going to get better confidence interval for the 40 percent case, than 82
percent but both confidence intervals have, I mean are going to have very low margin of error.

So, what do you believe? Who do you believe? So, here is where I think you have to look at these
two and conclude that the population, they were serving has to have been different, India has a
pretty big country and you have to be very very careful who you pick which segment etcetera et
cetera, so they must have been sampling different segments of the population, because back in
November to January, you could not have had such a wide variation in number of people willing
to take a vaccination. So, any statistical study one needs to put out confidence intervals and margin
of error, but one needs to look at the overall picture and see if everything makes sense. Thank you
very much.

You might also like