Speaker 5

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Note: This transcription document is a text version of the upGrad videos present in this

session. It is not meant to be read independently, but can be used to complement your video
watching experience.

Speaker: Himanshu Manroa

Okay. So, now let's come the all-important, critical aspect about sampling and population. So, what we'll see in
this particular wrap-up topic is the concept of, you know,

why do you need to do sampling, vis-a-vis the population?


What's the relevance and why should we be picking up the right samples to be coming up with the right kind
of estimates or predictions for the data that we are looking into?

We'll also look at the goodness of the samples that we have picked by the central limit theorem. You know, so
what are the central limits within your sample sets that allows you to figure out, you know, is my sample right
enough, accurate enough. Does it represent the data?

And finally, we'll come to the final part of this entire topic, which will be about the accuracy or the strength of
your predictions. You know, so that tells us about, you know, how strong are you about, you know, the
confidence levels that you have in your predictions? So, you know, let's get on with these three-final parts of
the topic, sampling and population, central limits, and confidence intervals.

Speaker: Mirza Rahim

So, before discussing inferential statistics further, there is a key idea we need to discuss. The idea
of sample versus a population. Now population refers to the entirety of the data and sample refers
to a small portion of it.
What does that mean? Let's say you wish to say, get the average weight of students in
mathematics. And the population, there would be all the students studying mathematics right
now, students of mathematics right now.

And let's say limit it to just the country of India, right? Population would be all of these people,
these millions of people.

Now, if you want to get the average weight for the mathematics student, you would need to
actually study the population. If you want to study the population, you would want to get access
to the weight of every single individual, which is not practical, not feasible.

So, what you would do instead is work with a sample. Maybe sample some people from here,
some people from a different state and so on, and you get a much smaller number, but you
believe that, you know, this group represents the overall population.

And what we learned from this group should hold for the overall population. So, that is the idea of
a sample, which is of course, a small subset of the overall population.

S,o in general, in a lot of cases, getting data for the entire population is either very expensive,
prohibitively expensive, or just not possible.

So, in those cases, which is most cases, we resort to taking a sample and then diving our insights
from the sample and trying to generalize them to the population.
So, the sample is a small subset taken from the population and we use the sample typically to
make inference on the population.

The hope is that if I've sampled correctly, if my sample represents the population, the things I
learned about the sample or things I learned from the sample hold for the overall population.

And this part of moving from a sample value to a population value, this bridge is filled by the
whole study of inferential statistics.

Speaker: Himanshu Manroa

Right. So, talking about samples and population, and why is it so important when we talk about inferential
statistics? So, we discussed about two concepts of descriptive statistics, which - which is things that have
happened in the past. Your ability to diagnose or decode that data, which already resides within your system.

The second and more important part often about statistics, your ability to estimate and predict and come up
and forecast about, you know, the population that you are studying about your datasets. And that's all in the
larger concept of inferential statistics. That's when it becomes important to understand your larger population
and the samples.

Now you know, talking about from a business perspective, let's try to understand. You are a soap
manufacturer. You know, you want to come up with a new brand of soap, a new variant, and you want to
launch it to various markets, spread out all across India.
Now, you know, imagine the total amount of soap users in India. It's practically a 100%, right?

And now you want to conduct a study to understand their buying preferences.
And what is the likelihood of these, you know, existing soap users to switch over to your new brand of soap?

Would it be possible for you to go across the entire country and understand the buying preferences of them
all? 100% population, you know, of this 1 billion plus country?

Definitely not possible at all, and definitely it does not, you know, serves the purpose.
That's the reason, you know, what you would do. You want to understand or come up with a good sample,
which is representative of this entire country. Or you could maybe, you know, decide on a few markets that
you want to first begin your launch with. And then you want to understand the samples within those markets
or possibly states to give you a better read about the consumer insights.
And that's what, you know, the concept of sampling is all about.
Speaker: Himanshu Manroa

Now having spoken about samples with regards to population, a very basic fundamental question that comes
up, you know, in your mind when you are deciding the samples is, what is a good sample size?

You know, what's the size that I should go out with, which is best representative of my population, with the
population sizes being as huge as 1 billion in certain cases?

And that's where, you know, central limit theorem comes to our rescue, with a good formula that helps us
understand, you know, what could be a good sample size.
Now the principle concept of the central limit theorem is the fact that it tells us, you know, for a good sample,
you know you have arrived back a good sample when the mean and the standard deviation of that particular
sample is as close to the actual population.

You know, when you know there are estimates within your population.
So, if you're covering a larger population, you know this is the kind of household levels that I understand
would be there within my larger population. And if your sample manages to achieve those mean levels,
those median levels, and those standard deviation levels, which means lesser number of fluctuations. You
know that you have picked up a good sample.

Speaker Himanshu Manroa

So, to define this particular aspect of central limit theorem, much more scientifically or in a formula manner, we
have this very simple formula,.

which tells you if the mean of your sample is equal to the standard deviation of your population divided by the
root of your sample size, you know, that's the mean of your population, of your sample is equal to the mean of
your, you know, larger population. That's when you would know that you have picked up a good sample size

Let me repeat that again. You know, so for you to be able to pick up a good sample size, you know, you take
a square root of that sample size. You know, and the formula is your standard deviation of your sample size
upon the square root of the sample size that you have picked up.
If the mean, you know, this is equal to the mean of your larger population, it means that you have picked up a
good sample size. You can see it, you know, through various examples, which allows you to determine the
right sample size, if the mean of the population and the sample is close together.

So, that's the formula that goes into it. Now, where would you use the goodness of these sample sizes, you
know, that you have decided to represent your larger population? Where do you use these kinds of
fundamentals?
You know, you would use this in, say exit polls, in opinion polls of elections. You know, so a good exit poll
sample size or opinion poll sample size would be the one where the mean is closer to the larger population.

Similarly, you know, we could have various other examples of say, you know, median household income levels
of say, for a telecom brand or a soap brand, or even for say an SUV launch that you want to go into. You want
to understand the spend or the disposable income levels.

And that's where you would do these kinds of, you know, you would apply formula to understand, you know, if
the sample size that you have picked up, the mean of these income levels, if it's closer to what you had in
mind of the larger population, it means that you have picked up a good sample size for your survey.

Speaker: Siddhartha Roy

Okay, so the next thing we're going to talk about is the central limit theorem. And to understand
the central limit theorem, let's start with a completely random probability. A completely random
population distribution. Make this larger and let's make this, right. So you have some value, some
probability of x equal to 1 and 2, absolutely zero and 3, so none of the population is at x equal to
3. There's a fairly large population at x equal to 4 and then slightly less at x equal to 5. So I hope
this part is clear. It's a completely random, looking population distribution.

Now what we do is next is we take some samples of sample size n equal to 5? Let me just write
down the first sample. So it can take a value of 1, take a value of 2, take a value of 4 and take a
value of 5. It can take a value of 1 again. Let me take another sample. You can take a value of 1,
we can take a value of let's say, 4 twice, 5 once and 1 again. Now, let's consider another sample
S3. This time, let's say, there's no 1 at all. This is 2, 2, 3 times 2, 1 times 5 and 1 time 4. So what
I’m going to next going to do is now, I have 3 samples, S1, S2, S3 and I'm going to take the
means of each of this.

So x1 bar is 13 by 5. This will come to 2.6. Similarly x2 bar would be 15 by 5. Next, we have x3
bar which is 15 right. So what I've done here is, I've taken three completely random samples, and
I have found the mean for each of these samples and, as you can see, these means could vary. I
mean, if I keep doing this up to you, know another 100 samples, I'll keep getting different means.

So right now I have three samples. Let's say this is zero. I have one sample at 2.6. This is 2.6.
This is 3. I have 2 samples at 3. Now, if you keep doing this, and I've only considered 3 samples
of N equal to 5 right, if you keep doing this, for let's say you know a thousand samples or even
more - let's say a 10,000 sample. What you start observing is that all these sample means when
you plot the frequency of these sample means, they will start representing or being represented
by a normal distribution graph with some mean and some standard deviation.
We said it's a completely random population distribution. We took three random samples, we took
their means and then we said we'll plot these means, or rather we'll plot the frequency of these
means, and we said that would end up being a normal distribution. So this obviously sounds very
weird right. So, let's try and see this using a tool. So let's, so whatever I just told you, let's try and
understand this using a simulation tool.

The simulation tool is called as onlinestatbook.com. As you can see, so you see a graph here at
the top right. There's, this is a normal, no standard normal graph or a normal distribution graph.
Let's not take that. Let's take a random graph. Okay, so I select custom. You can try this on your
own as well. So I selected some random graph right. Some random population distribution, similar
to what I just showed you earlier. This is completely random in nature.

Now what I do is, when I click on this animated button. What it will do is it will pick up 5 sample
points or 5 points from this distribution and y5 because the sample size, as you can see in the
next one, is equal to 5. Sorry, not in the next one, in the third one you see, n is equal to 5. So I
have set n is equal to 5. So it will pick up five random, you know points and make a sample and
to find the mean of that sample and it will show it in the third graph. Okay, so let's see and I’ll
show it once or twice so that it becomes clear.

Click on animated. Pick the five points, and then in the third graph, the blue one - you see the
mean of these five points. Yeah, let's do it again. It picked up some random five values and it did
the sample and the blue one, is showing the where the sample of these five lines, these five
points lie. Let's do it again right, again five random points and the sample of that.

Now, if you keep doing this okay, what you will see in the third graph where distribution of means
is written, is you will start getting the frequency distribution of means, right or it's also called this
frequency distribution of sample means because the means are of the sample right. So when I hit
you know this 10,000, what is gonna happen is instead of 11555, it is essentially gonna take ten
thousand samples of n equal to 5.

I'm just gonna click, 10,000. Now what you see is what exactly what I told you right. This graph is
now starting to look like a normal distribution. Okay and let's focus on another important part.
Just look at the left of the first graph, you see mean is equal to 17.33, right. You look at the mean
of this, of the third graph as well, and you see it as 17.32, which means the mean of the initial
population, which was absolutely you know, you know random in nature.

When I took the, you know sample some random samples of that of a particular sample size in
our case 5 and I plotted the frequency distribution of those samples. What I end up getting is a
normal distribution, right.

Now, let's do it another 100,000 times. You see it's starting to become more and more a normal
distribution. If you keep doing this right, it will only end up looking like more and more as a
normal distribution. Now, let's look at a variation of this. Now, instead of n equal to 5, let's look at
n equal to a slightly larger number. So we are going to look at both n equal to 5 and n equal to 20
both. Now, let's look at another variation of this. Now earlier, we just took one in value or one
sample size, value of n equal to five. Now we are considering two sample sizes n equal to five and
n equal to 20 and we're gonna repeat the same thing we did. So just for once, i'm gonna show as
animated. So five random points. You see the sample. Now you're taking 20 random points, and
you see the sample being shown at the bottom now, the blue line right.

So look at the third and the fourth graph. The third graph is the frequency distribution of your
sample means when the sample size is 5 and the fourth graph is the frequency distribution of
sample means when the sample size is equal to 20. Now, let's do it 10,000 or a 100,000 time.

Now you see the difference. The difference between the third graph and the fourth graph is the
fourth graph is a better normal distribution right, as compared to the third one. It's more of a
normal distribution than the third one, and that's what happens when you increase the size of n.

Now, as you increase the size of n okay, as you, you know, bring the n value closer to the
population size value, the sample size being closer to the population size value, you will start
seeing that the sampling distribution of sample means will start looking more and more like a
normal distribution, and you see that the mean is actually becoming more and more equal to the
population distribution, as you are increasing the size of n, right.

In the earlier case, it was 17.34. Now you see it as being 17.33. So this is essentially the central
limit theorem right. Two things in the central limit theorem. One, is that the fact that, when you
map the frequency of the samples, you get a normal distribution. That's the first part of your
central limit theorem.

And the second part is that the mean of the population which in this case was let's say, mu. Let’s
say mu and sigma was the mean and standard deviation. We see that the mean of the population
and the mean of the sampling distribution of the sample mean are both same.

So your mean of sampling distribution of sample mean, which is represented by mu x bar, is equal
to the mean of your overall actual population distribution, you started off with, and your standard
deviation of your sampling distribution of sample means, which is represented by sigma x bar, is
equal to sigma divided by root of n. Where sigma is the standard deviation of the entire
population distribution we started off with and n is your sample size, right.

And think of this intuitively as well, okay, let's go back to this. Let's go back to this. As the size of
n increased, right from 5 to 20. What did you see? You saw that it became thinner and thinner
right. The standard deviation or the normal distribution became, thinner and thinner.
Now what that essentially means is that your standard deviation right. Your deviation from your
mean, is becoming less and less, which again is proved by this equation, where it's inversely
proportional to n. So, as your n increases, your standard deviation starts decreasing right and
your normal distribution starts looking thinner and thinner. So these two are very important
formulas, you'll have to, you know remember, while you're dealing with the central limit theorem.

A slight variation of this is the variation, the variance which is just the square root of this divided
by a, it just comes from this. I'm just squaring both sides. So this is your central limit theorem. I
hope the entire concept from you know where we started off, so I'm just gonna summarize it
because there's a lot of things here.

We started off with a absolutely random population. We said its mean is mu and sigma is sigma.
Its standard deviation is sigma. We took three or four samples of a particular sample size, n equal
to 5. We found their means and then we said, let's plot these sample means on a curve, on a plot
and let's plot their frequencies and we said that by central limit theorem, the frequency
distribution of sampling mean is going to be a normal distribution, and not just that, we also said
that mean is equal to the mean of the population. Standard deviation is given by standard
deviation of population by root of n. So this, in a nutshell, is the central limit here.

Speaker: Himanshu Manroa

So, to define this particular aspect of central limit theorem, much more scientifically or in a formula manner, we
have this very simple formula.

which tells you if the mean of your sample is equal to the standard deviation of your population divided by the
root of your sample size, you know, that's the mean of your population, of your sample is equal to the mean of
your, you know, larger population. That's when you would know that you have picked up a good sample size

Let me repeat that again. You know, so for you to be able to pick up a good sample size, you know, you take
a square root of that sample size. You know, and the formula is your standard deviation of your sample size
upon the square root of the sample size that you have picked up.
If the mean, you know, this is equal to the mean of your larger population, it means that you have picked up a
good sample size. You can see it, you know, through various examples, which allows you to determine the
right sample size, if the mean of the population and the sample is close together.

So, that's the formula that goes into it. Now, where would you use the goodness of these sample sizes, you
know, that you have decided to represent your larger population? Where do you use these kinds of
fundamentals?

You know, you would use this in, say exit polls, in opinion polls of elections. You know, so a good exit poll
sample size or opinion poll sample size would be the one where the mean is closer to the larger population.
Similarly, you know, we could have various other examples of say, you know, median household income levels
of say, for a telecom brand or a soap brand, or even for say an SUV launch that you want to go into. You want
to understand the spend or the disposable income levels.

And that's where you would do these kinds of, you know, you would apply formula to understand, you know, if
the sample size that you have picked up, the mean of these income levels, if it's closer to what you had in
mind of the larger population, it means that you have picked up a good sample size for your survey.

Speaker: Siddhartha Roy

Okay, so the next topic we are going to talk about is called as confidence interval. It's often
represented as CI, confidence interval. So let's take an example. Let's say this is our weight of all
men in India greater than 60 years. Now this population would be in millions right, it's a fairly
large population size and we are seeing that the mu of this entire population, the mean of this
entire population - I am representing it as mu and sigma as the standard deviation is sigma.

Now it's impossible to know what or where does the mean of this entire population lie, right
because it's such a large population. So if you actually have to go and you know, look at all the
people above 60 years and find their weight and then find the mean it's an impossible scenario,
right.

So rather than and let's say you are asked to find the mean or you are asked to find, where does
the mean lie, right. So, rather than actually finding the exact mean, or rather than looking for the
exact mean mu value? What is generally done is we look for a range or we look for an interval
okay, where the this value of mean could lie, where this mu could lie. This interval or this value
within which the mean would lie somewhere is called as the confidence interval.
So confidence interval is the interval right. It's some range of values where the true mean of the
population will lie and I'll explain how we come to this entire conceptual confidence interval.

Now we know that from the central limit theorem, that the sampling distribution of sample means
will be a normal distribution curve. So let's just draw this normal distribution. Now this curve, as
you can see it's a normal distribution curve. This is the curve of the sampling distribution of
sample means right, and we know that by the central limit theorem, there are two things which
are established.

One is that so, let's say the mean of this is mu x bar. This is how it's generally denoted, mu x bar
is equal to mu, which is the mean of the population, we started off with right. This mean mu and
the standard deviation of sampling distribution of sample means is denoted by sigma x bar and
that is sigma by root of sample size, and this sigma is again the standard deviation of the
population.

Now, what we also know by the rule which we had discussed earlier, is that if I take this, so let
me denote this as mu itself, because mu sigma x bar and mu are the same. I'm just going to
denote this by mu plus 2 sigma x bar and this is mu minus 2 sigma x bar.

Now we know from an earlier rule that this area, this blackened area, is 95 percent. Now what is
this again, just think about what this graph is? This graph is a distribution of all the sample means
of this population right, if you take samples of a particular sample size, if you find their means and
if you then plot these means, this is the distribution graph you get and we are saying that 95
percent of these sample means are going to lie between mu plus minus 2 sigma x bar. Is that
clear?

I'm just going to repeat this. What we can say is that 95 percent of all sample means is going to
lie between mu plus minus 2 sigma x bar right. It will actually be mu x bar, but we know that mu
x bar is same as mu, so 95 percent of all sample means lies in the range of mu plus minus 2,
sigma x bar, which is also an interval right. What is this interval right? It lies between this interval.
Let me use a different colour. It lies between this interval.

So, let's take a sample mean here. Let's call it x bar and let's use another terminology here where
the standard deviation, we're denoting it by SE, which is standard error. Now what we are saying
here is that your sample mean x bar is at a distance of less than or equal to two standard error,
and we can say that, with a 95 percent confidence, right. Again flowing from the entire concept
we discussed of you know the area be 95 percent, the red line and everything, we're saying that
your sample mean, will always lie, not always, will lie with the 95 percent confidence at a distance
of less than or equal to two standard errors.

Now, if you just inverse this this statement. Okay, so the population mean will lie at a distance
less than or equal to two standard error away from the sample mean, and let's try and draw this
interval of two standard errors away from the sample mean I’ll use the same green line, and you
know, let's just assume this is the interval. So this green interval or this green line you see, is the
95 percent confidence interval.

And let's kind of write this down. So we're saying that 95 percent confidence interval is given by x
bar, this is a sample mean plus or minus. Why plus or minus? Because both sides, right. Plus or
minus 2. We said standard error. Let's just write the standard error in a format we know. This is
the standard deviation of the population. This is the sample size.

So this is your 95 percent confidence interval and what we understand from you know this 95
percent confidence interval or let me just draw this confidence interval is that your true population
mean which was mu, there is a 95 chance that this mu will be lying somewhere in this particular
interval.

So that is the significance of your confidence interval, because you are eventually saying that from
your sample means, you are evaluating, or you are saying, with a certain percentage of
confidence that your true mean, true population mean is lying within a particular range or a
particular interval, which is called as a confidence interval.

Speaker: Himanshu Manroa

And now after having understood the concept, let's start looking about or talking about business relevance, you
know, examples where you can put all this understanding to practice.

When would you be able to say about these confidence levels or confidence intervals within your existing data?

So, take for example, biological studies. You know, one particular biological study which wants to attempt to
look at the heights and weights of tiger population in, say Ranthambore. You know, so your biologists want to
conduct these studies. They are aware of, you know, what would be their, as a benchmark, as the mean height or
a mean weight for the entire tiger population in the Ranthambhore sanctuary.

Now we want to do this, you know, specific biological study for about 50 tigers within Ranthambore. So, that's
done, and now you would analyse the results between, you know, of these 50 tigers for their heights and weight.
And, you know, when you look at those results, when you plot those results, you look at the mean values.
You look at the standard deviation, you look at the entire spread, the normal distribution curve. Now if 95% of
your data lies within two sigma levels, you know, the biologists would be able to say with 95% confidence levels,
that I'm sure about my findings.
They represent, you know, with 95% confidence levels that they represent the larger population.

So, that's what you would do. You know, you would continue comparing your results of a particular sample to the
larger population and the fluctuations or the deviations of your results. And you would come up with confidence
interval levels. You know, say 95% or 99%, that's what you would intend to do.

You could do, you know, come up with similar kind of interpretations for, say clinical studies. I have come up with
a new drug as a pharma manufacturing company, which has the potential to increase the hemoglobin levels
within my patients.

Now, how would you test this out and how would you, you know, convey the test results with much more
confidence levels? A similar kind of way, you know, where you have conducted your entire test within a sample
size, you would look at the actual results, the fluctuations, the way they are spread out and that allows you to,
you know, convey with confidence levels, I am 99% confident about my data and the findings coming out of this
data, because, you know, I'm sure about the spread of this particular data.

You know, there are more examples, like say the ad evaluation study. You know, what is my, you know, I'm a
marketing team or an advertising team. I've come up with a new ad and I want to see, you know, how would
people react to this ad, you know, on basic parameters?

Would they like it? Would they remember it? Would they associate with my product when they look at this ad?
So, you know, all of these findings you would do in a particular sample size, and then you would be able to
convey with confidence levels of say 95%, of 99%. You know, what is the likelihood of people liking this ad on
various parameters?
And I can say this with 99% confidence levels, because I know the fluctuations within my data are within these
ranges of either two sigma or three sigmas. So, that's the various kind of examples that you can drive out of
confidence intervals about your data.

Speaker: Himanshu Manroa

Okay. So, we have come to the end of this course. And I would say it has been a wonderful journey taking

through the entire concepts of a basic introduction to statistics. What is the aspect of business statistics and why
is it so relevant for all of us who want to, you know, take these fundamentals of statistics and bring it to practice
within our organisations, within our functions?
So, we understood the concepts of
descriptive statistics, things that have happened in the past, the data that you already have. We have looked at
the
concept of inferential statistics, a larger part where you want to conduct new surveys, new understanding to find
out, predict, estimate, forecast more about things that are likely to happen in the future.

We have spoken about central tendency theorems, you know, a refresher of concepts like mean, median, mode,
variance, standard deviation. You know, these statistical concepts that were so far lying within our books of
statistics, why are they so pertinent to the businesses of today?

You know, a basic understanding if we have about these terms can tell us so much about our existing data sets
and the data sets that we are constantly collecting from the market to understand our business much better.

We went to the theory of probability and probability distribution, because think about it, you know, businesses
are all about probabilities. All you want to do at times is, you know, say with the degree of certainty or probability
of things likely to happen or not.

Maybe the churn levels, the attrition levels, your customer conversion rates, multiple applications where what
your management wants to hear from you is with a degree of confidence, what is it that's likely to happen. And
what can you do from a business perspective, you know, to make that happen or to enhance that, or derive most
benefits out of those likelihood in your businesses?

So, that's where the concepts of probability come to us and we can implement it. And finally, we ended up with
the fundamentals of what is a population when you want to understand your consumer and what is the concept
of a sample.

You can't be reaching out to your entire population whenever you want to gather consumer insights or listen to
your consumers. And that's where we have the concepts of sampling. The various kinds of sampling, be it
probability samples or non-probability samples.
What is the central limit theorem, you know, which allows you to pick up the right size as the right sample to
come up with consumer insights, that represent the larger population? And finally, we ended with confidence
interval levels, which is about how confident are you about your sample to be able to predict results for the
larger population.
So, in a nutshell, the entire gamut of statistics, you know, and what does it intend to do is, of course, you know, it
allows you take much more statistically data driven, informed decisions about your functions or roles. And that
allows you to study your consumer much more effectively. Thank you. It's been a wonderful journey.
Disclaimer: All content and material on the upGrad website is copyrighted, either belonging to
upGrad or its bonafide contributors and is purely for the dissemination of education. You are
permitted to access, print and download extracts from this site purely for your own education
only and on the following basis:
● You can download this document from the website for self-use only.
● Any copies of this document, in part or full, saved to disk or to any other storage
medium, may only be used for subsequent, self-viewing purposes or to print an
individual extract or copy for non-commercial personal use only.
● Any further dissemination, distribution, reproduction, copying of the content of the
document herein or the uploading thereof on other websites, or use of the content for
any other commercial/unauthorised purposes in any way which could infringe the
intellectual property rights of upGrad or its contributors, is strictly prohibited.
● No graphics, images or photographs from any accompanying text in this document
will be used separately for unauthorised purposes.
● No material in this document will be modified, adapted or altered in any way.
● No part of this document or upGrad content may be reproduced or stored in any
other website or included in any public or private electronic retrieval system or
service without upGrad’s prior written permission.
● Any right not expressly granted in these terms is reserved.

You might also like