Professional Documents
Culture Documents
8614 saba 2nd
8614 saba 2nd
ASSIGNMENT NO 2
SEMESTER: 3RD
Central Tendency
In statistics, the central tendency is the descriptive summary of a data set. Through
the single value from the dataset, it reflects the centre of the data distribution.
Moreover, it does not provide information regarding individual data from the dataset,
where it gives a summary of the dataset. Generally, the central tendency of a dataset can
be defined using some of the measures in statistic
Definition
The central tendency is stated as the statistical measure that represents the single
value of the entire distribution or a dataset. It aims to provide an accurate
description of the entire data in the distribution.
• Geometric Mean
• Harmonic Mean
• Weighted Mean
It is observed that if all the values in the dataset are the same, then all geometric,
arithmetic and harmonic mean values are the same. If there is variability in the
data, then the mean value differs. Calculating the mean value is completely easy.
Median
Median is the middle value of the dataset in which the dataset is arranged in the
ascending order or in descending order. When the dataset contains an even number of
values, then the median value of the dataset can be found by taking the mean of the
middle two values.
Consider the given dataset with the odd number of observations arranged in descending
order – 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and 2
Here 12 is the middle or median number that has 6 values above it and 6 values below
it.
Now, consider another example with an even number of observations that are arranged
in descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and
17
When you look at the given dataset, the two middle values obtained are 27 and 29.
Now, find out the mean value for these two numbers.
i.e.,(27+29)/2 =28
Mode
The mode represents the frequently occurring value in the dataset. Sometimes the
dataset may contain multiple modes and in some cases, it does not contain any
mode at all.
Since the mode represents the most common value. Hence, the most frequently
repeated value in the given dataset is 5.
Based on the properties of the data, the measures of central tendency are selected.
• If you have a symmetrical distribution of continuous data, all the three
measures of central tendency hold good. But most of the times, the analyst uses
the mean because it involves all the values in the distribution or dataset.
• If you have skewed distribution, the best measure of finding the central
tendency is the median.
Video Lesson
The central tendency measure is defined as the number used to represent the center
or middle of a set of data values. The three commonly used measures of central
tendency are the mean, median, and mode. Measures of central tendency
Definition
A measure of central tendency (also referred to as measures of centre or central
location) is a summary measure that attempts to describe a whole set of data with a
single value that represents the middle or centre of its distribution.
• mode
• median
• mean
• Each of these measures describes a different indication of the typical or central
value in the distribution.
Mode
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
ImageDescription
The most commonly occurring value is 54, therefore the mode of this distribution is 54
years.
Median
The median is the middle value in distribution when the values are arranged in
ascending or descending order.
The median divides the distribution in half (there are 50% of observations on either side
of the median value). In a distribution with an odd number of observations, the median
value is the middle value.
Mean
The mean is the sum of the value of each observation in a dataset divided by the number
of observations. This is also known as the arithmetic average.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean can be used for both continuous and discrete numeric data.
Limitations of the mean
The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers
and skewed distributions.
y G y
‗ ‘ y y x (pronounced Xbar).
• making estimates about populations (for example, the mean SAT score of all 11th
graders in the US).
• testing hypotheses to draw conclusions about populations (for example, the
relationship between SAT scores and family income).
Table of contents
Descriptive versus inferential statistics
Descriptive statistics allow you to describe a data set, while inferential statistics allow you to
make inferences based on a data set.
Descriptive statistics
In descriptive statistics, there is no uncertainty – the statistics precisely describe the data
that you collected. If you collect data from an entire population, you can directly compare
these descriptive statistics to those from other populations.
Example: Descriptive statisticsYou collect data on the SAT scores of all 11th graders in a
school for three years.
Inferential statistics
Most of the time, you can only acquire data from samples, because it is too x v
w y ‘ interested in.
Example: Inferential statisticsYou randomly select a sample of 11th graders in your state and
collect data on their SAT scores and other characteristics.
You can use inferential statistics to make estimates and test hypotheses about the whole
population of 11th graders in the state based on your sample data.
Sampling error in inferential statistics
Since the size of a sample is always smaller than the size of the population, some
‘ y sampling error, which is the difference
between the true population values (called parameters) and the measured sample values
(called statistics).
There are two important types of estimates you can make about the population: point
estimates and interval estimates.
Confidence intervals
While a point estimate gives you a precise value for the parameter you are interested in,
a confidence interval tells you the uncertainty of the point estimate.
They are best used in combination with each other.
Hypothesis testing
Hypotheses, or predictions, are tested using statistical tests. Statistical tests also
estimate sampling errors so that valid inferences can be made. Parametric tests
• the population that the sample comes from follows a normal distribution of scores
• the sample size is large enough to represent the population
• the variances, a measure of variability, of each group being compared are similar
Comparison tests
Correlation tests
Correlation tests determine the extent to which two variables are associated.
The chi square test of independence is the only test that can be used with nominal
variables.
Regression tests
Regression tests demonstrate whether changes in predictor variables cause
changes in an outcome variable. You can decide which regression test to use based
on the number and types of variables you have as predictors and outcomes.
Data transformations help you make your data normally distributed using
mathematical operations, like taking the square root of each value.
Spurious Relationships
It's important to remember that correlation does not always indicate causation. Two
variables can be correlated without either variable causing the other. For instance, ice
cream sales and drownings might be correlated, but that doesn't mean that ice cream
causes drownings—instead, both ice cream sales and drownings increase when the
weather is hot. Relationships like this are called spurious correlations.
in the same direction
correlational data
There are many different methods you can use in correlational research. In the
social and behavioral sciences, the most common data collection methods for this
type of research include surveys, observations, and secondary data.
Surveys
In survey research, you can use questionnaires to measure your variables of
interest.
You can conduct surveys online, by mail, by phone, or in person.
Surveys are a quick, flexible way to collect standardized data from many
, ‘ y q w unbiased way and capture
relevant insights.
Naturalistic observation
Naturalistic observation is a type of field research where you gather data about a
behavior or phenomenon in its natural environment.
Secondary data
Instead of collecting original data, you can also use data that has already been
collected for a different purpose, such as official records, polls, or previous studies.
Using secondary data is inexpensive and fast, because data collection is complete.
However, the data may be unreliable, incomplete or not entirely relevant, and you
have no control over the reliability or validity of the data collection procedures.
Correlation analysis
Using a correlation analysis, you can summarize the relationship between variables
into a correlation coefficient: a single number that describes the strength and
w v ,y ‘ q y the degree of the relationship between variables.
Regression analysis
With a regression analysis, you can predict how much a change in one variable will
be associated with a change in the other variable. The result is a regression
equation that describes the line on a graph of your variables.
Directionality problem
If two variables are correlated, it could be because one of them is a cause and the
other B g ‘ wy
to infer w w , ‘ causality from
correlational studies.
Level up your AI game: Dive deep into Large Language Models with us!
Having a good grip on statistical distribution makes exploring a new dataset and
finding patterns within a lot easier. It helps us choose the appropriate machine
learning model to fit our data on and speeds up the overall process.
PRO TIP: Join our data science bootcamp program today to enhance your data
science skillset!
In this blog, we will be going over diverse types of data, the common distributions
for each of them, and compelling examples of where they are applied in real life.
Before we proceed further, if you want to learn more about probability distribution,
watch this video below:
When you roll a die or pick a card from a deck, you have a limited number of
outcomes possible. This type of data is called Discrete Data, which can only take a
specified number of values. For example, in rolling a die, the specified values are
1, 2, 3, 4, 5, and 6.
Types of statistical distributions
Depending on the type of data we use, we have grouped distributions into two
categories, discrete distributions for discrete data (finite outcomes) and continuous
distributions for continuous data (infinite outcomes).
Discrete distributions
• Given multiple trials, each of them is independent of the other. That is, the
‘ ne.
• Each trial can lead to just two possible results (e.g., winning or losing), with
probabilities p and (1 – p).
• An event can occur any number of times (within the defined period).
• w v ‘ y
The graph of Poisson distribution plots the number of instances an event occurs in
the standard interval of time and the probability of each one.
Continuous distributions
Conclusion
Data is an essential component of the data exploration and model development
process. The first thing that springs to mind when working with continuous
variables is looking at the data distribution. We can adjust our Machine Learning
models to best match the problem if we can identify the pattern in the data
distribution, which reduces the time to get to an accurate outcome.
• The chi-square goodness of fit test is used to test whether the frequency
distribution of a categorical variable is different from your expectations.
House sparrow 15
House finch 12
American 236 19
Canadian 157 16
• Academic style
• Vague sentenc
Where:
The larger the difference between the observations and the expectations (O − E in
the equation), the bigger the chi-square will be.
Mathematically, these are actually the same test. However, we often think of them
• Alternative hypothesis (HA): The bird species visit the bird feeder in
different proportions.
• Null hypothesis (H0): The bird species visit the bird feeder in the same
proportions as the average over the past five years.
• Alternative hypothesis (HA): The bird species visit the bird feeder in
different proportions from the average over the past five years.
Like vanilla 47 32
Dislike vanilla 8 13
• Null hypothesis (H0): The proportion of people who like chocolate is the
same as the proportion of people who like vanilla.
which is larger.
Statistics
• Statistical power
• Descriptive statistics
Methodology
Research bias
• Hawthorne effect
• Unconscious bias
Chi-Square Goodness of Fit Test
• You can use the test when you have counts of values for a categorical
variable.
• The Chi-square goodness of fit test checks whether your sample data is
likely to be from a specific theoretical distribution.
• What do we need?
For the goodness of fit test, we need one variable. We also need an idea, or
hypothesis, about how that variable is distributed. Here are a couple of
examples:
We have bags of candy with five flavors in each bag. The bags should
contain an equal number of pieces of each flavor. The idea we'd like to test is
that the proportions of the five flavors in each bag are the same.
Understanding results
L ‘ wg
A simple bar chart of the data shows the observed counts for the flavors of candy: