Professional Documents
Culture Documents
STATISTICS (Tanya) PG 1 - 28
STATISTICS (Tanya) PG 1 - 28
Importance
1) Lends credibility to an argument.
2) It provides tools that you need in order to react intelligently to information you hear
or read.
Descriptive stats:
Numbers used to summarize and describe data
No assumptions/inferences can be made.
Just descriptive in nature.
EXAMPLE:
Average salaries for various occupations in 1999.
$112,760 paediatricians
$106,130 dentists
$100,090 podiatrists
$76,140 physicists
$53,410 architects,
$49,720 school, clinical, and counselling psychologists
$47,910 flight attendants
$39,560 elementary school teachers
$38,710 police officers
$18,980 floral designers
Prime facie Description:
we pay the people who educate our children and who protect our citizens a great
deal less than we pay people who take care of our feet or our teeth.
No inference should be made from this data as there might be various other factors
involved too.
Inferential stats:
The mathematical procedures whereby we convert information about the sample
into intelligent guesses about the population.
1)Sample -- a small subset of a larger set of data --- to draw inferences about the
larger set.
2) Population: The larger set is from which the sample is drawn.
Assumption that sampling is random and covers all varieties of specimens in the
population
In this sense, we can say that simple random sampling chooses a sample by pure
chance.
Sample size matters. Only a large sample size makes it likely that our sample is
close to representative of the population.
Stratified Sampling:
Used to make the sample more representative of the population.
Used if the population has a number of distinct “strata” or groups.
HOW?
You first identify members of your sample who belong to each group. Then
you randomly sample from each of those subgroups in such a way that the
sizes of the subgroups in the sample are proportional to their sizes in the
population.
(ratio of subgroup: sample = sample: population)
Variables:
Variables are properties or characteristics of some event, object, or person that can
take on different values or amounts.
Percentiles:(DOUBT)
Percentage of number of people behind your score in a test/ or a level.
Shows your performance with respect to your competitors.
STATISTICS (PurabSoni)Pg 29-52
First two definitions are self-explanatory so discussing about the third
definition
For Example refer Pg no. 30
Explanation is given in the audio
Levels of Measurement-
Types of Scales
Nominal Scales
Ordinal Scales
Interval Scales
Ratio Scales
Nominal: The essential point about nominal scales is that they do not imply
any ordering among the responses. For example, when classifying people
according to their favorite color, there is no sense in which green is placed
“ahead of” blue. Responses are merely categorized. Nominal scales embody
the lowest level of measurement.
Ratio: Ratio Scale is defined as a variable measurement scale that not only
produces the order of variables but also makes the difference between
variables known along with information on the value of true zero.Ratio scale
provides the most detailed information as researchers and statisticians can
calculate the central tendency using statistical techniques such as mean,
median, mode and methods such as geometric mean, the coefficient of
variation or harmonic mean can also be used on this scale.
Ratio Scale Example:
The following question fall under the Ratio Scale category:
What is your daughter’s current height?
Less than 5 feet.
5 feet 1 inch – 5 feet 5 inches
5 feet 6 inches- 6 feet
More than 6 feet
Distributions
Discrete
Fixed Frequencies are
distributed to different
objects/variables.
Ranges are made for
Continuous frequencies for different
objects.
Probability Distribution:
When probablity is
calculated for each
frequency to occur.
Chance for the frequency
to appear can affect the
ultimate data table.
Example: Discrete
Colour Frequency 14
12
Red 12 10
Brown 1 8
Blue 7 6
4
Purple 6 2
0
Red Brown Blue Purple
Probability Distribution
Colour Probability
0.5
Red 0.4 0.4
Brown 0.15 0.3
Blue 0.25 0.2
0.1
Purple 0.2
0
Red Brown Blue Purple
Continuous
A hand gesture and average time men take to respond.
Probability Density
To represent the probability of any given event associated with any arbitrary movement like
the one mentioned above, we plot their frequency over a stipulated period of time. To
account for all possible outcomes, we try to make it continuous for us not to miss out on
any outcome. A normal bell like Curve is the most common curve used to represent such
movements.
Here, the probability for the event to occur is the maximum at the centre, while it is the
least where the curve cuts the X-axes.
Observations
Area under the curve is equal to 1 because the curve shows the summation of all
probabilities for different events.
Second, the probability of any exact value of X is 0 as the probability that his
movement takes exactly 698.956432342346576 milliseconds is essentially zero.
Shapes of distributions
Not all shapes look like a bell. Not all events have their normal probability density centred at
the middle. They could be more spread out.
Note: A normal bell curve would have its mean, median and mode all at the centre. When
median deviates from the mean, then we call the probability to be skewed or more spread
out.
A distribution with the longer tail extending in the positive direction is said to have a positive skew.
It is also described as “skewed to the right.”
A distribution with the longer tail extending in the negative direction is said to have a negative skew.
It is also described as “skewed to the left”.
Example:
Take 3 numbers (49,50,51), the central tendency being 50. If a number is added before 49, say 48 as
the median figure would lie on the left of the central tendency. The graph obtained would be
positively skewed. On the other hand, if a 52 is added, the graph would be skewed towards the right
with the median value being greater than the central tendency making it negatively skewed.
Talking in terms of placements at B-schools, usually the figures are positively skewed as the central
tendency is usually on the right side of the median.
Distributions also differ from each other in terms of how large or “fat” their tails are. The left
distribution has relatively more scores in its tails; its shape is called leptokurtic. The right
distribution has relatively fewer scores in its tails; its shape is called platykurtic.
STATISTICS (Nivi ) Page –(52-91)
Student Weight - X
Harry 60
Ron 54
Hermione 57
∑ 3𝑖=1 𝑋i = 60 + 54 + 57 = 171
Many formulas involve squaring numbers before they are summed. This is indicated as:
X Y XY
1 3 3
2 2 4
3 4 12
∑ 𝑋𝑌 = 3 + 4 + 12 = 19
This indicates the summation of cross products.
Logarithms –
Pie – chart - Pie charts are effective for displaying the relative frequencies of
a small number of categories. In a pie chart, each category is represented by
a slice of the pie. The area of the slice is proportional to the percentage of
responses in the category. This is simply the relative frequency multiplied by
100.
Bar charts - Bar charts can also be used to represent frequencies of different
categories. Here, frequencies are shown on the Y-axis and the no of
consumers opting for a particular ice-cream is shown on the X-axis.
Bar charts are better when there are more than just a few categories and for
comparing two or more distributions.
Some common mistakes to avoid – Unnecessary fanciness can lead to
unacceptable distortion and vary the information you are trying to convey.
For example, using 3D charts , setting the baseline to value other than zero,
using line graph for qualitative variables etc.
Stem and leaf displays - A stem and leaf display is a graphical method of
displaying data. It is particularly useful when your data are not too numerous .
The 'stem' is on the left displays the first digit or digits. The 'leaf' is on the
right and displays the last digit.
For example if we have a distribution:
3, 6, 9, 10, 10, 11, 14, 17, 19, 20, 22, 22, 27, 28, 29, 31, 31, 33, 33, 33
The numbers 3, 2, 1 and 0 (for single digits) are arranged as stems and the
numbers to the right of the bar are leaves, and they represent the 1’s digits.
3 |11333 (31, 31, 33, 33, 33)
2 |022789 (20, 22, 22, 27, 28, 29)
1 |001479 (10, 10, 11, 14, 17, 19)
0 |369 (3, 6, 9)
We shall repeat the leaf as per the no of occurrences in the dataset. We can
also simplify this figure by splitting the stem into two parts. For example if we
allocate each row with specified intervals like,
1st row- 30-34 3 |11333
nd
2 row- 25-29 2 |789
rd
3 row- 20-24 2 |022
th
4 row- 15-19 1 |79
th
5 row- 10-14 1 |0014
th
6 row- 5-9 0 |69
th
7 row-0-4 0 |3
We can also use stem and leaf display for comparison. For instance – Suppose
girls and boys in a school read a certain no of books. This can be represented
on a stem and leaf display. From the plot we can infer that there are two girls
who studied 51 books while there is only one boy who studied 51 books and
so on.
Decimal numbers and negative numbers can also be plotted on a stem and
leaf display. We can round off the decimal to nearest whole number and to
represent negative numbers we can use negative stems. For instance,
43.9 can be rounded to 44, 51.2 can be rounded to 51 and so on.
Consider the data set: 43.9, 51.2, -27.4, -15.4, 1.2, -0.2, -6.3, -6.7, -8.8
Now to plot this on stem and leaf display:
Although stem and leaf displays are unwieldy for large data sets, they are
often useful for data sets with up to 200 observations. For example if we use
stem and leaf display for representing large population data then we need to
round off the large values to two place accurate numbers. Like 493,559 can
be rounded to 490,000 and then plotted with a stem of 4 and a leaf of 9.
Whether your data can be suitably represented by a stem and leaf display
depends on whether they can be rounded without loss of important
information.
Histograms - A histogram is a graphical method for displaying the shape of a
distribution. It is particularly useful when there are a large number of
observations. It groups the observations into ranges or intervals. The height
of each bar shows how many elements fall into each interval known as class
frequency. Let us consider the height distribution of trees in an orchard. We
can club them in a group of 50 each. First bar would depict the no of trees
from 100cm – 150cm, second bar would depict the no of trees from 150cm –
200cm and so on.
SUMMARISING DISTRIBUTIONS
4.1 Objectives:
Central Tendency
o Mean, Median and Mode
o Calculation
Variability
o Standard Deviation
o Variance
Shape and Transformations
o Variance Sum Law 1.
1. Balance Scale
The point in the given data set where the distribution is in balance. For
example, for a given data set S = { 2, 3, 4, 9, 16} the following image represent
the balance scale.
The balance point or the fulcrum would vary depending on the type of
distribution, whether it is a symmetric or asymmetric distribution. Examples as
shown below.
NOTE: The balance point/ the point where fulcrum is placed denotes the
centre of distribution
For a data set S = { 2, 3, 4, 9, 16} we need to find a value for which the sum of
absolute deviations is minimum.
Two cases are taken for calculating the absolute deviation of the set from
A. 10
B. 5
1. Arithmetic Mean
The arithmetic mean is the summation of all the numbers in a
given data set divided by the number of numbers in the same. The symbol “μ”
is used for the mean of a population. The symbol “M” is used for the mean of a
sample
3. Mode
The mode is a central tendency measure tool which tells the most frequently
occurring value.
The following continuous data contains the range and frequency at which a
value comes in that particular range.
The mode for the given sample data set would be between 600-700 as this
range has the highest frequency. Thus, the mode is the middle of the interval,
650.
Note: The mean, median and mode would be same for a symmetric
distribution, for an asymmetric distribution the trio would not be same and it
would vary.
+
Trimean = (15 + 2* 20 + 23) / 4 = 78/4 = 19.5
2. Geometric Mean
a. Geometric mean is the product of the n numbers in a given data set
and taking the nth root of the resultant.
b. GM is ideal measure for averaging rates.
Eg: For the following data set which represent the stock portfolio for a value of
$1000 and had annual returns of 13%, 22%, 12%, -5%, and -13%.
Here each return is a multiplier indicating how much higher the value grew
after each year.
Thus for the above table the geometric mean of the multipliers is 1.05
3. Trimmed Mean
Removal of the higher and lower scores and computing the mean of the
remaining scores is known as trimmed mean.
Representation: Mean trimmed 10% means that a mean is computed with
10% of its scores trimmed off, which is 5% from the bottom of the data set and
5% from the top of the data set.
S = { 37, 33, 33, 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20, 20, 19,
19, 18, 18, 18, 18, 16, 15, 14, 14, 14, 12, 12, 9, 6}
For the set S to calculate mean trimmed at 20% is by removing 10% from the
bottom of S and 10% from the top of the set.
I. Total number of elements in S = 30 ;
II. 10% from top and bottom imply = 3 elements each
So the new set S’ which is a trimmed set is as follows
S’ = { 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20, 20, 19, 19, 18, 18, 18,
18, 16, 15, 14, 14, 14, 12, }
Top 10% of S = { 37, 33, 33 }
Bottom 10% of S = {12, 9, 6 }
Therefore, S’(Trimmed Mean) = 20.16 (approx) [n’ = 31 - 6 = 25terms]
[Note: The number of terms for calculating the mean is to be used from S’]
whole story.
1. Axes of the graphs: y axis -> frequency x-axis -> respective values (here
scores of quiz)
2. 75th and 25th percentile calculation
QUIZ 2:
a. Find the summation of the frequencies for different scores, total =
20.
b. Calculate the 75th percentile and 25th percentile which would be 15
and 4 respectively.
c. Find the score corresponding to which the frequencies gets add up to
15 and 4 which are scores 9 and 5 respectively.
1. Range
This describes the spread of the entire distribution by calculating
the difference between the highest and lowest scores.
2. Inter-Quatrile Range
The IQR is the range of middle 50% of scores in a distribution.
IQR = 75th percentile - 25th percentile
For Quiz 1 the IQR would be,
IQR Quiz 1 = 8 - 6 = 2
For Quiz 2 the IQR would be,
IQR Quiz 2 = 9 - 5 = 4
A related measure of variability is semi-inter-quartile range. It is defined as the
half of the inter-quartile range. For a symmetric distribution, the median +/-
median contains half the scores in the distribution.
3. Variance
Variability can also be defined in terms of how close the scores in the
distribution are to the middle of the distribution. The variance is defined as the
average squared difference of the scores from the mean.
Thus for a sample the calculation of the variance is as follows:
s2 is the estimate of the variance and M is the sample mean. Note that M is the
mean of a sample taken from a population with a mean of μ. Since, in practice,
the variance is usually computed in a sample, this formula is most often used.
4. Standard Deviation
The standard deviation is simply the square root of the variance. The standard
deviation is an especially useful measure of variability when the distribution is
normal or approximately normal because the proportion of the distribution
within a given number of standard deviations from the mean can be calculated.
[Repeat] : Please take the standard deviation section again when taking normal
distribution.
4.6 Shapes of Distribution
1. Skew
a) Distributions with large positive skew value will have larger means and
medians
b) In a highly skewed distribution, mean is more than twice of the median
c) To calculate the skew index of the graph Pearson formula :
3(mean median)
stddeviation
3. Kurtosis
It is another measure of skew for a distribution. The value “3” is subtracted to
define “no kurtosis” as the kurtosis of a normal distribution. Otherwise, a
normal distribution would have a kurtosis of 3.
2) Zero variance indicates that all the values are equal in the distribution.
3) If in a frequency distribution, all values are added by constant number, then the variance of the
distribution does not change, i.e.
4) If all the values of a variable in a distribution are multiplied by constant, then the variance of
distribution is multiplied by the square of that constant.
5) If we have multiple distributions having same mean and if their variances are given, then the
total variance can be calculated using the following formulae.
Note: These formulas for the sum and difference of variables given above only
apply when the variables are independent.
Pranay Saha ( Pg : 164- 179)
Introduction to Bivariate Data
What is Bivariate Data?
In statistics, bivariate data is data on each of two variables, where each value of one of the variables is
paired with a value of the other variable. Typically it would be of interest to investigate the possible
association between the two variables.
For Example:
Below figure shows a scatter plot of Arm Strength and Grip Strength from individuals working in
physically demanding jobs including electricians, construction and maintenance workers, and auto
mechanics.
Not surprisingly, the stronger someone's grip, the stronger their arm tends to be. There is therefore a
positive association between these variables. Although the points cluster along a line, they are not
clustered quite as closely as they are for the scatter plot of spousal age.
However, not all scatter plots show linear relationships. Figure below shows the results of an
experiment conducted by Galileo on projectile motion.
The symbol for Pearson's correlation is “ρ” when it is measured in the population and “r”
when it is measured in a sample. Because we will be dealing almost exclusively with
samples, we will use “r” to represent Pearson's correlation unless otherwise noted.
Figure 3. A scatter plot for which r = 0. Notice that there is no relationship between X and Y.
However, with real data, you would not expect to get values of r of exactly -1, 0, or 1.
For example:
Properties of Pearson's r
i. A basic property of Pearson's r is that its possible range is from -1 to 1. A
correlation of -1 means a perfect negative linear relationship, a correlation of 0
means no linear relationship, and a correlation of 1 means a perfect positive
linear relationship.
ii. Pearson's correlation is symmetric in the sense that the correlation of X with Y
is the same as the correlation of Y with X. For example, the correlation of
Weight with Height is the same as the correlation of Height with Weight.
Step 1: Compute the mean for X and subtracting this mean from all values of X. The new
variable is called “x.”
Step 2: The variable “y” is computed similarly.
The variables x and y are said to be deviation scores because each score is a deviation from
the mean. Notice that the means of x and y are both 0.
Next we create a new column by multiplying x and y.
Please Note:
Before proceeding with the calculations, let's consider why the sum of the xy column
reveals the relationship between X and Y. If there were no relationship between X and
Y, then positive values of x would be just as likely to be paired with negative values of y
as with positive values. This would make negative values of xy as likely as positive
values and the sum would be small. On the other hand, consider Table 1 in which high
values of X are associated with high values of Y and low values of X are associated with
low values of Y. You can see that positive values of x are associated with positive values
of y and negative values of x are associated with negative values of y. In all cases, the
product of x and y is positive, resulting in a high total for the xy column. Finally, if there
were a negative relationship then positive values of x would be associated with negative
values of y and negative values of x would be associated with positive values of y. This
would lead to negative values for xy.