Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Vanessa Suares

11/07/2023
• Hans Rosling video – Problem statement:
When will asia regain its dominance over the western world?
Questions on this problem statement:
1. Which parameters will asia overtake in? – per capita income
2. What part of asia will be part of dominance? – India and china vs US and UK
Which variables are causing variation, what is the factor controlling this variation
• Stats can be divided into two parts:
1. Descriptive stats
2. Inferential stats
Eg: 6000 iim students ---population
500 students from this population –sample
Stats applied to this 500 sample, obtained results and generalised to the population ->
Inferential Stats; eg: lab trials, exit polls
If we collected data for the entire population and then analyse -> descriptive stats
Eg: class average, census
• Educational qualifications, gender, roll no. – Categorical, nominal data; cannot be
subjected to mathematical operations; but we can calculate mode, proportions
• Height, weight, age, temperature, income – numeric data (can do mathematical
operations on it)
• Ratings (agree, strongly agree, neutral, disagree, strongly disagree / weight class,
literacy level ranking) – categorical, ordinal (they can be ranked); can calculate mode
and median

Ordinal Mode, median

Categorical

Nominal Mode
Variables

mode, median,
Numeric
mean

1
Vanessa Suares

13/07/2023
• Mean Median Mode – Measures of central tendency
• Even for nominal (ranked) values, we can take avg only if the classes are uniform /
distance between each ranking is same (eg: amazon ratings)
• Student Data excel doc -> Value field setting to change the depiction of data
• Sort in pivot table
• Bar graph (column graph) on pivot table
• DATEDIF() function =DATEDIF(E2, TODAY(), "Y")
• Analyse your data and ask the right questions
• Converting your data to ranges: right click row data in the pivot table, select group,
select your interval
• Cross tab – grouping of different types of data in one table to show their relationship
• Percentile, quartile, decile
• Quartile formula; quartile.exe excludes the median value in that quartile, quartile.inc
includes it
18/07/2023 (Prepare for surprise quiz next week)
• Measures of Dispersion : How far the data is spread – range, interquartile range, std
dev
• Box and whisker plot – top of box is 3rd quartile, bottom of box is 1st quartile, the
line that passes through the middle of the box is the median (or 2nd quartile), the cross
denotes average
• The shape of the box does not matter
• 50% of the people are between Q3-Q1
• Q3-Q1 = Interquartile range (IQR)
• IQR is another measure of dispersion that gives slightly more information than range
• Q3 + 1.5IQR
Formula to find whiskers
• Q3 - 1.5IQR
• Whisker value will
give the outlier, but what will come on the graph is based on your dataset
• Anything lying above and below the whiskers are your outliers
• Whiskers are meant to find outliers in your dataset
• Demo1 sheet; Histogram – demo1 excel sheet
• Most natural variables (age, height, weight) will follow symmetric (bell curve)
distribution
• Symmetric distribution with tail towards the right side and max on left side :
positively skewed distribution/right skewed distribution ; mean will be to the right of
median ie. Mean>median
• Symmetric distribution with tail towards the left side and max on right side :
negatively skewed distribution/ left skewed distribution ; mean will be to the left of
median ie. Mean<median
• Median and mode occur at the same point
• Added – datapacks

2
Vanessa Suares

• Data – Data analysis (under Analyse) – Desriptive statistics – select your data with
labels and check with data labels and summary statitics – ok
• Standard deviation and Variance: Variance2 = Std. Deviation
• Std dev is a very efficient scaling unit

20/07/2023
• Variable is distributed symmetrically – means that mean=median=mode and it is a
proper bell curve


3
Vanessa Suares

mu = mean, sigma = std dev


The image above has the variable symmetrically distributed
6 sigma plot (to be studied later)
• Interpretation of the above normal distribution: If I pick any value from the data set,
there is a 99.72% chance it lies between categories 1-6,
• For a skewed/asymmetric/absurd distribution:
o >= 75% data will lie between categories (μ ± 2σ)
o >=89% data will lie between categories 1-6
In general,
>= (1-1/k2)*100% for (x̄-kσ, x̄ +kσ)
This is known as CHEBYSHEV’S RULE

You might also like