04 - Descriptive Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 127

MATH 621

Mathematical
Methods
Dr. Asim Khwaja
Lecture 04
CASE STUDY

Dr Death!
Harold Shipman

“The Art of Statistics”


by
David Spiegelhalter
• Questions arise when we want to better
understand the world.

• Those questions are answered using statistical


science.

• A natural first question is how were the killings


spread out among his professional years.
What kind of people did Harold
Shipman murder?
• More data was collected to answer this question and others.
Forensic Statistics
• Interesting case where we searched for patterns in the data.

• These patterns led to more interesting questions.

• There was no mathematics in this case, no models, no theory.

• The data analysis supported a general understanding of how he


went about his crimes.
Shipman’s Story Demonstrates
• The great potential of using data

• To help us understand the world.

• And make better judgements.


This is what statistical science is all
about!!

It lets us see the BIG picture of the


data appreciating the whole scene
from a distance
What is Statistics About?
• How many calories did each of us eat for breakfast?
• How far from home did everyone travel today?
• How big is the place that we call home?
• How many other people call it home?

• To make sense of all of this information, certain tools and ways of


thinking are necessary.

• The mathematical science of statistics help us to deal with this.

• Statistics is the study of numerical information, called data.


14
The Big Picture of Statistics
• Statistics is all about converting data into useful information.

• It is a process in which we
1. Collect data
2. Summarize & visualize data
3. Interpret data

• The big picture of statistics is a central idea of this course.


The Big Picture of Statistics
• The process of statistics starts when we identify what group we
want to study or learn something about.

• We call this group the population.

• Population here does not refer only to people but in a statistical


sense also to animals, objects, etc.
The Big Picture of Statistics
• For e.g. we might be interested in:

• The opinions of the population of Pakistan about corona vaccine.

• How the population of mice reacts to a certain chemical.

• The average price of the population of all two-bedroom apartments in a


certain city.
The Big Picture of Statistics
• Population, then, is the entire group that is the target of our
interest or study:
The Big Picture of Statistics
• In most cases, the population is so large that there is absolutely
no way we can study all of it.

• Like taking opinion of all Pakistanis about corona vaccine.

• A more practical approach would be to examine and collect


data only from a subgroup of the population, which we call a
sample.
The Big Picture of Statistics
• First step: Choosing a sample and collecting data from it – this
is called producing data.
The Big Picture of Statistics
VERY IMPORTANT:

• We are examining only a small sub-group of the population


rather than the whole population.

• We need to be VERY careful in choosing a sample such that it


represents the population well.
The Big Picture of Statistics
• Once the data have been collected, we have a list of questions
to ask and answer and numbers to collect.

• For this, we need to summarize that data in a meaningful way


(because it is too large).

• And visualize it graphically.


The Big Picture of Statistics
• Second step: This summarizing and visualizing the data is
called exploratory data analysis.
The Big Picture of Statistics
• Now we have obtained the sample results and summarized
them.

• But we are not done yet.

• Remember: Our goal is really to study the whole population


(and not just the sample).

• So, what we want is to be able to draw conclusions about the


population based on the sample results.
The Big Picture of Statistics
• Before we can draw such conclusions, we need to look at how
the sample we are using may differ from the population as a
whole.

• We would need to adjust for those differences in our analysis so


we can draw near accurate conclusions.

• To examine this difference, we use probability.


The Big Picture of Statistics
• In essence, probability is the “machinery” that

• allows us to draw conclusions about the population

• It does that based on the data collected about the sample.


The Big Picture of Statistics
The Big Picture of Statistics
• Finally, we can use what we’ve discovered about our sample to
draw conclusions about our population.
The Big Picture of Statistics
• Final step: This is called inference.
The Big Picture of Statistics
This is the BIG picture of
Statistics
Types of Data
Types of Statistics
• Descriptive Statistics
• Deals with methods used to describe the data that have been collected
• Organize data into some meaningful form
• Summarizing large data sets with small parameters or indicators

• Inferential Statistics
• These involve methods concerned with finding out something about a
population
• A decision, estimate, prediction, or generalization about a population based
on a sample
• Also known as inductive statistics
• Underlying inferential statistics is probability
Parameters vs Statistics
• What we are typically after in a study is the parameter.

• A parameter is a numerical value that states something about


the entire population being studied.

• For example, we may want to know the mean wingspan of the


American bald eagle.

• This is a parameter because it is describing all of the


population.
34
Parameters vs Statistics
• Parameters are difficult if not impossible to obtain exactly.

• On the other hand, each parameter has a corresponding statistic that


can be measured exactly.

• A statistic is a numerical value that states something about a sample.

• To extend the example above, we could catch 100 bald eagles and
then measure the wingspan of each of these. The mean wingspan of
the 100 eagles that we caught is a statistic.

35
Parameters vs Statistics
• The value of a parameter is a fixed number.

• In contrast to this, since a statistic depends upon a sample, the


value of a statistic can vary from sample to sample.

• Suppose our population parameter has a value, unknown to us,


of 10. One sample of size 50 has the corresponding statistic
with value 9.5. Another sample of size 50 from the same
population has the corresponding statistic with value 11.1.

36
Parameters vs Statistics
• A parameter measures something in a population.

• A statistic measures something in a sample.

37
Census or Sample
• Census
• Collecting data for every member of the group we are interested in
(whole population)

• Sample
• Collecting data just for selected members of the group
Census or Sample - Example
• There are 120 people in your local cricket club.

• You can ask everyone (all 120) what their age is – this is census.

• Or you could just choose the people that are there that particular
afternoon – that is a sample.

• What could be a biased sample in this case?


• If you could do census, would you need statistics?
Census or Sample - Example
• A census is accurate, but hard to do (remember face
recognition)

• A sample is not as accurate, but may be good enough, and a lot


easier
What is Data?
• A collection of facts such as:
• Numbers
• Words
• Measurements
• Observations
• Or even just descriptions of things
Data Types
• It is important to understand the various types of data or
variables.

• It helps guide you to select the correct statistical technique for


analyzing your data.

• Two types: Qualitative and Quantitative


Data: Qualitative vs Quantitative
Qualitative data:
• Descriptive or categorical information
(it describes something)
• Math operations are meaningless

Quantitative data:
• Numerical information (numbers)
• Math operations are meaningful
Quantitative Data
Discrete data:
• Can only take certain values (like whole numbers)
• Counted

Continuous data (real-valued):


• Can take any values (within a range)
• Infinite of them
• Uncountable
• Measured (Why are measured values continuous?)
Class Activity
E.g.: What do we know about the
cat?
• Qualitative:
• He is grayish
• He has long hair
• He has lots of energy
• Mean looking
• Quantitative:
• Discrete:
• He has 4 legs
• He has 1 tail etc.
• Continuous
• He weighs 2.3 kg
• He is 10 inches tall etc.
Which of these are Categorical or
Numerical?
• Your friends’ favorite holiday destination
• Height
• Petals on a flower
• The most common given names in your town
• Phone numbers
• How people describe the smell of a new perfume
• Weight
• Postcodes (like 74700)
• Customers in a shop
Four Levels of Measurements
• Nominal data
• Binary & Non binary

• Ordinal data

• Interval data

• Ratio data
Nominal (aka Categories)
• This level is the most primitive, lowest or the most limited type
of measurement

• From Latin nomen meaning name

• Nominal data are items are that are just names or categories

• The word nominal level of measurement or nominally scaled are


used for this type of data
Nominal (aka Categories)
Nominal (aka Categories)
• The only thing a nominal scale does is to say that items being
measured have something in common, although this may not
be described.

• Usually categorical – they belong to a definable category such


as ‘employees’.

• You cannot do arithmetic on them.

• You cannot order them.


Nominal Examples
• Color like red, blue, etc.

• Religion like Islam, Christianity, etc.

• Gender like male, female.


Nominal Data
• Data can only be classified into
categories
• The information is simply a count
• The categories are considered
mutually exclusive
• An individual or item that by virtue of
being included in one category must
be excluded from another
• Exhaustive
• Each person or item must appear in
at least one category
Nominal Data
• Nominal items may have numbers assigned to them

• This may appear ordinal but is not – numbers are used only to
simplify capturing and referencing

• E.g.: Marital status: single, married, widowed, divorced

• These may be assigned numbers like 1, 2, 3, & 4 but these cannot


be manipulated arithmetically

• Other examples: a set of countries etc.


Ordinal Data
• An order operation is defined on the data by their position on
the scale

• This may indicate position or superiority etc.

• The order of items is often defined by assigning numbers to


them to show their relative position

• You cannot do arithmetic with ordinal numbers – they show only


sequence
Ordinal Data (Examples)
• the first, third and fifth person in a race

• Pay scales in an organization as denoted by A, B, C and D etc.

• Ranking of course outcomes etc.

• Course grades: A, B , C, D etc.


Ordinal Data
• A major difference between a nominal and an ordinal level of
measurement is the “greater than” relationship between ordinal
level categories.

• Otherwise it has the same characteristics as the nominal scale:


mutually exclusive and exhaustive.

• The differences are meaningless. (e.g. four cars in a racing


game)
Interval Data
• Interval data is measured along a scale in which each position
is equidistant from one another

• The differences are meaningful.

• This allows for the distance between two pairs to be equivalent


in some way
Interval Data
• The distances make sense.

• Arithmetic operations (addition, subtraction) make sense.


Interval Data (Example)
• E.g.: Temperature, in degrees Celsius or Fahrenheit.

• Time (if measured during the day or using a 12-hour clock)

• The numbers on a wall clock are on an interval scale since they are
equidistant and measurable. For example, the difference between 1
o’clock and 2 o’clock is the same as that between 2 o’clock and 3 o’clock.

• IQ test score. You can not have a zero IQ but otherwise measured along a
fixed scale.

• CGPA. The intervals in the CGPA are also equidistant.


Interval Data
• No natural zero.

• Interval data cannot be multiplied or divided.


Ratio Data
• In a ratio scale, numbers can be compared as multiples of one
another

• Thus one person can be twice as tall as another person

• Important also, the number zero has meaning


Ratio Data
• Ratio data can be multiplied and divided because not only is the
difference between 1 and 2 the same as between 3 and 4, but
also that 4 is twice as much as 2

• Interval and ratio data measure quantities and hence are


quantitative. Because they can be measured on a scale, they
are also called scale data
Ratio Data
• E.g.: A person’s weight
• E.g.: Temperature in Kelvin scale
• E.g.: Money in a bank account
Temperature Celsius and
Fahrenheit are not Ratio Data
• See separate PDF document titled “Ratio versus Interval Data”
Fox News – If Bush Tax Cuts
Expire
Calculations for Different Data
Types
OK to Compute… Categorical Continuous
Nominal Ordinal Interval Ratio
Frequency distribution Yes Yes Yes Yes
Median and percentiles No Yes Yes Yes
Add or subtract No No Yes Yes
Mean, standard deviation, standard No No Yes Yes
error of the mean
Ratio, or coefficient of variation No No No Yes
How to Visualize Data
Nominal Data
• When we are dealing with nominal data, we collect information
through:
• Frequencies
• The rate at which something occurs over a period of time

• Proportion
• Obtained by dividing the frequency by the total number of events (i.e.
how often something happened divided by how often it could happen)

• Percentage
Nominal Data
Ordinal Data
• Same visualization methods as for nominal.
Numerical Data
Nature of Statistical Data
Measures of Central Tendency
Summarizing Statistical Data
• There is a need for a single measurement which may describe
the chief characteristics of the entire data set.

• Measures of central tendency.


• The tell us the point about which data tend to cluster.
• Such measurements are generally in the central part of the data
distributions.
• Well, that depends upon how we define the “center”.
Average
• The average is a measure of central tendency
• What kind?

• You see averages everywhere in newspaper and media. You


have seen an average person.
How We Use the Word “Average”
• She is an average student.
• Zaid is average looking.
• He has a high batting average.
• I sleep six hours a day on average.
• What’s the average temperature here?
• On the average, I go to play tennis once a week.
• How much time does an average teenager spends on facebook?
• His work is below average.
• It’s an average day at work.
• His ability in English is above average.
How We Use the Word “Average”
• Stop thinking of me as just an average person.
• His perfect score brought the class average up.
• On an average how many miles do you walk a day?
• What is the average rainfall for July here?
• What is the average life span in Japan?
• She earns on average ten pounds a week.
• He studies ten hours a day on average.
• My school grades were average.
• Hasan is shorter than average.
• The average life of a dog is ten years.
Average
• The object of an average is to represent a group of data in a simple and
concise manner so that the mind may get an idea of the general size of
other items in the group and thereby to render comparison easy.

• In other words, the main function of an average is to act as the most


representative figure for the entire mass of homogeneous data.

• It is obvious that all types of averages cannot be equally representative.

• We then have to choose an average that is best suitable for the problem at
hand.
Average
• Types of averages (based on how we define the concept of a
center)

• Mean

• Median

• Mode
Mean
• A mean is a number that can be used in place of each number
in a set, for which the NET effect will be the same as that of the
original set of numbers.

• What determines which mean to use is the way in which the


numbers act together to produce that net effect.
Mean - Types
• Arithmetic mean.

• Geometric mean.

• Harmonic mean.
Arithmetic Mean
• For example, if you are looking for a mean amount of rainfall,
you note the total amount of rain, which affects crop growth etc.,
by ADDING the daily numbers.

• So, if you add them up and divide by the number of days, the
resulting ARITHMETIC Mean is the amount of rain you could
have had on EACH of those days to get the same total.
Arithmetic Mean - Example
• 10-year monthly rainfall data in inches or rain:

3.2 3.1 2.9 3.7 2.9 4.1 3.5 2.8 2.9 1.7

3.2 + 3.1 + 2.9 + 3.7 + 2.9 + 4.1 + 3.5 + 2.8 + 2.9 + 1.7 = 30.8 ~ 31
Average = 31 / 10 = 3.1

3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1
3.1 + 3.1 + 3.1 + 3.1 + 3.1 + 3.1 + 3.1 + 3.1 + 3.1 + 3.1 = 31
Arithmetic Mean - Example
• Money spent (Rs) each day for a week:

10 20 25 15 30 27 31
10 + 20 + 25 + 15 + 30 + 27 + 31 = 158
Average = 158 / 7 = 22.6

22.6 22.6 22.6 22.6 22.6 22.6 22.6


22.6 + 22.6 + 22.6 + 22.6 + 22.6 + 22.6 + 22.6 ~158
Arithmetic Mean
• Another example is that of connecting resistors in series

• One of value 60 ohms and the other of value 40 ohms

• Their net effect is the sum of them: 60 + 40 = 100

• The arithmetic mean resister value is: (60+40)/2 = 50

• That is, each resister can be replaced by the mean value to achieve
the same net effect of 100
Arithmetic Mean
• It serves as a smoothing operation.

• In the previous examples, the variations in data were smoothed


out and equalized.
Arithmetic Mean
• The concept of center is that of the “center of gravity”

• Simple Arithmetic mean

• Weighted Arithmetic mean


Arithmetic Mean
• The weights of adults in a group are given as (in pounds):
• 274 235 223 268 290 285 235
Weighted Arithmetic Mean
• It is the mean of a data set whose entries have varying weights

• It is a mean which is obtained by applying to the items weights


as judged by their relative importance
Weighted Arithmetic Mean
• Example:
Mean - Applications
• Human visual system

• Example NICTA image


Example -
Recognize?
Dithering
Example NICTA Image
Color Dithering
• An illustration of color dithering

• Only red and blue are used

• As the red and blue patches are made


smaller, the patch appears purple
LCD Monitor Pixels
• Each pixel consists of three lights: red, blue, and green

• These three lights with different intensities are combined by our


visual system to generate the perception of different colors

• Check it out by putting a drop of water on white background


which acts as a magnifying lens to reveal the three colors
Mean - Advantages
• Simplest measure of central tendency – by calculation as well
as intuitively

• Other important statistical measures are calculated based on


the mean

• Takes all items in the data set into account – hence a


representative of all items
Mean - Advantages
• Does not require grouping or arrangement of items in any
particular order

• The sum of the deviations of various items from the mean is 0


Mean - Disadvantages
• The mean is not necessarily an item in the data set.

• The mean does not even have to be a possible value of the


variable being measured.

• The arithmetic mean can only make sense if the data set is
perfectly homogeneous.

• Heavily affected by outliers.


Mean - Disadvantages
• It is not true that half of every list has to be below mean.

• Example list: 3, 8, 9, 10 mean = 7.5

• Only ¼ of this list is below mean.

• That means that if a student’s test score is above average, that


student may not necessary be in the top half of the class
Mean - Problem
List (Rs): 1, 2, 3, 4 Mean = Rs 2.5

Suppose one of the people gets another Rs 100.

What happens to the mean? Does it go up or down?


It goes UP.
By how much?
By Rs 100 / 4 = Rs 25 (Divide the new amount equally among the 4)

The new mean is the old mean plus the change:


Rs 2.5 + Rs 25 = Rs 27.5
Mean - Problem
A class of 30 students has a mean score of 65 on the midterm.
Two students ask for their papers to be regraded. After the
regrading, one student’s score increases by 10 points and the
other comes down by 4 points. What happens to the class
average?

The change to the total score is 10 − 4 = 6.


This change gets split evenly among the 30 students: 6/30 = 0.2.
The class average becomes 65 + 0.2 = 65.2.
When Not to Use the Mean?
• Consider the wages of staff at a factory below:

• The mean salary for these 10 staff is Rs 30.7k.


• But this does not represent well the typical salary of worker.
• Most of the workers have salaries in the Rs 12k to 18k range.
• The mean is being skewed by the two large salaries.
Median
• The median of a data set is the value that lies in the middle of
the data when the data set is ordered

• The center is at the middle count-wise


Median – Example1 (Odd Count)
• Find the median weight of the following data set:
• 274 235 223 268 290 285 235

• First sort the data set into ascending order:


• 223 235 235 268 274 285 290

• The median is the central element (count-wise): 268


Median – Example 2 (Even Count)
• Find the median weight:
• 223 235 235 268 274 290

• Median = (235 + 268) / 2 = 251.5


Median - Advantages
• It is unaffected by items on the extremes

• Very simple calculation and easy to understand

• It is possible to arrange students according to their capability or


intellect in a certain subject as judged by someone and find the
middle student as representing the class as a whole
Median - Disadvantages
• Requires as a prerequisite the sorting of items which is
cumbersome particularly if the data set is large

• Less stable measure than the mean as it is more subject to


chance variations (this is subject to specific problem).
Median – Applications
Removing Noise: Mean Filter vs
Median Filter
• Filtering is often used to remove noise
from images

• Sometimes a median filter works better


than a mean filter
1/ 1/ 1/
• One of the simplest spatial filtering 9 9 9
operations we can perform is a
smoothing operation 1/ 1/ 1/
• Simply averaging all of the pixels in a 9 9 9
neighborhood around a central value
• On the right is the mean filter 1/ 1/ 1/
9 9 9
Image Filtering Example

Original Image After


Image After
Image Median Filter
Mean Filter
With Noise
Original Noisy
Image
Mean Filter
Median Filter
Mode
• The mode looks for the most commonly occurring value

• 274 235 223 268 290 285 235

• For example, in the above set the mode is 235


Mode
Consider the mode as the
most popular option.
Mode

Normally used for categorical


data.

When we wish to know the


most common category.
Mode
• A data set having only one mode is called unimodal.

• A data set can have one, more or even no mode

• A data set having two modes is known as bimodal and one


having more than two modes in known as multi-modal data set
Mode
Not unique.

Not so good with continuous data.


Skewed Distributions
• If the data is normally distributed and symmetrical (bell curve)
then the mean, median, and mode are the same.
Skewed Distributions

Median is the preferred measure


of center in skewed distributions.
Summary of When to Use the
Mean, Median, and Mode
Question
• While travelling for sight-seeing, you come across a river that
you want to cross to get to the other side. You don’t have
access to any boat or ferry, and you don’t know how to swim.
You start walking along the bank of the river while looking for a
bridge to cross. Soon you spot a sign board that has put by
government officials managing that area. The sign board says
the following:
Question
• “The department of statistics has very carefully measured the
depth of the river at intervals of every 3 feet from one edge of
the river to the opposite at this point. The average depth of the
river across this point was accurately found to be 2 feet and the
river is 300 feet wide from here.”

• Would you cross the river from that point without a boat or
a bridge by stepping into the water?
How Many
Jellybeans are
in the Jar?
The End

127

You might also like