Introduction To Statistics - Chapter 3-5 Notes

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 20

Please sign in (SIGNATURES) as you come in to class.

It will save
my voice instead of my taking attendance (this is only to settle the
class roster).

Introduction to Statistics
Lecture Notes
Chapters 3-5
What’s up with the powerpoint?
 I don’t usually use slides, but am going to try to use these
to save my voice somewhat.

 Notes: Still working on getting the class roster settled. Has


been some movement on the waitlist, will keep in touch as
things develop. Be sure you’ve signed in!
 First homework is posted (on our course website), but
isn’t due until next Friday (the 4th). The additional
problem is NOT optional, that just means it is not a book
problem.
Handouts for Today
 There is one handout on graphs/descriptive statistics going
around. Save this to use tomorrow in class.

 There is a second handout – the anonymous survey largely


designed by the class on Monday. Please go ahead and
take a few minutes to fill this out (no names!) and get it
back to me. We’ll take a look at this data next week in lab.

 If you missed class Monday, I have extra course


syllabuses at the front as well.
The “W”’s of a Data Set
 Who – the observations (population – set of all objects you are
interested in obtaining the value of some parameter for – since
we usually can’t observe all objects, we take a sample of
objects – a subset of the overall population of objects to
observe)
 Note: There is NO such thing as a population sample or sample
population.
 What – the variables
 Why – why was the data collected
 How – how was the data collected (related to design/sampling
in chapters 12-13)
 When/Where – more information that could be relevant
Chapters 3-5 Overview
 Covers basic graphs and descriptive statistics for both
categorical and quantitative variables
 This is what you would do as a “preliminary analysis” for
a variable.

 Recall: a data set can have multiple variables in it.

 These chapters focus on mostly univariate (single


variable) analyses. There is one comparative graph – a
side-by-side boxplot in Chapter 5.
3 Rules of Data Analysis
 Rule 1- Make a picture
 Rule 2 – Make a picture (really, before you do anything
else)
 Rule 3 – Make a picture (really, we mean a well-chosen
picture for your variables)
Categorical Variable Prelim Analysis
 Frequency tables (one variable) – summarize counts by
category
 Contingency tables (2 or more variables) – summarize
counts by category for multiple variables
 Bar charts
 Pie charts
Frequency
 What is frequency?
 Frequency is the number of objects/cases per category
 You can also look at relative frequency.
 Relative frequency is the number of objects/cases per category
divided by the total number of objects.
 Hence it gives proportions for each category out of the total.
 It is often converted to %.
Bar Charts
 One bar per category – height is determined by frequency
or relative frequency
 Order of categories is arbitrary.
 Does NOT let you talk about the shape of a distribution.

 “Area” principle – areas are supposed to be relative. This


is often violated when people try to make graphs “cool”
and go 3-D, etc. (see Example passed around).
Pie Charts
 Take 100% of cases and divide up 360 degrees based on
relative frequencies.

 We will look at bar charts over pie charts.

 Note that for bar charts you do not need to create bars for
100% of the cases. You could look at the top three risk
factors for a disease, etc. However, we usually do have
100% of cases shown.
Contingency Tables - Example
 See first page of Handout
 Totals for rows/columns give marginal distributions for
each variable.
 You can also look at conditional distributions. Fix a row
or column and work solely within that row or column.

 Concept of independence (will formalize later):


 If the distribution of one variable is the same for all categories
of another variable, then the two variables are independent.
On Your Own
 Text has some discussion of segmented bar-charts and
side-by-side (feel free to read or skip)
Simpson’s Paradox
 Something that can happen when you aggregate categorical
data
 Looking at overall averages or % can be misleading
 Can get different results looking at breakdown
 Berkeley Discrimination Data Example (see bottom of page
one of the handout)
 Claims of Sexual Discrimination in1973 Graduate School
Admissions
 Overall, 44.28% of males who applied were admitted, while only
34.58% of females were admitted.
 Look what happens when you breakdown by the 6 largest
departments though! (try this on your own or with a partner). Is there
evidence of discrimination against females at the dept. level? What is
going on?
Quantitative Variables Preliminary Analysis
 Graphs
 Dot plot – won’t use much – read about on your own
 Stem and leaf – won’t use much – read about on your own
 Histogram
 Boxplot (chapter 5)
 Qqplot (Friday or next week)
 Time plot (Friday or next week)
 Descriptive statistics
 Measures of center: mean, median
 Measures of spread: standard deviation, IQR, range
Describing the distribution of a quantitative
variable
 You should focus on three things when describing the
distribution of a quantitative variable:
 Shape – unimodal (one peak), bimodal (two peaks), multimodal
(many peaks), bell-shaped, skewed left (tail to the left), skewed
right (tail to the right), symmetric, uniform (no peaks, basically
flat)
 Center – estimate the center (or use a descriptive statistic)
 If multiple peaks, report the peak locations
 Spread – estimate the spread (can use a descriptive statistic)
Dot Plot – On Your Own
 Most basic quantitative graph
 Use for a low number of observations (<50)
 Basically use a number line and place a dot above it for
each value you have observed.
 Example from wikipedia:
Stem and Leaf – On Your Own
 Your book discusses lots of options for these, including
split leaves (which is something R/Rcmdr will do).
 Basics: You take your values and set a stem – maybe tens.
Then the leaves are the ones place. For each stem, you list
the leaves that coincide in numeric order.
 Usually works decently for fewer than 100 observations
 Try it. Suppose you have scores on a pre-test for an at-risk
youth group as follows:
 5, 11, 13, 21, 34, 36, 45, 47, 48, 48, 49
Histogram
 Take the quantitative variable and break it up into “piles” or
“bins” (usually the same width).
 Count the number of observations in each bin or pile.
 Plot the frequencies per bin.
 Usually no spaces between bins (if there is, it is a gap – NOT
like a bar chart).
 You DO need to know the boundaries. (5,10], (10,15] as bins
IS different from [5,10),[10,15). (If anyone needs me to
explain open/closed brackets, please ask).
 Technology lets us vary the width of bins (effectively the
number)
 You can also use unequal bin widths but then you need
something called density, not frequency.
Examples
 See page 2 of the handout
 Try to describe the shape of each histogram

 Then see page 3 of the handout


 We’re going to create a histogram by hand if there is time
 If no time, you can do this on your own.
Cookie Lab
 Time Permitting (otherwise, Friday)

 The last page (to turn in) is not due till the end of class
tomorrow. So don’t worry if we don’t get to it today. You
can look at it tonight or tomorrow in class (I’ll give last
five minutes of class for you to work on it).

You might also like