Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 20

Please sign in (SIGNATURES) as you come in to class.

It will save
my voice instead of my taking attendance (this is only to settle the
class roster).

Introduction to Statistics
Lecture Notes
Chapters 3-5
Whats up with the powerpoint?
I dont usually use slides, but am going to try
to use these to save my voice somewhat.

Notes: Still working on getting the class roster


settled. Has been some movement on the
waitlist, will keep in touch as things develop.
Be sure youve signed in!
First homework is posted (on our course
website), but isnt due until next Friday (the
4th). The additional problem is NOT optional,
that just means it is not a book problem.
Handouts for Today
There is one handout on graphs/descriptive
statistics going around. Save this to use
tomorrow in class.

There is a second handout the anonymous


survey largely designed by the class on Monday.
Please go ahead and take a few minutes to fill
this out (no names!) and get it back to me. Well
take a look at this data next week in lab.

If you missed class Monday, I have extra course


syllabuses at the front as well.
The Ws of a Data Set
Who the observations (population set of all objects
you are interested in obtaining the value of some
parameter for since we usually cant observe all
objects, we take a sample of objects a subset of the
overall population of objects to observe)
Note: There is NO such thing as a population sample
or sample population.
What the variables
Why why was the data collected
How how was the data collected (related to
design/sampling in chapters 12-13)
When/Where more information that could be relevant
Chapters 3-5 Overview
Covers basic graphs and descriptive statistics for
both categorical and quantitative variables
This is what you would do as a preliminary
analysis for a variable.

Recall: a data set can have multiple variables in


it.

These chapters focus on mostly univariate (single


variable) analyses. There is one comparative
graph a side-by-side boxplot in Chapter 5.
3 Rules of Data Analysis
Rule 1- Make a picture
Rule 2 Make a picture (really, before you do
anything else)
Rule 3 Make a picture (really, we mean a
well-chosen picture for your variables)
Categorical Variable Prelim Analysis
Frequency tables (one variable) summarize
counts by category
Contingency tables (2 or more variables)
summarize counts by category for multiple
variables
Bar charts
Pie charts
Frequency
What is frequency?
Frequency is the number of objects/cases per
category
You can also look at relative frequency.
Relative frequency is the number of objects/cases
per category divided by the total number of
objects.
Hence it gives proportions for each category out of
the total.
It is often converted to %.
Bar Charts
One bar per category height is determined
by frequency or relative frequency
Order of categories is arbitrary.
Does NOT let you talk about the shape of a
distribution.

Area principle areas are supposed to be


relative. This is often violated when people try
to make graphs cool and go 3-D, etc. (see
Example passed around).
Pie Charts
Take 100% of cases and divide up 360
degrees based on relative frequencies.

We will look at bar charts over pie charts.

Note that for bar charts you do not need to


create bars for 100% of the cases. You could
look at the top three risk factors for a disease,
etc. However, we usually do have 100% of
cases shown.
Contingency Tables - Example
See first page of Handout
Totals for rows/columns give marginal
distributions for each variable.
You can also look at conditional distributions.
Fix a row or column and work solely within that
row or column.

Concept of independence (will formalize later):


If the distribution of one variable is the same for all
categories of another variable, then the two
variables are independent.
On Your Own
Text has some discussion of segmented bar-
charts and side-by-side (feel free to read or
skip)
Simpsons Paradox
Something that can happen when you aggregate
categorical data
Looking at overall averages or % can be misleading
Can get different results looking at breakdown
Berkeley Discrimination Data Example (see bottom
of page one of the handout)
Claims of Sexual Discrimination in1973 Graduate
School Admissions
Overall, 44.28% of males who applied were admitted,
while only 34.58% of females were admitted.
Look what happens when you breakdown by the 6
largest departments though! (try this on your own or
with a partner). Is there evidence of discrimination
against females at the dept. level? What is going on?
Quantitative Variables Preliminary Analysis
Graphs
Dot plot wont use much read about on your own
Stem and leaf wont use much read about on
your own
Histogram
Boxplot (chapter 5)
Qqplot (Friday or next week)
Time plot (Friday or next week)
Descriptive statistics
Measures of center: mean, median
Measures of spread: standard deviation, IQR, range
Describing the distribution of a quantitative
variable
You should focus on three things when
describing the distribution of a quantitative
variable:
Shape unimodal (one peak), bimodal (two peaks),
multimodal (many peaks), bell-shaped, skewed left
(tail to the left), skewed right (tail to the right),
symmetric, uniform (no peaks, basically flat)
Center estimate the center (or use a descriptive
statistic)
If multiple peaks, report the peak locations
Spread estimate the spread (can use a
descriptive statistic)
Dot Plot On Your Own
Most basic quantitative graph
Use for a low number of observations (<50)
Basically use a number line and place a dot
above it for each value you have observed.
Example from wikipedia:
Stem and Leaf On Your Own
Your book discusses lots of options for these,
including split leaves (which is something
R/Rcmdr will do).
Basics: You take your values and set a stem
maybe tens. Then the leaves are the ones place.
For each stem, you list the leaves that coincide
in numeric order.
Usually works decently for fewer than 100
observations
Try it. Suppose you have scores on a pre-test for
an at-risk youth group as follows:
5, 11, 13, 21, 34, 36, 45, 47, 48, 48, 49
Histogram
Take the quantitative variable and break it up into
piles or bins (usually the same width).
Count the number of observations in each bin or pile.
Plot the frequencies per bin.
Usually no spaces between bins (if there is, it is a gap
NOT like a bar chart).
You DO need to know the boundaries. (5,10], (10,15]
as bins IS different from [5,10),[10,15). (If anyone
needs me to explain open/closed brackets, please
ask).
Technology lets us vary the width of bins (effectively
the number)
You can also use unequal bin widths but then you
need something called density, not frequency.
Examples
See page 2 of the handout
Try to describe the shape of each histogram

Then see page 3 of the handout


Were going to create a histogram by hand if there
is time
If no time, you can do this on your own.
Cookie Lab
Time Permitting (otherwise, Friday)

The last page (to turn in) is not due till the end
of class tomorrow. So dont worry if we dont
get to it today. You can look at it tonight or
tomorrow in class (Ill give last five minutes of
class for you to work on it).

You might also like