Module 1-2 - Data and Variables

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Module 1 - Section 2

Data and Variables

By: Rosana Fok


1
Population vs. Sample
u The entire collection of individuals is called the population
of interest.
u A sample is a subset of the population, selected for study
in some prescribed manner.

u A parameter is a numerical summary that describes a


characteristic of the population (often unknown)
u A statistic is a numerical summary that describes a
characteristic of the sample (is known once the data are
observed)

NOTE: We typically use Greek letters to denote parameters


and Latin letters to denote statistics.
2
Commonly used Parameters and Statistics

Name Statistic Parameter


Mean 𝑦! 𝜇 (𝑚𝑢)
Standard Deviation s 𝜎 (𝑠𝑖𝑔𝑚𝑎)
Correlation r 𝜌 (𝑟ℎ𝑜)
Regression coefficient b 𝛽 (𝑏𝑒𝑡𝑎)
Proportion 𝑝̂ 𝑝

3
Problem: It is generally impossible to
gather all observations from a population.

Sampling

Population Data

?? Descriptive
statistics

Parameter Statistics

Inference

4
Example
I want to estimate the proportion of male students in this Statistics
class by randomly selecting 20 students in this class.

u Population of Interest:
u all students in this Statistics Class

u Sample:

u 20 randomly selected students in this class

u Parameter:

u the proportion of male students in this Statistics Class

u Statistic:

u the proportion of male students in this sample of 20 students.


5
Example
I want to find out the average time my female students spend on
studying Statistics by asking 20 randomly selected female students
from this class.

u Population of Interest:

u all female students in this Statistics class

u Sample:

u 20 randomly selected female students in this class

u Parameter:

u the average study time of female students in this class

u Statistic:

u the average study time of the 20 randomly selected female


6 students
Data
Recall: Data can be numbers, record names or other labels.

u To provide context, the data need:

u 5 “W’s”

u Who

u What

u When

u Where

u Why

u 1 “H”

u How 7
Who

u The Who of the data tells us the individual cases about


which (or whom) we have collected data

u Subjects or participants – people on whom we experiment

u The entire set of subjects is the population.

u The set of subjects you observe is your sample.

u Respondents – individuals who answer a survey

u Experimental units – animals, plants, and inanimate subjects

8
What
u What characteristic (or variable) are you measuring?

u Variables are characteristics recorded about each individual

u A variable can take different values for different individuals.

u Some variables have units that tell how each value has been
measured and tell the scale of the measurement

Measurement Units
Distance Meter
Mass Kilogram
Time Second
Electric current Ampere
Temperature Kelvin 9

Amount of substance Mole


Why

u Why are you doing this? What is the reason for collecting
data?

u The questions of interest shape what we think about and


how we treat the variable

u NOTE: we need the Who, What, and Why to analyze data.

10
When and Where

u Knowing when and where the data are collected give us


some nice info about the context

u Example: values recorded at the U of A in 1960s may


mean something different than similar values recorded
last year. (eg. average price of a pen that students use)

u Example: salary in Canada vs salary in 3rd world country

11
How
u How are the data being collected? How is the variable
being measured? (Assigned Readings)

u Data must be gathered properly.

u Improper data collection methodology may lead to wrong


conclusions.

u Critically important for appropriate analysis and validity


of inferences.

u Ex: results from voluntary internet surveys are often


useless 12
Example
One of the reasons that the Monitoring the Future (MTF) project was started was
“to study changes in the beliefs, attitudes, and behavior of young people in the
United States.” Data are collected from 8th, 10th, and 12th graders each year.
To get a representative nationwide sample, surveys are given to a randomly
selected group of students. In Spring 2004, students were asked about alcohol,
illegal drug, and cigarette use. Describe the W’s, if the information is given. If
the information is not given, state that it is not specified.
u Who: 8th, 10th, and 12th graders
u What: alcohol, illegal drug, and cigarette use
u Why: “to study changes in the beliefs, attitudes, and behavior of young
people in the United States”
u When: Spring 2004
u Where: United States
13
u How: survey
Variables
u Variables are characteristics recorded about each individual
u Types of Variables

Categorical
(Qualitative)

Variable
Numerical
(Quantitative)

14
Categorical/Qualitative Variable
u A categorical variable places a subject into one of several groups or
categories (or levels).

u Usually we determine the counts of cases that fall into each


category

u Two types:
u Nominal: the levels have no order Nominal

u Ordinal: the levels have some order Categorical


(Qualitative)
Ordinal
Variable
Numerical
Quantitive
15
Example
u gender (M or F) à

u hair color (blonde, white, black, red, etc…) à

u letter grade (A+, A, A-, B+, B, B-, C+, C, C-, D+, D, F) à

u car manufacturer (Dodge, Ford, Honda, Others) à

u opinion (strongly agree, agree, neutral, disagree, strongly


disagree) à

u education level (high school diploma, undergraduate degree,


graduate degree) à

16
Numerical/Quantitative Variables
u A numerical variable measures a numerical quantity or amount in each subject

u Two types:

u Discrete: can only take on distinct values

u Continuous: can take on any value in a given interval

u Not all data represented by numbers are numerical data (eg. 1 = male, 2 =
female)
Nominal
Categorical
(Qualitative)
Ordinal
Variable
Discrete
Numerical
Quantitative 17
Continuous
Example
Data from a medical study contain values of many variables of each of the
people who were the subjects of the study. Which of the following
variables are categorical and which are quantitative?

u Age (years)

u Smoker (yes or no)

u Systolic blood pressure (millimeters of mercury)

u Level of calcium in the blood (micrograms per milliliter)

u Drug effectiveness (1=strongly agree, 2=agree, 3=neutral, 4=disagree,


5=strongly disagree)

18
Example
Each student has his or her own characteristics, eg. age, student ID, height,
weight, gender, hair color, nationality, grade (A+, A, A-, … D, F), number of
siblings, per week, etc…

Student Age Student ID Height Gender Grade


Alice 19 00001 165 F B+
Boris 20 00002 170 M A-
Catherine 18 00003 158 F A

u This data table clearly shows the context of the data.


u Specifically, it tells us the “What” (column titles) and “Who” (row titles) for
these data.
u NOTE: student ID is known as an identifier variable. It is not a
quantitative variable as it doesn’t have units. It is a categorical variable
with one individual in each category. The university assigns you a unique
student ID, so that they can identify you on their system.
u Example: SIN, ISBN 19
Thank you for watching
this video!

20

You might also like