Lecture 01

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Lecture 01: Data Basics

What is ”Statistics”?

Statistics is the science of learning from data. Data are the facts and figures collected,

analyzed, and summarized for presentation and interpretation. All the data collected in a

particular study are referred to as the data set for the study. Any set of data set contains

information about some group of individuals. The information is organized in variables. Our

objective is to make more informed and better decisions based on the information we have.

What are the Individuals and Variables?

• Individuals (/elements/subjects/cases/...) are the objects described by a set of data.

Individuals may be people, but they may also be animals or things.

• A variable is any characteristic of an individual. A variable can take di↵erent values for
di↵erent individuals.

1
Example: A survey was conducted on students in an introductory statistics class. Below are a

few of the questions on the survey, and the corresponding variables the data from the

responses were stored in

• gender: What is your gender?

• intro extra: Do you consider yourself introverted or extraverted?

• sleep: How many hours do you sleep at night, on average?

• bedtime: What time do you usually go to bed?

• countries: How many countries have you visited?

• dread: On a scale of 1-5, how much do you dread being here?

2
Data set

Data collected on students in a statistics class on a variety of variables:

Stu. gender intro extra dread

1 male extravert ··· 3

2 female extravert 2
···

3 female introvert 4
···

4 other extravert ··· 2

... ... ... ... ...

36 male extravert 3
···

3
What are the types of variables?

• A categorical variable places an individual into one of several groups or categories.

– eg: Eye colour, hair colour, gender, etc. (population proportion)

• A quantitative variable takes numerical values for which arithmetic operations such as
adding and averaging make sense. The values of a quantitative variable are usually

recorded with a unit of measurement such as seconds or kilograms. (Population mean)

all variables

numerical categorical

POPULATION: ALL STUDENTS FROM STAT CLASSES

Stu. gende slee bedtim countrie drea


r p e s d “sleep” – Quantitative variable

1 male 5 12-2 13 3 µ - population mean


- Number of hours sleep
2 female 7 10-12 7 2 - Mean number of hours of sleep for all
students
3 female 5.5 12-2 1 4
“bedtime” – Categorical
4 female 7 12-2 2
P – proportion of all students who went to bed
5 female 3 12-2 1 3 after midnight

6 female 3 12-2 9 4 Success = students who went to bed after


midnight

4
• gender: categorical

• sleep: quantitative

• bedtime: categorical

• countries: quantitative

• dread: quantitative (or categorical, depends on the study)

Population vs. Sample

The distinction between population and sample is basic to statistics. To make sense of any

sample result, you must know what population the sample represents.

Population

parameters

Sample

statistics
5

The population in a statistical study is the entire group of individuals about which we want

information.

- ALL INDIVIDUALS

• A sample is a part of the population from which we actually collect information. We use a
sample to draw conclusions about the entire population. --- STATISTICAL

INTERFERENCE

• A sampling design describes exactly how to choose a sample from the population.

• A parameter is a number that describes the population. In practice, the value of a


parameter is not known because we can rarely examine the entire population.

Eg: µ, p and represent the mean, proportion and standard deviation of a population.

These numbers are parameters.

µ - population mean

- Mean of all individuals

- True mean

- Quantitative data

P – Population proportion

- Proportion of all individuals with certain characteristics.

6

- True proportion

- Categorical data

A statistic is a number that can be computed from the sample data without making use of any

unknown parameters. In practice, we often use a statistic to estimate an unknown parameter.

Eg: x¯, ¯p and s represent the mean, proportion and standard deviation of a sample. These

numbers are statis-

tics.

X – variable
X bar – sample mean
P bar – sample proportion

P bar = x/n = number of successes / sample size

Note: Remember p and s: parameters come from populations and statistics come from samples.

Exploratory Data Analysis

Statistical tools and ideas help us examine data in order to describe their main features.

This examination is called exploratory data analysis.

Begin by examining each variable by itself. Then move on to study the relationships among

the variables.

• Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data.

7

Distribution of a Variable

The distribution of a variable tells us what values it takes and how often it takes these values.

1. Distribution of a Categorical variable.

• The distribution of a categorical variable lists the categories and gives either the count, the
percent and the proportion of individuals who fall into each category.

8
• Bar charts and Pie Charts display distribution of a categorical variable graphically.

• A frequency table describes distribution of a categorical variable numerically.

• A pie chart must include all the categories that make up a whole. Use a pie chart only
when you want to emphasize each category’s relation to the whole.

• To describe the distribution of a categorical variable, we need to write about the main
features.

2. Distribution of a numerical(/quantitative) variable

• To describe distribution of a quantitative variable, look for the overall pattern and
for striking deviations from that pattern.

• You can describe the overall pattern by its shape, center, variability/spread and
location.

• An important kind of deviation is an outlier, an individual that falls outside the


overall pattern.

• Histograms, box-and-whisker plots or dot plots, stem plot display distribution of a


quantitative variable graphically.

9
• Mean and Median describe the center of the distribution of a quantitative variable
numerically.

• A percentile provides information about how the data are spread over the interval
from the smallest value to the largest value. The pth percentile of a data set is a

value such that at least p% of the items take on this value or less (and at least (100

p)% percent of the items take on this value or more).

p% 100-p

• Q3 = 75 th
percentile (below that)

• Q1 = 25 th
percentile (below that)

• IQR = Q3-Q1 = Middle 50% of the distribution

• Range = max-min, IQR = Q 3 Q1 and standard deviation (s.d.)/variance describe


the variability/spread of the distribution of a quantitative variable numerically.

• Min, Max, first-quartile (Q1), third-quartile (Q3) describe the location of the

distribution of a quantitative variable numerically.

• The shape of a distribution of a quantitative variable is either symmetric or right-


skewed or left-skewed .

10
Exploratory analysis to inference

• Sampling is natural.

• Think about sampling something you are cooking - you taste (examine) a small part of
what you’re cooking to get an idea about the dish as a whole.

• When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough,
that’s exploratory analysis.

• If you generalize and conclude that your entire soup needs salt, that’s an inference.

• For your inference to be valid, the spoonful you tasted (sample ) needs to be

representative of the entire pot (population).

– If your spoonful comes only from the surface and the salt is collected at the bottom
of the pot, what you tasted is probably not representative of the whole

pot.

– If you first stir the soup thoroughly before you taste, your spoonful will more likely
be representative of the whole pot.

11
Statistical inference is primarily concerned with understanding and quantifying the

uncertainty of parameter estimates. While the equations and details change depending on

the setting, the foundations for inference are the same throughout all of statistics.

12

You might also like