Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Stat 230

Introduction to Probability and Statistics

Lecture 1
Sections 1.1 & 1.2

Hagop Karakazian
What is Statistics?
• Statistics is the science of collecting, displaying, analyzing
and interpreting data, as well as of drawing conclusions
and making informed decisions based on data.

• Collecting 📋, displaying 📊, analyzing 📈


↳ Descriptive Statistics (Chapter 1 - very brie y)

• Interpreting data, drawing conclusions and making


informed decisions based on this data

↳ Inferential Statistics (Chapter 6 on)

• What about Chapters 2 through 5?


↳ Probability Theory, the “meat” of the course.
Stat 230 - Karakazian
fl
Population v.s. Sample
• A population consists of all elements—individuals, items, or objects—whose
characteristics are being studied.
↳ Ex: All houses in Beirut, all students at AUB,…

• A sample is a portion/subset of a population selected for study.


• Types of samples:
• Representative = represents the characteristics of the population as closely as
possible. (Q: Example of a non-representative sample?)
• Random = Drawn in such a way that each element of the population has some
chance of being selected in the sample. (Q: Example of a non-random sample?)

• Studying the entire population —> Census Survey (inconvenient + costly)


• Studying a sample —> Sample Survey
• Inferences obtained from a representative random sample are generally more reliable.
Stat 230 - Karakazian
What are Statistics?
• A statistic is any quantity whose value can be calculated from sample data.
↳ Ex: average/mean, percentage/proportion, etc.

• In Inferential Statistics, a statistic is used to estimate the corresponding


quantity regarding the population.
↳ Ex: Suppose we calculated an average height (our statistic) of a
randomly selected sample of 50 AUB students, and obtained 170cm, with
a standard deviation of 4cm (whatever this means).

↳ Chapter 7 will help us infer statements like:


We are 99% sure that the average height of all AUB students is between
168.4cm and 171.6cm

• This inference is based on Probability Theory which tells us the likelihood


of all values a statistic can assume when a random sample has been
selected from the population.

Stat 230 - Karakazian


Variables
• A variable is any characteristic whose value may change from
one element to another in the population.
↳ Ex: x=brand of calculator owned by a student
y=total wealth of a person

• A data set is a table that displays the values/observations/


measurements of one or more variables corresponding to each
element of a sample or population.

• Stat 230 - Karakazian


Types of Variables
• A Quantitative Variable assumes a numerical value:
A Discrete Variable can have only certain numerical
values, i.e. with no intermediate values.
↳ Ex: # of cars owned, # of deliveries/day

A Continuous Variable can assume any numerical


value over some interval.
↳ Ex: Income, weight, time, distance, …

• A Qualitative Variable does not assume a numerical


value but can be classi ed into two or more categories.

↳ Ex: {XS,S, M, L, XL}, {Fresh., Soph., Junior, Senior}

Stat 230 - Karakazian


fi
Displaying Data
Collected data is typically raw data (i.e. unprocessed,
ungrouped, and unsorted), which should be organized and
properly displayed for use.

We will focus on:


• Stem-and-Leaf Displays
• Dot Plots
• Histograms for discrete data
• Histograms for continuous data
Stat 230 - Karakazian
Stem-and-Leaf Displays
Raw Data:

Leaves of 5

Stems

A Stem-and-Leaf Display is used to condense quantitative (raw)


data by splitting each value into two portions:
• The Leaf - Speci ed number of rightmost digit(s)
• The Stem - Remaining leading digit(s)
where the leaves of each stem are shown next to it in a horizontal list.
Stat 230 - Karakazian
fi
Example

The data ranges from 630 to 1370


1-digit leaves ⇒ Stems range from 63 to 137 too many!
2-digit leaves ⇒ Stems range from 6 to 13 good!

Stat 230 - Karakazian


Dot Plots

A Dot Plot displays a quantitative (raw) data using dots drawn above
corresponding values on the number line.
Advantages:
• Displays the location & spread of the data
• Displays the number of occurrences (frequency) of each value
• Displays gaps and outliers (extreme values relative to the majority)
• Can be also used for multivariable data

Stat 230 - Karakazian


Frequency & Relative Frequency
• Frequency of a value = # of its occurrences in the data

↳ Ex: Frequency of 21 is 5

• Relative Frequency of a value = proportion or % of the whole


collected data that has that value.
↳ Ex: In the dot plot above, we have 33 observations
So 5 out of 33 observations are 21
Rel. Frequency of 21 is 5/33 ≈ 0.15
Frequency
Rel. Freq. =
Count, which is the sum of all Frequencies
Remark: 0 ≤ Rel. Freq. ≤ 1 and should add up to 1Stat 230 - Karakazian
(Rel.) Frequency Distribution
A (rel.) frequency distribution is a tabulation of the (rel.) frequencies.

Here, frequency = # of games.

Note: The sum of rel. frequencies is not exactly 1 because of


rounding after calculation of each rel. frequency.
Stat 230 - Karakazian
Histogram for Discrete Data
A histogram is a graphical representation of the (rel.)
frequency distribution and is constructed as follows:

Stat 230 - Karakazian


Histogram for Continuous Data
To construct a histogram for continuous data,

1) Subdivide the horizontal axis into a suitable number of class intervals or


classes, such that each observation is contained in exactly one class.

Typically, between 5 and 20 classes should be satisfactory. One can also


calculate:

# of classes ≈ # of observations
# of classes ≈ 1 + 3.3 log(# of observations)
largest value − smallest value, called the range
2) Approximate the class width ≈
number of classes
3) Follow book’s instructions

Stat 230 - Karakazian


Example
A raw data of 90 observations

# of classes ≈90 ≈ 9.48 ≈ 9


18.26 − 2.97 Of course, one
class width ≈ = 1.698 ≈ 2 could have
9 started the 1st
class from 0 or 2;
as long as each
measurement
belongs to exactly
one class.

Endpoints of the classes or class


intervals are called class boundaries.
Ex: 1 is the lower boundary of the 1st
class, and 3 is the upper boundary of
the 1st class.
Stat 230 - Karakazian
Freq.

Polygons Frequency Polygon

• A Polygon is a graph
formed by joining the
midpoints of the tops of
successive bars in a
histogram with lines.

• As the number of classes


increases, the polygon
becomes smoother and
is referred as the
distribution curve.

Stat 230 - Karakazian


Shapes of Histograms
1. Symmetric Histograms

2. Skewed
Right-Skewed Left-Skewed
or or
Positively Skewed Negatively Skewed

3. Uniform or Rectangular 4. Unimodal = 1 peak

5. Bimodal = 2 peaks

Stat 230 - Karakazian

You might also like