Professional Documents
Culture Documents
STAT 1124 - Chapter 1
STAT 1124 - Chapter 1
STAT 1124 - Chapter 1
Contents 1
1
Chapter 1
What is statistics?
A dataset is a collection of information about some group of individuals or (subjects, cases, items, or units) such as people,
Individuals are people, animals, cars, things, or objects described by a dataset and on which data are collected.
Variables of interest are characteristics or properties or attributes of the individuals which may take different values for
different individuals.
An Observation is a row in the spreadsheet which contains the measurement(s) (numbers, letters, or words) on one indi-
vidual’s variable(s).
A categorical variable has two or more groups or categories or classes into which an individual would be placed.
A quantitative variable takes numeric values recorded with a unit of measurement such as hours, minutes, percentages,
Note:
Sometimes the categories of a categorical variable are stored as numbers but these numbers are just labels for the categories
Thus, the categorical variables represent the data which are labels, or names. Some categorical variables represent the data
2
CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 3
Example 1.1.1 Consider the following dataset obtained in a study about seven selected students from a Statistics class.
1. What individuals does the dataset describe? How many individuals are being studied?
2. How many variables do the dataset contain? What is the unit of measurement of each variable? Classify the corre-
Exploratory Data Analysis: Describing main features of data by statistical tools and ideas.
Exploring Data
1. Examining and describing one variable and then studying the relationships among the variables.
2. Creating a relative graph or graphs and then calculating numerical summaries of specific aspects of the data.
We usually want to display the distribution of a single variable in order to examine it.
Distribution of a Variable
The distribution of a variable indicates what values it takes and how often it takes these values.
Now we are going to describe the distribution of a single categorical variable using graphs.
The distribution of a categorical variable lists the categories of the variable as well as the count (frequency) or the percent
Example 1.2.1 787 employees of a company were asked to complete a survey on their education level (some high school,
high school graduate, some college, and college graduate). Here are the data on the percents and counts of employees who
1- Pie charts
Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the
categories.
• Pie charts are used to emphasize each category’s relation to the whole.
2- Bar graphs
Bar graphs represent each category of a categorical variable as a bar. The height of each bar over each category indicates
Figure 1.3 describes the distribution of education levels of employees when the bars follow the alphabetical order of education
levels.
CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 6
Figure 1.5: Excel bar graph for Education Level when the bars apear in order of hieght
What is the height of the bar related to the “High School Graduate ” level?
Quantitative variables often take many different values. The distribution of a quantitative variable tells us what values the
The two commonly used graphs to display the distribution of a quantitative variable are:
• Histograms show the distribution of a quantitative variable by using bars whose height represents the number (or
• Stemplots or stem-and-leaf plots separate each observation into a stem and a leaf that are then plotted to display
Histograms
• The range of the data set is divided into some (usually between 5-20) classes (or class intervals) with equal width.
• Since the class widths are equal, the taller bar has a larger area and represents more individuals.
• You can use the following formula to calculate the approximate class width:
• Each value in the data set falls into one and only one class. A class starts with the minimum value of the class called
the lower class limit and ends with the maximum value of the class called the upper class limit.
• The number of individuals which fall into each class is called class frequency or (class counts).
• The horizontal axis shows the classes and is marked in the units of measurement for the variable of interest while the
• There is no gap or space between histogram bars which shows all the values of the variable are covered by the bars. If
there is a gap between bars in a histogram, that means there are no values to fall into that particular class.
Example 1.3.1 The following histogram summarizes the annual sales (in thousands of dollars) amount for some selected
What is the difference between the two histograms above (Figure 1.6 and Figure 1.7)?
Note:
• There are some recommendations for selecting the number of classes in a histogram but there is no unique right choice.
• Trial and error and the resulting judgement are used to determine the number of classes in order to describe the shape.
CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 9
• Not too few classes with all values in a few classes with tall bars which results in losing information (“skyscraper”
graph).
Which histogram describes the numerical variables, Sales Amount and Age the best?
• We apply the created graph such as histogram for describing the overall pattern and for striking deviations from
that pattern.
• Overall pattern of a histogram can be described by its shape, center, and variability (spread).
• An important kind of deviation is an outlier, an individual that falls outside the overall pattern. We only look for
strong outliers that suggest something special about observations such as error of typing. Usually, look for outliers
• The shape of a distribution can be described by explaining the symmetry or skewness of the histogram and whether
the distribution has a single pick (unimodal) or multiple picks (multimodal). Try to find major picks not minor ups
• The centre of a distribution can be described by its midpoint which half of the observations have values smaller than
the value of the midpoint while half of the observations have values larger than the value of the midpoint.
CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 10
• The variability or spread can be shown by the difference between the largest value and the smallest value called range.
• A histogram of a very large data set with small classes appears as a smooth curve.
• A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other.
• A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther
• A distribution is skewed to the left if the left side of the histogram extends much farther out than the right side.
The second graph for describing a quantitative variable is a stemplot or a stem-and-leaf plot which mostly used for a small
data set (usually fewer than 100 observations) and provides more detail than a histogram.
1- Separate each observation into a stem, consisting of all but the final (rightmost) digit, and a leaf, which is that remaining
final digit. Stems may have as many digits as needed, but each leaf contains only a single digit.
2- Write the stems in a vertical column with the smallest value at the top, and draw a vertical line at the right of this
column. Be sure to include all the stems needed to span the data, even when some stems have no leaves.
3- Write each leaf in the row to the right of its stem, in increasing order out from the stem.
147, 232, 547, 328, 295, 194, 368, 456, 410, 298, 321, 190, 211, 413, 123, 128, 189, 136, 150, 129, 110, 250, 259, 200, 200, 650,
Stems Leaves
1 1222345899
2 00135599
3 226
4 116
5 04
6 05
7 0
8 0
Leaf Unit = 10
Note: If you have too many leaves for one stem, you can split the stem.
Acknowledgement
The core content of the slides are from the textbook of this course;
by