STAT 1124 - Chapter 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Contents

Contents 1

1 Picturing Distributions with Graphs 2


1.1 Individuals and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Categorical variables: pie charts and bar graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Quantitative variables: Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Interpreting Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Quantitative variables: Stemplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1
Chapter 1

Picturing Distributions with Graphs

What is statistics?

• The science of data (information).


• The science of collecting, classifying, summarizing, organizing, analyzing, presenting, and interpreting numerical and
categorical information.

1.1 Individuals and Variables

A dataset is a collection of information about some group of individuals or (subjects, cases, items, or units) such as people,

cars, nations, etc.

Individuals are people, animals, cars, things, or objects described by a dataset and on which data are collected.

Variables of interest are characteristics or properties or attributes of the individuals which may take different values for

different individuals.

An Observation is a row in the spreadsheet which contains the measurement(s) (numbers, letters, or words) on one indi-

vidual’s variable(s).

Categorical and Quantitative Variables

A categorical variable has two or more groups or categories or classes into which an individual would be placed.

A quantitative variable takes numeric values recorded with a unit of measurement such as hours, minutes, percentages,

or kilograms. Ordinary arithmetic operations are meaningful for quantitative data.

Note:

Sometimes the categories of a categorical variable are stored as numbers but these numbers are just labels for the categories

and have no units of measurement (numerical meaning).

Thus, the categorical variables represent the data which are labels, or names. Some categorical variables represent the data

of which the order or rank is meaningful.

2
CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 3

Example 1.1.1 Consider the following dataset obtained in a study about seven selected students from a Statistics class.

Figure 1.1: A spreadsheet from seven students of a Statistics class

1. What individuals does the dataset describe? How many individuals are being studied?

2. How many variables do the dataset contain? What is the unit of measurement of each variable? Classify the corre-

sponding variables as categorical or quantitative.


CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 4

1.2 Categorical variables: pie charts and bar graphs

Exploratory Data Analysis: Describing main features of data by statistical tools and ideas.

Exploring Data

1. Examining and describing one variable and then studying the relationships among the variables.

2. Creating a relative graph or graphs and then calculating numerical summaries of specific aspects of the data.

We usually want to display the distribution of a single variable in order to examine it.

Distribution of a Variable

The distribution of a variable indicates what values it takes and how often it takes these values.

Now we are going to describe the distribution of a single categorical variable using graphs.

The distribution of a categorical variable lists the categories of the variable as well as the count (frequency) or the percent

of individuals who fall into each category.

Example 1.2.1 787 employees of a company were asked to complete a survey on their education level (some high school,

high school graduate, some college, and college graduate). Here are the data on the percents and counts of employees who

have different education levels. Complete the following table.

Education Level Number of Employees(frequency or count) Percent of Employees Relative Frequency


Some High School, 123
High School Graduate 251
Some College 189
College Graduate 224
Total

1- What percentage of employees did not go to college?

2. What proportion of employees are college graduates?

1- Pie charts

Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percents for the

categories.

• A pie chart must include all the categories of a categorical variable.

• The slices of a pie represent the categories of the categorical variable.


CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 5

• The size of each slice is proportional to the percents of the categories.

• Pie charts are used to emphasize each category’s relation to the whole.

Figure 1.2: Excel pie chart for Education Level

2- Bar graphs

Bar graphs represent each category of a categorical variable as a bar. The height of each bar over each category indicates

the count or percent of the corresponding category.

Figure 1.3: Excel bar graph for Education Level

Figure 1.3 describes the distribution of education levels of employees when the bars follow the alphabetical order of education

levels.
CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 6

Figure 1.4: Excel bar graph for Education Level

Figure 1.5: Excel bar graph for Education Level when the bars apear in order of hieght

What is the height of the bar related to the “High School Graduate ” level?

1.3 Quantitative variables: Histograms

Quantitative variables often take many different values. The distribution of a quantitative variable tells us what values the

variable takes on and how often it takes those values.

The two commonly used graphs to display the distribution of a quantitative variable are:

• Histograms show the distribution of a quantitative variable by using bars whose height represents the number (or

percent) of individuals who take on a value within a particular class.

• Stemplots or stem-and-leaf plots separate each observation into a stem and a leaf that are then plotted to display

the distribution while maintaining the original values of the variable.


CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 7

Histograms

• The range of the data set is divided into some (usually between 5-20) classes (or class intervals) with equal width.

• Since the class widths are equal, the taller bar has a larger area and represents more individuals.

• You can use the following formula to calculate the approximate class width:

Largest value - Smallest value


Class Width = Number of Classes

• Each value in the data set falls into one and only one class. A class starts with the minimum value of the class called

the lower class limit and ends with the maximum value of the class called the upper class limit.

• The number of individuals which fall into each class is called class frequency or (class counts).

• The horizontal axis shows the classes and is marked in the units of measurement for the variable of interest while the

vertical axis shows the count or percent of each class.

• There is no gap or space between histogram bars which shows all the values of the variable are covered by the bars. If

there is a gap between bars in a histogram, that means there are no values to fall into that particular class.

Example 1.3.1 The following histogram summarizes the annual sales (in thousands of dollars) amount for some selected

salespeople in a company for the last fiscal year.

Figure 1.6: StatGraphics histogram of Annual Sales


CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 8

1- How many salespeople were selected for this study?

2- Complete the following table:

Class Count (Frequency) Relative Frequency


50 to < 150

Consider the following histogram:

Figure 1.7: StatGraphics histogram of Annual Sales

What is the difference between the two histograms above (Figure 1.6 and Figure 1.7)?

Note:

• There are some recommendations for selecting the number of classes in a histogram but there is no unique right choice.

• Trial and error and the resulting judgement are used to determine the number of classes in order to describe the shape.
CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 9

• Not too many classes with one or no observations (“pancake” graph).

• Not too few classes with all values in a few classes with tall bars which results in losing information (“skyscraper”

graph).

Which histogram describes the numerical variables, Sales Amount and Age the best?

1.4 Interpreting Histograms

• We apply the created graph such as histogram for describing the overall pattern and for striking deviations from

that pattern.

• Overall pattern of a histogram can be described by its shape, center, and variability (spread).

• An important kind of deviation is an outlier, an individual that falls outside the overall pattern. We only look for

strong outliers that suggest something special about observations such as error of typing. Usually, look for outliers

after a large gap in the distribution.

• The shape of a distribution can be described by explaining the symmetry or skewness of the histogram and whether

the distribution has a single pick (unimodal) or multiple picks (multimodal). Try to find major picks not minor ups

and downs in the bars of the histogram.

• The centre of a distribution can be described by its midpoint which half of the observations have values smaller than

the value of the midpoint while half of the observations have values larger than the value of the midpoint.
CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 10

• The variability or spread can be shown by the difference between the largest value and the smallest value called range.

• A histogram of a very large data set with small classes appears as a smooth curve.

Figure 1.8: Data set size effect on a histogram

Describing Distributions: Symmetric and Skewed Distributions

• A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other.

Figure 1.9: symmetric


CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 11

• A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther

out than the left side.

Figure 1.10: Right-skewed

• A distribution is skewed to the left if the left side of the histogram extends much farther out than the right side.

Figure 1.11: Left-skewed

Figure 1.12: Symmetric and Skewed Distributions


CHAPTER 1. PICTURING DISTRIBUTIONS WITH GRAPHS 12

1.5 Quantitative variables: Stemplots

The second graph for describing a quantitative variable is a stemplot or a stem-and-leaf plot which mostly used for a small

data set (usually fewer than 100 observations) and provides more detail than a histogram.

How to make a stemplot:

1- Separate each observation into a stem, consisting of all but the final (rightmost) digit, and a leaf, which is that remaining

final digit. Stems may have as many digits as needed, but each leaf contains only a single digit.

2- Write the stems in a vertical column with the smallest value at the top, and draw a vertical line at the right of this

column. Be sure to include all the stems needed to span the data, even when some stems have no leaves.

3- Write each leaf in the row to the right of its stem, in increasing order out from the stem.

Example 1.5.1 Sales Data

147, 232, 547, 328, 295, 194, 368, 456, 410, 298, 321, 190, 211, 413, 123, 128, 189, 136, 150, 129, 110, 250, 259, 200, 200, 650,

700, 600, 500, 800

Stems Leaves
1 1222345899
2 00135599
3 226
4 116
5 04
6 05
7 0
8 0

Leaf Unit = 10

What does the 6 stem contain?

Note: If you have too many leaves for one stem, you can split the stem.

Acknowledgement

The core content of the slides are from the textbook of this course;

The Basic Practice of Statistics (8th Edition)

by

MOORE, NOTZ, and FLIGNER

You might also like