The document discusses various ways to describe patterns of data distribution including the center, spread, shape, and unusual features. It covers different charts and graphs used to visualize distributions such as dot plots, bar charts, histograms, stem-and-leaf plots, box plots, and scatter plots. Tables are also presented as an alternative way to display data. Key aspects like comparing distributions and Simpson's paradox are mentioned.
The document discusses various ways to describe patterns of data distribution including the center, spread, shape, and unusual features. It covers different charts and graphs used to visualize distributions such as dot plots, bar charts, histograms, stem-and-leaf plots, box plots, and scatter plots. Tables are also presented as an alternative way to display data. Key aspects like comparing distributions and Simpson's paradox are mentioned.
The document discusses various ways to describe patterns of data distribution including the center, spread, shape, and unusual features. It covers different charts and graphs used to visualize distributions such as dot plots, bar charts, histograms, stem-and-leaf plots, box plots, and scatter plots. Tables are also presented as an alternative way to display data. Key aspects like comparing distributions and Simpson's paradox are mentioned.
1 lehoailong@hcmut.edu.vn Center • The center of a distribution is located at the median of the distribution. • This is the point where about half of the observations are on either side.
Lecturer: Le Hoai Long (Ph.D.)
2 lehoailong@hcmut.edu.vn Spread • The spread of a distribution refers to the variability of the data. • If the observations cover a wide range, the spread is larger. If the observations are clustered around a single value, the spread is smaller
Lecturer: Le Hoai Long (Ph.D.)
3 lehoailong@hcmut.edu.vn Shape • The shape of a distribution is described by the following characteristics. – Symmetry – Number of peaks. Distributions can have few or many peaks. • Distributions with one clear peak are called unimodal, • and distributions with two clear peaks are called bimodal.
Lecturer: Le Hoai Long (Ph.D.)
4 lehoailong@hcmut.edu.vn Shape • And by the following characteristics. – Skewness. Distributions with most of their observations on the left (toward lower values) are said to be skewed right; and so on. – Uniform. When the observations in a set of data are equally spread across the range of the distribution, the distribution is called a uniform distribution.
Lecturer: Le Hoai Long (Ph.D.)
5 lehoailong@hcmut.edu.vn Shape
Lecturer: Le Hoai Long (Ph.D.)
6 lehoailong@hcmut.edu.vn Gap and outlier • Gaps: areas of a distribution where there are no observations. • Outliers: distributions are characterized by extreme values that differ greatly from the other observations. Lecturer: Le Hoai Long (Ph.D.) 7 lehoailong@hcmut.edu.vn Chart and graph Dotplot • A dotplot is made up of dots plotted on a graph. – Each dot can represent a single observation or a specified number of observations. – The dots are stacked in a column over a category – If the categories are quantitative, the pattern of data in a dotplot can be described in terms of symmetry and skewness • Dotplots are used most often to plot frequency counts within a small number of categories, usually with small sets of data.
9 lehoailong@hcmut.edu.vn Chart and graph Bar Charts • A bar chart is made up of columns plotted on a graph. – The columns are positioned over a label that represents a categorical variable. – The height of the column indicates the size of the group defined by the column label.
Lecturer: Le Hoai Long (Ph.D.)
10 lehoailong@hcmut.edu.vn Chart and graph Histograms • Like a bar chart, a histogram is made up of columns plotted on a graph. Usually, there is no space between adjacent columns. – The columns are positioned over a label that represents a quantitative variable. – The column label can be a single value or a range of values. – The height of the column indicates the size of the group defined by the column label.
Lecturer: Le Hoai Long (Ph.D.)
11 lehoailong@hcmut.edu.vn Bar chart and histogram • In SPSS: Graphs => Legacy dialogs => Bar (Histogram)
Lecturer: Le Hoai Long (Ph.D.)
12 lehoailong@hcmut.edu.vn Chart and graph Difference Between Bar Charts and Histograms • With bar charts, each column represents a group defined by a categorical variable; and with histograms, each column represents a group defined by a quantitative variable. • It is always appropriate to talk about the skewness of a histogram. And how about bar charts?
Lecturer: Le Hoai Long (Ph.D.)
13 lehoailong@hcmut.edu.vn Chart and graph Stemplots • A stemplot is used to display quantitative data, generally from small data sets (50 or fewer observations). • The entries on the left are called stems; and the entries on the right are called leaves • Stemplots usually do not include explicit labels for the stems and leaves Lecturer: Le Hoai Long (Ph.D.) 14 lehoailong@hcmut.edu.vn Stemplot (Stem and leaf)
Lecturer: Le Hoai Long (Ph.D.)
15 lehoailong@hcmut.edu.vn Chart and graph Boxplot Basics • A boxplot splits the data set into quartiles. The body of the boxplot consists of a "box” which goes from the first quartile (Q1) to the third quartile (Q3). • Within the box, a vertical line is drawn at the Q2, the median of the data set. • Two horizontal lines, called whiskers, extend from the front and back of the box. The front whisker goes from Q1 to the smallest non-outlier in the data set, and the back whisker goes from Q3 to the largest non-outlier • If the data set includes one or more outliers, they are plotted separately as points on the chart Lecturer: Le Hoai Long (Ph.D.) 16 lehoailong@hcmut.edu.vn Boxplot • In SPSS: Graphs => Legacy dialogs => Boxplot
Lecturer: Le Hoai Long (Ph.D.)
17 lehoailong@hcmut.edu.vn Chart and graph Scatterplot • A scatterplot is a graphic tool used to display the relationship between two quantitative variables • A scatterplot consists of an X axis (the horizontal axis), a Y axis (the vertical axis), and a series of dots. • Each dot on the scatterplot represents one observation from a data set Lecturer: Le Hoai Long (Ph.D.) 18 lehoailong@hcmut.edu.vn Chart and graph Scatterplot • Scatterplots are used to analyze patterns in bivariate data. • These patterns are described in terms of linearity, slope, and strength.
Lecturer: Le Hoai Long (Ph.D.)
19 lehoailong@hcmut.edu.vn Scatter plot
Lecturer: Le Hoai Long (Ph.D.)
20 lehoailong@hcmut.edu.vn Compare distributions • Focus on four features: – Center. – Spread. – Shape. – Unusual features.
Lecturer: Le Hoai Long (Ph.D.)
21 lehoailong@hcmut.edu.vn Table • Alternatively, data can be presented in table form – One-way table – Two-way table
Lecturer: Le Hoai Long (Ph.D.)
22 lehoailong@hcmut.edu.vn Table • A one-way table is the tabular equivalent of a bar chart. Like a bar chart, a one-way table displays categorical data in the form of frequency counts and/or relative frequencies. – Frequency Tables: a one-way table shows frequency counts for a particular category of a categorical variable – Relative Frequency Tables: a one-way table shows relative frequencies for particular categories of a categorical variable Lecturer: Le Hoai Long (Ph.D.) 23 lehoailong@hcmut.edu.vn Table • A two-way table (also called a contingency table) is a useful tool for examining relationships between categorical variables. The entries in the cells of a two-way table can be frequency counts or relative frequencies just like a one-way table
Lecturer: Le Hoai Long (Ph.D.)
24 lehoailong@hcmut.edu.vn Table
Lecturer: Le Hoai Long (Ph.D.)
25 lehoailong@hcmut.edu.vn Be careful, Simpson’s paradox • Simpson's paradox (or the Yule-Simpson effect) is a paradox in which a correlation present in different groups is reversed when the groups are combined. • It occurs when frequency data are hastily given causal interpretations. • Simpson's Paradox disappears when causal relations are brought into consideration (Wikipedia)
Lecturer: Le Hoai Long (Ph.D.)
26 lehoailong@hcmut.edu.vn Be careful, Simpson’s paradox • Consider the situation of two contractors in the table below (Good quality/number of contracts) • Who is better? (Long N.D. 2010) Type of contract Civil Industrial Total Contractor A 40/60 13/15 53/75 66.6% 86.7% 70.7% Contractor B 5/8 42/50 47/58 62.5% 84% 81% Lecturer: Le Hoai Long (Ph.D.) 27 lehoailong@hcmut.edu.vn