Lekcija 3 - Frekvencije

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Statistical analysis 2021/22

Lecture 3

Frequency Distributions
Data Visualization
Learning Objectives

After this lecture you should be able to:

LO 1: Summarize qualitative data by constructing a frequency distribution.


LO 2: Summarize quantitative data by constructing a frequency distribution.
LO 3: Construct and interpret a pie chart and a bar chart.
LO 4: Construct and interpret other graphs (e.g., histogram, polygon, ogive,
scatterplot).
Frequency Distribution: Qualitative
Data
• A frequency distribution for qualitative data groups data into categories and
records the number of observations that fall into each category.

Source: McClave, J.T, & Sincich, T. (2018). Statistics. Twelfth Edition. Boston, MA:
Pearson, pp. 28-29.
Frequency Distribution: Qualitative
Data Example
• To illustrate the construction of a frequency distribution with nominal data,
Table 1. shows the weather for the month of February (2010) in Seattle,
Washington.

MON TUE WED THUR FRI SAT SUN


1. 2. 3. 4. 5. 6. 7.
Rainy Rainy Rainy Rainy Rainy Rainy Rainy
8. 9. 10. 11. 12. 13. 14.
Rainy Cloudy Rainy Rainy Rainy Rainy Rainy
15. 16. 17. 18. 19. 20. 21.
Rainy Rainy Sunny Sunny Sunny Sunny Sunny
22. 23. 24. 25. 26. 27. 28.
Sunny Rainy Rainy Rainy Rainy Rainy Sunny
Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating With Numbers. New York, NY: McGraw-Hill
Education, p. 20.

• We first note that the weather in Seattle is categorized as cloudy, rainy,


or sunny. The first column in Table 2. lists these categories.
• Initially, we use a “tally” column to record the number of days that fall into
each category.
• A frequency distribution in its final form does not include the tally column.

WEATHER TALLY FREQUENCY

CLOUDY I 1

RAINY IIIII IIIII IIIII IIIII 20

SUNNY IIIII II 7

Total = 28 days
Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating With Numbers. New
York, NY: McGraw-Hill Education, p. 21.

Relative
WEATHER Absolute % Relative
Frequency
(xi) Frequency Frequency
(ni = fi / N)
(fi) (Class percentage)

CLOUDY 1 1 / 28 = 0,036 3,6 %

RAINY 20 20 / 28 = 0,714 71,4 %

SUNNY 7 7 / 28 = 0,250 25,0 %

TOTAL 28 1 100 %

Σ = 1 ALWAYS Σ = 100% ALWAYS


(rounding) (rounding)
Graphical Descriptive Methods:
Qualitative Data
Weather in Seattle 2/2010
Bar Graph: The categories (classes) of the
Rainy Sunny Cloudy
qualitative variable are represented by
bars, where the height of each bar is
Cloudy
either the class frequency, class 4%
relative frequency, or class Sunny
percentage. 25%
Pie Chart: The categories (classes) of the
Rainy
qualitative variable are represented by 71%
slices of a pie (circle). The size of each
slice is proportional to the class
relative frequency.
Pareto Diagram: A bar graph with the
categories (classes) of the qualitative
variable (i.e., the bars) arranged by Weather in Seattle 2/2010
height in descending order from left to
right. 20

Weather Frequency 7

0 4 8 12 16 20 24 28
Days

Rainy Sunny Cloudy


Summarizing Qualitative Data:
Example

• In this example, the circle is divided into


segments proportional to the relative
frequencies of the Marital Status variable.

• For example, 2010’s data may emphasize


the decline or rise in the proportions
compared to 1960’s.
Summarizing Qualitative Data:
Cautionary Comments on Charts

• The simplest graph should be used.


• Axes should be clearly marked with numbers and scales.

• Bars on bar charts should have the same width.

• Vertical axis should not have a very high values as an upper limit.

• Vertical axis should not be stretched.

LO 2.2
Frequency Distribution for Quantitative
Data
• A frequency distribution for quantitative data groups data into intervals
called classes, and records the number of observations that fall into each
class.

• When constructing frequency distribution make sure that:

• Classes are mutually exclusive.


• This means that there is no way that any of the data could fall into 2 different
classes at once. ... The classes must be of equal width, otherwise the
frequency distribution would give a distorted view of the data.

• Classes are exhaustive.


• The total number of classes covers the entire sample (or population).
Frequency Distribution: Steps

1) Identify the minimum and the maximum value of observations.


2) Choose the adequate number of classes, in the range from
minimum to maximum value.

3) Determine class width.

4) Determine class limits.

5) Classify the units / observations and count them across


classes (frequency).
Classes

• The number of classes usually ranges from 5 to 20 (guideline)


• Option 1: the number of classes should be the smallest whole number k that
makes the quantity 2k greater than the number of measurements in the data set.
• Option 2: Sturges formula: k = 1 + 3,3log (n) ... not obligatory!
• Number of classes has to consider the problem at hand and has to be informative!

𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒


𝐶𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ (𝑎𝑝𝑝𝑟𝑜𝑥) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠

𝐶𝑙𝑎𝑠𝑠 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝐶𝑙𝑎𝑠𝑠 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

𝐶𝑙𝑎𝑠𝑠 min + 𝐶𝑙𝑎𝑠𝑠 𝑚𝑎𝑥


𝐶𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =
2

Overlapping classes Non-overlapping classes


• upper limit of a particular class = • upper limit of a particular class ≠
lower limit of the next class lower limit of the next class
• 2 - 4 (2 up to 4; four doesn't • 31 - 40 (all the numbers,
belong into this class) including 40 belong into this
• 4 - 6 (4 up to 6; four belongs into class)
this class) • 41 - 50 (Xmin = 40,5; Xmax =
• chart: class middle points + 2 50,5)
additional points on the X-axis; • chart (e.g., a histogram): class
absolute frequencies on the Y- minimum and class maximum on
axis the X-axis; relative frequencies on
• additional point 1: smallest the Y-axis
middle point – class width • use when data is integer (whole
• additional point 2: greatest middle number)
point + class width
• use when you have decimals
Frequency Distribution: Quantitative
Data Example
Example: Number of children in families with children.
• Calculate the absolute, relative, cumulative and relative
cumulative frequencies.
• Classify the data into overlapping and non-overlapping
classes.

1 3 2 4 6 1 4 1 7 1 3 2 1 1 2

1 2 10 4 2 1 3 1 1 5 6 8 1 1 6

1 3 1 3 4 3 6 8 4 3 2 2 1 2 3
1 3 4 2 3 3 2 1 1 2 3 1 5 2 1

Type of variable: numerical, ratio


Population: 60 families with children
Unit: 1 family with children
Number of Units: 60

xi fi f'i fi% ∑fi (Fi) ∑fi% (Fi%)


1 20 0,333 33,33% 20 33,33%
2 12 0,200 20,00% 32 53,33%
3 12 0,200 20,00% 44 73,33%
4 6 0,100 10,00% 50 83,33%
5 2 0,033 3,33% 52 86,67%
6 4 0,067 6,67% 56 93,33%
7 1 0,017 1,67% 57 95,00%
8 2 0,033 3,33% 59 98,33%
9 0 - 0,00% 59 98,33%
10 1 0,017 1,67% 60 100,00%
60 1 100%
Count of all Always 1 Always 100% Always 100%
observations
Frequency Distribution: Overlapping
Classes

xmin xmax Xi
∑fi%
Classes fi f'i fi% ∑fi (Fi) (lower (upper (class
(Fi%)
limit) limit) midpoint)
0-2 32 0,533 53,33% 32 53,33% 0 2 1

2-4 18 0,300 30,00% 50 83,33% 2 4 3

4-6 6 0,100 10,00% 56 93,33% 4 6 5

6-8 3 0,050 5,00% 59 98,33% 6 8 7

8-10 1 0,017 1,67% 60 100,00% 8 10 9


60 1 100%
Frequency Distribution: Non-
overlapping Classes

xmin xmax Xi di
∑fi%
Classes fi f'i fi% ∑fi (Fi) (lower (upper (mid- (class
(Fi%)
limit) limit) point) width)
0,533
1-2 32 53,33 32 53,33 0,5 2,5 1 2
0,300
3-4 18 30,00 50 83,33 2,5 4,5 3 2
0,100
5-6 6 10,00 56 93,33 4,5 6,5 5 2
0,050
7-8 3 5,00 59 98,33 6,5 8,5 7 2
0,017
9-10 1 1,67 60 100,00 8,5 10,5 9 2
60 1 100
Frequency Distribution: Quantitative
Data Example with Charts
• A relocation specialist for a real estate firm in Mission Viejo, CA gathers
recent house sales data for a client from Seattle, WA.
• The table below shows the sale price (in $1,000s) for 36 single-family
houses.

• Frequency Distribution for House-Price Data


• The raw data has been converted into a frequency distribution in the
following table.
Reading a Frequency Distribution

Reading a Frequency Distribution

• What is the price range over this


time period?
• $300,000 up to $800,000

• How many of the houses sold in


the $500,000 up to $600,000
range?
• 14 houses

Reading a Frequency Distribution


(Continued)

• A cumulative frequency
distribution specifies how many
observations fall below the
upper limit of a particular class.

• Question: How many houses


sold for less than $600,000?
• 29 houses

Reading & Constructing a Relative Frequency Distribution (Continued)


• A relative frequency distribution identifies the proportion or fraction of
values that falls into each class.

• Here are the relative frequency and the cumulative relative frequency
distributions for the house-price

Class (in Relative


Frequency Cumulative Relative Frequency
$1,000s) Frequency
300 up to 400 4 4/36 = 0.11 0.11

400 up to 500 11 11/36 = 0.31 0.11 + 0.31 = 0.42

500 up to 600 14 14/36 = 0.39 0.11 + 0.31 + 0.39 = 0.81

600 up to 700 5 5/36 = 0.14 0.11 + 0.31 + 0.39 + 0.14 = 0.95

700 up to 800 2 2/36 = 0.06 0.11 + 0.31 + 0.39 + 0.14 + 0.06


» 1.0
Total 36 1.0

Reading a Frequency Distribution


(Continued)

• Question: What percent of the


houses sold for at least
$500,000 but not more than
$600,000?
• 39%
• Question: What percent of the
houses sold for less than
$600,000?
• 81%
Summarizing Quantitative Data

ü Histogram

ü Polygon

ü Ogive

LO 2.4
Visualization of Data

• Reasons for visualization


• Comprehension of dense information
• Seeing relationships
• Types of visualization
• Frequency distribution
• Histogram
• Quartile
• Distribution
• Multiple variables
• Scatter: simple, multiple, matrix
• Bubble chart
• Density chart
Charts / Graphs: Starting Points

Before you choose and construct a chart it is good to consider:

• Who is your audience?

• What message or information is relevant for your audience


and should be immediately spotted?

• How to support visual presentation with text?

• How will the chart be delivered to the audience?

• What charts are appropriate and which of those is the best?


A Non-Comprehensive Catalogue of
Tricks

• Chart orientation

• Use of 3D

• Proper and improper comparisons

• Axes and scales

• Colors

• Labels


Graph Design Considerations

Title
• summarizes what the graph shows
• should identify what is being described and the units of measurements
• may be placed within the chart area or above or below the chart
Use of area frame
• depending on the separation from the main text
Axes should be labeled
• can be omitted if categories are very obvious (e.g., years)
Reference to source of information

Colors and shading


• Used to distinguish the areas representing different categories.
• If there is a natural order or sequence to the data you are plotting
then the colors or shading patterns used to represent the
categories should reflect that.
• Color combinations and shading patterns used are ultimately a
personal matter.
• When choosing color schemes, you will want mappings from data
to color that are not just numerically but also perceptually uniform.

3D effects
• popular (“Looks beautiful)
• in general, the use of 3D makes it much more difficult to
interpret the data presented
Common Graph Types

• Bar charts

• Column Chart

• Histograms

• Line charts

• Pie charts

• Scatterplot
Bar Chart vs. Column Chart

• Both charts display data using


rectangular bars where the length
of the bar is proportional to the
data value.
• Both charts are used to compare
two or more values. However…
• A bar chart is oriented
horizontally.
• The column chart is oriented
vertically.
• They cannot be always used
interchangeably because of the
difference in their orientation.

Source: https://www.fusioncharts.com/blog/bar-charts-or-column-charts/

• Bar charts are good for displaying long data labels.
• Bar charts are good for displaying large number of data sets on the
category axis.
• Column charts are good for displaying data sets with negative values.

Source: https://www.fusioncharts.com/blog/bar-charts-or-column-charts/
A Bar Chart with a Considerable Amount of
Junk In It…
Bar Charts: Use the Full Axis and Avoid
Distortion

Source: https://flowingdata.com/2012/08/06/fox-
news-continues-charting-excellence/
Our Sales Compared to Competitors

We are doing just fine…


Our Sales Compared to Competitors

Falling behind…
Comparing Absolute Numbers
Comparing Relative Numbers
Histogram

• Special case of a bar chart


• Plots the distribution of a numeric variable’s values as a series of bars.
• Each bar typically covers a range of numeric values called a bin or
class; a bar’s height indicates the frequency of data points with a value
within the corresponding bin.
• Both axes have scales
• X axis - values of a numerical variable
• Y axis – frequencies
• Histograms are graphs that display the distribution of your continuous data.
• They are also useful for giving a rough view of the probability distribution.
Histograms & Central Tendency: An
Example
• Use histograms to understand the
center of the data.
• In the histogram on the right, you
can see that the center is near
50.
• Most values in the dataset will be
close to 50, and values further
away are rarer.
• The distribution is roughly
symmetric and the values fall
between approximately 40 and
64.
Source:
https://statisticsbyjim.com/basics/histograms/

• A difference in means shifts the


distributions horizontally along
the X-axis (unless the histogram
is rotated).
• In the histograms on the right,
one group has a mean of 50
while the other has a mean of 65.
• Histograms help you grasp the
degree of overlap between
groups.
• In these histograms, there’s a
relatively small amount of
Source:
overlap. https://statisticsbyjim.com/basics/histograms/
Line Chart

Used to present (time) series data


• time or other series indicator on the X axis
• value of a variable on the Y axis
Used to present numeric variables
The US National Debt

Decreasing?
The US National Debt

Increasing?
Linear vs. Logarithmic

Source: https://www.motherjones.com/kevin-drum/2020/05/chart-trivia-which-
is-better-log-or-linear/

A lot of people have a hard time understanding log charts!


Pie Charts

• Used to display a proportional distribution between different categories


• shares or percentages
• Used for presenting nominal variables only
• best used with relatively few categories

• The human brain thinks linearly: we can easily compare lengths/heights of


line segments but when it comes to angles and areas most of us can't judge
them well.
• Try to avoid the use of pie charts when comparing many categories or when
categories have similar values
Pie Charts: An Example

• Let's say that the charts represent


the polling from a local election with
five candidates at three different
points A, B, an C during an election:
• In the first race, is candidate 5
doing better than candidate 3?
• Who did better between time A
and time B, candidate 2 or
candidate 4?
• Who has the most momentum in
the race?

• But now, let's look at that exact


same information — parts of a
whole — articulated on a bar chart:

Source: https://www.businessinsider.com/pie-charts-are-the-
worst-2013-6
3D Pie Charts Are Evil

Paul crushed it in Q1… Bryan won Q2…Or did he?

The only difference between the two charts is that they are rotated 130
degrees relative to one another.

Source: http://www.getnerdyhr.com/3d-pie-charts-are-evil/
Scatter Plot

Used to show the relationship between pairs of quantitative measurements


made for the same object or individual
Numeric variables
• interval or ratio
Each axis represents one measurement (variable)
Scatter Plot
Pictographs: Example 1

• Pictograph shows a scale that


represents the number of
elementary students who prefer
chocolate chip cookies.
• This type of pictograph shows
how a symbol can be used to
represent data.
• One cookie symbol represents
two students, and a half-cookie
symbol is used to represent one
student.
• These data could easily have
Source: https://www150.statcan.gc.ca/n1/edu/power-
been presented in a bar chart
pouvoir/ch9/picto-figuratifs/5214825-eng.htm using a scale to present the
figure rather than a symbol.
Pictographs: Example 2

• The Canadian dollar shrank to a


value of 70 cents over 20 years
because of inflation.
• This information means the
value of the 2020 CAD was
worth 70% of the value of the
2000 CAD!
• The size or area (total surface)
of the dollars coin (loonie) is
misleading.
• The dollar value differences
represented are exaggerated.
Source: https://www150.statcan.gc.ca/n1/edu/power- • Since 70 cents is over half of
pouvoir/ch9/picto-figuratifs/5214825-eng.htm one dollar, the 2020 loonie
should appear bigger than half
the size of the 2000 loonie.
https://www.labnol.org/software/find-right-chart-
type-for-your-data/6523/
Playing With Axes
Two Axes

You might also like