14 Data collection and graphical summaries

These indications are better shown by a frequency curve (Fig. 2.2(b)) than a
polygon (Fig. 2.2(a)) which merely joins the tops of the histogram columns (and
in fact does not even properly portray the sample, as the areas are incorrectly
distributed over the classes).

2.3.2 Blob diagrams and box-and-whisker plots

A further graphical technique, which in some respects combines features of both
the stem-and-leaf table and the frequency table, is the 'blob' diagram. A suitable
scale is drawn and divided to represent the range of the values, and a blob is
placed on the scale to represent each measurement. Repeated values are shown
by piling up the blobs, so that frequencies are readily counted up. If desired,
values can be rounded whilst compiling the blobs, thus yielding an alternative
to the grouped frequency table. The final appearance is then as for the 'digidot'
of the next section. Figure 2.3 shows an unrounded blob diagram for the valve-
liner data.
Back-to-back blob diagrams are also useful for presenting data from two
samples to highlight differences in average or variation. In Fig. 2.4, it is clear
that sample B exhibits more variation than sample A, and in general the whole
pattern for B is displaced to the right of that for A, i.e. there is a difference in
'average' (note that we are not yet defining precisely which average - mean,
median, mode, etc. The discrepancy is one of general location of the A and B
data sets).

2.3.3 Box-and-whisker plots

Where a sample or data set is numerous, the blob diagram may preserve too
much detail, possibly masking the main features of location and dispersion.
Again, if several or many (rather than two) samples are to be compared, multiple
blob diagrams are not easy to interpret. In these cases, the box-and-whisker
plot may be advantageous.

•• • ••
• •••• • •• ••
• • • • •
• •• • ••
• •••••••••••••••••••••••• •
1 1 1 r r 1 1 1 1 1 1 1 1 1 r t 1 r 1 1 1 1 1 1 r 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 r r • 1 1 1 1 1 r 1
236 237 238 239 240 241 242

Fig. 2.3 Blob diagram for valve liner data .

• • ••• •• •

• • •• •• •• •• •• •• •• •• • • • A

• • •• •• •• •• •• ••• •• • •• • •• • B
• • •• •
Fig. 2.4 Back-to-hack blob diagram .
Graphical presentation 15

We here assume that the reader is familiar with the median as a measure of
location. When all individual values are available, it is defined as the middle
value, or the average of the middle two, when the values are arranged in order.
We also introduce the first and third quartiles, the values which separate the
lower and upper 25% of the data from the central 50%. Note that the median
is also the second quartile.
To locate the median and quartiles in the ordered data, for example in a stem-
and-leaf table or a blob diagram, use the following expressions. The rationale
will be explained in connection with probability plotting in Chapter 13.
Q 1 =X;, where i = i-(n + 1);
M =X;, i=!(n + 1);
Q3 =X;, i = }(n + 1).
Thus, for the 52 items in the data set forming the main example in this chapter,
For Q1, i = i(52 + 1) = 13.25; so Q1 lies between the thirteenth and fourteenth
values from the lower end. These are in fact 238.9 and 239.0, so we may take
Q1 as 238.9 approximately (second decimal place accuracy is not justified).
ForM, i =!(52+ 1) = 26.5; with x 26 and x 27 both 239.5, the median is also

ei------------------~ 1--------------·x52

236 237 238 239 240 241 242

Fig. 2.5 Box-and-whisker plot for valve liner data.



Fig. 2.6 Box-and-whisker plot expanded to blob diagram. A box-and-whisker plot (a)
would suggest some peculiarity in the original data. A follow-up diagram (b) reveals
values clustered at the upper end with wide gaps between values below the median.
16 Data collection and graphical summaries

For Q3 , i = 39.75; x 39 and x 40 are both 240.1. So we have Q 1 = 238.9,

M = 239.5, Q3 = 240.1.
The box-and-whisker plot is shown in Fig. 2.5. The median is identified by
a line across the box, which covers the middle 50% range from Q 1 to Q3 • The
whiskers extend from the box to the extreme values at each end of the data.
Blob diagrams contain more information than box-and-whisker plots, and a
choice may need to be made according to the amount of detail required. Blobs
as a follow-up to unusual appearance in the box-and-whisker plots may be a
useful tactic, as illustrated in Fig. 2.6.

2.3.4 Run chart and 'digidot'

The time sequence in which data occur may often be significant, and plotting
data in time (or other sequence) order will then be useful. A plot with time as
the horizontal (left-to-right) axis and a vertical scale for the variable of interest
is known variously as a line graph, run chart or time plot. In effect it is a control
chart without statistically based limits, and serves as a simple but effective data
summary permitting the detection of trends, cyclic behaviour, outlying values
and other features. It may be used with guidelines or zones based on practical
considerations, such as temperatures outside which extra heating or cooling equip-
ment may need to be switched on, limits to process or material characteristics
beyond which special action needs to be taken, and so on. Strict statistical
control is not required for all cases, though it is always worth recognising and
exploiting possibilities for preventing problems and reducing variation.
Figure 2.7 shows a run-chart (with specification limits and nominal value)
and digidot for the valve-liner data. It emphasizes that many more results are
below the nominal than above, and there would seem to be more risk of violating
the lower specification limit than the upper. This point will recur later in the
book for this set of data.

2.4 Two-dimensional data

2.4.1 Scatter diagram

Often two or more characteristics or properties are measured on the same items
or on the same occasions. It is then useful to assess whether there are relationships
between the variables, either in pairs or multiples. Numerical methods involve
regression and correlation techniques which lie beyond the scope of the present
text, and the presentation of three or more variables in graphical form also
involves some complications. The most effective way of presenting two variables
is the scatter diagram (or scatter plot).
The horizontal axis of the diagram is usually labelled x, the vertical y. If it
is suspected that one variable may, in some sense, be the 'cause' of the other,
Upper specification limit


2431 Run chart


242 ~
•• ~ 241
241-i t 1\ .u
~ ) l }
240 I I I ~ I '\ I \ I v 't/ \ rI )1 /\I il
r '
- · r40
• I 1- 239
. •
239 -1 _ I ¥ \ I II l. I lo. I \1 \ )\ I i


v . ~ 237

237 ~ v
1 236


_L:_o_~e! _sp~~i!i~~t~o_n_l ~~i~ _________ __________ _______________ ____ _______ ___ _


Fig. 2.7 Run chart and d igidot for va lve li ner da ta.
18 Data collection and graphical summaries
@ = 2 result
2.5 2.5

• •
·--·• • •


~ • • ••
~ ·~· @ .
• • • ·-=--+· •••••
c 1.5
••••••• 1.5
e. • • •••••

• e@ • • • •••••••
• • • @ • • @• • .
• •••••••••


~ 1.0 • • ••••• 1.0

~ •• • • I• •••••
• • • • ••••
• • ••
0.5 •
• • •
• ·- t-• •••


40 45 50 55 60
65 70


Resin content (%) • •

• •• •• ••• •• ••• ••• •• • •
• •• • •• •• •• •• •• •• •• •• •• •• •• • •• •
40 45 50 55 60 65 70
Fig. 2.8 Scatter diagram with marginal blob diagrams.

the cause variable is assigned to x and the effect variable to y. In fact the run
chart uses this convention in that time is often regarded as a cause of changes
in a system.
Each pair of x, y values is plotted at the appropriate coordinates, and any
relationship will appear as a pattern. If the data arise from different sources
(e.g. machines, operators, days, material batches) symbols or colour coding for
these subgroupings may reveal further relationships. It may also be useful to
compile blob diagrams for x and y on their respective axes, as shown in Fig. 2.8.

2.5 Tabular presentation of two-way classifications

2.5.1 Guidelines for tabulation

Observations often need to be classified in two or more dimensions, as for
example sales value by both product and destination, or absence records by

