Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Probability and Statistics First Week Text Book: E.

Kreyszig, Advanced Engineering Mathematics, John Wiley & Sons, Inc., 9th edition, 2006. Reference Books: M.R. Spiegel, J. Schiller and R.A. Srinivasan, Schaums Easy Outlines of Probability and Statistics, McGrawHill, 2001. R.E. Walpole, R.H. Myers, S.L. Myers and K. Ye, Probability & Statistics for Engineers & Scientists, Pearson Education, Inc. 8th edition, 2007. Outlines for week one Graphical Representation of Data: StemandLeaf Plot Histogram Boxplot Arithmetic Mean Standard Deviation Variance

Introduction to Statistics

Introduction Statistics is a word with a variety of meaning. The word statistics refers to numerical facts systematically arranged. For instance, we have statistics of prices, statistics of road accidents, statistics of crimes, statistics of births, etc. In all these examples, the word statistics denotes a set of numerical data in the respective elds. These all comes under the heading of descriptive statistics, in which items are counted or measured and the results are combined in various ways to give useful results. That type of statistics certainly has its use in engineering. In the second place, the word statistics dened as a discipline that includes procedures and techniques used to collect, process and analyze numerical data to make inferences and to reach decisions in the face of uncertainty. It should be kept in mind that uncertainty does not imply ignorance but it refers to the incompleteness and instability of available data. Thirdly, the word statistics are numerical quantities calculated from sample observations; a single quantity that has been so calculated is called a statistic. The mean of a sample for instance is a statistic. The world statistics is plural when used in this sense. Another type of statistics will engage our attention to a much greater extent, that is, inferential statistics or statistical inference. For example, it is often not practical to measure all the items produced by a process. Instead, we take a sample and measure the quantities on the basis of sample. We infer something about all the items from our knowledge of the sample.

Data Representation.

Data can be represented numerically or graphically in various ways. For example, our daily newspaper may contain tables of stock prices and money exchange rates, curves or bar charts illustrating economical or political developments, or pie charts showing how your tax dollar is spent. And there are numerous other representations of data for special purposes. Today we discuss the use of standard representations of data in statistics. For these, software packages, Maple or Mathematica may be helpful. We explain corresponding concepts and methods with the help of examples. Consider the following data 89, 84, 87, 81, 89, 86, 91, 90, 78, 89, 87, 99, 83, 89. (1)

These are n = 14 measurements of the tensile strength of steel sheet in kg/mm2 , recorded in the order obtained and rounded to integer values. To see what is going on, we sort these data, that is, we order them by size, 78, 81, 83, 84, 86, 87, 87, 89, 89, 89, 89, 90, 91, 99. (2)

We shall now discuss standard graphic representations used in statistics for obtaining information on properties of data.

2.1

Graphic Representation of Data

StemandLeaf Plot. In 1977, John Turkey introduced a technique known as StemandLeaf plot. This technique oers a quick and novel way for simultaneously sorting and displaying data sets where each number in the data set is divided into two parts, a Satem and a Leaf. A stem is the leading digit(s) of each number and is used in sorting, while a leaf is the rest of number or the trailing digit(s). For (1) it is shown in Fig. 1. The numbers in (1) range from 78 to 99; see (2). We divide these numbers into 5 groups, 7579, 8084, 8589, 9094, 9599. The integers in the tens position of the groups are 7, 8, 8, 9, 9. These form the stem in Fig. 1. The rst leaf is 8 (representing 78). The second leaf is 134 (representing 81, 83, 84), and so on. The number of times a value occurs is called its absolute frequency. Thus 78 has absolute frequency 1, the value 89 has absolute frequency 4, etc. The column to the extreme left in Fig. 1 shows the cumulative absolute frequencies, that is, the sum of the absolute frequencies of the values up to the line of the leaf. Thus, the number 4 in the second line on the left shows that (1) has 4 values up to and including 84. The number 11 in the next line shows that there are 11 values not exceeding 89, etc. Dividing the cumulative absolute frequencies by n = 14 (in Fig. 1) gives the cumulative relative frequencies.

Histogram. A histogram is a representation of a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies. For the set of values 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, the histogram is Histogram. For large sets of data, histograms are better in displaying the distribution of data than stemandleaf plots. The principle is explained in Fig. 2. The bases of the rectangles in Fig. 2 are the xintervals (known as class intervals) 74.5 79.5, 79.584.5, 84.589.5, 89.594.5, 94.599.5, whose midpoints (known as class marks) are x = 77, 82, 87, 92, 97, respectively. The height of a rectangle with class mark x is the relative class frequency frel (x), dened as the number of data values in that class interval, divided by n = 14 (in our case). Hence the areas of the rectangles are proportional to these relative frequencies, so that histograms give a good impression of the distribution of data. Center and Spread of Data: Median, Quartiles. As a center of the location of data values we can simply take the median, the data value that falls in the middle when the values are ordered. In (2) we have 14 values. The seventh of them is 87, the eighth is 89, and we split the dierence, obtaining the median 88. (In general, we would get a fraction.) The variability of the data values can be measured by the range R = xmax xmin , the largest minus the smallest data values, R = 99 78 = 21 in (2). Better information gives the interquartile range IQR = qU qL . Here the upper quartile qU is the middle value among the data values above the median. The lower quartile qL is the middle value among the data values below the median. Thus in (2) we have qU = 89 (the fourth value from the end), qL = 84 (the fourth value from the beginning), and IQR = 89 84 = 5. The median is also called the middle quartile and is denoted by qM . The rule of splitting the dierence (just applied to the middle quartile) is equally well used for the other quartiles if necessary. Boxplot The boxplot of (1) in Fig. 3 is obtained from the ve numbers xmin , qL , qM , qU , xmax just determined. The box extends from qL to qU . Hence it has the height IQR. The position of the median in the box shows that the data distribution is not symmetric. The two lines extend from the box to xmin below and to xmax above. Hence they mark the range R. Boxplots are particularly suitable for making comparisons. For example, Fig. 3 shows

boxplots of the data sets (1) and 91, 89, 93, 91, 87, 94, 92, 85, 91, 90, 96, 93, 89 (consisting of n = 13 values). Ordering gives 85, 87, 89, 89, 90, 91, 91, 91, 92, 93, 93, 94, 96 (4) (3)

(tensile strength, as before). From the plot we immediately see that the box of (3) is shorter than the box of (1) (indicating the higher quality of the steel sheets!) and that qM is located in the middle of the box (showing the more symmetric form of the distribution). Finally, xmax is closer to qU for (3) than it is for (1), a fact that we shall discuss later. For plotting the box of (3) we took from (4) the values xmin = 85, qL = 89, qM = 91, qU = 93, xmax = 96.

Mean. Standard Deviation. Variance

Medians and quartiles are easily obtained by ordering and counting. practically without calculation. But they do not give full information on data: you can change data values to some extent without changing the median. Similarly for the quartiles. Arithmetic Mean. The average size of the data values can be measured in a more rened way by the mean n 1 j=1 xj x= = (x1 + x2 + + xn ). (5) n n This is the arithmetic mean of the data values, obtained by taking their sum and dividing by the data size n. Thus in (1). 1 611 x = (89 + 84 + + 89) = 87.3 (6) 14 7 Every data value contributes, and changing one of them will change the mean. Standard Deviation. The variability of the data values can be measured in a more rened way by the standard deviation s or by its square, the variance s2 = 1 n1
n

(xj x)2 =
j=1

1 [(x1 x)2 + (x2 x)2 + (xn x)2 ] n1

(7)

Thus, to obtain the variance of the data, take the dierence xj x of each data value from the mean, square it, take the sum of these n squares, and divide it by n 1. To get the standard

deviation s, take the square root of s2 . For example, using x = 611/7, we get for the data (1) the variance 611 611 611 176 1 ) + (84 + ) + + (89 + )] = 25.14 (8) s2 = [(89 + 13 7 7 7 7 Hence the standard deviation is s = 176 5.014. Note that the standard deviation has the 7 same dimension as the data values (kg/mm2 , see at the beginning), which is an advantage. On the other hand, the variance is preferable to the standard deviation in developing statistical methods. Problem Set 24.1 Represent the data by a stemandleaf plot, a histogram, and a boxplot: Question 3.1. 20 21 20 19 20 19 21 19 Solution.Sorting the data we have 19, 19, 19, 20, 20, 20, 21, 21. Hence qL = 19, qM = 20 and qU = 20.5. Question 3.2. 7 6 4 0 7 1 2 4 6 6 Solution. Sorting the data we have 0, 1, 2, 4, 4, 6, 6, 6, 7, 7. Hence qL = 2, qM = 5 and qU = 6. Question 3.3. 56 58 54 33 41 30 44 37 51 46 56 38 38 49 39 Solution. Sorting the data we have 30, 33, 37, 38, 38, 39, 41, 44, 46, 49, 51. Hence qL = 38, qM = 44 and qU = 52.5. Question 3.4. 12.1 10 12.4 10.5 9.2 17.2 11.4 11.8 14.7 9.9 Solution. Sorting the data we have 9.2, 9.9, 10.0, 10.5, 11.4, 11.7, 11.8, 12.1, 12.4, 17.2. Hence qL = 10, qM = 11.55 and qU = 12.1.

Question 3.5. 70.6 70.9 69.1 71.3 70.5 69.7 71.5 69.8 71.1 68.9 70.3 69.2 71.2 70.4 72.8 Solution. Sorting the data we have 68.9, 69.1, 69.2, 69.7, 69.8, 70.3, 70.4, 70.5, 70.6, 70.9, 71.1, 71.2, 71.3, 71.5, 72.8. Hence qL = 69.15, qM = 70.5 and qU = 71.15. Question 3.6. -0.52 0.11 -0.48 0.94 0.24 -0.19 -0.55 Solution. Sorting the data we have 0.55, 0.52, 0.48, 0.19, 0.11, 0.24, 0.94. Hence qL = 0.5, qM = 0.19 and qU = 0.175. Question 3.7. Reaction time [sec] of an automatic switch 2.3 2.2 2.4 2.5 2.3 2.3 2.4 2.1 2.5 2.4 2.6 2.3 2.5 2.1 2.4 2.2 23 2,5 2.4 2.4 Solution. Sorting the data we have 2.1, 2.1, 2.2, 2.2, 2.3, 2.3, 2.3, 2.3, 2.3, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 2.5, 2.5, 2.5, 2.5, 2.6. Hence qL = 69.15, qM = 70.5 and qU = 71.15. Question 3.8. Carbon content [%] of coal 89 90 89 84 80 88 90 89 88 90 85 87 86 82 85 76 89 87 86 86 Solution. Sorting the data we have 76, 80, 82, 94, 85, 85, 86, 86, 86, 87, 87, 88, 88, 89, 89, 89, 89, 90, 90, 90. Hence qL = 85, qM = 87 and qU = 89. Question 3.9. Weight of lled bottles [g] in an automatic lling process 403 399 398 401 400 401 401 Solution. Sorting the data we have 398, 399, 400, 401, 401, 401, 403. Hence qL = 399.5, qM = 401 and qU = 401. Question 3.10. Gasoline consumption [gallons per mile] of six cars of the same model 14.0 14.5 13.5 14.0 14.5 14.0 Solution. Sorting the data we have 13.5, 14.0, 14.0, 14.0, 14.5, 14.5, 15.5. Hence qL = 14.0, qM = 14.0 and qU = 14.5. Mean and Standard Deviation. Find the mean and compare it with the median. Find the standard deviation and compare it with the interquartile range. Question 3.11. 20 21 20 19 20 19 21 19

Solution. x=
n j=1

xj

159 = 19.8750 8

s = Question 3.12. 7 6 4 0 7 1 2 4 6 6 Solution. x=

n j=1 (xj

x)2 = 0.6964 n1

n j=1

xj

n
n j=1 (xj

43 = 4.3 10

s =

x)2 = 6.4556 n1

Question 3.13. 70.6 70.9 69.1 71.3 70.5 69.7 71.5 69.8 71.1 68.9 70.3 69.2 71.2 70.4 72.8 Solution. x=
n j=1

xj

1057.3 = 70.4867 15

s =

n j=1 (xj

x)2 = 1.09552 n1

Question 3.14. -0.52 0.11 -0.48 0.94 0.24 -0.19 -0.55 Solution. x=
n j=1

xj

0.45 = 0.0642 7

s =

n j=1 (xj

x)2 = 0.2939 n1

Question 3.15. Weight of lled bottles [g] in an automatic lling process 403 399 398 401 400 401 401 Solution. x=
n j=1

xj

2803 = 400.4286 7

s =

n j=1 (xj

x)2 = 2.61908 n1

Question 3.16. 5 22 7 23 6. Why is |x qM | so large?

Solution. x=
n j=1

xj

n
n j=1 (xj

63 = 12.6 5

s =

x)2 = 82.3 n1

Also, here qM = 7, which implies that |x qM | = |12.6 7| = 5.6 Question 3.17. Construct the simplest possible data with x = 100 but qM = 0. Question 3.18. Prove that x must always lie between the smallest and the largest data values. Solution. Let xsmall = min{xj : j = 1, , n}, xlarge = max{xj : j = 1, , n}. Now, x= Also, x= Consequently, xsmall x xlarge Question 3.19. Writing Project Average and Spread, Compare qM , IQR and x, s illustrating the advantages and disadvantages with examples of your own.
n j=1

xj

n
n j=1

n j=1

xlarge n xlarge = = xlarge n n

xj

n j=1

xsmall n xsmall = = xsmall n n

You might also like