ELEMENTARY STATISTICAL METHODS JWH Swanepoel + CJ Swanepoel FC van Graan ¢ JS Allison * L Santana AXIE PUBLISHERS 45 Beyers Naude Avenue, Potchefstroom, 2531 Elementary Statistical Methods, 7" edition ISBN EBook: 978-1-991000-09-5 ISBN Hard Copy: 978-1-991000-07-1 Copyright ©) 2021 Subject Group Statistics, Northwest University All rights reserved, No part of this publication may be reproduced in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from the publisher and authors. Preface In this set of notes an attempt is made to provide the novice student with an understanding of the nature and methods used in modem Statistics. We are convinced that the basic ideas of Statistics are clear and evident and that even beginners with little mathematical knowledge can master it, Therefore, the content of the course demands almost no Mathematical pre-knowledge, although a feeling for Mathematics will be an asset, In recent years several books in Statistics at basic level appeared in the literature, of which most are written in English, This leaves the Afrikaans reader with only a few available books. With this set of notes we strive towards providing both Afrikaans as well as English students with the most needed statistical tools for handling the planned content of the course, The notes are gathered from our own insights, knowledge and experience, and will hopefully compare favourably to other intemational books on the same: introductory level for these topics of interest. The books are available in both English and Afrikaans. The notes coniain several outstanding elements that will be of interest to the lecturer, Firstly, the concepts of population and sample are stressed from the very start and are presented separately throughout, to highlight the problem of statistical inference continuously. We therefore do not discuss the concept of Desoriptive Statistics in isolation, but continually link calculations fram the data to the problem of statistical inference. When inference is required in order to estimate the population mean, it is attempted with the minimum of Probability Theory. In fact, we use the probability concept sparingly throughout. This approach, however. is appropriate as introduction to more advanced courses in Mathematical Statistics. As far as calculations are concemed, we are convinced of the importance of experience and regular exercise as method of development of statistical skills. Therefore, exercises form an important part of this course. In most cases the calculations can be effectively be simplified by using the computer package Stat1.3. The exercises were chosen in such a way that continuous revision af previous wark is achieved. We tried to minimize the derivation and use of complicated formulas and to rather concentrate on the understanding and perception, as well as the application of statistical ideas and methods. Since Statistics plays a central role in modern science, we consider it an essential task to explain several relative obligations in every chapter and to place the contents of the chapters into perspective in these regards. Therefore, every chapter is of more than passing interest and value, The relationships between Statistics and the community entails, for example, ethical problems of which some are discussed briefly to emphasize the responsibility of the student towards possible applications of Statistics in practice. It will be necessary te purchase a pocket calculator which can handle at least square root caiculati squares and other powers, as well as calculations involving the exponential function (e" ) Frequency distribution of continuous data 4 3.2.1 Frequency table. 35 3.2.2 Cumulative frequency distribution 3a 3.2.3. Relotive frequency and percentage frequency distributions... “a 3.2.4 Relative and percentage cumulative frequency and distributions 4 3.3. Frequency distribution of discrete data.. 42 3.3.1. Frequency table of discrete data .. 45 3.3.2 Cumulative frequency table... 4 3.3.3. Relative frequency table 45 3.4 Graphical representation of continuous. data... 46 4.4.1. Dot plots . “6 34.2 Histograms ....... 47 3.4.3. Frequency polygons 46 3.4.4 Cumulative frequency polygons ........ 43 3.4.5 Relative and percentage frequency palygans 51 3.4.6 Relative cumulative frequency polygons... 82 3.5 Graphical representations of discrete data... 83 35.1. Dot plots 83 3.5.2 Bar charts 84 35.3. Pie charts... 55 Exercises, 7 CHAPTER 4: DESCRIPTIVE MEASURES OFLOCATION 4.1. Introduction... a 4.2 The arithmetic mean 2 4.2.1 Ungrouped data... a 4.2.2 Grouped continuous data 63 4.2.3 Grouped discrete data. 85 4.24 Properties of the arithmetic mean. or 43 Mode... or 4.3.1 Ungrouped data or 4.3.2 Grouped continuous data 6s 4.3.3 Grouped discrete data... 6 4.3.4 Properties of the mode 8a 44 The median......... 68 4.4.4 Ungrouped data eae ae tinea, 4.4.2 Grouped continuous data mr 44,3. Grouped discrete data... 1 4.4.4 Properties of the median . "8 45 QuantilEs ese a ote bisontcednpaebecepapac capappon acco 4 4.5.1 Ungrouped data a 75 45.2 Grouped continuous data is cst sts wise dancuaes saat ta wa 76 4.6 — Comparison of the mean, median and Mode vaesuncusunnenninnneninnnnisninnnnnnee TB Exercises. a CHAPTER 5: DESCRIPTIVE MEASURES OF SPREAD 5.1 Introduction, 5.2 Range of the data... §.2.1 Ungrouped data. 5.2.2 Grouped data... . 5.2.8 Properties of the range... 5.3. Inter-quarile range 54 — Quartile deviation, 5.4.1 Properties of the inter-quartile range and quartile deviation... 5.5 ‘Standard deviation .... 5.5.1 Ungrouped data 5.5.2 Grouped data. ne 5.5.8 Properties of the standard deviation 5.6 Relative measures of spread... 5.7 — Stem- and: leaf plots . 5.8 Box plots 6.9 Side-by. Exercises... CHAPTER 6: SIMPLE REGRESSION AND CORRELATION ANALYSIS 6.1 Introduction 108 6.1.4. The discrete case .. 110 6.1.2 The continuous case sone . wee 112 6.2 The scatter plot... cee penenernreeeneene cececetneneennecenenee 1M 6.3 Pearson's correlation coefficient. Ww ide box plots and stemn-and-leaf piats. 6.4 Spearman's correlation coefficient... onmmnsn 119 6.5 Regression; The linear relation, 128 6.6 The coefficient of determination 129 6.7 Graphical representation of residuals eee we cevsreeereeeeee 130 6.8 Fitting of linear transformable curves 133, 6.8.1 The exponential curve 133 6.8.2 The power curve........ 134 6.8.2. The hyperbolic curve... 134 Exercises... 138 CHAPTER 7: TIME-SERIES 7.1 The nature and purpose of time-series analysis.. 143 7.2 Movement components of a time-series tae 7.2.1 The time-series model.. 146 73 Long-term movement ... 146 7.3.1. Method of least-squares .. 147 7.3.2 Method of moving averages . 149 7.4 Seasonal component 151 7.5 Cyclic mavement 158 Exercises. 187 CHAPTER 8: PROBABILITY 61 Introduction ... 161 82 — Review of set notation .. 62 8.3 Basic probability concepts... 165 B.4 Tools for counting individual outcomes ... 1” 8.5 Conditional probability and the independence of aVENtS napmaraseccccan (HE 8.6 Laws of probability... 180 87 Random variables......... 183 8.8 Probabilty distribution of a discrete random variable 186 8.9 The binomial probability distribution... 190 8.10 Probably distrbution of a continuous random variable. 192 8.11 The normal probability distribution. 196 8.12 Sampling distributions and the Central Limit Theorem. 200 8.13 Student's f probability distribution 205 8.14 The chi-square distribution . 209 8.15 The F-distribution. Exercises... CHAPTER 9: POINT AND INTERVAL ESTIMATION a1 Introduction... 219 92 Types of estimators . 220 a3 Point estimation ... us us 2 9.3.1 Pointestimate for the population MEAN fl oo... nen snnesnie seneneannenean 2h 9.3.2 Point estimate for the population variance o” Ret 9.3.3 Point estimate for the population proportion 7. 222 9.3.4 Properties of point estimates ....e.nnn nnn seenennnncee PRB 94 Interval estimation .._....... sessonsenseeneasensenesntecseersnen a 9.4.1 Large sample confidence interval for the population Mean ff voen-ureeervannnnne 228 9.4.2 Large sample confidence intervals for the population propertion 22 9.4.3. Small sample confidence intervals for the population mean #1... 208 Exer CHAPTER 10: HYPOTHESIS TESTING - ONE POPULATION 10.1. Introduction.. 2 29 10.2 The hypothesis testing procedure 240 40.3 Single population: hypotheses involving the population mean (1 )ursswsicnwnsnninnn 247 10.3.1 Large sample sizes... sania i cca BNF ises.. 10.3.2 Small sample sizes... i 10.4. Single population: Hypotheses regarding a population proportion (z) Exercises. CHAPTER 11: HYPOTHESIS TESTING - TWO POPULATIONS 14.1. Introduction... 11.2 The difference between two population means (t, — jt, ).. 11.2.1 Large sample sizes.. 11.2.2 Small sample sizes.. 11.3 The difference between two population proportions (a= Fy) sae Exercises, CHAPTER 12: PRACTICAL CONSIDERATIONS REGARDING SAMPLE SURVEYING 12.4 The planning of sample surveys. 12.1.1 Purpose of the survey. 12.1.2 Description of the population ... 12.2 Data collection... 12.2.1 Thoughts about the measuring procedures... svn nennnnsn 12.2.2 Methods of obtaining data 12.3. Determining the sample size. 12,31 Population mean,. 12.3.2 Population proportion 124 Pilot surveys 12.5 Questionnaires Exercises CHAPTER 13: QUALITY CONTROL 13.1 Introduction... 13.2. Control chart and sample proportions 13.3. Control charts for sample means Exercises. CHAPTER 14: MULTIPLE REGRESSION 14.1. Introduetion.. pais i 414.2. Interpretation of the regression Coefficients... csvesnecnnnewnsnenrnnnnnn 14.3. The coefficient of determination .. 14.4. Inference for the population parameters... Sica hentia ont 14.4.1. The normal probability plot... 14.5 The contribution to R? of each of the variable a li 14.6 Statistical significance of the regression as a whole. Exercises. CHAPTER 15: ANALYSIS OF VARIANCE 15.1 Introduction , 15.2 Intuitive concepts 15.3. Analysis of variance notation. 15.4 ANOVA table and the F-test 15.4.1 Interpretation of the test statistic... 15.5 Post hac tests i fe 15.6. Testing ANOVA assumptions 15.7 Computer output... Exercises. je regression 249 251 253 255 255 255 258 260 263 267 267 267 268 268 269 270 270 m2 273 ara 278 281 281 285 286 289 291 283 208 208 208 298 300 303 08 308 310 312 aw 322 wes aa CHAPTER 16: NONPARAMETRIC METHODS 16.1. Introduction _. a4 46.2 Nonparametric and parametric methods .. a1 16.3 Nonparametric one sample tests .. west B82 46.3.4 The Sign test... asset 98 46.3.2 The Wilicoxon test for symmetry (the signed rank test for medians and means)... 336 16.4. Nonparametric two sample test: Difference in medians of independent samples (Wilcoxon rank sum test). at 46.5 Multiple group design with independent samples: The Kruskal-Wallis one-way analysis of Variance by FANKS 2. see secs 4a Exercises... Appendix A: Random numbers. ast ‘Appendix a2 Appendix 354 References a3 CHAPTER 1 INTRODUCTION 44 Statistics and modern society “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” ‘The above quote, which is indirectly attributed to author and futurist H.G, Wells in 1903, and stated during a presidential address of the American Statistical Association in 1950, points out that Science and Statistical thinking are fundamental tools needed for a modern society to function effectively. The influence of science is immediately evident in nearly all areas of modem society. Where ancient societies were predominantly supported by primitive agricultural and/or hunting techniques, modem society is characterized by a more scientific existence. Ingenious artefacts of modem life such as surgeons” apparatus, televisions, computers, compact disc players, digital music players and microwave ovens became available through breakthroughs in the fields of Electronics and Physics, and other related scientific fields. Processing fuel from coal, synthetic materials and plastics are the fruits of modem Chemistry. New plat 19 techniques, impraved livestock breeding methods, and improved fertilizers are all @ result of research done in the fields of Biology and Agriculture. The incredible feats achieved in medicine in the last few decades are nothing short of miraculous, and are all the result of modern medical science, The natural sciences have indeed enriched seciety and paved the way for a brighter future Scientific methods have edged their way into the decision making processes of both governments and private undertakings. These d a systematic philosophy. Using these methods, decisions are usually only made after data have been collected and studied ion making processes are founded on object critical analyses and A vital aspect of modern science is the process of using empirical evidence, Scientists conducting experiments in a laboratory and recording the observed results for further analysis are making use of an empirical process. Just as the universe is studied through basic observations made through telescopes, 80 t00 are more earthbound subjects studied by observation. When analysts run opinion polls and surveys, they are engaged in collecting observations. Observations, or empirical evidence, are thus crucial to scientifically based decision making. Academics have not always appreciated the use of empirical evidence; there have always been schools of thought that supposed that any knowledge could be determined by simply thinking about it. However, it soon became apparent that studies based purely on contemplation lead to unreliable or erroneous results, Reality can only be understood through @ combination of meticulous observation and critical thinking. This will lead to better understanding and interpretation of the events and things being studied. The development of scientific knowledge is rather intricate and many factors play a role. In addition to human innovation, the concept of “scientific intuition”, as well as collaboration with other scientific fields and their methods are also useful. This i important because one often finds that certain fields within scientific research tend to cling to particular schools of thought, Another important aspect in the empirical process is the actual methods used to measure the information. These methods are usually always made public. The results of the entire empirical process are a number of observations, also known as the data. In summary, it should be noted that scientific methods implement these empirical processes, which allow the Scientists access to the observations (usually called data). which were obtained through some or other measurement process. The question now is: how does ane go about obtaining meaningful information from this data? In most cases the data by itself gives no clear answers to the problems being studied. A major difficulty that does come up is that data tends to vary, Sometimes this variation is purely the result of faulty measuring techniques, but mostly itis due to the very nature of the problem being studied. Usually including more observations in the study will mean that the amount of information available to answer specific questions increases as well, Unfortunately, it sometimes happens that either the cost of including more observations is too high, or that there are only a limited number of available observations. Whichever the case, the researcher usually has to do as much as possible with the data on hand. Itis at this stage that the subject of Statistics comes into its own. Se een a re ae ed ct Renn an ce CC ME ee Cet Statistics is also widely used to make decisions based on data, but for now, the focus will fie on gaining Information from data. Statistics is thus a subject that is intertwined with nearly all other fields of scientific research. It studies methods used to work with data; these methods are used in nearly all of the sciences where observations and measurement are involved. Statistics could never take over the other scienoes, since it focuses on only a single aspect of the research, namely the analysis and interpretation of observations. The expertise within those fields of science using statistical methods is thus vitally important. Put simply, Statistics is a “suppart science” employed to manage data. Once the data have been interpreted and the results have been delivered to the relevant user, the role of Statistics is complete Scientists working with observations should have at least some basic: statistical knowledge in order to grasp the many advantages of statistical analysis and to understand the potential for its use in their subject. 2 Itis not necessary that they are able to perform all of the analysis themselves, but it is important that they have enough knowledge to be able to collaborate with professional statisticians. Scientists who think they can handle data by using only their own “common sense”, will find, even in very simple cases, only the most primitive sort of conclusions. In general, scientists realise the importance of the scientific analysis of data in order to make reliable conclusions. In many scientific communities it is considered a necessary Fequirement that, in order to be accepted as high quality work, that the data be statistically analysed, Scientific journals demand that all results reported from research be handled statistically. Therefore all research institutes compel researchers to make use of statistical services provided by either their own statistics departments or by independent statistical consultants. In Section 2 the role of Statistics in the research process is discussed, In Section 3 the different types of data are examined. Section 4 investigates the use and role of the modem computer in statistical analysis. 1.2 Aspects of Statistics Methods for dealing with data have been developed in the field of Statistics. These methods can be divided into the following three groups: + Methods for data collection ‘A.comman misconception is that the collection of data can easily be handled by the novice researcher. It is well known that the manner in which the data are collected greatly influences the conclusions that can be made from the data. It is for this reason that one should carefully plan how the data should be collected For example, one should not collect too few, but also not excessively many observations. The difference between the concepts of population and sample as well as the sampling methods will be explained in Chapter 2. Furthermore, the data collection process should, as far as possible, not be subjective and must also conform to various requirements in order to satisfy the assumptions of the analyses that will be applied later. * Methods to order, represent graphically and summarize data Raw data (., data that are directly obtained from the measuring process) will not necessarily deliver any meaningful conclusions to the researcher, It is easier to determine the underlying properties of the data if the data is properly summarized, especially if the data set is large. The researcher is interested in the essential information contained within the data. A statistician is capable of obtaining this information from the data by using the methads of Descriptive Statistics. Dee er ee eee eae eet gure een ee ee data Descriptive statistics will be described in more detail in Chapters 3 to 5. * Methods which can be used to draw conclusions from the data Developing the methods which can be used to draw conclusions from data is one of the most challenging aspects of the subject of Statistics. In the midst of uncertainty surrounding the problems involved, there must be an accountable procedure which can be used to obtain answers from the data. Basic statistical methods and concepts arose from multiple cultures and societies throughout history, but it is fair to say that modern methods and probability theory first appeared in Europe in the seventeenth century. Much of this development was stimulated by mathematical interest in gambling problems which lead to the formal definition of probability theory, and ullimately gave rise to statistical inference (which will be defined below), The field has grown in leaps and bounds aver the last 100 years thanks in no small part to Statisticians like Neyman, Pearson, Fisher, Cox, Tukey, Cramer, and Wald; all helping to pioneer and develop a field which is ubiquitous in almost all other fields of science. Before we formally define the concept of statistical inference. we first need to introduce the notion of populations and samples (these concepts will be described in more detail in Chapter 2): ‘A population is the complete group of elements from which ene would like to-gain information. ee tee ee generac ent Example: Suppose that we require information about the lifetimes of televisions manufactured by a certain manufacturer. Here, the population is the set of all televisions produced by this manufacturer. It would be impossible to investigate every single television manufactured, so we rather investigate the lifetimes of a sample (or subset) of televisions drawn from this population. The idea is then to use the information concetning the lifetimes of televisions extracted from this sample to be able say something about the lifetimes of televisions in the whole population. ‘This brings us then to the definition of statistical inference. een keene Ey ‘The purpose of statistical inference is therefore to infer (.e., make conclusions of decisions) characteristics ‘of @ population using only the information obtained from a sample drawn from that population. Probability theory which farms the basis of statistical inference will be handled in Chapter 8, while inference: procedures are discussed in Chapters 9, 10, and 11. Example 1.1: Suppose that a professor in mathematics decides to make use of the average mark of one of the first year mathematics class groups to estimate the performance of the entire group. The process used to estimate the average performance of the entire group represents a typical problem in statistical inference. It is clear that any conclusions conceming the entire group will depend on a generalization that 4 reaches far beyond the information contained within the sample data of the one group. Since the generalization might not be completely valid, an indication must be given as te the appropriateness of the reliability (for example, by assigning a probability) of the result, ‘The previous three methods form a closely knit unit since they are all i dispensable when handling data. Each one of the three mathods must be used sensibly, since the removal of any one of these will have a detrimental effect on the statistical process as a whole. > SELF-EVALUATION EXERCISE Which of the following do not form part of Descriptive Statistics? i. Graphical representation of the data. ii, The summary of data. ii, The drawing of conclusions about a population from a sample. ive ‘The ordering of the data. 1.3 Different types of data 134 “Measurement” and variables In Statistics we deal with the methods used to collect data, where it is assumed to be an undertying “measurement process” through which the properties of a particular element are determined by a “measurement instrument" and represented numerically Se a eee Mae a Coan Ca at For example, a particular person's opinion regarding an issue can be recorded as “for” or “against” the issue. These opinions can then be assigned numerical values, such as assigning a 0 if the person is “for” the issue, and a 1 if the person is “against” the issue. It is thus not unreasonable to require that the “measuring process” produces a numerical value as its output, One of the advantages of using numerical values instead of non-numeric values (j.e., using 0 and 4 instead of for" and “against”) is that itcan greatly simplify the analysis of the data, ‘A troubling aspect of the “measurement process” is the question of whether the measurements obtained from the observed element's properties are valid measurements. DOCS na Mm mom Mee nec eM orca) Ett tis important to ensure that valid measurements are obtained in one’s research since there are usually costs associated with the collection of data. Example 1.2: Standardized 1Q tests are commonly criticized because some population groups score lower, on average, than the other groups. The question is whether or not the tests used in each population ‘9fOUp result in reliable indications of intelligence. ing When “measurement processes’ are repeated it is usually found that there is variation in the results, that is, not all the observations are the same, For example, if the body mass of a group of people is measured, the first person could weigh 81kg, the second G5kg, the third 74kg, etc. The body mass of a person is thus a variable in the sense that each person has their own value associated with them, The variable in this case is the person's own body mass, which is different from person to person, POL en cee Rea Mm ec neu ea ee eeu ene Rees Examples of variables are © height ofa person - here the element is the person, the property or variable is height measured in metres which can (realistically) take on values between 0.1m and 2.8m, + weight of a banana - here the element is a banana, the property of interest is weight measured in grams which realistically takes on values between 0.01g and 1600g, * marks obtained by a student in an exam — the element being observed is a student and the property is the mark they receive in an exam, which ranges between 0% and 100%, and © gender—here the element isa person and the property is gender which can only take on the values “male” or “feral Now, because it is sometimes difficult to keep referring to a variable by its full name (like “Body mass") it is usually convenient to express the variable in terms of symbols SUCH a8 x, y, u, or, Of course, in each different case it must be clear as to which variables the symbols are referring. 13.2 ‘Types of variables ‘Two different types of variables appear in Statistics, namely discrete and continuous variables. ee eee Re ee aS ee ie te ee ee ny distinguishable and disconnected from one another. If a variable represents a whole number (for example the number of horses in a stable), then the values that the variable can take on are 0, 1,2, ... and the variable is thus discrete. Discrete variables are typically the result of a process where “counts” are observed (the measurements are then whole numbers}. Example 1.3: The variable “number of rooms in a residence” can take on the values 1, 2, 3, 4, 5 or 6 (ar more) (this means the variable "number of rooms" is discrete). The number of homework problems that a statistics student receives daily is also a discrete variable. The gender of a person can be either “male" or female’, if the gender is “female”, it could be represented by a 0 and “male” could be represented by a 1. ‘This means that the variable “gender of a person’ can be completely described by the values 0 and 1, thus making the variable discrete. Ce ey tet ee ere ae Cee Cas Ce Silo ae mm Os ae AGL ae ec ORT) SN Ue wD Example 1.4: Examples of continuous variables include: surface area of the floor in residences, ages of cats, the body mass of squirrels in a nature reserve, the heights of 25-year-old men living in Belgium, the daily volume of milk used by a restaurant, and fuel consumption of motorcars, nn MM SELF-EVALUATION EXERCISE Maich column A with column 8: a 8 The velocity of a tennis ball served at Population es a match at Wimbeldon The complete group of elements from Statistical inference which one would like to gain information ‘Total number of spectators at a tennis: Statistic match at Wimbledon - . Make conclusions about a population Discrete variable from sample data. Z The science of extracting information ‘Continuous variable from data. 1.3.3 Types of scales The values that variables can take on can be expressed in terms of various scales, namely the nominal scale. ordinal scale, interval scale, and ratio scale. These different scales are defined below. Nominal scale peor Se PoE URC CUE er Aa GCAO a Rd err ge Mio For example, the colour of a person's eyes can be categorized as “blue”, "brown, or "green’, and the numbers 0, 1, and 2 can be used to denote each colour. (Note: the numbers are only used to distinguish between the colours and have no further interpretation), Other examples of variables measured on a nominal scale include the "Make of cell-phone a student owns” (with categories "Apple", "Samsung", or “Sony’), and the blood type of a person (with categories "A+", “A-", “BY”, “B-", 0+", “O-", "ABs", and "AB- », Ordinal scale Tee eects See Niece ae For example, an employee's job satisfaction can be classified as "poor", “reasonable, “good” and “excellent”. The numbers 0, 1, 2, and 3 can be used to denote each of these degrees of satisfaction respectively. Apart from the fact that the levels of satisfaction are different from one another, it can also be seen that there is an order to these levels of satistaction, i.e., the higher the number of the category, the greater the amount of job satisfaction. Differences between the numbers don't really have any meaning. Another example of a variable measured on an ordinal scale is the weight class of a boxer (here the categories are “Bantamweight”, "Lightweight", “Middleweight” en "Heavyweight’). Interval scale Pee eer eR Ee Ramet RG Ri ae et ae eer eee net properties as those of the ordinal Seale, bul meaning has now been allached to the differenc: Cua R EL ‘An example of this type of variable is “time as it appears on a watch”: The interval between 7:00 and 14:00 has the same interpretation as the interval between 8:00 and 15:00 (i.e., it is seven hours). However, the B00 ratio between 7:00-and 14:00 (16., ="), or the ratio between 8:00 and 15:00 (Lé., ="), have no meaning. We cannat say that 7,00 is 0.5 times 14:00 or 8:00 is 0.633 times 15:00; this is because the starting point of the scale (say, midnight, 0.00) is arbitrarily chosen and any other starting point would serve just as well. Ratio scale eee ee RR enn eC or CRE ae cet ay properties as those of the interval scale, but nov fete nose el ac gcc Md Re ee een ew ees Examples of this scale are “mass of bags of rubbish” and “height of a tree” because both of these have zero (which is unique and non-arbitrary) as their starting point. For example, if we have a variable which measures the mass of a bag of rubbish and we measure two bags: Bag A weighs 15 kg and Bag B weighs 60 kg. We can interpret the interval (or difference) between these two values as “Bag B is 45 kg heavier than Bag A", However, since this variable is on the ratio scale, we can also interpret the ratio between the two bags as being "Bag B is 4 times heavier than Bag A 13.4 Discrete data Measurements obtained from observations from a discrete variable produce discrete data. This type of data can only take on values that are clearly distinguishable and disconnected from one another. 100% accurate Measurements are possible in this case. Discrete data are obtained when, for example, one measures things like counts, such as the amount of times the numbers 1 to 6 appear face up when a six- sided dice is thrown. 13.5. Continuous data Measurements obtained from observations from 2 continuous variable produce continuous data. As opposed to discrete data, it is not possible to make 100% accurate measurements since the accuracy of the measurements is determined by the accuracy of the measuring instrument. Measurements are usually evaluated using certain predetermined measurement units, but there is always the possibility, in principle at least, of using more precise measurement units. The result of this is that measurements made on continuous variables are only approximations to the true measurements, Suppose, for example, that the measuring unit of height is taken to be in centimetres. This means that, for instance, any one of the infinite different measurements between 175,5 and 176,Scm is simply accepted as being 17Gcm, It is thus possible to obtain data that, at first glance, appear to be discrete, but are simply rounded-off continuous data and the number of possible values that can be assumed are more than what would usually be acceptable in discrete variables, Examples of continuous data are, among others, measurements of the heights or weights of people. MM SELF-EVALUATION EXERCISES ‘Guestion 1: Which of the following definition(s) islare correct? i, Statistics is the science of extracting informatian from data, .e. statistics makes sense of data, ii, Statistical inference refers to methods used to draw conclusions about the population from sample data. li, Descriptive statistics refers to methods to collect data, iv. Measurements are invalid if they lead to useful information conceming the property being studied ¥. Measurement involves the process of assigning a numerical value to a property of an element. ‘Question 2: Consider the following descriptions of variables. Which of these statements is/are false? i. “The colour of the shirt that | am wearing’ (where blue: “1”, green: “2",...) is an example of a discrete variable that is measured on the nominal scale ii, “The number of subjects | take this semester” is an example of a discrete variable and is measured on the nominal scale. ii, “The amount of water | drink daily” is an example of a continuous variable that is measured on the ratio scale. iv, “The time of day at which | walk to the Student Gentre for @ Chicken Wrap” is an example of a continuous variable that is measured on the interval scale, ¥. "My height" is an example of a continuous variable that Is measured on the ratio scale. 1.4 The use of the computer in Statistics The rate at which computer hardware has developed in the past two decades in conjunction with the development of software has resulted in a multitude of computer packages that can be used to analyse data statistically. Most of these programs are compatible with use on a network or with personal computers. The implication of this is that any person, even those without any statistical background, has access to statistical techniques that can praduce answers in a fraction of a second, The problem with this situation is that people with little or no statistical knowledge do not have the necessary faculties required to. use these packages sensibly or to interpret the answers that are produced. This results in people making flawed conclusion, which in turn leads to inappropriate decisions being made. Examples of the most renowned programs being used nationally and internationally in academic and research institutions include: Statistical Analysis Systems (SAS), Statistical Package for Social Sciences (SPSS), CSS Statistica and S-PLUS (or the freeware version known simply as R), These program packages are routinely improved and updated and are supported by user-manuals and help files. R is one of the only free packages for both personal and network computer use. The user- manuals usually make provision for people with little or no statistical knowledge by including some theoretical background in their text. This aids bath the experienced and inexperienced user by making the output easier to understand and interpret. 10 In this textbook the computer package called STAT1.3 will be used, This program is available from the North-West University’s network. Students will receive assistance in the use of the program. The purpose of the computer work in this course is to ay b) s) students an indication of the role that computers play in statistical analysis. students a deeper insight into important concepts in Statistics. Expose students ta routines involved in executing practical exercises in order for them to be able to handle a wider spectrum of examples Exercises » Hf we collect data on people's taste in music and label the categories: Rock 'n Rall, Rap, R&B, Jazz and Classical, what type of data are we working with? Chaose the correct aption, i. Discrete Data on a ratio scale i. Continuous Data on a nominal scale iil, Discrete Data on an ordinal scale iv. Continuous Data on a ratio scale v, Discrete Data on a nominal scale . Consider the fallowing variables. In each case indicate whether the variable Is discrete or continuous: (a) The number of spectators at a soccer match. {b) The amount of soda-pop (in mi) that a student drinks in one day. (c) The mass of a boxer that competes in the heavy-weight division. (d) The number of clients that visit a supermarket daily. Consider the following variables. In each case indicate the scale used to measure the variable: {a) The position in which a person places in the Comrades marathon. (b) The type of vehicle preferred by an individual, ©.9.. “Audi”, "BMW", or “Mercedes” {c) The salary of an employee of Eskom. ‘South Africa, ete. ion of the Statistics lecturer, where 1="Excellent’, 2="Good’, {d) The country in which a person is born, e.g., 1=USA. (@) (A) The time taken to send a letter from Johannesburg to Moscow. (a) The shelf life of mitk {h) The number of life policies issued per day. “Average” and 4="Poor”. (i) The number of pages in a textbook. () The flavours available in Bobiail food chunks, ({k) The wood types that can be used fo make'a desk (I) The voltage produced by a generator. {m) The car types in the Mercedes range, 1 CHAPTER 2 SAMPLING METHODS 2.1 Introduction Data collection is an integral part of statistical data analysis. To derive trustworthy conclusions fram information (data), we need to ensure that suitable and correct data collection methods are applied, Govemments, industry and saciely need reliable information to make better decisions and to establish an informed society. Therefore individuals and organizations collect data because information is needed for many purposes. For example, records for administrative purposes are kept to make decisions about important issues, or to formulate policy, or to adapt to new situations. Whatever the specific reason, data have to be collected to provide the needed information The nature of the information (data) which is collected is determined by the particular problem being analysed and the factors associated with the study. It s usually impossible to obtain complete information ofall the possible items of interest within a particular study, usually due to lack of all or some of the following factors: time, money, energy, equipment, labour (such as manpower), access to work places or lack of access to the complete sampling frame, The sampling frame is defined as follows: ete Renee is oe Some examples of sampling frames include lists of all eligible voters held by the Independent Electoral Committee, the complete list of all matric learners writing the 2014 matric final exam across South Africa, a list of registered participants at @ conference, and the South African Revenue Service's list of all tax payers. In spite of the restrictions listed above, the sample must also comply with the necessary requirement that it represents the underlying population as closely and correctly as possible to ensure that the end results ofthe study are as reliable as possible, since the results relate to the unknown properties of the population (which is what we are ultimately interested in), Before continuing aur discussion of sampling and sampling methods, we will first formally define the concepts of populations, samples, and a census! 13 ‘A population is the complete group of elements from which one would like to gain information eee agate ey Tae rte ee eerie amen RCE eam eRe Met The following notation for the sample size and population size will be used throughout the text: ‘The number of elements in a sample (ie., the sample size) is denoted by n ‘The number of elements in the population (ie., the population size) is denoted by NV, If a census is conducted then no statistical inference (See Chapter 1 for the definition of inference) is Tequired since a census Will raveat all the information inherent in the population. Descriptive methods will still be required to represent and order the data In general it is reasonable to expect that reliable conclusions canceming a population can only be made from samples that are representative of the population for the variables being studied. Therefore, samples cannot be simply chosen in any arbitrary fashion. The composition and nature of the population, as well as some other properties of the population, will influence the choice of the sampling procedure. For example, one should know who or what exactly is included in your population of interest and to whom or which group you want to generalize your results to, Ifa sample of doctors is drawn from all doctors working in Potchefstroom, do we wish to generalise our results obtained from this sample to all dactors in Potchefstroom, of to all doctors in North West? For a sample of 50 women in the age group 30-40, can the results extracted from this sample be generalised to all women, or only to all women aged 30-407 This chapter provides an overview of some of the mare popular sampling procedures which are used in practice. Advantages and disadvantages of each procedure are also briefly discussed. 14 ME SELF-EVALUATION EXERCISES Study the two concepts 1. Population. Il, Sample ‘With regard to the relation between the two cancepis given above, which one of the following statements is correct? i, Vis part of Il, ii, Husually has more items than |. is studied more intensely than Il. iv. The knowledge of Il is used to gain knowledge of | y. _ Fand I have ne connection 2.2 Sampling methods From the previous section it should be clear that sampling plays an important role in statistical analysis and its implementation should be carefully considered and planned. A sample survey costs less than a census and results are obtained far more quickly for a sample survey than for @ census because fewer units are consulted and less data need to be processed. ‘A sample should be representative of the population no matter what the circumstances regarding the population may be or which sampling method is applied. In order to make a sample representative, close attention should be paid to the procedures involved for sampling and analyses of the data, since insufficient consideration regarding this aspect of statistics will lead to untrustworthy results and often disastrous ‘eonsequences, such as unscientific, false conclusions trom wrong results, Therefore, selecting the best sampling method is a primary step in statistical analyses, as well as using the correct sample size Insufficient attention to sampling and sampling metheds can also mean that conclusions drawn fram these samples are scientifically questionable and so the best method for data collection must be selected for -clly impacted by the method you choose. each case. Keep in mind that cost and data quality will be di Since every survey will differ from almost every other survey, there are no strict rules for determining the size of the sample required. The factors that will influence size of the survey operations are time, cost, ‘operational constraints and the desired precision of the results, It is important to evaluate and assess each of these issues in order to determine suitable sample sizes. Determining the sample size will be discussed in Chapter 12. ‘Sampling metheds can be divided into two groups, namely probability procedures and non-probability procedures. Probability sampling involves drawing a sample from a population based on the principle of randomization or chance. Probability sampling is more complex, more time-consuming and usually more costly than non- probability sampling. However, because units from the population are randomly selected and each unit's 15 probability of inclusion can be calculated, reliable estimates for population parameters can be produced, and inferences can be made about the population (The terms “parameter” and “estimates” will be discussed in later chapters.) In non-probability sampling, since elements are chosen using subjective methods, there is no way to estimate the probability of any one element being included in the sample. This inability to calculate these probabilities prevents valid inferences from the sample to the population being considered. Statisticians are reluctant to use these methods, but in some simple situations these non-probability methods can useful, quick, inexpensive and convenient. The difference between probability and non-probability sampling has to do with a basic assumption about the nature of the population under study, In probability sampling, every item has a chance of being selected. In non-probability sampling, there is an assumption that there Is an even distribution of characteristics within the population. The researcher thus believes that any sample would be representative of the population with respect to the variable being considered. This means that, if the assumption is true, his nen-probability sample will be trustworthy. For probability sampling, randomization is a feature of the selection process, rather than an assumption made conceming the structure of the population. De eee Cen ne oe ee act probability of being included in the sample, A non-probability sample is taken by making use of more subjective criteria ‘The sampling procedures which will be discussed can be classified as follows: Probability sample Non-probability sample = Simple random sampling . Convenience sampling © Stratified randem sampling * Judgement sampling = Clustered sampling = Quota sampling Several other probability sampling methods exist, such as, for example systematic sampling and multi- phase sampling, but these will not be discussed in this text. 16 2.2.1. Simple random sampling ‘Simple random sampling is the most general probability procedure since the principle used here is also found, and used, in stratified sampling and clustered sampling. In order to randomly select an element from a population, one must assume that each element in the population has the same chance of being chosen, Also, each combination of members of the population has an equal chance of composing the sample, No element should be favoured above another in the selection process. DN eer eee a ees ann eet Ta Se aurea Dine Lam aR ec eu ama ‘Any sampling method exhibiting the following properties, can be classified as.a simple random sample + The population should consist of N abjects, + the sample should consist of n objects and + all passible samples of n objects should be equally likely to occur. Example 2.1: The national lottery draw, where a sample of 6 numbers is randomly generated from a Population of 49, is a good example of simple random sampling. Each number has an equal chance of being selected and each combination of 6 numbers has the same chance of being the winning combination. Even though people tend to avoid combinations such as 1-2-3-4-6-6, it has the same chance of being the winning set of numbers as the combination of 8-15-21-28-32-40 for example. Whenever personal preference comes into play (both consciously and subconsciously) when drawing samples it will, almost without exception, lead to a non-random sample and so some mechanical system {for selecting these samples is thus preferable. In practice, if you wanted to select a simple random sample using one of these mechanical systems, you would need to first construct a list all of the units in the population, ‘One example of producing a simple random sample is by applying the so-called lottery method (see ‘example 2.1) for choosing n objects from a group of size NV, Here, each of the Iv population members is assigned @ unique number. The numbers are placed in a container whereafter it is shuffled thoroughly Then, n numbers are blindly selected randomly and independently, identifying the numbers of the population members selected to be included in the sample, If it is agreed that when a number is chosen, it cannot be repeated, the method is referred to as sampling without replacement. {fit is agreed that a number may be chosen more than once, the number must be available in the container after being drawn, each time. This process is known as sampling with replacement. ‘Another such mechanical method is to make use of generated randem numbers. Random numbers are simply a random ordering of the numbers 0, 1, 2... 9. Table At in Appendix A is an example of a table of such numbers. 17 Table 2.1: Extract from Table A1 96599 17254 79613 49389 37448 66591 15245 93980 90991 10947 48809 G6495 59414 88211 79005 89424 78962 21439 60317 35608 Table 2.1 represents an extract fram Table A1. The grouping of the numbers is only used to aid readability and for no other reason. The numbers are used in the following manner: Suppose a sample of 10 elements must be obtained fram a finite population of 800 elements, and that each element in the population has a unique number assigned to it, Suppose further that the numbers assigned to the elements in the population are the numbers beginning at 001 and ending at 800_ The first step is then to arbitrarily choose a starting point in the random number table. Starting at this point, read the numbers in groups of three, The first 10 numbers found in this way which are both smaller than 800 and non-repeating (i.e., there are no two numbers which are the same in the group of 10 numbers), represent those numbered values in the population which will be included in the sample. Note: In the above example the numbers are read off in groups of three because the population size (N) was a three digit number, if N was, say, a five digit number, then one would make use of groupings of five random numbers and so on. Example 2.2(a): A random sample of size five is to be taken from a population of 250 people (N = 250). ‘The fallowing two rows of random numbers are te be used: 127356127345955561020301771 345156314751245625890242345 By beginning in the first row on the left hand side, the fallowing three-digit number groups are obtained 127, 356, 127, 345, 95, 561, 020, 301, 771, 345, 156, ete. (Remember that N has 3 digits). Some of the numbers obtained are greater than 250, making them useless for this population. If some of the numbers repeat themselves, and we keep the original and repeated numbers, this sampling is equivalent to sampling with replacement. Usually the repeating numbers are removed, implying sampling without feplacement. The remaining numbers then represent the numbers of the elements in the population which should be chosen from. The numbers of the five people to be included in the sample are: 127, 20, 156, 245 and 242. 18 One can also easily generate random numbers by making use of the “random” function on a pocket calculator or by using various computer packages. The random numbers in Table 2.1 can be read-off in a number of different ways. The next two examples illustrate two other possible ways of doing this. Example 2.2(b): Suppose a random sample of size four is to be taken from the same population of 250 people (N=250). As before, the following two rows of random numbers are to be used: 127356127345955561020301771 345156314751245625890242345 We now start off in the first row on the right hand side, reading right to left, faking one number at @ time until 3 numbers are identified. Then the following three-digit number groups are obtained: 177, 103, 020, 165, 559, 543, 721, 653, 721, etc. Remember that N’ has 3 digits. Again, numbers greater than 250 and numbers that repeat will be discarded (i.e., we are sampling without replacement), but the first ccurrence of a repeating number is retained. The remaining numbers represent the numbers of the elements in the population which should be chosen from. The numbers of the four people to be included in the sample are then: 177, 103, 20, 165. Example 2.2(c): If the numbers were chosen from the right, in groups of three at a time, the following three-digit number groups are obtained: 771, 301, 020, 561, 955, 345, 127, 356, 127, 345, 242, 890, 625, 248, 751, 314, 156, 345, To select the four desired numbers, we had to continue to the second row, from the right side, three numbers at a time, The four numbers in case of without replacement sampling, are 20, 127, 242 and 245. With replacement sampling will produce the numbers 20, 127, 127 and 242. These examples illustrate that an agreement is needed on how to use the generated random numbers before sampling is attempied. Example 2.3: To draw a simple random sample from a home owners directory, each entry would need to be numbered sequentially. If there were 10 000 enitries in the directory and if the required sample size was 2.000, then 2.000 numbers between 1 and 10 000 would have to be randomly generated by a computer or from Table A1. Each number should have the same chance of being generated by the computer (in order to fulfil the-simple random sampling requirement of an equal chance far every unit). The 2.000 hame owners corresponding to the 2 000 computer-generated random numbers would make up the sample. 19

