Download as pdf
Download as pdf
You are on page 1of 370
The Analysis of Biological Data Michael C. Whitlock and Dolph Schluter Roberts a Company Publishers ‘sbSouh Yosemite See, F237 (Greenwood Vile, Colas 80111 USA Intec cber paises uphooe (08) 221-308, cama (05)221505 Intec wor puichr.com Pier Ben Roe Development Er and Cope abn Musee, Pree Conds Heft ‘Ax Sta: Liner ‘euDesgrer Mak Og. Sd By Sie Sts Cove Petpet Muley Fo and Poon Corer Desig Mk Ou Sie By Sie Sos Permisions Coord Lars Gabba Roberts Conpostr:Silety Side Stine ‘2200 by Roberts and Company Pbtiers Reqd oration of yp of his wok beyond tht permite by Seton I? a 108 ft 1976 Und Stas Cpyrigt Act witout permasion the cpyrght owners nl. Rejes or emission or fuer Information shoul be sdcresed othe Pemisoas Depart Robs {Copy Pass ISBN: 978.0.951519404 ibeary of Congres Cataagingin-Publction Data Wik, Micha “Theat isogcl ia / Michal Whidk and Dolph Scher, ssi die-oonisi9¢04 1 Blomety Tetons Shi Dolph HT ‘ganas Srsios aa? nwo wos76s To Sally and Wilson, Andrea and Maggie ll Contents in brief Preface xv Acknowledgments xxix About the authors 200i part 1 INTRODUCTION TO STATISTICS 1, Statistics and samples 1 wrentars Biology and the history of statistics 20 2. Displaying data 23 3. Describingdata 59 4, Estimating with uncertainty 83 vwrmisr2 Pseudoreplication 97 5. Probability 99 / 6. Hypothesis testing 127 orsesrs Why statistical significance is not the same asbiological importance 148 part 2 PROPORTIONS AND FREQUENCIES 7. Analyzing proportions 151 swousnrs Correlation does not require causation 173 ing probability models to frequency data 175 ! vowusnrs Makinga plan 204 9. Contingency analysis: associations between categorical variables 207 vi part 3 10, IWTERLEAE nn 12. 13, 14. 15. part 4 16. vreRLEAF 10 i, part 5 18. 19. 20. 21. content in ef COMPARING NUMERICAL VALUES The normal distribution 231 Controls in medical studies 256 Inference for anormal population 259 Comparing twomeans 279 Which test should use? 315 Handling violations of assumptions 319 Designing experiments 357 Datadredging 390 Comparing means of more than two groups 393 Experimental and statistical mistakes 429 REGRESSION AND CORRELATION Correlation between numerical variables 431 Publication bias 460 Regression 463 Using species as datapoints 509 MODERN STATISTICAL METHODS. Multiple explanatory variables 513 Computer-intensive methods 539 Likelihood 567 ‘Meta-analysis: combining information from multiple studies 593 [Answers to practice problems 613 Literature ited 651 Statistical tables 667 Photocredits 689 Index 691 contents nek Contents part 1 ereneae 2 Preface »evil Acknowledgments xxxi About the authors xxx INTRODUCTION TO STATISTICS Statistics and samples 1 1.1 Whatis statistics? 1 1.2 Sampling populations 3 Example 1.2:Raining cats 3 Populationsand samples 4 Properties of good samples 5 Random sampling 7 Howto takea random sample 8 ‘The sample of convenience 8 Volunteer bias 10 Real data inbiology 11 1.3. Typesofdataand variables 11 Categorical and numerical variables 11 Explanatory and response variable 1.4 Frequency distributions and probability distributions 13, 1.5 Typesofstudies 15 1.6 Summary 16 Practice problems 17, Assignment Problems 18 Biology and the history of statistics 20 Displaying data 23 2.1 Displaying frequency distributions 24 Frequency tables and bar graphs for categorical data 25 contents Example 2.1A:Causes of teenage deaths 25 Construction rules for bar graphs 26 Frequency tables and histograms for numerical data 27 Example 2.18: Abundance of desert bird species 27 Describing the shape of ahistogram 30 Interval width can affect histogram shape 31 Example 2.1C:How many peaks? 31 Construction rules forhistograms 32 2.2 — Quantiles ofa frequency distribution 33, Percentiles and quantiles 33, Cumulative frequency distribution 33 23. Associations between categorical variables 34 Example 2.3:Reproductive effortand avian malaria 35 Contingency tables 35 Grouped bar graph 36 Mosaic plot 37, 2.4 Comparing numerical variables between groups 38 Comparing histograms between groups 38 Example 2.4:Blood responses to high elevation 38 ‘Comparing cumulative frequencies 40 laying relationships between a pair of numerical variables 40 Scatter plot 40 Example 2.5A:Sins of thefather 41 Une graph 42 Example 2.58: Cyclic fluctuations in lynx numbers 42 Maps 43 Example 2.5C:The Antarcticozonehole 43, 2.6 Principles of effective display 44 Principles of graphical display 44 Follow similar principles in display tables 47 27 Summary 48 Practice problems 49 Assignment problems 53 Describing data 59 3.1 Arithmeticmean and standard deviation 60 Example 3.1:Gliding snakes 60 The samplemean 61 Variance and standard deviation 62 25 coments 2a Rounding means, standard deviations, and other quantities 64 Coefficient of variation 64 Caleulating mean and standard deviation from frequency table 65 32 Example 3.2:14d give my right arm fora female 65 Themedian 67 The interquartilerange 68 The box plot 69 3.3. How measures of location and spread compare 70 Example 3.3:Disarming fish 70 ‘Mean versus median 71 Standard deviation versus interquartilerange 72 3.4 Proportions 73 Calculating a proportion 73 The proportionislikea sample mean 74 3.5 Summary 74 3.6 Quick Formula Summary 75 Practice problems 76 ‘Assignment problems 79 Estimating with uncertainty 83 4.1 Thesampling distribution of anestimate 84 Example 4.1:The length of human genes 84 Estimating mean gene length witha random sample 86 The sampling cistribution of 787 4.2 Measuring the uncertainty of an estimate 89 Standard error 89. The standard enor of 69 ‘The standard err of ¥from data 90 43 Confidence intervals 90 The 28€ rule ofthumb 92 44 Summary 93 45° QuickFormula Summary 93 Practice problems 94 Assignment problems 95 Pseudoreplication 97 conte Probability 99 5.1 The probability ofan event 99 5.2. Venn diagrams 101 5.3. Mutually exclusive events 101 5.4 Probabillty distributions 102 Discrete probability distributions 102 Continuous probability distributions 103 5.5 Eltherthis or that:adding probabilities 104 The addition rule 104 ‘The probabilities ofall possible mutually exclusive events add toone 106 The general addition rule 106 5.6 Independenceand the multiplication rule 107, Multiplication rule 108 Example 5.6A:Smoking and high blood pressure 108, ‘And versusor 109 Independence of more than two events 108 Example 5.6B:This thing atemymoney! 110 Example 5.6C:Mendlel’s peas 110 5.7 Probability trees 111 Example 5.7:Sexand birth order 111 5.8 Dependentevents 112 Example 5.8:Is this meat taken? 112 5.9 Conditional probability and Bayes'theorem 115 Conditional probability 115 ‘The general multiplication rule 117 Bayes'theavem 117 Example 5.9:Detection of Down syndrome 118 5.10 Summary 120 Practice problems 121 Assignment problems 124 Hypothesis testing 127 6.1 Makingand using hypotheses 128 Null hypothesis 128 Alternative hypothesis 129 Torejectornottoreject 130 6.2 Hypothesistesting:an example 130 part 2 z Contents Example 6.2:The righthand of toad 130 Stating the hypotheses 131 The test statistic 131 The null distribution 132 Quantifying uncertainty:the P-value 134 Statistical significance 135 Reporting the results 136 63 Errorsin hypothesis testing 137 Typeland Type errors 137 6.4 When thenull hypothesisisnot rejected 138 Example 6.4:The genetics of mirror-image flowers 138 Thetest 139 Interpreting anon-significant result 140 65 Onesided tests 141 6.6 Hypothesis testing versus confidence intervals 143 67 Summary 144 Practice problems 145 Assignment problems 146 Why statistical significance is not the same as biological importance 148 PROPORTIONS AND FREQUENCIES Analyzing proportions 151 7A Thebinomial distribution 152 Formula for the binomial distribution 152 Number of successes in arandom sample 153 Sampling distribution ofthe proportion 156 7.2. Testing a proportion:the binomial test 157 EXAMPLE7.2:SexandtheX 158 ‘Approximations for the binomial test 160, 7.3. Estimating proportions 161 Example 7.3:Radiologists' missing sons 161 Estimating the standard error of a proportion 161 Confidence intcrvals for proportions — the Agresti-Coull method 162 ‘Confidence intervals for proportions—the Wald method 163 7.4 Deriving the binomial distribution 163 coments 75 Summary 165 7.6 Quick Formula Summary 165 Practice problems 166 Assignment problems 170 Correlation does not require causation 173 Fitting probability models to frequency data 175 8.1. Example of a random model:the proportional model 176 Example 8.1:Noweekend getaway 176 8.2 y@goodness-of-fittest 177 ‘Null and alternative hypotheses 177 Observed and expected frequencies 178 The y*teststatistic 178 The sampling distribution of x? under the null hypothesis 180 Calculating the Pvalue 182 Critical values forthe y2 distribution 182 8.3 Assumptions of the x goodness-of-fittest 184 8.4 Goodness-of-fit tests when there are only two categories 185 Example 8.4:Gene content of the human X chromosome 185 ting the binomial distribution 187 Example 8.5: Designer two-child families? 187 8.6 Random in space or time: the Poisson distribution 190 Formula forthe Poisson distribution 192 Testing randomness with the Poisson distribution 192 Example 8.6:Mass extinctions 193, Comparing the variance with the mean 196 87 Summary 197 88 Quick Formula Summary 198 Practice problems 198 Assignment problems 201 Makingaplan 204 Contingency analysis: associations between categorical variables 207 9.1 Associating two categorical variables 208 9.2 Estimating association in 2 * 2 tables:odds ratio 209 Odds 209 PART 3: 10. Contents Example 9.2:Take two aspirin and call mein the morning? 210 ‘Odds atio 211 Standard eerorand confidence interval for odds ratio 212 93 Theycontingencytest 213 Example 9.3:The gnatly worm gets the bird 213 Hypotheses. 215 Expected frequencies assuming independence 215 The x? statistic 216 Degrees of freedom 216 Paalue and conclusion 217 A shorteut for caulatng the expected frequencies 217 The ¥ contingency test isa special case ofthe" goodness-of-fit test 218 Assumptions ofthe x contingency test 218 Correction for continuity 219 9A Fishersexacttest 219 Example 9.4:The feeding habits of vampire bats 220 95 Gtests 221 9.6 Summary 222 9.7 QuickFormula Summary 223 Practice problems 224 Assignment problems 227 COMPARING NUMERICAL VALUES The normal distribution 231 10.1 Boll-shaped curves andthe normal distribution 232 10.2 Theformula for the normal distribution 234 10.3 Properties of the normal distribution 235 10.4 The standard normal distribution and statistical tables 236 Using the standard normal table 237 Using the standard normal to describe any normal distribution 239 Example 10.4:One small step forman? 240 10.5 TWenormal distribution of sample means 242 Calculating probabilities of sample means 244 10.6 Centrallimit theorem 245 Example 10.6:Pushing your buttons 245 Contents 10.7 Normal approximation for the binomial distribution 247 Example 10.7:The only good bugisadead bug 248 Summary 250 10.9 Quick Formula Summary 250 Practice problems 251 contents x Example 12.3:Spike or be spiked 287 Confidence interval forthe difference between two means 288 Two-sample test 291 Assumptions 292 [Atwo-sample test when standard deviations are unequal 293, 124 Using the correct sampling units 293 Assignment problems 254 Example 12.4:S0 long; thanks toall thefish 294 noxusars Controls in medical studies 256 125. Thefallacy of indirect comparison 297 Example 12.5:Mommy’s baby, Daddy's maybe 297 th Brees torte ea ae 126 Interpreting overlap of confidence intervals 299 11.41 Thet-distribution for sample means 259 Gs Gena ae a Student’ tistibution 260 sabe Finding citical values ofthe -istribution 262 Tie ees orariasl vatarices: 300 eons Q Levene’ test for homogeneity of variances 301 na a eet eee rerneeeea = oe ee a imple 11.2:Bye to eye The 95% confidence interval for the mean 264 ee The 99% confidence interval for the mean 265 Practice problems 306 11.3. Theone-samplet-test 266 Assignment problems 311 Example 11.3:Human body temperature 266 nooucar? Which test should|use? 315 The effects of larger sample size—body temperature revisited 270 11.4 The assumptions of the one-samplet-test 270 13. Handling violations of assumptions 319 11.5 Estimating the standard deviation and variance of anormal 13.1. Detecting deviations from normality 320 population 271 Graphical methods 320 Confidence limits forthe variance 271 Example 13.1:The benefits of marine reserves 323 Confidence limits forthe standard deviation 273 Formaltestofnormalty 324 ‘Assumptions 273, 13.2. When toignore violations of assumptions 325 11.6 Summary 273 Violations ofnormality 325 117 QuickFormula Summary 274 Unequal standard deviations 326 Practice problems 275 13.3. Datatransformations 326 i robles Log transformation 327 (oe detailer ll Acsine transformation 330 12. Comparing twomeans 279 Thesquaerootranstrmation 331 Other transformations 331 121 Paired sample versus two independent samples 280 eee ce aoralon 22 baan atten sill Fi i Acaveat:avoid multiple testing with transformations 332 salelscltadle emulate 13.4 Nonparametric alternatives to one-sample and paired t-tests 333 Example 12.2:S0 macho it makes you sick? 282 Paired t-test 285 ‘Assumptions 287 12.3, Two-sample comparison of means 287 Signtest 333 Example 13.4:Sexual conflict and the origin of new species 334 The Wilcoxon signed-ranktest 337 4, Contents 13.5 Comparing two groups:the Mann-Whitney U-test 337 Example 13.5:Sexual cannibalism in sagebrush crickets 338 Tied ranks 341 Large samples and the normal approximation 342 136 Assumptions of nonparametrictests 342 13.7. Typeland Typelllerror ates of nonparametric methods 343 13.8 Summary 344 13.9 Quick Formula Summary 345, Practice problems 346 Assignment problems 351 Designing experiments 357 14.1 Why do experiments? 358 ‘Confounding variables 358 Experimental artifacts 359 142 Lessons from clinical trials 359 Example 14.2: Reducing HIV transmission 360 Design components 360 143 Howtoreducebias 362 Simultaneous control group 362 Randomization 363 Blinding 364 144 Howto reduce the influence of sampling error 366 Replication 366 Balance 368 Blocking 369 Example 14.4A:Holey waters 370 Extreme treatments 371 Example 14.4B:Plastichormones 371 145 Experiments with more than one factor 372 Example 14.5:Lethal combination 373, 146 What if you can’t do experiments? 374 ‘Match and adjust 375 147 Choosingasamplesize 375 Planfor precision 376 Plan for power 378 Plan for dataloss 379 148 Summary 380 15. Contents i 149° Quick Formula Summary 381 Practice problems 384 Assignment problems 386 Datadredging 390 Comparing means of more than two groups 393 15.1. The analysis ofvariance 394 Example 151: The knees who say night 394 Hypotheses 395 ANOVA.n a nutshell 396 Calculating the mean squares 397 The variance ratio F 399 ANOVA tables 401 Variability explained: 403 ANOVA with two groups 403 152 Assumptions andalternatives 403 Therobustness of ANOVA 403 Data transformations 403 Nonparametric altematives to ANOVA. 404 153 Planned comparisons 404 Planned comparison between two means 405 15.4 Unplanned comparisons 406 Example 15.4:Wood wide web 407 Testing al pars of means using the Tukey-Kramer method 408, Assumptions 410 15.5 Fixedand,andom effects 410 15.6 ANOVAwith randomly chosen groups 411 Example 15.6:Walking stcklimbs 412 ANOVA ‘able 413 Variance components 414 Repeatability 415 ~ Assumptions 416 157 Summary 416 158 QuickFormula Summary 417 Practice problems 420 Assignment problems 424 mistakes 429 part 4 16. enueae 10 "7. antents REGRESSION AND CORRELATION Correlation between numerical variables 431 16.1. Estimatinga linear correlation coefficient 432 The correlation coefficient 432 Example 16.1:Manly digits 434 Standard eeror 436 ‘Approximate confidence interval 436 16.2. Testing the null hypothesis of zero correlation 438 Example 16.2: What big inbreeding coefficients youhave 438 16.3 Assumptions 440 164 The correlation coefficient depends on the range 442 16.5 Spearman’srank correlation 443, Example 16.5:The miracles of memory 444 Procedureforlargen 446 ‘Assumptions of Spearman's correlation 447 166 The effects of measurement error on correlation 447 167 Summary 448 168 Quick Formula Summary 449 Practice problems 452 ‘Assignment problems 455 Publication bias 460 Regression 463 17.1 Linearregression 465 Example 17.1:Thelion'snose 465 The method oflleast squares 466 Formula for the line 467 Calculating the slope and intercept 468 Populations and samples 469 Predicted values 470 Residuals 470 Standard eror of slope 472 Confidence interval for the slope 472 17.2 Confidence in predictions 472 Confidence intervals for predictions 473 Extrapolation 474 17.3 Testing hypotheses aboutaslope 475 74 75 76 wa v8 179 Contents voit Example 17.3:Chickadee alarms 476 The ttest of regression slope 477 ‘The ANOVA approach 478 Using R? to measure the it ofthe line to data 479 Regression toward themean 480 ‘Assumptions of regression 482 Outliers 483 Detectingnor-linearity 483 Detecting non-normality and unequal variance 484 Transformations 485 ‘The effects of measurement error on regression 487, Nonlinear regression 488 ‘Acurve with an asymptote 488 Quadraticcurves 489 Formulafree curve fitting 490 Example 17.8:The incredible shrinking seal 490 Fitting a binary response variable 492 Summary 493 17.10 Quick Formula Summary 494 Using species as data pi Practice problems 498 Assignment problems 503 S509 MODERN STATISTICAL METHODS Multiple explanatory variables 513 1841 182 183 From linear regression to general linear models 514 Modeling with linear regression 514 Generalizing linear regression 516 Analyzing acategorical treatment variable 517 Example 18.1:1feel your pain 517 Analyzing experiments with blocking 519 ‘Analyzing data from a randomized block design $20 Example 18.2:Zooplankton depredation 520 Model formula 520 Fitting the model to data 521 Analyzing factorial designs 522 ‘Analysis of two fixed factors 523, 19, 20. Contents Example 18.3:Interaction zone 523 Model formula 524 Fitting the model to data 524 The importance of distinguishing fixed and random factors 526 18.4 Adjusting for the effects of acovariate 525 Example 18.4:Mole-rat layabouts 527 Testing interaction 528 Dropping the interaction term 529 185 Assumptions of generallinear models $30 18.6. Summary 532 Practice problems 533 Assignment problems 535 Computer-intensive methods 539 19.1 Hypothesis testing using simulation $40 Example 19.1: How did he know? The nonrandomness of haphazard choice 540 19.2 Randomization test 543 Example 19.2:Giels just wanna have genetic diversity 545 ‘Assumptions of randomization tests 550 19.3 Bootstrap standard errors and confidence intervals 550 Example 19.3:The language centerin chimps’ brains 551 Bootstrap standard error 553, Confidence intervals by bootstrapping $55 Bootstrapping data sets with multiple samples 556 ‘Assumptions andlimitations of the bootstrap 556 194 Summary 556 Practice problems 557 Assignment problems 563 Likelihood 567 20.1. Whatislikelinood? 568 20.2 Twousesoflikelihood in biology 569 Phylogeny estimation 569, Gene mapping 570 203 Maximum likelihood estimation 571 Example 20.3:Unruly passengers 571 Probability model 571 ‘Thelikeinood formula 572 Contents Themaximum likelihood estimate 573 Likelinood-based confidenceintervals $74 204 Versatility of maximum likelihood estimation 577 Example 20.4: Conservation scoop 577 Probability model 577 Thelikelinood formula 578 The maximum likelihood estimate 579 Bias 580 20.5 Logulikelthood ratiotest 580 Likelihood ratio test statistic $80 Testing 3 population proportion $81 20.6 Summary 582 20.7, Quick Formula Summary 583, Practice problems $84 Assignment problems 588 21. Meta-analysis: combining information from multiple studies 593 21.1 Whatis meta-analysis? 595 Why repeata study? 595 21.2 The power of meta-analysis 596 Example 21.2: Aspirin and myocardial infarction 596 21.3 Meta-analysis can giveabalanced view 598 Example 21.3:The Transylvania effect. 598 21.4 Thestepsofameta-analysis 599 Define the question $99 Example 21.4:Testosterone and aggression 600 Review theliterature 601 Compute effect sizes 602 Determine the average effect slze 604 Calculate confidence intervals and make hypothesis ests 605 Look foreffects of study quality 605 Look forassociations 606 21.5 Filedrawer problem 606 21.6 Howto make your paper accessible to meta-analysis 607 21.7 Summary 608 21.8 Quick Formula Summary 609 Practice problems 610 Assignment problems 611 ww Answers to practice problems 613 Literature cited 651 Statistical Tables 667 Using statistical tables 667 Statistical Table A:The y? distribution 669 Statistical Table B:The standard normal (Z) distribution 672 Statistical Table C:The Student distribution 674 Statistical Table D: The Fdistribution 677 Statistical Table E: Mann-Whitney U-distribution 682 Statistical Table F:Tukey-Kramer q-distribution 684 Statistical Table G: Critical values for the Spearman's correlation coefficient 686 Photo credits 689 Index 691 Preface Modem biologists ned the powerful tols of data analysis. A a result, an inceasing hhumber of universities offer, or even require, abasic daa analysis course for all their biology students. We have been teaching such a course at the University of British Columbia forthe last wo decades. Over this period, we have sought a textbook that ‘covered the material we needed in a fist course at just the right level. We found that most texts were too technical and encyclopedic, or else they didn't go far enough, missing methods that were crucial to the practice of modem biology. We wanted a book that hada stong emphasis on intuitive understanding wo convey meaning, rather than an over-eliance on formulas. We wanted to teach by example, andthe examples, needed tobe interesting. Most importantly, we needed a biology book, addvessing topics important to biologists handling real data, ‘We couldn't find the book that we needed, so we decided to write this one to ll ‘the zap. We include several unusual features that we have discovered tobe helpful for celfectively reaching our audience: Interesting biology examples. Our teaching has shown us that biology students, learn data analysis best in the context of interesting examples drawn from the medical and biological literature, Statistics is a means to an end, a tol to earn about nature. By emphasizing what we can leara about biology, the power and value of statistics becomes plain. Plu, its just more fun for everyone concer Every chapter has several biological examples of key concepts, and each example is prefaced by a substantial description of the biological setting. The examyles a illustrated with photos ofthe real organisms, so that students can look at what they're [earning about. The emphasis on teal and interesting examples carries into the prob- ‘em set; foreach chapter, there are dozens of questions based on real data abut bio- logical issues. Intuitive explanations of key concepts. Statistical reasoning requires slot of new ways of thinking, Students can get lost in the barrage of new jargon and ulti- tudinovs fests. We have found chat starting from an intuitive foundation, away from all the details, is extremely valuable. We take an intuitive approach to basie ques tions: What's a good sample? What’s a confidence interval? Why do an experiment? ‘The first several chapters establish this basic knowledge, and the rest of th: book builds on it Preface Practical data analysis, As its ile suggests, this book focuses on dats rather than the mathematical foundations of statistics, We teach how to mak good graphical dis- plays, and we emphasize that a good graph is the begianing point of any good data analysis. We give equal time to estimation and hypothesis testing, and we avoid treat- ing the P-value as an end in itself. The book does not demand a knowledge of mathe ‘matics beyond simple algebra. We focus on practicality over nuance, on biological usefulness over theoretical hand-wringing. We teach not only the “ight” way of all cildrea in Vancouver, Canada, suffering from asthma, A sample is « much smaller se of individuals selected from the population? The researcher uses this sample to draw conclusions that, hopefully, apply tothe whole population, Examples include > the fallen cas brought to one veterinary clini > selection of 20 human genes, > pub full of Australian voters, > eight paradise tre snakes caught by tesearchers in Bomeo, and > aselection of 50 chikiren in Vancouver, Canada, suferin from asthma, > > > in New York City, Inthe above examples, the basic unit of sampling is literally a single individual, Sometimes, however, the basic unit of study isa group of individuals, in whick case 2 sample consists of a se” of such units. Examples of units include a single family, 2 colony of microbes, a plot of ground in feld, an aquarium of fish, and a cage of mice. Scientists use several terms to inicat the sampling unit, such as “unit “indi- vidual? “subject,” or “wepicate” Properties of good samples [Estimates based on samples are doomed 1o depart somewhat from the true population characteristics simply by chance. This chance difference from the truths called sam- pling error. The spread of estimates resting from sampling enor indicates the precision of an estimate. The lower the sampling error, the higher the precision, Larger sampies are less affected by chance and so al els beng equal, larger simples ‘will have lower sampling error and higher precision. Inti. 2-Bood sang” ora" ape” ght frase en roe ingen, nantes we eevee word mpl” tarefertoa tof awn fo span. Chapter 1 Statistics and samples ‘deal, our estimate is accurate (or unblased), mesning that the average of esti- ‘mates is centered on the irue population value. If samples are not properly taken, ‘measurements made on them might systematically underestimate (or overestimate) the population parameter. This is a second kind of error called bias. ‘The major goal of sampling is to minimize sampling esr and bias in estimates, Figure 1.2-2 illustrates these goals by analogy with shooting ata target. Bach point represents an estimate of the population bull's The injury rate of cats that have fallen {rom high-rise buildings is likely to be underestimated compared with a random sample, if measured only on ets that fare brought to 8 veterinary clinic, Uninjured and fatally injured eats ae less likely to make itto the vet and into the sample. 0 Chapter staustes and simples > The spectacular collapse of the North Atlantic cod fishery in the last century vas caused in part by overestimating cod densities in the sea, which led to ‘excessive allowable catches by fishing boats (Walters and Maguire 1996). Den- sity estimates were too high because they relied heavily on the ates at which the fishing boats were observed to capture cod, However, the fishing boats ‘tended to concentrate in the few remaining areas where cod were sill numer- ‘ous and they did noc randomly sample the entire fshing area, > The Literary Digest Poll was the largest pol in history (questionnaires were seat to 10 million people, of which 24 million responded), but it predicted the ‘wrong outcome t the 1936 US, federal election (Freedman etal 1997). This was probably because the list of people to receive questionnaires was obtained ‘rom magazine subscriptions, telephone books, automobile registrations, and lub memberships. This tended to leave out people in low-income families whose voting preferences were very different from the higher-income people ‘who rceived questionnaires. A sample of convenience might also violate the assumption of independence if individuals in the sample are more similar to one another in thei characteristics than individuals chosen randomly from the whole population. This is likely if, for exam- ple, the sample includes a disproportionate number of individuals who are fiends or ‘who ate seated to one another, Volunteer bias Hunan stuies in paricular must deal withthe possibilty of volunteer bias, which isa bia resulting from a systemati difference between the pool of volunters (he volunteer sample) andthe population to which they belong. The problem arses ‘when the behavior ofthe subject fects whether they ae sampled, Ina large experiment test the benefits fa polo vaccine, fo example participat- ing schoohlden were randomly chosen rceivecither the vaeine orate su- tion (serving a the contol). The vacine proved effective, but the rate at whic chi den inte saline group contracted poio was found tobe higher than inthe general population Perhaps parents of children who had not been exposed to pli prior wo the study, and therefore had no immunity, were more likely to volunee their hldea for the study than paren of kids win had been exposed (Brownlee 1985, Bland 200). Compared with the rest of the population, volunteers might be > more health conscious and more proactive; > low-income if volunteers are pid); > more ill, particularly if the therapy involves risk, because individuals who are dying anyway might ry anything; >> more likely to have time on their hands (eg. retirees and the unemployed are ‘more likely to answer telephone surveys); als Section 13 Typeset data and varables u > more angry, because people who are upset are sometimes more likely to speak upsor >> less prudish, because people with liberal opinions about sex are more likely to speak to surveyors about se, ‘Such differences can cause substantial bias in ce results of studies. Blascan be ‘minimized, however, by careful handling of the yolunteer sample, but the resulting sample is stil inferior to arandom sample, Real data in biology In this book we use real data hard-won from observational or experimental studies in the lab and field and published inthe literature. Do the samples on which the studies are based conform to the ideals outlined above? Alas, the answer is often no. Random samples, however much desired, are often not achieved by biologists working inthe trenches. Real data are frequently based on samples that are not random, asthe falling, cats in Example 1.2 demonstrate. ‘Biologists deal with this problem by acknowledging thatthe problem exists, by pointing out where biases might arse in their studies,’ and by carrying out further ‘studs that attempt to control for any sampling problems evident in earlier work. Types of data and variables With a sample in hand, we can begin to measure variables. A variable i any charac- teristic or measurement that differs from individual to individual. Examples include running speed, reproductive rate, and genotype. Estimates (e.g, average running speed ofa random sample of 10 lizards) are aso variables, because they difer by chance from sample to sample. Data are the raw measurements of one or mov vari- ables made ona sample of individuals Categorical and numerical variables ‘Vaables can be categorical or numerical Categorleal varlables describe membership in a category or group. They describe named characteristics of individuals that do not have magnitude on a 2 Wibiologsn re generally pero ind ch a in ote racer dla than oo. Chapter) States and samples umerical scale, Categorical variables are also called attribute or qualitative vai- ables. Examples of categorical variables include > sex chromosome genotype (e, XX, XY, XO, XXY, or XYY), > method of disease transmission (eg., water, ait, animal vector, oF ditect contact), > predominant language spoken (e.g. English, Mandarin, Spanish, Indonesian, etc), > life stage e-.,egg, larva, juvenile, subadult, or adult), > snakebite severity score (¢., minimal severity, moderate severity, or very severe) and > size class (e.g, smal, medium, o large). ‘A categorical variable is mominal if the different categories have no inherent ‘order. Nominal means “name” Sex chromosome genotype, method of disease trans- ‘mission, and predominant language spoken are nominal Variables. In contrast the valuos of an ordinal categorical variable can be ordered, despite lacking magnitude on the numerical scale. Ordinal means “having an ordet” Life stage, snakebite sever= ity sore, and size class are ordinal categorical variables A variable is numerical when measurements of individuals are quantitative and thave magnitude. These variables are numbers. Measurements that are counts, dimen- sions, anges, rates, and percentages are numerical. Examples of musmerical variables include > core body temperature (e.g, degrees Celsius, °C), territory size (eg, hectares), Variables are ether categorical or numerical. A categorical variable describes which category an individual belongs to, while @ numerical variable is expressed as a number >> Instudies of association between variables, one variable is often designated as the response variable (the variable being predicted) and the other variable designated asthe explanatory variable (Le, the variable being used to predict the response). » Tnexperimental studies, the researcher is able to assign subjects randemly to stferent treatments or groups, In observational studies, the assignment of indie viduals to teatments is not contolled by the researcher. PRACTICE PROBLEMS ‘The distinction between experimental studies and observational studies is that experimental studies can determine cause-and-effect relationships between variables, ‘whereas observational studies can only point 1 associations. An association between smoking and lung cancer might be due tothe effects of smoking per se, or perhaps to tn underlying predisposition to lung cancer in those individuals prone to smoking. Te is difficult to distinguish these alternatives with observational studies alone. For this reason, experimental studies ofthe health hazards of smoking in non-human animals have helped make the case that cigarette smoking is dangerous to human health, But experimental studies are not always possible, even on animals. Smoking in humans, for example, is also associated witha higher suicide rate (Hemmingsson and Kriebel 2003). Is this association caused by the effects of smoking, or is it eaused by the effets of some other variable? Just because a study was carried out in the laboratory does not mean that the study is an experimental study in the sense described here. A complex laboratory apparats ‘and careful conditions may be necessary to obtain measurements of interest, ut such a study is sil observational ifthe assignment of treatments or groups is out ofthe Answers tothe practice problems are provided atthe reo the yearn which each mash ws col control ofthe researcher end of the book, staring on page 613. lected. This enabled her t plete propextion of ‘melanic moths yearby year. 1 Which of the following categorical variables are nominal? Which are odin? a Lester grades 4 Canthe specimens frm given year be con- sidered a andom sample fom the mth pop Summar, > Fst etter of stents last names stndar a ee 5: Hootaandon samp, ha ype ame > Statsies is a technology for measuring aspects of populations from samples 41. Body mas nde underweight, onal «Wo tia toatl awd sen ter Kiewiianieaubeantysanttig —-—-“sppetersoatua fombutingn Xen Tony, _— T"EGicmtiaraencatepwimnal —& Whee ctanrina pain Seti? Bepisin youranee, ‘wher individuals differing inthe varableaf by Of the variables measured in Figure 1.21 : which isthe explanatory variable? its dont have anol aren Whihisthe eons ail? 3. Whichof te folowing mma rislenae ST eee ‘AA WAR PRES. answer the following question: Should the US. i Namie Of apne ruc im Constitution be amended to ban same-ses mar~ 5 Frc of ins n lags population Fhoger? OF te 0285 woes as, 16848 : snswered yea 234.057 anewered "0" Number of evimes commited she variable measured inhi survey cae- | Logarithm of ody mass ace and for quantifying the uncertainty of the measurements overweight ar abese) > Much of statistics is about estimation, which infers an unknown quantity of a population using sample data. > Statistics also allows hypothesis testing, a method to determine how well a mull hypothesis about a population characteristic, established forthe purposes of argument, is the sample dat, > Sampling eror is the chance difference between an estimate describing a sam. ple and thst parameter of the whole population. Bias is a systematic discrep- ancy between an estimate and the population quantity. > ‘The goals of sampling ae to increase the accuracy and precision of estimates and to ensure that itis possible to quantify precision. > In random sample, every individual ina population has the same chance of ‘being selected an the selection of individuals is independent > A sample of convenience isa collection of individuals easly available to a researcher, butts not usually a random sample > Volunteer bias is a systematic discrepancy in a quantity between the pool of 4, Aresearcher cared out a study on he peppered be, Can the 402,485 vores be sepuded asa rane ‘lintel ual ta pope ‘mobi examine ho the proportion of melnic dom sample of ope aos the US? Why > Variables are measoroments hat fer among individuals, (ak) individuals changed ove tine in Bg exon? land. She measured all specimens of the moth ©, ‘The fraction of people answering “yes” that were cold and deposited in uscuns sw 158480402, = 042. Wha pe and lage private colletions, and she also 1 8 Chapter | Statistics and samples fins might affect the acura ofthis Ina stdy of stress levels in U.S. amy ees stationed in Irn, eserchers obtained a com plete st the names of ert in rag tthe me ofthe study. The listed the eu alpha etclly and then numbered them consecus tively One hundred random numbers between 1 nd te total numberof eis Were then ge rated wsng random -mumbergoncrtor oa the ‘omer. The 100 ects whose numbers ear responded to those generate bythe computer were interviewed forthe stad 44, Whats the population eres inthis study? 1h The 100 rer itrviwed were randomly sampled as deseibed, Are the measurements silted by sampling enor? Explain. & Whatare te main advantages of corn sampling this example? Animportat quantity in conservation bioloy 4s the number of plant and animal species ina ing given areata be se aside for protection ‘To survey the community of small maramals inhabiting Kruger Nasional Park, alae cies ofl raps were placed randomly throughout ‘the park for one weck inthe ain dy season of 2004, Teaps were set each evening and checked ‘he following morning. Iva caught were “Meni, nage (0th ne cates could be iscnguished trom recaptures), and released. At ‘he ond of the survey, ther nambee a eal rarmal species inthe park, M, was estimated ‘by, the total number of species captured inthe suey 4 What the population being estimated in she sureey? ‘the simple of indivi aptredin the ‘raps likely tobe azandom sample? Why or ‘why nt? In your answer, address the to criteria tht define a sample random «© Tsthe number of species inte sample (m) 1ikely tbo an unbiased estimate of M, the total mimber of small mara species inthe pak? Why or why not? ‘Survival of sparows over thei Fit year oie ‘was measured on age university campus. The nests of 20 beeing pars were located by care- ful sare, representing less than 1% of al the reding parson campus. Chicks in each of these ness were given auniquesetf color ‘ands on their es so that they cou be ide ‘ed individually once they had edged fromthe ness toa of 60 chicks were banded, oF ‘hich only four survived to he following year The researchers estimated the sual ‘arin the population as 4160. Do the 60 ‘hicks constite a random sample? What ‘consequences mightthis have forthe estimate of survival ale? ESTs esd 1 ‘ena which ofthe following variables are

You might also like