Download as pdf
Download as pdf
You are on page 1of 20
DEPARTMENT OF BIOLOGICAL SCIENCES BENUE STATE UNIVERSITY, MAKURDI COURSE: BIO 307 (BIOSTATISTICS I) COURSE LECTURER: Dr. Liamngee Kator CORRELATION A correlation is a statistical measure of the relationship between two variables. In correlated data, the change in size/magnitude of one variable is associated with a change in the le, either in the same direction (positive correlation) or in the magnitude of another v opposite direction (negative comrelation).The measure is best used in variables that demonstrate a linear relationship between each other. For example; ice cream sales increase as temperature increased during summer or fan sales tend to increase as tempers increased. The degree to which the change in one continuous variable is associated with « change in another continuous variable can be mathematically described in terms of the 7 covariance of the two variables. Covariance is similar to variance, but whereas variance describes the variability of a single variable, covariance is a measure of how two variables vary together. The correlation coefficient (r) is a value that indicates the degree of the relationship between variables. The coefficient can take any values from -1 to ! interpretations of the values are: ~l: Perfect negative correlation: The variables tend to move in opposite directions (i.e., when one variable increases, the other variable decreases). 0: No correlation: The variables do not have a relationship with each other 1:Perfect positive correlation: The variables tend to move in the same direction (.e., when one variable increases, the other variable also increases). ‘The correlation coefficient that indicates the degree of the relationship between two variables can be found using either of the following formula: xy —2X2Y =a 2x2-ED" . py2 EYE 7 OR 1 j d J Scanned with CamScanner yy hy = YS Eeiy? where x = (x — 2) y=O0-/) Therefore, the formula can be simplified thus; 2G-2H-5) Whers: hy: the correlation coefficient of the linear relationship between the variables x and x: the values of the x-variable in a sample X the mean of the values of the x-variable ¥: the values of the y-variable in a sample 4 ¥ the mean of tne values of the y-variate In order to calculate the correlation coefticiont using the formula above, you must undertake the following steps © Obtain a data sample with the values of x-variable and y-variable, Calculate the means (averages) forthe x-variable and J for the y-variable. For the x-variable, subtract the mean from each value Of the x-variable (let’s cal! this new variable “a”, Do the same for the y-variable (let's call this variable “b”), Multiply each a-value by the Corresponding b-value and find the sum of these multiplications (the final value is the numerator in the formula). Square each a-value and calculate the sum of the result Find the square root of the value obtained in the previous step (this is the denominator in the formula). Characteristics of simple correlation * There is covariability of two variables © Variables are normally distributed * Both variables ate selected randomily * Whether variable X is dependent on variable Y is unknown Scanned with CamScanner Sy * Both variables are effect of one cause * One variable is not the function of the other Assumptions of correlation 1. _ Asis actually true for any statistical inference, the data are derived from a random or at least representative sample. If the data are not representative of the population of interest, one cannot draw meaningful conclusions about that population Il Both variables are continuous, jointly normally distributed, random variables. They follow a bivariate normal distribution in the population from which they were sampled. II. Each pair of x-y values is measured independently from each other pair. Scatter diagram for correlation y y x x Positive correlation Negative correlation y SIN x x No relationship Curvilinear correlation N.B: For coefficient of correlation to be negative does not mean there is no relationship, it means the relation is a negative one. Example 1 ‘Calculate and comment on the correlation coefficient (r) between CA scores exami int scores of five students in BSU below oe — CA scores (X)_ [25 26 27 30. [3s ] Exam scores (¥)_| 60 70 80 85 [30 | Scanned with CamScanner Solution Correlation coefficient (1) EXSY. ExY-— _ EX =F axe EXE 5 nye EVE A 7 a 2 n= 5, IX= 143, EY= 385, EX LY= $5055, EXY= 11180, EX?= 4155, EY"= 30225, (EX) '=20449, (EY)"= 148225. anizo — $5958 ss 2088 x a5- Eze 11180 ~.11011 V4iS5- 4089.8 x 30225-29645 \ 169 L “FEZ SBD 169 37816 nl Scanned with CamScanner forecasting, Regression analysis can provide insights that few other benefits of using regression analysis arc that it can niques can. The ke: 1. Indicate if independent variables have a significant relationship with a dependent var 2, Indicate the relative strength of different independent variables’ effects on a variable, 3. Make predictions. Regression models Regression models involve the following variables: * The unknown parameters, denoted as which may represent a scalar or a vector, + The independent variables, X. + The d variable, Y. The linear regression model is represented as; Y=a+BXx Where Y = Dependent vari le X = Independent variable a= Intercept B= Slope where i (slope) OR Ixy pat Baa ‘The line of best fit should pass through the mean of x and y, thus we have yrat px px Scanned with CamScanner Characteristics of Regression fe The dependent variable Y has a li mean of the x and through the mean of y ge in dependent variable Y for a unit near relationship to the independent variable X ‘ The regression line passes through the © The regression coefficient (8) is the average chant change in the independent variable X. Assumptions of linear regression ‘© The sample is representative of the population at large + The variance of residual is the same for any value of X ‘© There is no perfect linear relationship between explanatory variables Example 2 ‘An experiment was conducted at College of Advanced and Professional Studies (CAPS) Makurdi to access the effect of single phosphate fertilizer on crop yield. The data obtained is shown below moana of Ferilizers K a Ey 1301180 240 (wim) | : “ Crop yield ¥ (kg/plot) [150 [i60 170 [iso lis90 Fita regression equation for the data and predict crop yield when fertilizer amount is 400g/m? Solution n= 5, EX EY= 535500, EXY= 112900, Ex"= 113300, (EX)=396900, z= 126 70 exy —7A2Y pe GX? = 112900 — 535500 —— ata 113300-326200 p= 12900 = 107100 113300-79380 5800 B= sa500 Scanned with CamScanner be o= 170-0.17(126) o= 170-2142 a= 148.58 Tinear regression, Y= a+ BX therefore, Y = 148.58 + 0.17 (400) Y= 148.58 + 68 Y=216.58 \ Hence crop yield is 216:58kg/plot when the farmer applied 400g/m? of fertilizer. Regression equation is Y= 148 +0.17X Differences between correlation and regression Correlation Regression | “Tris nota predictive model Itis a predictive model bewween 2 Measures the degree of relation: two variables jeasures the strength of dependency 5 Itdoes not matter which variable is Variable X is independent while Y is dependent or independent dependent Best relationship given by coefficient of Best relationship given by the line of best fit correlation which ranges from -1 to +1 Similarities between correlation and regression © They both show relationship between variables ‘© They are both used as method of data analysis ‘© They are both represented on scatter diagram EXPERIMENTAL DESIGN: Data for statistical studies are obtained by conducting either experiments or surveys Experimental design is the branch of statistics that deals with the design and analysis of experiments. The methods of experimental design are widely used in the fields of agriculture, medicine, biology, marketing research, and industrial production. Experimental design deals with the planning of experiment, layout, collection of data, data tabulation, analysis of data 7 Scanned with CamScanner and interpretation of results. In experimental design we look out for factors or th {interaction to find out whether the experimental materials are homogenous or heterogeneous. ‘There are three principles of experimental design and these replication and local control. clude; randomization, ‘Randomization: This is the allocation of treatment without bias such that no treatment or ‘persons or objects is consistently favoured or handicapped. Randomization can be carried out ‘by the use of random numbers, random table, raffle draws or lottery. Randomization also helps in reducing experimental error. There are three simple types of randomization which include; Completely Randomized Design (CRD), Randomized Complete Block Design (RCBD) and Latin Square Design (LSD). Replication: This refers to the number of times an experiment is repeated. At least should be minimum of 3 replications in any research. The principle of replication is ba on the conviction that similar circumstances and conditions should produce highly identical results. Replication just as randomization reduces experimental error. Control: ‘The control group in an experiment is the group that does not receive aay treatme indi It is used as a benchmark against which other test results are measured, This group includes juals who are very similar in ntany ways to the individuals who are receiving the treatment, in terms of age, gender, race, or other factors. A control group is used in an experiment as a point of comparison. By having a group that does not receive any sort of ‘treatment, researchers are better able to isolate whether the experimental treatment did or did not affect the subjects who received it. ‘Types of experimental designs Completely Randomized Design (CRD) Randomized Complete Block Design (RCBD) ‘Sampling experiments Latin Square Design Factorial experiments Split-plot design Scanned with CamScanner ANALYSIS OF VARIANCE (ANOVA) ‘The responses between experimental units may vary due to many different causes known and unknown, The process of separation and comparison of sources of variation in an experiment is called analysis of variance. One of the ways to minimize the experimental errors arising from this different source of variation is via randomization and replication of the experiment Therefore, all the aforementioned types of experimental design are most suitably used for the type of the experiment at hand. For the purpose of this study, the CRD and RCBD form of experimental design will be discuss in details. Assumption underlying the analysis of variance Experimental errors (or observations) are normally distributed Experimental errors (or observations) are independently distributed Experimental errors (or observations) from different treatments have the same variance COMPLETELY RANDOMIZED DESIGN Characteristics of CRD Itis mostly used for laboratory experiment and where materials are homogenous Used when environmental effects are relatively controlled Used when several experimental units may be destroyed or fail to respond. iv. Experimental units are at random and number of replications and treatments are restricted Advantages of CRD i, The design is simple ii, Error are minimized iii, Data are easily calculated iv. Statistical analysis is easy v. Missing data can be calculated vi All available experimental materials can be utilized and method of data anelysis still remain simple even when data are missing or rejected Ee E , ie Scanned with CamScanner Disadvantage of CRD « Itisnot very acourate Linear model in CRD Xij=n+ it Zi ‘Where Xij = any observation = population mean Yi = treatment mean Bij = error term Example 3. The root yields (ton/acre) of plots fertilized with six levels of Nitrogen is presented below; Take Fu 0s) = 2.62 a. Design the ANOVA table for the above experiment b. Test the hypothesis of the treatment effect . Determine if there is a significance difference in treatment mean at 5% lev probabil Treatment 1 2 3 4 5 Total A 313 33.4 29.2 32.2 33.9 160.0 B 38.8 375 374 35.8 38.4 187.9 c 40.9 39.2 39.5 38.6 39.8 198.0 D 40.9 4.7 304 40.1 40.0 202.1 E ey 40.6 39.2 38.7 419 200.1 . 41.0 415 41.1 . Toa * en Solution Ha: Root yields were the same in response to nitrogen fertilizer Hi: Root yields were not the same in response to nitrogen fertilizer (xij? Comrection factor (C.F) (1152.1)? 30 1327334.41 30 10 Scanned with CamScanner C.F = 4424448 ‘Sum of Square Total (SSixai) = xij? - C.F Exi 31.3? + 33.47 + 29.2? + »- 39,87 ~ 44244.48 4555.61 — 44244.48 SStotat = 311.3 = (Exit) Sum of Square Treatment (SSyeamen) Ge" _ CF e 2 + 200.1? + 204 160? + 1 +198 = +200. — 44244.48 22610.83 — 44244.48 44522.17 — 44244.48 SStrearment 277.69 ‘Sum of Square error (SSerror) = SSwoxat - SSueatment 311.13 — 277.69 SSerror= 277.69 Degree of freedom for treatment = t-1 = 6-1=5 Degree of freedom for error = t(t-1) = 6 (5-1) = 24 Degree of freedom for total = tr-1 = 6(5)-1=29 Sum of Square treatment Mean Square Treatment = = 277.69 55.54 F 5 Sum of Square error Mean Square error = af 33. 24 1.39 Mean square treatment Fealculated = ‘Mean square error n id Scanned with CamScanner TABLE OF ANOVA. Sources of Variation Df ss, MS For Fanos “Treatment $s 27.69 55.54 39.96 2.62 Error 24 3341.39 ‘Total 29 31L1 ‘The Hois rejected since Fea is greater than Fus, Therefore, there is a difference in the root yield in response to the nitrogen fertilizer. \ RANDOMIZED COMPLETE BLOCK DESIGN Characteristics of RCBD i, The number of block is the number of replications 2 ii. Any treatment can be adjacent to any other treatment but not to the same treatment within the block, - CF 64? + 52? +442 4 402 | 3 — 2000 4096 + 2704 + 1936+ 1600 or - 200 B 10336 _ 5990 2067.2 - 2000 7.20 SSpiock™ Gx _ oF ‘gum of Square Block (SSuesmen) =p aatg oh eet EE 2000 a 1681 + 1369 +1024 + 1764 + 2304 e 8142 =——- 2000 77 20 2035.5 — 2000 SSereaoment™ 35:50 am of Square error (Sen) = $Swat-8S8se 124 — 67.20 -35.50 SSerror= 21-30 Degree of freedom for block = b-I= 4-1 = 3 Degree of freedom for treatment = t-1 Degree of freedom for error = (t-1) (b-1) = (5-1)(4-1) =12 Degree of freedom for total = tb-1 = 5(4)-1 = 19 Mean Squate Block = SUmol oe block 67.20 =224 Sum of Square treatment Mean Square Treatment = af 35.50 wane 8.88 Mean Square error = SU™ Of Square error af 21,30 _ 188 12 i 1“ ome = 2000 Scanned with CamScanner Mean square block Featalaeaoek"yean square error 2.4 we =1191 1.88 Mean square treatment Featnetueien- “Wyean square error 8.88 sae = 472 1.88 ‘TABLE OF ANOVA Sources of Variation vf Ss MS Fat Fab (05) Block 3 6720~—« a 1191 3.49 ‘Treatment 4 35.50 8.88 4n 3.26 Error 12 2130 1.88 Total 19 124 ‘The Ho is rejected since Fa is greater than Fy, Therefore, there is a significant difference in growth rate of the AYB and a significant block effect. MEAN SEPARATION In example 3 and 4 above while considering CRD and RCBD respectively, the F test was computed to only guide us whether to accept or reject null hypothesis. However, the F-test did not tell us how the treatment means were different from each other, Therefore mean separation enables us to know how treatments ate different from each other. There are basically two most commonly used method in separating means which are; Fisher's Least Significant Difference (FLSD) and Duncan's Multiple Range Test (DMR1). Differences between FLSD and DMRT FLSD DMRT , Itis suitably used when treatments are not Used when treatments are many - many F-test must be significant before using it F-test does not have to be significant before Using it tis very easy to use Itis more difficult and cumbersome to use 1s Scanned with CamScanner Itcan be used for all possible comparisons It cannot be used to compare all means \djacent ones except only adj ation i separ Tedoes not make use of alphabet in mean Alphabets are used in mean sep: separation ast FLSD = ta x Where $? = variance (error mean square), r= number of replicates (blocks) and 2 is @ constant, ‘Example 5. In solving example 3 (), we are going to apply FLSD method of mean separation. ‘Treatment ‘Yield mean A 32.00 B 37.58 Cc 39.60 D 10.42 E 40.02, F 40.80 FLSD (0.05) 1.55 FLSD =ta x = = 2.06 x Fa F = 2.06 * 0.75 =1.55 N.B. Itis expected that a clear interpretation of the table above must be given to earn full marks. Foot note: If the difference between two treatments is greater than the FLSD value then those treatments are different from each other. Otherwise they are the same at 5% level of significance PRACTICE QUESTIONS 1. Differentiate between correlation and regression analysis a. Using scatter diagrams briefly explain correlation and regression . Mention three assumptions of linear regression 16 ~ Scanned with CamScanner 2. In.an experiment to determine the relationship between plasma (PV) and body weight (BW) of eight members of Benue State University management staff in June 2017. the result below was obtained: BW PV. sg MGT Bl ve 2 70.0 2.86, ; 3 740 3.37 4 63.5 2.76. Ts 62.0 2.62 {e 705 3.49 : i 71.0 3.05 & 66.0 3.12 Determine and comment on the correlation coefficient. 3. a. Explain the terms used in linear regression below Y=a+gx b. Plot a linear regression graph and show on it how a and ff can be obtained Given that the values of « = 0.0857, B = 0.043615 to depict yield in Arachis ‘ypogea, determine the yield value in 2011 and 2012 for respectively. 0 and 70, 4. Anexperiment was conducted to determine the relationship between parked cell volume (X) and red blood cell count (¥) in 10 days. The following data were obtained: EX = 455, ZY = 73.3 and = 0.176, Find the prediction equation and sketch it graphically Predict the red blood cell counts at parked cell volume of 42, 35 and 50. 5. What does the following mean in statistics i 0.05 0.01 5b, From the experimental data Presented below, find if there are differences in means by making a decision statement at 5% level of significance, (Ftab 18, ta = 2.18) Replicates ____ Varieties _ t3 3 4 I 2.0 1.0 3.0 2.0 1 10 24 26 3.0 I 4.0 2.0 5.0 10 6 (@) Define mean separation (©) State the two commonly used method of mean separation clearly stating the differences in the methods. v7 _-~—tiiail em - Scanned with CamScanner 7, The following is a partial ANOVA table : Source af SS MS Fea F wp coos Treatment [2 - = : Error, : 5 20 4 Total i 300 [ : Complete the table and answer the following questions \ a, How many treatments are there? b. How many times was the treatments replicated ‘c. What is your conclusion regarding the null hypothesis 8. Six levels of nitrogen fertilizers were applied on thirty sugar beet plots numbered in sequence and arranged in completely randomized design to determine the effect of the nitrogen on the leaf length. The table below showed sequence of application; [seas 796-29 13-64-21 19-20-07 25-25-08 “| [D @o9y F G10) E (9.2) B G75) B (38.4) [ 297-30 851-15 216 | 20-73-23 26-60-19 1 8). 4 CG9-S) HT) ee £687 D0. 3-42-11 9-74-24 [15-62-20 21-44-12 “[aa5538 j C409) EG9.7) D G94) C98) F G98) 407-02 10-79-25 16-28-09 22-01-01 -O1- 28-15-04 AGI3) E 40.6) B G88) A G22) A G39) 5-49-14 11-13-03, 17-92-27 23-31-10 a 3 1 - 29-53-17 (9.2) A (29.2) F411) B G74) 40.0) __ 6-14-05 12-85-26 18-45-13 24-17. | -17-06 30-65-22 | A G34) F (41.5) C 38.6) B (35.8) BGL a Key: leaf length (parentheses). . ! i, Determine if the leaf length of the sugar beets are the same or not ii, Design the ANOVA table for the above experiment ili, Is there any difference in treatment mean at 5% level of significance? a i judges scored four products. Given the following; SSviack 1.5, SSucamen = 13.3, SSioui= 68; and the product means: A = 7.8, B=9.0, C= 5.4 and D a, Produce the ANOVA table, . Separate the means using the FLSD procedure. Take Fiab pick = 3.26, Fup treatm 349, ta= 218 18 Scanned with CamScanner 10. Complete th iplete the table below and calculate the correlation coefficient of variable x and y [x x Y : + 2009 A | 2 4 26 ae | 3 2 ey 134 | a 2 97 1834 3 4 0 to 5 36 77 (2772 69 3312 | { B 7 2 1300 I T —— i 6 31 2916 3 { { ‘Use Shoot dey weights (me) Treatment Control Acetic Propionic Butyric 423 3.85 3.15 3.66 438 3.78 3.65 3.67 4.10 3.91 3.82 3.62 3.99 3.94 3.69 354 435 3.86 3.3. 3 means. 12. A mathematics teacher recordec when leaving hor Perform an analysis of variance and draw an appropriate conclusion about the treatment .d the length of time y minutes taken to travel to school me x minutes afier 7am on seven. selected morning. The results are as follows; x 0 10 20 30 [40 {50 [60 y 16 27 28 39 {39 {48 {51 19 Scanned with CamScanner a. plot the data on scatter diagram b. Calculate the equation of the regression of y on x writing your answer in the form of \ yeat Bx . ‘The maths teacher needs to arrive at school no later than 8:40 am. The number of minutes by which the maths teacher arrives at school when leaving home x minutes after 7 am is denoted by z. i. Deduce that z= (100-a) ~ (1+)x ii, Hence estimate to the nearest minute the latest time that the maths teacher can leave home without then arriving late at school. 20 Scanned with CamScanner

You might also like