A Text Book of METHODOLOGY AND BIOSTATISTICS Strictly As Per Syllabus Prescribed for M. Pharmacy Semester-III by Pharmacy Council of India, New Delhi M. Pharm. ‘M.Pharm. Ph.D, FLSL hie cas ‘Associate Professor Department of Pharmaceutics Department of Pharmacognosy cians Denapa cs Pheacy & Fstearc, Swami VivekanandColege of Pharmacy ‘Oriental University, Indore, (M.P.) Eto: 0ar) & Executive Editor International Journal of Pharmacy & Life Sciences M.Pharm., Ph.D, FAPP, FICPHS, FSRHCP, FRSH, FSPER aa ee ‘Associate Professor Professor 8 Panel Department of Pharmacognosy Smt. §. §, Patil College of Pharmacy, Chopda, Maharashira & Joint Secretaries ‘SPER Central Branch & President IPNAPPIRSH/SRHCP Maharashira State Branch Pearlat Colege of Pharmacy, indore, (M.P.) Y CONTENTS Unit-I General Research Methodology 1-18 1.1 Research & Research Methodology |, 1.2 Objectives of research in research methodology ?. 1.3 Requirements of Research Methodology * 1.4 Practical difficulties |, 1.5 Review of literature 5, 1.6 Study design ° 1.7 Types of studies 1.8 Strategies to eliminate errors/bias |\), 1.9 Controls, randomization, crossover design |‘, 1.10 Placebo, blinding techniques Unit-Il Biostatist 19-65 21 Definition | 2.2 Application |‘), 23 Sample size, importance of sample size, factors influencing sample size, dropouts *”, 24 Statistical tests of significance & type of significance tests 25, 25 Parametric tests °> 2.6 Non-parametric tests, null hypothesis, P values, degree of freedom, interpretation of P values 56, 2.7 p-Value 58 Unit-I Medical Research 66-77 3.1 Medical Research 66, 3.2 Values in medical ethics 67, 33 Conflicts between autonomy and beneficence/non-maleficence 69, 3.4 Euthanasia 70, 3.5 Informed consent 71, 3.6 Confidentiality 71, 3.7 Criticisms of orthodox medical ethics 72, 3.8 Importance of communication “> 3.9 Control resolution 72, 3.10 Guidelines 72, 3.11 Ethics committees “3.12 Cultural concems 74, 3.13 Truth telling 7, 3.14 Online business practices 4, 3.15 Conflicts of interest 75, 3.16 Referral 75, 3.17 Vendor relationships >5, 3.18 Treatment of family members 76, 3.19 Sexual relationships 76, 3.20 Fatality (Futility) 7° Unit-IV CPCSEA Guidelines For Laboratory Animal Facility 78-97 41 Goal 7, 4.2 Veterinary care 78, 4.3 Quarantine, stabilization and separation 79, 4.4 Surveillance, diagnosis, treatment and control of disease 9, 4.5 Personal hygiene 6.) 4.6 Animal experimentation involving hazar- dous agents ©), 4.7 Multiple surgical procedures on single animal 80, 48 Durations of experiments °\), 4.9 Physical restraint 81, 4.10 Functional areas 82, 4.11 Environment 84, 4,12 Animal husbandry 85, 4.13 Activity 57, Unit-V 89, 417 Sanitation and 414 Food ®%) 4.15 Bedding °° 4.16 Water cleanliness 5%: 4.18 Assessing the effectiveness of sanitation 90, 4.19 Waste disposal °1 4.20 Pest control °1, 421 Emergency, weekend and holiday care 91) 422 Record keeping °* 4.23 Standard operating procedures (cops) / guidelines 5% 4.24 Personnel and training °3, 4.25 Transport of laboratory animals °° 426 Anaesthesia and euthanasia °! 4.27 Laboratory animal ethics °° Declaration of Helsink 98-108 5.1 History 98 52 Introduction |! 5.3 Basic principles for all medical research 102 5,4 Additional principles for medical research combined with medical care 107 Bibliograph fe m a (i)-Gi) Research, objective, require. : riiirements, practical difficulties, review of literature, study design, types of studies, strates; ; design, placebo, Dlinding techni errors/bias, controls, randomization, crossover U Definition, application, sample size, importance size, factors influer ample 7 le size, i i wn Neston \portance of sample size, factors influencing sampl: al tests of significance, ty ai Ee type of significance tests, parametric a ‘t” test, ANOVA, Correlation coefficient, regression), non-parametric tests (wileoxan rank tests, analysis of variance, correlation, chi square test), null hypothesis, P values, degree of freedom, interpretation of P values. History, introduction, basic principles for all medical research, and additional principles for medical research combined with medical care. | GENERAL RESEARCH METHODOLOGY 1.1 RESEARCH & RESEARCH MI (o)pje)! Research in defined as search for knowledge. It is defines as a scientific and systematic search for pertinent information on a specific topic. In fact, research is an art of scientific investigation. The Advanced Learner's Dictionary of Current English lays down the meaning of research as “a careful investigation or inquiry especially through search for new facts in any branch of knowledge.” Research methodology is the specific procedures or techniques used to identify, select, process, and analyze information about a topic. In a research paper, the methodology section allows the reader to critically evaluate a study's overall validity and reliability. Research may be very broadly defined as systematic gathering of data and information and its analysis for advancement of knowledge in any subject. Research attempts to find answer intellectual and practical questions through application of systematic methods. Webster's Collegiate Dictionary defines research as "studious inquiry or examination; esp: investigation or experimentation aimed at the discovery and interpretation of facts, revision of accepted theories or laws in the light of new facts, or practical application of such new or revised theories or laws". Some people consider research as a movement, a movement from the known to the unknown It is actually a voyage of discovery. We all possess the vital instinct of inquisitiveness for, when the unknown confronts us, we wonder and our inquisitiveness makes us probe and attain full and fuller understanding of the unknown. This inquisitiveness is the mother of all knowledge and the method, which man employs for obtaining the knowledge of whatever the unknown, can be termed as research. Research is an academic activity and as such the term should be used in a technical sense. ‘According to Clifford Woody research comprises defining and redefining problems, formulating hypothesis or suggested solutions; collecting, organizing and evaluating data; making deductions and reaching conclusions; and at last carefully testing the conclusions to determine whether they fit the formulating hypothesis. D. Steiner and M. Stephenson in the Encyclopedia of Social Sciences define research as “the manipulation of things, concepts or symbols for the purpose of generalizing to extend, correct or verify knowledge, whether that knowledge aids in construction of theory or in the practice of an art.” @ rch Methodology and Biostatistics PV A Text Book of Resea! 2 Research is, thus, an original contribution to the . advancement. Itis the per suit of truth with the help of study, experiment. In short, the search for knowledge through objective Spee mee of finding solution to a problem is research. The systematic ie ac ing generalization and the formulation of a theory is also research. As suc : research refers to the systematic method consisting of enunciating the problem, formul stg . hypothesis, collecting the facts or data, analyzing the facts and reaching certain conclusions either in the form of solutions(s) towards the concerned problem or in certain generalizations for some theoretical formulation. existing stock of knowledge making for its ‘observation, comparison ang 4.4.1 NEED FOR RESEARCH The purpose of research is to discover answers to questions through the application of scientific procedures. The main aim of research is to find out the truth which is hidden and which has not been discovered as yet. The main need and objectives are as mentioned below: To gain familiarity with a phenomenon or to achieve new insights into it. * To portray accurately the characteristics of a particular individual, situation or a group. * To determine the frequency with which something occurs or with which it is associated with something else. * Totest a hypothesis of a causal relationship between variables 1.4.2 TYPES OF RESEARCH The basic types of research are as follows: * Descriptive vs. Analytical. «Applied vs. Fundamental. * Quantitative vs. Qualitative. Conceptual vs. Empirical. aie: 3 * To gain familiarity with a phenomenon or to achieve new insights into it (studies with this object in view are termed as exploratory or formulative research studies); * To portray accurately the characteristics of a particular individual, situation or a group(studies with this object in view are known as descriptive research studies); + To determine the frequency with which something occurs or with which it is associated with something else (studies with this object in view are known as diagnostic research studies); ‘+ To test a hypothesis of a causal relationship between variables (such studies are known as hypothesis-testing research studies). ‘The main requirement is that they provide knowledge of the characteristics of the different methods and how to apply the methods to study a given subject area as well as an ability to consider any given study or data set critically, on the basis of the applied method of data collection and analysis. In thesis or dissertation, we have to discuss the methods you used to do research. The methodology or methods section explains what you did and how you did it, allowing readers to evaluate the reliability and validity of the research. It should include: «The type of research you did. + How you collected and/or selected your data, © How you analyzed your data. + Any tools or materials you used in the research. + Your rationale for choosing these methods. ‘There is need for design of the experiments to have fruitful and well organized research outcomes. = Desire to get a research degree along with its consequential benefits. = Desire to face the challenge in solving the unsolved problems, ie., concern over practical problems initiates’ research. + Desire to get intellectual joy of doing some creative work. * Desire to be of service to society. Desire to get respectability. ihodology and Biostatistics PUA Text Book of Res: ‘The Problems faced by Researchers, particularly those as Tesearch, are facing several problems Some ofthe important problems ate as fl a The lack of a scientific training in the methodology of research is a Brea i oon researchers in our country. There is paucity of competent oe ee < stake a leap in the dark without knowing research methods. Most ee work, whi goes in the name of research is not methodologically sound. Research to many researchers and even to their guides, is mostly a scissor and paste job without any insight shed on the collated materials. The consequence is obvious, vi., the research results, quite often, do not reflect the reality or realities. Thus, a systematic study of research methodology is an urgent necessity. Before undertaking research projects, researchers should be well equipped with all the methodological aspects. As such, efforts should be made to provide shortcutting courses for meeting this requirement. * There is insufficient interaction between the university research departments on one sideband business establishments, government departments and research institutions on the other side. A great deal of primary data of non-confidential nature remains ‘untouched /untreated by the researchers for want of proper contacts, Efforts should be imace to Gevelop satisfactory liaison among all concerned for better and realiste impermeable barrier to researchers. Thus, thatthe information data obtained from a Research studies overlapping one anoth there is the need for generating the confidence business unit will not be misused, er are undertaken quite often for want of . This tion and revision, at regular intervals, of a list of subjects on which and the places where the research is going . a '§ On. Due attention should be given toward identification of research problem: Science which are of immediate conce There does Not exist a code interdepartmental rivalries are Code of conduct for researchers in various disciplines of applied "m to the industries, of conduct for Tesearchers Is0 quite common. Hence, which, if, adhered sincerely, and inter-university and there is need for developing a / can win over this problem. & re su 5 ——— av Oh * Many researchers in our country also face the difficulty of adequate and timely secretarial assistance, including computer assistance. This causes unnecessary delays in the completion of research studies. All possible efforts be made in this direction so that efficient secretarial assistance is made available to researchers and that too well in time. University Grants Commission must play a dynamic role in solving this difficulty. Library management and functioning is not satisfactory at many places and much of the time and energy of researchers are spent in tracing out the books, journals, reports, etc, rather than in tracing out relevant material from them. There is also the problem that many of our libraries are not able to get copies of old and new Acts/Rules, reports and other government publications in time. This problem is felt more in libraries which are away in places from Delhi and/or the state capitals. Thus, efforts should be made for the regular and speedy supply of all governmental publications to reach our libraries. * There is also the difficulty of timely availability of published data from various government and other agencies doing this job in our country. Researcher also faces the problem on account of the fact that the published data vary quite significantly because of differences in coverage by the concerning agencies. There may, at times, take place the problem of conceptualization and also problems relating to the process of data collection and related things. A literature review surveys books, scholarly articles, and any other sources relevant to a Particular issue, area of research, or theory, and by so doing, provides a description, summary, and critical evaluation of these works in relation to the research problem being investigated. Literature reviews are designed to provide an overview of sources you have explored while researching a particular topic and to demonstrate to your readers how your research fits within a larger field of study. A literature review may consist of simply a summary of key sources, but in the social scienc s, a literature review usually has an organizational pattern and combines both summary and synthesis, often within specific conceptual categories. A summary is a recap of the important information of the source, but a synthesis is a re-organization, or a reshuffling, of that information in a way that informs how you are planning to investigate a research problem. The analytical features of a literature review might: * Give a new interpretation of old mat or combine new with old interpretations, Trace the intellectual progression of the field, including major debates, jology and Biostatistics {UA Text Book of Research ise the reader on the most uation, evaluate the sources and ac * Depending on the pertinent or relevant research, or ; wast inh . Usually in the conclusion of a literature review, identify where gaps exist in how a problem has been researched to date. review is to: ‘The purpose of a literatu «Place each work in the context of its contribution to understanding the research problem being studied. + Describe the relationship of each work to = Identify new ways to interpret prior research. + Reveal any gaps that exist in the literature. = Resolve conflicts amongst seemingly contradictory previous studies. + Identify areas of prior scholarship to prevent duplication of effort. + Point the way in fulfilling a need for additional research. + Locate your own research within the context of existing literature [very important]. the others under consideration. The various steps involved in a research process are not mutually exclusive; nor are they separate and distinct. They do not necessarily follow each other in any specific order and the researcher has to be constantly anticipating at each step in the research process the requirements of the subsequent steps. The following order concerning various steps provides a useful procedural guideline regarding the research process: (1) Formulating the research problem. | (2) Extensive literature survey. 3) Developing the hypothesis. (4) Preparing the research design. (6) Determining sample design. (6) Collecting the data. (7) Execution of the project. (8) Analysis of data. (9) Hypothesis testing, (10) Generalizations and interpretation, 7 General Research Methodology (11) Preparation of the report or presentation of the results, i.e., formal write-up of conclusions reached. A brief description of the above mentioned are summarized below: Formulating the research problem The best way of understanding the problem is to discuss it with one’s own colleagues or with those having some expertise in the matter. In an academic institution the researcher can seek the help from a guide who is usually an experienced man and has several research problems in mind. Extensive literature survey After the problem is formulated, a brief summary of it should be written down. It is compulsory for a research worker writing a thesis for a Ph.D. degree to write a synopsis of the topic and submit it to the necessary Committee or the Research Board for approval. Development of working hypotheses ‘After extensive literature survey, researcher should state in clear terms the working hypothesis or hypotheses. The answer is by using the following approach: (a) Discussions with colleagues and experts about the problem, its origin and the objectives in seeking a solution. (b) Examination of data and records, if available, concerning the problem for possible trends, peculiarities and other clues. (© Review of similar studies in the area or of the studies on similar problems. (4) Exploratory personal investigation which involves original field interviews on a limited scale with interested parties and individuals with a view to secure greater insight into the practical aspects of the problem. Preparing the research design The research problem having been formulated in clear cut terms, the researcher will be required to prepare a research design, ice., he will have to state the conceptual structure within which research would be conducted. Research purposes may be grouped into four categories, viz. (i) Exploration i) Description (ii) Diagnosis, (iv) Experimentation 8 PVA Text Book of Research Methodology and Biostatistics —.-— ———————r—rr—rrrrrerr ep os tetstios 1.7 TYPES OF STUDIES Peat ag ECoP aE tioned Yes| No Experimental Study Observational Study Random allocation? Comparison group? No No Analytical Descriptive Rancionigns Non- study study eonteolied randomised ‘rial controlled trial Direction? Exposure Outcome Exposure and outcome at the same time Exposure S? - SE St And when SP [> Too ee PT Ton] veal ron] se vol eon ern eo sol sed ve] ssi 04 sila i ter eed aoa] ae ted sf If the calculated value of F is, however, greater than the FR , a e table value of F (F — cal > F-t : : tab.) the aul hypothesis is rejected and the difference between the standard deviations is significant which means that the two samples under test cannot be supposed to be part: nee Parts of the same i Ee Biostatistics _ Test of Significance: Type # 3. Fisher's Z-Test or Z-Test Z-test is based on the normal probability distribution and is used for testing the significance of several measures. The relevant test statistics is worked out and compared with its probable value (to be read; table showing the area under normal curve) at a given level of significance in order to judge the significance of measures concerned. Z-test is generally used to compare the mean of large sample; hypothetical mean for population. Zz Where, ¥ = Sample mean i= Hypothetical mean of population o = Standard deviation n = Total number of observations Test of Significance: Type # 4. X2-Test (Chi-Square Test) Xp square test (named after Greek letter x pronounced as ki) is a statistical method of testing significance which was worked out by Karl Pearson. Any biological study is based on a limited number of individuals which constitute a sample. A sample is a small part of population at random. Ifa sample is drawn from population at random, each individual of the population is given equal opportunity to be included in sample and so the properties of samples will reflect the properties of population of which they are the part. Thus, the various statistics like mean, ratios, variance calculated from a random sample are estimates of those of parent population. The mean ratios variance of sample may be close but seldom equal those of parent population. Ina cross between tall and dwarf pea plants Mendel found 787 tall and 277 dwarf plants out dwarf which was close to but of 1064 plants in Fz generation that yielded a ratio of 2.84 ta numerically slight different from the expected ratio 3 Tall: 1 Dwarf from the imaginary population. Such deviations are bound to occur in all biological situations like this. Thus, there should be some objective criteria to decide if a set of observed data is in accordance with specified or expected ratio. In other words, it is applied to test goodness of fit of the frequency of observed data with the expected or specified ratio. etnodology and Blostatisticg ch Me 32 PVA Text Book of Re: surement data ang i nd rarely t X@-test is generally applied to enumeration data 4 involves the following steps 1, Forma si ‘ormation of a Null hypothesis ot different from the specified or expecteg n ; 1 but is du .d ratio is not real Ie to the expectes In this, it is presumed that the observed data are ratio, ie., the deviation of observed data from # chance only. 2. Test statistics or computation of value of X* ; jation of observed data from the expecteg After the formation of null hypothesis, the dev chance only. Calculation of frequencies of different classes of data will be considered due to the value of from observed data is made in the following WaY: (@ The observed frequencies of different classes are arranged sequentially e.g. Tall 267 ani Dwarf 277 in f2 generation of Tall x Dwarf. (i) The expected frequencies of different classes are calculat (as stated in the null hypothesis) and the total number of observed values. For example, the expected ratio of Tall to Dwart is 3:1 and the total number of observations is 1064 plants. (This ratio specifies that out of every 4 (3 + D,3 Ge, 3/4) will be tall and one (ie, 1/4) will be dwarf plants. Therefore, out of total 1064 plants, three-fourth (or 1064 x 3/4) = 798 should be tall and one-fourth (ie., 1064 x 1/4) = 266 should be dwarf plants. Likewise in a dihybrid experiment involving Tall and red flowered x Dwarf, white flower the expected ratio for Fzplants is Tall Red: Tall, White: Dwarf, Red: Dwarf White:: 9: 3:3:1 ed on the basis of expected ratio. If the total number of plants in Fz generation is 556, the distribution of frequencies for different classes will be calculated in the following way Class Ratio Expected Frequency Tall, Red 90r9/16 | 9/16 0f 556 = 556 x 9/16 = 312.75 Tall White 3013/16 | 3/16 0f 556 = 556 x 3/1 Dwarf Red 3013/16 | 3/16 of 556 = 556 x 3/16 = 104.25 Dwarf White Tend/i§ | 1/1698 856-854 1/15 - 9495 Total = 943+3+1=16 : the respective class (deviation = observed fre 'y from the observed frequen tency - value of deviation may have either positive (4) yn? EXPected frequency or d=oe) ™ or Negative (2) sign, (iv)! (wv) 1 expe of de the € for tl Thus of di divic (quo impo Wi) x The s Thus devia This j e=ex d=de Degre Itis or classes ea << “ee a in the above example 0 33 Thus, i mpl T MONOHY DTIC Cray, th 798 =-11 and that for Dwart will ho 9979 26 ’ the deviation value for tall plants is 787 — . 277 266 = 444, .) Now the deviation value (iv) Now the deviation value (A) of each cate a aquared tp oS : get all values in positive. (e) The importance of a deviation depends on n dS on its m, ted frequencies from which the Seance which the deviation has occurred Suppose that the magnitudes | Che ne aon aa nu data are equal, say 11 in eal cross Tall Bae but the expect ‘quencies of the two classes are di x asses are di : ee eee © different say 798 for one class (Tall) and 266 ‘agnitude as well as on the value of Thus there is disparity in the | of different classes of data divided by expected freque | (quotients). So obtained are based on the expected frequency of one and are, therefore, equal in importance. imy i ; ae of deviations due to unequal expected frequencies " order to remove this disparity, the deviation squares are cles of the respective classes (0 - e)?/e. All the numbers (vi) value The sum (total) of all quotients of different classes yields the calculated value of c square. Thus Chi Square may be defined as the total of the quotients obtained by dividing the | deviation squares of different classes with their respective expected frequencies. This is expressed as follows x? eyed o YL. e e Where, = Summation Observed frequency of the class &= expected frequency of the class 4 = deviation or difference between the observed value and expected frequency value o Degree of Freedom {tis one less than the total number of classes classes and is expressed as d.f=n-1 where n= no. of n motnodology 2nd Biostatistcg 34 PU A Text Book of Resoa ton Table 3, The X? Distribu oe of? Degree of Accept — et ve 0.04 2.001 : fe 2098 10.89 1 0.000157 0.00303 oom 0.210 ta82 5.901 ei 1627 2 0.020 0.108 as 18.47 3 0.115 0.352 ol 13,28 ones 4 0.207 ort oa 15.09 11.070 Fi 22.46 5 0558 1148 oy 168: 6 0.872 1.635 12. 7 18.48 24.32 7 1.230 2.167 14.061 20.09 26.12 : tes are ner a8 8 2088 3.325 sata 23.21 20.50 10 2.558 3.940 2 24,73 31.26 " 3058, 4575 19.675 ae ar 2 3587 5.226 21.008 7768 34.53 8 4107 5.982 oe 20.14 36.12 “ 4660 esrt 73.685 ae Bas 5 5.20 7261 24.906 er 3025 6 sei 7908 26.296 a a 7 6.408 3672 27587 = ae 8 7015 9.390 28.869 aaa 43.82 9 78% 10.120 30.144 ae 2 8.260 10.850 31.410 37.67 ‘ a 807 11.500 wont 38.99 6.00 2 2.542 12340 3.924 40.29 48.27 B 10.200 13000 35.172 4164 4073 2 10.860 13.850 36.415 42.98 51.18 2 1.820 +4910 7.052 44.31 52.62 % 12.200 15300 38.805 45.64 54.05 az 12.880 16.150 40.13 46.96 55.48 2 13.560 18.900 41337 48.28 56.89 2 14.260 17710 42.557 49.59 58:30 20 14.950 18.490 4773 50.89 53.70 (viii) Table value of X? It is obtained from X*table (Table 3) at given degree of freedom and 1% (0.01) or 5% (005) level of probability. In Chi Square table maximum values of Chi ‘Square at different Probability levels and at different degrees of freedom obtained purely due to chance aft listed which serve as Points of reference while deciding whether the calculated value is due to real or chance deviation. The value of X? depends on the vatiables, de f freedom and probability. ME gTee (ix) Level of significance Bic on du w Aft wit The hyp @) | not acce ) 1 prot exp con 25; Stuc devi In} deve num incre Pract distr It is differ mean exam 35 one degree of freedom, 5% or 0.05 experiments will show X? value 3.84 or less than that only due to chance. Consequently in 0.95 or 95% of experiments X? value 3.84 or of lower magnitudes will be due to real deviation of observed data from the expected data After determining the X? value from observed data, the calculated value of X*is compared with table value of X? (a) If the calculated value of Xis less than the table value at 5% probability against given degree of freedom, the deviations of observed frequencies from the expected frequencies are not significant and are accepted to be purely due to chance. Therefore, null hypothesis is accepted and it is inferred that the observed data are in accordance with the expected ratio. (©) If the calculated value of X?is equal or greater than the table value of Xat 5% level of probability and given degree of freedom, then the deviations of observed data from expected value are statistically significant. In such a case null hypothesis is rejected and it is concluded that observed data are not in accordance with the expected ratio. (students “t” test, ANOVA, Correlation coefficient, regression) Student's t-test, in statistics, a method of testing hypotheses about the mean of a small sample drawn from a normally distributed population when the population standard deviation is unknown. In 1908 William Sealy Gosset, an Englishman publishing under the pseudonym Student, developed the t-test and f distribution. The t distribution is a family of curves in which the number of degrees of freedom (the number of independent observations in the sample minus one) specifies a particular curve. As the sample size (and thus the degrees of freedom) increases, the t distribution approaches the bell shape of the standard normal distribution. In practice, for tests involving the mean of a sample of size greater than 30, the normal distribution is usually applied. It is usual first to formulate a null hypothesis, which states that there is no effective difference between the observed sample mean and the hypothesized or stated population mean—ie,, that any measured difference is due only to chance. In an agricultural study, for example, the null hypothesis could be that an application of fertilizer has had no effect on }V A Text Book of Research Methodology and Biostatistics d {t would be performed to test whether it has increased the crop yield, and an experiment would wo-tailed), stating si harvest, In general, a t-test may be either two-sided (also a 7 0 a y eats may v -sided, specifying whether the obse ean ig that the means are not equivalent, or one-sided, sp : L ler than the hypothesized mean. The test statistic t is then calculated. If the larger or smaller than the hypothesized mea : observed t-statistic is more extreme than the critical value determined by the appropriate . is rejected. The appropriate reference distribution reference distribution, the null hypothes ' for the t-statistic is the { distribution. The critical value depends on the significance level of i thesis). the test (the probability of erroneously rejecting the null hypo! - For example, suppose a researcher wishes to test the hypothesis that a sample of size n = 25 with mean x= 79 and standard deviation s = 10 was drawn at random from a population with mean p = 75 and unknown standard deviation. zou Jn Using the formula for the t-statistic, the calculated f equals 2. For a two-sided test at a common level of significance a = 0.05, the critical values from the t distribution on 24 degrees of freedom are -2.064 and 2.064. The calculated t does not exceed these values, hence the null hypothesis cannot be rejected with 95 percent confidence. (The confidence level is 1 =a.) A second application of the t distribution tests the hypothesis that two independent random samples have the same mean. The f distribution can also be used (v construct confidence intervals for the true mean of a population (the first application) or for the difference between two sample means (the second application). 2.5.2 CORRELATION COEFFICIENT Definition Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related: taller people tend to be heavier than shorter people. The relationship isn't perfect. People of the same he weight, and you can easily think of two people you know where the than the taller one. Nonetheless, the average weight of people 5'5" is weight of people 5%", and their average weight is less than that Correlation can tell you just how much of the v heights. Although this correlation is fairly obvious your data may contain unsuspected correl You may also suspect there are correlations, but ight vary in shorter one is heavier less than the average of people 57°, ete ‘ariation in peoples’ weights is related to their lations. don't know which are the strongest. An understanding of your data. intelligent correlation analysis can lead to a greater Biostatistics a SO Techniques in determining correla! on There are several different correlation techniques. The Survey System's optional Statistics Module includes the most common type, called the Pearson or product-moment correlation. The module also includes a variation on this type called partial correlation. The latter is useful when you want to look at the relationship between two variables while removing the effect of one or two other variables. Like all statistical techniques, correlation is only appropriate for certain kinds of data. Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some sort. It cannot be used for purely categorical data, such as gender, brands purchased, or favorite color. Correlation coefficient ‘The main result of a correlation is called the correlation coefficient (or "r"). It ranges from - 1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related. If r is, close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation). While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them makes then easier to understand. The square of the coefficient (or r square) is equal to the percent of the variation in one variable that is related to the variation in the other. After squaring r, ignore the decimal point. An r of 5 means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of the variance is related (.7 squared = 49) A correlation report can also show a second result of each test - statistical significance. In this case, the significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random sampling error. If you are working with small sample sizes, choose a report format that includes the significance level. This format also reports the sample size. Karl Pearson’s coefficient of correlation Karl Pearson’s coefficient of correlation (or simple correlation) is the most widely used method of measuring the degree of relationship between two variables. This coefficient assumes the following: (i) That there is linear relationship between the two variables. (i) That the two variables are casually related which means that one of the variables is independent and the other one is dependent. ii) A large number of independent causes are operating in both variables so as to produce a normal distribution. 38 PV A Text Book of Re: Karl Pearson’s coefficient of correlation can be worked out thus. E(x X) (0-7) Karl Pearson's coefficient of correlation (ot r)*="—T-5— g * Alternatively, the formula can be written as: E(x) iF) Or Where; = ith value of X variable X=mean of X = th value of Y variable Y= Mean of Y n= number of pairs of observations of X and Y s X = Standard deviation of X sY = Standard deviation of Y In case we use assumed means (Ax and Ay for variables X and Y respectively) in place of true means, then Karl Person’s formula is reduced to: aa TE EA oy Biostatistics as ee Where Lae =T(%-4) X4.=L(%-4,) Lay =D(x,-4,) Lar=LDK-4) Ley Da. =L(%-A) DKA) n= number of pairs of observations of X and Y. This is the short cut approach for finding ‘r’ in case of ungrouped data. If the data happen to be grouped data (ie., the case of bivariate frequency distribution), we shall have to write Karl Pearson's coefficient of correlation as under: ane) Where; fijis the frequency of a particular cell in the correlation table and all other values are defined as earlier. Karl Pearson’s coefficient of correlation is also known as the product moment correlation coefficient. The value of ‘r’ lies between + 1. Positive values of r indicate positive correlation between the two variables (ie, changes in both variables take place in the statement direction), whereas negative values of ‘r’ indicate negative correlation i.e., changes in the two variables taking place in the opposite directions. A zero value of ‘r’ indicates that there is no association between the two variables. When r = (+) 1, it indicates perfect positive correlation and when it is (-) 1, it indicates perfect negative correlation, meaning thereby that variations in independent variable (X) explain 100% of the variations in the dependent variable (Y). We can also say that for a unit change in independent variable, if there happens to be a constant change in the dependent variable in the same direction, then correlation will be termed as perfect positive. But if such change occurs in the opposite direction, the correlation will be termed as perfect negative. The value of ‘r’ nearer to +1 or -1 indicates high degree of correlation between the two variables. P ron Methodology and Blostatigy., es pup rot nen of eS 40 PUA Text Bor e analysis con riables, the anal cer When there are two or more than two indepe the equation describing such relationship relationship is known as multiple correlations and UNSC correlation and regress, ai as the multiple regression equation. We here @*PI2i" TT 16 (Convenient compu depen 7 0 taking only two independent variables and ee eee in this situation the resis ay, reat number Multiple correlations endent va programs exist for dealing with a interpreted as shown below: Multiple regression equation assumes the form. =atbX,+b:X> Where; i , wnt variable, a X1 and X2 are two independent variables and Y being the dependen ee ind the constants a, bl and b2 can be solved by solving the following three normal eq : DH ana DX +b LX VX aa X, HDX + LX Xe DM aD Xa th DXvXa the Xa, (It may be noted that the number of normal equations would depend upon the number of independent variables. If there are 2 independent variables, then 3 equations, if there are 3 independent variables then 4 equations and so on, are used.) In multiple regression analysis, the regression coefficients (viz., b1 2) become less reliable as the degree of correlation between the independent variables (viz., X1, X2) increases. If there is a high degree of correlation between independent variables, we have a problem of what is commonly described as the problem of multicollinearity. In such a situation we should use only one set of the independent variable to make our estimate. In fact, adding a second variable, say X2, that is correlated with the first variable, say X1, distorts the values of the regrestion coefficients, Nevertheless, the prediction for the dependent variable can be madt even when multicollinearity is present, but in such a situation enough care should be taken in selecting the independent variables to estimate a de pendent vai t multi-collinearity is reduced to the minimum, a 2.8.3 REGRESSION Regression is the determination of a statistical relations hip between two or more variables fined as independent) dependent Variable). Regression 4" “Physical way in which independest In simple regression, we have only two variables, one the cause of the behaviour of another one (defined ag only interpret what exists physically ic, there must be variable (de! ixZa+'xQo= ax Xt 'y Qe mu=Kh -suopenba feurou omy Suzmozjoy ayy Suisn Aq q pue » “Z1a sjuesuco omy atp Jo sanfea ayy pu; We aM ‘SaIquiTeA X pure x jo sonyea waar’ axp 0} xq +9 = x ad&y ay Jo uonenbs uorssau8ax & SuMIY 105 ‘ATPANeUIDHTY] -a1qeuva juapuadepur ayp Jo sanyea ay uaniZ ‘aqua quapuadap jo sanqea ay Jo uonotpaid jo asodmd ap 30} pasn aq we> yrYM sa[qeuea ysSuoUTe drysuonela: Sunoidap jepour eoneUIayVeUT JO VonemMuLoy ayy YIM Teop 0} pou jeoRSHEIS v st sisdqeue UoIssauBa1 ay ‘SNL Sp zsfa apumn se no payzom aq uatp ue> 4 Jo anes ayy pure syutod x pue X TeurSuz0 a yBnonp ay aqissod ysaq ayy ant Tum yrYM q PUL » aUYEp samnsvour asou, =4 4 ‘@uruLiayep ysry 2m “Apua_oyza Hf asn of ‘poyaut azenbs-yseay amp, st an ueo pury sp jo aur qysrens v yeYA IY As9q, axp PUY 0} poyyaut pasn ATTeIoUAS UDLL “sdrysuonejas asiaaur 30} aaneSou pue yoartp 104 aarnsod si yorya “x Ut q Jo a8uey> saonpoad x ut a8uey> yun ypea yep sueaur Yor (ydex8 w UO UAE LAY x UO x Jo aUT] Uorssou8ar ax stuasaidar ose) x wo { Jo woREnbs uorssaifax axp se UMOU st uoRenba sry X Jo anea uaa e203 x Jo anqea payeumsa atp saouap x rym XQ+D=X 24q wonts st pur x uaaaiaq diysuoneyay 21seq au, “f a[qeuea juapuadap yoayge ue> x ajqeuea ~ SONSHeIsolg of Research Methodology and Biostatistcg “io PUA Text Book hi ese equations for finding a and b values. Once these values are obtained lues. Once these val Then solving th finding a and een put in the equation $ Y = a + BX, we say that we have fitted the and : levelop the regres car equation of Y on X to the given data. In a similar fae he i einbie i Pe ” equation of X and Y viz., X= a + bX, presuming Y as a1 dependent variable, “ Curve fitting by the method of least squares sm It is the process of constructing a curve, or mathematical function, that has the ae toa = Series ofdata points possibly subject to constraints. Curve fitting can involve in either interpolation, where an exact fit to the data is required, or smoothing, in which a Smooth” function is constructed that approximately fits the data. A related topic The is regression analysis, which focuses more on questions of statistical inference such as how ang much uncertainty is present in a curve that is fit to data observed with random errors. Fitted an: curves can be used as an aid for data visualization, to infer values of a function where no Pos data are available, and to summarize the relationships among two or more variables, if Extrapolation refers to the use of a fitted curve beyond the range of the observed data and is po subject to a degree of uncertainty since it may reflect the method used to construct the curve no as much as it reflects the observed data. fit Different types of Curve fittings ex Fitting functions to data points Th Most commonly, one fits a function of the form y=f(x). in Fitting lines and polynomial functions to data points 7 The first degree polynomial equation. yearth Is a line with slope a. A line will connect any two points, so a first degree polynomial equation is an exact fit through any two points with distinct x coordinates. If the order of the equation is increased to a second degree polynomial, the following results: yaar three This will exactly fit a simple curve to three points. If the order of the equation is increased to a third di legree polynomial, the following is obtained: Yaar tbe tex¢d This will exactly fit four points, ———_—_ ss A more general statement would be to say it will exactly fit four constraints. Each constraint can be a point, angle, or curvature (which is the reciprocal of the radius of an osculating circle). Angle and curvature constraints are most often added to the ends of a curve, and in such cases are called end conditions, Identical end conditions are frequently used to ensure a smooth transition between polynomial curves contained within a single spline. Higher-order constraints, such as “the change in the rate of curvature", could also be added. This, for example, would be useful in highway cloverleaf design to understand the rate of change of the forces applied to a car (see jerk), as it follows the cloverleaf, and to set reasonable speed limits, accordingly. The first degree polynomial equation could also be an exact fit for a single point and an angle while the third degree polynomial equation could also be an exact fit for two points, an angle constraint, and a curvature constraint. Many other combinations of constraints are possible for these and for higher order polynomial equations. If there are more than +1 constraints (nbeing the degree of the polynomial), the polynomial curve can still be run through those constraints. An exact fit to all constraints is not certain (but might happen, for example, in the case of a first degree polynomial exactly fitting three collinear points). In general, however, some method is then needed to evaluate each approximation. The least squares method is one way to compare the deviations. ‘There are several reasons given to get an approximate fit when it is possible to simply increase the degree of the polynomial equation and get an exact match: * Even if an exact match exists, it does not necessarily follow that it can be readily discovered. Depending on the algorithm used there may be a divergent case, where the exact fit cannot be calculated, or it might take too much computer time to find the solution. This situation might require an approximate solution. "The effect of averaging out questionable data points in a sample, rather than distorting the curve to fit them exactly, may be desirable. Runge's phenomenon: high order polynomials can be highly oscillatory. If a curve runs through two points A and B, it would be expected that the curve would run somewhat near the midpoint of A and B, as well. This may not happen with high-order polynomial curves; they may even have values that are very large in positive or negative magnitude. With low-order polynomials, the curve is more likely to fall near the midpoint (it's even guaranteed to exactly run through the midpoint on a first degree polynomial). th Methodology and Biostatistcg xt Book of 44 PVA ynomial curves tend toy, oth and high order po * Low-order polynomials tend to be smoo number of inj “lumpy” a “define this more precisely, the maximum weston : the order of the polynom Points possible in a polynomial curve is 7 aren eee eae moat equation. An inflection point is a location on the a penne radius to negative. We can also say this is where it tran: eee ‘shedding water". Note that it is only “possible” that high order poly: like with = lumpy; they could also be smooth, but there is no guarantee of this, unlike wit : low order polynomial curves. A fifteenth degree polynomial could have, at most, thirteen inflection points, but could also have twelve, eleven, or any number down to zero. The degree of the polynomial curve being higher than needed for an exact fit is undesirable for all the reasons listed previously for high order polynomials, but also leads to a case where there are an infinite number of solutions. For example, a first degree polynomial (, line) constrained by only a single point, instead of the usual two, would give an infinite umber of solutions. This brings up the problem of how to compare and choose just one solution, which can be a problem for software and for humans, as well. For this reason, itis “sually best to choose as low a degree as possible for an exact match on all constraints, and Perhaps an even lower degree, if an approximate fit is acceptable. Fig. 2.1 Polynomial curves fitting points generated with a sine function Note: Red line is a first degree polynomial, green line is second degree, orange line is third degree and blue is fourth degree. Fitting other functions to data points Biostatistics Voigt and related functions. In agriculture the inverted logistic sigmoid function (S-curve) is used to describe the relation between crop yield and growth factors, The blue figure was made by a sigmoid regression of data measured in farm lands. It can be seen that initially ie. at low soil salinity, the crop yield reduces slowly at increasing soil salinity, while thereafter the decrease progresses faster. Algebraic fit versus geometric fit for curves For algebraic analysis of data, “fitting” usually means trying to find the curve that minimizes the vertical (y-axis) displacement of a point from the curve (eg., ordinary least squares). However, for graphical and image applications geometric fitting seeks to provide the best visual fit; which usually means trying to minimize the orthogonal distance to the curve (eg,, total least squares), or to otherwise include both axes of displacement of a point from the curve. Geometric fits are not popular because they usually require non-linear and/or iterative calculations, although they have the advantage of a more aesthetic and geometrically accurate result. Fitting plane curves to data points If a function of the form cannot be postulated, one can still try to fit a plane curve. Other types of curves, such as conic sections (circular, elliptical, parabolic, and hyperbolic arcs) or trigonometric functions (such as sine and cosine), may also be used, in certain cases. For example, trajectories of objects under the influence of gravity follow a parabolic path, when air resistance is ignored. Hence, matching trajectory data points to a parabolic curve would make sense. Tides follow sinusoidal patterns, hence tidal data points should be matched to a sine wave, or the sum of two sine waves of different periods, if the effects of the Moon and Sun are both considered. For a parametric curve, it is effective to fit each of its coordinates as a separate function of arc length; assuming that data points can be ordered, the chord distance may be used. Curve fittings: Least square Curve fitting is a problem that arises very frequently in science and engineering. Suppose that from some experiment n observations, i. values of a dependent variable y measured at specified values of an independent variable x, have been collected. In other words, we have a set of n data points GA, yD), (2, y2), 63, y3), . ++,6cn, yn) The first step in constructing a mathematical model of the underlying physical process is to plot these data points and postulate a function form f(x) to describe the general trend in the data. Some simple functions commonly used to fit data are: * Straight line: f(x) = ax +b PUA Text Book of * Parabola: f(x) = ax2 + bx +¢ ¢a2x2+alxt a0 (includes the Previous * Polynomial: f(x) = amxm + am-Ixm-1+""* two cases) * Exponential: f(x) = ¢ explax) * Gaussian, e.g. f(x) = c exp(-bx2) "Sine or cosine, e.g. f(x) = a cos(bx) +¢ The coefficients a, b, cetc. in the formula of f(x) is, in the data, Of course, since there are inevitable measurement errors in tl 7 ee aa not expect f(x) to fit the data perfectly. The best we can do is iy fo choos data viene the function so as to minimize the fitting Game Seas the differences bone e y-valn a . The residual originally collected: parameters that we can adjust. in general we y, tis yi f(xi) fori =1,2,.- The length-n array of ri values is called the residual vector r, and we aim to minimize th norm of this vector. Three vector norms that are most widely used in applications; they gve rise to the following three standard error measure + Average ror: £,(s)= fr], =, Lyra) (ant) [E0707 nie * Rootmennsquare enor: &(/)=-LIr], * Maximum error: £.(F)=[r]_ =, mex |r max |, -F(x)] Suppose thatthe formula off contains the parameters al, a2 quantity E() that we wish to minimize will ‘ E(al, a2... Biostatis If we ch Process i Minimiz Write Sowe ne When {( system I Ard ais 253.2 The gen Tearn m depend the size respecti hoase. C to see w Biostatistics sie If we choose the parameters of f in order to minimize the root-1 ‘mean-square error, then the process is called “least squares fitting” Minimizing the root-mean-square error. -F(x))' is equivalent to minimizing IB =Z0.-s(a)y E (4,43, @e)= DoF (a)? FE =Yiali-1()){ 2 So we need to solve the equation © Este)» dao, kata, When f(x) is a polynomial of degree m (with m + 1 coefficients), (*) can be written as a linear system Ma = B w. Da Dae Die Ly we| Dae Da al , pal ay Sus run “oon Dade And ais the array of parameters (i. coefficients of the polynomial) that we solve. Multiple regressions The general purpose of multiple regressions (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For 8 PVA Text Book of Research Methodology and Biostatisticg example, you might learn that the number of bedrooms is a better predictor of the price for Which a house sells in a particular neighborhood than how "pretty" the house is (subjective rating). You may also detect “outliers,” that is, houses that should really sell for more, given their location and characteristics. Personnel professionals customarily use multiple regression procedures to determine equitable compensation. You can determine a number of factors or dimensions such ag “amount of responsibility" (Resp) or "number of people to supervise’ (No_Super) that you believe to contribute to the value of a job. The personnel analyst then usually conducts a salary survey among comparable companies in the market, recording the salaries ang Tespective characteristics (ie, values on dimensions) for different positions. This information can be used in a multiple regression analysis to build a regression equation of the form: Salary = .5*Resp +.8*No Super Once this so-called regression line has been determined, the analyst can now easily construct @ graph of the expected (predicted) salaries and the actual salaries of job incumbents in his or her company. Thus, the analyst is able to determine which position is underpaid (below the regression line) or overpaid (above the regression line), or paid equitably. In the social and natural sciences multiple regression procedures are very widely used in research. In general, multiple regression allows the researcher to ask (and hopefully answet) the general question "what is the best predictor of .”. For example, educational researchers might want to leam what are the best predictors of success in high-school. Psychologists may want to determine which personality variable best predicts social adjustment Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt and be absorbed into society. COMPUTATIONAL APPROACH The general computational problem that needs tobe solved in multiple regression analysis 8 to fit a straight line to a number of points. Biostatistic — In the simp! ina scatter; + Least Squ + The Regn = Unique P; * Predicted + Residual + Interpreti Least squares In the scatte These variab school achie represents 0 regression p compute a | minimized. ’ estimation. The regressic A line ina tw text: the Y ve the X variabl Biostatistics Fig. 2.2: Multiple regression analysis In the simplest case - one dependent and one independent variable - you can visualize this ina scatter plot. + Least Squares. «The Regression Equation. * Unique Prediction and Partial Correlation. + Predicted and Residual Scores. + Residual Variance and R-square. «Interpreting the Correlation Coefficient R. Least squares In the scatterplot, we have an independent or X variable, and a dependent or Y variable. These variables may, for example, represent IQ (intelligence as measured by a test) and school achievement (grade point average; GPA), respectively. Each point in the plot represents one student, that is, the respective student's IQ and GPA. The goal of linear regression procedures is to fit a line through the points. Specifically, the program will compute a line so that the squared deviations of the observed points from that line are minimized. Thus, this general procedure is sometimes also referred to as least squares estimation The regression equation A line in a two dimensional or two-variable space is defined by the equation Y=a+b"X; in full text: the Y variable can be expressed in terms of a constant (a) and a slope (b) times the X variable. The constant is also referred to as the intercept, and the slope as the regression coefficient or B coefficient. For example, GPA may best be predicted as 1+.02*1Q. Thus, rch Methodology and Biostatistics ¥ xt Book of 50 | 1d lead us to predict U hat her GPA would be 3 rt knowing that a student has an 1Q of 130 wou (since, 14.02*130=3.6) a For example, the animation below shows 40 oo) with three different confidence intervals (90%, ensional regression equation ploteg a, ee, l0BQZ 2 OY FOF PHARMACY Fig. 2.3: Two dimensional regression equations plotted with three different confidence intervals (90%, 95% and 99%) In the multivariate case, when there is more than one independent variable, the regression line cannot be visualized in the two dimensional space, but can be computed just as easily. For example, if in addition toIQwe had additional predictors of achievement (eg., Motivation, Self- discipline) we could construct a linear equation containing all those variables. In general then, multiple regression procedures will estimate a linear equation of the form: Y= a+ bitXi + ba*X2 +... + byp"Xp Unique prediction and partial correlation Note that in this equation, the regression coefficients (or B coefficients) represent the independent contributions of each indey i pendent variable iti the dependent variable. Another way to ex ae Rees ; Press this fact is to say that, for example variable X; is correlated with the Y variable, after controlling for al other independent varabie ave ren oe is also referred to as a partial Correlation (this term was first , erhaps the following example wi i will clarify thie ; id probably find a significant negative correlation etree baie _ aa a Ae ngth and heigl population (ie., short people have longer hai i Ber hair). At first this if we were to add the variable Gender into the multiple regression eres seem odd; bere hy probably disappear. This is because women, on the average, have ‘a = ee a * longer hair than men; CUNEATE 2

