Statistical Techniques in Scientific Research: Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

STATISTICAL TECHNIQUES IN SCIENTIFIC RESEARCH

Dr. K. Manjunatha Prasad


Professor, Department of Statistics
Manipal University,
Manipal, Karnataka 576 104
Statistics is an important tool in (i) designing research, (ii) analyzing the data obtained and (iii) drawing conclusions there
from. In many of the research, we come across a large volume of raw data which must be reduced suitably so that the
same can be read easily and can be used for further analysis. So, ignoring statistics in a research is just impossible.
Though, the classification and tabulation of data helps in reducing the size of data, to explain many other characteristics
of the data we require to go a step further and use advanced tools available in statistics. In dealing with data in a
research, researchers come across one more of the following circumstances:

• Describe the data in most efficient manner (in brief)


• Unknown quantities are to be estimated or some relationships to be established through observed data or
required to test some hypothesis and draw inferences
• Suggest a course of action under uncertainty.

We have three branches of statistics in handling the above circumstances.

STATISTICS

Descriptive Statistics Inductive Statistics Deductive Statistics

Data Collection and Estimation and


Probability Theory
Presentation Statistical Inference

Descriptive Statistics, in fact, refers to analysis, synthesis of data so that better description of the situation can be made
thereby promoting better understanding of the facts.

Inductive Statistics is concerned with procedure in which values of a group to be estimated by examining small portion
of that group. The group is known as ‘population’ or ‘universe’ and the portion is known as ‘sample’. Concerned values
in the sample are known as statistics and the values in the population are known as parameters. Thus, inductive
statistics is concerned with estimating universe parameters from the sample statistics. Inductive statistics is also known
as ‘Inferential Statistics’ or Sampling Statistics’.

Deductive Statistics concerned with establishment of rules and procedures for choosing one course from alternative
courses of actions under situations of uncertainty. Deductive Statistics uses probability theory and it provides a rational
basis for dealing with situations influenced by chance-related factors.

Research comprises

(i) defining and redefining problems,


Dr. K. Manjunatha Prasad
Deptt of Statistics, Manipal University Page 1
(ii) formulating hypothesis or suggested solutions,
(iii) collecting, organizing and evaluating data,
(iv) making deductions and reaching conclusions,
(v) carefully testing the conclusions to determine whether they fit the formulating hypothesis.

Purpose of Research:
1. The purpose of research is to discover answer to questions through application of scientific procedures. It is to
find out the truth which is hidden and which has not been discovered yet.
2. It is to gain familiarity with phenomenon or to achieve new insight into it. –Exploratory or formulative research
3. To portray accurately the characteristics of a particular individual, situation or a group. – Descriptive Research
4. To determine the frequency with which something occurs or with which it is associated with something else. –
Diagnostic Research
5. To test a hypothesis of casual relationship between variables. – Hypothesis-testing Research.

Flow Chart for Research Process

Define Research Problem

Review concepts Review previous


and theories research findings

Formulate Hypothesis

Design Research (including


sample design)

Collect Data (Execution) FF

Analyze data (test


hypothesis, if any)

Interpret and report

F – Feed back (Helps in controlling the subsystem to which transmitted)


FF – Feed forward (Serves the vital function of providing criteria for evaluation)

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 2
For the successful completion of research within the time frame as planned, researcher must have an effective research
design in which sampling
pling design and statistical design are very important components. Overall research design may be
split into the following parts:

(a) The sampling design which deals with the method of selecting items to be observed for the given study;
(b) The observational design which relates to the conditions under which observations are to be made;
(c) The statistical design concerns with the question of how many items are to be observed and how the
information and data gathered to be analyzed;
(d) The operational design which de deals
als with the techniques by which the procedures specified in the sampling,
statistical and observational designs can be carried out.

We focus our attention on the employment of statistics in data analysis and sampling part of any research.

Statistics in Summarizing the Data:

The important statistical measure used to summarize the data are :

(i) Measure of Central Tendency


(ii) Measure of Dispersion
(iii) Measure of Asymmetry (skewness)
(iv) Measure of Relationship & etc

Measure of Central Tendency: This measure hints at poin


pointt about which items have a tendency to cluster.
cluster

Mesure of Central Tendency

Arithmatic Geometric Harmonic


Median Mode
Mean Mean Mean

Measure of central tendency can represent the values of variable as a single value but it cannot reveal the entire feature
of the data. Particularly, it fails to give an idea about the scatter of the values of a variable around the true value of
average. In order to measure this scatter, we have measure of dispersion as a statistical device.

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 3
Amongst the measures of dispersion,, range is the simplest concept amongst all but the variance/standard deviation has
mathematical significance as it explains the distribution of the data. Standard deviation is amenable for further algebraic
computation and hence it is most preferred measure of dispersion.

Mesure of Dispersion

Mean Standard Quartile


Range
Deviation Deviation Deviation

Measure of Asymmetry: In a larger series of data, it is expecte


expected d that the distribution is symmetric. In other words, the
values of the variable are expected to be clustered more around the measure of central tendency. But in practice it may
not be the situation many a time. In such an event, it is very important to kn
know
ow the kind of asymmetry.

Measures of Relationship: So far, we have dealt with those statistical measures that are useful in the context of
univariate population. Pairs of values (X, Y), where for every measurement of a variable X, we have corresponding value
of second variable, Y, is known as bivariate popopulation
pulation and sometime in addition we may have corresponding
corres value of
third variable, Z or even values of more number of variables, in which case tuples of values are called multivariate
population.

In case of bivariate or multivariate populations, we are iinterested in knowing relation of the two and/or more variables
in the data to one another. For example, one may be interested in knowing whether the number of hours students
devote for studies (H) is somehow related to their family income(I), to age(A), to sex(S) or to any other similar factor. In
other words, the question is if there exists a function ‘F’ such that H = F(I, A, S, … ). There are several methods of
determining the relationship between variables, but no methods would tell us with certainty th that a correlation is
indicative of causal relationship. So, we have the following basic types of questions in the case of bivariate or
multivariate population:

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 4
(i) Does there exist any association (or correlation) between two (or more) variables ? If yes, of w
what degree?
(ii) Is there any cause and effect relationship between the two variables in case of bivariate population or
between one variable on one side and two or more variables on the other side in case of multivariate
population? If yes, of what degree and in which direction?

The first equation answer through correlation technique and the second is through regression technique.

Mesure of Relation

Bivariate Multivariate

CS coefficient Coefficient of Coefficient of


Cross CP coefficient Simple Multiple
Rank Multiple partial
tabulation of correlation Regression regression
Correlation correlation correlation

Simple regression, in which case we deal with bivariate population, explains to what extent an independent variable (X)
influence in changes in dependent variable and relationship is given by

called regression equation of Y on X. In this case the regression coefficient is the change Y corresponding to a unit
change in X. Similarly, the multiple regression equation is in the form

Estimation of Parameter:

In most of the research, conducting a census study is practically not possible. The usual approach is to make
generalization or to draw inferences based on samples about the characteristics of population from which the samples
are taken. The characteristics of population of researcher’s concern are known as parameters (often called paramparameters
of interest). As a researcher selects a few items from the population for his study and this collection of selected items is
known as sample. Sampling is done with the assumption that the sample data enables the researcher to estimate the
parameter. Sample should be truly representative of population without any bias so that the conclusion derived from
the sample is valid and reliable. But the fact is irrespective of the method adopted by the researcher, it is near
impossible to make such estimation free ree from any error unless a census study is made. In the following diagram, the
occurrence of sampling error is explained.

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 5
As the true value of parameter is not known, even the error is obtained through estimation. This estimation is subject to
the level of confidence a researcher would like to have in it and the sample size he determined. We define the
terminologies ‘Precision’ and ‘Confidence Level’ in this context.

Precision is the range within which the population parameter lie in accordance with reliability specified in the confidence
level as a percentage of the estimate as a numerical quantity. The confidence level or reliability is the expected
percentage of items that the actual values fall within the stated precision limits.

The estimate of a population parameter may be one single value or it could be a range of values. If the estimate is one
single value, it is referred as point estimate, whereas in the range of values case it is termed as interval estimate.

A good estimator possesses the following properties:

(i) An estimator should on the average be equal to the value of the parameter being estimated. (Property of
Unbiased ness)
(ii) An estimator should have relatively less variance. (Property of efficiency)
(iii) An estimator should use as much as possible the information available from sample (Property of Sufficiency)
(iv) An estimator should approach the value of parameter as the sample size becomes larger and larger.
(Property of Consistency)
If the population mean is the parameter of researcher’s interest, then the point estimator of population mean ( µ ) is X
, the sample mean. The interval estimator for the mean µ is given by the interval around X for certain degree of
confidence with the help of Standard error.

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 6
For example, for 95% degree of confidence interval for the population mean is given by the lower limit X − 1.96 SE and
upper limit X + 1.96 SE . In other words, the probability of µ being in the interval [ X − 1.96 SE , X + 1.96 SE ] is 0.95.

Or, P[ X − 1.96SE ≤ µ ≤ X + 1.96SE ] = 0.95

Confidence interval for population mean with 95% of confidence level

Standard error is determined by the sampling distribution, in fact, through the variation of concerned statistics for all
different samples of same size. In case of population mean the standard error is obtained by the expression

SE = σ X =
s
=
∑ (x − x)
i
2

n n ( n −1) .

Similarly, we can discuss estimation of all other parameters associated with the population.

Estimation of Sample Size:

In the context of estimation of population mean above, it has been noted that limits of interval estimation are quite
influenced by the number of items chosen in the sample. So, the determination of appropriate si size of the sample in the
sample design is very important for suitable reliability of the conclusion. Size of the sample should be determined by a
researcher keeping the following points:

(i) Nature of Universe: If the items of the universe are homogeneous, a sm


small
all sample can serve the purpose.
But if the items are heterogeneous, a large sample would be required. Technically, this can be termed as
dispersion factor.

(ii) Number of classes proposed: If many class groups are to be formed, a large sample would be require
required
because a small sample may not be able to give reasonable number of items in each class
class-group.

(iii) Nature of Study: If items are to be intensively and continuously studied, the sample should be small. For a
general survey the size of the sample should be la large,
rge, but small sample is considered appropriate in
technical survey.

(iv) Type of sampling: Sampling technique plays an important part in determining the size of the sample. A small
random sample is apt to be much superior to a larger but badly selected sample.

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 7
(v) Standard of accuracy and acceptable confidence level: If the standard of accuracy or the level of precision is
to be kept high, we shall require relatively larger sample. For doubling the accuracy for fixed significance
level, the sample size has to be increased fourfold.

(vi) Availability of finance: In practice, the size of the sample depends upon the amount of money available for
the study purposes. This factor should be kept in view while determining the size of the sample. Larger
sample result in increasing the cost of sampling estimates.

(vii) Other considerations: Nature of units, size of population, size of questionnaire, availability of trained
investigators, the conditions under which the sample is being conducted, the time available for completion
of the study are few other considerations to which a researcher must pay attention while selecting the size
of the sample.

In general, we follow the following diagram while deciding the sample size.

Determine the
sample size
Identify the
Objectives of subject to
Parameters of
Research confidence level
Interest
and precesion
expected

Sample Size in case of estimating a mean:

Note that the limits of confidence interval for the Mean of Population is by

σ
X ± z.SE = X ± z ,
n

where X is the sample mean


z is the value of standard variate at given confidence level
n is the sample size, and
σ is the standard deviation of population.

If the researcher like to estimate the mean of population within desired precision ± e , then get

σ z 2σ 2
e=z and therefore n = .
n e2

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 8
In case of finite population, we get

σ N −n z 2σ 2 N
e = z.SE = z and therefore n =
n N −1 ( N − 1)e 2 + z 2σ 2

Many a times, the standard deviation of population is not known and sample is not yet taken, rough estimate of the
population is given by

Range of Population Distribution


σˆ =
6

Range in the above may have to be obtained from past records or through a pilot survey of large number of items.

Sample size when estimating the population proportion:

If we are to find the sample size for estimating a proportion of population, our reasoning remains similar to what we
have said in the context of population mean. It is required to specify the precision and the confidence level and then
estimate the sample size as under:

Note that the standard error of proportion is given by

pq
SE = σ p = (in case of infinite population)
n

pq N − n
SE = σ p = (in case of finite population of size N)
n N −1

Where, p is the sample proportion, q = 1-p, z is the standard variate for appropriate confidence level and n is the sample
size.Further, confidence interval for the population proportion is given by

p ± z.SE

If e is the precision rate, the acceptable error then the sample size can be expressed as

z 2 pq
n= (in case of infinite population)
e2

z 2 pqN
n= 2 (in case of finite population)
e ( N − 1) + z 2 pq

So, depending on the objectives and the parameter of interest, the method of identifying the sample size varies.

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 9
Research Hypothesis & Statistical Techniques

In most of the researches, it is in practice to make some conjecture related to possible conclusion while designing the
research and test the same after obtaining the sample data. In fact, hypothesis is an assumption or some supposition to
be proved or rejected.

Definition: Hypothesis is a proposition or a set of propositions set forth as an explanation for the occurrence of some
specified group of phenomena either asserted merely as a provisional conjecture to guide some investigation or
accepted as highly probable in the light of established facts.

The question we come across in general

What kind of hypotheses/tests are to be employed to achieve objectives?

Answer

Depends on what hypothesis a researcher framed and how the variable is measured

With reference to same objective, we may have several hypotheses in place and the methods to be employed may be
different in each case. In the following, we have several hypotheses, where (i)-(v) deals with objective of verifying the
impact of counseling in students showing good performance. In each of the case, research designs are different.

Examples:

(i) Students who receive counseling show greater performance


(ii) Students who receive counseling improve their performance
(iii) Student who receive counseling perform better than the others not receiving the counseling
(iv) Counseling has an impact on students’ performance
(v) Number of hours of counseling received by students has an impact on performance
(vi) Family income status of students has an impact on the performance

Characteristic of Hypothesis:

(i) Hypothesis should be clear and precise.


(ii) Hypothesis should be capable of being tested
(iii) Hypothesis should state relationship between variables, if it happens to be relational hypothesis
(iv) Hypothesis should be limited to the scope and must be specific. A researcher must remember that narrower
hypothesis is more generally testable and should develop such hypothesis.
(v) Hypothesis should be stated as far as possible in most simple terms so that the same is easily
understandable by all concerned.
(vi) Hypothesis should be consistent with most known facts, i.e., it must be consistent with a substantial body of
established facts. In other words, it should be one which judges accept as being the most likely.
(vii) Hypothesis should be amenable to testing within reasonable time.
(viii) Hypothesis must explain the facts that gave rise to the need for explanation.

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 10
Null Hypothesis and Alternative Hypothesis: Null Hypothesis is an initial statement concerning a population parameter. It
is generally denoted by H 0 . Any hypothesis which differs from a null hypothesis is called ‘alternative hypothesis.
Alternative Hypothesis is denoted by H 1 .

Type I error: The error of rejecting the hypothesis when it should have been accepted is known as type I error.

Type II error: The error of accepting the hypothesis when it should have been rejected is known as type II error.

The probability of Type I error is usually determined in advance and understood as level of significance of testing the
hypothesis. If the type I error is fixed at 5%, it means that there are about 5 chances in 100 that we reject H 0 when H 0
is true. But with a fixed sample size, n, when we try to reduce the type I error, the probability of committing type II error
increases. Both type of error cannot be reduced simultaneously. In testing of hypothesis, subject to level of significance,
we identify the reject region for test statistics in order to reject or accept null hypothesis.

Flow Chart for Hypothesis Testing:

State H 0 as well H 1

Specify the level of significance ( α )

Decide the correct sampling distribution

Obtain sample and workout an appropriate value


from sample data

Calculate the probability that sample result would diverge as widely as it has from
expectations, if the null hypothesis were true (find z-value or t-value for the purpose)

Compare this probability with significance level( α / 2 in case of two tailed test; α in
case of one tail test).(Find whether calculated z or t value is in the rejection region)

Yes No

Reject H 0 Accept H 0

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 11
Now we shall have some examples of hypothesis and Null and Alternative hypothesis while testing the research
hypothesis.

Example 1.

Hypothesis: Students who receive counseling show greater performance.

In this case benchmark of greater performance is to be defined. If obtaining a CGPA of more than 7.5 is a good
performance, then we the following Null and Alternative Hypothesis

Null hypothesis: Performance = 7.5

Alternative Hypothesis: Performance > 7.5

(One sample T test)

Example 2.

Hypothesis: Students who receive counseling improve their performance

In this case we compare the performance of student after the counseling with same student’s performance before the
counseling

Null Hypothesis : No difference in performance before and after

Alternative hypothesis: Performance after is greater than the performance before.

(Paired –Sample T test)

Example 3.

Hypothesis: Student who receive counseling perform better than the others casual in receiving the counseling

Here we may have category of students, namely, students not receiving the counseling, students those who were
receive counseling casually and the other category is serious in taking the counseling. In this case we compare the
performance of two groups of students.

Null Hypothesis : No difference in performance of two groups of students

Alternative hypothesis: Performance of students receiving the counseling seriously is better than that of group of
students receiving the counseling casually.

(Independent –sample T test)

Examples 4.

Hypothesis: Students’ performance depending on the kind of counseling they receive.

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 12
Category of students in this example may be same that of earlier, having three groups.

Null Hypothesis: The kind of counseling is causing variations in performance of students.

Alternative Hypothesis: The kind of counseling causes variation in performance of students.

(One way ANOVA)

Example 5.

Hypothesis: Number of hours of counseling received by students has an impact on performance

In this we are concerned with relation between two variables.

Null Hypothesis : No correlation between the variables – No Hrs of counseling received & Performance

Alternative hypothesis: There is linear correlation between the variables

(Correlation test)

Example 6.

Hypothesis: Family income status (Poor, Middle, Rich) has no association with performance (Poor, Mediocre, Excellent)
of student.

In this case we are dealing with attributes rather than variables. We shall have nonparametric test for the purpose.

Null Hypothesis: No association between the attributes

Alternative Hypothesis: Attributes are dependent

(Chi-Square test)

When more than one variable influence on a dependent variable, we may employ multiple regression techniques.
Similarly in reducing the number of factors to a few, we employ factor analysis.

Limitations of Test of Hypothesis:

(i) The test should not be used in a mechanical fashion. It should be kept in view that testing is not decision
making itself; the tests are only useful aids for decision-making. Hence proper interpretation of statistical
evidence is important to intelligent decision.
(ii) Tests do not explain the reasons as to why do the difference exist. They simply indicate whether the
difference is due to fluctuations of sampling or because of other reasons but tests do not tell us as to which
is/are the other reason(s) causing the difference.

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 13
(iii) Results of significant tests are based on probabilities and such cannot be expressed with full certainty. When
a test shows that a difference is statistically significant, then it simply suggest that the difference is probably
not due to chance.
(iv) Statistical inferences based on the significance tests cannot be said to be entirely correct evidences
concerning the truth of hypothesis. This is specially so in case of small samples where the probability of
drawing erring inferences happens to be generally higher. For greater reliability, the size of samples is
sufficiently enlarged.

Ref: CR Cothari

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 14
Some Example of Data and Analysis through SPSS package:

Data :

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 15
Descriptive Statistics of Data: (Comments under each observations are mandatory)

Gender
Gender
14

Cumulative 12

Frequency Percent Valid Percent Percent 10

Valid Male 8 40.0 40.0 40.0 8

6
Female 12 60.0 60.0 100.0 4

Total

Frequency
20 100.0 100.0 2

0
Male Female

Age class Gender

Age class
Cumulative 12

Frequency Percent Valid Percent Percent 10

Valid <=22 4 20.0 20.0 20.0 8

23 - 26 11 55.0 55.0 75.0 6

> 26 5 25.0 25.0 100.0 4

Frequency
2
Total 20 100.0 100.0 0
<=22 23 - 26 > 26

Age class
Sales class
Sales class
Cumulative 12

Frequency Percent Valid Percent Percent 10

Valid less than 8

3 15.0 15.0 15.0


25 6

26 - 50 10 50.0 50.0 65.0 4

Frequency
More than 2

7 35.0 35.0 100.0


50 0
less than 25 26 - 50 More than 50

Total 20 100.0 100.0 Sales class

Region

Region
Cumulative 10

Frequency Percent Valid Percent Percent 8

Valid State 1 8 40.0 40.0 40.0


6

State 2 6 30.0 30.0 70.0


4
State 3 6 30.0 30.0 100.0
Frequency

Total 20 100.0 100.0


2

0
State 1 State 2 State 3

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation


Age 20 19 30 24.60 2.836
Sales (in Rs'000) 20 11 77 43.55 19.975
Valid N (listwise) 20

Hypothesis: Age of employees is as low as 24

Null Hypothesis: Mean Age is 24

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 16
Alternative Hypothesis: Average age is not 24

One-Sample Test

Test Value = 24
95% Confidence Interval
of the Difference
Mean
t df Sig. (2-tailed) Difference Lower Upper
Age .946 19 .356 .60 -.73 1.93

Accept Null Hypothesis.

Hypothesis: Age and sales have correlation

Null Hypothesis: No correlation between the variables

Alternative Hypothesis: Linear correlation exists.

80 5
2
9
13 10
70

60

3 11
1
50
7
16
40
14 20
15
Sales (in Rs'000)

18
12
17
30
6
8

20 19

4
10
18 20 22 24 26 28 30 32

Age

Correlations
No Correlation between age
Sales (in
Age Rs'000) and sales
Age Pearson Correlation 1 .118
Sig. (2-tailed)
N
. .619 &
20 20
Sales (in Rs'000) Pearson Correlation .118 1
Sig. (2-tailed) .619 . Analysis continues
N 20 20

Dr. K. Manjunatha Prasad


Deptt of Statistics, Manipal University Page 17

You might also like