lecture Note

Learning objectives:
After completing this chapter, the student will be able to:
1. Define Statistics and Biostatistics
2. Enumerate the importance and limitations of statistics
3. Define and identify the different types of data and understand why we need to
classifying variables
1.1 Origin and Growth of Statistics

The origin of modern statistics can be traced back to the 17th and 18th centuries when
mathematicians were mainly interested in the development of the theory of probability as
applied to the theory of chance. In the modern world of computers and information
technology, the importance of statistics is very well recognized by all the disciplines.
Statistics has originated as a science of statehood and found applications slowly and
steadily in Agriculture, Economics, Commerce, Biology, Medicine, Industry, planning,
Education and so on.
Statistical thinking has now a day became very essential for different fields of study. Its
usefulness has now spread to such diverse fields as agriculture, business, accounting,
marketing, economics, management, medicine, political science, psychology, sociology,
engineering, journal, metrology, tourism, etc. In biomedical research, meaningful
conclusions can only be drawn based on data collected from a valid scientific design using
appropriate statistical methods. Therefore, the selection of an appropriate study design is
important to provide an unbiased and scientific evaluation of the research questions. Each
design is based on a certain rationale and is applicable in certain experimental situations.

1.1.1. Definition of Biostatistics

Biostatistics is the segment of statistics that deals with data arising from biological
processes or medical experiments. Thus biostatistics is the application of statistical
techniques in a health related area (application of statistical methods on biological,
medical and public health data).

Why biostatistics?

Because some statistical methods are more heavily used in health applications than
elsewhere (e.g Survival analysis, longitudinal data analysis)

The word statistics on the other hand, has two meanings. In the more common
Statistics (plural sense) refers to numerical information (aggregates of facts). Example
includes statistics of births, disease cases, imports, exports, etc. In these examples
statistics are numbers or facts. The subject of statistics (singular sense), has a much
broader meaning than just collecting and publishing numerical information. Statistics
in this sense may be defined as the science of Collecting, Organizing,
Presenting, Analyzing and Interpreting data to assist in making more effective
decisions. This definition points out five stages in any statistical investigation

1.2 Definitions and classification of statistics

Definitions of Statistics

Some definitions of statistics:

 “Statistics may be regarded (i) as the study of populations, (ii) as the study of
variation, (iii) as the study of methods of the reduction of data.” Fisher [1950].
 “Statistics is the branch of the scientific method which deals with the data obtained
by counting or measuring the properties of populations of natural phenomena.”
Kendall and Stuart [1963].
 “Statistics is concerned with the inferential process, in particular with the planning
and analysis of experiments or surveys, and with the efficient summarizing of sets
of data.” Kruskal [1968].
 “Statistics may be defined as a science of collection, presentation, analysis and
interpretation of numerical data.” Croxton and Cowden.

All the above definition of statistics can be summarized by the following statements.
A) Statistics as Numerical Data (Plural Senses): in this sense statistics are defined as
aggregates of numerically expressed facts (figures) collected in a systematic manner for a
pre-determined purpose.

B) Statistics as a Subject or Field of Study (Singular Sense): in this sense statistics is

defined as the science of collecting, organizing, analyzing and interpreting numerical data
for understanding a phenomenon or making wise decisions. Hence, statistics is a
procedural process performing the five major activities on numerical data.

Classification of Statistics

Statistics can be classified in to two broad classes: Descriptive statistics and Inferential
1. Descriptive statistics:
 This part of statistics deals only with describing some characteristics of the data
collected without going beyond the data. In other words, it deals with only
describing the sample data without going any further: that is without attempting to
infer (conclude) anything about the population.
 Descriptive statistics deals with collection of data, its presentation in various
forms, such as tables, graphs and diagrams and findings averages and other
measures which would describe the data.
 Descriptive statistics refers only to the actual data. That is, the data at hand.
 Descriptive Statistics is basically a kind of Statistics which is used to describe the
features of the data that gathered by the researcher.
 Classification of students in college of computing and informatics according to
their Department.
 The number of female students in this class.

2. Inferential Statistics:

 This type of statistics is concerned with drawing statistically valid conclusions

about the characteristics of the population (large group) based on information
obtained from a sample (small group). That is, this part of statistics is concerned
with the generalizing the results of a sample (small groups) to the entire
population (large group) from which the sample is drawn.

 It is the part of statistics that is generalizing from sample to population using

probabilities, performing hypothesis testing, determining relationships between
variables, and making predictions.
Example: Of 50 randomly selected people in the town of Wolkite, 10 people had the last
name Kebede. An example of inferential statistics is the following statement:
"about 20% of all people living in Ethiopia have the last name Kebede."

1.3 Stages in Statistical Investigation

According to the definition of statistics, we have the following five stages of a statistical
1. Collection of data: The first stage of statistical investigation. The data should be
collected with a specific and well defined purpose so that the conclusions drawn are
not to be misleading. Two methods of data collection: Primary and Secondary:
Primary method of data collection refers to obtaining original and first hand data
and Secondary method of data collection involves obtaining data from other sources.
2. Organization of data: This is a methodology for classification and describing the
properties of data in a summary form. Editing, coding and classification are the three
steps in the organization of data.
3. Presentation of data: In this stage the collected and organized data are presented
with in some systematic order to facilitate statistical analysis. The organized data are
presented with the help of tables, diagrams and graphs.
4. Analysis of data: Analysis of data involves extraction of relevant information from
the collected data using some mathematical and statistical tools. In other words, it
involves extracting relevant information from the data (like mean, median, mode,
range, variance…), mainly through the use of elementary mathematical operation.
5. Interpretation of data: This stage involves drawing a valid conclusion from the
analyzed data. That is interpretation of data involves making inferences (drawing
conclusions) based on the analysis of data.

1.4 Definition of some common terms

1. Population:

 It is the complete set of possible measurements for which inferences are to be

 It is the totality of all subjects, measurements or individuals possessing certain
common characteristics that are being studied.
 The population to which generalization is made is called Target or Source or
Reference Population.
 The population could be finite or infinite (an imaginary collection of units).
For example
 All students in Wolkite University.
 All trees under specified climatic conditions.
 All animals fed a certain type of diet.
 Population of all males between the ages of 15 and 18.
 Population of farms having a certain type of natural fertility.
 Population of households, etc.
There are two ways of investigation: Census and sample survey.
2. Census: a complete enumeration of the population. But in most real problems it
can not be realized, hence we take sample.
3. Sample: A sample from a population is the set of measurements that are actually
collected in the course of an investigation. It should be selected using some pre-
defined sampling technique in such a way that they represent the population very
 Monthly production data of a certain factory in the past 10 years.
 Small portion of a finite population.
4. Parameter: Characteristic or measure obtained from a population.
5. Statistic: Characteristic or measure obtained from a sample.
6. Sampling: The process or method of sample selection from the population.
7. Sample size: The number of elements or observation to be included in the sample.
8. Data: is the raw material of statistics. It can be obtained either by measurement or

1.5 Applications, Uses and Limitations of statistics

a) Applications of statistics:
• In almost all fields of human endeavor.
• Almost all human beings in their daily life are subjected to obtaining numerical
• Applicable in some process e.g. invention of certain drugs, extent of environmental
• In industries especially in quality control area.
b) Uses of statistics
The main function of statistics is to enlarge our knowledge of complex phenomena. The
following are some uses of statistics:
1. It presents facts in a definite and precise form.
2. Data reduction.
3. Measuring the magnitude of variations in data.
4. Furnishes a technique of comparison
5. Estimating unknown population characteristics.
6. Testing and formulating of hypothesis.
7. Studying the relationship between two or more variable.
8. Forecasting future events.
c) Limitations of statistics:
Statistics with all its wide application in every sphere of human activity has its own
limitations. Some of them are given below.
1. Statistics is not suitable to the study for qualitative phenomenon: Since statistics is
basically a science and deals with a set of numerical data, it is applicable to the study of
only these subjects of enquiry, which can be expressed in terms of quantitative
measurements. As a matter of fact, qualitative phenomenon like honesty, poverty, beauty,
intelligence etc, cannot be expressed numerically and any statistical analysis cannot be
directly applied on these qualitative phenomenons. Nevertheless, statistical techniques
may be applied indirectly by first reducing the qualitative expressions to accurate

quantitative terms. For example, the intelligence of a group of students can be studied on
the basis of their marks in a particular examination.
2. Statistics does not study individuals: Statistics does not give any specific importance
to the individual items; in fact it deals with an aggregate of objects. Individual items,
when they are taken individually do not constitute any statistical data and do not serve
any purpose for any statistical enquiry.
3. Statistical laws are not exact: It is well known that mathematical and physical
sciences are exact. But statistical laws are not exact and statistical laws are only
approximations. Statistical conclusions are not universally true. They are true only on an
4. Statistics can be easily misused: Statistics must be used only by experts; otherwise,
statistical methods are the most dangerous tools on the hands of the inexpert. The use of
statistical tools by the inexperienced and untraced persons might lead to wrong
conclusions. Statistics can be easily misused by quoting wrong figures of data.

1.6 Types of variables and Scales of measurement (level of measurements)

1) Types of variables

A variable is any characteristic of a study unit (example an individual) that is measurable

and/or classifiable, and can take any value for different units. For example, for an
individual we can consider the following entities as variables: age, religion, ethnicity,
place of residence, weight, height, body mass index, body temperature, blood sugar level,
knowledge towards a diseases, etc
Variables, depending on their quantifiablity, can be classified as Qualitative and
Quantitative variables.
I. Qualitative Variables:
 Variables which assume non-numerical values and can not be measured.
 Refers to objects or characteristics described by a set of data.
 Often termed as categorical data.

Examples: Brand of computer (Dell, Toshiba, Samsung etc.), gender, religious

affiliation, and state of birth, ethnicity, illness status (well or ill), treatment
outcome (improved or not improved), Stage of breast cancer (I, II, III, IV) etc
II. Quantitative Variables: is a characteristic that can be measured and expressed
numerically. This can be of two types
a) Discrete Variable:
 Variables which assume a finite or countable number of possible values.
 Usually obtained by counting.
 A type of numeric variable that can only exist in whole number.
Example: number of children in family, number of students in the class, etc.
b) Continuous Variables:
 Variables which assume an infinite number of possible values.
 Usually obtained by measurement.
 Height
 Weight
 Age
 Blood sugar level.

2) Scales of measurement (level of measurements)

Proper knowledge about the nature and type of data to be dealt with is essential in order
to specify and apply the proper statistical method for their analysis and inferences.
Measurement scale refers to the property of value assigned to the data based on the
properties of order, distance and fixed zero.
In mathematical terms measurement is a functional mapping from the set of objects {O }

to the set of real numbers {M (O )}. The goal of measurement systems is to structure the

rule for assigning numbers to objects in such a way that the relationship between the
objects is preserved in the numbers assigned to the objects.

The different kinds of relationships preserved are called properties of the measurement
The property of order exists when an object that has more of the attribute than another
object, is given a bigger number by the rule system. This relationship must hold for all
objects in the "real world".
The property of ORDER exists When for all i, j if O > O , then M (O ) > M (O ).
i j i j

The property of distance is concerned with the relationship of differences between
objects. If a measurement system possesses the property of distance it means that the unit
of measurement means the same thing throughout the scale of numbers. That is, an inch
is an inch, no matters were it falls - immediately ahead or a mile downs the road.
More precisely, an equal difference between two numbers reflects an equal difference in
the "real world" between the objects that were assigned the numbers. In order to define
the property of distance in the mathematical notation, four objects are required: O , O ,
i j

O , and O . The difference between objects is represented by the "-" sign; O - O refers to
k l i j

the actual "real world" difference between object i and object j, while M (O ) - M (O )
i j

refers to differences between numbers.

The property of DISTANCE exists, for all i, j, k, l
If O -O ≥ O - O then M (O )-M (O ) ≥ M (O )-M ( O ).
i j k l i j k l

Fixed Zero
A measurement system possesses a rational zero (fixed zero) if an object that has none of
the attribute in question is assigned the number zero by the system of rules. The object
does not need to really exist in the "real world", as it is somewhat difficult to visualize a
"man with no height". The requirement for a rational zero is this: if objects with none of
the attribute did exist would they be given the value zero. Defining O as the object with

none of the attribute in question, the definition of a rational zero becomes:

The property of FIXED ZERO exists if M (O ) = 0.
The property of fixed zero is necessary for ratios between numbers to be meaningful.

Scale Types

Measurement is the assignment of numbers to objects or events in a systematic fashion.

Four levels of measurement scales are commonly distinguished: nominal, ordinal,
interval, and ratio and each possessed different properties of measurement system
1. Nominal Scales
Nominal scales are measurement systems that possess none of the three properties stated
 Level of measurement which classifies data into mutually exclusive, all inclusive
categories in which no order or ranking can be imposed on the data.
 No arithmetic and relational operation can be applied.
 No quantitative information is conveyed
 Thus only gives names or labels to various categories.
 Political party preference (Republican, Democrat, or Other,)
 Sex (Male or Female.)
 Marital status (married, single, widow, divorce)
 Country code
 Regional differentiation of Ethiopia.
2. Ordinal Scales
Ordinal Scales are measurement systems that possess the property of order, but not
the property of distance. The property of fixed zero is not important if the property of
distance is not satisfied.
 Level of measurement which classifies data into categories that can be ranked.
Differences between the ranks do not exist.
 Arithmetic operations are not applicable but relational operations are applicable.
 Ordering is the sole property of ordinal scale.
 Letter grades (A, B, C, D, F).

 Rating scales (Excellent, Very good, Good, Fair, poor).

 Military status.
3. Interval Scales
Interval scales are measurement systems that possess the properties of Order and
distance, but not the property of fixed zero.
 Level of measurement which classifies data that can be ranked and differences are
meaningful. However, there is no meaningful zero, so ratios are meaningless.
 All arithmetic operations except division are applicable.
 Relational operations are also possible.
 IQ
 Temperature in o .

4. Ratio Scales
Ratio scales are measurement systems that possess all three properties: order, distance,
and fixed zero. The added power of a fixed zero allows ratios of numbers to be
meaningfully interpreted; i.e. the ratio of Bekele's height to Martha's height is 1.32,
whereas this is not possible with interval scales.
 Level of measurement which classifies data that can be ranked, differences are
meaningful, and there is a true zero. True ratios exist between the different units
of measure.
 All arithmetic and relational operations are applicable.
 Weight, Height, Number of students, Age, etc.

Use of level of measurements

 Helps you decide how to interpret the data from the variable.

 Helps you decide what statistical analysis is appropriate on the values that were
assigned. For example if a measurement is nominal then you know that you never
average the data level.

Review Questions
1. Classify the following scales of measurement as Nominal, Ordinal, Interval and

a. Your score on the first statistics test as a measure of your knowledge of statistics.
b. Religion: Muslim (1), Christian (2), Pagan (3), Others (4)
c. A response to the statement "Abortion is a woman's right" where "Strongly
Disagree" = 1, "Disagree" = 2, "No Opinion" = 3, "Agree" = 4, and "Strongly
Agree" = 5, as a measure of attitude toward abortion.
d. Times for swimmers to complete a 50-meter race
e. The height of the men in the same town.
f. Regions numbers of Ethiopia (1, 2, 3 etc.)
g. The number of students in a college
2. In each statement that follows, tell whether descriptive or inferential statistics has
been used.
a. The average age of the students in your statistics class is 20.7.
b. From past figures, it has been predicted that 56% of registered voters will vote in
the next election.
c. During the thanksgiving weekend last year, 73 people died in traffic accidents in
New York.
d. According to insurance company figures, the chance of your living to age 83 is
2.1 Methods of data collection
Once it is decide what type of study is to be made, it becomes necessary to collected
information about the concerned study, mostly in the form of data. In order to generate
valid conclusion from a data, information has to be collected in a systematic manner.
Whatever the quality of sampling and analysis method, a haphazardly collected dataset is
less likely to produce valuable and generalizable information.
2.1.1 Sources of data

Data may be derived from several sources. Depending on the source, data can be
classified as Primary or Secondary data.
1. Primary Data
 Data measured or collect by the investigator or the user directly from the source.
 Data is gathered for the first time by the researcher for a given purpose.
o An enquiry is made from each tax payer in a city to obtain their opinion
about the tax collecting machinery.
o the data collected in the census study
 Two activities involved: planning and measuring.
A) Planning:
 Identify source and elements of the data.
 Decide whether to consider sample or census.
 If sampling is preferred; decide on sample size, selection method etc.
 Decide measurement procedure.
 Set up the necessary organizational structure.
B) Measuring: there are different options.
 Focus Group
 Telephone Interview
 Mail Questionnaires
 Door-to-Door Survey
 New Product Registration
 Personal Interview and
 Experiments are some of the sources for collecting the primary data.
2. Secondary Data
 Data gathered or compiled from published and unpublished sources or files.
 Usually secondary data is obtained from years book, census reports, survey
reports, official records or reported experimental reports
 When our source is secondary data check that:
 The type and objective of the situations.

 The purpose for which the data are collected and compatible with the present
 The nature and classification of data is appropriate to our problem.
 There are no biases and misreporting in the published data.
For example, let‟s assume a researcher is interested to study the prevalence of family
planning utilization among women of reproductive age in a given Woreda. The
researcher can either conduct a survey (primary data) or utilize the record of family
planning clinics in the woreda (secondary data).
Note: Data which are primary for one may be secondary for the other.

2.1.2 Data collection techniques

Questionnaire is the main data collection instrument in formal sample survey. Before
examining the steps in designing a questionnaire we need to review the types of questions
used in questionnaires. Depending on the amount of freedom given to respondent in
offering responses, there are two basic types of questions that can be used in
questionnaires: open-ended questions and closed ended questions.
The type of questions will be determined by the form of responses wanted, the nature of
the respondents and their ability to answer the questions.

Open-ended questions: - allows the respondent to answer it freely in his or her own

Example: What is your favorite memory from childhood?

Closed–ended questions:- Predetermined list of alternate responses is presented to the
respondent for checking the appropriate one(s). It implies that the respondent‟s answers
are restricted in some way to a limited range of alternatives. Closed ended question fall in
to one of the two categories: dichotomous questions and multiple-choice question.

A dichotomous question contains two alternatives in the predetermined list of responses.

Example: - Yes-no, true –false, agree-disagree, like-dislike, fair-unfair and so on.

A multiple choice question offers more than two responses in the predetermined list of
alternate responses.

Example: How many children have you ever born?

a. 1-2 b. 3-4 c. 5-6 d. 7-8 e. More than 8

The major data collection techniques:-

1) Survey through interview: A quantitative approach in which a standardized

questionnaire, to be administered through interview, is used to collect information.
 Quick and inexpensive.
 A response from different respondents is comparable.
 Easy to quantify and analyze.
 Useful in describing quantifiable characteristics of a large population.
 Very large and representative samples are feasible.
 Standardized questions make measurement more precise.
 Participants do not need to be able to read and write to respond.
 Doesn‟t give opportunity to probe and explore.
 Relatively inflexible.
 Less reliable to assess behavior and attitude of respondents.
2) Survey through self-administered questionnaire: - A quantitative method in which a
standardized questionnaire, to be filled by the respondents themselves, is used to collect

 Quick and inexpensive.
 Responses from different respondents are comparable.
 Useful in describing quantifiable characteristics of a large population.
 Very large and representative samples are feasible.
 Standardized questions make measurement more precise.
 Participants need to be able to read and write to respond.
 High non-response rate.
 Doesn‟t give opportunity to probe and explore.

 Less reliable to assess behavior and attitude of respondents.
 Relatively inflexible.
3) Focus Group Discussion (FGD):- A qualitative method to obtain in-depth
information on concept and perceptions about a certain topic through spontaneous group
discussion of approximately 6–12 persons, guided by a facilitators.
 Excellent approach to gather information on in-depth attitudes, and beliefs of a group
 Group dynamics might generate more ideas than individual interviews
 Provides an excellent opportunity to probe and explore
 Participants are not required to read or write
 Requires strong facilitator to guide discussion and ensure participation by all
 Doesn‟t give quantitative information
 It is difficult to organize the discussion
 The analysis is relatively difficult.
4) In-depth interview:- A qualitative method that relies on person to person discussion
 Good approach to gather in-depth attitudes and beliefs from individual respondents
 Provides an excellent opportunity to probe and explore
 Participants don‟t need to be able to read and write to respond
 Doesn‟t give quantitative information.
 It is time taking.
 The respondent may feel like „a bug under a microscope‟.
 The analysis is relatively difficult.
5) Observation: - A qualitative method that involves critical observation and recording
the practice (behavior, culture…) of individuals or a group.

 Excellent approach to discover behaviors

 Provide accurate information
 Usually takes longer time.
 Liable to “Observational bias”.
6) Secondary data (the use of documentary sources):- A quantitative approach which
utilizes data already collected by others.
 Less resource and time consuming.
 May not give in depth information to answer the research question
 No knowledge on the accuracy of data collection
 Can be outdated
 The researcher has limited control on the sampling method and sample size
 Less likely to give qualitative information.

2.2 Methods of data Presentation

Having collected and edited the data, the next important step is to organize it. That is to
present it in a readily comprehensible condensed form that aids in order to draw
inferences from it. It is also necessary that the like be separated from the unlike ones. The
process of arranging data in to classes or categories according to similarities technically
is called classification. Classification is a preliminary and it prepares the ground for
proper presentation of data. Mainly, the purpose of classification is to divide the data into
homogeneous groups or class.
The classification of the data generally done on geographical, chorological, qualitative or
quantitative basis on the following lines:
1) In geographical classification, data are arranged according to places, areas or regions.
2) In chorological classification, data are arranged according to time i.e., weekly,
monthly, quarterly, half yearly, annually, etc.
3) In qualitative classification, the data are arranged according to attributes like sex,
marital status, educational standard, stage or intensity of diseases etc.
4) In qualitative classification, the data are arranged according to certain characteristic

that has been measured like height, weight, income of persons, vitamin content of in a
substance etc.
The presentation of data is broadly classified in to the following two categories:
• Tabular presentation
• Diagrammatic and Graphic presentation.
Frequency Distribution: is the organization of raw data in table form with classes and
Raw Data is data collected in original form.
Frequency is the number of times a certain value or class of values occurs
The reasons for constructing a frequency distribution:
1. To organize the data in a meaningful, intelligible way
2. To enable the reader to determine the nature or shape of the distribution.
3. To facilitate computational procedures for measures of average and spread.
4. To enable the researcher to draw charts and graphs for the presentation of data
5. To enable the reader to make comparisons between different data set.
 There are three basic types of frequency distributions
o Categorical frequency distribution
o Ungrouped frequency distribution
o Grouped frequency distribution
1) Categorical frequency Distribution:
-Used for data that can be place in specific categories such as nominal, or ordinal.
Example 2.1: a social worker collected the following data on marital status for 25
persons. (M=married, S=single, W=widowed, D=divorced)


Since the data are categorical, discrete classes can be used. There are four types of
marital status M, S, D, and W. These types will be used as class for the distribution. We
follow procedure to construct the frequency distribution.
Step 1: Make a table as shown.
Step 2: Tally the data and place the result in column (2).
Step 3: Count the tally and place the result in column (3).
Step 4: Find the percentages of values in each class by using;
%=  100 = Where f= frequency of the class, n=total number of value.
Percentages are not normally a part of frequency distribution but they can be added since
they are used in certain types diagrammatic such as pie charts.
Step 5: Find the total for column (3) and (4).

Combing the entire steps one can construct the following frequency distribution.
Class (1) Tally (2) Frequency (3) Percent (4)

M //// 5 20
S //// // 7 28
D //// // 7 28
W //// / 6 24

2) Ungrouped frequency Distribution:

 Is a table of all the potential raw score values that could possible occur in the data
along with the number of times each actually occurred.
 Is often constructed for small set or data on discrete variable.
Constructing ungrouped frequency distribution:
• First find the smallest and largest raw score in the collected data.
• Arrange the data in order of magnitude and count the frequency.
• To facilitate counting one may include a column of tallies.
Example 2.2: The following data represent the mark of 20 students.

80 76 90 85 80
70 60 62 70 85
65 60 63 74 75
76 70 70 80 85

Construct a frequency distribution, which is ungrouped.

Step 1: Find the range, Range=Max-Min=90-60=30.
Step 2: Make a table as shown
Step 3: Tally the data.
Step 4: Compute the frequency.

Mark Tally Frequency

60 // 2
62 / 1
63 / 1
65 / 1
70 //// 4
74 / 1
75 // 2
76 / 1
80 /// 3
85 /// 3
90 / 1
3) Grouped frequency Distribution:
 A frequency distribution when several numbers are grouped in one class.
 When the range of the data is large, the data must be grouped in to classes that
are more than one unit in width.
Definition of some common terms

 Class limits: Separates one class in a grouped frequency distribution from

another. The limits could actually appear in the data and have gaps between the
upper limits of one class and lower limit of the next.
 Units of measurement (U): the distance between two possible consecutive
measures. It is usually taken as 1, 0.1, 0.01, 0.001, -----.
 Class boundaries: Separates one class in a grouped frequency distribution from
another. The boundaries have one more decimal places than the row data and
therefore do not appear in the data. There is no gap between the upper boundary
of one class and lower boundary of the next class.
The lower class boundary is found by subtracting 0.5U from the corresponding lower
class limit and the upper class boundary is found by adding 0.5U to the
corresponding upper class limit.
 Class width: the difference between the upper and lower class boundaries of any
class. It is also the difference between the lower limits of any two consecutive
classes or the difference between any two consecutive class marks.
 Class mark (Mid points): it is the average of the lower and upper class limits or
the average of upper and lower class boundary.
 Cumulative frequency: is the number of observations less than/more than or
equal to a specific value.
 Cumulative frequency above: it is the total frequency of all values greater than
or equal to the lower class boundary of a given class.
 Cumulative frequency blow: it is the total frequency of all values less than or
equal to the upper class boundary of a given class.
 Cumulative Frequency Distribution (CFD): it is the tabular arrangement of
class interval together with their corresponding cumulative frequencies. It can be
more than or less than type, depending on the type of cumulative frequency used.
 Relative frequency (rf): it is the frequency divided by the total frequency. This
gives the percent of values falling in that class.

 Relative cumulative frequency (rcf): it is the cumulative frequency divided by

the total frequency. Gives the percent of the values which are less than or more
than the upper class boundary.

Guidelines for classes

1. There should be between 5 and 20 classes.

2. The class width should be an odd number. This will guarantee that the class
midpoints are integers instead of decimals.
3. The classes must be mutually exclusive. This means that no data value can fall into
two different classes
4. The classes must be all inclusive or exhaustive. This means that all data values must
be included.
5. The classes must be continuous. There are no gaps in a frequency distribution.
Classes that have no values in them must be included (unless it's the first or last
classes which are dropped).
6. The classes must be equal in width. The exception here is the first or last class. It is
possible to have a "below ..." or "... and above" class. This is often used with ages.

Creating a Grouped Frequency Distribution

1. Find the largest and smallest values

2. Compute the Range = Maximum - Minimum
3. Select the number of classes desired. This is usually between 5 and 20 or use Sturges
rule K  1 3.32 log n where k is number of classes desired and n is total number of
4. Find the class width by dividing the range by the number of classes and rounding up
W  There are two things to be careful of here. You must round up, not off.
Normally 3.2 would round to be 3, but in rounding up, it becomes 4. If the range
divided by the number of classes gives an integer value (no remainder), then you can
either add one to the number of classes or add one to the class width. Sometimes
you're locked into a certain number of classes because of the instructions.

5. Pick a suitable starting point less than or equal to the minimum value. The starting
point is called the lower limit of the first class. Continue to add the class width to this
lower limit to get the rest of the lower limits.
6. To find the upper limit of the first class, subtract U from the lower limit of the second
class. Then continue to add the class width to this upper limit to find the rest of the
upper limits.
7. Find the boundaries by subtracting 0.5U units from the lower limits and adding 0.5U
units from the upper limits. The boundaries are also half-way between the upper limit
of one class and the lower limit of the next class.
8. Tally the data.
9. Find the frequencies.
10. Find the cumulative frequencies. Depending on what you're trying to accomplish, it
may not be necessary to find the cumulative frequencies.
11. If necessary, find the relative frequencies and/or relative cumulative frequencies
Example 2.3: Construct a frequency distribution for the following data.
11 29 6 33 14 31 22 27 19 20
18 17 22 38 23 21 26 34 39 27

Step 1: Find the highest and the lowest value H=39, L=6
Step 2: Find the range; R=H-L=39-6=33
Step 3: Select the number of classes‟ desired using Sturges formula;
=1+3.32log (20) =5.32=6(rounding up)
Step 4: Find the class width; w=R/k=33/6=5.5=6 (rounding up)
Step 5: Select the starting point, let it be the minimum observation.
6, 12, 18, 24, 30, 36 are the lower class limits.
Step 6: Find the upper class limit;
E.g. the first upper class=12-U=12-1=11
11, 17, 23, 29, 35, 41 are the upper class limits.
So combining step 5 and step 6, one can construct the following classes.
Class limits
6 – 11
12 – 17
18 – 23
24 – 29
30 – 35
36 – 41
Step 7: Find the class boundaries;
E.g. for class 1 Lower class boundary=6-U/2=5.5
Upper class boundary =11+U/2=11.5
• Then continue adding class width on both boundaries to obtain the rest boundaries
and one can obtain the following classes.
Class boundary
5.5 – 11.5
11.5 – 17.5
17.5 – 23.5
23.5 – 29.5
29.5 – 35.5
35.5 – 41.5
Step 8: tally the data.
Step 9: Write the numeric values for the tallies in the frequency column.
Step 10: Find cumulative frequency.
Step 11: Find relative frequency or/and relative cumulative frequency.
The complete frequency distribution follows as:

Class limit CB CM Tally Freq LCF GCF RF LRCF

6 – 11 5.5 – 11.5 8.5 // 2 2 20 0.10 0.10
12 – 17 11.5 – 17.5 14.5 // 2 4 18 0.10 0.20
18 – 23 17.5 – 23.5 20.5 ////// 7 11 16 0.35 0.55

24 – 29 23.5 – 29.5 26.5 //// 4 15 9 0.20 0.75

30 – 35 29.5 – 35.5 32.5 /// 3 18 5 0.15 0.90

36 – 41 35.5 – 41.5 38.5 // 2 20 2 0.10 1.00

Diagrammatic and Graphic presentation of data

-These are techniques for presenting data in visual displays using geometric and pictures.
Importance: -
• They have greater attraction.
• They facilitate comparison.
• They are easily understandable.

Diagrammatic presentation of data

Diagrams are appropriate for presenting discrete as well as qualitative data.
The three most commonly used diagrammatic presentation for discrete as well as
qualitative data are:
1) Pie charts 2) Pictogram 3) Bar chart

Pie chart
A Pie Chart is a circular chart divided into sectors, illustrating relative magnitudes or
frequencies of classes of a given variable. Pie chart usually represents categorical data but it
is also possible to use it for discrete quantitative data. The angle of each sector has to be
proportional to the relative frequency of a given class.
value of the part
Angle of Sector= * 100
the whole quantity
Example 2.4: Draw a suitable diagram to represent the following population in a town.
Men Women Girls Boys
2500 2000 4000 1500
Step 1: Find the percentage.
Step 2: Find the number of degrees for each class.
Step 3: Using a protractor and compass, graph each section and write its name
corresponding percentage.

Class Frequency Percent Degree

Men 2500 25 90
Women 2000 20 72
Girls 4000 40 144
Boys 1500 15 54


40% 20%

Data are presented with the help of picture. Such presentation is known as pictorial
diagram or pictogram. Here the magnitudes of quantities of the variable are explained
with the help of pictures which depict the variable approximately. In a pictogram, each
symbol in the picture represents a fixed quantity of the variable.
Bar Charts:
- A set of bars (thick lines or narrow rectangles) representing some magnitude over
time space.
- They are useful for comparing aggregate over time space.
- Bars can be drawn either vertically or horizontally.
- There are different types of bar charts. The most common being:
 Simple bar chart
 Component or sub divided bar chart.
 Multiple bar charts.

Simple Bar Chart

-Are used to display data on one variable.

-They are thick lines (narrow rectangles) having the same breadth. The magnitude of a
quantity is represented by the height /length of the bar.
Example 2.5: The following data represent sale by product, 1957- 1959 of a given
company for three products A, B, C.

Product Sales($) Sales($) Sales($)

In 1957 In 1958 In 1959
A 12 14 18
B 24 21 18
C 24 35 54

Simple Bar chart for sale by product in year 1997

Sales($) In 1957

24 24


15 12


Component Bar chart

-When there is a desire to show how a total (or aggregate) is divided in to its component
parts, we use component bar chart.
-The bars represent total value of a variable with each total broken in to its component
parts and different paints or designs are used for identifications
Example 2.6:
Draw a component bar chart to represent the sales by product from 1957 to 1959.
Introduction To Statistics……………………………………………………………. lecture Note


Sales By product in 1957-1959



sales in $
product C
product B
product A

1957 1958 1959
Years of production

Multiple Bar charts

- These are used to display data on more than one variable.
- They are used for comparing different variables at the same time.
Draw a component bar chart to represent the sales by product from 1957 to 1959.

Sales by Product in 1957-1959

Sales in $

40 product A
30 product B
20 product C
1957 1958 1959
Years of production

Example 2.6: Draw a diagram presenting by product 1958 assuming that there was a
product D whose sales in 1958 was $ 100000.

Graphical Presentation of data

The histogram, frequency polygon and cumulative frequency graph or Ogive is most
commonly applied graphical representation for continuous data.

Procedures for constructing statistical graphs:

 Draw and label the X and Y axes.
 Choose a suitable scale for the frequencies or cumulative frequencies and label it on
the Y axes.
 Represent the class boundaries for the histogram or Ogive or the mid points for the
frequency polygon on the X axes.
 Plot the points.
 Draw the bars or lines to connect the points.


A graph which displays the data by using vertical bars of various heights to represent
frequencies. Class boundaries are placed along the horizontal axes. Class marks and class
limits are sometimes used as quantity on the X axes. Unlike Bar graph, in the case of
Histogram the categories (bars) must be adjacent.

Example 2.7: the following table summarizes the Biostatistics mid exam score of 38
students out of 35 marks.

If we want to draw Histogram for this data it would be like this:

Frequency Polygon:
Frequency Polygon depicts a frequency distribution for discrete or continuous numeric
data. Frequency polygons are a graphical device for understanding the shapes of
A Histogram can easily be changed to Frequency Polygon by joining the mid points of
the top of the adjacent rectangles of the Histogram with a line. It is also possible to draw
Frequency Polygon without drawing Histogram.
Example 2.8: - the following Frequency Distribution represents the ages (in years) of 60
patients at a psychiatric counseling center.

Then we have to identify the mid points of each interval.

Finally we have to plot the midpoints (as X axis) with respective frequency of each class
(as Y axis) and connect adjacent plots with a straight line.

Ogive (cumulative frequency polygon)

A graph showing the cumulative frequency (less than or more than type) plotted against
upper or lower class boundaries respectively. That is class boundaries are plotted along
the horizontal axis and the corresponding cumulative frequencies are plotted along the
vertical axis. The points are joined by a free hand curve.
Example: Draw an Ogive curve(less than type) for the above data of example 2.3.


You might also like