Original Article

Sample size estimation and sampling techniques

for selecting a representative sample
Aamir Omair
Department of Medical Education, Research Unit, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia

Introduction: The purpose of this article is to provide a general understanding of the concepts of sampling as applied to health-
related research.
Sample Size Estimation: It is important to select a representative sample in quantitative research in order to be able to generalize
the results to the target population. The sample should be of the required sample size and must be selected using an appropriate
probability sampling technique. There are many hidden biases which can adversely affect the outcome of the study. Important factors
to consider for estimating the sample size include the size of the study population, confidence level, expected proportion of the
outcome variable (for categorical variables)/standard deviation of the outcome variable (for numerical variables), and the required
precision (margin of accuracy) from the study. The more the precision required, the greater is the required sample size.
Sampling Techniques: The probability sampling techniques applied for health related research include simple random sampling,
systematic random sampling, stratified random sampling, cluster sampling, and multistage sampling. These are more recommended
than the nonprobability sampling techniques, because the results of the study can be generalized to the target population.

Keywords: Sample, sample size, sampling techniques

INTRODUCTION the importance of sampling in health research and to

provide the readers with useful tips and resources for
Selecting a sample that is representative of the general selecting a representative sample.
population is an important part of quantitative research.
One of the major reasons that articles are rejected The first step in understanding the sampling process is
by good quality peer-reviewed journals is due to a to be familiar with the terminology [Table 1]. A sample
nonrepresentative sample or not having an adequate is a subset of the total population that is of interest for
sample size.[1] The results of a poorly selected sample the study topic. This “total” population is called the
that is different from the target population cannot be target population, to which the results of the study can
applied to the general population.[2] A smaller than be generalized.[3] For example, the outcomes of a study
required sample size may not have the appropriate based on patients admitted in a tertiary care hospital in
a major city cannot be generalized to patients presenting
power to identify significant differences or associations
with the same condition to other types of health care
that may be present in the target population. The
facilities in smaller towns. The sample itself is selected
purpose of this article is to provide an overview of
from a section of the target population that is accessible
to the researcher, which is called the study population.[4]
Access this article online The study population may be as simple as a list of
Quick Response Code: patients admitted with a certain disease, or it may be as
obscure as all patients visiting any health care facility
with different signs and symptoms.
Even if all the available study population is being
selected, it is important to keep in mind that the study
population may still be inherently different from
Address for correspondence: the target population.[5] Patients who present with a
Dr. Aamir Omair, Department of Medical Education, specific disease to a tertiary health care facility are
Research Unit, King Saud bin Abdulaziz University not representative of all the patients with that disease
for Health Sciences, Riyadh, Saudi Arabia. in that area. The difference may be in the severity
E-mail: of the disease or even with the demographics of the

Omair: Sample size estimation and sampling techniques

Table 1: Sampling framework – from target population to the sample SAMPLE SIZE ESTIMATION
Sampling unit Description A sample must be of the required size in order to have the
Target population This is a larger population – the results from a representative required degree of accuracy in the results as well as to
sample can be generalized to this level
be able to identify any significant difference/association
Study population This is the accessible part of the target population from where
the sample is selected that may be present in the study population. [12]
Sampling frame This is a list of all the members in the study population (may Determining the minimum required sample size for
not be available in all cases) achieving the main objectives of the study is of prime
Sample Members of the study population who are selected for the importance for all studies but is generally neglected by
study – should be representative of the study and target population most novice researchers. A common practice is to select
all the cases that are available (consecutive sampling) in
patients depending on the type of health care facility, a given period of time or to select a sample size based
that is, Ministry of Health, Military hospital, Private on a previous study.[13] Another practice is to select a
hospital, etc. Patients who present to a specific health sample of 50 or 100 patients depending upon the time
care center may be different from those who go to and resources available.[14] While the above assumptions
another health center or another health care provider. may be adequate in some cases, they are generally not
Hence, it is not advisable to generalize the results from appropriate, especially for studies which require the
a single hospital-based study to the whole city let comparison of two or more groups with respect to one
alone the entire country.[6,7] Another important point or more outcomes of interest.
to consider is that people who agree to participate in
the study may be different from the nonresponders. In The factors that need to be considered when determining
general responders tend to be more health conscious the required sample size include the size of the study
and more literate, or they may be more likely to population (from which the sample is to be selected),
have a chronic condition as compared to an acute the confidence level (generally set at 95% confidence
exacerbation of the disease. Hence, the results of the level), the expected prevalence or variance of the main
study may be different from the outcome in the general outcome variable that is being studied, and the required
population either in a positive or negative direction, margin of error/accuracy that is acceptable for the
depending on how the responders are different from study.[12,15] In studies comparing two or more groups, the
the nonresponders.[8] power of the study is generally set at 80% and additional
information regarding the expected difference between
It is important to show that the selected sample is the two groups, will also be required.[16] Nowadays, it is
representative of the study and the target population not required to go about looking up difficult formulae
with regards to the demographic and other relevant and going through complicated calculations in order to
characteristics that may affect the outcome of the determine the required sample size. There are a number
study. [9] For example, in a study to compare the of free online software and easily accessible websites
outcomes of diabetic patients being managed by an like Open-Epi,[17] RaoSoft,[18] Pi-face,[19] etc., which can
endocrinologist as compared to those being managed estimate a number of permutations for the required
by family physicians. It is important to consider sample size based on the estimated parameters for the
that the two groups have the same socioeconomic study population.
characteristics with regards to age, gender, income,
and education since all of these are related to the The researcher does need to do some preparation in
outcome. [10] It is also important to consider the advance before estimating the required sample size. The
severity and duration of disease since patients simplest scenario is a single sample study, where the
presenting to the endocrinologist may be more prevalence of a specific variable is required in the study
likely to be already having complications due to population, e.g., prevalence of diabetes mellitus or its
diabetes mellitus. It is recommended to obtain the complications. The additional information to determine
relevant demographic and background information the required sample size includes the estimated size of
from the responders to demonstrate that they are the study population (if very large then use 20,000), the
representative of the target/study population. [10] expected prevalence of the main variable (if unknown
Additional information that may be easily obtained then use 50%), and the required margin of accuracy
should also be collected about the nonresponders/ (generally set at 10% or 5%).[20] The margin of accuracy
lost to follow-up cases, like area of residence, body is related to how accurate the required result is with
mass index (BMI), smoking status, etc. This will regards to being close to the expected population value,
be useful to demonstrate that the responders are the more precise the required results, the greater is
similar to the nonresponders with regards to these the sample size required. Generally for an expected
background variables.[11] prevalence of around 50% for the outcome variable

a margin of accuracy of ±10% requires a sample size the general rule is that the more precise the required
of around 100, which increases to around 400 for accuracy, the greater is the required sample size.[24] A
an accuracy of ±5% and 10,000 for ±1% margin of summary of the information required for estimating the
accuracy. sample size is given in Table 2.

In case of determining the sample size for determining The requirements for determining the sample size for
the mean value for a numerical variable (e.g., BMI, comparing between two (or more) groups becomes more
cholesterol level, etc.,), the additional information complex with the requirements for estimation about the
required is for the expected variance of the required expected prevalence in both the groups (for categorical
variable in the target population.[21] This information variables) and the expected difference of means (for
can be obtained from the literature review of similar numerical values). But the basic rule is the same – the
studies in the form of the standard deviation (SD) for greater the variability of the variable under study or the
the required variable. The higher the SD, the greater the more precise the required accuracy, the greater is
will be the required sample size. In case the SD is not the required sample size.[12] Table 3 shows the estimated
known for the target population, it can be estimated by sample sizes for a categorical variable (hypertension)
taking the difference between the estimated “highest” and a numerical variable (systolic blood pressure)
and “lowest” values in the population and dividing it by for comparing these variables between smokers and
four (±2 SD on either side of the mean for the “normal” nonsmokers for different level of accuracies. It is up to
distribution).[22] For example, the BMI for a group of the researcher to select the required criteria according
diabetics is expected to have a high value of 48 and to the study objectives and the available resources.
a low value of 16 kg/m2. Hence, the “normal” range is It should be kept in mind that these are all based on
48-16 = 32, which gives an estimated value for the SD estimates and if the sample results are found to have
of ±8 (32 divided by 4). The other information required more variability than used in the estimation then the
for determining the sample size is the accuracy of the P values will not be statistically significant. If there is
estimated mean, that is, how close it should be to the provision for doing a pilot study, then the estimated
actual population mean.[23] In the above example, for prevalence or SD can be more accurately determined
BMI, the accuracy can be set as ±1, ±2 or ±4 kg/m2, based on a smaller sample from within the study

Table 2: Information required for estimating the required sample size

Type of outcome variable
Categorical Numerical
Estimating proportion (%) for “one sample” Estimating mean value for “one sample”
Estimated size of study population Estimated size of study population
Confidence level (usually set at 95%) Confidence level (usually set at 95%)
Anticipated proportion of outcome variable in the target population (if not Expected SD of the outcome variable in the target population
known then set at 50%)
Precision required (i.e., ± % difference in sample result from actual Acceptable margin of error (i.e., ± difference of the sample mean from the population
population value) mean)
Comparing a proportion between two samples Comparing the mean between two samples
Confidence level (usually set at 95%) Confidence level (usually set at 95%)
Power (usually set at 80%) Power (usually set at 80%)
Ratio of the two groups (e.g., 1:1) Ratio of the two groups (e.g., 1:1)
Expected proportion of the outcome variable in each of the two groups Expected mean value in the two groups or the expected difference in the mean value
Expected SD/variance of the outcome variable in each group
SD: Standard deviation

Table 3: Estimated sample sizes required for varying expected differences between two samples (nonsmokers versus smokers) for a categorical
and numerical variable
Confidence level = 95%, Power = 80%, Ratio of nonsmokers:smoker = 1:1
Hypertension (%) Systolic blood pressure (mmHg)
Nonsmokers Smoker Expected Required Nonsmokers Smoker Expected difference Required
% % difference % sample size (SD: ±15) (SD: ±20) (mmHg) sample size
20 30 10 315+315 (630) 120 125 5 200+200 (400)
20 35 15 150+150 (300) 120 130 10 50+50 (100)
20 40 20 90+90 (180) 120 135 15 25+25 (50)
SD: Standard deviation

population for determining the required sample size hidden bias that people who have two phones (or double
more accurately.[25] SIM phones) are twice as more likely to be selected as
compared to the majority of people who have only one
SAMPLING TECHNIQUES number.[29] The people with >1 phone are more likely
to have a higher income so this may bias any study
The other important issue related to sampling is which may be asking about their perceptions about
selecting the required sample size in a manner, so that health care insurance or even about choosing between
the sample is representative of the study population.[7] prepaid/postpaid mobile phone services. This type of
It is a common pitfall to opt for the easier option of bias can be controlled for by simply recognizing this as
convenience sampling where “all” the available persons a bias at the planning stage of the survey and including
in the study population are selected for the study a question on “How many phone numbers do you have?”
until the required sample size is reached. This is in the survey. This can be used to appropriately weight
nonprobability sampling, where the sample is less likely the responses of such respondents in the final analysis
to be representative of the study population due to stage.[29]
inherent biases in the sampling process.[13] Other forms
of nonprobability sampling include purposive sampling, The types of probability sampling methods include
quota sampling, and snowball sampling, where the simple random sampling, systematic random sampling,
sample is selected according to some predetermined and stratified random sampling [Table 4] – these three
criteria. These type of sampling techniques are more methods are more relevant when a sample frame (list
appropriate for small level studies which are not meant of the people in the study population) is available.[7]
to be generalized to a larger population.[13] Simple random sampling is as simple as picking up chits
(names or numbers written on pieces of paper) from the
The more relevant sampling technique is called box for a small study population of up to 30-50 people.
“probability sampling” or “random sampling.”[26] It is For larger study population, a computer-generated
important to note here that the word “random” as used random number table can be used to select the
in this context is different from the normal usage in the respondents accordingly, e.g., every nth person coming
everyday terms. It is misleading to state that the sample out of a clinic or selecting the nth person from each
was chosen at random from all the patients coming to household.[7] Systematic random sampling is applicable
the outpatient clinic. In order to be classified as random when the study population is relatively large (100 or
or probability sampling, every person in the study more) and a list is available of all the members, e.g.,
population must have an equal or known probability of employees in a hospital, medical students in a class, or
being included in the sample.[7] It is quite common to even beds in a hospital. The total number of subjects
overlook some hidden biases in the sampling process in the list is divided by the required sample size to
which adversely affect the outcome of the study. For obtain the “skip number” e.g., to select 25 out of a list
example, if a study was to be conducted to determine of 200 the skip number will be every 8th person on the
the satisfaction of patients coming to a health care list (200/25 = 8). The next step is to choose a number
center and the decision was to sample every third randomly from between 1 and 8 which will be the first
patient who was coming out of the center. Apparently, person selected and then systematically select every 8th
this seems to be “unbiased” if every third person was person from the list till the end of the list is reached, e.g.,
selected accordingly. But one hidden factor is related 3, 11, 19, 27, …, 195. It is important to remember that
to the outcome of the study, that is, satisfaction with the first person should be chosen randomly – arbitrarily
the care provided. A person who is not satisfied with selecting the 1st person or the 8th person on the list will
the health care provided would be unlikely to return lead to zero probability of the other persons in the list
to the center or would come only once a month, while being selected.[30] Stratified random sampling is a form
a person who is satisfied would be returning more of systematic random sampling with the addition that
frequently maybe 2-3 times a month. Hence, it is quite the list is stratified (arranged by categories) according
likely that the result of the satisfaction survey shows a
more positive result than the actual perception.[27] One
way to account for this hidden bias is to interview only Table 4: List of different probability and nonprobability sampling
“new” patients who are visiting the clinic for the 1st methods
time or it may be sufficient to just ask the respondent Probability sampling methods Nonprobability sampling methods
how many times s/he has visited the clinic in the last Simple random sampling Convenience sampling
month or year.[28] The same bias may be associated with Systematic random sampling Consecutive sampling
random digit dialing for a phone survey. Apparently, the Stratified random sampling Quota sampling
computer dials a number randomly so there should be Cluster sampling Judgmental/purposive sampling
no bias in the sample selection? Actually, there is still a Multistage sampling Snowball sampling

to a predetermined characteristic, e.g., gender, level of [Last accessed

employees, class in medical college. After arranging on 2014 Sep 24].
the list according to the specified criterion, the same 9. Watt JH, van den Berg S. Sampling. In: Research Methods for
process of selecting every nth person is followed as in Communication Science. Ch. 6. 2002. Available from: http://
systematic random sampling.[30] The stratified random [Last accessed on 2014
Sep 24].
sampling technique ensures that the sample contains
10. Ross KE. Sample design for educational survey research. In:
approximately the same proportion of the specified
Quantitative Research Methods in Educational Planning. Paris:
criterion as in the study population. This is important UNESCO International Institute for Educational Planning;
when the outcome variable that is being studied is 2005. p. 4. Available from:
directly related to that particular characteristic, e.g., TR_Mods/Qu_Mod3.pdf. [Last accessed on 2014 Sep 24].
gender and smoking, employee satisfaction, and 11. Barclay S, Todd C, Finlay I, Grande G, Wyatt P. Not another
level of employees. The other two probably sampling questionnaire! Maximizing the response rate, predicting
methods of cluster sampling and multistage sampling non-response and assessing non-response bias in postal
are more appropriate for a community based or large questionnaire studies of GPs. Fam Pract 2002;19:105-11.
scale surveys and will not be described in detail in this Available from:
article. More information on these two methods can be content/19/1/105.full.pdf. [Last accessed on 2014 Sep 24].
obtained from other detailed text on sampling.[7,9,10,15,30] 12. Israel GD. Determining Sample Size, [April 2009]. Available
The issue of avoiding bias due to nonresponse in from:
[Last accessed on 2014 Sep 25].
sampling will be discussed in detail in the next article
13. Explorable Psychology Experiments. Non-Probability Sampling,
on data collection methods.
[17 May, 2009]. Available from:
non-probability-sampling. [Last accessed on 2014 Sep 25].
CONCLUSION 14. Science Buddies. Sample Size: How Many Survey Participants
Do I Need? Available from:
The issue of sampling is of an important consideration science-fair-projects/project_ideas/Soc_participants.shtml. [Last
in all quantitative research which aims to generalize accessed on 2014 Sep 25].
the finding of the study to a larger population. It is 15. Kadam P, Bhalerao S. Sample size calculation. Int J Ayurveda
essential to have the required sample size as well as Res 2010;1:55-7.
to select a representative sample using the appropriate 16. Andrews University. Applied statistics – Lesson 11: Power and
sampling technique. sample size, [28 Jul, 2005]. Available from: http://www.andrews.
edu/~calkins/math/edrm611/edrm11.htm. [Last accessed on
REFERENCES 2014 Sep 25].
17. Dean AG, Sullivan KM, Soe MM. Open Epi: Open Source
1. Bordage G. Reasons reviewers reject and accept manuscripts: Epidemiological Statistics for Public Health, Version.
The strengths and weaknesses in medical education reports. Available from:
Acad Med 2001;76:889-96. htm. [Last accessed on 2014 Sep 25; Last updated on 2014
2. University of Texas. Common Mistakes in Using Statistics: Sep 22].
Spotting and Avoiding Them. Available from: http://www. 18. Raosoft Inc. Sample Size Calculator. 2004. Available from: [Last accessed on
[Last accessed on 2014 Sep 24; Last accessed on 2012 Aug 28]. 2014 Sep 25].
3. Easton VJ, McColl JH. Statistics Glossary. Available from: 19. Lenth RV. Java apllets for power and sample size, [Computer [Last software, 2006]. Available from: http://www.homepage.stat.
accessed on 2014 Sep 24]. [Last accessed on
4. Trochim WM. Research Methods Knowledge Base: Sampling 2014 Sep 25].
Terminology, [Oct 28th, 2008]. Available from: http://www. 20. Triola MF. Estimates and sample sizes. In: Elementary Statistics. [Last accessed 12th ed., Ch. 7. New York: Prentice Hall Inc.; 2012. Available
on 2014 Sep 24]. from:
5. Freedman DA. Sampling. Available from: http://www.stat. Sta1020/ElemStat_Triola_Chapter7.pdf. [Last accessed on [Last accessed on 2014 2014 Sep 25].
Sep 24]. 21. National Research Council (US) Committee on Guidelines
6. Population and Samples: The Principle of Generalization. for the Use of Animals in Neuroscience and Behavioral
Available from: Research. Sample Size Determination. In: Guidelines
[Last accessed on 2014 Sep 24]. for the Use of Animals in Neuroscience and Behavioral
7. Schutt RK, Engel RJ. Sampling. In: The Practice of Research in Research. Washington DC: National Academies Press; 2003.
Social Work. 3rd ed., Ch. 5. Washington DC: SAGE Publications Appendix A. Available from:
Inc.; 2008. Available from: books/NBK43321/#a20007f55ddd00182. [Last accessed on
data/24480_Ch5.pdf. [Last accessed on 2014 Sep 24]. 2014 Sep 25].
8. Singer E. Introduction: Nonresponse bias in household surveys. 22. Henry GT. Sample size. In: Practical Sampling. Thousand Oaks
Pub Opinion Quart 2006;70:637-45. Available from: http://www. CA: SAGE Publishing Inc.; 1990. p. 117-29.

23. Boston University, School of Public Health. Power and Sample
Size Determination. Available from: http://www.sphweb.bumc. [Last accessed on 2014 Sep 26]. 28. World Health Organization. Toolkit on Monitoring Health
print.html. [Last accessed on 2014 Sep 25]. Systems Strengthening: Service Delivery, [June 2008]. Available
24. Penn State University. Stat 100– Statistical Concepts and from:
Reasoning. 2014. Available from: http://www.onlinecourses. EN_PDF_Toolkit_HSS_ServiceDelivery.pdf. [Last accessed [Last accessed on 2014 Sep on 2014 Sep 26].
25]. 29. Ferraro D, Krenzke T, Montaquila J. RDD Telephone Surveys:
25. Noordzij M, Tripepi G, Dekker FW, Zoccali C, Tanck MW, Jager Reducing Bias and Increasing Operational Efficiency. Joint
KJ. Sample size calculations: Basic principles and common Statistical Meeting: Section on Survey Research Methods; 2008.
pitfalls. Nephrol Dial Transplant 2002;17:2087-93. Available p. 1949-56. Available from:
from: srms/proceedings/y2008/Files/301280.pdf. [Last accessed on
long. [Last accessed on 2014 Sep 25].
2014 Sep 26].
26. Doherty M. Probability versus non-probability sampling in
30. Daniel J. Choosing the type of probability sampling. In:
sample surveys. New Zealand Stat Rev 1994;21-8. Available
Sampling Essentials. CH. 5. SAGE Publications Inc.; 2012.
Available from:
pdf. [Last accessed on 2014 Sep 26].
d/$FILE/Probability%20 versus%20Non%20Probability%20
Sampling.pdf. [Last accessed on 2014 Sep 26].
27. Lane DM. Research design. In: Online Statistics Education: How to cite this article: Omair A. Sample size estimation and sampling
techniques for selecting a representative sample. J Health Spec 2014;2:142-7.
An Interactive Multimedia Course of Study. Rice University,
University of Houston, Tufts University. Available from: http:// Source of Support: Nil, Conflict of Interest: None declared.

Journal of Health Specialties / October 2014 / Vol 2 | Issue 4 147

