Module 1: Nature of Statistics

MODULE 1: NATURE OF STATISTICS
Introduction

Statistical Thinking will one day be as necessary for efficient citizenship as the ability to read
and write (H. G. Wells).

In 2017, The Economist published one of the striking changes in the world economy. It claims
that the world’s most valuable resource is no longer oil, but data. The five biggest tech giants –
Google, Amazon, Apple, Facebook, and Microsoft – had been taking advantage and profiting from
making use of consumer/customer data. This phenomenon prompted professionals to the use
of statistics and later popularizing the concept of data science.

To date, many business companies all over the world are hiring statisticians and data
scientists to further their competitive advantage. Since everything is data and everyone needs it
analyzed, you need to learn the important knowledge and skills of statistics. As Florence Nightingale
puts it,

“Statistics… is the most important science in the whole world: for upon it depends the practical
application of every other science and of every art; the one science essential to all political and social
administration, all education, all organization based upon experience, for it only gives the results of
our experience.”
Lesson 1: History and Development of Statistics

Statistics as a science and art had undergone a series of development and refinement
through time all over the world. Many experts from different fields such as medicine and health,
philosophy, mathematics, and science contributed to strengthening the foundations of the field of
statistics. Some of these notable developments are highlighted in the Timeline of Statistics designed
by Tom Fryer and on the notes on history of statistics by Sweetland.

 450BC Hippias of Elis uses average value of the length of a king’s reign (mean) to work out the
date of the first Olympic Games, some 300 years before his time
 431BC Attackers besieging Plataea in the Peloponnesian War calculate the height of the wall by
counting the number of bricks. The count was repeated several times by different soldiers. The
most frequent value (mode) was taken to be the most likely. Multiplying it by the height of the
brick allowed them to calculate the length of the ladders needed to scale the
 400BC In the Indian epic Mahabharata, King Rtuparna estimates the number of fruits and leaves
(2095 fruit and 50 000 000 leaves) on two great branches of a vibhitaka tree by counting the
number on a single twig, then multiplying by the number of The estimate is found to be very
close to the actual number. This is the first recorded example of sampling
 – “but this knowledge is kept secret”, says the account.
 2AD Chinese census under the Han Dynasty finds 57.67 million people in 12.36 million
households – the first census from which data survives, and still considered by scholars to have
been
 7AD Census by Quirinus, governor of the Roman province of Judea, is mentioned in Luke’s
Gospel as causing Joseph and Mary to travel to Bethlehem to be
 840 Islamic mathematician Al-Kindi uses frequency analysis – the most common symbol in a
coded message will stand for the most common letters – to break secret codes. Al-Kindi also
introduces Arabic numerals to
 10th century The earliest known graph, in a commentary on a book by Cicero, shows the
movements of the planets through the It is apparently intended for use in monastery schools.
 1069 Domesday Book: survey for William the Conqueror of farms, villages, and livestock in his
new kingdom – the start of official statistics in
 1150 Trial of the Pyx, an annual test of the purity of coins from the Royal Mint, Coins are drawn
at random, in fixed proportions to the number minted. It continues to this day.
 1188 Gerald of Wales completed the first population census of
 1303 A Chinese diagram entitled “The Old Method Chart of the Seven Multiplying Squares”
shows the binomial coefficients up to the eighth power – the numbers that are fundamental to
the mathematics of probability, and that appeared five hundred years later in the west as
Pascal’s triangle.
 1346 Giovanni Villani’s Nouva Cronica gives statistical information on the population and trade
of
 1560 Gerolamo Cardano calculates probabilities of different dice throws for
 1570 Astronomer Tycho Brahe uses the arithmetic mean to reduce errors in his estimates of
the locations of stars and
 1644 Michael van Langren draws the first known graph of statistical data that shows the size of
possible It is of different estimates of the distance between Toledo and Rome.
 1654 Pascal and Fermat correspond about dividing stakes in gambling games and together
create the mathematical theory of
 1657 Huygen’s On the Reasoning in Games of Chance is the first book on probability He also
invented the pendulum clock.
 1663 John Graunt uses parish records to estimate the population of
 1693 Edmund Halley prepares the first mortality tables statistically relating death rates to age –
the foundation of life insurance. He also drew a stylized map of the path of a solar eclipse over
England – one of the first data visualization
 1713 Jacob Bernoulli’s Ars conjectandi derives the law of large numbers – the more often you
repeat an experiment, the more accurately you can predict the
 1728 Voltaire and his mathematician friend de la Condamine spot that a Paris bond lottery is
offering more prize money than the total cost of the tickets; they corner the market and win
themselves a
 1749 Gottfried Achenwall coins the word statistics (in German, Statistik); he means the
information you need to run a nation-state.
 1757 Casanova becomes a trustee of, and may have had a hand in devising the French national
 1761 The Rev. Thomas Bayes proves Baye’s theorem – the cornerstone of conditional probability
and testing of beliefs and
 1786 William Playfair introduces graphs and bars charts to show economic
 1789 Gilbert White and other clergymen-naturalists keep records of temperatures, dates of first
snowdrops and cuckoos, etc; the data is later useful for the study of climate
 1790 First US census, taken by men on horseback directed by Thomas Jefferson, counts
 3.9 million Americans.
 1791 First use of the word statistics in English, by Sir Joh Sinclair in his Statistical Account of
 1805 Adrien-Marie Legendre introduces the method of least squares for fitting a curve to a given
set of
 1808 Gauss, with contributions from Laplace, derives the normal distribution – the bell- shaped
curve fundamental to the study of variation and error.
 1833 The British Association for the Advancement of Science sets up a statistics section. Thomas
Malthus, who analyzed population growth, and Charles Babbage are members. It later becomes
the Royal Statistical
 1835 Belgian Adolphe Quetelet’s Treatise on Man introduces social science statistics and the
concept of the average man – his height, body, mass index, and
 1839 The American Statistical Association is formed. Alexander Graham Bell, Andrew Carnegie,
and President Martin Van Buren will become
 1840 William Farr sets up the official system for recording causes of death in England and Wales.
This allows epidemics to be tracked and disease compared – the start of medical statistics.
 1849 Charles Babbage designs his difference engine, embodying the ideas of data handling and
the modern computer. Ada Lovelace, Lord Byron’s niece, writes the world’s first computer
program for
 1854 John Snow’s cholera map pins down the source of an outbreak as a water pump in Broad
street, London, beginning the modern study of
 1859 Florence Nightingale uses statistics of Crimean War casualties to influence public opinion
and the War Office. She shows casualties month by month on a circular chart she devises, the
Nightingale rose, the forerunner of the pie She is the first woman member of the Royal
Statistical Society and the first overseas member of the American Statistical Association.
 1868 Minard’s graphic diagram of Napoleon’s March on Moscow shows on one
diagram the distance covered, the number of men still alive at each kilometer of the march, and
the temperatures they encountered on the
 1877 Francis Galton, Darwin’s cousin, describes regression to the mean. In 1888 he introduces
the concept of correlation. At a Guess the weight of an Ox contest in Devon he
describes the Wisdom of Crowds – that the average of many uninformed guesses is close to the
correct
 1886 Philanthropist Charles Booth begins his survey of the London poor, to produce his poverty
map of Areas were colored black, for the poorest, through to yellow for the upper-middle class
and wealthy.
 1894 Karl Pearson introduces the term standard If errors are normally distributed, 68% of
samples will lie within one standard deviation of the mean. Later he develops chi- squared tests
for whether two variables are independent of each other.
 1898 Von Bortkiewicz’s data on deaths of the soldier in the Prussian army from horse kicks
shows that apparently rare events follow a predictable pattern, the Poisson
 1900 Louis Bachelier shows that fluctuations in stock market prices behave in the same way as
the random Brownian motion of molecules – the start of financial
 1908 William Sealy Gosset, chief brewer for Guinness in Dublin, describes the t-test. It uses a
small number of samples to ensure that every brew tastes equally
 1911 Herman Hollerith, inventor of punchcard devices used to analyze data in the US census,
merges his company to form what will become IBM, pioneers of machines to handle business
data, and early
 1916 During the First World War, car designer Frederick Lanchester develops statistical laws to
predict the outcomes of aerial battles: if you double their size, land armies are only twice as
strong, but air forces are four times as
 1924 Walter Shewart invents the control chart to aid industrial production and
 1935 George Zipf finds that many phenomena – river lengths, city populations – obey a power
law so that the largest is twice the size of the second-largest, three times the size of the third,
and so R. A. Fisher revolutionizes modern statistics. His Design of Experiments gives ways of
deciding which results of scientific experiments are significant and which are not.
 1937 Jerzy Neyman introduces confidence intervals in statistical testing. His work leads to
modern scientific
 1940-45 Alan Turing at Bletchley Park cracks the German wartime Enigma code, using advanced
Bayesian statistics and Colossus, the first programmable electronic
 1944 The German tank problem: the Allies desperately need to know how many Panther tanks
they will face in France on D-Day. Statistical analysis of the serial numbers on gearboxes from
captured tanks indicates how many of each are being
 Statisticians predict 270 a month; reports from intelligence sources predict many fewer. The
total turned out to be 276. Statistics had outperformed spies.
 1948 Claude Shannon introduces information theory and the bit – fundamental to the digital
age.
 1948-53 The Kinsey Report gathers objective data on human sexual A large-scale survey of 5000
men and, later, 5000 women, causes an outrage.
 1950 Richard Doll and Bradford Hill establish the link between cigarette smoking and lung
cancer. Despite fierce opposition, the result is conclusively proved, to huge public health
benefit.
 1950s Genichi Taguchi’s statistical methods to improve the quality of automobile and electronics
components revolutionize the Japanese industry, which far overtakes western European
 1958 The Kaplan-Meier estimator gives doctors a simple statistical way of judging which
treatments work best. It has saved millions of
 1972 David Cox introduced the proportional hazard model and the concept of partial likelihood.
 1977 John Tukey introduces the box-plot or box-and-whisker diagram, which shows the
quartiles, medians, and spread in a single
 1979 Bradley Efron introduces bootstrapping, a simple way to estimate the distribution of
almost any sample of
 1982 Edward Tufte self-publishes The Visual Display of Quantitative Information, setting new
standards for graphic visualization of
 1988 Margaret Thatcher becomes the first world leader to call for action on climate
 1993 The statistical programming language R is released, now a standard statistical
 1997 The term Big Data first appears in
 2002 The amount of information stored digitally surpasses non-digital. Paul DePodesta uses
statistics – sabermetrics – to transform the fortunes of the Oakland Athletics baseball team; the
film Moneyball tells the
 2004 Launch of Significance magazine
 2008 Hal Varian, chief economist at Google, says that statistics will be the sexy profession of the
next ten
 2012 Nate Silver, statistician, successfully predicts the result in all 50 states in the US
Presidential election. He becomes a media star and sets up what may be an over-reliance on
statistical analysis for the 2016 election. The Large Hadron Collider confirms the existence of a
Higgs boson with a probability of five standard deviations the data is a coincidence.

Lesson 2: Basic Concepts of Statistics

Statistics refers to the scientific study that deals with the collection, organization and
presentation, analysis and interpretation of data.
Two Divisions of Statistics

1. Descriptive Statistics. This refers to the statistical procedures concerned with
describing the characteristics and properties of a group of persons, places, or It
organizes the presentation, description, and interpretation of data gathered without
trying to infer anything that goes beyond the data. The most common measures used
to describe data include the measures of central tendency (mean, median, mode),
measures of variation (range, variance, standard deviation, etc.), kurtosis, skewness,
etc.
Sample Research Questions (Objectives):

1. What is the demographic profile of the respondents? (describe the
demographic profile of the respondents)
2. What are the characteristics and qualifications do school principals look for in
a potential teacher applicant? (determine the characteristics and qualifications
that principals look for in a potential teacher applicant)
3. Which group of learners have the best performance in the national
achievement test? (identify which group of learners have the best
performance in the national achievement test)
4. How did the graduates of teacher education institutions perform in the
licensure examinations? (assess the licensure examination performance of the
graduates from teacher education institutions)
5. What are the factors that affect the implementation of the school program?
(determine the different factors that affect the implementation of the school
program)

2. Inferential Statistics. This refers to statistical procedures that are sued to draw
inferences for a large group of people, places, or things (population) based on the
information obtained from a small portion (sample) taken from a large group. The
most common procedures include the tests of difference between and among
groups, the test of relationship and association, and test of effects.

Sample Research Questions (Objectives):
1. To what degree do NCAE ratings predict freshman college GPA? (ascertain if
freshman college GPA can be predicted by NCAE ratings)
2. To what extent do entry-level qualifications of graduates of teacher education
programs increase the likelihood of developing proficient teachers? (ascertain if
entry- level qualifications of graduates of teacher education programs increase
the likelihood of developing proficient teachers)
3. How do K-3 pupils from different socio-economic status compare in their
reading and mathematics achievement after adjusting for family type? (compare
the reading and mathematics achievement of the K-3 pupils from different socio-
economic status after adjusting for family type)
4. How do male and female learners differ in the national achievement test?
(ascertain if results of the national achievement test differ between sexes)
Population and Sample

Population refers to a large collection of people, objects, places, or things. Any numerical
value that describes a population is called a parameter. A sample refers to the small portion or subset
of the population. Any numerical value that describes a sample is called statistics.

Example: The Department of Education, in a press brief, stated that the average rating of 1 000 000
high school students all over the country who took the examination is 94%. A division supervisor
would like to study the performance of high school students in the national achievement test from
their schools division. Eighteen thousand high school students from their division had an average
rating of 92%.

Population: All high school students who took the national achievement test
Parameter: N = 1 000 000 high school students, average rating of 94%
Sample: All high school students from the specific schools division who took the national achievement
test
Statistics: n = 18 000 high school students, an average rating of 92%

Variable, Data, and Indicators

A variable is a characteristic or property of a population or sample which makes the members
different from each other. Variables can be classified as follows.

1. Independent Variable. This is the one thing you change. It is the variable that affects
another
2. Dependent It is the variable being affected by another variable. The change that
happens is due to the influence of the independent variable.
3. Controlled This is the variable that you want to remain constant and unchanging.
4. Quantitative Variable. This is expressed as a number or can be
quantified. Types of Quantitative Variable
1. Discrete Variable. This variable has a
countable number of possible values in a finite
amount of
2. Continuous Variable. This variable can take
on any value between two specified values.
5. Qualitative Variable. This is information that can’t be expressed as a number; thus,
these are not

Example: To what degree do NCAE ratings predict college freshman GPA?
Independent Variable: NCAE ratings
Dependent Variable: college freshman GPA
Controlled Variable: Type of examination and test items

Both the NCAE ratings and college freshman GPA are quantitative, continuous variables. The number
of college freshman students is a discrete variable. The profile of the college freshman students such
as the program of study, sex, and school last attended are qualitative variables.

Data are facts or values gathered or observed from samples or population being
studied. Indicators are data that directly measure the being studied. To be able to gather significant
and relevant data, indicators for each variable of interest must be established first. This will make
analysis and interpretation a lot easier and convenient.

Example: The school principal wants to know if the feeding program implemented among K-3 pupils
for the past 6 months has been successful. The data/indicators that she may look for are the pupils'
weight and height before and after the feeding program.

Example: What is the socio-economic status of pupils and students in different private and public
schools in Batangas?
The variable socio-economic status is broad in scope and data may vary depending on the group of
persons being studied. Data or indicators may include parents’ educational attainment, parents’
occupation, household income, and other household conditions (house ownership,
appliances/gadgets, etc.)

Most research data can be classified into one of the three basic categories.

Category 1: A single group of participants with one score per participant. This type of data often
exists in research studies that are conducted simply to describe individual variables as they exist
naturally. Although several variables are being measured, the intent is to look at all of them one at a
time and there is no attempt to examine relationship nor difference between variables.

Category 2: A single group of participants with two or more variables measured for each participant.
The research study is specifically intended to examine relationships or differences between variables.
However, there is no attempt to control or manipulate the variables.

Category 3: Two or more groups with each score measurement of the same variable. This involves
independent-measures and repeated-measures designs.

Levels of Measurement

Measurement levels refer to different types of variables that imply how to analyze them.

1. It is a variable whose values don’t have an undisputed order. It may have two or
more exhaustive, non-overlapping categories but there is no intrinsic ordering of the
categories. Examples: sex, socioeconomic status, civil status, school division, religious
affiliation, mother-tongue
2. It holds the value that has an undisputed order but no fixed unit of measurement. An
ordinal variable is similar to a nominal variable except that there is a clear ordering of
the variables. Although, the difference between each range cannot be stated with
certainty. Examples: rating scales (Likert scales), shoe/shirt sizes, ranking, monthly
income (range)
3. An interval variable is similar to ordinal data except that the ranges are equally
spaced. It has a fixed unit of measurement but zero does not mean anything.
Examples: temperature, pressure, IQ score, mental ability ratings
4. A ratio variable is an interval variable with a true zero. It has a fixed unit of
measurement and zero means nothing. Example: weight, height, age, income

Lesson 3: Summation Notation

Summation notation is a convenient and simple way that is used to give a concise expression
for a sum of values of a variable. It is commonly used to express statistical formulas. It involves the
following symbols.
Example: Express the following summation as a sum of individual terms.
Evaluate the following summations.

MODULE 2: DATA COLLECTION AND SAMPLING DESIGN
Introduction

After identifying your research problem, the next step is to collect appropriate and relevant
data. Data collection is crucial to the success of any investigation or study. If the investigator was not
able to collect enough relevant data, the findings and results of the study will be affected; thus,
conclusions, generalization, or implications derived from the available data may not be reliable or
valid. Becoming an expert in data collection methods and techniques require time and effort.
Guidance from an experienced researcher or statistician may help you in working your data collection
and sampling design

Lesson 1: Sources of Data and Data Collection Methods
Data collection is a methodical process of gathering and analyzing specific information to give
solutions to relevant research questions.

Characteristics of a Good Data

Ortega (2017) outlines seven (7) characteristics that define quality data.

1. Accuracy and Precision: This characteristic refers to the exactness of the It cannot have any
erroneous elements and must convey the correct message without being misleading. This
accuracy and precision have a component that relates to its intended use. Without
understanding how the data will be consumed, ensuring accuracy and precision could be off-
target or more costly than necessary. For example, accuracy in healthcare might be more
important than in another industry (which is to say, inaccurate data in healthcare could have
more serious consequences) and, therefore, justifiably worth higher levels of investment.
2. Legitimacy and Validity: Requirements governing data set the boundaries of this characteristic.
For example, on surveys, items such as gender, ethnicity, and nationality are typically limited to
a set of options, and open answers are not Any answers other than these would not be
considered valid or legitimate based on the survey’s requirement. This is the case for most data
and must be carefully considered when determining its quality. The people in each department
in an organization understand what data is valid or not to them, so the requirements must be
leveraged when evaluating data quality.
3. Reliability and Consistency: Many systems in today’s environments use and/or collect the same
source data. Regardless of what source collected the data or where it resides, it cannot
contradict a value residing in a different source or collected by a different There must be a
stable and steady mechanism that collects and stores the data without contradiction or
unwarranted variance.
4. Timeliness and Relevance: There must be a valid reason to collect the data to justify the effort
required, which also means it has to be collected at the right moment in Data collected too soon
or too late could misrepresent a situation and drive inaccurate decisions.
5. Completeness and Comprehensiveness: Incomplete data is as dangerous as inaccurate data.
Gaps in data collection lead to a partial view of the overall picture to be displayed. Without a
complete picture of how operations are running, uninformed actions will occur. It’s important to
understand the complete set of requirements that constitute a comprehensive set of data to
determine whether or not the requirements are being
6. Availability and Accessibility: This characteristic can be tricky at times due to legal and regulatory
constraints. Regardless of the challenge, though, individuals need the right level of access to the
data to perform their This presumes that the data exists and is available for access to be
granted.
7. Granularity and Uniqueness: The level of detail at which data is collected is important because
confusion and inaccurate decisions can otherwise occur. Aggregated, summarized, and
manipulated collections of data could offer a different meaning than the data implied at a lower
An appropriate level of granularity must be defined to provide sufficient uniqueness and
distinctive properties to become visible. This is a requirement for operations to function
effectively.
Types of Data
1. Primary Data. These are data collected by the investigator himself/ herself for a
specific purpose. For instance, the data collected by an investigator for their research
projects is an example of primary
2. Secondary These are data collected by someone else for some other purposes, but
the being utilized by the current investigator for another purpose. For instance, the
census data is used to analyze the impact of education on career choice, and earning
is an example of secondary data.
Data Collection Tools and Instruments (Bhat, 2020)

1. Interview Method. The interviews conducted to collect quantitative data are more structured,
wherein the researchers ask only a standard set of questionnaires and nothing more than that.
There are three major types of interviews conducted for data collection
 Telephone interviews: For years, telephone interviews ruled the charts of data collection
However, nowadays, there is a significant rise in conducting video interviews using the internet,
Skype, or similar online video calling platforms.
 Face-to-face interviews: It is a proven technique to collect data directly from the participants. It
helps in acquiring quality data as it provides a scope to ask detailed questions and probing
further to collect rich and informative data. Literacy requirements of the participant are
irrelevant as face-to-face interviews offer ample opportunities to collect non-verbal data
through observation or to explore complex and unknown issues. Although it can be an
expensive and time-consuming method, the response rates for face-to-face interviews are often
 Computer-Assisted Personal Interviewing (CAPI): It is nothing but a similar setup of the face-to-
face interview where the interviewer carries a desktop or laptop along with him at the time of
interview to upload the data obtained from the interview directly into the database. CAPI saves
a lot of time in updating and processing the data and also makes the entire process paperless as
the interviewer does not carry a bunch of papers and

2. Survey or Questionnaire Method. The checklists and rating scale type of questions make the bulk of
quantitative surveys as it helps in simplifying and quantifying the attitude or behavior of
the respondents.
 Web-based questionnaire: This is one of the ruling and most trusted methods for internet-based
research or online In a web-based questionnaire, the receive an email containing the survey link,
clicking on which takes the respondent to a secure online survey tool from where he/she can
take the survey or fill in the survey questionnaire.
 Mail Questionnaire: In a mail questionnaire, the survey is mailed out to a host of the sample
population, enabling the researcher to connect with a wide range of audiences. The mail
questionnaire typically consists of a packet containing a cover sheet that introduces the
audience about the type of research and reason why it is being conducted along with a prepaid
return to collect data
3. Observation Method. In this method, researchers collect quantitative data through systematic
observations by using techniques like counting the number of people present at the specific
event at a particular time and a particular venue or number of people attending the event in a
designated place. Structured observation is more used to collect quantitative rather than
qualitative
 Structured observation: In this type of observation method, the researcher has to make careful
observations of one or more specific behaviors in a more comprehensive or structured setting
compared to naturalistic or participant observation. In a structured observation, the
researchers, rather than observing everything, focus only on very specific behaviors of It allows
them to quantify the behaviors they are observing. When the observations require a judgment
on the part of the observers – it is often described as coding, which requires a clearly defining a
set of target behaviors.
4.Documents and Records. Document review is a process used to collect data after reviewing the
existing documents. It is an efficient and effective way of gathering data as documents are
manageable and are the practical resource to get qualified data from the past. Three primary document
types are being analyzed for collecting supporting quantitative research
 Public Records: Under this document review, official, ongoing records of an organization are
analyzed for further research. For example, annual reports policy manuals, student activities,
game activities in the university,
 Personal Documents: In contrast to public documents, this type of document review deals with
individual personal accounts of individuals’ actions, behavior, health, physique, etc. For
example, the height and weight of the students, distance students are traveling to attend the
school,
 Physical Evidence: Physical evidence or physical documents deal with previous achievements of
an individual or of an organization in terms of monetary and scalable growth.
Lesson 2: Sampling Design
Sampling is a statistical procedure that is concerned with the selection of individual
observations. It allows us to make statistical inferences about the population.
Approaches to Determine the Sample Size

1. Using a census for a small population (N ≤ 200). This eliminates sampling error and provides data
on all the members or elements in the
2. Using a sample size of a similar The disadvantage of using the same method used by other
research is the possibility of repeating the same errors that were made in determining sample
size for the study.
3. Using published tables. (research-advisors.com/tools/SampleSize.htm)
4. Using a formula
a) http://www.raosoft.com/samplesize.html
b) https://www.surveymonkey.com/mp/sample-size-calculator/

c) http://sphweb.bumc.bu.edu/otlt/mph- modules/bs/bs704_power/BS704_Power_print.html
In using a formula to compute the sample size, the basic information needed is as follows.
a)Margin of error. It is the amount of error that you can tolerate. If 90% of respondents
answer yes, while 10% answer no, you may be able to tolerate a larger amount of error than
if the respondents are split 50-50 or 45-55. A lower margin of error requires a larger sample
b)Confidence Interval. It is the amount of uncertainty you can tolerate. Suppose that you have 20
yes-no questions in your survey. With a confidence level of 95%, you would expect that for
one of the questions (1 in 20), the percentage of people who answer yes would be more than the
margin of error away from the true answer. The true answer is the percentage you would
get if you exhaustively interviewed A higher confidence level requires a larger sample size.
Sampling Techniques

1. Probability Sampling. It is a sampling technique wherein the members of the population are
given an (almost) equal chance to be included as a sample.
 Simple Random All members of the population have a chance of being included in the sample.
Example: lottery method, random numbers
 Systematic Random Sampling (with a random start). It selects every kth member of the
population with a starting point determined at random. Example: Selecting every 5th member
of N = 1000, to get 200 samples. For instance, starting at 7th member, we have the 12th, 17th,
22nd, and so
 Stratified Random This is used when the population can be divided into several smaller non-
overlapping groups (strata), then the sample is randomly selected from each group.
 Cluster Sampling. Also called area sampling in which groups or cluster, instead of individuals
are selected randomly as sample
 Multi-stage Sampling. If the population is too big, two or more sampling techniques may be
used until the desired sample is

2. Non-probability Sampling. It is a sampling technique wherein the sample is determined by set
criteria, purpose, or personal
1. Purposive or Judgment The sample is selected based on predetermined criteria set by
the researcher. Example: To determine the difficulties encountered by students in the
2017 national achievement test, only the Grade 6 pupils of the said school
will be included as a sample.
2. Convenience or Accidental It relies on data collection from population members who
are conveniently available to participate in the study. Facebook polls or questions can
be mentioned as a popular example of convenience sampling.
3. Quota Sampling. It is a non-probability sampling technique in which researchers look for a
specific characteristic in their respondents, and then take a tailored sample that is in
proportion to a population of
4. Snowball The samples are determined by referrals made by previous members of the sample
MODULE 3: DATA PRESENTATION AND VISUALIZATION
Introduction

Data visualization is a graphical representation of information and data. The different data
visualization tools provide an accessible way to see and understand trends, outliers, and patterns in
data. Being another form of visual art, data visualization grabs the interest and attention of the
audience on the message. It helps to tell the important stories by curating data into a form easier to
understand, highlighting the most important aspect of the data set. However, data presentation and
visualization are not as simple as creating graphs and tables. Effective presentation and visualization
of data involve a balance between form (aesthetics) and function.

Lesson 1: Graphical Presentation of Data
A statistical graph (or chart) is a tool that helps readers to understand the characteristics of a
distribution of sample or a population. Effective data presentation follows the following principles.

Five Essential Elements of Data Visualization (Data Craze, 2020)

1. Consistent Style and Colors. Carefully choose and maintain the same style across your
visualizations. Remember that the true meaning and value of data are not just in
2. Select Right Visualization. A bar or pie chart is not the only visualization method in your
arsenal. Adjust what you want to present based on the purpose and type of data you
3. Less is More. Focus on the quality of what you want to present. The excessive number of charts
or indicators is distracting. Simplicity comes at a price – the less information to analyze the
4. Effective Visualization. The difference between effective and impressive visualization can be
huge. The data presented in the application should foremost give a value – effect in the form of
specific
5. Data Quality. The trust of users is difficult to build, but it is easy to lose. Unexpected
information is desirable, errors are not. Try to detect errors at an early
What Graphs Should You Use?

Data should be matched appropriately to the right information visualization. The following are
some of the most common graphs used to present data (Klipfolio, Inc., 2020).

1. Bar Graph. It organizes data into rectangular bars that make it convenient to
compare related data
When to Use: compare two or more values in the same category; compare parts of a whole,
do not have too many groups (less than 10); and relate multiple similar data sets. When Not
Use: the category you are visualizing has one value associated with it or the data is
continuous.
Design Best Practice: Use consistent colors and labelling throughout for identifying
relationships more easily. Simplify the length of the y-axis labels and don’t forget to start
from 0.
2. Line Chart. It organizes data to rapidly scan information to understand

When to Use: to understand trends, patterns, and fluctuations; to compare different yet
related data sets with multiple series; and to make projections beyond your data
When Not to Use: to demonstrate an in-depth view of your data
Design Best Practice: Use different colors for each category you are comparing. Use solid
lines to keep the line chart clear and concise. Try not to compare more than four categories
in one line chart.
3. Scatter Plot. It organizes many different data points to highlight similarities in the
given data set. It is useful when looking for outliers and identifying correlation
between two variables.
When to Use: to show relationship between variables and to have a compact data visualization
When Not to Use: to rapidly scan information or to have a clear and precise data
points Design Best Practice: Ensure to use 1 or 2 trend lines to avoid confusion. Start at 0 for
the y-axis.
4. Histogram. It shows the distribution of data over a continuous interval or certain
period. It gives and estimate as to where values are concentrated, what extremes are
and whether there are any gaps or unusual values throughout the data
When to Use: To make comparison in data sets over an interval or time and to show a
distribution of data
When Not to Use: to compare three or more variables in data sets
Design Best Practice: Avoid bars that are too wide that can hide important details or too
narrow that can cause a lot of noise. Use equal round numbers to create bar sizes. Use
consistent colors and labelling throughout.
5. Box Plot. Also known as box and whisker diagram, is a visual representation of
displaying a distribution of data, usually across groups, based on a five-number
summary: minimum, first quartile, median, third quartile, and maximum. It also
shows the
When to Use: To display or compare a distribution of data and identify the minimum,
maximum and median of data.
When Not to Use: to visualize individual, unconnected data sets
Design Best Practice: Ensure font sizes for labels and legends are big enough and line widths
are thick enough. Use different symbols, line styles or colors to differentiate multiple data
sets. Remove unnecessary clutter from the plots.
Other useful graphs and charts, with their description, use, and other important features may be
found at The Data Visualization Catalogue via datavizcatalogue.com

Here are some tips on improving your charts and graphs (Visme, 2020).
1. Our eyes do not follow a specific order, so you need to create that order. Create a visualization
that deliberately takes viewers on a predefined visual
2. Our eyes first focus on what stands out, so be intentional with your focal point. Create charts
and graphs with one clear message that can be effortlessly
3. Our eyes can only handle a few things at once, so do not over crowd your design. Simplify your
charts so that they highlight one main point you want you
4. Our brains are designed to immediately look for connections and try to find meaning in the data.
Assign colors deliberately to improve the functionality of your
5. We are guided by cultural

Lesson 2: Tabular Presentation of Data

Almost all research and technical reports use tables to present data. Tabular presentation of
data is a systematic and logical arrangement of data into rows and columns with respect to the
characteristics of data.

Components of Tables

1. Table Number and Title. It is included for easy reference and identification. It should indicated
the nature of the information that is included in the
2. Stub (Row Labels). It is placed on the left side of the tabular form indicating specific issues in the
3. Captions (Column Headings). It placed at the top of the columns of a table to explain figures of
the
4. Body. The most important part of the table which comprises numerical contents and reveal the
whole story of investigated
5. Footnote. It provides further explanation that may be needed for any item that is included in a
6. Source note. It is placed at the bottom of the table to indicate the sources of
Tabular Presentation of Nominal and Ordinal Data

Nominal or ordinal data are presented using a frequency table or frequency distribution
table. The table displays frequency count and percentages for each value of a variable.

Example: Suppose your research objective is to determine the profile of the respondents. The data
may be presented as follows.
A contingency table or crosstabulation can also be used to display the relationship between
categorical variables. This type of presentation allows us to examine a hypothesis regarding the
independence or dependence of between variables.

Example: Suppose your research objective is to determine the profile of the respondents. The data
may be presented in crosstabulation as follows.
Tabular Presentation of Interval and Ratio Data

The data on the interval or ratio scale are organized using a frequency distribution table.
These are the steps in constructing a frequency distribution table.

1. Determine the number of class intervals, = 1 + 3.322 , the range = – , and the class size c = R/k
2. Construct the class intervals based on the class The first and last class intervals should contain
the minimum and maximum value, respectively. It is advisable to start the first class interval
with the minimum value.
3. Arrange the data in in either ascending or descending order. Then tally the scores based on the
class intervals in step
4. Add columns for class boundaries, class mark or class midpoint, relative frequency, and
cumulative

The class interval contains the lower (L) and upper limits (U). (e g. In the class interval 46
– 65, the lower limit is 46 and the upper limit is 65)
The class mark or class midpoint (X) is the value in the middle of the class interval. (e. g. In the
class interval 46 – 65, the class mark is 55.5; that is,
The class boundaries are the true class limits of the class intervals. It is halfway below the
lower limit and halfway above the upper limit. (e. g. In the class interval 46 – 65, the class
boundary is 44.5 – 65.5)

The relative frequency (also known as percentage frequency) is computed using the formula
where f is the frequency of the class interval and n is the total of the frequencies.

The less than cumulative frequency (<cf) and greater than cumulative frequency (>cf) are
obtained by adding the frequencies from top to bottom and from bottom to top,
respectively.

Example: Using the scores of 50 students in a 55-item Mathematics test, construct a frequency
distribution table.
43 30 35 37 42 19 26 48 34 15
35 18 46 41 27 18 13 40 29 14
40 17 10 21 28 13 14 39 30 5
19 50 36 20 31 28 48 32 20 38
25 12 33 31 28 16 40 32 26 35

Solution:

Step 1: Determine the number of class intervals the range, and the class size.
= 1 + 3.322 = –

= 1 + 3.322 (50) = 50 – 5
= 6.643978 = 7 = 45
c= R/K
c = 45/7 = 6.43 =7

Step 2: Construct the class intervals based on the class size.
Since our minimum value is 5 and the class size if 7, the first class interval is 5 – 11. Note
that this class interval contains 7 values – 5, 6, 7, 8, 9, 10, 11.
To construct the succeeding intervals, add the class size to the lower and upper limits.

Class
Intervals
5 – 11
12 – 18
19 – 25
26 – 32
33 – 39
40 – 46
47 – 53

Step 3: Arrange the data in in either ascending or descending order. Then tally the scores based on the
class intervals in step 2.

5 14 18 20 27 30 32 35 40 43
10 14 18 21 28 30 33 36 40 46
12 15 19 25 28 31 34 37 40 48
13 16 19 26 28 31 35 38 41 48
13 17 20 26 29 32 35 39 42 50

This data set can be organized or sorted using stem-and-leaf plot. A stem-and-leaf plot is a special
table where each data value is split into a stem, first digit or digits, and a leaf, last digit.

Stem Leaf
0 5
1 0233445678899
2 00156678889
3 001122345556789
4 000123688
5 0

Step 4. Add columns for class boundaries, class mark or class midpoint, relative frequency, and
cumulative frequencies.

Module 1: Nature of Statistics

Uploaded by

Copyright:

Available Formats

You might also like

Module 1: Nature of Statistics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 1: Nature of Statistics

Uploaded by

Copyright:

Available Formats

MODULE 1: NATURE OF STATISTICS

Lesson 1: History and Development of Statistics

Lesson 2: Basic Concepts of Statistics

Two Divisions of Statistics

Sample Research Questions (Objectives):

Population and Sample

Lesson 3: Summation Notation

Evaluate the following summations.

Lesson 1: Sources of Data and Data Collection Methods

Characteristics of a Good Data

Data Collection Tools and Instruments (Bhat, 2020)

Approaches to Determine the Sample Size

Lesson 1: Graphical Presentation of Data

What Graphs Should You Use?

2. Line Chart. It organizes data to rapidly scan information to understand

You might also like