Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 41

HST 403: RESEARCH METHODOLOGY DR. C.

U OKEKE

MEASUREMENT OF VARIABLES

A population is a set of existing unit (usually people, objects or events). It is any complete
group with at least one characteristic in common.

Examples of a population are

a) All year four students in POT department


b) All students in school of health.
c) All consumers who bought cell phones in 2019.
One or more characteristics of population units can be studied

WHAT IS A VARIABLE?
Any characteristics of a population unit is called a variable. It is the characteristic of a population unit
that has quantity or quality that varies.
For instance, if we want to study the PHY 101 scores of POT students of 2015 and 2014 sets, the
variable there is their PHY 101 scores. If you want to determine the teaching effectiveness of POT
lecturers, the variable of interest becomes “teaching effectiveness”.
Measurement is normally carried out to assign a value of a variable to each population unit. For
instance we may assign percentage to PHY 101 score (30%, 50%, 35%, 80% etc) or we can assign ‘kg’
to the weight of rats used in the experiment.
A variable can be called a data item. Some common examples are age, sex, business income
and expenses, country of birth, class grades, eye, colour, and vehicle types. They are called variables
because the value may vary between data units in a population and may change in value over time.
Example, students’ grade score in PHY 101 can vary among year 4 POT students.

NATURES OF VARIABLES.
A variable can be quantitative or qualitative in nature.
A variable is said to be quantitative when the possible measurements are numbers that represent
quantities (that is when it answers the question “How Much” or “How Many”)
Examples are “the number of students in each level in pot department”, ‘the weight of patients
attending ante-natal clinic’.
A variable is said to be of qualitative or categorical in nature when the variable has values
that describe a ‘quality’ or ‘characteristic’ of a data unit. It answers the question “what type” or
“which category”.
Examples of categorical variables are
a) A person’s gender (male or female)
b) Make of car
c) Size of cloth (small, medium, large, extra large).
d) People’s attitudes (i.e. strongly agree, agree, disagree, and strongly disagree). Etc
These data collected in categorical variables are described as qualitative data.

1
TYPES OF MEASUREMENT OF VARIABLES

There are two major types of variables, they are:

a) Numerical or quantitative variables


b) Categorical or qualitative variables.

NUMERIC OR QUANTITATIVE VARIABLES.

These variables describe “how many” or “how much in a given measurement. It can be further
described as

a) A continuous variables or (b) A discrete variable.


a) Continuous variable is a quantitative or numeric variable in which the observations can take
any value between a certain set of real numbers.
The value given to an observation (observation is a population unit being measured) for a continuous
variable can include values as small as the instrument of measurement allows.
Examples of such include height, time, age, and temperature.
b) A discrete variable is a numeric variable in which the observations can take a value based on
a count from a set of distinct whole values. It cannot take the value of a fraction between one value
and the next closest value. E.g of such includes number of hospitals in owerri, number of
hypertensive patients in FUTO, number of children in a family. They are measured in whole unit like
50 hospitals in owerri, 10 children in a family etc.

CATEGORIES OR QUALITATIVE VARIABLES


These are variables that answer the questions “what type” or “which category”. They fall into
mutually exclusive (in one category or in another category) and exhaustive (include all possible
options) categories. Categorical or quantitative variables are manly tend to be represented by a non-
numeric values. It is further described as ordinal or nominal variables.
a) An ordinal variable is a categorical variable in which observations can take a value that can
be logically ordered or ranked. The observations can be ranked higher or lower than another but do
not necessarily establish a nueric difference between each category. E.g of this is academic grade (i.e
A, B, C), people’s attitudes ( i.e strongly agree, agree, disagree, strongly disagree). Etc
b) A nominal variable is one in which observations have values that are not able to be
organized in a logical sequence. Examples are sex, business type, eye colour, religion, and brand.

2
SCALES/LEVEL OF MEASUREMENT

There are 4 scales of measurement in statistics. These are names according to the variables it is used
for. Data can be classified as being one of the 4 scales. These scales are Nominal, Ordinal, Interval
and Ratio. Each of the level of measurement has some important properties that are useful to know.
For example, only the ratio scale has meaningful zeros.

1.  Nominal Scale. Nominal is from the Latin nomalis, which means “pertaining to


names”.Nominal variables (also called categorical variables) can be placed into categories. They
don’t have a numeric value and so cannot be added, subtracted, divided or multiplied. They also
have no order; if they appear to have an order then you probably have ordinal variables instead.
They have mode and counts (frequency distribution).
Examples:
a. Gender: Male, Female, Other.
b. Hair Color: Brown, Black, Blonde, Red, Other.
c. Type of living accommodation: House, Apartment, Trailer, Other.
d. Genotype: AA, AS, SS
e. Religious preference: Buddhist, Mormon, Muslim, Jewish, Christian, Other.

RELIGIOUS PREFERENCES

HEATHEN
MARMON 15%
5% CHRISTIAN
MUSLEM
MARMON 2.
CHRISTIAN HEATHEN OR
50% DIN
AL
MUSLEM
30%

scale: means in order or rank. For instance ‘first, second,third, ….. ninety-ninth etc. it provides order
of values, counts (frequency of distribution), mode and median.
Examples are
a. High school class ranking: 1st, 9th, 87th…
b. Socioeconomic status: poor, middle class, rich.
c. The Likert Scale: strongly disagree, disagree, neutral, agree, strongly agree.
d. Level of Agreement: yes, maybe, no.
e. Time of Day: dawn, morning, noon, afternoon, evening, night.
f. Political Orientation: left, center, right.

3. INTERVAL scale of measurement


3
Interval Scale. An interval scale has ordered numbers with meaningful divisions. It has values of
equal intervals that mean something. Temperature is on the interval scale. A thermometer might
have intervals of ten degrees. A difference of 10 degrees between 90 and 100 means the same as 10
degrees between 150 and 160. Compare that to high school ranking (which is ordinal), where the
difference between 1st and 2nd might be .01 and between 10th and 11th .5. If you have meaningful
divisions, you have something on the interval scale.
It provides order of values, counts (frequency distribution), modes, median, mean, can quantify
difference between each values, can add or substract values, and it can multiply and divide values.

4. RATIO scale: Exactly the same as the interval scale except that the zero on the scale
means: does not exist. For example, a weight of zero doesn’t exist. It means no weight, and 4kg is
twice as 2kg weight. Another example is prices of goods. N0 represents no cost, and N 90 book is
three times as costly as N30 book. On the other hand, temperature is not a ratio scale, because zero
exists (i.e. zero on the Celsius scale is just the freezing point; it doesn’t mean that water ceases to
exist). Ratio scale of measurement provides order of values, counts (frequency distribution), modes,
median, mean, can quantify difference between each values, can add or substract values, it can
multiply and divide values and has true zero.

Examples: Weight, Height, Sales Figures, Ruler measurements, Income earned in a week,
Years of education, Number of children etc

Characteristics Norminal Ordinal Interval Ratio


The ‘order’ of No Yes yes Yes
values is known
‘count’ ie frequency Yes Yes yes Yes
distribution
Mode Yes Yes Yes Yes
Median No Yes Yes Yes
Mean No No Yes Yes
Can quantify No No yes Yes
difference between
each value
Can add and No No Yes Yes
substract
Can multiply and No No No Yes
divide values
Has ‘true zero’ No No No Yes

Common Types of Variables and the meanings


 Categorical variable: variables than can be put into categories. For example, the category
“Toothpaste Brands” might contain the variables Colgate and Aquafresh.
 Confounding variable: extra variables that have a hidden effect on your experimental
results.
 Continuous variable: a variable with infinite number of values, like “time” or “weight”.
 Control variable: a factor in an experiment which must be held constant. For example, in an
experiment to determine whether light makes plants grow faster, you would have to control for soil
quality and water.
4
 Dependent variable: the outcome of an experiment. As you change the independent
variable, you watch what happens to the dependent variable.
 Discrete variable: a variable that can only take on a certain number of values. For example,
“number of cars in a parking lot” is discrete because a car park can only hold so many cars.
 Independent variable: a variable that is not affected by anything that you, the researcher,
does. Usually plotted on the x-axis.
 Lurking variable: a “hidden” variable the affects the relationship between the independent
and dependent variables.
 A measurement variable has a number associated with it. It’s an “amount” of something, or
a”number” of something.
 Nominal variable: another name for categorical variable.
 Ordinal variable: similar to a categorical variable, but there is a clear order. For example,
income levels of low, middle, and high could be considered ordinal.
 Qualitative variable: a broad category for any variable that can’t be counted (i.e. has no
numerical value). Nominal and ordinal variables fall under this umbrella term.
 Quantitative variable: A broad category that includes any variable that can be counted, or
has a numerical value associated with it. Examples of variables that fall into this category include
discrete variables and ratio variables.
 Random variables are associated with random processes and give numbers to outcomes of
random events.
 A ranked variable is an ordinal variable; a variable where every data point can be put in
order (1st, 2nd, 3rd, etc.).
 Ratio variables: similar to interval variables, but has a meaningful zero.
Less Common Types of Variables
 Active Variable: a variable that is manipulated by the researcher.
 Antecedent Variable: a variable that comes before the independent variable.
 Attribute variable: another name for a categorical variable (in statistical software) or a
variable that isn’t manipulated (in design of experiments).
 Binary variable: a variable that can only take on two values, usually 0/1. Could also be
yes/no, tall/short or some other two-variable combination.
 Collider Variable: a variable represented by a node on a causal graph that has paths pointing
in as well as out.
 Covariate variable: similar to an independent variable, it has an effect on the dependent
variable but is usually not the variable of interest. See also: Noncomitant variable.
 Criterion variable: another name for a dependent variable, when the variable is used in non-
experimental situations.
 Dichotomous variable: Another name for a binary variable.
 Dummy Variables: used in regression analysis when you want to assign relationships to
unconnected categorical variables. For example, if you had the categories “has dogs” and “owns a
car” you might assign a 1 to mean “has dogs” and 0 to mean “owns a car.”
 Endogenous variable: similar to dependent variables, they are affected by other variables in
the system. Used almost exclusively in econometrics.
 Exogenous variable: variables that affect others in the system.
 Explanatory Variable: a type of independent variable. When a variable is independent, it is
not affected at all by any other variables. When a variable isn’t independent for certain, it’s an
explanatory variable.

5
 Extraneous variables are any variables that you are not intentionally studying in your
experiment or test.
 A grouping variable (also called a coding variable, group variable or by variable) sorts data
within data files into categories or groups.
 Identifier Variables: variables used to uniquely identify situations.
 Indicator variable: another name for a dummy variable.
 Interval variable: a meaningful measurement between two variables. Also sometimes used
as another name for a continuous variable.
 Intervening variable: a variable that is used to explain the relationship between variables.
 Latent Variable: a hidden variable that can’t be measured or observed directly.
 Manifest variable: a variable that can be directly observed or measured.
 Manipulated variable: another name for independent variable.
 Mediating variable or intervening variable: variables that explain how the relationship
between variables happens. For example, it could explain the difference between the predictor and
criterion.
 Moderating variable: changes the strength of an effect between independent and
dependent variables. For example, psychotherapy may reduce stress levels for women more than
men, so sex moderates the effect between psychotherapy and stress levels.
 Nuisance Variable: an extraneous variable that increases variability overall.
 Observed Variable: a measured variable (usually used in SEM).
 Outcome variable: similar in meaning to a dependent variable, but used in a non-
experimental study.
 Polychotomous variables: variables that can have more than two values.
 Predictor variable: similar in meaning to the independent variable, but used in regression
and in non-experimental studies.
 Responding variable: an informal term for dependent variable, usually used in science fairs.
 Scale Variable: basically, another name for a measurement variable.
 Study Variable (Research Variable): can mean any variable used in a study, but does have a
more formal definition when used in a clinical trial.
 Test Variable: another name for the Dependent Variable.
 Treatment variable: another name for independent variable.

6
AGE SEX PHY 101 GRADE
IHEDORO 20 Female 60 B
MICHAEL 23 Male 40 F
IFEOMA 19 Female 75 A

1. AGE, SEX, PHY 101, GRADE - are called Data items: data items (it is xteristics or attributess
of a data unit which is measured)
2. IFEOMA, IHEDORO, MICHAEL are called Data unit: this is one entity such as person, business
in the population being studied about which data are collected
3. Male, female are non-numeric or categorical variables
4. 60, 70, 40 are numeric variable
5. B,F, A are Ordinal variable

Methods of sampling from a population


7
It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the cost
and workload, and may make it easier to obtain high quality information, but this has to be
balanced against having a large enough sample size with enough power to detect a true
association.

If a sample is to be used, by whatever method it is chosen, it is important that the


individuals selected are representative of the whole population. This may involve specifically
targeting hard to reach groups. For example, if the electoral roll for a town was used to
identify participants, some people, such as the homeless, would not be registered and
therefore excluded from the study by default.

There are several different sampling techniques available, and they can be subdivided into
two groups: probability sampling and non-probability sampling. In probability (random)
sampling, you start with a complete sampling frame of all eligible individuals from which you
select your sample. In this way, all eligible individuals have a chance of being chosen for the
sample, and you will be more able to generalise the results from your study. Probability
sampling methods tend to be more time-consuming and expensive than non-probability
sampling. In non-probability (non-random) sampling, you do not start with a complete
sampling frame, so some individuals have no chance of being selected. Consequently, you
cannot estimate the effect of sampling error and there is a significant risk of ending up with
a non-representative sample which produces non-generalisable results. However, non-
probability sampling methods tend to be cheaper and more convenient, and they are useful
for exploratory research and hypothesis generation.

Probability Sampling Methods

1. Simple random sampling

In this case each individual is chosen entirely by chance and each member of the population has an
equal chance, or probability, of being selected. One way of obtaining a random sample is to give
each individual in a population a number, and then use a table of random numbers to decide which
individuals to include. For example, if you have a sampling frame of 1000 individuals, labelled 0 to
999, use groups of three digits from the random number table to pick your sample. So, if the first
three numbers from the random number table were 094, select the individual labelled “94”, and so
on. As with all probability sampling methods, simple random sampling allows the sampling error to
be calculated and reduces selection bias. A specific advantage is that it is the most straightforward
method of probability sampling. A disadvantage of simple random sampling is that you may not
select enough individuals with your characteristic of interest, especially if that characteristic is
uncommon. It may also be difficult to define a complete sampling frame and inconvenient to contact
them, especially if different forms of contact are required (email, phone, post) and your sample units
are scattered over a wide geographical area.

2. Systematic sampling

8
Individuals are selected at regular intervals from the sampling frame. The intervals are
chosen to ensure an adequate sample size. If you need a sample size n from a population of
size x, you should select every x/nth individual for the sample.  For example, if you wanted a
sample size of 100 from a population of 1000, select every 1000/100 = 10 th member of the
sampling frame.

Systematic sampling is often more convenient than simple random sampling, and it is easy
to administer. However, it may also lead to bias, for example if there are underlying patterns
in the order of the individuals in the sampling frame, such that the sampling technique
coincides with the periodicity of the underlying pattern. As a hypothetical example, if a
group of students were being sampled to gain their opinions on college facilities, but the
Student Record Department’s central list of all students was arranged such that the sex of
students alternated between male and female, choosing an even interval (e.g. every
20th student) would result in a sample of all males or all females. Whilst in this example the
bias is obvious and should be easily corrected, this may not always be the case.
 

3. Stratified sampling

In this method, the population is first divided into subgroups (or strata) who all share a
similar characteristic. It is used when we might reasonably expect the measurement of
interest to vary between the different subgroups, and we want to ensure representation
from all the subgroups. For example, in a study of stroke outcomes, we may stratify the
population by sex, to ensure equal representation of men and women. The study sample is
then obtained by taking equal sample sizes from each stratum. In stratified sampling, it may
also be appropriate to choose non-equal sample sizes from each stratum. For example, in a
study of the health outcomes of nursing staff in a county, if there are three hospitals each
with different numbers of nursing staff (hospital A has 500 nurses, hospital B has 1000 and
hospital C has 2000), then it would be appropriate to choose the sample numbers from each
hospital proportionally (e.g. 10 from hospital A, 20 from hospital B and 40 from hospital C).
This ensures a more realistic and accurate estimation of the health outcomes of nurses
across the county, whereas simple random sampling would over-represent nurses from
hospitals A and B. The fact that the sample was stratified should be taken into account at
the analysis stage.

Stratified sampling improves the accuracy and representativeness of the results by reducing
sampling bias. However, it requires knowledge of the appropriate characteristics of the
sampling frame (the details of which are not always available), and it can be difficult to
decide which characteristic(s) to stratify by.
 

4. Clustered sampling

In a clustered sample, subgroups of the population are used as the sampling unit, rather
than individuals. The population is divided into subgroups, known as clusters, which are
randomly selected to be included in the study. Clusters are usually already defined, for
example individual GP practices or towns could be identified as clusters. In single-stage
cluster sampling, all members of the chosen clusters are then included in the study. In two-
stage cluster sampling, a selection of individuals from each cluster is then randomly selected
9
for inclusion. Clustering should be taken into account in the analysis. The General Household
survey, which is undertaken annually in England, is a good example of a (one-stage) cluster
sample. All members of the selected households (clusters) are included in the survey.

Cluster sampling can be more efficient that simple random sampling, especially where a
study takes place over a wide geographical region. For instance, it is easier to contact lots of
individuals in a few GP practices than a few individuals in many different GP practices.
Disadvantages include an increased risk of bias, if the chosen clusters are not representative
of the population, resulting in an increased sampling error.
 

Non-Probability Sampling Methods

1. Convenience sampling

Convenience sampling is perhaps the easiest method of sampling, because participants are
selected based on availability and willingness to take part. Useful results can be obtained,
but the results are prone to significant bias, because those who volunteer to take part may
be different from those who choose not to (volunteer bias), and the sample may not be
representative of other characteristics, such as age or sex. Note: volunteer bias is a risk of all
non-probability sampling methods.
 

2. Quota sampling

This method of sampling is often used by market researchers. Interviewers are given a quota
of subjects of a specified type to attempt to recruit. For example, an interviewer might be
told to go out and select 20 adult men, 20 adult women, 10 teenage girls and 10 teenage
boys so that they could interview them about their television viewing. Ideally the quotas
chosen would proportionally represent the characteristics of the underlying population.
Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics that
weren’t considered (a consequence of the non-random nature of sampling). 

3. Judgement (or Purposive) Sampling

Also known as selective, or subjective, sampling, this technique relies on the judgement of
the researcher when choosing who to ask to participate. Researchers may implicitly thus
choose a “representative” sample to suit their needs, or specifically approach individuals
with certain characteristics. This approach is often used by the media when canvassing the
public for opinions and in qualitative research. Judgement sampling has the advantage of
being time-and cost-effective to perform whilst resulting in a range of responses
(particularly useful in qualitative research). However, in addition to volunteer bias, it is also
prone to errors of judgement by the researcher and the findings, whilst being potentially
broad, will not necessarily be representative.
 

4. Snowball sampling
10
This method is commonly used in social sciences when investigating hard-to-reach groups.
Existing subjects are asked to nominate further subjects known to them, so the sample
increases in size like a rolling snowball. For example, when carrying out a survey of risk
behaviours amongst intravenous drug users, participants may be asked to nominate other
users to be interviewed.

Snowball sampling can be effective when a sampling frame is difficult to identify. However,
by selecting friends and acquaintances of subjects already investigated, there is a significant
risk of selection bias (choosing a large number of people with similar characteristics or views
to the initial individual identified).
 

Bias in sampling

There are five important potential sources of bias that should be considered when selecting
a sample, irrespective of the method used. Sampling bias may be introduced when:

1. Any pre-agreed sampling rules are deviated from


2. People in hard-to-reach groups are omitted
3. Selected individuals are replaced with others, for example if they are difficult to
contact
4. There are low response rates
5. An out-of-date list is used as the sample frame (for example, if it excludes people
who have recently moved to an area)

DATA COLLECTION TECHNIQUES

There arefour different data collection techniques – observation, questionnaire, interview


and focus group discussion 

1. OBSERVATION: Seeing is believing, they say. Making direct observations of simplistic


phenomena can be a very quick and effective way of collecting data with minimal intrusion.
Establishing the right mechanism for making the observation is all you need.

Advantages

Non-responsive sample subjects are a non-issue when you’re simply making direct
observation.

If the observation is simple and doesn’t require interpretation (e.g. the number of cars
driving through an intersection per hour), this model doesn’t require a very extensive and
well-tailored training regime for the survey workforce.

Infrastructure requirement and preparation time are minimal for simple observations.

Disadvantages

11
More complex observations that ask observers to interpret something (e.g. how many cars
are driving dangerously) require more complex training and are prone to bias.

Analysis may rely heavily on experts who must know what to observe and how to interpret
the observations once the data collection is done.

There is the possibility of missing out on the complete picture due to the lack of direct
interaction with sample subjects.

Use Case

Making direct observations can be a good way of collecting simple information about
mechanical, orderly tasks, like checking the number of manual interventions required in a
day to keep an assembly line functioning smoothly.

2. QUESTIONAIRE: A questionnaire can be defined as a set of questions along with


answer choices asked for respondents which are mainly used for gathering information or
for survey purposes. Questionnaires, are stand-alone instruments of data collection that
will be administered to the sample subjects either through mail, phone or online. They
have long been one of the most popular data collection techniques.

Advantages

Questionnaires give the researchers an opportunity to carefully structure and formulate the
data collection plan with precision.

Respondents can take these questionnaires at a convenient time and think about the
answers at their own pace.

The reach is theoretically limitless. The questionnaire can reach every corner of the globe if
the medium allows for it.

Disadvantages

Questionnaires without human intervention (as we have taken them here) can

be quite passive and miss out on some of the finer nuances, leaving the responses open to
interpretation. Interviews and focus group discussions, as we shall see later, are
instrumental in overcoming this shortfall of questionnaires.

Response rates can be quite low. Questionnaires can be designed well by choosing the
right question types to optimize response rates, but very little can be done to encourage
the respondents without directly conversing with them.

Use Case

The survey can be carried out through directly-administered questionnaires when the
sample subjects are relatively well-versed with the ideas being discussed and comfortable

12
at making the right responses without assistance. A survey about newspaper reading
habits, for example, would be perfect for this mode.

3. INTERVIEW:

Conducting interviews can help you overcome most of the shortfalls of the previous two
data collection techniques that we have discussed here by allowing you to build a deeper
understanding of the thinking behind the respondents’ answers.

Advantages

 Interviews help the researchers uncover rich, deep insight and learn information that
they may have missed otherwise.
 The presence of an interviewer can give the respondents additional comfort while
answering the questionnaire and ensure correct interpretation of the questions.
 The physical presence of a persistent, well-trained interviewer can significantly
improve the response rate.

Disadvantages

 Reaching out to all respondents to conduct interviews is a massive, time-consuming


exercise that leads to a major increase in the cost of conducting a survey.
 To ensure the effectiveness of the whole exercise, the interviewers must be well-
trained in the necessary soft skills and the relevant subject matter.

Use Case

Interviews are the most suitable technique for surveys that touch upon complex issues like
healthcare and family welfare. The presence of an interviewer to help respondents
interpret and understand the questions can be critical to the success of the survey.

4. FOCUS GROUP DISCUSSION:

Focus group discussions take the interactive benefits of an interview to the next level by
bringing a carefully chosen group together for a moderated discussion on the subject of the
survey.

Advantages

 The presence of several relevant people together at the same time can encourage
them to engage in a healthy discussion and help researchers uncover information that they
may not have envisaged.
 It helps the researchers corroborate the facts instantly; any inaccurate response will
most likely be countered by other members of the focus group.
 It gives the researchers a chance to view both sides of the coin and build a balanced
perspective on the matter.

Disadvantages

13
 Finding groups of people who are relevant to the survey and persuading them to
come together for the session at the same time can be a difficult task.
 The presence of excessively loud members in the focus group can subdue the
opinions of those who are less vocal.
 The members of a focus group can often fall prey to group-think if one of them turns
out to be remarkably persuasive and influential. This will bury the diversity of opinion that
may have otherwise emerged. The moderator of a focus group discussion must be on guard
to prevent this from happening.

Use Case

Focus group discussions with the lecturers of a university can be a good way of collecting
information on ways in which our education system can be made more research-driven.

DATA ANALYTICAL TECHNIQUES

Data analysis is defined as a process of cleaning, transforming, and modeling data to


discover useful information for business decision-making. The purpose of Data Analysis is to
extract useful information from data and taking the decision based upon the data analysis.

Whenever we take any decision in our day-to-day life is by thinking about what happened
last time or what will happen by choosing that particular decision. This is nothing but
analyzing our past or future and making decisions based on it. For that, we gather memories
of our past or dreams of our future. So that is nothing but data analysis. Now same thing
analyst does for business purposes, is called Data Analysis.

What is the first thing that comes to mind when we see data? The first instinct is to find
patterns, connections, and relationships. We look at the data to find meaning in it.

Similarly, in research, once data is collected, the next step is to get insights from it. For
example, if a clothing brand is trying to identify the latest trends among young women, the
brand will first reach out to young women and ask them questions relevant to the research
objective. After collecting this information, the brand will analyze that data to identify
patterns — for example, it may discover that most young women would like to see more
variety of jeans.

Data analysis is how researchers go from a mass of data to meaningful insights. There are
many different data analysis methods, depending on the type of research. Here are a few
methods you can use to analyze quantitative and qualitative data.

Analyzing Quantitative Data


14
Data Preparation

The first stage of analyzing data is data preparation, where the aim is to convert raw data
into something meaningful and readable. It includes four steps:

Step 1: Data Validation

The purpose of data validation is to find out, as far as possible, whether the data collection
was done as per the pre-set standards and without any bias. It is a four-step process, which
includes…

 Fraud, to infer whether each respondent was actually interviewed or not.

 Screening, to make sure that respondents were chosen as per the research criteria.

 Procedure, to check whether the data collection procedure was duly followed.

 Completeness, to ensure that the interviewer asked the respondent all the
questions, rather than just a few required ones.

To do this, researchers would need to pick a random sample of completed surveys and
validate the collected data. (Note that this can be time-consuming for surveys with lots of
responses.) For example, imagine a survey with 200 respondents split into 2 cities. The
researcher can pick a sample of 20 random respondents from each city. After this, the
researcher can reach out to them through email or phone and check their responses to a
certain set of questions.

Step 2: Data Editing

Typically, large data sets include errors. For example, respondents may fill fields incorrectly
or skip them accidentally. To make sure that there are no such errors, the researcher should
conduct basic data checks, check for outliers, and edit the raw research data to identify and
clear out any data points that may hamper the accuracy of the results.

For example, an error could be fields that were left empty by respondents. While editing the
data, it is important to make sure to remove or fill all the empty fields.

Step 3: Data Coding

This is one of the most important steps in data preparation. It refers to grouping and
assigning values to responses from the survey.

For example, if a researcher has interviewed 1,000 people and now wants to find the
average age of the respondents, the researcher will create age buckets and categorize the
age of each of the respondent as per these codes. (For example, respondents between 13-
15 years old would have their age coded as 0, 16-18 as 1, 18-20 as 2, etc.)

Then during analysis, the researcher can deal with simplified age brackets, rather than a
massive range of individual ages.
15
Quantitative Data Analysis Methods

After these steps, the data is ready for analysis. The two most commonly used quantitative
data analysis methods are descriptive statistics and inferential statistics.

Descriptive Statistics

Typically descriptive statistics (also known as descriptive analysis) is the first level of
analysis. It helps researchers summarize the data and find patterns.

Descriptive statistics allow you to characterize your data based on its properties. There are
four major types of descriptive statistics:

1. Measures of Frequency:

* Count, Percent, Frequency

* Shows how often something occurs

* Use this when you want to show how often a response is given

2. Measures of Central Tendency

* Mean, Median, and Mode

* Locates the distribution by various points

* Use this when you want to show how an average or most commonly indicated response

3. Measures of Dispersion or Variation

* Range, Variance, Standard Deviation

* Identifies the spread of scores by stating intervals

* Range = High/Low points

* Variance or Standard Deviation = difference between observed score and mean

* Use this when you want to show how "spread out" the data are. It is helpful to know when
your data are so spread out that it affects the mean

4. Measures of Position

* Percentile Ranks, Quartile Ranks

* Describes how scores fall in relation to one another. Relies on standardized scores

* Use this when you need to compare scores to a normalized score (e.g., a national norm)

Descriptive statistics provide absolute numbers. However, they do not explain the rationale
or reasoning behind those numbers. Before applying descriptive statistics, it’s important to
16
think about which one is best suited for your research question and what you want to show.
For example, a percentage is a good way to show the gender distribution of respondents.

Descriptive statistics are most helpful when the research is limited to the sample and does
not need to be generalized to a larger population. For example, if you are comparing the
percentage of children vaccinated in two different villages, then descriptive statistics is
enough.

Since descriptive analysis is mostly used for analyzing single variable, it is often called
univariate analysis.

Analyzing Qualitative Data

Qualitative data analysis works a little differently from quantitative data, primarily because
qualitative data is made up of words, observations, images, and even symbols. Deriving
absolute meaning from such data is nearly impossible; hence, it is mostly used for
exploratory research. While in quantitative research there is a clear distinction between the
data preparation and data analysis stage, analysis for qualitative research often begins as
soon as the data is available.

Data Preparation and Basic Data Analysis

Analysis and preparation happen in parallel and include the following steps:

1. Getting familiar with the data: Since most qualitative data is just words, the
researcher should start by reading the data several times to get familiar with it and start
looking for basic observations or patterns. This also includes transcribing the data.

2. Revisiting research objectives: Here, the researcher revisits the research objective
and identifies the questions that can be answered through the collected data.

3. Developing a framework: Also known as coding or indexing, here the researcher


identifies broad ideas, concepts, behaviors, or phrases and assigns codes to them. For
example, coding age, gender, socio-economic status, and even concepts such as the positive
or negative response to a question. Coding is helpful in structuring and labeling the data.

4. Identifying patterns and connections: Once the data is coded, the research can start
identifying themes, looking for the most common responses to questions, identifying data
or patterns that can answer research questions, and finding areas that can be explored
further.

Qualitative Data Analysis Methods

Several methods are available to analyze qualitative data. The most commonly used data
analysis methods are:

 Content analysis: This is one of the most common methods to analyze qualitative
data. It is used to analyze documented information in the form of texts, media, or even
17
physical items. When to use this method depends on the research questions. Content
analysis is usually used to analyze responses from interviewees.

 Narrative analysis: This method is used to analyze content from various sources,
such as interviews of respondents, observations from the field, or surveys. It focuses on
using the stories and experiences shared by people to answer the research questions.

 Discourse analysis: Like narrative analysis, discourse analysis is used to analyze


interactions with people. However, it focuses on analyzing the social context in which the
communication between the researcher and the respondent occurred. Discourse analysis
also looks at the respondent’s day-to-day environment and uses that information during
analysis.

 Grounded theory: This refers to using qualitative data to explain why a certain
phenomenon happened. It does this by studying a variety of similar cases in different
settings and using the data to derive causal explanations. Researchers may alter the
explanations or create new ones as they study more cases until they arrive at an explanation
that fits all cases.

These methods are the ones used most commonly. However, other data analysis methods,
such as conversational analysis, are also available.

Data analysis is perhaps the most important component of research. Weak analysis
produces inaccurate results that not only hamper the authenticity of the research but also
make the findings unusable. It’s imperative to choose your data analysis methods carefully
to ensure that your findings are insightful and actionable.

1. MEASURES OF FREQUENCY

The first step in turning data into information is to create a distribution. The most
primitive way to present a distribution is to simply list, in one column, each value
that occurs in the population and, in the next column, the number of times it occurs.
It is customary to list the values from lowest to highest. This simple listing is called
a frequency distribution. A more elegant way to turn data into information is to draw
a graph of the distribution. Customarily, the values that occur are put along the
horizontal axis and the frequency of the value is on the vertical axis.

Example

18
STUDENT AGES FREQUENCY RELATIVE FREQUENCY
20 15 0.32
23 5 0.11
19 7 0.15
21 9 0.20
n = 46

16

14

12

10
frequency

6 frequency

0
19 20 21 23
ages (years)

19
0.35

0.3

0.25

0.2

relative freq
0.15

0.1

0.05

0
19 20 21 23

16

14

12

10
FREQUENCY

6 frequency

0
19 20 21 23
AGES (YEAR)

20
2. MEASURES OF CENTRAL TENDENCY

A measure of central tendency is a single value that attempts to describe a set of data by identifying
the central position within that set of data. As such, measures of central tendency are sometimes
called measures of central location. They are also classed as summary statistics. The mean (often
called the average) is most likely the measure of central tendency that you are most familiar with,
but there are others, such as the median and the mode.

The mean, median and mode are all valid measures of central tendency, but under different
conditions, some measures of central tendency become more appropriate to use than
others.

Mean (Arithmetic)

The mean (or average) is the most popular and well known measure of central tendency. It
can be used with both discrete and continuous data, although its use is most often with
continuous data. The mean is equal to the sum of all the values in the data set divided by
the number of values in the data set. So, if we have n values in a data set and they have
values x1,x2, …,xn, the sample mean, usually denoted by x¯ (pronounced "x bar"), is:

x¯=x1+x2+⋯+xnn

This formula is usually written in a slightly different manner using the Greek capitol letter, ∑,
pronounced "sigma", which means "sum of...":

x¯=∑xn

You may have noticed that the above formula refers to the sample mean. So, why have we
called it a sample mean? This is because, in statistics, samples and populations have very
different meanings and these differences are very important, even if, in the case of the
mean, they are calculated in the same way. To acknowledge that we are calculating the
population mean and not the sample mean, we use the Greek lower case letter "mu",
denoted as μ:

μ=∑xn

21
The mean is essentially a model of your data set. It is the value that is most common. You
will notice, however, that the mean is not often one of the actual values that you have
observed in your data set. However, one of its important properties is that it minimises
error in the prediction of any one value in your data set. That is, it is the value that produces
the lowest amount of error from all other values in the data set.

An important property of the mean is that it includes every value in your data set as part of
the calculation. In addition, the mean is the only measure of central tendency where the
sum of the deviations of each value from the mean is always zero.

When not to use the mean

The mean has one main disadvantage: it is particularly susceptible to the influence of
outliers. These are values that are unusual compared to the rest of the data set by being
especially small or large in numerical value. For example, consider the wages of staff at a
factory below:

Staff 1 2 3 4 5 6 7 8 9 10

Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests
that this mean value might not be the best way to accurately reflect the typical salary of a
worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed
by the two large salaries. Therefore, in this situation, we would like to have a better
measure of central tendency. As we will find out later, taking the median would be a better
measure of central tendency in this situation.

Another time when we usually prefer the median over the mean (or mode) is when our data
is skewed (i.e., the frequency distribution for our data is skewed). If we consider the normal
distribution - as this is the most frequently assessed in statistics - when the data is perfectly
normal, the mean, median and mode are identical. Moreover, they all represent the most
typical value in the data set. However, as the data becomes skewed the mean loses its
ability to provide the best central location for the data because the skewed data is dragging
it away from the typical value. However, the median best retains this position and is not as

22
strongly influenced by the skewed values. This is explained in more detail in the skewed
distribution section later in this guide.

Mean of grouped data

A mean can be determined for grouped data, or data that is placed in intervals. Unlike listed
data, the individual values for grouped data are not available, and you are not able to
calculate their sum. To calculate the mean of grouped data, the first step is to determine the
midpoint (also called a class mark) of each interval, or class. These midpoints must then be
multiplied by the frequencies of the corresponding classes. The sum of the products divided
by the total number of values will be the value of the mean.

In other words, the mean for a population can be found by dividing ∑mf by N, where m is
the midpoint of the class (divide the sum of first class number and last class number by 2)
and f is the frequency. As a result, the formula μ=∑mf/N can be written to summarize the
steps used to determine the value of the mean for a set of grouped data. If the set of data
represented a sample instead of a population, the process would remain the same, and the
formula would be written as x¯=∑mf/n.

N= number of data for population


n= number of data for sample
∑= sum of
x¯= sample mean
μ= population mean

Consider these data 1a.


2, 4, 5, 3, 2, 5, 7, 8, 3, 9, 10, 2, 5, 4, 5, 8, 2, 11, 13, 2, 20, 3, 15, 15, 17, 19, 19, 18, 18, 4, 12, 12, 7, 4, 8,
9, 8, 4, 9, 18, 13, 15, 18, 16, 20
Get the mean

Class group Freguency (f) Mid-point (m) Mf


0-5 17 2.5 42.5
5.5-10 10 7.75 77.5
10.5-15 8 12.75 102
15.5-20 10 17.75 177.5
n = 45 ∑mf =399.5

23
Mean (x¯) = ∑mf /n = 399.5/45= 8.88

Median

The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to calculate
the median, suppose we have the data below:

65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle
mark because there are 5 scores before it and 5 scores after it. The median for odd number
of scores is (n + 1)th divided by 2. This works fine when you have an odd number of scores,
but what happens when you have an even number of scores? What if you had only 10
scores? Well, you simply have to take the middle two scores and average the result, or use
the formular {(nth/2) + (n+1)th / 2}. So, if we look at the example below:

65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89

Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5. or you get it by {(5th/2) + (5+1)th/2} = {(55/2) + (56/2)} =( 27.5 + 28) = 55.5

Median of a group data

Median M = L + n/2 - cf x c

24
To find Median Class in data 1a above

Class group Freguency (f) Mid-point (m) Mf cf


0-5 17 2.5 42.5 17
5.5-10 10 7.75 77.5 27
10.5-15 8 12.75 102 35
15.5-20 10 17.75 177.5 45
n = 45 ∑mf =399.5

M = value of (n/2)th observation

= value of (45/2)th observation

= value of 22.5th observation

From the column of cumulative frequency cf, we find that the 22.5th observation lies in the class 5.5-
10.

∴ The median class is 5.5-10.

Now, using the formula


∴L=lower boundary point of median class =5.5

∴n=Total frequency =10

∴cf=Cumulative frequency of the class preceding the median class =17

∴f=Frequency of the median class =10

∴c=class length of median class =10 minus 5.5 = 4.5

Median = L + n/2 - cf x c

M = 5.5 + 22.5 - 17 x 4.5

10

= 5.5 + 0.55 x 4.5

= 7.97

7.97 lies within 5.5- 10

25
Mode

The mode is the most frequent score in our data set. It simply the data with the highest
frequency. On a histogram it represents the highest bar in a bar chart or histogram. You can,
therefore, sometimes consider the mode as being the most popular option.

65 55 89 56 35 14 56 55 87 45

The mode, Z, of the data above is 55.

2,2,3,3,3, 4, 5, 5, 6, 6, 7, 2. The mode here are 2 and 3

Mode for a grouped data

Class group Freguency (f) Mid-point (m) Mf cf


0-5 17 2.5 42.5 17
5.5-10 10 7.75 77.5 27
10.5-15 8 12.75 102 35
15.5-20 10 17.75 177.5 45
n = 45 ∑mf =399.5

The modal group here is 0-5

Fomular for mode:

L=lower boundary point of mode class 

∴f1= frequency of the mode class

∴f0= frequency of the preceding class 

∴f2= frequency of the succedding class 

∴c= class length of mode class 

Z = L + (f1 - f0) x c

2 x ( f1-f0-f2)

26
Z = 0 + ( 17- 0) x 4.5

2 x (17 – 0 – 8)

= 0 + 17/18 x 4.5 = 4.25

4.25. This lies in the class of 0-5.

An example of a mode is presented below:

Normally, the mode is used for categorical data where we wish to know which is the most
common category, as illustrated below:

27
We can see above that the most common form of transport, in this particular data set, is the
bus. However, one of the problems with the mode is that it is not unique, so it leaves us
with problems when we have two or more values that share the highest frequency, such as
below:

28
We are now stuck as to which mode best describes the central tendency of the data. This is
particularly problematic when we have continuous data because we are more likely not to
have any one value that is more frequent than the other. For example, consider measuring
30 peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more
people with exactly the same weight (e.g., 67.4 kg)? The answer, is probably very unlikely -
many people might be close, but with such a small sample (30 people) and a large range of
possible weights, you are unlikely to find two people with exactly the same weight; that is,
to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data.

Another problem with the mode is that it will not provide us with a very good measure of
central tendency when the most common mark is far away from the rest of the data in the
data set, as depicted in the diagram below:

29
In the above diagram the mode has a value of 2. We can clearly see, however, that the
mode is not representative of the data, which is mostly concentrated around the 20 to 30
value range. To use the mode to describe the central tendency of this data set would be
misleading.

Skewed Distributions and the Mean and Median

We often test whether our data is normally distributed because this is a common
assumption underlying many statistical tests. An example of a normally distributed set of
data is presented below:

30
When you have a normally distributed sample you can legitimately use both the mean or
the median as your measure of central tendency. In fact, in any symmetrical distribution the
mean, median and mode are equal. However, in this situation, the mean is widely preferred
as the best measure of central tendency because it is the measure that includes all the
values in the data set for its calculation, and any change in any of the scores will affect the
value of the mean. This is not the case with the median or mode.

However, when our data is skewed, for example, as with the right-skewed data set below:

31
We find that the mean is being dragged in the direct of the skew. In these situations, the
median is generally considered to be the best representative of the central location of the
data. The more skewed the distribution, the greater the difference between the median and
mean, and the greater emphasis should be placed on using the median as opposed to the
mean. A classic example of the above right-skewed distribution is income (salary), where
higher-earners provide a false representation of the typical income if expressed as a mean
and not a median.

If dealing with a normal distribution, and tests of normality show that the data is non-
normal, it is customary to use the median instead of the mean. However, this is more a rule
of thumb than a strict guideline. Sometimes, researchers wish to report the mean of a
skewed distribution if the median and mean are not appreciably different (a subjective
assessment), and if it allows easier comparisons to previous research to be made.

Summary of when to use the mean, median and mode

Please use the following summary table to know what the best measure of central tendency
is with respect to the different types of variable.

32
Type of Variable Best measure of central tendency

Nominal Mode

Ordinal Median

Interval/Ratio (not skewed) Mean

Interval/Ratio (skewed) Median

Skewness of data: It is the degree of distortion from the symmetrical bell curve or the


normal distribution. It measures the lack of symmetry in data distribution.
It differentiates extreme values in one versus the other tail. A symmetrical distribution will
have a skewness of 0.

There are two types of Skewness: Positive and Negative

Positive Skewness means when the tail on the right side of the distribution is longer or
fatter. The mean and median will be greater than the mode.

Negative Skewness is when the tail of the left side of the distribution is longer or fatter than
the tail on the right side. The mean and median will be less than the mode.

33
Chapter 1. Descriptive Statistics and
Frequency Distributions
This chapter is about describing populations and samples, a subject known as descriptive
statistics. This will all make more sense if you keep in mind that the information you want to
produce is a description of the population or sample as a whole, not a description of one
member of the population. The first topic in this chapter is a discussion of distributions,
essentially pictures of populations (or samples). Second will be the discussion of descriptive
statistics. The topics are arranged in this order because the descriptive statistics can be
thought of as ways to describe the picture of a population, the distribution.

Distributions

The first step in turning data into information is to create a distribution. The most primitive
way to present a distribution is to simply list, in one column, each value that occurs in the
population and, in the next column, the number of times it occurs. It is customary to list the
values from lowest to highest. This simple listing is called a frequency distribution. A more
elegant way to turn data into information is to draw a graph of the distribution. Customarily,
the values that occur are put along the horizontal axis and the frequency of the value is on the
vertical axis.

Ann is the equipment manager for the Chargers athletic teams at Camosun College, located in
Victoria, British Columbia. She called the basketball and volleyball team managers and
collected the following data on sock sizes used by their players. Ann found out that last year
the basketball team used 14 pairs of size 7 socks, 18 pairs of size 8, 15 pairs of size 9, and 6
pairs of size 10 were used. The volleyball team used 3 pairs of size 6, 10 pairs of size 7, 15
pairs of size 8, 5 pairs of size 9, and 11 pairs of size 10. Ann arranged her data into a
distribution and then drew a graph called a histogram. Ann could have created a relative
frequency distribution as well as a frequency distribution. The difference is that instead of
listing how many times each value occurred, Ann would list what proportion of her sample
was made up of socks of each size.

You can use the Excel template below (Figure 1.1) to see all the histograms and frequencies
she has created. You may also change her numbers in the yellow cells to see how the graphs
will change automatically.

Figure 1.1 Interactive Excel Template of a Histogram – see Appendix 1.

Notice that Ann has drawn the graphs differently. In the first graph, she has used bars for
each value, while on the second, she has drawn a point for the relative frequency of each size,
and then “connected the dots”. While both methods are correct, when you have values that
are continuous, you will want to do something more like the “connect the dots” graph. Sock
sizes are discrete, they only take on a limited number of values. Other things
have continuous values; they can take on an infinite number of values, though we are often
in the habit of rounding them off. An example is how much students weigh. While we usually
give our weight in whole kilograms in Canada (“I weigh 60 kilograms”), few have a weight
that is exactly so many kilograms. When you say “I weigh 60”, you actually mean that you
weigh between 59 1/2 and 60 1/2 kilograms. We are heading toward a graph of a distribution
34
of a continuous variable where the relative frequency of any exact value is very small, but the
relative frequency of observations between two values is measurable. What we want to do is
to get used to the idea that the total area under a “connect the dots” relative frequency graph,
from the lowest to the highest possible value, is one. Then the part of the area under the graph
between two values is the relative frequency of observations with values within that range.
The height of the line above any particular value has lost any direct meaning, because it is
now the area under the line between two values that is the relative frequency of an
observation between those two values occurring.

You can get some idea of how this works if you go back to the bar graph of the distribution of
sock sizes, but draw it with relative frequency on the vertical axis. If you arbitrarily decide
that each bar has a width of one, then the area under the curve between 7.5 and 8.5 is simply
the height times the width of the bar for sock size 8: .3510*1. If you wanted to find the
relative frequency of sock sizes between 6.5 and 8.5, you could simply add together the area
of the bar for size 7 (that’s between 6.5 and 7.5) and the bar for size 8 (between 7.5 and 8.5).

Descriptive statistics

Now that you see how a distribution is created, you are ready to learn how to describe one.
There are two main things that need to be described about a distribution: its location and its
shape. Generally, it is best to give a single measure as the description of the location and a
single measure as the description of the shape.

Mean

To describe the location of a distribution, statisticians use a typical value from the


distribution. There are a number of different ways to find the typical value, but by far the
most used is the arithmetic mean, usually simply called the mean. You already know how
to find the arithmetic mean, you are just used to calling it the average. Statisticians use
average more generally — the arithmetic mean is one of a number of different averages.
Look at the formula for the arithmetic mean:
μ=∑xNμ=∑xN
All you do is add up all of the members of the population, ∑x∑x, and divide by how many
members there are, N. The only trick is to remember that if there is more than one member of
the population with a certain value, to add that value once for every member that has it. To
reflect this, the equation for the mean sometimes is written:
μ=∑fi(xi)Nμ=∑fi(xi)N
where fi is the frequency of members of the population with the value xi.

This is really the same formula as above. If there are seven members with a value of ten, the
first formula would have you add seven ten times. The second formula simply has you
multiply seven by ten — the same thing as adding together ten sevens.

Other measures of location are the median and the mode. The median is the value of the
member of the population that is in the middle when the members are sorted from smallest to
largest. Half of the members of the population have values higher than the median, and half
have values lower. The median is a better measure of location if there are one or two
members of the population that are a lot larger (or a lot smaller) than all the rest. Such

35
extreme values can make the mean a poor measure of location, while they have little effect on
the median. If there are an odd number of members of the population, there is no problem
finding which member has the median value. If there are an even number of members of the
population, then there is no single member in the middle. In that case, just average together
the values of the two members that share the middle.

The third common measure of location is the mode. If you have arranged the population into
a frequency or relative frequency distribution, the mode is easy to find because it is the value
that occurs most often. While in some sense, the mode is really the most typical member of
the population, it is often not very near the middle of the population. You can also have
multiple modes. I am sure you have heard someone say that “it was a bimodal distribution“.
That simply means that there were two modes, two values that occurred equally most often.

If you think about it, you should not be surprised to learn that for bell-shaped distributions,
the mean, median, and mode will be equal. Most of what statisticians do when describing or
inferring the location of a population is done with the mean. Another thing to think about is
using a spreadsheet program, like Microsoft Excel, when arranging data into a frequency
distribution or when finding the median or mode. By using the sort and distribution
commands in 1-2-3, or similar commands in Excel, data can quickly be arranged in order or
placed into value classes and the number in each class found. Excel also has a function,
=AVERAGE(…), for finding the arithmetic mean. You can also have the spreadsheet
program draw your frequency or relative frequency distribution.

One of the reasons that the arithmetic mean is the most used measure of location is because
the mean of a sample is an unbiased estimator of the population mean. Because the sample
mean is an unbiased estimator of the population mean, the sample mean is a good way to
make an inference about the population mean. If you have a sample from a population, and
you want to guess what the mean of that population is, you can legitimately guess that the
population mean is equal to the mean of your sample. This is a legitimate way to make this
inference because the mean of all the sample means equals the mean of the population, so if
you used this method many times to infer the population mean, on average you’d be correct.

All of these measures of location can be found for samples as well as populations, using the
same formulas. Generally, μ is used for a population mean, and x is used for sample means.
Upper-case N, really a Greek nu, is used for the size of a population, while lower case n is
used for sample size. Though it is not universal, statisticians tend to use the Greek alphabet
for population characteristics and the Roman alphabet for sample characteristics.

Measuring population shape

Measuring the shape of a distribution is more difficult. Location has only one dimension
(“where?”), but shape has a lot of dimensions. We will talk about two,and you will find that
most of the time, only one dimension of shape is measured. The two dimensions of shape
discussed here are the width and symmetry of the distribution. The simplest way to measure
the width is to do just that—the range is the distance between the lowest and highest
members of the population. The range is obviously affected by one or two population
members that are much higher or lower than all the rest.

The most common measures of distribution width are the standard deviation and the variance.
The standard deviation is simply the square root of the variance, so if you know one (and
have a calculator that does squares and square roots) you know the other. The standard

36
deviation is just a strange measure of the mean distance between the members of a population
and the mean of the population. This is easiest to see if you start out by looking at the formula
for the variance:
σ2=∑(x−μ)2Nσ2=∑(x−μ)2N
Look at the numerator. To find the variance, the first step (after you have the mean, μ) is to
take each member of the population, and find the difference between its value and the mean;
you should have N differences. Square each of those, and add them together, dividing the sum
by N, the number of members of the population. Since you find the mean of a group of things
by adding them together and then dividing by the number in the group, the variance is
simply the mean of the squared distances between members of the population and the
population mean.

Notice that this is the formula for a population characteristic, so we use the Greek σ and that
we write the variance as σ2, or sigma square because the standard deviation is simply the
square root of the variance, its symbol is simply sigma, σ.

One of the things statisticians have discovered is that 75 per cent of the members of any
population are within two standard deviations of the mean of the population. This is known
as Chebyshev’s theorem. If the mean of a population of shoe sizes is 9.6 and the standard
deviation is 1.1, then 75 per cent of the shoe sizes are between 7.4 (two standard deviations
below the mean) and 11.8 (two standard deviations above the mean). This same theorem can
be stated in probability terms: the probability that anything is within two standard deviations
of the mean of its population is .75.

It is important to be careful when dealing with variances and standard deviations. In later
chapters, there are formulas using the variance, and formulas using the standard deviation. Be
sure you know which one you are supposed to be using. Here again, spreadsheet programs
will figure out the standard deviation for you. In Excel, there is a function, =STDEVP(…),
that does all of the arithmetic. Most calculators will also compute the standard deviation.
Read the little instruction booklet, and find out how to have your calculator do the numbers
before you do any homework or have a test.

The other measure of shape we will discuss here is the measure of skewness. Skewness is
simply a measure of whether or not the distribution is symmetric or if it has a long tail on one
side, but not the other. There are a number of ways to measure skewness, with many of the
measures based on a formula much like the variance. The formula looks a lot like that for the
variance, except the distances between the members and the population mean are cubed,
rather than squared, before they are added together:
sk=∑(x−μ)3N sk=∑(x−μ)3N
At first, it might not seem that cubing rather than squaring those distances would make much
difference. Remember, however, that when you square either a positive or negative number,
you get a positive number, but when you cube a positive, you get a positive and when you
cube a negative you get a negative. Also remember that when you square a number, it gets
larger, but that when you cube a number, it gets a whole lot larger. Think about a distribution
with a long tail out to the left. There are a few members of that population much smaller than
the mean, members for which (x – μ) is large and negative. When these are cubed, you end up
with some really big negative numbers. Because there are no members with such large,
positive (x – μ), there are no corresponding really big positive numbers to add in when you
sum up the (x – μ)3, and the sum will be negative. A negative measure of skewness means that
there is a tail out to the left, a positive measure means a tail to the right. Take a minute and
37
convince yourself that if the distribution is symmetric, with equal tails on the left and right,
the measure of skew is zero.

To be really complete, there is one more thing to measure, kurtosis or peakedness. As you


might expect by now, it is measured by taking the distances between the members and the
mean and raising them to the fourth power before averaging them together.

Measuring sample shape

Measuring the location of a sample is done in exactly the way that the location of a
population is done. However, measuring the shape of a sample is done a little differently than
measuring the shape of a population. The reason behind the difference is the desire to have
the sample measurement serve as an unbiased estimator of the population measurement. If we
took all of the possible samples of a certain size, n, from a population and found the variance
of each one, and then found the mean of those sample variances, that mean would be a little
smaller than the variance of the population.
You can see why this is so if you think it through. If you knew the population mean, you
could find ∑(x−μ)2n∑(x−μ)2n for each sample, and have an unbiased estimate for σ2.
However, you do not know the population mean, so you will have to infer it. The best way to
infer the population mean is to use the sample mean x. The variance of a sample will then be
found by averaging together all of the ∑(x−¯x)2n∑(x−x¯)2n.
The mean of a sample is obviously determined by where the members of that sample lie. If
you have a sample that is mostly from the high (or right) side of a population’s distribution,
then the sample mean will almost for sure be greater than the population mean. For such a
sample, ∑(x−¯x)2n∑(x−x¯)2n would underestimate σ2. The same is true for samples
that are mostly from the low (or left) side of the population. If you think about what kind of
samples will have ∑(x−¯x)2n∑(x−x¯)2n that is greater than the population σ2, you
will come to the realization that it is only those samples with a few very high members and a
few very low members — and there are not very many samples like that. By now you should
have convinced yourself that ∑(x−¯x)2n∑(x−x¯)2n will result in a biased estimate
of σ2. You can see that, on average, it is too small.
How can an unbiased estimate of the population variance, σ2, be found?
If ∑(x−¯x)2n∑(x−x¯)2n is on average too small, we need to do something to make it
a little bigger. We want to keep the ∑(x−¯x)2∑(x−x¯)2, but if we divide it by
something a little smaller, the result will be a little larger. Statisticians have found out that the
following way to compute the sample variance results in an unbiased estimator of the
population variance:
s2=∑(x−¯x)2n−1s2=∑(x−x¯)2n−1
If we took all of the possible samples of some size, n, from a population, and found the
sample variance for each of those samples, using this formula, the mean of those sample
variances would equal the population variance, σ2.

Note that we use s2 instead of σ2, and n instead of N (really nu, not en) since this is for a
sample and we want to use the Roman letters rather than the Greek letters, which are used for
populations.

38
There is another way to see why you divide by n-1. We also have to address something
called degrees of freedom before too long, and the degrees of freedom are the key in the
other explanation. As we go through this explanation, you should be able to see that the two
explanations are related.
Imagine that you have a sample with 10 members, n=10, and you want to use it to estimate
the variance of the population from which it was drawn. You write each of the 10 values on a
separate scrap of paper. If you know the population mean, you could start by computing all
10 (x – μ)2. However, in the usual case, you do not know μ, and you must start by
finding x from the values on the 10 scraps to use as an estimate of m. Once you have found x,
you could lose any one of the 10 scraps and still be able to find the value that was on the lost
scrap from the other 9 scraps. If you are going to use x in the formula for sample variance,
only 9 (or n-1) of the x’s are free to take on any value. Because only n-1 of the x’s can vary
freely, you should divide ∑(x−¯x)2∑(x−x¯)2 by n-1, the number of (x’s) that are
really free. Once you use x in the formula for sample variance, you use up one degree of
freedom, leaving only n-1. Generally, whenever you use something you have previously
computed from a sample within a formula, you use up a degree of freedom.

A little thought will link the two explanations. The first explanation is based on the idea
that x, the estimator of μ, varies with the sample. It is because x varies with the sample that a
degree of freedom is used up in the second explanation.

The sample standard deviation is found simply by taking the square root of the sample
variance:
s=√[∑(x−¯x)2n−1]s=√[∑(x−x¯)2n−1]
While the sample variance is an unbiased estimator of population variance, the sample
standard deviation is not an unbiased estimator of the population standard deviation — the
square root of the average is not the same as the average of the square roots. This causes
statisticians to use variance where it seems as though they are trying to get at standard
deviation. In general, statisticians tend to use variance more than standard deviation. Be
careful with formulas using sample variance and standard deviation in the following chapters.
Make sure you are using the right one. Also note that many calculators will find standard
deviation using both the population and sample formulas. Some use σ and s to show the
difference between population and sample formulas, some use sn and sn-1 to show the
difference.

If Ann wanted to infer what the population distribution of volleyball players’ sock sizes
looked like she could do so from her sample. If she is going to send volleyball coaches
packages of socks for the players to try, she will want to have the packages contain an
assortment of sizes that will allow each player to have a pair that fits. Ann wants to infer what
the distribution of volleyball players’ sock sizes looks like. She wants to know the mean and
variance of that distribution. Her data, again, are shown in Table 1.1.

Table 1.1 Ann’s Data

Size Frequency

39
6 3

7 24

8 33

9 20

10 17

The mean sock size can be found:


=3∗6+24∗7+33∗8+20∗9+17∗1097=8.25=3∗6+24∗7+33∗8+20∗9+17
∗1097=8.25
To find the sample standard deviation, Ann decides to use Excel. She lists the sock sizes that
were in the sample in column A (see Table 1.2) , and the frequency of each of those sizes in
column B. For column C, she has the computer find for each
of ∑(x−¯x)2∑(x−x¯)2 the sock sizes, using the formula (A1-8.25)2 in the first row, and
then copying it down to the other four rows. In D1, she multiplies C1, by the frequency using
the formula =B1*C1, and copying it down into the other rows. Finally, she finds the sample
standard deviation by adding up the five numbers in column D and dividing by n-1 =
96 using the Excel formula =sum(D1:D5)/96. The spreadsheet appears like this when she is
done:

Table 1.2 Sock Sizes

A B C D E

1 6 3 5.06 15.19

2 7 24 1.56 37.5

3 8 33 0.06 2.06

40
4 9 20 0.56 11.25

5 10 17 3.06 52.06

6 n= 97 Var = 1.217139

7 Std.dev = 1.103.24

Ann now has an estimate of the variance of the sizes of socks worn by basketball
and volleyball players, 1.22. She has inferred that the population of Chargers players’ sock
sizes has a mean of 8.25 and a variance of 1.22.

Ann’s collected data can simply be added to the following Excel template. The calculations
of both variance and standard deviation have been shown below. You can change her
numbers to see how these two measures change.

Figure 1.2 Interactive Excel Template to Calculate Variance and Standard Deviation – see
Appendix 1.

Summary

To describe a population you need to describe the picture or graph of its distribution. The two
things that need to be described about the distribution are its location and its shape. Location
is measured by an average, most often the arithmetic mean. The most important measure of
shape is a measure of dispersion, roughly width, most often the variance or its square root the
standard deviation.

Samples need to be described, too. If all we wanted to do with sample descriptions was
describe the sample, we could use exactly the same measures for sample location and
dispersion that are used for populations. However, we want to use the sample describers for
dual purposes: (a) to describe the sample, and (b) to make inferences about the description of
the population that sample came from. Because we want to use them to make inferences, we
want our sample descriptions to be unbiased estimators. Our desire to measure sample
dispersion with an unbiased estimator of population dispersion means that the formula we use
for computing sample variance is a little different from the one used for computing
population variance.

41

You might also like