Quantitative Analysis of Categorical Variables

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Vol.

6, 2021-04

Quantitative Analysis of Categorical Variables

Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
hugo.hernandez@forschem.org

doi: 10.13140/RG.2.2.19233.33129

Abstract

Categorical or qualitative variables are commonly found in research and data analysis due to
the lack of suitable quantitative measuring systems or simply as a result of the inherently
qualitative nature of the information considered. However, statistical analysis and modelling
tools are based on quantitative variables. Thus, in order to incorporate categorical variables in
the analysis, either as factors or as response variables, they must be quantified in any way. In
this report, different approaches are explained and discussed for transforming categorical
variables into quantitative variables suitable for modelling or statistical analysis, including the
use of binary or “dummy” variables, the implementation of arbitrary scales (for ordered
categorical variables), the definition of fuzzy membership variables, and the determination and
transformation of probability values. Different application examples are presented in order to
illustrate the different approaches for quantifying categorical variables.

Keywords

Arbitrary Scales, Binary variables, Categorical variables, Dummy variables, Fuzzy membership,
Logit transformation, Modelling, Regression, Statistical Analysis.

1. Introduction

Experimental variables are commonly classified into categorical (or qualitative) and numerical
(or quantitative) variables. The main difference between these two types of variables is the
nature of the information contained by them. In principle, numerical variables are represented
by numbers with a mathematical meaning. That is, the values represented by such numbers are
subject to mathematical relations (equalities and inequalities) and operations (addition,
subtraction, etc.). Anything else is simply a categorical variable, even when it involves numbers.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (1 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

For example, if I count the number of people living in a city, such variable is numerical or
quantitative because it has a mathematical meaning. I can say for example that a city (let us
denote as City #1) with 300.000 inhabitants has more population than a city with 200.000
inhabitants (City #2). I can also say that a city with 400.000 inhabitants (City #3) has twice the
population of the city with 200.000 inhabitants (City #2). However, even when I have denoted
each city with a number, its value has no mathematical meaning. I cannot say that City #2 is
larger or bigger in any sense than City #1, nor I can say that City #3 is the result of adding City #1
and City #2. Thus, the identification of the city, although expressed with numbers, is a
categorical or qualitative variable and not numerical.

Categorical variables are commonly observed in any field of research. It is important, however,
to remark that the categorical nature of a variable is determined by the specific measuring
system employed rather than by the variable itself. For example, if I am studying the color of
light, I can define color as a qualitative variable with different categories including: Blue, green,
red, yellow, etc., or I can measure the wavelength or the frequency of each color, resulting in a
numerical representation of color.

It is now clear that categorical variables exist as a mean to describe a variable in the absence of
measurement systems capable of providing a numerical result (with mathematical meaning).
Unfortunately, the straightforward analysis of categorical variables is quite limited due to the
non-mathematical nature of the observations. Amongst the most common measures used in
descriptive statistics, only the mode can be determined for a categorical variable. All other
measures and analyses (correlation, regression, probability distribution functions, etc.) require
numerical data with mathematical meaning.

The purpose of this report is to summarize some of the most common methods that can be
employed for the quantification of any categorical variable, in order to perform a quantitative
analysis. These methods include the introduction of binary (“dummy”) variables, the definition
of arbitrary quantitative scales, and different types of fuzzy quantification. Each one of them is
explained in more detail in the following Sections.

2. Binary or “Dummy” Variables

The first approach considered for transforming categorical variables into numerical variables is
the use of binary, Boolean, or “dummy” variables. These binary variables can only take two
values: 0 or 1. They are considered Boolean variables since 0 may represent a False statement,
while 1 represents a True statement. These variables are also denoted as “dummy” variables
considering that they are artificial and fictitious imitations of the original qualitative variable.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (2 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The transformation of a categorical variable into binary variables is relatively simple. First of all,
it is possible to create one binary variable for each different category in the qualitative variable.
Then, for each observation the binary variable of a particular category will have a value of 1 if
the observation belongs to that category, or a value of 0 if not. As an example, Table 1 shows
the transformation of a categorical color variable into different binary variables.

While each category of the qualitative variable has its own binary variable, all of the binary
variables are not completely independent. If a qualitative variable has 𝑛 different possible
categories, and we know 𝑛 − 1 different binary variables, then the last binary variable is already
known (or inferred), that is, it is dependent on all other binary variables.

Table 1. Example of the transformation of a categorical variable into binary variables


Observation Categorical variable Binary Variables
# Color Black Blue Green Red White
1 Blue 0 1 0 0 0
2 White 0 0 0 0 1
3 Black 1 0 0 0 0
4 Blue 0 1 0 0 0
5 Green 0 0 1 0 0
6 Green 0 0 1 0 0
7 Black 1 0 0 0 0
8 Red 0 0 0 1 0
9 Blue 0 1 0 0 0
10 Red 0 0 0 1 0
11 Red 0 0 0 1 0
12 Blue 0 1 0 0 0
13 Red 0 0 0 1 0
14 White 0 0 0 0 1
15 Green 0 0 1 0 0

This dependence can be mathematically expressed as follows:


𝑛

∑ 𝑏𝑖 = 1, 𝑏𝑖 ∈ {0,1}
𝑖=1
(2.1)

where 𝑏𝑖 represents the 𝑖-th binary variable. Eq. (2.1) simply indicates that any observation of
the qualitative variable must be exactly one of the categories available. Eq. (2.1) can also be
arranged as follows:

𝑏𝑗 = 1 − ∑ 𝑏𝑖
𝑖≠𝑗
(2.2)

clearly showing the linear dependence of any 𝑗-th binary variable on all other binary variables.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (3 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Binary variables have another curious property:

𝑏𝑖 = 𝑏𝑖𝑘 , 𝑖 = 1, … , 𝑛, 𝑘>0
(2.3)

Thus, any 𝑘-th positive power of a binary variable is confounded with the variable itself. This
property makes binary variables in general unsuitable for non-linear models.

It is also possible to consider interactions between binary variables. If two binary variables are
obtained from the same categorical variable, both of them cannot have a value of 1 at the
same time (since they are mutually exclusive), and therefore their interaction will always be:

𝑏𝑖 ∙ 𝑏𝑗 = 0, 𝑖≠𝑗
(2.4)

This, of course, is not useful for any quantitative analysis.

On the other hand, the interaction between two different categorical variables results in non-
zero vectors, but they will show linear dependence. Let us simply consider two categorical
variables (𝐶𝐴 , 𝐶𝐵 ), each with only two different categories, represented by binary variables
(𝑏𝐴,1 , 𝑏𝐴,2 , 𝑏𝐵,1 , 𝑏𝐵,2 ). Four different interactions are possible (discarding the interactions
between categories of the same variable): 𝑏𝐴,1 ∙ 𝑏𝐵,1 , 𝑏𝐴,1 ∙ 𝑏𝐵,2 , 𝑏𝐴,2 ∙ 𝑏𝐵,1 , and 𝑏𝐴,2 ∙ 𝑏𝐵,2 . Now,
from Eq. (2.2) we obtain:

𝑏𝐴,2 = 1 − 𝑏𝐴,1
(2.5)
𝑏𝐵,2 = 1 − 𝑏𝐵,1
(2.6)
and therefore the interactions become:

𝑏𝐴,1 ∙ 𝑏𝐵,2 = 𝑏𝐴,1 ∙ (1 − 𝑏𝐵,1 ) = 𝑏𝐴,1 − 𝑏𝐴,1 ∙ 𝑏𝐵,1


(2.7)
𝑏𝐴,2 ∙ 𝑏𝐵,1 = (1 − 𝑏𝐴,1 ) ∙ 𝑏𝐵,1 = 𝑏𝐵,1 − 𝑏𝐴,1 ∙ 𝑏𝐵,1
(2.8)
𝑏𝐴,2 ∙ 𝑏𝐵,2 = (1 − 𝑏𝐴,1 ) ∙ (1 − 𝑏𝐵,1 ) = 1 − 𝑏𝐴,1 − 𝑏𝐵,1 + 𝑏𝐴,1 ∙ 𝑏𝐵,1
(2.9)
Thus,
𝑏𝐴,1 ∙ 𝑏𝐵,1 + 𝑏𝐴,1 ∙ 𝑏𝐵,2 + 𝑏𝐴,2 ∙ 𝑏𝐵,1 + 𝑏𝐴,2 ∙ 𝑏𝐵,2 = 1
(2.10)
Or, in general:

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (4 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

𝑛𝐴 𝑛𝐵

∑ ∑ 𝑏𝐴,𝑖 ∙ 𝑏𝐵,𝑗 = 1
𝑖=1 𝑗=1
(2.11)

where 𝑛𝐴 is the number of different categories in variable 𝐶𝐴 and 𝑛𝐵 is the number of different
categories in variable 𝐶𝐵 .

Higher order interactions involving more than two categorical variables are also possible, and
they will also show linear dependence between all possible terms. This can be expressed
mathematically for 𝑁 different categorical variables as follows:
𝑛1 𝑛2 𝑛𝑁 𝑁

∑ ∑ … ∑ ∏ 𝑏𝑗,𝑖𝑗 = 1
𝑖1 =1 𝑖2 =1 𝑖𝑁 =1 𝑗=1
(2.12)

It is also possible to construct binary variables by grouping categories, instead of using


individual categories. The binary variable of a group of categories from a single categorical
variable can be determined as follows:

𝑏𝑖,𝑗,𝑘,…,𝑠 = 𝑏𝑖 + 𝑏𝑗 + 𝑏𝑘 + ⋯ + 𝑏𝑠
(2.13)
where 𝑖, 𝑗, 𝑘, … , 𝑠 are arbitrary categories of the qualitative variable.

Binary variables are not limited exclusively to categorical variables. It is always possible to
transform discrete numerical variables into binary variables following the same idea behind
categorical variables. Each value of the discrete variable is then represented by a binary
variable, as follows:
0, 𝑋≠𝑥
𝑏𝑋=𝑥 = {
1, 𝑋=𝑥
(2.14)

Also, groups or intervals of discrete values can be represented by binary variables. The same
idea can also be extrapolated to continuous numerical variables. By designating intervals as
categories, each continuous interval can be represented by a binary variable according to:

0, 𝑋 < 𝑥𝐿 ∨ 𝑋 > 𝑥𝑈
𝑏𝑥𝐿 ≤𝑋≤𝑥𝑈 = {
1, 𝑥𝐿 ≤ 𝑋 ≤ 𝑥𝑈
(2.15)

In this case, a closed interval was considered. However, open or semi-open intervals can also be
used. If a set of orthogonal binary variables is required, the intervals considered must be

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (5 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

mutually exclusive. If the intervals are also exhaustive, the linear dependence between binary
variables (Eq. 2.1) will also be valid.

Binary variables and their interactions have been successfully used in linear regression models
[1-3]. However, the linear dependence and confounding between binary variables should
always be considered when selecting the model terms.

3. Arbitrary Scales

Qualitative variables can also be represented using arbitrary numerical scales. This approach is
more difficult to implement than using binary variables because it requires a reasonable
assignment of values to each category, following a certain mathematical sense of order or
“intensity” for each category. For example, if we consider the thermal sensation experienced
by different individuals, different arbitrary scales can be proposed as illustrated in Table 2. The
first scale denotes the degree of “hotness” starting from the lowest (1) to the highest (7). The
second scale is the opposite, indicating the degree of “coldness” of each sensation. The third
scale is a widely known numerical subjective scale of the degree of warmth, known as the
ASHRAE scale [4]. The last two scales indicate possible quantitative representations of the
categories using representative values in the Celsius scale of temperature. In one case, the
Celsius temperature values are equally spaced between adjacent categories, and in the other,
they are unequally spaced.

Table 2. Example of the transformation of a categorical variable into arbitrary numerical scales
Thermal "Hotness" "Coldness" ASHRAE Celsius Celsius
sensation scale scale scale scale I scale II
Cold 1 7 -3 10 0
Cool 2 6 -2 15 10
Slightly cold 3 5 -1 20 20
Neutral 4 4 0 25 25
Slightly warm 5 3 1 30 30
Warm 6 2 2 35 40
Hot 7 1 3 40 50

There are several difficulties when dealing with arbitrary scales. First, there are an unlimited
number of possible scales to be used. Second, not all qualitative variables have categories that
can be ordered following an “intensity” concept. For example, if we consider the variable
animals in a farm we may have the following categories: Cows, sheep, chickens, horses, etc.
How can we define a numerical scale in this case? We might consider ordering the animals
according to their average mass, average size, etc., but there is not such a thing as an
“animality” index determining which of those species is more “animal” than the others. In third
place, consecutive categories are not necessarily equally spaced in “intensity”, as it was shown

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (6 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

in the last example of Table 2. It is also not possible to know a priori if the arbitrary spacing
between consecutive categories is adequate or not.

The main advantage of an arbitrary scale over binary variables is the fact that there is only one
quantitative variable per categorical variable, independently of the number of categories
considered. For example, if the variable is a response variable, an arbitrary scale is better suited
than a set of binary variables. However, when the number of quantitative variables is not an
issue, the binary approximation should be preferred.

4. Fuzzy Categories

One of the main difficulties observed when dealing with qualitative variables is the fuzziness of
the categories available. In some cases, the categories might seem clearly different, but the
assignment of a particular value may depend on subjective criteria proper of the observer, or
on the conditions of observation. For example, considering the qualitative variable presented
in Table 1, different category assignments are possible if the observer suffers color blindness,
or if the object is observed using light with a different wavelength. Also, the categories might
be ambiguously defined for certain cases. For example, a turquoise color might be considered
as “blue” by certain observers but also as “green” by others (unless an unambiguous and clear
definition of the categories is provided to the observers). Such fuzziness in the determination
of the category may also occur in the example of Table 2. Some people might consider the
environment “slightly cold” while other people would respond to the same environment as
“cool” and others as “neutral”. The fuzziness in the quantification of categorical variables leads
to increased uncertainty. It is possible to incorporate such fuzziness during the quantification
of the variable in different ways, some of which will be explained in this Section.

4.1. Membership Values

One possible approach is using membership values or membership functions [5]. The
membership value (𝑚) is closely related to a probability value. It represents the fraction or
probability that a certain element belongs to a particular category. Thus, any observed element
will have a membership value (𝑚𝑖 ) for each category 𝑖 of the observed variable, which can be
inferred from a sample of observations. Membership values (or membership variables) must
follow the rule:
𝑛

∑ 𝑚𝑖 = 1, 0 ≤ 𝑚𝑖 ≤ 1
𝑖=1
(4.1)
Or equivalently,

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (7 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

𝑚𝑗 = 1 − ∑ 𝑚𝑖
𝑖≠𝑗
(4.2)

Notice the strong similarities between Eq. (2.1) and (4.1), and between Eq. (2.2) and (4.2). One
may argue that binary variables are a particular case of the more general membership
variables. The definition of membership values can be done based on samples of data from
different observers, or can be arbitrarily defined following certain rules or functions. For
example in the case of colors, turquoise might be defined as 0.5 blue and 0.5 green. Such
arbitrary definition has similar disadvantages as in the case of arbitrary scales. However, it
might provide a better quantification of categorical variables than simple binary variables.

4.2. Probability Values

Transformation of intervals of numerical variables into membership variables instead of binary


variables is also possible. This situation must be considered when dealing with uncertainty in
the observation of the numerical variable. For example, let us assume that we want to split a
certain quantitative variable into the following intervals: (0,1], (1,2], (2,3], (3,4] etc. Let us also
assume that the variable is measured with a normal white-noise with a standard deviation of
0.1. Then, after measuring the variable a value of 1.95 is obtained. While it seems logical to
assign this value to the interval (1,2], it is also possible that it actually belongs to (2,3] but
drifted due to the random error. Thus, it would be possible to assign a probability value (𝑃) to
each interval, indicating the probability of the observed value to belong to each interval. This
calculation is summarized in Table 3. As it can be seen, the 𝑃-values represent the values of the
membership variable for each interval (category). Thus, there is only a 69.15% chance that the
result belongs to interval (1,2], and 30.85% chance that it belongs to (2,3].

Table 3. Calculation of 𝑃-values in the transformation of numerical intervals into membership


variables. 𝑍 represents the standard normal random variable.
Lower limit Upper limit 𝑳𝑳 − 𝟏. 𝟗𝟓 𝑼𝑳 − 𝟏. 𝟗𝟓 𝑷-value
Interval 𝑷 (𝒁 ≤ ) 𝑷 (𝒁 ≤ )
(LL) (UL) 𝟎. 𝟏 𝟎. 𝟏 (Difference)
(0, 1] 0 1 5.48912 × 10−85 1.04945 × 10−21 1.04945 × 10−21
(1, 2] 1 2 1.04945 × 10−21 0.691462461 0.691462461
(2, 3] 2 3 0.691462461 1 0.308537539
(3, 4] 3 4 1 1 0

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (8 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

In general, we can say that:


𝑚𝑖 = 𝑃(𝑋 ≤ 𝑈𝐿𝑖 ) − 𝑃(𝑋 ≤ 𝐿𝐿𝑖 )
(4.3)

where 𝑋 represents the numerical variable, and 𝑈𝐿𝑖 and 𝐿𝐿𝑖 represents the upper and lower
limits of interval 𝑖, respectively.

When the categorical variable is a response, the probability value itself can be used to quantify
the variable.

4.3. Bounded Functions

When dealing with probabilities, the use of linear and polynomial functions may result in
inconsistent probability values because probabilities are naturally bounded between 0 and 1
whereas linear and low-order polynomial functions are unbounded. Thus, bounded functions
that are consistent with the natural limits of probabilities are preferred for describing
probability functions. Two representative examples are the logistic (logit) function and the
inverse standard normal probability (probit) function [6]. The logistic transformation is the
following:

𝑃
𝐿(𝑃) = log ( )
1−𝑃
(4.4)
or equivalently,
1
𝑃(𝐿) =
1 + 𝑒 −𝐿
(4.5)
where 𝐿 ∈ ℝ, and log is the natural logarithm.

The probit transformation is the following:

−1 (𝑃) = √2erf −1 (2𝑃 − 1)


𝑍(𝑃) = 𝜙𝑁𝑜𝑟𝑚𝑎𝑙
(4.6)
or equivalently,
𝑍
1 + erf ( )
𝑃(𝑍) = √2
2
(4.7)

where 𝑍 ∈ ℝ, erf is the error function and erf −1 is the inverse error function. The logit and
probit transformations are illustrated in Figure 1.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (9 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 1. Non-linear bounded-function transformations of probability values. Solid blue line:


Logit transformation. Dashed red line: Probit transformation.

5. Application Examples

5.1. Example 1: Regression using Binary Variables

Binary variables are commonly used to incorporate the effect of categorical variables in
regression models. Let us recall that binary variables only provide linear effects into the
regression model. Cohen [2] reports an example of linear regression using binary variables. In
this example, the grain size and composition of different samples of surface sediments are
analyzed. Particularly, grain size is determined using a set of three sieves, resulting in the
following size intervals: < 0.063 𝑚𝑚 (small size), 0.063 − 0.125 𝑚𝑚 (medium size), and
0.125 – 0.250 𝑚𝑚 (large size). In addition, the concentration of lead (ppm) and iron (%) in each
sample was quantified. The results extracted from [2] are summarized in Table 4. The binary
variables representing the categorical variable “Grain Size” (one binary variable for each
category) are also included.

A simple linear regression analysis expressing the lead contents in terms of the iron contents
results in the following expression:

2
𝑃𝑏 (𝑝𝑝𝑚) = 11.78 + 21.57 ∙ 𝐹𝑒(%) + 7.56 ∙ Ξ, 𝑅𝑎𝑑𝑗 = 0.745
(5.1)
2
where Ξ represents a standard random error, and 𝑅𝑎𝑑𝑗 is the adjusted coefficient of
determination.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (10 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 4. Surface sediment analysis of different samples [2]


Fe (%) Pb (ppm) Grain Size 𝒃𝑺𝒎𝒂𝒍𝒍 𝒃𝑴𝒆𝒅𝒊𝒖𝒎 𝒃𝑳𝒂𝒓𝒈𝒆
0.72 47 Small 1 0 0
1.31 43 Small 1 0 0
1.88 63 Small 1 0 0
2.15 61 Small 1 0 0
2.33 59 Small 1 0 0
2.27 68 Small 1 0 0
0.63 31 Medium 0 1 0
0.88 24 Medium 0 1 0
1.13 31 Medium 0 1 0
1.40 35 Medium 0 1 0
1.51 39 Medium 0 1 0
1.58 41 Medium 0 1 0
0.25 22 Large 0 0 1
0.57 19 Large 0 0 1
0.80 26 Large 0 0 1
0.97 26 Large 0 0 1
1.21 34 Large 0 0 1
1.50 41 Large 0 0 1

The effect of the grain size can be included by incorporating the binary variables into a multiple
regression model. Since all three binary variables are linearly dependent, one of them will be
selected as a reference category (or a baseline) and the model will include all other variables.
By arbitrarily selecting the large grain size as baseline, the following regression model is
obtained:

𝑃𝑏 (𝑝𝑝𝑚) = 15.78 + 13.83 ∙ 𝐹𝑒(%) + 16.47 ∙ 𝑏𝑆𝑚𝑎𝑙𝑙 + 1.28 ∙ 𝑏𝑀𝑒𝑑𝑖𝑢𝑚 + 4.45 ∙ Ξ,


2
𝑅𝑎𝑑𝑗 = 0.9115
(5.2)

This model represents a significant improvement with respect to the simple linear model (Eq.
5.1), which is evident due to a lower standard error (coefficient of Ξ) and a higher adjusted
coefficient of determination. The coefficient of the Medium binary variable, however,
presented a standard error of 2.66, larger than the value of the coefficient. Thus, it is not
reliable and can be removed from the model. This means that both large- and medium-sized
grains will be considered as a single reference category. The resulting model now becomes:

2
𝑃𝑏 (𝑝𝑝𝑚) = 16.12 + 14.13 ∙ 𝐹𝑒(%) + 15.62 ∙ 𝑏𝑆𝑚𝑎𝑙𝑙 + 4.33 ∙ Ξ, 𝑅𝑎𝑑𝑗 = 0.9161
(5.3)

which provides an additional improvement with respect to Eq. (5.2). The standard random
variable Ξ presented in Eq. (5.2) and (5.3) can be represented by the standard normal random
variable.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (11 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The mathematical interpretation of the binary variable in Eq. (5.3) is the following:

31.74 + 14.13 ∙ 𝐹𝑒(%) + 4.33 ∙ Ξ, Grain size = Small


𝑃𝑏 (𝑝𝑝𝑚) = {
16.12 + 14.13 ∙ 𝐹𝑒(%) + 4.33 ∙ Ξ, Grain size = Medium or Large
(5.4)

Figure 2. Surface sediment composition of samples with different grain size [2]. Green: Small
grain size. Red: Medium-Large grain size. Solid lines: Regression model (5.4).

This is graphically presented in Figure 2. Alternatively, it is possible to build individual models


for each group of categories [7]. In this case, we obtain by simple linear regression (using the
same groups of categories as before):

34.34 + 12.66 ∙ 𝐹𝑒(%) + 6.06 ∙ Ξ, Grain size = Small


𝑃𝑏 (𝑝𝑝𝑚) = {
14.57 + 15.62 ∙ 𝐹𝑒(%) + 3.55 ∙ Ξ, Grain size = Medium or Large
(5.5)
Model (5.5) can also be expressed in terms of the binary variable 𝑏𝑆𝑚𝑎𝑙𝑙 as follows:

𝑃𝑏 (𝑝𝑝𝑚) = 14.57 + 19.77 ∙ 𝑏𝑆𝑚𝑎𝑙𝑙 + (15.62 − 2.96 ∙ 𝑏𝑆𝑚𝑎𝑙𝑙 )𝐹𝑒(%) + (3.55 + 2.51 ∙ 𝑏𝑆𝑚𝑎𝑙𝑙 )Ξ
≈ 14.57 + 19.77 ∙ 𝑏𝑆𝑚𝑎𝑙𝑙 + (15.62 − 2.96 ∙ 𝑏𝑆𝑚𝑎𝑙𝑙 )𝐹𝑒(%) + 4.39 ∙ Ξ,
2
𝑅𝑎𝑑𝑗 = 0.9129
(5.6)

The first expression corresponds to a heteroscedastic model, where the standard error is also a
function of the binary variable 𝑏𝑆𝑚𝑎𝑙𝑙 . The second expression represents a homoscedastic
approximation. The value of the homoscedastic standard error can be found as follows:

2.51
𝜎ℎ𝑜𝑚𝑜𝑠𝑐 = 𝐸(𝜎ℎ𝑒𝑡𝑒𝑟𝑜𝑠𝑐 ) = 𝐸(3.55 + 2.51 ∙ 𝑏𝑆𝑚𝑎𝑙𝑙 ) = 3.55 + 2.51 ∙ 𝐸(𝑏𝑆𝑚𝑎𝑙𝑙 ) = 3.55 +
3
(5.7)
1
In this case, 𝐸(𝑏𝑆𝑚𝑎𝑙𝑙 ) = 3 because each of the three categories has the same number of
observations.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (12 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Model (5.5) is graphically presented in Figure 3.

Figure 3. Surface sediment composition of samples with different grain size [2]. Green: Small
grain size. Red: Medium-Large grain size. Solid lines: Regression model (5.5).

While model (5.6) slightly reduces the sum of squared errors (273 vs. 281.8) with respect to
model (5.5), since model (5.6) contains one additional interaction term (𝑏𝑆𝑚𝑎𝑙𝑙 ∙ 𝐹𝑒(%)) the
standard error increases (4.39 vs. 4.33) and the adjusted coefficient of determination
decreases (0.9129 vs. 0.9161). Following the principle of parsimony, model (5.5) would then be
preferred.

5.2. Example 2: Experimental Optimization using Binary Variables

Oyelade et al. [8] proposed a D-optimal design for maximizing the conversion of biodiesel
under different conditions, using two different catalysts: Calcined snail shell (CSS) and
potassium hydroxide (KOH). The results obtained are summarized in Table 5. Only the binary
variable 𝑏𝐶𝑆𝑆 is included, since 𝑏𝐾𝑂𝐻 is its complement.

A second order model is obtained after minimizing the standard error of the residuals. The best
model obtained is the following:

𝑌𝑖𝑒𝑙𝑑(%) = 892.74 − 29 ∙ 𝑇 + 0.2385 ∙ 𝑇 2 + 1.215 ∙ 𝑡 − 0.00583 ∙ 𝑡 2 + 41.99 ∙ 𝐶 − 2.63 ∙ 𝐶 2


− 0.4188 ∙ 𝑇 ∙ 𝐶 − 0.05838 ∙ 𝑡 ∙ 𝐶 + (1.018 ∙ 𝑇 + 0.2873 ∙ 𝑡 − 92.42) ∙ 𝑏𝐶𝑆𝑆
2
+ 3.736 ∙ Ξ, 𝑅𝑎𝑑𝑗 = 0.8713
(5.8)

where 𝑇 is temperature in °𝐶, 𝑡 is time in minutes, and 𝐶 is catalyst concentration in 𝑤/𝑤.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (13 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 5. D-optimal design of experiments for Biodiesel production [8]


Experiment Temperature Cat. Conc. Biodiesel
# (°C)
Time (min)
(w/w)
Cat. Type 𝒃𝑪𝑺𝑺 Yield (%)
1 65 120 2 KOH 0 82
2 55 60 1 KOH 0 88
3 55 120 2 CSS 1 93
4 65 60 3 KOH 0 80
5 65 60 2 CSS 1 73
6 55 120 3 KOH 0 96
7 55 120 1 KOH 0 89
8 65 60 3 KOH 0 75
9 55 60 1 CSS 1 65
10 60 60 3 CSS 1 67
11 60 60 3 CSS 1 64
12 55 90 3 CSS 1 90
13 65 60 1 CSS 1 65
14 55 90 3 CSS 1 86
15 60 90 2 KOH 0 85
16 65 120 3 CSS 1 78
17 55 120 1 CSS 1 87
18 55 60 2 KOH 0 87
19 60 75 1.5 CSS 1 71
20 65 60 1 CSS 1 65
21 65 120 3 CSS 1 89
22 62.5 105 2 CSS 1 88
23 65 120 1 CSS 1 92
24 55 120 3 KOH 0 92

In this case, the model interpretation is the following:

𝑌𝑖𝑒𝑙𝑑𝐾𝑂𝐻 (%) = 892.74 − 29 ∙ 𝑇 + 0.2385 ∙ 𝑇 2 + 1.215 ∙ 𝑡 − 0.00583 ∙ 𝑡 2 + 41.99 ∙ 𝐶 − 2.63


∙ 𝐶 2 − 0.4188 ∙ 𝑇 ∙ 𝐶 − 0.05838 ∙ 𝑡 ∙ 𝐶 + 3.736 ∙ Ξ,
(5.9)

𝑌𝑖𝑒𝑙𝑑𝐶𝑆𝑆 (%) = 800.32 − 27.99 ∙ 𝑇 + 0.2385 ∙ 𝑇 2 + 1.503 ∙ 𝑡 − 0.00583 ∙ 𝑡 2 + 41.99 ∙ 𝐶 − 2.63


∙ 𝐶 2 − 0.4188 ∙ 𝑇 ∙ 𝐶 − 0.05838 ∙ 𝑡 ∙ 𝐶 + 3.736 ∙ Ξ
(5.10)

According to model (5.8), the optimal yield can be obtained when:

𝜕𝑌𝑖𝑒𝑙𝑑 𝜕𝑌𝑖𝑒𝑙𝑑 𝜕𝑌𝑖𝑒𝑙𝑑


= = =0
𝜕𝑇 𝜕𝑡 𝜕𝐶
(5.11)
𝜕2 𝑌𝑖𝑒𝑙𝑑 𝜕2 𝑌𝑖𝑒𝑙𝑑 𝜕2 𝑌𝑖𝑒𝑙𝑑
as long as 𝜕𝑇 2 < 0, 𝜕𝑡 2 < 0 and 𝜕𝐶 2 < 0.

Condition (5.11) implies:

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (14 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

0.477 0 −0.4188 𝑇 29 − 1.018 ∙ 𝑏𝐶𝑆𝑆


[ 0 0.0117 0.05838 ] ∙ [ 𝑡 ] = [1.215 + 0.2873 ∙ 𝑏𝐶𝑆𝑆 ]
0.4188 0.05838 5.2521 𝐶 41.99
(5.12)

Eq. (5.12) provides an expression for the determination of the process conditions as a function
𝜕2 𝑌𝑖𝑒𝑙𝑑 𝜕2 𝑌𝑖𝑒𝑙𝑑 𝜕2 𝑌𝑖𝑒𝑙𝑑
of the binary variable 𝑏𝐶𝑆𝑆 . Unfortunately, 𝜕𝑇 2
> 0 while 𝜕𝑡 2
< 0 and 𝜕𝐶 2
< 0,
indicating that the solution to Eq. (5.12) is not a maximum but a saddle point. In other words,
the maximum will be found at a boundary of the search region (particularly for the process
temperature).

Considering the limits of the experimental design, the optimal conditions for biodiesel
production using KOH as a catalyst are found to be: 𝑇 = 55 °𝐶, 𝑡 = 91.23 𝑚𝑖𝑛., and
𝐶 = 2.59 𝑤/𝑤, with a predicted Biodiesel yield of 99.18%. On the other hand, when CSS is
used as a catalyst, the optimal conditions found are: 𝑇 = 55 °𝐶, 𝑡 = 117.32 𝑚𝑖𝑛., and
𝐶 = 2.30 𝑤/𝑤, with a predicted Biodiesel yield of 92.71%. According to these results, it is
clearly observed that KOH allows obtaining higher Biodiesel yields. These optimal results are
obtained by formulating individual optimization problems for each category (KOH or CSS). The
overall optimum value can either be found by simple comparison, or it can also be obtained
using mixed-integer optimization methods [9].

Please recall that these results are based on a model obtained from a limited sample of
observations. An experimental validation of the optimum prediction is always required.

5.3. Example 3: Ordered Categorical Variables

Farewell [10] provided data from the Washington State Controlled Substances Therapeutic
Research Program on the subjective intensity of nausea experienced by cancer patients treated
by chemotherapy. The data is summarized in Table 6. The patients were classified depending
on the use of cisplatin as a medication in the corresponding treatment.

Table 6. Nausea severity data [10]


Nausea Severity
Severity of Nausea No cisplatin Cisplatin Total
Arbitrary Scale
None 0 43 7 50
Mild 1 39 7 46
Moderate (Low) 2 13 3 16
Moderate (Middle) 3 22 12 34
Moderate (High) 4 15 15 30
Severe 5 29 14 43
Total 161 58 219

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (15 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The proportion of patients experiencing the different degrees of nausea is represented as a


cumulative probability plot in Figure 4. The data is compared with the expected behavior
according to the uniform distribution model. The distribution of all cases closely resembles the
behavior of the uniform model with a random goodness-of-fit [11] ℛ 2 = 94.36%, whereas the
individual groups appear to deviate from uniformity (ℛ 2 < 90%).

A 𝜒 2 -test can be used to check if there is an effect of cisplatin use on nausea severity. Using a
contingency table, it is possible to obtain 𝜒 2 = 18.18 with a 𝑃-value of 0.0027. Thus, significant
differences are observed between the nausea severities with cisplatin vs. the severities without
cisplatin.

Now, let us try modelling the probability of severity using the arbitrary scale of nausea severity
(𝑁𝑆) and the binary variable representing the use of cisplatin (𝑏𝐶𝑖𝑠𝑝𝑙𝑎𝑡𝑖𝑛 ). A second order
effect of the arbitrary scale and the interaction between nausea severity and cisplatin use is
considered. The best model (lowest standard deviation in residual error) obtained is:

𝑃 = 0.2682 − 0.08733 ∙ 𝑁𝑆 + 0.01275 ∙ 𝑁𝑆 2 − 0.1427 ∙ 𝑏𝐶𝑖𝑠𝑝𝑙𝑎𝑡𝑖𝑛 + 0.0571 ∙ 𝑁𝑆 ∙ 𝑏𝐶𝑖𝑠𝑝𝑙𝑎𝑡𝑖𝑛


2
+ 0.0551 ∙ Ξ, 𝑅𝑎𝑑𝑗 = 0.4681
(5.13)

Figure 4. Cumulative probability distribution of cases according to nausea severity. The uniform
distribution model (dashed red line) is presented for comparison. Top left: No cisplatin. Top
right: Cisplatin. Bottom: Total cases.

This model is graphically presented in Figure 5. In general, the probability of nausea decreases
with severity when no cisplatin is used but increases with severity when cisplatin is used. Even

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (16 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

though the adjusted determination coefficient of the model is relatively low, these trends are
clear.

Figure 5. Probability distribution of nausea severity with cisplatin (purple) and without cisplatin
(green). Dots: Experimental observations. Dashed lines: Model (5.13).

A similar conclusion can be obtained considering the odds-ratio (𝑂𝑅), defined as:

𝑃(𝑁𝑆, 𝑏𝐶𝑖𝑠𝑝𝑙𝑎𝑡𝑖𝑛 = 0)
𝑂𝑅 =
𝑃(𝑁𝑆, 𝑏𝐶𝑖𝑠𝑝𝑙𝑎𝑡𝑖𝑛 = 1)
(5.14)

𝑂𝑅 values close to 1 indicate that the risk of nausea for a particular severity level is
independent of the use of cisplatin. 𝑂𝑅 values greater than 1 represent a higher probability of
presenting the severity level without cisplatin, and 𝑂𝑅 values less than 1 represent a higher
probability of presenting the severity level with cisplatin. The expected value of each
probability given in Eq. (5.13) can be replaced in Eq. (5.14) resulting in:

0.2682 − 0.08733 ∙ 𝑁𝑆 + 0.01275 ∙ 𝑁𝑆 2


𝑂𝑅 =
0.1255 − 0.03023 ∙ 𝑁𝑆 + 0.01275 ∙ 𝑁𝑆 2
(5.15)

According to model (5.15), the 𝑂𝑅 values decrease as the nausea severity increases.
Alternatively, the observed 𝑂𝑅 values can be fitted using a polynomial model. Considering a
cubic polynomial, the following model is obtained:

2
𝑂𝑅 = 2.1933 + 0.2963 ∙ 𝑁𝑆 − 0.4782 ∙ 𝑁𝑆 2 + 0.07219 ∙ 𝑁𝑆 3 + 0.1075 ∙ Ξ, 𝑅𝑎𝑑𝑗 = 0.9807
(5.16)

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (17 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 6. Odds ratio (No cisplatin vs. Cisplatin) as a function of nausea severity. Dots:
Experimental observations. Dashed line: Model (5.15). Dotted line: Model (5.16).

These models are illustrated in Figure 6. In both cases, it is clear that the odds ratio is greater
than 1 for low nausea severities, but less than 1 for high nausea severities. Thus, the use of
cisplatin increases the risk of nausea with higher severity in the patients. If a different nausea
severity scale is used, the models will change from a quantitative perspective, but it is not
expected to change the corresponding qualitative conclusions.

5.4. Example 4: Membership and Arbitrary Scales

Reinhart & Chien [12] analyzed the effect of different concentrations of chlorpromazine and
sodium salicylate on the shape of red blood cells as observed under a scanning electron
microscope. The shape of the red blood cells was assigned following the classification
proposed by Bessis [13]. The results obtained in these experiments are summarized in Table 7.
The shape of the red cells is clearly a qualitative variable with multiple categories. Fortunately,
the categories can be ordered and therefore it is possible to assign arbitrary numerical scales.
One possible scale consists in assigning a value of 0 to discocytes, as they are the normal, disc-
shaped, healthy shape of the red cells. Now, two types of deformations of red cells are
observed in the experiments. One type of deformation is the appearance of a stoma or opening
in the cell. Cells with stomas are denoted as stomatocytes, and different degrees of
stomatocytosis are observed. Let us assign arbitrary values from 1 (stomatocytes I) to 4
(spherostomatocytes) to the different levels of deformation. On the other hand, different
degrees of echinocytosis, the appearance of spicules, are assigned negative values from −1
(echinocytes I) to −4 (spheroechinocytes).

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (18 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 7. Effect of chlorpromazine and salicylic acid on the proportion of red cells classified
according to shape [12]
Chlorpromazine (mmol/L) Salicylic acid (mmol/L)
Red-cell shape Control
0.02 0.1 0.5 7.5 30 120
Spherostomatocytes 0% 0% 1% 92% 0% 0% 0%
Stomatocytes III 0% 0% 21% 8% 0% 0% 0%
Stomatocytes II 0% 8% 44% 0% 0% 0% 0%
Stomatocytes I 18% 52% 28% 0% 5% 0% 0%
Discocytes 78% 40% 6% 0% 63% 2% 0%
Echinocytes I 4% 0% 0% 0% 32% 17% 0%
Echinocytes II 0% 0% 0% 0% 0% 53% 2%
Echinocytes III 0% 0% 0% 0% 0% 28% 9%
Spheroechinocytes 0% 0% 0% 0% 0% 0% 89%

In addition, the proportion of red cells observed in each shape category (𝑠) can be considered
as a membership value (𝑚𝑠 ). Then, using the arbitrary scale it is possible to quantify the
average shape of the red cells as:
4

𝑆̅ = ∑ 𝑠 ∙ 𝑚𝑠
𝑠=−4
(5.17)

The average shape values calculated with Eq. (5.17) using the data given in Table 7 are
summarized in Table 8 and Figure 7. Now, since the average shape value is naturally bounded
between −4 and 4 (which are the extreme categories), it is possible to use an equivalent logit
transformation as follows:
4 + 𝑆̅
8 4 + 𝑆̅
𝐿 = log ( ) = log ( )
4 + 𝑆̅ 4 − 𝑆̅
1− 8
(5.18)
4+𝑆̅
where the term transforms the [−4, 4] boundary into a [0, 1] boundary.
8

Then, the logit-transformed variable 𝐿 can be represented by a linear function of the compound
concentration (𝐶) and the standard deviation of the estimation 𝜎𝐿 as follows:

𝐿 = 𝛼𝐶 + 𝜎𝐿 Ξ
(5.19)

The parameters fitted for each compound (chlorpromazine and salicylic acid) using least
squares minimization, are presented in Table 9.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (19 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The resulting function describing the average shape value becomes:

𝑒 𝛼𝐶+𝜎𝐿 Ξ − 1 𝑒 𝛼𝐶 − 1
𝑆̅ = 4 ( ) ≈ 4 ( ) + 𝜎𝑆̅ Ξ
𝑒 𝛼𝐶+𝜎𝐿 Ξ + 1 𝑒 𝛼𝐶 + 1
(5.20)
where 𝜎𝑆̅ is an approximate standard error on the estimation of 𝑆̅.

Table 8. Effect of chlorpromazine and salicylic acid on the average shape value of red cells
Concentration Average shape Logit
(mmol/L) value transformation
Control 0 0.14 0.070
0.02 0.68 0.343
Chlorpromazine (mmol/L) 0.1 1.83 0.988
0.5 3.92 4.595
7.5 -0.27 -0.135
Salicylic acid (mmol/L) 30 -2.07 -1.146
120 -3.87 -4.103

Figure 7. Average red cell shape value estimated using Eq. (5.17) as a function of
chlorpromazine (left) and salicylic acid (right) concentration. Dotted black lines: Estimation
given by model (5.20).

Table 9. Parameter values fitted by least-squares minimization for describing the effect of
chlorpromazine and salicylic acid on the average shape value of red cells
Parameter Chlorpromazine Salicylic acid
𝛼 [L/mmol] 9.2291 -0.0344
𝜎𝐿 0.1076 0.1058
2
𝑅𝑎𝑑𝑗 (5.19) 99.74% 99.70%
𝜎𝑆̅ 0.2065 0.1899
2
𝑅𝑎𝑑𝑗 (5.20) 98.48% 98.93%

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (20 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

5.5. Example 5: Comparison of Distributions

Fantou et al. [14] analyzed the homogeneity and stability of O/W emulsions prepared using
modified xanthan derivatives as surfactants. Homogeneity and stability are categorical
responses. An experiment can be either homogeneous or heterogeneous, and either stable or
unstable. Homogeneity is analyzed in this case, considering the differences in droplet size
distribution between a top sample and a bottom sample of the same emulsion. According to
the authors, “emulsions are considered as homogenous as far as both droplet size and
distribution are identical over the entire volume (from top to bottom)”. Stability is analyzed
considering the differences in droplet size distribution over time (from 1 day up to 2 months in
this example).

Even though homogeneity and stability are based on a quantitative variable (droplet size
distribution), the categorical evaluation is rather subjective. Let us consider, for example, the
top and bottom droplet size distributions determined after 1 day of preparation for three
emulsions containing different amounts of surfactant. The corresponding droplet size
distributions obtained by static light scattering are presented in Figure 8. When 0.05% 𝑤𝑡. of
surfactant is used, the droplet size distributions at the top and the bottom of the emulsion are
clearly different. Also, when 0.2% 𝑤𝑡. surfactant is used, the droplet size distributions at the
top and bottom are practically identical. However, when 0.1% 𝑤𝑡. surfactant is used, the
droplet size distributions are not strictly identical, but they are also not so different. Thus, it is
not clear if the system is homogeneous or not in this case.

Figure 8. Log10 droplet size distribution of O/W emulsions prepared using different amounts of
surfactant from samples taken at different positions (dotted blue lines: top samples, dot-
dashed red lines: bottom samples). Left: 0.05% 𝑤𝑡. surfactant. Center: 0.1% 𝑤𝑡. surfactant.
Right: 0.2% 𝑤𝑡. surfactant.

In order to quantify homogeneity, let us determine the similitude between the top and bottom
probability density functions (𝜌𝑇𝑜𝑝 and 𝜌𝐵𝑜𝑡𝑡𝑜𝑚 ) of the logarithm of the droplet size, according
to the following expression [15]:

∫ |𝜌𝑇𝑜𝑝 (𝑥) − 𝜌𝐵𝑜𝑡𝑡𝑜𝑚 (𝑥)|𝑑𝑥
𝓈 =1− 0
2
(5.21)

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (21 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

where 𝑥 = log10 (𝑑𝑠 ), and 𝑑𝑠 is the droplet size in 𝜇𝑚. Since the full raw data producing the
distributions shown in Figure 8 is not available, let us consider two different strategies for
approximating the droplet size distributions. The first approach consists in approximating the
density functions using piecewise splines [16]. The second approach consists in approximating
the logarithm of the droplet size distribution by fitting a normal distribution model. The
parameters estimated by minimization of the error in the probability density function value,
along with the similitude values obtained using both approaches are summarized in Table 10.
Furthermore, the log-normal approximations are illustrated and compared to the
corresponding spline distributions in Figure 9. Both approaches show a similar trend in the
results. In this example, the similitude indicates the probability of both samples (top and
bottom) having the same size distribution. Thus, a confidence cut-off value (similar to that used
in hypothesis testing) can determine if the emulsion is homogenous or not.

Table 10. Quantification of emulsion homogeneity using the similitude between top and
bottom droplet size distributions
Surfactant concentration
Property 0.05% 0.10% 0.20%
Top-bottom Similitude Splines 27.04% 91.49% 98.71%
Bottom 0.768 1.217 1.137
Sample average log-size
Top 1.437 1.304 1.141
Bottom 0.349 0.395 0.347
Sample st.dev. log-size
Top 0.221 0.366 0.347
Top-bottom Similitude Log-normal 23.21% 90.41% 99.61%

Figure 9. Log-normal approximation of droplet size distribution for O/W emulsions prepared
using different amounts of surfactant. Dashed blue lines: Spline approximation. Solid red lines:
Log-normal approximation.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (22 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 10. Emulsion homogeneity quantified by top-bottom similitude, predicted by a logit


model (solid blue line) vs. experimental observations from size distributions fitted using splines
(red diamonds) and log-normal models (green squares).

Using a logit transformation, the top-bottom similitude values can be related to the surfactant
concentration as illustrated in Figure 10. According to this model, a homogeneous emulsion is
obtained with a 95% confidence using 0.113% 𝑤𝑡. surfactant, and with a 99% confidence
using 0.162% 𝑤𝑡. surfactant.

A similar approach can be used for analyzing stability, comparing size distributions obtained at
different times.

6. Conclusion

The quantitative analysis of categorical variables requires using a suitable strategy for
transforming the qualitative data into numerical information. Depending on the nature of the
categorical variable (e.g. whether the categories can be ordered or not), different
quantification strategies can be used. When the categorical variables are used as input
variables or factors (for example in regression analysis or in experimental design), the
introduction of binary or “dummy” variables (having values of 0 or 1) is a successful approach.
If the definition of the categories is fuzzy or uncertain, binary variables can be replaced with
membership values (having values between 0 and 1). In these cases, a categorical variable with
𝑁 categories will be represented by 𝑁 − 1 numerical variables. On the other hand, when a
categorical variable is used as a response variable, increasing the number of variables is not a
suitable approach, and the use of arbitrary scales is preferred. In some cases, the arbitrary
scales can also be replaced by probability values (also having values between 0 and 1). Non-
linear transformations, such as the logit or probit transformations, of those probabilities or
arbitrary scales can be used when the natural bounds of these numerical surrogates need to be
preserved. Different examples are included to illustrate the quantification and analysis of
categorical variables.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (23 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Acknowledgments

This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.

References

[1] Hardy, M. A. (1993). Regression with dummy variables (Quantitative Applications in the
Social Sciences 93). Sage Publications Inc., Newbury Park (CA, USA).

[2] Cohen, A. (1991). Dummy variables in stepwise regression. The American Statistician, 45(3),
226-228.

[3] Suits, D. B. (1957). Use of dummy variables in regression equations. Journal of the American
Statistical Association, 52(280), 548-551.

[4] Humphreys, M. A., & Hancock, M. (2007). Do people like to feel ‘neutral’?: Exploring the
variation of the desired thermal sensation on the ASHRAE scale. Energy and buildings, 39(7),
867-874.

[5] Dubois, D. J. (1980). Fuzzy sets and systems: theory and applications (Vol. 144). Academic
press. pp. 255-256.

[6] Agresti, A. (2018). An introduction to categorical data analysis. John Wiley & Sons. Chapter
4.

[7] Holgersson, H. E. T., Nordström, L., & Öner, Ö. (2014). Dummy variables vs. category-wise
models. Journal of Applied Statistics, 41(2), 233-241.

[8] Oyelade, J. O., Idowu, D. O., Oniya, O. O., & Ogunkunle, O. (2017). Optimization of biodiesel
production from sandbox (Hura crepitans L.) seed oil using two different catalysts. Energy
Sources, Part A: Recovery, Utilization, and Environmental Effects, 39(12), 1242-1249.

[9] Adjiman, C. S., Androulakis, I. P., & Floudas, C. A. (2000). Global optimization of mixed‐
integer nonlinear problems. AIChE Journal, 46(9), 1769-1797.

[10] Farewell, V. T. (1982). A note on regression analysis of ordinal data with variability of
classification. Biometrika, 69(3), 533-538.

[11] Hernandez, H. (2019). Goodness-of-fit of Randomistic Models. ForsChem Research Reports,


4, 2019-10. doi: 10.13140/RG.2.2.35386.34248.

[12] Reinhart, W. H., & Chien, S. (1986). Red Cell Rheology in Stomatocyte-Echinocyte
Transformation: Roles of Cell Geometry and Cell Shape. Blood, 67(4), 1110-1118.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (24 / 25)


www.forschem.org
Quantitative Analysis of
Categorical Variables
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[13] Bessis, M. (1972). Red cell shapes. An illustrated classification and its rationale. In: Bessis,
M., Weed, R. I., & Leblond, P. F. (Eds.). Red Cell Shape: Physiology, Pathology, Ultrastructure.
Springer Verlag, New York. pp. 1-25.

[14] Fantou, C., Comesse, S., Renou, F., & Grisel, M. (2019). Hydrophobically modified xanthan:
thickening and surface active agent for highly stable oil in water emulsions. Carbohydrate
polymers, 205, 362-370.

[15] Hernandez, H. (2018). Parameter Identification using Standard Transformations: An


Alternative Hypothesis Testing Method. ForsChem Research Reports, 3, 2018-04. doi:
10.13140/RG.2.2.14895.02728.

[16] Hernandez, H. (2020). Reconstructing Probability Distributions using Quantile-based


Splines. ForsChem Research Reports, 5, 2020-21. doi: 10.13140/RG.2.2.14827.36645.

07/04/2021 ForsChem Research Reports Vol. 6, 2021-04 (25 / 25)


www.forschem.org

You might also like