Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Instrumentation for sensory measurements

Name: Azzam Mousa Khan


University ID: 44285470
Introduction
One of the most traditional methods available to sensory scientists is difference testing (Meilgaard et al .
2007; Lawless and Heymann 2010; Stone et al . 2012). Difference tests have traditionally been carried
out to assess if there is a sensory difference between samples (Amerine et al. 1965), while binomial tests
have been used to determine if a substantial difference is observed between the samples (Peryam and
Swartz 1950). In response to difficulty in evaluating non-significant test results, this method was then
advanced by power considerations, both theoretical and operational [see Ennis and Jesionka (2011) and
Kim and Lee (2012) for discussion of theoretical and operational power considerations, respectively].
But more recently, Difference testing has undergone an additional shift in perspective in order to
provide businesses with as detailed and reliable information as possible, with additional market focus on
cost savings and growing public policy focus on health initiatives. Specifically, in the latter half of the
twentieth century, difference testing increasingly drew on ideas from psychology (Thurstone 1927) and
signal detection theory (Green and Swets 1966) to become useful not just for providing “ yes / no
”judgments of significant differences, but for scaling the magnitude of the sensory differences among
products (Ura 1960; David and Trivedi 1962; Frijters 1979, 1980; Frijters 1980; Ennis et   al. 1988; Mullen
et   al. 1988; O'Mahony 1992; O'Mahony et   al . 1994; Stillman and Irwin 1995; Hautus and Irwin 1995;
Ennis et   al . 1998). This use of difference tests as measurement instruments agrees philosophically with
Daniel Ennis's (1990) comment:

“Is hypothesis testing really important? A more important issue than that concerning possible sensory
differences is the size and importance of these differences1 .”

Forced‐Choice Difference Testing Methods


In this article, we begin by reviewing the methods we consider. To facilitate direct comparisons, we do
not consider methods with response criteria that can vary by individual, such as the A / Not‐A, Same –
Different, Degree of Difference or 2‐Alternative Choice (2‐AC) tests (Bi 2002; Alfaro‐Rodriguez et   al.
2007; Lee et   al . 2007; Kim et   al . 2008, 2012; Young et   al . 2008; Christensen and Brockhoff 2009;
Chae et   al. 2010; Christensen et   al. 2011; Santosa et   al. 2011; Sung et   al. 2011; Christensen et   al.
2012a; Ennis and Ennis 2012) – comparison of these methods with the tests covered in this present review
is an important topic for future research. Also, although replication is a topic of justifiable interest in
sensory (Brockhoff and Schlich 1998; Ennis and Bi 1998; Bi and Ennis 1999, 2001; Kunert 2001;
Brockhoff 2003; Liggett and Delwiche 2005; Bi 2006; Meyners 2007a), we focus on unreplicated testing
– tests that are more precise in a single evaluation will be more precise in replicated evaluations as long as
the psychological processes involved in the evaluations remain constant3 . For similar reasons, we do not
cover double discrimination tests [see Bi (2006) for a discussion]. Finally, as we discuss in more detail at
the end of this section, we focus on methods in which only two groups of samples, A and B, are
compared.
Specified Methods
For the specified testing methods, an attribute of interest, such as “saltiness,” is explicitly identified in the
test instructions. In the specified tests listed later, we assume that sample A is hypothesized to be
perceived as having more of the specified attribute.
2‐Alternative Forced‐Choice Test
In the 2‐Alternative Forced‐Choice (AFC) test, respondents are presented with one sample of A and one
sample of B. The respondents are instructed to identify the sample with the most of the specified attribute
(i.e., “Please identify the saltiest sample”)4 . The two potential presentation orders (AB and BA)
throughout the experiment are balanced. If a sample of A is chosen, the answer is deemed correct. 1/2 is
the guessing chance.
3‐AFC Test
In the 3‐AFC test, respondents are presented with one sample from one group and two from the other. If
two B samples are presented, the respondents are instructed to identify the sample with the most of the
specified attribute (i.e., “Please identify the saltiest sample”). If two A samples are presented, the
respondents are instructed to identify the sample with the least of the specified attribute (i.e., “Please
identify the least salty sample”). The six possible presentation orders (ABB, BAB, BBA, BAA, ABA and
AAB) are balanced across the experiment. If the odd sample is chosen, then the response is considered
correct. The guessing probability is 1/3.
4‐AFC Test
In the 4‐AFC test, respondents are presented with one sample from one group and three from the other. If
three B samples are presented, the respondents are instructed to identify the sample with the most of the
specified attribute (i.e., “Please identify the saltiest sample”). If three A samples are presented, the
respondents are instructed to identify the sample with the least of the specified attribute (i.e., “Please
identify the least salty sample”). The eight possible presentation orders (ABBB, BABB, BBAB, BBBA,
BAAA, ABAA, AABA, AAAB) are balanced across the experiment. If the odd sample is chosen, then the
response is considered correct. The guessing probability is 1/4.
More generally, it is possible to consider m‐AFC tests, in which m products are considered. But we do not
consider such tests as they provide only a trivial increase in test power (Ennis and Jesionka 2011), are
impractical, and are not in common use.
Specified Tetrad Test
In the specified Tetrad test, respondents are presented with two samples from one group and two from the
other. The respondents are instructed to identify the two samples with the most of the specified attribute
(i.e., “Please identify the two saltiest samples”). The six possible presentation orders (AABB, ABAB,
ABBA, BABA, BAAB and BBAA) are balanced across the experiment. If the two A samples are chosen,
then the response is considered correct. The guessing probability is 1/6.
Unspecified Methods
For the unspecified testing methods, it is not necessary to specify an attribute of interest5 – respondents
are instead asked to compare samples with respect to an overall level of sensory difference. As mentioned
earlier, we assume the two groups of samples are labeled A and B.
Duo‐Trio Test
In the standard form of the Duo‐trio test, respondents are presented with a reference sample together with
two test samples6 . Respondents are asked which test sample is most similar to the reference. The two
possible presentation orders (AB and BA) are balanced across the experiment. If the sample from the
same group as the reference is selected then the response is considered correct. The guessing probability
is 1/2.
Triangle Test
In the Triangle test, respondents are presented with two samples from one group and one from the other.
Respondents are asked which test sample is most different from the other two. The six possible
presentation orders (ABB, BAB, BBA, BAA, ABA and AAB) are balanced across the experiment. If the
odd sample is chosen then the response is considered correct. The guessing probability is 1/3.
Tetrad Test
In the Tetrad test, respondents are presented with two samples from one group and two from the other.
The respondents are instructed to group the samples into two groups of two. The six possible presentation
orders (AABB, ABAB, ABBA, BABA, BAAB and BBAA) are balanced across the experiment. If the
samples are grouped correctly the response is considered correct. The guessing probability is 1/3.
Two‐Out‐of‐Five Test
In the standard form of the Two‐Out‐of‐Five test, respondents are presented with two samples from one
group and three from the other. The respondents are instructed to group the samples into a group of two
and a group of three. The 20 possible presentation orders are balanced across the experiment. If the
samples are grouped correctly, the response is considered correct. The guessing probability is 1/10.
Two‐Out‐of‐Five Test with Forgiveness
A modification to the Two‐Out‐of‐Five test has recently been proposed (Ennis 2013). Data are collected
in the same manner as in the standard Two‐Out‐of‐Five test. But a response is scored as correct as long as
the group of two contains two samples from the same group. The guessing probability for the Two‐Out‐
of‐Five test with forgiveness is 2/5.
Other Tests
There are several other tests that we do not include in this review. They include the Dual Pair (Rousseau
and Ennis 2001; Rousseau and O'Mahony 2001; Rousseau et al. 2002) and Dual Standard (Peryam and
Swartz 1950) tests – these tests are not commonly used and have been found, both theoretically and
experimentally, to lack sensitivity. We also do not consider tests of more than two groups of samples.
These tests, such as Torgerson's method of Triads (Torgerson 1958; Ennis et al. 1988; Rousseau 2007),
have shown promise, but require further investigation before they can be compared with the methods
mentioned earlier with respect to precision of measurement.

Decision Rules
The second main assumption made by Thurstonian scaling is that subjects compare the percepts in a
systematic fashion when making their decisions in a sensory difference test. For example, in a 2‐AFC test,
a subject will respond that Sample A is less intense than Sample B if the percept corresponding to Sample
A is less intense than the percept corresponding to Sample B. If Sample A is less intense on average than
Sample B, then the subject will be correct. Such a case occurs if, for example, percepts a1 and b are
chosen in Fig. 1. On the other hand, if the percept a2 is instead chosen for Sample A in Fig. 1, then the
subject will give an incorrect answer. As these percepts assume momentary values as given by normal
distributions, the probability of correct response can be computed as a function of δ.

Figure 1
Possible Percepts in the 2‐Alternative Forced‐Choice (2‐AFC) Test
Similarly, for each difference testing method, the relationship between δ and the probability of a correct
response can be deduced once a decision rule has been specified. This relationship is called a
psychometric function (cf. Ennis 1993) – psychometric functions have been derived for the 2‐AFC
(Thurstone 1927), 3‐AFC (Elliot 1964), 4‐AFC (Bi et al. 2010), Triangle (Ura 1960; David and
Trivedi 1962; Bradley 1963), Duo‐Trio (David and Trivedi 1962) and the Specified and Unspecified
Tetrad (Ennis et al. 1998) tests, and have been approximated for the Two‐Out‐of‐Five test (Ennis 2013).
In addition, Bi and O'Mahony (2013) have provided simplified expressions for the psychometric
functions of both the Triangle and Unspecified Tetrad tests.
Differences in decision rules explain why some difference tests lead to more correct answers than others
(Frijters 1979) – some decision rules are more efficient than others and are less easily influenced by noise
in the percepts. For example, Jesionka et al. (2014) illustrate, through a case‐by‐case analysis, why the 3‐
AFC leads to more correct answers than the Triangle test9 .
Recently, investigations have been conducted into techniques to encourage subjects to employ more
efficient decision rules (van Hout et al. 2011; M. Kim et al. 2012). A challenge for this line of research is
that, unless the decision rules that the subject used are known, it is not possible to use Thurstonian theory
to transform the results into d′ values. Through the skillful application of signal detection theory
(Hautus et al. 2007; O'Mahony and Hautus 2008; Wichchukit and O'Mahony 2010), it may be possible to
meet this challenge, but to our knowledge it has not yet been fully met.

Precision of Measurement
Psychometric Functions and Precision
As noted in previous section, the psychometric function of a sensory difference test relates the
underlying sensory difference δ to the probability of correct response. Figure 2 shows the
psychometric functions for the specified methods described in Section 2, while Fig. 3 shows the
psychometric functions for the unspecified methods.

Figure 2
Psychometric Functions for the Specified Methods of Forced‐Choice Difference Testing
Discussed in this Article

Figure 3
Psychometric Functions for the Unspecified Methods of Forced‐Choice Difference Testing
Discussed in this Article
When using the results of a difference test to estimate δ, the shape of the psychometric function
determines the precision of that estimate. If the psychometric function is relatively flat in the
region of δ values surrounding d′, then a wide range of δ values would yield results similar to
those observed experimentally. Thus the estimate of δ will not be precise in this case. On the
other hand, if the psychometric function is relatively steep, in the region of δ values
surrounding d′, only δ values close to d′ would yield results similar to those witnessed
experimentally. In such a case, the estimate of δ will be more precise. For example, Fig. 4 shows
the psychometric function of the Triangle test over the range 0.5 ≤ δ ≤ 1.5, while Fig. 5 shows the
psychometric function of the Tetrad test over the same range.

Figure 4
The Relationship Between a Range of Similar Values for the Proportion Correct and the
Corresponding Range of δ Values, in a Triangle Test

Figure 5
The Relationship Between a Range of Similar Values for the Proportion Correct and the
Corresponding Range of δ Values, in a Tetrad Test
Over this range of δ values, the psychometric function for the Tetrad test is steeper than the
psychometric function of the Triangle test. As a result, small deviations in the proportion of
correct responses (in this case a deviation of ± 0.03) correspond to relatively smaller deviations
in δ for the Tetrad test than the for the Triangle test. This fact eventually implies that the Tetrad
test is more precise near δ = 1 than the Triangle test [see Ennis and Christensen (2014), for
additional discussion on this topic].

Variance in Estimate of δ
The variance in the estimate of δ depends only on the experimentally observed proportion of
correct responses and the sample size. When discussing this variance, it is thus standard practice
to refer to the so‐called “B values,” which are the product of the variance by the sample size (cf.
Bi et al. 1997). Recently, Bi and O'Mahony (2013) compiled the B values for several of the
common difference testing methods into a single resource, and contributed the B values for the
Specified Tetrad test. These B values, which were provided separately in (Bi et al. 1997, 2010;
Bi 2006; Ennis 2012) have been combined with B values for the Two‐Out‐of‐Five test
(Ennis 2013) to create Figs 6 and 7. These figures show these values for the testing methods in
our list of forced‐choice testing methods, separated according to whether the tests are specified
or unspecified.

Figure 6
B Values for the Specified Methods of Forced‐Choice Difference Testing Discussed in this
Article
Figure 7
B Values for the Unspecified Methods of Forced‐Choice Difference Testing Discussed in this
Article

Comparing Figs 6 and 7 with Figs 2 and 3, we see that the B values for a difference test are
indeed smallest when the test's psychometric function is steepest. Two other important notes are
that the variance in d′ is not the same as the perceptual noise as described in Section 3, and that
variances derived from B values should not be used to create confidence intervals (cf.
Pawitan 2001 and Christensen and Brockhoff 2009). The B values are only meant to provide a
rough assessment of the precision or imprecision of the various testing methods over a range of
possible δ values, for comparative purposes.

Expected Width of Likelihood‐Based Confidence Intervals


As an alternative to Wald confidence intervals, which are based on the variance in d′,
Christensen and Brockhoff have proposed the use of likelihood‐based confidence intervals
(Christensen and Brockhoff 2009; Christensen 2011). Likelihood‐based confidence intervals can
be computed via the inverse psychometric function12 , that is the function that maps the
probability of a correct response to δ. Moreover, likelihood‐based confidence intervals have the
desirable property that no δ values outside the interval have greater likelihood than those inside
(cf. Pawitan 2001). Thus, the expected width of likelihood‐based confidence intervals serves as a
more meaningful metric for assessing the precision of a difference test. But unlike B values, this
metric depends on the sample size used for testing.

Figures 8 and 9 show the expected widths of the likelihood‐based 95% confidence intervals for
the specified and unspecified difference tests, respectively, for N = 30. Note that these expected
widths appear in Ennis and Christensen (2014) for the Unspecified Tetrad, Triangle and 2‐AFC
tests. Otherwise, these expected widths appear for the first time here in the present article13 .
Figure 8
Expected Widths of Likelihood‐Based Confidence Intervals for unspecified Methods of Forced‐
Choice Difference Testing Discussed in this Article (N = 30)

Figure 9
Expected Widths of Likelihood‐Based Confidence Intervals for the Duo‐Trio, Triangle and
Tetrad Tests (N = 30)

From these figures, we see that both of the methods of Tetrad testing, specified and unspecified
are, in theory, the most precise methods for which likelihood‐based confidence intervals have
been developed. However, some caveats are in order. Both of these methods require evaluation
of four samples, which could lead to additional noise. In addition, if four samples are to be
evaluated in a specified condition, it would perhaps be feasible to perform a double‐replicated 2‐
AFC test with the same amount of product preparation. Thus, experimental comparisons of the
various methods, especially those involving Tetrad testing, are crucial for understanding the
behavior of these tests in practice [see Garcia et al. (2012), Ishii et al. (2014), and Garcia et al.
(2013) for three such recent comparisons].

Conclusion and Future Directions


In this article, we have reviewed the recent developments in sensory difference testing that have
allowed a shift away from the binary perspective of “significantly different or not” to the more nuanced
perspective of measuring sensory effect size. According to this latter perspective, products are
necessarily different when they are reformulated, but the difference may be so small as to be irrelevant
to consumers. The question of what size of sensory difference is meaningful to consumers then
becomes a crucial one that can only be investigated experimentally14 . Similarly, although the concept
of power has helped sensory scientists understand the shortcomings of such methods as the Triangle
test15 , this new perspective indicates that it is the most precise methods rather than the most powerful
methods that are the most practically valuable.
For the next steps, it is important to develop reliable methods for determining the size of a meaningful
sensory difference, within a given product category and for a given target consumer group, as it is
impractical to conduct consumer tests for every product reformulation. The development of these
consumer‐relevant action standards will prove to be an important next step in the evolution of sensory
difference testing. It is also important to continue to innovate the theory of confidence intervals from
statistics (e.g., Agresti and Coull 1998 and Brown et al. 2002) into sensory science. Using confidence
intervals, we can assess whether we are confidently below a consumer‐relevant action standard, and
hence meaningfully equivalent, or confidently above a such as standard, and hence meaningfully
different. And related to this statistical progress that still needs to occur, it remains to be seen how the
measurement perspective will integrate with the increasing use of Bayesian analysis in sensory (Bi 2003,
2011b; Bayarri et al. 2008; Carbonell et al. 2008; Duineveld and Meyners 2008). Finally, it will be
important to extend all of this theory to the other testing methods, such as the Degree of Difference
test, that are in common use, but are not covered in this review.

Reference
Google Scholar
Angulo, O., Lee, H.‐S. and O'Mahony, M. 2007. Sensory difference tests: Overdispersion and warm‐up. Food Qual.
Prefer. 18, 190– 195.

Crossref Web of Science®Google Scholar


Bayarri, S., Carbonell, I., Izquierdo, L. and Tárrega, A. 2008. Replicated triangle and duo–trio tests: Discrimination capacity of
assessors evaluated by Bayes' rule. Food Qual. Prefer. 19, 519– 523.
Bi, J. 2002. Variance of d′ for the same–different method. Behav. Res. Methods 34, 37– 45.
Bi, J. 2003. Difficulties and a way out: A Bayesian approach for sensory difference and preference tests. J. Sensory
Studies 18, 1– 18.
Amerine, M., Pangborn, R. and Roessler, E. 1965. Principles of Sensory Evaluation of Food, Academic Press, New York, NY.
Alfaro‐Rodriguez, H., Angulo, O. and O'Mahony, M. 2007. Be your own placebo: A double paired preference test approach for
establishing expected frequencies. Food Qual. Prefer. 18, 353– 361.

Wiley Online Library Web of Science®Google Scholar


Bi, J. and Kuesten, C. 2012. Intraclass correlation coefficient (ICC): A framework for monitoring and assessing performance of
trained sensory panels and panelists. J. Sensory Studies 27, 352– 364.

https://onlinelibrary.wiley.com/doi/full/10.1111/joss.12086

You might also like