Psychophysics Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Psychophysics 2019-2020

1. Classic psychophysics: Historical introduction and Basic concepts

I. Gustav Theodor Fechner (1801-1887)


- Is the founding father of psychophysics
- Was a professor physics in Leipzig, and developed an eye disease (in studying color afterimages)
- He was a “strange guy”
• He made chemical and physical papers
• Made poems and humorous texts (pen-name “Dr. Mises”), including “Vergleichende
Anatomie der Engel” (1825) → comparative anatomy of angels
• Was a mysticus and metaphysicus, wrote about philosophical things, including “Das
Büchlein vom Leben nach dem Tod” (1836) and “Zendavesta, oder über die Dinge des
Himmels und des Jenseits” (1851)
- → This were strange topics for a professor in physics

- Panpsychism and monism, including “Nanna, oder über das Seelenleben der Pflanzen” (1848)
and “Über die physikalische und philosophische Atomenlehre” (1853)
• Everything in nature has a soul (“beseelt”) (= panpsychisme)
• He rejects the cartesian dualism of mind and body
• He defends a monistic view on the relation between the physical and the mental as two
aspects of the same (only one identity)
▪ Important core notion of his philosophical view on nature and starting point of
psychophysics, he wanted to study quantitatively the relation between these two
• Monism in the form of pan → everything is psychic, mental
• Starting point psychophysics = monism
▪ The start was already before the start of psychology

- He wanted to map the functional relation between the physical and the mental and thus
develops psychophysics
• = the exact science of the functional relation between the body and the mind
- “Elemente der Psychophysik” (1860):
• Start of psychophysics (even before the official start of psychology)
• Milestone in the development of psychology as a science

II. “Elemente der Psychophysik”


- Two volumes:
• “Outer psychophysics”: the relation between the intensity of the physical stimuli (R for
“Reiz”) and the intensity of the sensation (S)
▪ (Perceived) intensity
▪ R = stimulus and S = sensation! (so watch out with response and stimulus)
• “Inner psychophysics”: the relation between the intensity of the neural excitation (E) and
the intensity of the sensation (S)
▪ How strong you perceive stimuli to be in relation
to brain activity
▪ It’s al in the brain and the mind

III. Threshold and threshold measurements


- Threshold (Lat. “limen”) = limit between stimuli that evoke one kind of responses and stimuli
that evoke another kind of responses
• Very broad, general concept, transition between stimuli leading to one kind of response
and stimuli that evoke another kind of response
• It’s always expressed in terms of stimulus access
• It’s a transition in what happens in our mind in relation to stimulus strength
• Different kinds of thresholds
- For absolute threshold (RL): limit that indicates the transition between absence and presence of
a sensation
• Most basic type of threshold
• Suddenly you start noticing the stimulus, from 0 to something, or suddenly you stop
noticing the stimulus (at the high end of the spectrum)
• Always expressed in terms of the stimulus access
- For differential threshold (DL): the smallest added stimulus intensity that allows perceiving a
difference
• Again about a difference in stimulus intensity, but here it is al above the absolute
threshold, at levels of intensity higher than the RL

- Lowest threshold (“floor”): the minimal stimulus intensity (or signal strength) that is required
to be perceived
• Absolute is most often the lowest threshold, but you can also have an absolute threshold
at the other side of the spectrum (at the high end), so these two concepts aren’t the
same!
- Different ways to measure this (next lecture)
• Mostly a straightforward detection task
- Ideal situation: a discrete step function
• Discrete step function from nothing at all to perceiving the stimulus
- In reality: often no abrupt transition

- Graph of ideal stimulus threshold


- Always pay attention to the labels on the axes!
- Start from 0, which is 0 level of intensity, you have 0 detection
probability, but suddenly from a certain stimulus intensity we
have a detection probability of 1
- Fixed absolute threshold because you go from 0 to 1
- But this is what you find in practice (watch out: other y-axis)
• For very low stimuli they sometimes say yes but this happens rarely
• From some levels they begin to say more and more yes
- It is not a 0% to 100% but a gradual transition, that is fitted to the data
- Psychometric function, usually a sigmoid function (S-
shape), with the lower part of the S indicating low levels
and the upper part of the S indicating high levels of
detection
• It is very hard to say where the transition exactly
occurs, it is somewhat arbitrary
• Part of the challenge in psychometrics is deriving
thresholds from that function
• A lot of tools in psychophysics are developed to
work with this kind of problems, to solve these

- Within the perceptual range of a stimulus dimension, we can examine the differential threshold
(DL) or the just noticeable difference (JND or j.n.d.)
• These two are the same (DL and JND)
- Once again we can use several tasks (next lecture)
- Mostly discrimination tasks
• For absolute threshold we (usually) do a detection task, for difference threshold we
(usually) do a discrimination task (is this the same or different?)

- This graph is the difference threshold (how big does the


difference needs to be in terms of stimulus intensity?)
• This is an ideal case, in reality it is more
complicated than absolute threshold
- The stimuli lower and higher than the standard are
comparison stimuli (the other is standard)
• If you ask people if these are the same or not,
they will never say that the comparison is larger
than standard, because they are smaller, you
expect people to never say it’s larger (so
probability of 0)
• 50/50 chance they say that they are larger or smaller if there is only a small difference
between them
• From a certain high intensity, sufficiently above standard, they will always say it is larger
- Not a simple step function that goes from step 0 to step 50% to step 100%
- Middle region is where people cannot make distinction between comparison and standard
(interval of uncertainty), they cannot discriminate
• Divide that interval by 2 to get difference threshold
• When the IU is big, there is a big difference needed in terms of stimulus intensity
- This graph is the ideal, theoretical case
IV. Weber’s law
- In the same period or even a bit before Fechner
- We can measure the DL or the JND at different locations along the physical scale
• Is it always the same amount of difference that you need to detect a difference between
the stimuli?
- Observation: DL has no fixed value but a value in proportion to the stimulus intensity of a
standard stimulus
• Not the same absolute number, not a fixed number
• Smaller stimuli means smaller increments needed, bigger stimuli means bigger
increments needed
• If you double the weight, you also need to double the increment weight to feel a
difference
• You can apply this notion on all kinds of stimuli (weight, length …)
- Example:
• 50 g + 1 g
• 100 g + 2 g
• 200 g + 4 g

➔ So increments are in proportion to absolute stimulus level: basic notion behind Weber’s law
➔ The bigger the standard stimulus, The more needs to be added to notice a difference

- Basic rule discovered by Ernst Heinrich Weber (1795-1878) in 1834 (“De pulsu, resorptione,
auditu et tactu”)
- Most straightforward formulation: the stimulus intensity must be increased by a constant
fraction of its value in order to obtain a just noticeable difference
- This fraction is usually represented as:
• k = ΔI / I
• l for intensity and ΔI for the smallest (“increment”) added (increment of intensity level
divided by intensity level of standard stimulus)
• → Intensity that leads to a JND
- → Weber fraction or Weber constant (= k, which is a proportion, not a fixed number)

- A lot of psychophysical research aims at:


• Investigating whether Weber’s law is applicable across the entire range of all sensory
modalities
• Determining the Weber fraction for all sensory modalities
- Note: it might be useful to know the upper threshold (“ceiling”): the limit above which the
subject is no longer able to perceive the differences
• At some level, the sensory organ is no longer processing the sensory dimension as such
• Switch from sensing the sensory domain as such to sensing pain
• For example: very hot (100 degrees), you don’t experience temperature anymore but just
pain, and if 101 you don’t experience the temperature anymore but also just pain, you
can’t differentiate anymore
• (harmful stimulus intensities → pain)
- If you need small fractions, you are very sensitive in that domain
• We are for example better in visual than in smell (for many animals opposite)
- Two ways to present Weber’s law graphically:
• To plot ΔI against I
▪ Rising straight line with k = slope
▪ High sensitivity : minor increase (small k)
▪ Low sensitivity: major increase (large k)
• To plot k (ΔI / I) against I
▪ Straight line that plots all l-values at the same (constant) value k, parallel with the
abscissa
▪ Weber fraction is constant over all stimuli, therefore a straight line

- S = I (stimulus intensity)
- First graph:
• You have to add more and more for
larger values, you have to add
proportionally more
• Large slope = low sensitivity, low
slope = high sensitivity
- Second graph: weber fraction itself
• You always have the same fractional
measure within a sensory dimension
• Constant over the entire stimulus intensity domain
• Again, larger slope = low sensitivity, lower slope = high sensitivity

- Usually Weber’s law applies well for the middle part of the stimulus range (flat U-function)
• So applies well at middle level, but at lower end and higher end you get some increase,
so not a real flat function but a little bit of increase at both ends (U-function)
• So exceptions of Weber’s law in extreme dimensions
- Deviation is often quite major for very low stimulus intensities
- Modifications:
• ΔI = k ( I + Ir ) with Ir a small added value
• ΔI = k.In with n not necessarily equal to 1
- Big differences in sensitivity:
• k = .016 for luminance
• k = .33 for sound volume

V. Weber-Fechner law
- Fechners major contribution in developing methods (see before, he also wanted a mapping
between stimulus intensity and sensations)
- Psychophysical measurement requires a starting point and a measurement unit
• Ruler has a 0 point, and a certain increment (millimetre, centimetre) = measurement unit
- Fechner realized that the absolute threshold (RL, where you begin sensing the stimulus) could
be used to determine the starting point and the JND to determine a measurement unit
- This way he derived from Weber’s law (ΔR/ R = k) the so-called Weber-Fechner law:
• S = k log R
• Sensation = logarithm of stimulus intensity multiplied by k
- Not an obvious graph
• Sensation on x-axis, stimulus strength on y-axis
- You need a bit more before you start having a
sensation, so the graph doesn’t starts exactly at 0 of
the y-axis for 0 at the x-axis
• Starting point is higher than 0 on the stimulus
sensation
- Rather than to add a constant addition on sensation
axis, you have to multiply by Weber fraction, you have
to make the step larger and larger to get equal steps
on sensation axis
• Therefore the graph is not an linear shape
• We multiply on the y-axis to get equally big additions on sensation (x-axis)
- Strength of sensation logarithm multiplied by weber fraction

- Significance: to increase the strength of the sensation (S) as an arithmetic sequence (summed
with a constant) one has to increase the stimulus intensity (R) according to a geometric
sequence (multiplied by a constant)
• Arithmetic = summed with a constant (x-axis)
• Geometric sequence = multiplied (y-axis)
- Indirect scaling method (see later)
• You use threshold measurements as an indirect quantitative measurement to come up
with a good scale, indirect measurement to come up with a good scale

- Basic idea of psychophysics:


• Sensation magnitude (psychology part of
psychophysics)
• Stimulus intensity (physical dimension)
• Mapping strength of stimulus intensity in the
relation to the strength of the sensations
• This mapping is the essence of psychophysics
• You can apply that idea on all sorts of dimensions
(not only sensory, but also things like aesthetic appreciation, how much you like your
partner …)

VI. Stevens’ Power Law


- Another version of that kind of function
- Weber-Fechner law (based on an indirect scaling method):
• Ψ = k log Φ
• Intensity of psychological sensation = k log physical intensity
- Stevens’ Power Law (based on a direct scaling method, “magnitude estimation”), Steven has a
direct method: power law (rather than logarithmic)
• Ψ = k Φn
• K times physical intensity to power n
• Power function instead of logarithmic function
- “When description gives way to measurement, calculation replaces debate” (S.S. Stevens, 1951)
2. Psychometric function (basics): methods and procedures

I. Introduction
- Psychophysics: scientific study of the (quantitative) relations between physical stimuli and
sensations
- Classical psychophysics: threshold measurements as an
indirect scaling method
- Modern psychophysics: signal detection theory and
direct scaling methods (more broadly applicable than
sensory modalities)

- Absolute threshold (RL – Reiz Limen):


• Limit that indicates the transition between absence and presence of a sensation (i.e.
minimum stimulation needed to detect a stimulus)
- Differential threshold (DL – Difference Limen):
• The smallest added stimulus intensity that allows perceiving a just noticeable difference
(JND)

- Weber’s law (1834):


• The stimulus intensity must be increased by a constant fraction of its value in order to
obtain a just noticeable difference:

- Fechner (1860):
• The absolute threshold (RL) and the just noticeable difference (JND) can be used to
determine the starting point and the measurement unit, respectively (both necessary to
truly measure sensations):

- S = k . log R
• To  the strength of the sensation (S) as an arithmetic sequence (summed with a
constant), one has to  the stimulus intensity (R) according to a geometric sequence
(multiplied by a constant)
- Weber-Fechner Law (based on indirect scaling method):

- Stevens’ Power Law (based on a direct scaling method, “magnitude estimation”):


• This is a more direct scaling method

- Psychometric function:
• A function that describes the relationship between stimulus
intensity and probabilities of observer responses in a
classification (forced-choice) task

II. Classical psychophysical methods


- Simple method to measure perception thresholds:
• Consists in presenting a stimulus and asking subjects whether he/she perceives the
stimulus or not
• However, biological systems are subject to noise → the absolute threshold is different
from the stimulus level below which 0% is perceived and above which 100% is perceived
- → Threshold determination requires a statistical approach
- Ideally: this forms a step function
• you can’t perceive anything and from a certain intensity you see the stimulus
- In real life it isn’t like this, it is a sigmoid function
• Not easy to find a threshold because of this
- Fechner introduced three measurement methods:
• Method of constant stimuli
• Method of limits
• Method of adjustment
- → systematic overview of psychophysical tasks and procedures

1) Main tasks

- Two main kinds of tasks can be used to determine limits:


• Adjustment: the subject has to adjust the intensity of a stimulus
▪ Non-forced choice
▪ Just ask participant
• Classification: the subject has to classify the stimuli displayed
▪ Forced choice
▪ Classify displayed stimulus
➔ To obtain more standardized measurements, it is usually better not to have the subject
determining the stimulus level him/herself (so more classification then adjustment, is more
precise)

- Three types of classification tasks:


• Yes/no
▪ Was the stimulus present?
▪ Single interval design → more suitable for subjective experience
• 2AFC
▪ In which interval was the stimulus present?
▪ → more suitable for performance limits
• Identification
▪ What was the stimulus?
▪ For example categorisation
▪ Single interval design → more suitable for subjective experience

a. Yes/no tasks:

- Was the stimulus present?


- Advantages:
• Very simple, low cognitive load
- Disadvantages:
• Subjects can use their own criterion to answer ‘yes’ or ‘no’ (response bias) → maybe if
you are not sure you tend to say yes, but someone else may be biased to say no
- The threshold can only be dissociated from this internal criterion if catch trials are included in
the paradigm, so that both hit rate (HR) and false alarm rate (FAR) can be measured
• Hit rate = participant said yes and there was actually something
• False alarm rate = participant said yes but there was nothing

- By influencing the criterion during the instructions (e.g. by


reward/punishment), the different criterion values of the HR and FAR
can be plotted in a so-called ROC-graph (ROC = Receiver Operating
Characteristics)
• The area under the ROC-curve is an excellent measure for the
sensitivity to the signal, independent of the response bias
• So area under circle and still above the line is a good
measurement of threshold and sensitivity to the signal
- See further: signal-detection theory (SDT)

b. Two-alternative forced choice tasks (2AFC):

- For example: in which interval was the stimulus present?


- Essential: the subject is offered two stimuli instead of one
- 2 possibilities:
• Detection task: one signal and one blank
▪ Which one contains the signal? (top left)
• Discrimination task: two signals
▪ Which one contains the strongest signal? (bottom left)
- Other distinction:
• Successive: temporal 2AFC or 2IFC (top right)
• Simultaneous: side by side: spatial 2AFC (bottom right)

- Distinction between 2AFC and 2IFC:


• 2AFC: the 2 stimuli are presented together on the screen → simultaneous
• 2IFC: the 2 stimuli are presented in the same display position but in temporal order (so
not at the same time, after each other) → successive
➔ Variations of the alternative forced-choice tasks
➔ Here: examples of different methods to measure an
orientation discrimination
➔ N = number of stimuli presented on each trial
➔ So you can also have three or more stimuli, a lot of
variation is possible

- Advantages:
• By assigning two kinds of stimuli randomly to both intervals or positions, performance
can easily be compared to chance
• Under certain conditions (see SDT) the percentage of correct responses corresponds to
the area under the ROC curve, without the obligation to use the same cumbersome
procedure
- Disadvantages:
• A large number of trials is needed (especially if one wants to know the complete
psychometric function)

- Yes/no task vs. 2AFC task:


• Yes/no task: the two alternatives (present vs.
absent) are presented on separate trials
• 2AFC task: the two alternatives (e.g. right vs.
left) are presented within the same trial
- → So different trials vs. the same trial

c. Identification task:

- What was the stimulus?


- Most efficient method: presenting a limited number of stimuli and asking the subject to identify
these stimuli
- Tradeoff:
• If a small number of stimuli is used → Identification is easier to learn, but the guess rate
is higher
• If a large number of stimuli is used → The guess rate is lower, but identification is harder
to learn (higher cognitive load)
• Optimum : choosing ≈ 4 stimuli
2) Measurement methods

- 3 main methods to measure thresholds:


• Method of adjustment
▪ Method in which observers freely adjust the magnitude of a stimulus in order to
reach a criterion, for example a threshold
• Method of limits
▪ Method in which observers are presented with a series of stimuli of  or 
magnitude, and report when the stimulus appears to change state (for example
from visible to invisible)
• Method of constant stimuli
▪ Method in which the magnitude of the stimulus presented on each trial is
selected from a predefined set

a. Method of adjustment:

- The easiest and most straightforward way to determine thresholds is to ask the subject to set
the physical stimulus value a certain way:
- For example measurement of an absolute threshold:
• → Increased stimulus intensity until the subject detects the stimulus
- It is usually measured by alternating a number of ascending and descending sequences (left)

- Instructions are very important in adjustment tasks, because the subject controls the stimulus
level him/herself (directly or indirectly via experimenter)
- The easiest is to use matching instructions:
• Setting the test stimulus in such a way that it corresponds to a standard or a reference
stimulus (right)

b. Method of limits:

- Variation of the intensity of just one stimulus (e.g., luminance or volume) → after each
stimulus, subjects have to report whether they perceived the stimulus or not
• The threshold will be determined by alternation of a number of ascending and
descending sequences (cf. adjustment method; see Table 1)
• The mean of the stimulus intensity of the last two trials is used as a transition point
within each sequence
• The mean of the transition points within several sequences can be used as a measure of
threshold

➔ So similar as the previous, but slightly different in the way to calculate the threshold
➔ More trial by trial (before more constant)
- Average for finding the transition point
- Do it several times, and then the
average of all of these transition points

c. Method of constant stimuli

- The experimenter chooses a number of stimulus values around the threshold (for example
based on adaptive procedure) → each of the stimulus values (for example 5 or 7) is presented a
fixed number of times (for example 50) in random order
• So present trials of different intensity in random order
• For each of these stimulus values, a frequency can be plotted for a number of response
categories (for example ‘yes’ answers in a yes/no task or one of the two alternative
responses in 2AFC), possibly as a proportion or percentage
• Such graph is called a psychometric function

- Psychometric function:
• Usually a continuous S-shaped function
• Exact threshold determination is not trivial
- Disadvantage:
• Many trials are necessary and not all data points are
useful
• (solution: adaptive procedure; see after)

3) Function estimation

- Example: Temporal 2AFC to measure a differential threshold (temporal so successive)


- Stimulus: homogeneous light spot with 5 possible luminance values (left)
- Every trial has two intervals out of which the subject has to choose the one with the highest
brightness (right)

- If test stimulus doesn’t differ much from reference: closer to 0


• 2 points close to 0 because offered in 2 sequences/orders: first test stimulus and then
the reference or first the reference and then the test stimulus
• Performance varies gradually from chance level (50%) to perfect level
• Around the 50% level, there is an interval in which the subject is not sure of his/her
percept → Interval of Uncertainty (IU)
- Disadvantages:
• Not easy to determine the differential threshold
• Difficult to choose a performance level because there is a continuous transition and
because of some fluctuations due to measurement errors

- Necessary to estimate the true function by fitting a curve through the data points (left)
• Usually in papers you see curves like this (instead of datapoints like above)

- Example (right):
• Results of a standard two-interval forced-choice (2IFC) experiment
• The various stimulus contrasts are illustrated on the abscissa
• Black circles give the proportion of correct responses for each contrast
• The green line is the best fit of a psychometric function, and the calculated contrast
detection threshold (CT) is indicated by the arrow
▪ The contrast difference is 0.09 for participant to detect
▪ You need a curve to see this, not just the data points

- Theoretical point of view: data are often fitted with a cumulative Gaussian distribution
• Probit analysis is a possibility (Finney, 1971)
- Reason: the internal representation of the stimulus is supposed to have a normal distribution
- The perceived difference between two stimuli is inversely proportional to have overlap
between both normal distributions (Z-scores):
• Good stimulus representation (small σ) → steep slope
• Less good stimulus representation (larger σ) → shallower slope
- Much more variance so much more uncertain of your perception
• The closer to the ideal step function, the more precise you are (and so more precise
perception) → smaller standard deviation

- If a discrete threshold needs to be determined, and arbitrary


choice is to be made (JND = IU/2)
• So not always easy to determine threshold, and arbitrary in this case because you as
experimenter say where the threshold is
- 2 frequent choices (pro’s and con’s for each):
• Half the distance along the abscissa between the 20% and 80% points on the ordinate
▪ Advantage: the chosen threshold value is in accordance with one SD of the
underlying gaussian distribution
• Half the distance along the abscissa between the 25% and 75% points on the ordinate
▪ Advantage: the criterion is situated halfway between chance (50%) and perfect
performance (0% and 100%)

- Point of subjective equality (PSE): The physical magnitude of a stimulus at which it appears
perceptually equal in magnitude to that of another stimulus
• 50% point of subjective equality → usually, not situated on the 0- value (because of the
response bias)
• There is some response bias of the participant so the curve will be a bit shifted
• Green = 360, blue = 290 (estimated)
- Example: at which point considers the participant the face as
angry depending on the color of the face? (data from 1
participant)
• Red they perceived much sooner as angry, they say more
easily they are angry
• Blue they perceived much sooner as fearful, they say more
easily they are fearful
- → So you can shift the PSE with the different colors

- Alternative functions:
• Logistic function (small difference, but faster)
• Weibul function (larger difference)
- The choice of the stimulus levels is very important for threshold determination:
• Points around 20% and 80% of correct answers are the most informative, while the 0%
and 100% points are theoretically redundant
• In practice, it is useful to include the easy conditions
to keep the subjects motivated
• So the ends are not informative, but they keep them
for motivation, the middle part is the important and
informative part

III. Adaptive psychophysical methods


- Motivation: classic psychophysical procedures need a lot of trials for which the data are not
informative
• Adaptive test procedures can be used to focus more quickly on the interesting stimulus
levels
- Essential:
• The stimulus level is chosen based on the performance of the subject on previous trials

- Three main categories of adaptive procedures: depending on:


• The strategy to choose the stimulus levels
• The way to determine the threshold
- Three categories:
• Adaptive staircase procedures
• PEST procedures
• Maximum-likelihood adaptive procedures
1) Adaptive staircase procedures

- An adaptive variant of the “staircase”, as described above (method of limits)


- Start with an arbitrary (but large enough) stimulus value, then:
• Correct answer: lower the stimulus value with a fixed increment
• Wrong answer: raise the stimulus value with a fixed increment
• You start with an easy one, and you lower until they don’t perceive the stimulus
anymore, and then you go back up, until they again perceive it, and then dropping
again… (see left figure)
- Usually 6 to 9 turning points are used to estimate the threshold (usually the mean stimulus
intensity at the turning points)
- Advantage:
• More efficient because most of the stimulus values are centered around a threshold
- Disadvantage:
• Subjects can figure out the modifications (and the structure of the task)
- Solution: two “staircases” mixed → = “interleaved staircases” (see right figure)
• It is possible to randomize both “staircases” instead of systematically mixing them
• So you start with an easy stimulus and a stimulus that you can’t see (difficult trial first),
and they both converge to a certain threshold

- Disadvantage:
• Threshold converges to 50% correct (probability of a correct answer = probability of a
wrong answer); which is too low for some purposes
• 50% because you go to the point where correct/incorrect/correct … (they are close to
each other), so we need a strategy to become more than 50%
- Levitt (1971) has developed a general transformation procedure to acquire specific values on
the psychometric curve through an adaptive “staircase” procedure
• Idea: to get a higher performance level, the subject needs to give several (successive)
correct answers before the stimulus level is lowered:
▪ E.g. “two-down, one-up” procedure converges to 70.7%
▪ E.g. “three-down, one-up” procedure converges to 79.4%
- The adaptive staircase procedure is the most frequently used adaptive test procedure because
it is the most straightforward
- Advantages:
• It is easy to choose the next stimulus level, increment, stop criterion and threshold
(adaptive staircase → PEST)
• No assumptions with respect to the shape of the psychometric function (non-parametric
→ maximum-likelihood)
▪ Only assumption: monotonic link between the stimulus level and the
performance level

2) PEST procedures

- PEST: Parameter estimation by sequential testing


• Designed to address the problem of step size and starting intensity
- Idea: a property of some signal (the independent variable) is adjusted to find the magnitude
that results in a performance of specified accuracy (the dependent variable)
- Essential: more complex algorithm for the search of the threshold by:
• Changing the criterion for a change of stimulus intensity
• Changing the increment as the number of trials increases

- Concretely: a statistical test is used to check whether


the performance level up to a given moment is higher
or lower than the target performance level
• Increment halves with every subsequent
direction change
• Not just using criteria (for example two correct
and then lower one)

- Threshold: e.g. stimulus intensity at the end of the sequence


- Criterion for the efficiency of the procedure: product of the number of trials and the variance of
the measurements (Taylor & Creelman, 1967)
• E.g. Taylor and Creelman tested participants' ability to detect a sound by varying its
loudness until participants detected it on 75% of the times it was played
• → Usually 40 to 50%

3) Maximum-likelihood procedures

- With the previous methods, subsequent changes in stimulus intensity relies on the subjects
previous two or three responses
- Main idea of the maximum likelihood method:
• The stimulus intensity presented at each trial is determined by statistical estimation of
the subject threshold based on all responses from the beginning of the session
• At each trial, a psychophysics function is updated and then serves to select a stimulus
intensity for the upcoming trial
- So using that fit of the function you will chose the intensity of the stimulus on the next trial
• Fitting the data on the previous data you have accumulated
- Maximum-likelihood adaptive procedure:
• Estimation procedure in which the best-fitting model is defined to be that which
maximizes the likelihood function
- Essential:
• Performance up to a given moment is used to estimate as well as possible the entire
psychometric function
• There are assumptions with regard to the shape of the psychometric function
(parametric)
- Likelihood:
• The probability with which a hypothetical observer characterized by assumed model
parameters would reproduce exactly the responses of a human observer
• The likelihood is a function of parameter values

- Graph:
• No very clear criteria (steps are not always the same)
• It is converting more quickly to a threshold

- “Best PEST” (Pentland, 1980) is an example of a maximum-likelihood method


• After each trial, the likelihood function is calculated based on the responses to all
previous trials
• A value for the threshold parameter is estimated using a maximum likelihood criterion
• The stimulus intensity to be used on the next trial corresponds to the threshold estimate
determined from all previous trials
- The best PEST assumes a specific form of the psychometric function (logistic function) and
estimates only the threshold parameter of the psychometric function
• Corresponds to the best possible estimation of 50% performance point (according to
“maximum-likelihood” criterion)
• Final threshold: last stimulus value

- Best PEST:
•  stimulus intensity after a correct
response
•  stimulus intensity after an incorrect
response
• Step sizes are not fixed (and tend to )
• 2nd trial: always the highest/lowest, given
the range set by the experimenter

- QUEST (Watson & Pelli, 1983): Quick estimate


by sequential testing
• QUEST is the Bayesian version of best PEST
• Using QUEST, all information is used to determine the next stimulus level by means of
Bayesian methods (i.e. new data serve to adjust our pre-existing knowledge or beliefs
regarding the value of the threshold parameter)
- For a sequence of stimuli of a definite length, the QUEST-procedure is more efficient (e.g., 84%
as opposed to 40 to 50% for PEST)
- NB: Last years, better and more efficient procedures have been developed (MUEST, ZEST, etc.)
- Before a Quest run: the researcher postulates a prior probability distribution (mean and
precision) which reflects his/her belief about the threshold
• The prior has the effect of curbing the excessive stepsizes that were observed in the best
PEST run
- Prior: guide to the selection of stimulus intensities, and plays a smaller and smaller role relative
to the contribution of the data collected over the trials
- So the more data accumulated, the more you will rely on the actual data, and less the
hypothesis of the researcher
• In the beginning you rely mostly on what experimenter estimated (his hypothesis, the
prior), later and later you will rely more on the actual data

IV. Conclusion
- Important to keep in mind the main procedures and methods, as well as the advantages and
drawbacks of each method
3. The psychometric function: advanced aspects

I. What are psychometric functions?


- Interested in mental, psychological behavior, and relate it to some physical stimulus
• Mapping from physical stimulus to psychological behavior = psychometric function
- The psychometric function is a summary of the relation between performance in a classification
task (such as the ability to detect or discriminate between stimuli) and stimulus level/intensity
- We need a set of statistical techniques that allow precise quantitative modelling of
psychophysical data
- 3 steps of modelling data:
• Collect psychophysical data
• Estimate model parameters from data
• Model evaluation and criticism (goodness-of-fit, systematic prediction errors, confidence
regions, bias, ...)
- Summary: reduce to a certain set of parameters that reflect what you are interested in

II. What are psychophysical data?


- Behavior when we are manipulating some kind of physical stimulus
- Different kinds of paradigms
- For example: 2-AFC method of constant stimuli in a contrast detection task
• Look at what the minimum intensity is to see the grating stimulus
• Manipulated: intensity of the stimulus
- What is stimulus intensity in this example? (independent variable)
• Contrast of the grating
- What is measured in this example? (dependent variable)
• Left or right
- What will the data look like?

- This is the raw data:


• Contrast values from 10% to 90% contrast
• Percentage correct, increases from 0.4 to 1
- 2AFC: when you present no stimulus people will be as likely to
say left or right (50/50)
- You will only be able to approximate/get an estimation of the
true value over different trials

- Visualisation of the data


- Physical intensity on x-axis, summary of behavior on y-axis
- Summarize the data set as best as we can in a way that also is
meaningful
- What do we learn from the data?
• Response thresholds:
▪ Which stimulus intensity is required to produce a certain performance level? (e.g.
75% correct)
▪ Is motivated by the experimenter
• Rate of improvement as stimulus intensity increases
▪ How quickly do people improve when intensity becomes higher?
- These are two measures that can be derived from the psychometric function

- This is the actual fitted function →


- !Important!
- Psychometric function fitting is much more than fitting a
sigmoidal curve through a number of data points
• Minimising error, but based on a statistical model
• Find the intercept and slope at which the distance between
the graph and the data points is minimal
• But you put different weights to different datapoints, so it
is not just fitting a curve

III. The psychometric function as a psychophysical model


- Psychophysical model of the psychological mechanism and stochastic process underlying
psychophysical data
- Relatively simple, providing a performance description but no explanation of the signal-to-noise
ratio of the psychological mechanism
- Of course, comparing performances in different experimental conditions can lead to an
explanation
• Example?

- Formalizing the psychometric function Ψ:


• Suppose we have K signal levels (independent variables) yielding a vector x = (x1, x2, …, xk)
▪ In our example k signal levels = the different contrast levels
• For each signal level we have K proportions of correct responses (dependent variable)
yielding a vector y = (y1, y2, … , yk)
• If we have n trials for each signal level xi, the number of correct responses is equal to yini
▪ We rely on a number of trials we collected (just one trial is less informative than
for example 100, more trials = more information)
- The psychometric function Ψ(x) relates an observer's performance to an independent variable,
usually some physical quantity of a stimulus in a psychophysical task

- As a general form, we define:


• Ψ(x|α, β, γ, λ) = γ + (1 − γ − λ)F(x; α, β)
- The shape is determined by the parameter vector θ = (α, β, γ, λ) and the choice of F
• F is a sigmoidal function of stimulus intensity x ranging from 0 to 1 (e.g., cumulative
Gaussian, logistic or Weibull, different kinds of sigmoidal functions you can use, each
with a different theoretical motivation) with mean α and standard deviation β, describing
the performance of the underlying psychological mechanism
• γ specifies the guess-rate, i.e. performance when no stimulus information is available
(task-specific, so depends on the task, 50% in 2-AFC), is the
distance from x-axis to where the line starts
• λ specifies the rate of lapsing, i.e. stimulus-independent
errors
▪ if the lapse is 3%, your curve will asymptote at
97%, because even if the stimulus is very clear, a
participant can still make errors because for
example he just blinked and missed the stimulus
▪ so you won’t get the proportion correct of 1 but
for example of 0.97 or 0.98

- Throughout this class we rely on simulated data ( α = 0.5; β = 0.1; γ = 0.5; λ = 0; F is cumulative
Gaussian)
• Bell shape: two different parameters:
▪ The mean, which is the centre (alpha value), the location
▪ The width of the distribution, summarizes where bulk of density is located (beta)
• Cumulative gaussian is sigmoidal, and alpha will be locating where the direction of
acceleration switches from sign, and beta says how steep the curve is
▪ If you play with the location (alpha), the sigmoidal will also move left/right
▪ The curve ranges between 0 and 1

IV. Parameter estimation


- For which values of α, β, γ, λ does the psychometric function Ψ provide a good description of
the data?
- First graph:
• Alpha is quite far off, misfit → fit of
threshold is bad (but will rarely happen
with estimation software)
- Second graph:
• We think the function is shallower than it
actually is, although the alpha is good, the
threshold is located good, so also misfit →
underestimate rate of improvement
because slope misfit

- The green one was used to generate the data


- With computer generating you are always a little bit off, but that is not too
bad (green differs only a little bit from the red)
• It spreads a bit around the real data

- What is a “good” description?


• We need a currency to assess deviations from the model predictions
• Deviations from the model prediction can be due to randomness or
noise (small sample sizes) and do not mean that Ψ fails to describe the performance of
the underlying psychological mechanism
- With the first point we have a big deviation, with the others the deviation
is smaller (see figure)

- Psychophysical responses are the result of a stochastic process, meaning


that these responses are to some extent random instead of deterministic
- The amount of randomness depends on the amount of stimulus
information and amount of trials per block
- The psychometric function model contains specific assumptions to
capture the stochastic process

- We assume that psychophysical trials are Bernoulli processes:


• The outcome is either 0 or 1 (you can view it as a coin flip, but heads and tails don’t have
always equal proportions)
• There is an underlying probability to get a correct answer (probability to get a correct
answer is different depending on the stimulus information)
- For a single trial Bernoulli distribution with only 0 and 1, but in experiment binomial, because
we don’t have a single answer but a certain number of correct answers on a number of trials
- For the n trials at the same stimulus intensity x, the number of correct answers z then is
binomially distributed (formula = definition of binomial distribution)

𝑧
- Substituting the number of correct answers for the proportion correct 𝑦 = yields (this
𝑛
formula is for the proportion of correct instead of number of correct)

- We assume an underlying process that gives us the probability of getting a correct response,
which is dependant on a number of trials

- However, after an experiment we already know the data y, and we are interested in estimating
p, thus f(y) is not really interesting
• When we do fitting, we already have the data so the concept of probability is not really
appropriate anymore (because you either have observed it or not)

- We are interested in the “probability” of p conditional on the data we observed:


• That is why we are interested in likelihood rather than probability

- This is also known as the binomial likelihood function (≠ probability!)


• The likelihood of a particular parameter, having a certain value and given the data we
have, so we know the proportion correct and number of trials it is based on, and knowing
this you want to know the likelihood of having a certain number of probability correct
• !!So the formula is the same but the interpretation isn’t!!
• If you evaluate the likelihood of any value between 0 and 1, you get a curve of
likelihoods, and a value of p associated with the highest likelihood will be the best
- In a psychophysical experiment we typically have a vector y = (y1,…, yk) of proportions of correct
responses with 𝑁 = ∑𝐾 𝑖=1 𝑛𝑖 trials
• We don’t have one proportion correct but K proportions correct with a particular number
of trials
- If we assume all trials to be independent sets of Bernoulli trials and all blocks of trials to be
independent, then the joint likelihood is the product of the individual likelihoods:
• Joint likelihood: multiply individual likelihood functions (from 1 to K which means
different stimulus levels)
• For a particular value of p, what is the likelihood we have observed this set of data?

- As said earlier, the probability of a correct response is specified by the psychometric function
Ψ: Ψ(x|α, β, γ, λ) = γ + (1 − γ − λ)F(x; α, β) with parameter vector θ = (α, β, γ, λ)
- The likelihood thus becomes:

- And only depends on θ (and the monotonic function F we choose)

- For reasons of convenience, we typically optimise the log-likelihood


- As you can see, the log has the nice property that products become sums
- The log- likelihood l(θ) is:

➔ In words: a loaded coin flip is assumed to underlie psychophysical responses (loaded since the
probability of heads and tails is not the same)

- Expected variability:
• For a contrast value, you can have a lot of variation

- How extreme (unlikely) a probability of success equal to 85% correct is all depends on the
amount of variability to be expected given n and p
• On the left the green point is not unexpected, but on the right it is very unexpected
- Variation of the binomial distribution:
• Function of n and p
• Likelihood depends on n and p
- The fact that it depends on n and p is because in a binomial,
the variance of the distribution (of correct responses or
proportion) varies with the true probability of success
- In this case:
• Variance peaks at 0.5, expected variability is higher at
0.5 than at 1
• If probability of heads is 1, you basically always observe
heads, there is no variability
• So variance fluctuates with probability correct, variance is a function of n and p

- The model "knows" how much variability to expect given n and p by assuming binomial
variability
- We can use the model to generate a number of data sets using a specific Ψ as a generating
function

- The two empirical data points (colored squares) deviate equally


from the model predictions but the likelihood of Ψ governing the
observers responses is much lower for the green square assuming
binomial variability and a specific amount of trials
• A true psychometric function, linked to binomial distribution
• Variability of data points decreases when probability correct
increases (when we go to 1)

- Parameter estimation:
• We choose the values of α, β, γ, λ for which the overall likelihood is maximal = maximum-
likelihood estimation
• Let’s map out the likelihood surface for different combinations of α and β assuming a
specific psychometric function Ψ ( α = .5, β = .1, γ = .5, λ = 0) and a binomial process
underlying the psychophysical responses
- Example:
• Z = log likelihood function
• The estimated values for α and β are: 0.535 and 0.084
• The maximum-likelihood estimator 𝜃̂ of θ is that for
which l(𝜃̂; y) ≥ l(θ; y) holds for all θ

V. Model evaluation and criticism

- Goodness-of-fit assessment and model checking:


• Likelihood maximization yields the best-fitting parameter combination for the model
under consideration
• But what is the likelihood that the model under consideration is actually underlying the
data? In other words, is the model "true" ?
- How can the model fail?
• The failure of goodness of fit may result from failure of one or more of the assumptions
of one’s model:
▪ Inappropriate functional form (function doesn’t maps well on the data)
▪ Assumption that observer responses are binomial may be false (e.g., serial
dependencies)
▪ The observer’s psychometric function may be nonstationary (e.g., learning or
fatigue)
• Results in overdispersion or extra-binomial variation:
▪ Data points are (significantly) further away from the fitted curve than expected
- Model failure examples (see figure):

- Assessing overdispersion: deviance


- In maximum-likelihood parameter estimation the parameter vector 𝜃̂ returned by the
estimation routine is such that L(𝜃̂; y) ≥ L(θ; y)
• Thus, whatever error metric Z is used to assess goodness-of-fit, Z(𝜃̂; y) ≥ Z(θ; y) should
hold for all θ
- The log-likelihood ratio deviance is a monotonic transformation of likelihood and therefore
fulfills this criterion
• Deviance is defined as:

• Deviance: comparison, compare likelihood of best estimate with θmax (saturated model),
deviance compares likelihood of saturated, perfect model with likelihood of fitted model
- Where θmax denotes the parameter vector of the saturated model without residual error
between empirical data and model prediction
- The reason why deviance is preferred as a goodness-of-fit statistic is that, asymptotically, it is
distributed as 𝜒𝐾2 where K denotes the number of data points (blocks of trials)

- Fitted model (𝜃̂) vs. saturated model (θmax):


- Deviance failed models:
• We have data and our best estimate, and
deviations from that best estimate
• For each new data set we can calculate the
deviance
• If original fit compared to data is not good,
we will see that deviance for this will be
much higher

- Monte Carlo generation of the deviance distribution:


• Why?
• What distribution of deviance values would we expect, for a given n, if the model is true?
▪ 1: Obtain the best fitting psychometric function Ψ
▪ 2: Generate a large number of new datasets using Ψ as the generating function
and assuming binomial variability
▪ 3: Calculate deviance for each simulated dataset
▪ 4: Compare the empirical deviance to the distribution of simulated deviance
values (compare observed to simulated deviance values)

- Deviance simulated model:


• The null hypothesis that the model under consideration underlies the data cannot be
rejected (p = 0.457)
• Left: dataset from start of class, generated according to the rules of the model
• The distribution of deviance values range from 0 to 25
• If we think this model is true, we expect this blue distribution of values, and the one we
observe falls nicely into this, so there is no weird stuff going on, it is in line with the
expectations

- Deviance “failed” model 1 (left)


• It is unclear whether the data were generated by the best-fitting psychometric function
under the assumption of binomial variability (p = 0.063)
• Particular part of function is deviating, and we see this reflecting in the deviance values
• The distribution of deviance values is similar
- Deviance “failed” model 2 (right)
• It is unlikely that the data were generated by the best-fitting psychometric function
under the assumption of binomial variability (p = 0)
• Deviances here are really far off what we expect
• Is the model generating data consistent with the assumption that the model is true?
- Too good to be true (underdispersion)
• The agreement between data and fit is too good under the assumption of binomial
variability (p = 1)

- Assessing systematic prediction errors:


- Are we finished when the deviance seems ok?
- Frequently it is helpful to examine the residuals directly, rather than just a summary statistic
such as deviance
• Each deviance residual di is defined as the square root of the deviance value for
datapoint i in isolation, signed according to the direction of the arithmetic residual yi – pi
• For binomial data in the case of psychometric function fitting this results in:

• With pi = Ψ(xi; θ)
- This is a very global measure, summary in a single measure of what we just have seen
- Difference between model prediction and observed data, take the sign of it (positive or
negative), compare predicted and observed proportion, if they are close to each other deviance
gets smaller
• We assess the sign, (positive or negative), then compare proportion, and then weigh with
number of trials
• If number of trials is big the weight is higher

- No clear problem according to global goodness-of-fit test (left)


- r between deviance residuals and model prediction reveals systematic deviations (right)
- Monte-Carlo simulation indicates that such a high correlation is unlikely to occur when the
model is true (but, beware for the opposite!)
- Deviance residuals can also be used to show the presence of perceptual learning
• Analysis is the same, but now the correlation coefficient is calculated in function of
temporal order in which the blocks were run

- Importance of goodness-of-fit tests


• The importance of goodness-of-fit tests cannot be stressed enough, given that the
estimated values of threshold and slopes, as well as the estimated confidence regions are
usually of very limited use if the data do not appear to have come from the hypothesized
model

- Obtaining confidence regions:


• The same Monte-Carlo methods can be used to obtain
confidence intervals on the estimated parameters =
parametric bootstrap (procedure is very similar)
▪ 1: Obtain the best fitting psychometric function Ψ
▪ 2: Generate a large number of new datasets using Ψ
as the generating function and assuming binomial
variability
▪ 3: For each dataset, refit Ψ

- Assessing bias in estimated parameters:


• Are we good when:
▪ Deviance seems ok
▪ No systematic prediction errors
▪ Narrow confidence regions
• Using Monte-Carlo methods to generate a large number of datasets using a specific Ψ as
the generating function and assuming binomial variability
• Since we know the "true" Ψ, we can determine how well the estimated parameters
approximate the true parameters
• So fitted model seems okay, everything will look fine for this model, but if you look at
true values you will see a systematic difference with the estimated values

- Importance of λ
• In this figure you can see the importance of having a lambda parameter
• If this is non-zero, it will influence quality of estimates if you assume it is 0 in your fitting
procedure
• Data point no longer at 1 but slightly lower because person has lapsed, but we keep
lambda fixed at 0, you see that estimation is strongly affected by that one data point
▪ This will result in difference of estimated threshold, you will never be able to
obtain true threshold because model is biased in lambda
• So it will influence the quality of your estimates when you assume lambda = 0

VI. Some concluding remarks


- Ψ versus explanatory models:
• The 2-AFC method of constant stimuli allows us to model the stochastic process
underlying psychophysical responses quantitatively
• The statistical techniques described in this presentation can be easily used for more
complex explanatory psychophysical models
• We simply use these models instead of Ψ to specify the success probability
• p, assuming the same stochastic process
- So not really an explanatory model
• Doesn’t tell us necessarily how for example the visual system works

- 2-AFC versus method of adjustment:


• Each adjustment (e.g., left or right) can be seen as one two-alternative forced-choice
response based on a sample from a Bernoulli process
• Amount of adjustments is typically uncontrolled and samples are not independent
▪ Therefore, the binomial distribution cannot be assumed and you lose
considerable statistical power
- This hole procedure is optimised for a 2AFC method
• If method of adjustment you violate the independence between stimuli
• You can’t model data of method of adjustment with this

- Method of constant stimuli versus adaptive procedures:


• Similar objections can be made against adaptive procedures such as staircases
• Response on any given trial depends per definition on the response on the previous trial
• Sequence of Bernoulli random variables is again not i.i.d. distributed, the binomial
distribution cannot be assumed
- Again, you violate independence assumption and you can’t use this method
4. Psychophysical scaling

- Scaling is the assignment of objects to numbers according to a rule

I. Introduction: definition and goals


- Psychophysics is all about understanding the relationship between an external stimulus
(physical is easy to measure) and the resulting sensations and perceptions (mapped through
some internal representation)
- Psychophysical scaling refers to the process of quantifying mental events
• Distances in stimulus space are mapped onto distances in psychological space
- The physical and psychological spaces can be uni- or multidimensional
- Psychophysical functions refer to mathematical relations between physical and psychological
scales
- Examples psychophysical scaling:
• Weber’s Law
• Psychophysical function

II. A brief history of scaling


- Fechner was actually predated by the mathematician Daniel Bernoulli
who proposed a function of decreasing marginal gain for the
psychological value of economic goods
• For example: when a poor person gets €5 his happiness will rise
much more than a rich person who gets €5
- Fechner's wanted to support his logarithmic law through
discrimination data
- He assumed that the JND was a useful unit for sensation
• Every JND constitutes a constant increase in sensation
magnitude (1 JND = 1 unit on the sensation axis)
• Non-linear way to go from physical to psychological
- The logarithmic nature of his law stems from Weber's observations that the physical JND scales
with stimulus intensity
- Example of indirect scaling: sensation magnitudes critically depend on theory that specifies
how magnitudes map onto discrimination data
• For example: scaling on discrimination data

- Direct scaling methods developed early as well


- Plateau (1872) and Delboeuf (1873) used partition methods (see further)
• They didn’t found evidence for a logarithmic function
- Merkel (1888) proposed doubled stimuli
- Sensation magnitude is never assessed directly: Direct and indirect scaling methods are just
means to infer it, based on empirical data and underlying theory
• Sensation magnitude isn’t directly measurable
• It depends on what the participants report

- What kind of scale do we want? → there are different scale types:


• Most scaling attempts aim to arrive at
interval or ratio scales
• Empirical procedures that produce scales
provide no guarantee to arrive at a scale
that is intuitively asked for (e.g., doubling
a stimulus gives a ratio scale)
▪ Ratio: e.g. when sound goes from 5 to 10, is it perceived as twice as loud?
• Ordinal: can be used as a psychophysical scale
▪ For example dividing people in age groups
▪ You can say that one group is older than the other, but you don’t know individual
ages, you can’t calculate a mean or variance

III. Scaling by discrimination methods


- Rationale: "differences between sensations can be detected, but their absolute magnitudes are
less well apprehended" (Luce & Krumhansl,1988, p. 39)
- Inferring sensation magnitudes from "proportion greater than" judgments
• You can’t ask people how strong the sensation was, but you can ask if one sensation was
stronger than the other

1) Fechnerian discrimination scales

- Discriminatory ability increases as difference between psychological magnitude increases


- Fechner himself relied on the JND to construct scales of sensation magnitude (JND’s define
pairs of stimuli that are equally discriminable)
- Combination of assuming that one JND equals a unit increase in sensation magnitude and
Weber's law gives the logarithmic law
- Weber's law does not hold across the entire physical scale, thus Fechner's assumptions are not
completely valid
• Weber’s law only goes for the middle part of the physical scale, not for the ends of it
- JND’s can be used, by calculating them as a function of stimulus intensity (rather than relying
on Weber's law)
• The bigger the JND on the physical scale,
the smaller the slope of the PF function
- Example:
• Dol scale for the perception of pain (one
dol = two JND’s)
2) Are JND’s equal changes in sensation magnitude?

- If all JND’s imply equal changes in sensation magnitude, then the physical size of the JND must
be inversely related to the slope of the psychophysical function relating sensation magnitude to
stimulus intensity
• Hellman et al. (1987) suggest this is not the case
- JND is bigger than JND with other, steeper function

- Durup and Pieron (1933) observed that JNDs are not


subjectively equal in the visual modality as well
- Blue and red lights asked to be adjusted until they
appeared equal in brightness
• When intensities were increased by the same
number of JNDs, they no longer appeared
subjectively equally bright
- Consequence: JNDs can not be used as a basic unit for
sensation magnitude
• JND measure doesn’t translate in equal chance of sensation magnitude
- Ekman proposed a modification: Subjective size of the JND is not constant, but increases with
sensation magnitude (i.e., Weber's law, but then in psychological space)

3) Thurstonian scaling

- Law of comparative judgment: theoretical model describing internal processes that enable the
observer to make paired comparison judgments
• From these judgments, it is possible to calculate psychological scale values
- Thurstone assumes that stimulus presentation results in a discriminal process that has a value
on a psychological continuum
• The variability in this process is called discriminal dispersion (= spreading)
• The psychological scale value is the mean of the distribution of discriminal processes

- Only indirect measurements can uncover this, by considering the proportions of comparative
judgments between stimuli
- Discrimination of two stimuli results in a discriminal difference → the standard deviation of
discriminal differences is given by:

- On each presentation of the stimulus pair, the observer chooses


the strongest one

- Thus, the difference between psychological scale values is:

- Assuming that there is no correlation, and the dispersions are the


same, this simplifies to:
- Of course, in an actual scaling problem, this procedure will
be used for many paired comparisons

- If the simplifying assumptions hold, the paired comparison


method is consistent with Fechner's law, if Weber's law
holds (exercise: think about why this is the case)
- Thurstone's model requires transitivity: If A > B and B > C, A > C must follow
- Sometimes transitivity fails because psychological experiences vary in more than one
dimension: multidimensional scaling
- Thurstonian scaling is applicable to dimensions that are not easily quantified (e.g., beauty)
• Can be applied on every dimension, doesn’t has to be quantifiable

4) Multidimensional scaling

- Multidimensional scaling makes it possible to identify underlying subjective dimensions


associated with perception of differences among stimuli and to assign to each stimulus a
psychological scale value on each of these dimensions
- Observers typically judge the overall (dis)similarity of all possible stimulus pairs
- The rank orders of the overall similarities are usually sufficient to derive the psychological
dimensions
- These dimensions can then be visualised as a multidimensional space, with the psychological
scale values of each stimulus being coordinates in that space

- Metric MDS assumes a particular distance metric (e.g., Euclidian, cityblock, ...):
- City-block distance frequently applies for dimensions that can be intuitively
added, Euclidian distance for so-called "integral" dimensions.
- Nonmetric MDS assumes that only the rank order of distances need to be modelled

- Example from Shepard (1980) on similarity between colors elicited by wavelengths

- MDS solution resembles the well-known color wheel very well


• (purple gap: purple isn’t a 1 wave length
color
- Example from Op de Beeck et al. (2001, 2003) on shape similarity (left)
- Nonmetric Euclidian solution mimics the parameter space quite well (right)

- Number of dimensions is usually unknown, stress is a measure to quantify


the difference between the data and the predicted data obtained by the
MDS solution
- Often used as a useful data reduction technique
- Visualisations easy to grasp

IV. Scaling by partition and magnitude methods


- Observers directly compare subjective magnitudes of stimuli or differences between stimuli
- Developed as doubts grew about the validity of discrimination scaling
- In discrimination scaling, observers make ordinal judgments ("smaller/larger than")
- Here, observers are required to make more sophisticated judgments(more continuous method)

1) Partition scaling

- Designed to construct interval scales directly from the judgments


- Observers try to partition the psychological continuum in equal perceptual intervals
• Equisection scaling and category scaling
• Equisection: divide psychological space in equal intervals (ask people to generate
something with equal steps, create stimuli based on equal intervals)

Equisection:

- Section the psychological continuum into distances that are judged equal
• Distance between categories have to be equal
• People have to categorize the stimuli you give them
• For example: If stimulus A is the minimal and D the maximal stimulus, set B and C such
that distances A-B, B-C, and C-D are all equal
- Plateau (1872) introduced this method by letting painters reproduce
a gray midway between black and white (bisection)
- Simultaneous equisection and progressive equisection (see figure)
- Example: simultaneous equisection for
tones varying in frequency
- Validation procedures are important!
Verify that observers are indeed capable
of sectioning the psychological continuum
(under)

Category

- In equisection scaling, observers choose or generate stimuli, in category


scaling, observers categorize stimuli that are generated for them
- Prone to several kinds of response bias → range of stimuli presented
influences usage of categories
- Iterative procedures or verbal labels can ameliorate this bias

2) Magnitude scaling

- Magnitude scaling became of interest when acoustic engineers started to try and specify a good
psychological scale of loudness
- Sound intensity can be converted to the logarithmic decibel scale, and according to Fechner's
law, this should be sufficient to arrive at a quantification of the loudness of sound
• 80dB does not appear do be twice as loud as 40dB
• Constructing a proper ratio scale of loudness was necessary

Estimation

- Richardson and Ross (1930) assigned the number 1 to a standard tone, and then let observers
determine numbers associated with sounds varying in intensity
- This yielded a power function rather than a logarithmic function! (R = response)

- S. S. Stevens then proposed that this could be used to replace Fechner's logarithmic law

- Corollary (= gevolg): If power law holds, and Weber's law holds, subjective JND’s should scale
with sensation magnitudes (cf. Ekman's law)
- A useful feature of power functions: they become linear if the logarithm is taken on both sides:

Production

- Inverse procedure compared to estimation → given numerical values,


produce stimuli that are in accordance with these values
- Combination of estimation and production can offset particular
systematic errors inherent to both procedures (e.g., regression effect)

3) Individual differences

- Magnitude estimation functions are very different across observers:


do these differences stem from the usage of the numbers or
from real differences in sensation?
- Experimental controls are necessary to minimize potential
biases: stimulus transformations and response transformations
- Are magnitude estimates proportional to sensation
magnitudes? (= common assumption in experiments)
- The only thing we measure is from stimulus to response

- Line-length calibration is a method to estimate response


transformations from magnitude estimates of the perceived
length of lines
- Perceived length is assumed to be linearly related to physical
length, so magnitude estimates should reveal the response
transformation function
- If an observer uses the same response transformation
function for loudness and length, it is possible to adjust the
loudness judgments based on the length judgments

- Cross-modality matching allows observers to not use


numbers at all
- Stimuli from different modalities are adjusted such that they
appear equally intense
- Gives the same results as the transformed loudness function in the previous figure
• E.g. ask to compare their level of pain with a level of sound
• A lot of pain is then compared with a loud sound
- In magnitude matching observers judge sensory magnitudes of stimuli from two different
modalities on the same scale
- One modality serves as the standard or reference, the other as the test modality
- The assumption is that individuals are more alike in the perception of the standard modality,
and this can be used to correct the functions obtained in the test modality
- Example:
• Groups of "tasters" and "non-tasters" had to judge taste intensities of bitter and salty
compounds
• The standard modality was the loudness of 1000 Hz tones

V. Conclusion

- Psychophysical scaling has served two broad purposes:


• (1) The traditional purpose advanced by Fechner
• (2) The pragmatic purpose to study how experiences change under multivariate
stimulation

- The last section describes a worked out example of trying to answer a psychophysical question
using these methods
5. Signal detection theory

I. Why signal detection theory?

- Early psychophysicists assumed a close correspondence between verbal reports and concurrent
neurological changes in the sensory system caused by stimulation
• They assumed a close correspondence between what people said and what was
happening in sensorial system
- It was assumed that - for well-trained observers - p(yes) was a function of the stimulus and the
biological state of the sensory system
• You have to be an expert in introspecting your sensations, so a well-trained observer
- Core idea: no nonsense variables influencing the proportion of yes, but we have already seen
that there is no clear stepfunction but a gradual increase (sigmoidal)
- Many nonsensory variables were found to influence p(yes) as well:
• Probability of stimulus occurrence
▪ Even these well trained observers were influenced by how often the stimulus
occurred (when it was always there they say more yes compared to when it is
only there for half of the trials)
• Response consequences
- The threshold concept is not applicable to stimulus detection behavior

- Tanner and Swets (1954) proposed that statistical detection theory might be used to build a
model closely approximating how people actually behave in detection situations
- Green and Swets (1966) describe signal detection theory in detail
- The crucial assumption is that signals are always detected against a certain background level of
activity or noise → the observer must therefore make an observation and try to decide
whether this observation was generated by the signal or by the noise
- Signal detection theory: a theory relating choice behavior to a psychological decision space
• An observer's choices are determined by the distances between distributions in the
space due to different stimuli (sensitivities) and by the manner in which the space is
partitioned to generate the possible responses (response biases)
• Difference between noise and signal distribution (maps to how sensitive you are), and
the way how you cut the space to make your decision (underlying space that depends on
your sensitivity)
- Today: three different paradigms that lie at the core of how signal detection theory is applied

II. Yes/No experiment: sensitivity and bias

- One-interval design: Participants are presented with a single stimulus and have to classify it in
one of two classes (present/absent, left/right, slow/fast, soft/loud)
- Performance on a task can be decomposed in:
• The extent to which responses mimic the stimuli (sensitivity),
• and the extent to which observers prefer to use one response more than the other (bias)
- Depending on the task, it can be useful to be biased to one or the other response (e.g., tumor
detection by radiologists, eyewitness testimonies, ...)
• Tumor detection by radiologists: very difficult task, yes/no because given x-ray you have
to say if they have a tumor or not, so basic yes/no paradigm, but they are maybe inclined
to say more yes because it is harmful to not detect the tumor
• Accusing someone of being at a crime scene can be detrimental so more inclined to say
no (eyewitness testimonies)

1) Sensitivity

- Typical yes/no experiment: participants are shown a series of faces they have to remember
• In the test phase, some new ones ("lures") are added and participants have to classify
each face as being "old" or "new"
- → Sensitivity refers to how well participants can discriminate old from new faces
- The table can be summarized by two numbers: hit rate (saying yes when
there is something) and false alarm rate (saying yes when there is nothing)
• H = P ("Yes" |Old)
• F = P ("Yes" |New)

- The table can be rewritten as follows:


- What is a good measure of sensitivity?
• It needs to be a function of H and F
• Perfect sensitivity implies H = 1 and F = 0
• Zero sensitivity implies that H & F do not depend on the stimulus and are the same

- Can H or F be a measure of sensitivity?


• No! It must depend on both H and F
• Ideally, it increases when H increases and decreases when F increases
- Two possibilities:
• Sensitivity = H – F → (just take the difference)
1
• Sensitivity = p(c) = 2 [H + (1 − F)] → proportion correct: taking hit rate and the correct
rejection rate, summing them and dividing by 2
1 1
• p(c) = 2 (H − F) + 2
- Both solutions depend on each other and will yield similar conclusions
• But: they are not ideal sensitivity indices in this yes/no task

- Measure based on signal detection theory:


• Sensitivity = d’ = z(H) – z(F)
- Z-scores are the inverse normal distribution (cf. Thurstone)
• When H = F, d’ = 0
• When H ≥ F, d’ ≥ 0
- Perfect accuracy implies infinite d’!
• Inverse score of 1 (H = 1 and F = 0) will map onto infinity, because normal distribution
goes from minus to plus infinity, 0 will map onto negative infinity

- What is the justification of using d′ over p(c) or H − F?


- A good sensitivity measure should be invariant when factors other than sensitivity change
• Independent of their willingness to say "yes" or "no", d′ should remain the same
- Observers who generate (H, F) pairs like (.8, .4) can also generate (.6, .2) or (.35, .07)
• These pairs are generated by different response biases, but still give equal sensitivity
• 3 different pairs of hit rates and false alarm rates that map onto different proportions
correct, but when calculate d’ they will be the same
• When H is plotted against F, these points lie on a isosensitivity curve
• Pairs are generated by different response biases, but by the same d’ → then we can plot
h and f against each other, and see that they lie on a isosensitivity curve
- It shouldn’t matter if people are inclined to say yes or no
• You don’t want sensitivity dependant of other things than sensitivity!

- Visualisation of H vs. F is also called receiver operating characteristic


- Two important characteristics:
• (1) H = 1 can only be achieved for F = 1 (same for 0) and
• (2) steepness of the slope decreases as bias to say "yes"
increases
- This curve develops non-linearly
• If larger than zero it wil grow
• For very large d’ you will have very high hit rate for very low
false alarm rates
• The pairs generated by same d’ score but different people
with different response bias
• You can only have H = 1 if F = 0, but almost impossible to have a perfect decision
• Varying amounts of biases, when not say yes often we are at lower part of curve (bottom
of lines)
• You say yes a lot in upper part of curve (top of lines)
- Implied ROC: sensitivity measure, and for every same d’ we can plot the different false alarm
and hit rates

- ROC curves can also be expressed in Z-coordinates (zROC curves):


• z(H) = z(F) + d’
- Z-transformed
• d’ = z(H) – z(F)
• This gets a straight line, and the differences between the
lines is d’

- Any measure of sensitivity has an implied ROC, so what ROC does


p(c) generate?
• Straight lines (look at definition of proportion correct)
• It is a lineair function of H and F, so
ROC curve of proportion correct
generates linear functions in normal
ROC space and curved function in zROC
- Contrasting predictions based on ROC curves
are very useful, and empirical data can be
used to provide support for either measure
2) The signal detection model

- d’ was merely introduced, what kind of discrimination process does it imply? What is the
internal representation and how do observers arrive at a decision?
- Signal detection model underlying this d’
• There is always some kind of activity, and stimulus will generate additional activity on top
of the noise
• There is considerable overlap, so observers have to partition the space somewhere
▪ As soon as the activity level is strong
enough, for example bigger than 0, I will
say it is an old stimulus
▪ Is the stimulus is then smaller than 0 I will
say it is a new stimulus
• This is the decision space
• Moving the criterium across decision space,
because depending on criterium you will have
different hit rates and false alarms

- The two distributions together comprise the decision space


- The observer can assess the familiarity of the stimulus, but doesn’t knows from which
distribution it came
- What is a good strategy to make a decision?
• Establishing a criterion somewhere in decision space → when familiarity is above this
criterion, observers will say "yes", when below observers will say "no"
• Four possible response alternatives are possible given that stimuli can be generated by
two different distributions.
• This decision space defines an ROC, and moving the criterion will generate different
points on a particular ROC curve

- d′ can be conceptualized as the distance between distributions


• This only holds for equal-variance Gaussian SDT!
- All different response types (hits, false alarms, ...) as taking areas under the curve relative to
the criterion
• If shift to right proportion of hits wil decrease
and false alarms also, if to left it will increase
• Depending on where criterium is different hit
rates and false alarms
- D’ is the difference between these 2 distributions
• You can put your criterium everywhere, doesn’t
differ the distance between the 2 distributions so
independent of response bias
• Moving decision criterium implies moving on ORC
curve (if same d’ but different responsebias they are on different parts of ROC curve)
• Distance equals sensitivity, and where you place criterium generates observed data
• Depending on how you partition space you will have different hits and false alarms but
they wil give you the same d’
3) Response bias

- d’ does not depend on response bias, so a good measure of response bias is also independent
of sensitivity
- Bias should depend on H and F but now increasing when both increase (and vice versa), as it
aims to quantify the tendency to prefer one of the responses
- Where sensitivity relies on a difference between H and F, response bias should rely on a sum
- SDT suggested one sensitivity measure d’, but a variety of bias measures is available

4) Criterion location

- Criterion location c (where you partition the space) is defined as:


1
• c = − 2 [z(H) + z(F )]
- When the false-alarm rates and miss rates are equal, z(F ) = z(1 − H) = −z(H) and c = 0
• It will be 0 when you have no tendency to say yes or no, equal tendency to them both
- Negative values occur when the false alarm rate is higher than the miss rate (and vice versa)
• When you have a stronger tendency to say yes, it will be negative and when you have a
stronger tendency to say no, it will become positive
• When criterium is in the negative space the yes responses will increase, when in the
positive space the no responses will increase (and less yes responses)
- So: in the SDT model: when criterion location becomes negative, bias towards "yes" responses,
when criterion location becomes positive, bias towards more "no" responses
- So c goes from minus to plus infinity

- In ROC space, isobias curves are orthogonal to the isosensitivity curves


• Iso bias curve for c value of zero (‘vertical’ line’, line to left corner)(so from upper left to
down right)
• When they don’t show any kind of bias
• When you shift, in blue part more yes responses, yellow to more no responses
• Same bias across different d’ → isobias curve

5) Relative criterion location

- This bias measure scales the criterion location with sensitivity


𝑐
• c’ = 𝑑′
- Rationale? When sensitivity changes, the tendency to say "yes" changes if the criterion location
c remains constant
• Here, the distance to the means of the distributions are taken into account
• It scales the c value with d’, so it adjusts for the fact that you have different tendencies to
say yes for different sensitivities
6) Likelihoodratio

- For each distribution, the decision variable has an associated "likelihood" (i.e., the height of the
distribution at that point)
- The likelihood ratio is another possible way to quantify bias:
𝑓(𝑥|𝑂𝑙𝑑)
• 𝛽= 𝑓(𝑥|𝑁𝑒𝑤)
- Under the standard SDT assumption (equal-variance Gaussian), this simplifies to:
• β = ecd’
• ln(β) = cd′

7) ROC curves for bias measures

- curve is for a particular c value across sensitivity values

8) Response bias

- Which measure should be preferred?


- Three standards:
• (1) empirical support
• (2) increase as H or F increase, and
• (3) independent of sensitivity
- Criterion location c satisfies (2) and (3), (1)
has yielded ambiguous evidence
- (3) has been disputed as well, however, experiments have shown that c covaries with d′
- Empirical support problematic because basically none of these isobiase curve conforms well
with peoples responses
- People usually prefer criterium location (but independence empirically not ideal)

III. The rating experiment and empirical ROC’s

- In yes-no experiments, participants determine which of two events occurred on a given trial
- In rating experiments, observers can make graded reports about the degree of their experience
by setting multiple criteria simultaneously
• Yes no: present/absent vs. graded reports: strongly absent, present, little absent …
- Example: how is odour memory influenced by the passage of time? (left)
• Continuous way in which ratings are distributed
• We actually ask to partition the space in more than only two sections
- how to calculate sensitivity?
• Option 1: ignore confidence judgments, and collapse responses in two classes, and
calculate d′ as was done previously
• Option 2: calculate H and F for each cell, and tabulate cumulative probabilities, this
implies successively partitioning decision space (i.e., shifting the criterion location)
- This yields the following table (right):
• Based on the (H, F) pairs, one can now calculate a set of d′

- We can also visualise the data, creating an


empirical ROC
- Visualise accumulative false alarms and hit rates
- Non lineair behavior of H and F that are related
to each other
- Differences between curves are d’ (right)
- Problem when slopes that are not equal to 1,
because distance between curves then differs
(bigger distance when slope bigger than 1, and smaller when slope smaller than 1)

- What if the slope of the zROC curve is not equal to 1? Detectability as quantified by d′ is no
longer constant (i.e., it varies with criterion location)!
- What kind of underlying distribution can cause this type of ROC? (left)
• When the slope is smaller than 1, moving one z-unit on the F axis of the zROC implies
moving less than one z-unit on the H axis
- Principled way to compute d′? (see Chapter)
- Difference in width distribution, so you can’t compute d’ anymore on the formula we have,
because it would give different d’ values for different criteria

- The empirical ROC offers a straightforward alternative nonparametric way of estimating


sensitivity: area under the ROC (right)
• Sensitivity bigger than 0 will give you more area under the curve
• Independent of underlying model, you just base yourself on empirical data
- Response bias in the rating experiment:
• Given slopes of 1 in zROC space, all measures of resp. bias discussed earlier can be used
• When slope does not equal one, similar problems arise as for sensitivity
• Likelihood ratio can still be calculated, but now the distributions intersect at two points!

- Using a rating experiment is just one way to generate an empirical ROC


- Other ways include monetary rewards, verbal instructions, or manipulating presentation
probability

IV. Two-alternative forced choice (2AFC)


- Two stimuli are presented on each trial, observer makes a binary choice based on the
presentation of both stimuli
• So main difference is we don’t have one stimulus presented on each trial but two
• The binary choice is related to the relation between the two stimuli rather than just
present or absent
- Example: both new and old word presented and observer has to indicate which of the two is
the old one (make judgement about the relation)
- H and F are now arbitrary, given that both stimuli were
present:
• H = P("old on top" |Old, New)
• F = P("old on top" |New, Old)

- Calculating sensitivity follows the same kind of procedure:


1
• 𝑑′ = [𝑧(𝐻) − 𝑧(𝐹)]
√2
- This implies that 2-AFC is easier than yes-no, and d′ needs to be adjusted with a factor of √2
• d’ are basically the same in terms of procedure, but you have to correct the d’ that you
usually calculate with a factor of square root of 2
• Implies that the sensitivity you obtain with regular formula is too high, so 2AFC is easier
than yes/no so you have to do this to make them comparable

- Why is this the case?


• When collapsing the distributions on a decision variable that
is orthogonal to the decision boundary (criterion location),
the distance between distributions is √2 d’

- This √2 adjustment is based on an optimal decision rule


• Observers can partition the space along the vertical or
horizontal axis as well
• If so, they treat the 2-AFC task as a yes-no task and
effectively ignore useful information
- Methods to calculate response bias are exactly the same as discussed earlier
- In a 2-AFC task, observers do not regularly display extreme biases
• Doesn’t matter for bias top or bottom, because it doesn’t gives you any advantage, new
and old are randomised if they appear on top or bottom
• Because of this p(c) is sometimes argued to be a good sensitivity measure in this context
- 2-AFC area theorem: p(c) in 2-AFC by an unbiased observer
equals the area under the yes-no isosensitivity curve

- When distributions have unequal variance, and the optimal


decision is used, the distributions become equal variance
distributions, and the zROC curve has a unit slope, which is a
major advantage!

- Some empirical findings:


• The √2 relationship has not always been observed
• For temporal 2-AFC, the interstimulus interval matters
a lot
• Two reasons why 2-AFC has been adopted a lot:
▪ It discourages response bias
▪ Performance levels are high, and thus smaller stimulus differences can be
measured

You might also like