Charles Forceville - Visual and Multimodal Communication - Applying The Relevance Principle-Oxford University Press (2020) - 82-117

CHAPTER 3
Adapting Relevance
Theory to Accommodate
Visual Communication
3.1 INTRODUCTION
In this chapter, my goal is to explore how the RT model can be applied to

communication involving visuals. The reason for undertaking this, as
explained in Chapter 1, is that I believe that Sperber and Wilson are right
to claim that the “principle of relevance is essential to explaining human
communication” (1995: vii), while they acknowledge that “language is not a
necessary medium for communication” (1995: 174). But to substantiate the
contention that RT can be expanded to account for all communication it will
be necessary to revisit RT’s central tenets. I intend to push classic RT’s
“translatability” to visuals and multimodal discourse as far as it makes sense
to do so. The first reason for this is that it would be a pity to under-use its
insights and to leave unexplored opportunities for it to accommodate a
wider range of ostensive stimuli. The second reason is that this will help
us understand in what respects communication functions in the same way
irrespective of mode, medium, or genre, and in what respects it differs from
one mode, medium, or genre to another. That being said, I will not hesitate
to draw attention to those aspects of RT which in my view require serious
adaptation, possibly even rejection, when this is necessary to accommodate
communication in other (combinations of) modes or media. Usually the
parts where application to visuals is problematic point to dimensions
where visual communication (and by extension communication in other
modes, such as sound, music, and gesture) really works differently than
communication in language. Inevitably, much of what I have to say will
Visual and Multimodal Communication. Charles Forceville, Oxford University Press (2020). © Oxford University Press.
DOI:10.1093/oso/9780190845230.001.0001
be exploratory and require further debate. Here and there I will therefore take
considerable liberties to be speculative.
A few preliminary remarks are in order. First, as indicated before, the word
“visuals” will be taken in a very broad sense. I agree with Lisa El Refaie, that
“it is important to recognize . . . that visual meaning-making is by no means
limited to the use of iconic pictures; it also includes nonrepresentational
aspects of visual design, such as style, layout, color, and typography”
(2019: 38). Of course the differences between these various phenomena
are substantial and have a bearing on how they can achieve relevance. For
present purposes, the main point is that “visuals” contrast with other
modes or modalities (which are the two interchangeably terms I will use
for what in other approaches are called semiotic systems or semiotic
resources), specifically with the spoken- verbal mode that has hitherto
been the privileged one in RT scholarship. Sometimes the distinction is
problematic, since for instance written language has visual dimensions as well
as verbal ones: typed language is printed in a specific font, with a specific
size. We are usually unaware of this, but are alerted to it when we suddenly
encounter a word in CAPITALS, or in bold, or in italics, or with a section
heading printed in another font or size. Children’s books and instruction
manuals may use different colors for different segments of texts. Certain
poems exploit the visual dimensions of language. The lines in George
Herbert’s “Easter-Wings” (1633) are arranged on the page in such a way
that, when the page is turned 90°, the poem’s two stanzas resemble a pair of
angels, while the lines in his “The Altar” (1633) have been arranged to
resemble the object of its title. Dadaists conducted radical experiments
with the type fonts of words and letters and their spatial arrangement on
the page. Comics artists often find creative ways in which to visualize
onomatopoeia (AAARRRGG! Boing! Zzzzzz). Contemporary fiction
sometimes explores the visual dimension, and its interrelations with the
written-verbal mode, in highly daring and thought-provoking manners
(Gibbons 2012). Whenever written language has such salient visual
dimensions that it seems reasonable to assume that these dimensions play a
role in meaning-making, I consider these dimensions pertinent to what is
discussed in this book. The same holds for gestures. Gesturing is usually
considered a mode in itself (e.g., Müller 2008; Cienki and Müller 2008),
even though people see gestures, which thereby could also be called
“visual.” I will not be overly concerned by such problems of delimitation. My
focus on the visual dimension is motivated by practical reasons pertaining
to space, reproducibility of data for analysis, and the areas I am more or
less knowledgeable about rather than by principled reasons.
Second, I will in this chapter talk about both “pure” visuals and visuals
accompanied by language. In fact, the former constitute a fairly untypical sit-
uation: in the service of optimal relevance, ostensive visuals combine
more often than not with language; messages of the latter type are called
[ 64 ] Visual and Multimodal Communication
“multimodal.” Multimodality, however, is an ill-defined concept. As the
sociologist
A D A P t I Ng R e L e VA NC e t H e OR y [
Luc Pauwels warns, “Multimodal research is an ambitious venture given
the fact that even most forms of mono-modal or single mode analysis (for
example, the analysis of static photographs) are still underdeveloped—in
other words, not able to tap into the full expressive potential of this
medium” (2015: 73). In section 3.2 I will address the thorny concept of
“multimodality” at some greater length.
Third, while spoken language dominates the face-to-face variety of
communication habitually discussed in RT, visuals are typically used in
many forms of mass-communication. For the time being, I will take for
granted this mass- communicative aspect of visuals; in Chapter 4 I will
return to this crucial dimension in more detail.
Fourth, while the utterances exchanged between Mary and Peter typi-
cally analyzed in RT are one or two complete or incomplete sentences, the
visuals that will be the center of attention in this book are often, in one way
or another, complete discourses. We need to be alert to a possible imbalance
between the amount of information that is conveyed in a single sentence vis-
à-vis that in a complete visual or word-and-image discourse.
Fifth, I will standardly assume that the examples of pictures discussed
in this book are ostensively used. That is, they are used by a communicator
as a message that aims to trigger a positive cognitive effect or reward in an
addressee or audience—and thus come with the presumption of relevance.
Usually, of course, it is the context in which the picture functions that
marks it as a piece of ostenstive-inferential communication.
The structure of this chapter is as follows: I will begin by briefly discussing
multimodality and proceed by reflecting on some passages by Carston (2002)
that help pave the way for applying RT beyond the verbal mode. Then I will
es- tablish where the applicability of RT (as summarized in Chapter 2) to
visuals is in my view completely or fairly unproblematic, and gradually
proceed to address more knotty issues.
3.2 MULTIMODALITY
Visuals are often accompanied by language. As Gombrich points out, “The

understanding of images, whether still or moving, is vastly facilitated by the
addition of verbal explanations” (1999: 227–228). Visuals and written
discourse (not an ideal word, but preferable to “texts,” because of the
latter’s strong verbal connotations) are each generally accepted to have
“mode-status.” This means that both exclusively written discourses and
exclusively visual discourses count as monomodal ones. A combination of
visuals and written language would thereby constitute one type of multimodal
discourse.
As such, this should be enough for embarking on the case study chapters
in the second half of this book. However, it will be useful to briefly reflect on
multimodality, as this field is quickly developing into a discipline in its own
right. This development was helped by the publication of a handbook (Jewitt
2009, 2014a), several textbooks and edited volumes (e.g., Kress and Van
Leeuwen 2001, 1996, 2006; Kress 2010; Royce and Bowcher 2007; Baldry
and Thibault 2006; Bateman 2014; Bateman et al. 2017; Archer and Breuer
2015; Jewitt et al. 2016; Wildfeuer et al. 2019), the Routledge Studies in
Multimodality series (editor: Kay O’Halloran), and the foundation of a
journal (Multimodal Communication, since 2012). However, the concept of
multimodality has hitherto not yielded a generally accepted definition.
Kress and Van Leeuwen (2001) characterize mode very broadly: “Modes
are semiotic resources which allow the simultaneous realisation of discourses
and types of (inter)action. Design then uses these resources, combining se-
miotic modes and selecting from the options which they make available
according to the interests of a particular communication situation,”
adding that “media become modes once their principles of semiosis begin
to be conceived of in more abstract ways (as ‘grammars’ of some kind)”
(2001: 21– 22), and accord mode-status, for instance, to narrative (2001:
22). Kress (2010: 58) refrains from a definition but considers gaze, facial
expression, gesture, and spatial positioning each as modes. Later in his
book the vol- atility of the concept transpires from a distinction between a
social and a formal meaning: “socially, what counts as mode is a matter for a
community and its social-representational needs. What a community
decides to regard and use as mode is mode Formally, what counts as
mode is a matter
of what a social-semiotic theory of mode requires a mode to be and to do”
(2010: 87, emphases in original).
In 2009, Carey Jewitt edited The Routledge Handbook of Multimodality,
followed in 2014 by a revised edition. In her introductory chapter to this re-
vised edition, Jewitt discusses mode as one of multimodality’s core concepts
that “are in a state of change and fluidity, and are continuously taken up and
shaped in different ways by different approaches to multimodal research”
(2014b: 22).
In the introduction to another handbook, Jewitt et al. (2016) formulate
three premises of multimodality: “(1) Meaning is made with different semiotic
resources, each offering distinct potentialities and limitations; (2) Meaning
making involves the production of multimodal wholes; (3) If we want to study
meaning, we need to attend to all semiotic resources being used to make a
complete whole” (2016: 3), and provide a very general definition of mode: “a
set of socially and culturally shaped resources for making meaning [with] dis-
tinct affordances” (2016: 9).
There is a serious risk that as a result of the lack of a precise definition
of what counts as a mode, and of the cherished openness of the concept,
mode- status can be accorded to any meaning-generating principle or
dimension. The question of how many modes there are thereby becomes
worrisomely

similar to the (probably apocryphal) medieval conundrum of how many angels
can dance on the head of a pin. In other words, the whole concept of mode
becomes so hazy as to become completely vacuous.
Clearly, there are fuzzy border areas between various modes, and it
turns out to be difficult to marry theoretical rigor and pragmatic usefulness
into a good definition of mode. Eventually, it is to a considerable extent a
matter of paradigmatic preference what one wishes to count as a mode,
whether one wants to accept submodes, and how long a catalogue of
modes would still be considered workable. Bateman (2016), for instance,
takes for granted that “mode” cannot be captured in an exhaustive list, but
at least proposes three levels, all of which are individually and specifically
necessary for the determi- nation of each and every semiotic mode:
First, a material substrate must be fixed as an essential component for any se-
miotic mode; this material may itself stretch over diverse sensory channels.
Second, a mid-level, “mediating” stratum provides more (i.e., grammar-like) or
less (lexicon-like) compositionally functioning structural possibilities capable
of drawing “functionally”-motivated differentiations in form. Third and finally,
“above,” or “surrounding” these levels of semiotic abstraction, we place our
more abstract stratum of (local) discourse semantics, which operates abductively
on the descriptions of the lower levels of abstractions (2016: 46-47, emphases in
original).
I find particularly the third level rather daunting. Since the concept of a mode
can in my view only be useful if it covers a fairly limited number, I have de-
cided, pace Bateman (2016), to commit myself provisionally to the following
list: (1) visuals; (2) written language; (3) spoken language; (4) bodily behav-
ior (comprising gestures, postures, facial expressions, and [manner of] move-
ment); (5) sound; (6) music; (7) olfaction (see Plümacher and Holz 2007);
(8) taste; (9) touch. (This is a slight adaptation of the list I proposed in
Forceville (2006: 383), in which (4) covered only “gestures.”) The rationale
here is to try to keep as close a correspondence as possible between sensory
perception and mode, and to consider other meaning-generating
mechanisms as simply not belonging to “modality.” From this perspective it
would be highly convenient if we could restrict ourselves to the visual, the
aural, the olfactory, the tactile, and the gustatory modes. However, for
various reasons this is not a viable option. For one thing, we see both
written language and visuals, but whereas we have to learn the specific
language of the culture in which we are born, we often immediately
understand the meaning of visible and depicted objects and persons.
Moreover, we hear spoken language, music, and non-verbal sound.
Generally speaking different input sources are associated with very
different material and social circumstances. Language learning requires a
substantial period of time, while this is not, or is far less the case, for visuals.
Languages’

spoken variety are crucial in typical face-to-face communication, thereby
interacting with the visual mode. Music tends to be experienced for
aesthetic and perhaps ritual reasons rather than for communicative ones,
while sound is an often accidental by-effect of actions and events. To
further complicate the issue, conceptual metaphor theory (CMT)—one of
the most important paradigms in which I have worked—for a long time
focused exclusively on the verbal manifestations of conceptual metaphors
(Lakoff and Johnson 1980), but the first two non-verbal meaning-
generating mechanisms that began to be studied within CMT were visuals
(e.g., Forceville 1996) and gestures (e.g., McNeill 1992, 2005; Cienki and
Müller 2008; Müller 2008; Müller and Cienki 2009; Mittelberg and Waugh
2009). Since co-speech gesturing yielded one of the first types of
multimodal metaphor to be systematically studied, it would be awkward to
exclude gestures from the list of modes. Difficulties do not end here, as
written language, by virtue of letter fonts and sizes and layout (Bateman
2008) undeniably has visual dimensions as well. Rhythmic sounds (such as
a woodpecker’s pecking and a machine ramming piles into the ground) may
approach the musical mode. Braille is touched as well as read, while bodily
movements can, in certain cases, be felt as well as seen or heard. We can
consider the medium of books without pictures to be drawing on the verbal
mode—although even such books convey meaning by visual information
such as layout and type fonts—while lyric-less music depends wholly on the
musical mode. In practice, however, many discourses are multimodal. The
film medium, for instance, usually combines visuals, spoken language,
written language, music, sound, and bodily behavior, and is thus highly multi-
modal. All this means that it is difficult to discuss mode—however defined—
without discussing medium, as transpires from the analyses by Müller and
Kappelhoff (2018) and Greifenstein et al. (2018); for a response, see
Forceville (2018). Bateman et al. (2017) acknowledge the link between
mode and medium by discussing multimodality in terms of four
different “canvases,” where a canvas is “anything where we can inscribe
material regularities that may then be perceived and taken up in
interpretation” (2017: 87). Four key dichotomies regulating meaning-
making mechanisms in a canvas are distin- guished: static versus dynamic;
transient versus intransient; 2D versus 3D;
participant versus observer.
The case studies in the second half of this book, visuals or visuals
accompanied by written language, belong clearly in Bateman et al.’s “use case
area 3: spatial, static” (2017: 261–324). Even this pervasive and most
exten- sively studied combination of modes has not resulted in a generally
accepted theory. This means that I cannot rely on much tried-and-tested
research when discussing how pictures and their accompanying written
texts together constitute a discourse aiming for relevance. This should
not be a problem for present purposes, however. The verbal component of
the word-and-image combinations I discuss in this book are invariably
fairly short, often no more

than a few words, a tag line, or one or two sentences. This linguistic mate-
rial rarely if ever completely reduplicates the information in the visual mo-
dality, but the nature of its relation can take many forms. Barthes (1986
[1964])—taken as starting point to chart various approaches to word-and-
image combinations by Bateman (2014)—made the still pertinent distinction
between, on the one hand, language that “anchors” visuals, which means that
it zooms in on information that is already, in one way or another, present
in the image, and, on the other hand, language that “relays” information,
which means that it provides information that, in one way or another,
complements information in the image. But as suggested in Forceville
(1996: 73), images can anchor written text no less than the other way
round, and there is a continuum rather than a sharp distinction between
anchoring and relaying. Moreover, further refinements are possible: the
language can name or label (part of) the visuals it accompanies, it can
specify information therein, it can add pertinent details, or it can contrast
with the visuals (see Martinec and Salway 2005; Unsworth 2007; Unsworth
and Clérigh 2014). I will not explore the thorny issue of multimodality any
further here (for more discussion, see Forceville, in preparation). In this
book, I will focus on two types of static discourses: monomodal discourses
of the visual, and multimodal discourses of the visual-plus-written-verbal
variety.
3.3 RELEVANCE THEORY VIEWS ON APPLICATIONS

BEYOND LANGUAGE
Sperber and Wilson claim that “linguistic communication is the strongest pos-
sible form of communication” (1995: 175). This is undoubtedly right if one
thinks of the nuanced and precise way in which one can communicate in lan-
guage, but visual communication has other affordances.
Fortunately, RT offers clear starting points for applications to the visual
realm. I feel encouraged, for instance, by a major point in RT scholar
Robyn Carston’s Thoughts and Utterances: The Pragmatics of Explicit
Communication (2002). The philosophers she takes to task, she argues, base
their theories on the misguided idea that everyday verbal communication
functions essentially in the same way as the language of logic. In the
language of logic there is, or should be, a complete distinction between the
purely semantic information in a proposition and the pragmatically inferred
information. This distinction is important for linguists of the language-of-
logic persuasion, for “the semantics of a formal logical language is typically
given in terms of a truth theory for the language, which assigns to each
sentential formula conditions on its truth; the propositionality and context-
independence of the sentences of the language are important factors in
making this feasible” (Carston 2002: 50). Informally formulated, Carston’s
opponents (she specifically discusses Higginbotham’s

views; e.g., Higginbotham 1994) insist that a sentence must always be
trace- able to a proposition, a statement that can be evaluated as “true” or
“false”; and for this to be possible, all the pertinent information must be
conveyed linguistically. Consequently, it cannot be the case, according to the
language- logicians, that a proposition conveys explicit information that is
not completely captured in its semantic, verbalized part. Carston, by
contrast, asserts, in accordance with RT, that information typically results
from combining semantic information with contextually derived
information. Utterances in everyday verbal communication are often
highly incomplete (“on the top shelf” instead of “the marmalade is on the top
shelf”; “never” instead of “I will never play golf”; “Irene” instead of “Irene
will mind the children tomorrow evening when we attend the Stopera in
Amsterdam in order to see the dance performance in honor of Hans van
Manen’s 80th birthday”), because the verbal information is easily and
routinely complemented by pertinent information from the context.
Carston makes clear that semantic, linguistic meaning always
underdetermines what the speaker wants to convey, and the two can
therefore not be conflated: “There is linguistic meaning, the information
encoded in the particular lexical-syntactic form employed, and there is the
thought or proposition which it is being used to express” (2002: 17). Even
though she herself mainly focuses predominantly on spoken communication,
Carston emphasizes that language, however important, always remains
subservient to the goal of communicating information and emotions:
A linguistic system is undeniably enabling; it allows us to achieve a degree

of explicitness, clarity and abstractness not possible in non-verbal communi-
cation . . . , but it is not essential for the basic function of referring, and the
predicates it offers are but a tiny subset of the properties and relations that
humans can think about and communicate [W]hat the coded bits of an
ut-
terance do is “set the inferential process on the right track.” Verbal commu-
nication, on this view, is not a means of thought duplication; the thought(s)
that the speaker seeks to communicate are seldom, if ever, perfectly
replicated in the mind of the audience; communication is deemed successful
(that is, good enough) when the interpretation derived by the addressee
sufficiently resembles the thoughts the speaker intended to communicate
(Carston 2002: 47).
She repeats this point in more technical language as follows:
What bridges the gap between the underdetermining encoding of a natural-

language utterance and the thought(s) expressed is a powerful pragmatic
inferential mechanism, whose job it is to figure out the informative
intention behind a linguistic utterance (or any other act of ostension) (Carston
2002: 76, emphasis mine, ChF).

It is also useful at this stage to be reminded of the fact that Sperber and
Wilson themselves relativize the importance of coded communication:
Rather than seeing the fully coded communication of a well-defined paraphras-

able meaning as the norm, we treat it as a limiting case that is never encountered
in reality. Rather than seeing a mixture of explicitness and implicitness, and
of paraphrasable and unparaphrasable effects, as a departure from the norm,
we treat them as typical of ordinary, normal communication (Sperber and
Wilson 2012b: 87).
Moreover, they acknowledge that coding is not necessarily restricted to

language:
Much animal communication is purely coded: for example, the bee dance used
to indicate the direction and distance of nectar It is arguable that some
human
non-verbal communication is purely coded: for example, the interpretation
by neo-nates of facial expressions of emotion (Sperber and Wilson 2012c: 263).
Once these relaxations are accepted, the step from verbal communication
to communication involving other modalities is not such a big one
anymore. Indeed, De Brabanter (2010) discusses gestures from the angle of
RT, as does Wharton (2009). The latter, moreover, discusses Grice’s
example of Herod presenting the severed head of John the Baptist, and
accepts that such ostensive behavior can be understood as a specimen of
“overt intentional communication” (2009: 33).
The following sections will demonstrate in more detail how RT can be ap-
plied to visuals.
3.4 THE COGNITIVE AND COMMUNICATIVE PRINCIPLES

OF RELEVANCE: VISUALS
This can be a very short section. The First, Cognitive Principle of

Relevance pertains to all phenomena in the environment that have a
bearing on our Darwinian goals of survival and reproduction. Human
beings are constantly on the alert for any information, whether mediated or
not (and if mediated: irrespective of the medium in which it appears),
that is potentially relevant, so this holds for the visual realm as well. Like
verbal information, visual information that is clearly not addressing a
person can be regarded by that person as containing potentially highly
pertinent “symptoms” of the environment. It may be that by sheer
coincidence you happen to see some sort of visual on a colleague’s desk (a
diagram or blueprint, a caricature of the boss, a sketch
of a handsome co-worker) that gives you information that you find highly
relevant, but it is no more an ostensive stimulus than a dark cloud spelling
rain. Both the accidentally discovered visual and the dark cloud may
trigger pertinent inferences—but thanks to the (First) Cognitive, not the
(Second) Communicative, Principle of Relevance.
By contrast, if the colleague had shown any of these visuals to you, they
would have been ostensive stimuli which, as always, come with the presump-
tion of optimal relevance. However, most of the time visuals (whether or not
accompanied by information in other modes) are used ostensively not be-
tween two people but as part of a mass-communicative message. This latter
issue is discussed in Chapter 4.
3.5 RELEVANCE TO AN INDIVIDUAL IN VISUALS
There does not seem to be a big difference between how the two intentions
function in spoken language and visuals. As for the communicative inten-
tion: an ostensive visual stimulus, too, needs first of all to be recognized (= ac-
knowledged) and fulfilled (= heeded). Some ostensive stimuli may actually be
difficult or impossible not to notice. For instance, when somebody addresses
you very loudly from close by or you hear the shrill doorbell in your own
house, if you are not hard of hearing in all likelihood you can’t avoid
noticing it. Similarly, a flickering or moving advertisement on your computer
screen irritatingly imposes itself upon your attention. That is, you cannot help
but recognize the communicative intention in these cases—and it is doubtful
whether you can choose not to fulfill it by ignoring it.
Next, the issue arises of whether an (envisaged) addressee will
recognize the informative intention by processing and interpreting the
visual stimulus; and finally, only if the addressee then accepts that
interpretation as (probably) true or more generally accepts the positive
cognitive effects or cognitive rewards the interpretation triggers can we say
that the informative intention is not only recognized but also fulfilled.
Let us first consider ostensive pictorial communication in a Mary-and-
Peter type of exchange, that is, a picture is shown by one person to an in-
dividual addressee in a face-to-face situation. This situation is quite rare in
Western society (but less so in some other societies; see, e.g., Munn 2016;
Wilkins 2016). One example is a passer-by showing the way to a lost traveler
by drawing a map. The friendly passer-by has a relatively easy job: she wants
to help an audience of one (the traveler) to reach his destination with the
aid of her drawn-on-the-spot map and she can fine-tune her visual or
multimodal message (it turns multimodal if the visuals are accompanied by
language—for instance, in the form of written or spoken street names) in
interaction with the traveler. In the old board game Pictionary players take
turns drawing a
picture cueing a word printed on a card only they get to see in such a way that
their teammate is (hopefully) capable of guessing that word on the basis of
the picture. I propose this counts as a form of one-to-one visual communica-
tion. Here is a third example: in an educational context a teacher expects her
pupils to make a drawing satisfying certain requirements (and not merely as
a form of self-expression) for a mark. The task could be, for instance, “Draw
what impressed you most in our recent excursion to the local art gallery,” or
“Draw your family,” or “Draw what you did during the weekend.” Each pupil
would then make a drawing for one person: the teacher—although, again,
it is likely that the visuals would be accompanied by language, as “in many
graphic texts produced by children, drawing and writing are co-present”
(Mavers 2009: 265). Another situation would be the explanation of a tech-
nical problem by an expert to a fellow expert supporting it with a diagram
or sketch. In these examples the visuals are relatively expendable: after they
have served their purpose, they are deleted or thrown away, although a proud
father may stick up the product of a child’s enthusiastic exertions, specially
made for him, for a while in the living room. Similarly, certain visuals may
be stored for a period of time for later consultation—or even archived for
eternity. But these are presumably relatively rare cases of one-to-one visual
communication. Let us say, for argument’s sake, that mass-
communication technically begins when a sender addresses an audience of at
least two persons. Of course, the closer the number of addressees is to two,
the less typical is the situation a mass-communicative one, but for theoretical
purposes the distinction suffices. Importantly, the central RT tenet that
relevance is always relevance to an individual holds with undiminished force
when the audience consists of more than a single addressee—whether two,
2,000, or 2 million of them. The implications of the relevance-to-an-
individual tenet in mass- communication will be explored in more detail in
Chapter 4.
3.6 EFFECT (BENEFIT) AND EFFORT (COST) REVISITED
The principles of effect and effort apply in much the same way in visual com-
munication as they do in face-to-face verbal communication, given that any
ostensive stimulus comes with the presumption of optimal relevance to the
addressee. To recall, a presumption is by no means the same as a guarantee: as
addressees we are quite often disappointed in the implicitly promised
relevance of a communicator addressing us. Just as a narcissist at a party may
bore us to death with the presumption of the relevance of her self-
aggrandizing chatter, so we may find a TV-program or YouTube film
completely irrelevant because the meager positive cognitive effects or
emotional rewards do not warrant the investment of even our minimal
mental effort. Similarly, the discussion in a science program on TV may be
so difficult that we are not prepared to

invest the required mental effort in the hope to secure any positive effects.
In the latter cases, though, we have the opportunity to click or zap away
when we give up trying to understand, while we may not so easily be able
to flee the party bore. But like the bore’s talk, the TV-program and the
YouTube film come to us with the presumption of relevance: “I am worth
your attention; if you process messages conveyed by me, they will have
positive cognitive effects, or cognitive rewards, in your cognitive
environment.”
Consider another example: you may find a diagram in a textbook that is
supposed to help you to understand a point made in the body text, but it is
so impenetrable that you cannot bother to expend the mental effort needed
to make it relevant to you. But just as we are likely to pay close attention
to an endlessly spun out story of someone close to us about something
important that has befallen her because she is dear to us, so we may be
motivated to invest a lot of effort grasping the diagrams of a textbook
whose contents we will be examined about, or scrutinizing the pictures in
the users’ manual of a machine we hope to get working. That is, there may
be sociocultural, emotional, or practical reasons that encourage us to invest
an inordinate amount of effort in interpreting a visual.
3.7 VISUALS AND THE CODE MODEL
So far, so good. Thanks to the parallels with processing verbal stimuli, the
processing of visual stimuli can hitherto by and large be accommodated
within the RT framework. But further pursuing the parallels gets us into
deeper wa- ters, since Sperber and Wilson specify that the first stage in the
interpretation process of utterances is that they are decoded. Only after the
various elements have been decoded and their underlying logical form has
been decided on, utterances can be further processed by means of
reference assignment, disambiguation, and enrichment. Recall that an
addressee draws on the logical form for the derivation of explicatures and
implicatures from an utterance. The concept of logical form is thus a central
issue in RT. For instance, only a fully propositional logical form allows for
interaction with other conceptual representations so as to enable deciding
issues like contradiction and implica- tion (Sperber and Wilson 1995: 72).
Accepting that logical forms do not have to be fully propositional, Sperber
and Wilson state that there can also be less-than-fully-propositional logical
forms, as long as they are well-formed. Such “assumption schemas” (Sperber
and Wilson 1995: 73), which Carston defines in terms of “non-
propositional (non-truth-evaluable) logico-conceptual structure” (2002: 59),
require further completion, to be achieved in the future, in order to
develop, hopefully, into fully propositional logical forms. But even
allowing for the existence of not- yet-fully-propositional logical forms is
not going to help solve the following

major problem. Crucially, a logical form, to repeat an earlier quotation, is a
“syntactically structured string of concepts with some slots, or free variables,
indicating where certain contextual values (in the form of concepts) must
be supplied” (Carston 2002: 64, emphasis mine, ChF). If we take “syntax” to
refer specifically to grammatically structured combinations of elements, only
linguistic input can result in the retrieval of an underlying logical form. To
learn a language one must learn how its elements can be structured in gram-
matically acceptable ways. These elements are words. A language has a vast
but limited number of them, which we can find in dictionaries. Dictionaries
also give us their correct meaning, spelling, and, sometimes, an appropriate
context of use. That is, verbal utterances are governed by the codes of seman-
tics, orthography, vocabulary, and grammar. Now here comes the rub: unlike
verbal utterances, visuals have neither a vocabulary nor a grammar in the
strict sense in which this holds for language. Inasmuch as a language’s
grammar prescribes the correct use of its elements, it has no equivalent
in visuals. And while a dictionary more or less exhaustively inventories
the elements of a language, no similarly exhaustive pictionaries exist. True,
there are certain elements of pictures that have the language-like properties
of consisting of a limited set of items (= “having a vocabulary”) and of
being constrained in their combinability (= “having a syntax”). Examples are
the use of the emotion-enhancing flourishes above characters’ heads
(Forceville 2005a, 2011), text balloons (Forceville et al. 2010), and the
navigation from panel to panel (Cohn 2013) in comics; I will come back to
these in Chapter 9. But if we accept these as “languages,” they are extremely
rudimentary ones, and lack the enormous precision that verbal languages
allow for. These exceptions (and there are more) do not change the fact that
the medium of visuals as a whole does not have a grammar and vocabulary
in the way that the medium of language does. As Messaris formulated it,
“Visual communication does not have an explicit syntax for expressing
analogies, contrasts, causal claims, and other kinds of propositions” (1997:
xi). I therefore see it as misleading to postulate that pictures can be governed
by a “grammar” (Kress and Van Leeuwen 1996, 2006; see Forceville 1999a;
Bateman et al. 2004; Kwiatkowska 2013: 13; Gibbons 2012: 14 for
discussion)—unless this word is used very loosely, non-literally. But that is
not the way in which Sperber and Wilson use the word “grammar”; for them,
as for me, grammar governs the admissible relations between words—not
visuals, not sounds, not music, not gestures, not smells, not tastes.
This brings us back to the difficult issue of the logical form. The
conclusion appears to be simple: since pictures and (most) other visual
elements are not subject to a syntactic and semantic code, they cannot be
encoded and decoded and therefore by definition cannot give access to a
logical form. As Forceville and Clark (2014) point out, this in turn
problematizes the issue of explicatures in pictures, since “ostensive stimuli
which do not encode logical forms will, of
course, only have implicatures” (Sperber and Wilson 1995: 182). The conse-
quence of respecting this definition would be that ostensive pictures
cannot convey explicit information (since that is what explicatures
communicate) or, to reformulate, there would then be no visuals that on
their own, unaccom- panied by language, transmit explicit assumptions.
However, as Forceville and Clark (2014) argue, this is problematic, since
there seem to be at least some types of visuals that communicate coded,
explicit information (see also Forceville 2014; Tseronis and Forceville
2017a).
One way of solving the problem raised by Sperber and Wilson’s strict def-
inition of explicatures is to understand “syntax” in a broader sense than it
is used in linguistics, and to postulate that the logical form in the
“language of thought” is to be understood as a system of rules as to how
not just verbal elements but also information in other modes (such as
visuals and sounds) can be integrated to form fully propositional forms.
This solution would mean abandoning the requirement that in the logical
form concepts must be governed by grammar; it would be sufficient to say
that they are governed by some sort of structure. This means dropping the
idea that we (always) think grammatically or (always) think in grammar,
i.e., discarding the assumption that “the grammar of thought” has the same
underlying principles as “the grammar of language.” Now as indicated
before, there are very good reasons to believe that, indeed, we by no means
always “think in language.” Introspection suggests this to be the case, but
we can also ask: do painters think grammatically, or think in the grammar
of language? Do composers? The long tradition of systematically
investigating language as the supreme vehicle for communication has led
to a skewed view of its importance: “sentence-like linguistic expressions
are not primary, but are based on our more visceral, incarnate sources of
meaning and understanding” (Johnson 2015: 10).
It is important to recall here that semiotics has a long tradition of using
the word “code” and has routinely referred to “decoding” as a way to make
sense of pictures. Chandler (2017) devotes a chapter in his Semiotics: The
Basics to the concept of codes, and he summarizes insights from authors
such as De Saussure, Jakobson, Gombrich, and Hall as follows:
The concept of code is prominent in structuralist discourse In structuralist

theory, the production and interpretation of messages and texts (in any me-
dium) are seen as dependent upon codes Since the (intended) meaning of
a
sign depends on the code within which it is situated, codes provide a framework
within which signs make sense. They embody rules or conventions of interpre-
tation which systematically regulate the ways in which meanings are produced.
This is not the same as saying that codes alone determine meaning We read
meaning into texts, drawing upon our existing knowledge and experience of

signs, referents, and codes in order to make explicit what is only implicit. . . .
A semiotic code is closely associated with a set of interpretive and represen-
tational practices familiar to its users, and the conventions of codes represent
a social dimension in structuralist semiotics. . . .
Codes provide relational frameworks within which social and cultural
meanings are produced. Unfamiliar experiences are interpreted analogically
in relation to already codified knowledge. The dominant codes help to main-
tain a broad conceptual consensus and thus facilitate cultural transmission. . .
.
Semioticians seek to identify and describe the various codes that are taken
for granted in this way. The task of the analyst involves identifying and making
explicit the system of distinctions, conventions, categories, operations, and
relations underlying a particular social practice, which gives familiar
phenomena cultural meaning and value as signs (2017: 177–179, emphases in
original).
Obviously, the word “code” is used in a much broader sense in semiotics than
in RT. Chandler mentions three main codes, each with a variety of subcodes:
in- terpretative codes (including those governing perception and ideology);
social codes (including those governing language, bodily contact, proximity
and appearance, fashion, behavior, etc.); and representational codes
(including codes pertaining to science, aesthetics, genre, rhetoric, mass
media, etc.). He adds that “most codes are not explicitly formulated and are
usually followed uncon- sciously. Some theorists question whether some of
the looser systems constitute codes at all (2017: 187).
Without proposing that we equate the precision and sophistication of
the syntactic and semantic codes of language as used in RT with those in
the list provided above, I would nonetheless maintain that in essence they
pertain to the same thing: a set of conventions and rules that have to be
learned and that guide our interpretation of what semiotics calls “signs” and
RT calls “ostensive stimuli.” A person who is not in possession of the
proper code cannot check an ostensive stimulus (whether an utterance or a
visual or a sound or a gesture) against the encyclopedic knowledge
(pertaining to objects, people, scripts, schemata) in his cognitive
environment, and will misinterpret it, or not understand it at all.
That being said, there are important differences between verbal and
non- verbal codes. To understand utterances in an unfamiliar language one
truly has to learn that new language from scratch, but we can recognize and
understand many objects, people, and events in visuals because they very
closely resemble objects, people, and events in everyday life. We routinely
identify them, and usually do so correctly. This fact has led critics of
semiotics to point out that although interpreting film for instance requires
an understanding of some basic conventions (such as different ways of
editing together two shots), “learning” to watch film is nothing like learning a
new language. As Anderson puts it, “The perception and comprehension of

motion pictures is regarded as a subset of perception and comprehension in
general, and the workings of the
perceptual systems and the mind of the spectator are viewed in the context
of their evolutionary development” (Anderson 1996: 10; for similar views,
see, e.g., Bordwell 1985, 1989; Bordwell and Thompson 2008; Carroll 1996;
Ildirar
and Schwan 2011).
Chandler admits that the popular structuralist terms “encoding” and
“decoding” have sometimes had “the unfortunate consequence of making the
processes of constructing and interpreting texts (visual, verbal, or otherwise)
sound too programmatic,” acknowledging that “inference is required to ‘go
beyond the information given’ ” (2017: 228, emphasis in original). Chandler
wisely dropped from his book a sentence that appeared in its early online
predecessor, Semiotics for Beginners: “In the context of semiotics, ‘decoding’
involves not simply basic recognition and comprehension of what a text ‘says’
but also the interpretation and evaluation of its meaning with reference to
relevant codes” (Chandler n.d.: “Encoding/decoding,” emphases in original).
This latter would seem to suggest that interpretation and evaluation are
entirely a matter of “decoding”—and this is precisely why Sperber and
Wilson saw semiotics as having failed in providing an adequate model for
communication. One of the great strengths of RT, after all, is that it shows
how much of meaning-making is a matter of combining ostensive stimuli with
ad hoc context, yielding implicatures as well as explicatures.
In short, the word “code” in semiotics tends to be used fairly broadly, and
thus it does not mean exactly the same as it means in language (namely, the
set of rules governing the correct use of grammar and vocabulary). That being
said, RT acknowledges that even the language code cannot specify the precise
meaning of each single word or phrase: the word “open” means something
slightly different in “he opened the tin,” “he opened the door,” and “he opened
his heart.” In RT, such different uses would be marked by asterisks, so that we
can distinguish between OPEN*, OPEN**, and OPEN***. We could do
something similar for the words “code” and “en/decoding” themselves: in
the case of linguistic utterances, an addressee DECODES*, while in the case
of non- linguistic ostensive stimuli he DECODES**. But I submit that the
similarities between the two meanings are much more important than the
differences. I thus propose that we use the words “code,” “encoding,” and
“decoding” not only for linguistic utterances but also for at least some (parts
of) ostensive visual stimuli (and by extension also for some ostensive stimuli
in different sign systems).
Accepting this claim would mean that not just decoded ostensive verbal
stimuli but also (some) decoded ostensive visual stimuli can serve as the
raw input for the next step in the process of deriving relevance:
disambiguation, reference assignment, and various enrichment procedures
—which in turn allow for the derivation of explicatures and implicatures.
But before we turn to these issues, let me briefly consider what other
semiotic concepts will be useful in an RT analysis of mass-communicative
visuals.
3.8 SOME OTHER USEFUL SEMIOTICS CONCEPTS PERTAINING
TO CODE** IN VISUALS
Of the many tripartite divisions Charles Sanders Peirce proposed, the only
threesome that is regularly used outside of semiotics scholarship (e.g., by
Clark 1996) is that of symbol, icon, and index. I will here draw on Chandler
(2017) for their definition:
Symbolic: based on a relationship which is fundamentally unmotivated,

arbitrary, and purely conventional (rather than being based on resemblance or
direct connection to physical reality)—so that it must be agreed upon and
learned: e.g. language in general (plus specific languages, alphabetical
letters, punctuation marks, words, phrases, and sentences), numbers, Morse
code, traffic lights, national flags.
Iconic: based on perceived resemblance or imitation (involving some recogniz-
ably similar quality such as appearance, sound, feeling, taste, or smell)—e.g.
a portrait, a cartoon, a scale-model, onomatopoeia, metaphors, realistic
sounds in “programme music,” sound effects in radio drama, a dubbed film
soundtrack, imitative gestures.
Indexical: based on direct connection (physical or causal). This link can be
observed or inferred: e.g. “natural signs” (smoke, thunder, footprints, echoes,
non-synthetic odours and flavours), medical symptoms (pain, a rash, pulse-
rate), measuring instruments (weathercock, thermometer, clock, spirit-level),
“signals” (a knock on a door, a phone ringing), pointers (a pointing “index”
finger, a directional signpost), recordings (a photograph, a film, video or tele-
vision shot, an audio-recorded voice), personal “trademarks” (handwriting,
catchphrases) (Chandler 2017: 41, emphases in original).
It is to be realized, as Chandler points out, that although the three are often
referred to as kinds of signs, Peirce envisaged them as different
dimensions most signs have simultaneously. However, usually one of
them is dominant over the others, so that in common parlance a sign is
considered a symbol, an icon, or an index.
In many ostensively used visuals we recognize elements because they icon-
ically cue their referents in everyday life. It is not just people or single objects
that are iconically understood, but also certain actions; and certain clusters
of objects, people, and actions, which are usually referred to as scripts or
scenarios. A cluster consisting of pews, people kneeling, an altar, and a priest
will activate the “church” scenario; people at a table with cutlery on it, and
somebody with a menu standing next to it evokes the “restaurant”
scenario. Incidentally, it is not just that the recognition of individual
elements leads to the recognition of the scenario; this is frequently a two-
way process, as often

the recognition of the scenario helps, or is even necessary, to recognize certain
individual elements that would otherwise be unidentifiable or ambiguous.
Roman Jakobson emphasizes that one dimension of communication is
the “phatic” function (1960: 355): some communication has as primary goal
the act of communicating itself for social reasons rather than for exchanging
information. Pleasantries about the weather exchanged between strangers
exemplify this phatic function. Lovers’ talk that bears little content may
similarly be considered as primarily serving the phatic function. Jakobson
moreover introduced the “poetic function” to remind his readers that lan-
guage users sometimes draw attention to language itself rather than trying
to transparently convey information. We are here to think not only of po-
etry, but also, for instance, of rituals. Importantly, Jakobson’s model “does
not account for acts of communication purely in terms of encoding and
decoding” (Chandler 2017: 232). What matters for present purposes is that
on the one hand, pace Sperber and Wilson (1995: 6), semiotics does not
want to reduce all communication to the “code model.” On the other hand,
there are good reasons to assume that we interpret at least some parts of
visuals in a way that deserves to be called “decoding” in the much stricter
sense that RT reserves for this activity. Interestingly, in a passage discussing
onomatopoeic verbs such as “sizzling,” “mooing,” and “hiccupping,” RT
scholar Tim Wharton acknowledges that “there might be different types
of coding” (2003: 76), and he reminds us that “in the ethological literature,
non-human animal communication systems are often referred to as codes”
proposing to call these “natural codes” (2003: 80, emphasis in original).
Such observations about loosening the idea of what should count as a code
open up the possibility to talk about certain visuals in terms of complete
logical forms, which could thereby spawn not just implicatures, but also
explicatures.
3.9 REFERENCE ASSIGNMENT, DISAMBIGUATION,

ENRICHMENT, AND LOOSE USE IN VISUALS
Having argued that at least some elements of visuals are decoded, I will
now further pursue the analogy with language. As we saw in Chapter 2,
verbal messages rarely come in such a complete form that they
straightaway allow for the derivation of explicatures and implicatures. They
need to fit the format of a “logical form” or “assumption schema,” which can,
but need not be, fully propositional. To allow for the derivation of inferences,
the decoded verbal information requires reference assignment,
disambiguation, and various forms of enrichment. Let me reconsider these
operations with a view of their possible application to ostensively used
visuals.
3.9.1 Reference Assignment
First of all, in most realistic pictures we need to know who is who, and what is
what. In most photographs, there is a relation of resemblance between people
as they appear in the picture and as they are known to look in real life—
what Peirce calls an iconic relationship between signifier and signified.
Such a resemblance is supposed to be specifically salient in photographs for
passports, which have a clear ostensive function: “This is what person X,
whose name and other biographical details appear elsewhere in this
document, and who is this document’s carrier, looks like.” In other
photographs, the resemblance may, for all sorts of reasons, not be so clear,
and the assignment of the correct referent may require some reasoning.
Here is an example: Mary fetches a photograph taken at last year’s
Christmas party, showing it to Peter to prove that their friend Irene was
actually there, while Peter had insisted that Irene could not have been
present at that party as he is convinced she was abroad during that whole
December month. Technically speaking, Mary tries to per- suade Peter to
delete one assumption in his cognitive environment (Irene was abroad last
Christmas) and replace it by another (Irene was at Mary and Peter’s party
last Christmas). However, the photograph may be blurred, or the person
supposedly being Irene may be difficult to identify because she is seen from
the back only. In such a situation the issue of “reference assignment” may
require (mental) work and/or background knowledge. For instance, the
person under consideration wears an unusual hat that both Mary and Peter
know to be Irene’s.
Many issues may problematize reference assignment. Here is an
attested example of one such issue. Quite some years ago (I have to say in my
defense), a student gave a presentation in a seminar, showing a photograph of
a young, blond woman with a milky moustache. One mature student and I saw
a young blond woman with a milky moustache; all the other students saw
Paris Hilton with a milky moustache (the photograph was part of the
celebrity-endorsed “Make Mine Milk” campaign promoting the drinking of
milk to young people). Clearly, spotting resemblance between somebody in a
photograph, on the one hand, and the real-life referent, on the other,
requires that the addressee of the photograph has the knowledge of who
the referent is, and what she looks like, stored somewhere in his cognitive
environment. Note that in this case, only seeing a blond woman, as the
mature student and I did, still left intact a large part of the message—but
not all of it, for we missed the celebrity status of this endorser of milk-
drinking. In the case of drawn or painted rather than photographed people,
other problems pertaining to reference-assignment may arise. Cartoonists
may cue a depicted person’s identity by means of certain salient features
(a big nose, a bald head, prominent breasts, protruding teeth), or by certain
props—much as sculptures of saints in Catholic churches are recognized by

certain postures and attributes—rather than by detailed
resemblance. In portraits, artists may deliberately make the resemblance of
their sitters subservient to other interests, such as trying to bring out the
sitter’s character or status, or expressing their own idiosyncratic style
of painting.
Difficulties with reference assignment do not only emerge with
persons. Objects and buildings, too, may pose problems. We may not
recognize an object in the first place; or we are puzzled about what it is
because we see it represented from an unfamiliar angle or because we see
only part of it. Historians who want to use photographs as evidence, and
thus ostensively, may need to ensure that there is no disagreement about
the referent of a specific, unique building supposedly represented in a
given photograph, taken at a specific time.
Reference assignment also pertains to ostensive visuals’ depiction of
activities. Are these two boys playing or fighting? Is this nude woman so-
phisticatedly exhibitionistic, or has she been surreptitiously photographed
by a paparazzo? Are these police officers involved in self-defense against
a dangerous criminal or are they beating up a helpless victim (the issue
in the infamous Rodney King case)? Consider Figure 3.1. We see a man
using a pole apparently to cross a ditch. The innocent viewer may be
forgiven for thinking he simply does this to get to its other side in order to
continue on his way. In fact, however, the man is engaged in a simulation
of the sports activity that is popular in the province of Frisia in the
Netherlands and is known as “fierljeppen”; the goal is to descend as far as
possible at the other
Figure 3.1 Man involved in simulating “Fierljeppen” (“far-leaping”), screen shot

from video “Amsterdam-Fierljeppen” © DPA Productions/Matt Doyle, early 2010s.
Source: https://vimeo.com/56923508. The shot was found in the online edition of
the Leeuwarder Courant, October 12, 2012 (http://www.lc.nl/friesland/fierljeppen-op-
amerikaanse-sportzender-espn-15311673.html), last accessed January 2, 2020.

end of a ditch by climbing the pole as fast and high as possible during the
arc it makes in the jumping (and of course not to fall into the ditch). For
the knowledgeable viewer, the fact that this is a simulation rather than the
real thing would be clear already because the pole used by the man is by
no means long enough. One would have to know about the existence of
this sport and its rules in the first place, and then be made aware, prob-
ably from the context or situation in which the photograph would be used
ostensively, how to categorize the activity correctly. Similar needs for ref-
erence assignment arise for all unfamiliar activities, including certain pro-
fessional actions and rituals.
In short, reference assignment in visuals requires knowledge no less than
in language. Often at least part of this knowledge can be retrieved from the
immediate context in which the picture appears, but usually encyclopedic
knowledge is indispensable. Reference assignment may also be facilitated by
similar items in a series that preceded the picture under consideration (e.g.,
in the case of a comics panel in a newspaper strip or an album, or a standalone
cartoon by the regular cartoonist of the newspaper one subscribes to)—or
simply by a verbal caption identifying the person(s) or activity depicted, or
by the newspaper article for which the picture serves as an illustration. In
the latter situation the discourse is multimodal rather than purely visual.
Pictures not accompanied by language that render salient some of their
elements, or by language that lacks deictics or verbs, may still cause doubt as
to the precise referent of elements. For instance, there are numerous ways of
characterizing the man in Figure 3.1: “person,” “adult,” “man,” “man-
wearing- jeans,” “balding-man-wearing-jeans-and-white-shirt-and-
sunglasses,” and so on. Often the context resolves the question of what is the
right level of char- acterization, but as we will see in Chapter 10 vagueness is
sometimes a deliberate strategy.
3.9.2 Disambiguation
Is there, in pictures, the need to disambiguate elements in a way that is sim-

ilar to the need to ensure in what sense the word “bank” is used in the sen-
tence “he went to the bank”? I hesitate about this issue. The word “bank” is a
homonym: it has two different meanings, which can be found in a dictionary.
A visual element, as we have seen, cannot be checked in a similar way, as
there is not a limited set of “visuals” comparable to the limited set of linguistic
items in a language. When we are uncertain about the identity of a certain
visual element, we want to know what/who it is, and then we are back with
reference assignment. Of course ambiguity in visuals can be deliberate.
Consider, for instance the naughty sexual innuendo of the logos in Figures 3.2
and 3.3. We can also think of Arcimboldo’s “fruit-and-vegetable faces.” The
point here
Figure 3.2 Deliberately (?) ambiguous “A-Style” clothing company logo. Source:
https:// www.boredpanda.com/worst-logo-fails-ever/?utm_source=google&utm_
medium=organic&utm_campaign=organic, last accessed January 2, 2020.
Figure 3.3 Deliberately (?) ambiguous “Dirty Bird” restaurant logo. Source: https://www.
adforum.com/creative-work/ad/player/34501437/logo/dirty-bird, last accessed January
2, 2020.
is that we are not expected to resolve the ambiguity but to relish it. So provi-
sionally I will take it that the procedure of verbal “disambiguation” does not
have an equivalent in the sorting out of ostensive visuals that is distinct from
reference assignment.

3.9.3 Enrichment
Just as we routinely enrich phrases such as “soon” (as in “John will be

here soon”), “extremely hot” (as in “the water in the swimming pool is
extremely hot”), and “red” (as in “he looks red in the face”), we enrich
pictorial elements in visuals. Even the most detailed realistic picture cannot
depict everything, and if it is used ostensively, too much detail might
make it less relevant because of the concomitantly increasing processing
effort (remember that, as discussed in Chapter 1 with reference to Arnheim’s
discussions of “gestalts,” human beings prefer simple over complex visual
configurations). A photographic image, for instance, by definition has
borders that leave out part of what is depicted. A Dutch passport
photograph must be in black and white, and excludes everything but a
person’s head and shoulders (and moreover only shows frontal
representations at that—unlike, for instance, mug shots of suspects at
police stations, which provide close-ups both head-on and in pro- file); we
mentally supply the rest of the person’s body. Passport photographs, in fact,
are not exceptional in showing only parts of people, (in)animate objects, and
buildings. We always routinely imagine missing parts on the basis of ge- stalt
perceptions and the mental schemata we have available in our cognitive
environment. The mother proudly showing a photograph to give you an
idea of what her child looks like (Figure 3.4a) intends to prompt
admiration, not pity because the child would be understood to be facially
disfigured. Similarly, in most comics, cartoons, and animation films, an
artist deliberately leaves out many details. A few details such as some
windows, doors, and contours suffice to depict a skyscraper in a cartoon
(Figure 3.4b). Stick figures in some comics lack body parts (Figure 3.4c), and
some manga artists omit characters’ noses (Cohn 2013: 155). These are
instances, I propose, of visuals that in the viewer habitually cue various
strategies of “enrichment.” It is interesting to
(a) (b) (c)
Figure 3.4 (a) An extreme-close-up routinely makes us enrich the representation into a
complete face or body. Source: https://steemit.com/tutorial/@armiden/how-to-take-a-
good-and-true-video-image, last accessed January 2, 2020.
(b) A skyscraper cityscape. Windows are absent or indicated as highly stylized slits only.
Source: http://clipart-library.com/clipart/kiMaRg4ij.htm, last accessed January 2, 2020.
(c) A stick figure to which the viewer mentally adds facial features, hands, feet, etc.
Source: Internet, provenance unknown.

reflect on the question of whether there is a visual equivalent not just of
the broadening variety of enrichment that we have hitherto considered but
also of the narrowing variety. “Narrowing” is the term for a word whose
intended meaning is more specific than its decoded form. An example
borrowed from Ruiz de Mendoza and Díez Velasco (2002) would be that
the meaning of the word “pill” in “She’s on the pill” needs to be narrowed
to “She’s on the anti- conception pill.” A visual equivalent would have to be
an ostensive stimulus of a general phenomenon that is in context routinely
taken to be a specimen of a subcategory of that phenomenon. Presumably,
in a specific narrative context a visually represented box of pills may
indeed be understood to depict anti- conception pills. An example of visual
broadening is discussed in Figure 9.5a in Chapter 9.
3.9.4 Loose Use of Visuals
In many types of ostensive pictures, the drawing style is, in one way or
another, not “realistic,” as we have seen in the discussion of “enrichment.”
Now what counts as realistic is in itself a knotty issue, since the idea of
realism is subject to change over time and place. Nonetheless, most of us
have a fairly clear everyday idea of when a picture counts as “realistic,”
namely, when the depiction of something closely, in a more or less
photographic manner, resembles the way we perceive that thing in reality.
But in many situations, the maker of a picture deliberately and routinely
fails to adhere to these conventions of realism. That is, like the woman who
answers “2.30” rather than “2.28” when asked the time by a passer-by in
the street and me saying “Jag talar inte svenska” rather than “I don’t speak
Swedish” to the Malmö market grocer (see Chapter 2), some visuals
deliberately deviate from a faithful depiction. Viewers unproblematically fill
in the missing details in such “short-hand depiction” thanks to their
knowledge of stereotypes and standard scenarios, as we saw in Figures 3.4b
and 3.4c. Other candidates for “loose visuals” are the stylized pictures of
elements of a machine in a manual for the prospective user, who is to
assemble these elements himself. Again, details that are unmis- takably
recognizable in the real-life referent that the communicator wants to capture
are deliberately omitted in the representation of that referent. Why would
cartoonists and manuals designers deliberately indulge in incomplete- ness in
their visuals? Of course, the answer is that they do so in the interest of
relevance, more specifically, the reduction of processing effort. To achieve
optimal relevance, a cartoonist, like any communicator, needs to ensure that
the envisaged addressee “gets” the critical, often more or less funny comment
on a state of affairs in the world without being unduly puzzled. This means
among other things that a viewer needs immediately to recognize the situation,
and often the person(s) depicted. Cluttering cartoons with too much detail
would
[ 100 ] Visual and Multimodal

detract from the addressee’s ease to decode the information the visual
communicator wants to convey. The same holds for manuals: the graphic
designer uses simplified visuals to communicate without possible
ambiguity: “This plug goes in the third, yellow-colored socket from the
right on the right-hand bottom side of the back of your computer.”
At the risk of stating the obvious: not all “absences” count as loose use.
There may be circumstances in which, for whatever reason, something
crucially important is left out of a picture. It may well be that a clever
cartoonist working under a dictatorial regime manages to make the
absence of something expected blatantly salient. The absent part is then
an essential part of what makes the message relevant and needs to be
picked up by the target audience. But the elimination of crucial details in
pictures can also have other reasons: a communicator may change the
effects of an ostensively used photograph, for instance, by “cropping” it
(i.e., printing a detail of a photograph rather than the whole photograph, as
in Figure 3.4a), or by photoshopping it. In this case the original picture (if
there is one) has been manipulated. There are benign as well as malignant
reasons for doing this. When there are malignant reasons, we are coming
closer to the issue of “lying” with pictures—an issue that will be taken up
in Chapter 10.
In short, I conclude that the concept of loose use as adopted by Sperber and
Wilson is applicable to visuals. The fact of this applicability is in itself
pertinent for another reason. While acknowledging that loose uses are
unproblematic in everyday communication, Wilson and Sperber point out
that “they do raise a serious issue for any philosophy of language based on a
maxim or convention of truthfulness” (Wilson and Sperber 2012a: 55).
Accommodating loose use in the RT model therefore is one of the features
that distance communication from strict propositionality, and this can only
bode well for modes of communication, such as the visual, that for lack
of a grammar pose problems for propositionality anyway.
3.10 EXPLICATURES, IMPLICATURES, AND SYMPTOMATIC

MEANING IN VISUALS
If it is accepted that at least some visuals, or visual elements, are coded,

and that visuals are subject to the procedures of reference assignment and
enrichment, the next step is to ask, “Can decoded, reference-assigned, and
enriched (parts of) pictures amount to complete logical forms?” If so, they
can trigger explicatures, that is, truth-evaluable propositions. If not, we
would simply have to accept that communication by ostensive visuals can
only give rise to implicatures. Yus, expanding on earlier work (Yus 2008),
argues that “visual content can . . . lead to visual explicatures” (Yus 2016:
271, emphasis in original). Forceville and Clark (2014), too, want to
salvage the idea that there

are degrees of explicitness in visuals and to acknowledge that some visuals
contain explicit information. Nonetheless, we deemed declaring the label
“explicatures” applicable to visuals premature. Therefore, we decided to pro-
visionally adopt the label “explicature-like” (Forceville and Clark 2014: 452).
Here, however, I will adopt (as in Yus 2008, 2016, and Forceville 2014)
the stronger position of affirming that at least some (parts of) visuals can
trigger explicatures, that is, that there are indeed visuals that, without any help
from other modes, notably language, can singlehandedly present information
of which we would say: “this is true” or “this is false.” Let me consider some
visuals as candidates for such propositional status, followed by discussion.
(1) “This photograph shows unambiguously that celebrities X and Y were

pas- sionately kissing in a parking garage.”
(2) “This photograph truthfully depicts what the famous statues of
Bamiyan looked like before they were destroyed by the Taliban.”
(3) “This painting, made by Han Van Meegeren at the injunction of a
court of law, demonstrates without a doubt that he was capable of making
such high quality Vermeer forgeries that art experts were deceived by
them.”
Note that in all three cases deriving the pertinent information communica-
tively requires that the picture is used ostensively. It is not difficult to imagine
situations in which this is plausibly the case. As for the paparazzo photo-
graph described in (1), the editor-in-chief of a gossip magazine, faced with
the dilemma of whether she will publish a juicy article about the rumors
that X and Y are having an affair, risking a libel suit if wrong, could print the
photograph to prove that the rumor is true. As for the Bamiyan statue
discussed in (2)–see Figure 3.5—art historians who would want to restore
the statue could present the photograph, probably along with many others,
as a correct representation of what the statue looked like before it was
destroyed. And as for the historically attested Van Meegeren case (3):
after World War II, this Dutch art dealer was accused of having stolen and
sold paintings by famous artists, including Johannes Vermeer, to the
Nazis, and was therefore charged with collaboration. However, Van
Meegeren claimed that he was innocent because he had actually forged the
supposed masterpieces himself. The skeptical judges required him to paint
a “Vermeer” on the spot to prove his point—and when he successfully did
so, he was indeed found not guilty of the charge.
A critic might object that additional information from a wider context is
necessary to promote these visuals to the status of giving rise to explicatures,
since on their own they supply at best (perhaps essential) proof for a fully
propositional logical form. But this surely is not fundamentally different
from having to complete lapidary verbal forms such as “on the top shelf,”
Figure 3.5 Buddha statue at Bamiyan, Afghanistan, before it was destroyed by the Taliban,
photographer unknown.
“tomorrow,” or “Javier.” In this line of thinking, in a given context the

visuals, after referent assignment and enrichment, would be equivalent to
the verbal example “Willem Alexander van Oranje Nassau has been king
of The Netherlands since 30 April 2013,”—an utterance whose
propositional status also requires accepting several premises (pertaining to
naming, nation- hood, and the Julian calendar, for instance) as well as
presupposing encyclopedic knowledge. That being said, it would still have
to be acknowledged that accepting the utterance about the Dutch king as
fully propositional requires less information from the context than the
three visuals discussed in (1)–(3) above. Mindful of Sperber and Wilson’s
observation that “an explicature is explicit to a greater or lesser degree”
(1995: 182), if we accept them as giving rise to explicatures, the visuals
discussed in (1)–(3) above verge toward the less explicit pole of the
continuum, as they require a substantial amount of contextual information
before they can be processed. But even so, in a given context, each of
these could be used as an ostensive stimulus giving rise to minimal
explicatures such as, “Celebrities X and Y kissed (at least once) in a
parking lot”; “This is what (at time Y) one of the Buddha statues at
Bamiyan looked like”; “This painting proves that Van Meegeren was able to
forge a cred- ible Vermeer.”
I will tentatively propose, then, that an ostensive use of these pictures
in a certain context enables the envisaged addressees to derive specific
explicatures. I will for the time being compromise and when referring to
explicatures in visuals silently consider these to be “visual explicatures”—
thereby leaving room to distinguish them from verbal explicatures.

Each of these visual explicatures is in turn combined with contextual
assumptions to give rise to implicatures of various strengths such as “This
photograph will cause a scandal for X and Y” or “the spouses of X and Y will be
displeased when they see this photograph”; “A lot of time and money will
have to be invested to restore the Bamiyan Buddha statue to its original
status,” or “How wonderful did the original statue look!”; “Van Meegeren
needs to be acquitted from the charge that he sold original paintings by Old
Masters to the Nazis,” or “Van Meegeren is a master forger.”
Finally, we saw in Chapter 2 that verbal utterances can give rise to
meanings that were definitely not intended by their communicators but that
nonetheless may trigger highly relevant positive cognitive effects in an
addressee. I adopted the label “symptomatic meaning” from the film
scholar David Bordwell for this phenomenon (RT scholar Tim Wharton
[2009] uses the Gricean term “natural meaning”). Since this is not ostensive
communication, I propose not to see such meanings as part of
communicative acts. But whether this exclu- sion is accepted or not,
symptomatic/natural meanings play a role in visuals no less than in
language. Indeed, the identification of symptomatic meaning in visuals is
considered core business by many social semioticians and cultural studies
scholars: they attempt to lay bare what subconscious, implicated premises
constitute the ideological bias of the communicator. We can here for
instance think of the systematic depiction, in pedagogical textbooks, of
men rather than women to represent generic human beings, or of white-
skinned rather than black-skinned or yellow-skinned people to do so.
Kress and Van Leeuwen (2006 [1996]) discuss many examples of what would
qualify as symptomatic meaning—although their analyses are sometimes
debatable (Forceville 1999a).
Summarizing the argument up till now, I propose that visuals, after
reference assignment and enrichment, can direct the addressee to derive
not only strong or weak implicatures, but also to derive (visual)
explicatures. Visuals may also give rise to symptomatic meaning, but I
consider this meaning to fall outside the scope of communication and will
more or less ignore it in the remainder of this book. How such explicatures
and implicatures are derived to achieve relevance in everyday visuals will
be addressed in more detail in the case study chapters.
3.11 DESCRIPTIVE VERSUS INTERPRETIVE USES OF VISUALS?
As we saw in Chapter 2, utterances can both describe a state of affairs and in-
terpret a representation (another utterance or a thought). Clearly, the most
common use of ostensive visuals, like that of ostensive utterances, is the “de-
scription” (the noun is a bit awkward for visuals, but since it is a technical
term in RT, I will stick to it here) of actual states of affairs, but they also
can

describe possible states of affairs, both desirable and undesirable ones. Many
advertisements depicting the use of the product promoted, for instance, de-
scribe the enviable happiness the product’s user will experience thanks to it.
By contrast, cartoons could for instance visually describe the undesirable out-
come of a certain politician’s actions. But can visuals also be used interpre-
tively? That is, is it possible for a communicator to attribute an utterance
or thought to another (virtual or real) communicator by visual means? The
answer is an unambiguous “yes”—at least for the genres of cartoons and
comics. Consider the examples in Figures 3.6a–c.
In Figure 3.6a, a pastiche (?) of a famous Saussurean example, the car-
toonist communicator visually describes the two creatures, and verbo-visually
describes the left creature’s utterance, presumably with different
intonations and degrees of loudness, of various pronunciations of the word
“tree.” The cartoonist visually interprets the thoughts of both creatures.
(This cartoon also nicely illustrates the RT idea that ultimately every
utterance is an interpretation of a thought by rendering the utterance in
verbal, and the thought in visual form.)
(a) (b)
(c)
Figure 3.6 (a) Similarity, not identity, between concepts in the communicator’s and the
addresser’s mind: communicating the concept “tree.” Source: http://d3fhkv6xpescls.
cloudfront.net/blog/wp-content/uploads/2011/02/miscommunication.jpg, last accessed
January 2, 2020.
(b) S. Bailie: The workers’ ideal as envisioned by the bourgeoisie. Source: Vers l’Avenir,
Brussels, 1912, p. 40.
(c) Men discussing early 20th-century art. Source: Panel from Soirs de Paris (1989)
by Philippe Petit-Roulet (writer) and François Avril (artist) © 2017 Humanoids, Inc.
Los Angeles, p. 24.
Figure 3.6b has a caption in French, which translates as follows: “To be
‘at home,’ to live in ‘his’ house, cultivate his plot of land, to be his own
lord and master, who then, at the hour when destinies were to be decided,
has not had this dream and thought about realizing this ambition?” We do not
need this caption, however, to understand that the scene in the smoke
“balloon” is a depiction of the man’s “pipe dream.” Put differently, the
depiction in the cloud of smoke is a visual interpretation of the man’s
thoughts pertaining to a desirable state of affairs. (I am indebted to Janet
Polasky for alerting me to this illustration.)
Figure 3.6c is a panel from the album Soirs de Paris, in which the
artists apparently set themselves the task of using no language in the text
balloons of their characters. In this particular panel, we understand on the
basis of the visual description of a number of elements that the men are at
a party. The contents of their speech balloon are visual interpretations of
their utterances about several early 20th-century paintings.
Further proof that “describing” need not be done literally can be found
in the existence of pictorial/visual metaphors (see, e.g., Forceville 1996,
2008a, 2016a; Bounegru and Forceville 2011; El Refaie 2003, 2009, 2019;
Abdel-
Raheem 2019; Benedek and Nyíri 2019).
3.12 BLENDING/CONCEPTUAL INTEGRATION

THEORY PERSPECTIVES
Gilles Fauconnier and Mark Turner (2002) were perhaps a bit too enthusi-
astic when they implied that “the way we think” necessarily goes via “blends,”
but they certainly drew attention to, and modeled, an important phenom-
enon in human cognition. Blending theory (BT; its later developments are
labeled conceptual integration theory/CIT) can help account for how it is
we understand certain visual and verbo-visual discourses. Whereas I cannot
do justice to the theory here, I trust that a quick summary, using some rel-
evance theory terminology, will give an idea of its potential usefulness for
present purposes. In this summary I rely heavily on, and sometimes literally
quote from, my earlier analyses (Forceville 2004, 2012, 2013). Fauconnier
and Turner claim roughly the following: many of the conceptual domains
an addressee of an ostensive stimulus draws on in meaning-making are not
in themselves sufficient to make sense of discourse and, more generally,
artifacts. Very often, the interpreter needs to evoke two or more concep-
tual domains (called “mental input spaces” in BT), turning them into a new,
ad hoc conceptual domain (called the “blended space,” or the “blend”). This
blending is possible thanks to the fact that the input spaces share certain
similarities (modeled in the “generic space”). This ad hoc combining ability

is an efficient feature of the human mind: just as knowledge of the grammar
and vocabulary of a language allows a person to understand an infinite
number of new, creative sentences, so a knowledge of scores of conceptual
domains enables the addressee of a new message to cobble them together
in unprecedented ways when the message requires this. A well-known ex-
ample in the BT literature is “land-yacht” to designate a certain type of car.
Presumably we do not (or did not, when this hybrid was first used) possess
a ready-made mental space Land-Yacht—but we do have available the CAR
and yacht mental spaces, which on the basis of certain shared properties in
the generic space (such as “constituting a means of transport”) merge in the
blend to create a new meaning: an intimidatingly large car that has certain
properties (luxuriousness, size, prize, unwieldiness, etc.) of a yacht. Here
is another example:
The punning name “Nim Chimpsky,” for an ape that was able to deploy a
rudimentary form of sign language, is a blend that has the name of the lin-
guist “Noam Chomsky” and the noun “chimpanzee” as its two input spaces.
Shared properties include “being a mammal” and “belonging to a species
with a fairly highly developed signaling system.” On a purely formal level,
the input spaces share certain sounds as well, notably the “ch,” prominent
as the first phoneme of surname and noun, respectively, and the “m.” All
these would thus be represented in the generic space. Unique properties of
the “Chomsky” input space include him being, presumably, one of the most
informed and famous language experts of his species and a proponent of
the idea that the ability to use language is innate, while the chimpanzee
input space confers about everything generally known to be true of this spe-
cies of apes to the blend. The blended space also inherits the consonants
and the monosyllabic structure of the linguist’s first name (“Noam”→ “Nim”)
and the last syllable of his second name (“sky”), while the first part of its
second name (“chimp”) is inherited from the noun “chimpanzee.” The emer-
gent structure in the blend is a felicitous representation of a language-using
chimp (Forceville 2013: 256–257).
BT has tended to focus on the addressee’s perspective, but of course the

addressee’s job is the reverse of the communicator’s: while the addressee
needs to “unpack” or “run” the blend, he can do so only because the
communicator created the blend in such a way as to enable him to do this
—indeed make him do this. The ad hoc blend thus draws at least on one, but
often more than one, unique feature of each of the contributing input spaces.
But the input spaces also have many features, often also structural
relations, in common. The shared features and structures are preserved in
the blend. In the model,
Generic Space
Input Space 1 Input Space 2
Blended Space
Figure 3.7 The blending space model (from Forceville 2013, adapted from Fauconnier and
Turner 2002: 46). Key: The big circles are mental spaces. The black and open dots represent
properties. Uninterrupted lines between dots represent a property shared across spaces.
Interrupted lines represent properties uniquely imparted to the blended space by each
of the input spaces. The square in the blended space contains the pertinent properties
in the blend. Unconnected black dots in the input spaces represent properties that were
not imparted to the blended space; open dots in the blended space represent properties
that
were present in neither of the input spaces and are generated thanks to the combining of
the input spaces: these, then, symbolize the new, “emergent” properties.
this commonality is represented in the generic space. With the shared features
and structures of the input spaces as the base, the blend imports pertinent
features from the input spaces to create “emergent meaning” (see Figure 3.7).
The BT model also has serious limitations. While acknowledging the impor-
tance of pragmatic factors in meaning-making, it does not pay much attention
to them. Veale et al. (2013) point out that the BT focus on the perspective
of the recipient leads to a kind of reverse-engineering: in retrospect, the ad-
dressee can always discern which input spaces gave rise to the blend, but it
has nothing to say about how a communicator went about creating it—which
makes it of only limited use when one is interested in ad hoc meaning pro-
duction: “Blending theory cannot be considered a true theory of producer-
centric creativity until it can explicitly identify the heuristics, pathways
and mechanisms that allow a producer to infer the contents of a second input
space for a given input in a specific goal-oriented context” (Veale et al. 2013:
49; see
also Veale 2012).
Brandt (2013), too, identifies problems with BT. Her central thesis is
that insights from semiotics are indispensable for further development of BT
—indeed of cognitive linguistics more generally. Specifically, she rightly
criticizes the lack of attention being paid in BT to the crucial importance
of “the situation of enunciation, as an experiential source in meaning
construc- tion” (Brandt 2013: 219). This omission makes mental space theory
needlessly complex. Brandt points out, for instance, that Fauconnier
wrestles with the thorny issue of what constitutes a mental space. He
comes up with six types of spaces: time spaces, space spaces, domain spaces
(such as activities), hypo- thetical spaces (e.g., “if I were you . . .” or “imagine
the following . . .”), tenses, and moods. Fauconnier concludes:
I will assume . . . that a combination of pragmatic and grammatical factors

makes the appropriate domain type accessible. Clearly, however, this is at
present an unsolved (although unavoidable) problem. In context, we have no
trouble telling what the domain type is. How we do it is far from understood
(Fauconnier 1997: 138, quoted in Brandt 2013: 207).
But actually it is not at all “far from understood.” Surely, it is the

presumption of optimal relevance that guides the interpretation of each
ostensive stimulus-in-context. In a given situation, competent addressees
unproblematically realize which “domain types” they need to bring to
bear on the message, and how to do this. Also, of course, a competent
addressee is able to assign referents to the pertinent elements in the input
spaces that partake in the blend.
But BT/CIT has at least two attractive dimensions that can be useful to
help make sense of the visuals and multimodal discourses that are the focus of
attention in this book. In the first place, since mental spaces are supra-modal,
BT can easily accommodate information in different modalities. One input
space can be visual, for instance, while a second input space can be verbal.
In the second place, it can deal with more than two input spaces. A
blended space (for instance in a commercial) could be “fed” by a visual, a
spoken word, a written word, a musical, and a sonic input space, while, as
we will see, addressees often have to invoke various “intertexts” and
“scenarios” to make sense of a discourse. BT can model these in terms of
mental spaces that need to be invoked for successful interpretation.
3.13 PICTORIAL/VISUAL AND MULTIMODAL METAPHOR AND

RELEVANCE THEORY
A substantial part of my own scholarly work has been devoted to pictorial (or
visual) and multimodal metaphor. Since metaphoricity is not a central
issue in this book, it is not opportune to dwell on this work. Nonetheless, it
might be odd not to refer to it at all, particularly because, as briefly
mentioned in Chapter 2, my account of metaphor does not square with RT’s
categorization
of metaphor as a variety of loose use. This discrepancy, incidentally, did not
prevent me from drawing on RT in the development of my model of pictorial
metaphor in Forceville (1996).
My own understanding of metaphor, including its visual and
multimodal varieties, is in the spirit of Romero and Soria (2014), who see
as the great problem of RT’s view of metaphor as a form of loose use that
it cannot account for the ad hoc, emergent meaning that is typical of its
creative varieties. Emergent meaning in metaphor is meaning that does not
reside in the target, nor in the source, but comes into being in their
interaction (Black 1979). Discussing the example “Robert is a bulldozer,” the
authors state:
The reason why . . . it is hard for relevance theorists to solve the emergent
property issue is that they think that emergent properties not only have to be
attributed to the topic [or: target, ChF] but also to the denotation of the meta-
phorical vehicle [or: source, ChF]. When the speaker uses “bulldozer” meta-
phorically , he is not talking about a bulldozer but about Robert, a person, and
he is not interested in applying this expression to things that it literally
applies to. Nothing is meant to be conveyed about literal bulldozers In its
metaphor-
ical sense, this predicate is not applied to certain machines, the metaphorical
meaning of its properties does not have the requirement that they have in RT
of being applicable to both a tractor fitted with caterpillar [tracks] and Robert, a
person (Romero and Soria 2014: 499).
Instead, the authors propose to follow Black: “metaphorical interpretation

is explained taking into account metaphorical ad hoc concepts that result
from an inferential task that involves a partial mapping from a conceptual
domain [o]nto another” (Romero and Soria 2014: 490). In a process of
pragmatic adjustment, the addressee of a metaphor activates only those
properties and evaluations of the source in the mapping that are pertinent
in the specific situation at hand. Such pragmatic adjustment takes the form
of a matching process fine-tuning the mappings from source to target
domain that is to result in the understanding of the now metaphorically
transformed target (“Robert-as-bulldozer”).
3.14 SUMMARY
Given a very broad understanding of visuals as any non-verbal representa-

tion accessed with the eyes that carries potential meaning, it has been argued
in this chapter that the RT model can be applied to ostensive visuals in a sub-
stantial number of respects. A communicator deploying visuals
ostensively wants to inform an addressee or audience of something
(and/or express
an attitude, belief, or emotion vis-à-vis that information), and
communicate this intention by any of a wide range of attention-grabbing
devices. Like ostensive verbal stimuli and stimuli in other modes, these
visuals thus come with the promise that they are worth the audience’s
attention, since they have something to convey that is supposedly relevant
to the audience. Whether this promise is actually fulfilled depends, as
always, on whether the visuals triggers any positive cognitive/emotional
effects in the audience, and whether these effects are balanced by the
amount of mental effort this audience needs to invest to process these
effects.
Ostensive visuals, whether or not accompanied by language or other
modes, thus come with the presumption of relevance. Even though visuals
do not have a grammar or a vocabulary in the way languages have, visuals
have parts that are decoded, either because they resemble their counterparts
in reality or because they have a meaning that has been ascribed to them
convention- ally. All visuals are in one way or another incomplete and require
that the addressee assign a referent to one or more of their elements. In
addition, usually various forms of enrichment are called for, as many visual
communicators— for instance, in cartoons and drawings—tend to leave out
certain details in the interest of optimizing relevance by reducing
processing effort on the part of the addressees. Enrichment can also take
the form of the recognition of intertextual references, or the awareness that
different mental concepts have been conflated into a “blend.” Even after
performing these mental operations, however, the addressee will not be able to
derive explicatures and implicatures without taking into account a huge
amount of contextual information. The reason for this is that visual
representations are governed by structuring principles but not by a
grammar specifying which elements are admissible in it and how these items
can be combined. Consequently, most enriched visuals are in need of
additional information, such as, for instance, supplied in the verbal text
that accompanies the picture or in the situational context in which the
visuals are ostensively used, to be capable of being judged relevant. One con-
sequence of this is that while all visuals have implicatures, it remains a matter
of debate to what extent we should routinely ascribe explicatures to them.
But inasmuch as some subtypes of visuals, and some parts of visuals, I
have argued, are completely coded, they can be claimed to give rise to
explicatures. Since this is a controversial claim that will require further
debate in the RT community, each time I refer to an explicature in visuals,
this is to be understood as meaning a visual explicature. It will thus still
have to be resolved whether “verbal explicatures” and “visual explicatures”
can be ultimately be conflated or whether these terms should be
understood as referring to related but distinct concepts.
Finally, visuals can both describe actual and possible states of affairs
and interpret other agents’ descriptions, and can do so both literally and
non- literally. Thereby they are capable of metarepresentation.
Visuals usually function as part of mass-communicative messages. The
next chapter will therefore address the question how the key notion of
“relevance to an individual” fares in situations where Mary needs not
just to be optimally relevant to dear Peter, but instead needs to take into
account the cognitive environments of dozens, thousands, or millions of
envisaged addressees.

Charles Forceville - Visual and Multimodal Communication - Applying The Relevance Principle-Oxford University Press (2020) - 82-117

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Charles Forceville - Visual and Multimodal Communication - Applying The Relevance Principle-Oxford University Press (2020) - 82-117

Uploaded by

Copyright:

Available Formats

CHAPTER 3

In this chapter, my goal is to explore how the RT model can be applied to

Visuals are often accompanied by language. As Gombrich points out, “The

[ 68 ] Visual and Multimodal Communication

[ 70 ] Visual and Multimodal Communication

[ 72 ] Visual and Multimodal Communication

3.3 RELEVANCE THEORY VIEWS ON APPLICATIONS

[ 74 ] Visual and Multimodal Communication

A linguistic system is undeniably enabling; it allows us to achieve a degree

She repeats this point in more technical language as follows:

What bridges the gap between the underdetermining encoding of a natural-

[ 76 ] Visual and Multimodal Communication

Rather than seeing the fully coded communication of a well-defined paraphras-

Moreover, they acknowledge that coding is not necessarily restricted to

3.4 THE COGNITIVE AND COMMUNICATIVE PRINCIPLES

This can be a very short section. The First, Cognitive Principle of

3.5 RELEVANCE TO AN INDIVIDUAL IN VISUALS

3.6 EFFECT (BENEFIT) AND EFFORT (COST) REVISITED

[ 80 ] Visual and Multimodal Communication

3.7 VISUALS AND THE CODE MODEL

[ 82 ] Visual and Multimodal Communication

The concept of code is prominent in structuralist discourse In structuralist

[ 84 ] Visual and Multimodal Communication

[ 86 ] Visual and Multimodal Communication

Symbolic: based on a relationship which is fundamentally unmotivated,

[ 90 ] Visual and Multimodal Communication

3.9 REFERENCE ASSIGNMENT, DISAMBIGUATION,

[ 92 ] Visual and Multimodal Communication

Figure 3.1 Man involved in simulating “Fierljeppen” (“far-leaping”), screen shot

[ 94 ] Visual and Multimodal Communication

Is there, in pictures, the need to disambiguate elements in a way that is sim-

[ 96 ] Visual and Multimodal Communication

Just as we routinely enrich phrases such as “soon” (as in “John will be

(a) (b) (c)

[ 98 ] Visual and Multimodal Communication

3.9.4 Loose Use of Visuals

[ 100 ] Visual and Multimodal

3.10 EXPLICATURES, IMPLICATURES, AND SYMPTOMATIC

If it is accepted that at least some visuals, or visual elements, are coded,

[ 102 ] Visual and Multimodal

(1) “This photograph shows unambiguously that celebrities X and Y were

“tomorrow,” or “Javier.” In this line of thinking, in a given context the

[ 104 ] Visual and Multimodal

3.11 DESCRIPTIVE VERSUS INTERPRETIVE USES OF VISUALS?

[ 106 ] Visual and Multimodal

3.12 BLENDING/CONCEPTUAL INTEGRATION

[ 108 ] Visual and Multimodal

BT has tended to focus on the addressee’s perspective, but of course the

Input Space 1 Input Space 2

I will assume . . . that a combination of pragmatic and grammatical factors

But actually it is not at all “far from understood.” Surely, it is the

3.13 PICTORIAL/VISUAL AND MULTIMODAL METAPHOR AND

Instead, the authors propose to follow Black: “metaphorical interpretation

Given a very broad understanding of visuals as any non-verbal representa-

[ 116 ] Visual and Multimodal

You might also like