Professional Documents
Culture Documents
LLM4SGG: Large Language Model For Weakly Supervised Scene Graph Generation
LLM4SGG: Large Language Model For Weakly Supervised Scene Graph Generation
Kibum Kim1 Kanghoon Yoon1 Jaehyeong Jeon1 Yeonjun In1 Jinyoung Moon2
Donghyun Kim3 Chanyoung Park1 *
1
KAIST 2 ETRI 3 Korea University
{kb.kim,ykhoon08,wogud405,yeonjun.in,cy.park}@kaist.ac.kr jymoon@etri.re.kr, d_kim@korea.ac.kr
arXiv:2310.10404v6 [cs.CV] 21 Mar 2024
Abstract knowledge from images [21, 32, 36, 39, 45, 56]. Most ex-
isting SGG methods are fully-supervised, i.e., they heav-
Weakly-Supervised Scene Graph Generation (WSSGG) ily rely on the ground-truth annotations that involve the
research has recently emerged as an alternative to the fully- class information of entities and predicates as well as the
supervised approach that heavily relies on costly annota- bounding box of entities [19, 48, 50]. However, since cre-
tions. In this regard, studies on WSSGG have utilized im- ating extensively annotated scene graph datasets is costly,
age captions to obtain unlocalized triplets while primarily the heavy reliance on these annotations imposes practi-
focusing on grounding the unlocalized triplets over image cal limitations on the model training [54]. To mitigate
regions. However, they have overlooked the two issues in- the high cost associated with manual annotations, weakly-
volved in the triplet formation process from the captions: supervised scene graph generation (WSSGG) approaches
1) Semantic over-simplification issue arises when extracting have recently emerged, aiming at training an SGG model
triplets from captions, where fine-grained predicates in cap- without any annotated scene graph dataset. Specifically, the
tions are undesirably converted into coarse-grained predi- main idea of recent WSSGG methods is to leverage image
cates, resulting in a long-tailed predicate distribution, and captions along with associated images, as they can be easily
2) Low-density scene graph issue arises when aligning the collected from the Web [20, 47, 54, 57].
triplets in the caption with entity/predicate classes of inter- The training process of WSSGG model using image cap-
est, where many triplets are discarded and not used in train- tions requires four steps as illustrated in Figure 1(a). Step
ing, leading to insufficient supervision. To tackle the two 1: Preparing an image and its caption. Step 2: Parsing
issues, we propose a new approach, i.e., Large Language the image caption, i.e., triplets formed as ⟨subject, predi-
Model for weakly-supervised SGG (LLM4SGG), where we cate, object⟩ are extracted from the image caption through
mitigate the two issues by leveraging the LLM’s in-depth an off-the-shelf parser [30, 43]. Step 3: Aligning the triplets
understanding of language and reasoning ability during the in the caption with entity/predicate classes of interest, i.e.,
extraction of triplets from captions and alignment of en- entity (subject, object) and predicate classes in the extracted
tity/predicate classes with target data. To further engage triplets obtained in Step 2 are aligned with the entity and
the LLM in these processes, we adopt the idea of Chain- predicate classes in the target data1 , respectively. This
of-Thought and the in-context few-shot learning strategy. alignment is based on their synonym/hypernym/hyponym
To validate the effectiveness of LLM4SGG, we conduct ex- contained in an external knowledge base (KB), e.g., Word-
tensive experiments on Visual Genome and GQA datasets, Net [25]. Step 4: Grounding unlocalized entities in the ex-
showing significant improvements in both Recall@K and tracted triplets, i.e., unlocalized entities (subjects and ob-
mean Recall@K compared to the state-of-the-art WSSGG jects) are matched with relevant image regions generated by
methods. A further appeal is that LLM4SGG is data- a pre-trained object detector, e.g., Faster R-CNN [29]. The
efficient, enabling effective model training with a small localized entities and predicates in the extracted triplets then
amount of training images. Our code is available on serve as pseudo-labels for training an SGG model.
https://github.com/rlqja1107/torch-LLM4SGG
Existing WSSGG approaches mainly focus on Step 4
[20, 33, 47, 57]. For example, in Figure 1(a), their efforts
1. Introduction have been focused on grounding the entity person in an un-
Scene Graph Generation (SGG) is a fundamental task in localized triplet with an image region that captures the sit-
computer vision, aiming at extracting structured visual ting behavior. More precisely, LSWS [47] exploits the con-
1
Unlocalized Triplet
Step 3 cat cat
person Align person Step 4 Localized Triplet
person
Grounding
animal Step 3
lying on
Step 2 classes
classes
Parse
Align
sitting on sitting on on on
the caption
KB
pony pony → animal animal sitting on
computer laptop
Ours
Step 2 Ours
Paraphrased Caption
Parse cat
Step 1 LLM person SGG Training Step 1 the caption
Prepare a caption with its image Prepare a caption with its image lying on
LLM Step 2 Step 3 sitting on Conventional Approach
Adult cat lying on laptop computer LLM laptop
A person in a uniform is sitting on a pony animal
Ours (LLM)
(a) Pipeline of Weakly-Supervised SGG (c) Semantic Oversimplification
elephant elephant
Frequency
Step 3
carrying
carrying
classes
Align
Frequency 0 in Parser+KB log None
Step 2 Ours
Parse elephant
Step 1 the caption
Prepare a caption with its image carrying
An elephant is carrying a wooden LLM branch
log under its trunk
(b) Predicate Distribution (d) Low-Density Scene Graph
Figure 1. (a) The pipeline of weakly-supervised SGG. (b) The predicate distribution of unlocalized triplets (Parser+KB vs. Ours). In
Parser+KB, the distribution becomes heavily long-tailed, and 12 out of 50 predicates are non-existent. (c) Semantic over-simplification
caused by a rule-based parser in Step 2. (d) Low-density scene graph caused by the static structure of KB in Step 3.
textual object information to accurately ground the unlocal- ily long-tailed, in which coarse-grained predicates (e.g.,
ized entities, leveraging the linguistic structure embedded with, on, in) greatly outnumber fine-grained predicates
within the triplets. Another line of research [20] employs a (e.g., parked on, covered in) (Figure 1(b)). To make the
pre-trained vision-language model [15] to reflect the seman- matter worse, numerous fine-grained predicates eventu-
tic interactions among entities within the image caption. ally end up in a frequency of 0, even though they are
However, we argue that existing WSSGG approaches originally present in the captions. Specifically, 12 out of
overlook the importance of the triplet formation process 50 predicates are non-existent, which means that these 12
conducted in Step 2 and Step 3. We identify the two major predicates can never be predicted, since the model is not
issues described below, i.e., semantic over-simplication and trained on these predicates at all.
low-density scene graph, which incur incomplete unlocal- • Low-Density Scene Graph: We find that the KB-based
ized triplets after Step 2 and 3. These incomplete tripletes triplet alignment in Step 3 leads to low-density scene
are mostly uninformative predicates with a limited number, graphs, i.e., the number of remaining triplets after Step
and negatively impact the training of an SGG model even 3 is small. We attribute the low-density scene graphs
when entities are correctly grounded in Step 4. To demon- primarily to the utilization of KB in Step 3. Specif-
strate the impact of incomplete unlocalized triplets, we fol- ically, a triplet is discarded if any of the three com-
low the conventional process to extract unlocalized triplets ponents (i.e., subject, predicate, object) or their syn-
(i.e., Step 1-3), and conduct an examination of triplets ob- onym/hypernym/hyponym within the triplet fail to align
tained from COCO caption dataset, which are generated with the entity or predicate classes in the target data.
through Scene Parser [43] in Step 2 and WordNet [25] in For example, in Figure 1(d), the triplet ⟨elephant, car-
Step 3. As a result, we identify the following two issues: rying, log⟩ is discarded because log does not exist in
Visual Genome dataset nor its synonym/hypernym, even
• Semantic Over-simplification: We find that the standard if elephant and carrying do exist. In Table 1, we re-
scene graph parser [43] operating on heuristic rule-based port the number of triplets and images in Visual Genome
principles commonly used in Step 2 leads to a seman- dataset, which is a common benchmark dataset used in
tic over-simplification of predicates in extracted triplets. fully-supervised SGG approaches, and COCO caption
In other words, fine-grained predicates are undesirably dataset, which is a common benchmark dataset used in
converted into coarse-grained predicates, which we re- weakly-supervised SGG approaches. We observe that
fer to as semantic over-simplification. For example, in on average Visual Genome dataset contains 7.1 triplets
Figure 1(c), an informative predicate lying on (i.e., fine- (i.e., 405K/57K) per image (See Table 1(a)), while COCO
grained predicate) in the image caption is converted into dataset contains only 2.4 triplets (i.e., 154K/64K) per im-
a less informative predicate on (i.e., coarse-grained pred- age (See Table 1(b)). This indicates that existing WSSGG
icate), because the rule-based parser fails to capture the approaches suffer from the lack of sufficient supervi-
predicate lying on at once, and its heuristic rules fall short sion per image, leading to poor generalization and per-
of accommodating the diverse range of caption’s struc- formance degradation [47, 49]. In summary, relying on
ture. As a result, the predicate distribution becomes heav-
2
Dataset How to annotate # Triplet # Image (See Table 1(c) where the number of triplets increased to
Fully-Supervised approach 334K). A further appeal of LLM4SGG is that it is data-
(a) Visual Genome Manual 405K 57K efficient, i.e., it outperforms state-of-the-art baselines even
Weakly-Supervised approach with a small amount of training images, verifying the effec-
(b) COCO Caption Parser+KB 154K 64K
(c) COCO Caption LLM 344K 64K
tiveness of LLM4SGG.
In summary, we make the following contributions:
Table 1. Comparison of scene graph density.
• We identify two major issues overlooked by existing
the static structured knowledge of KB is insufficient to WSSGG studies, i.e., semantic over-simplification and
cover the semantic relationships among a wide a range low-density scene graph.
of words, which incurs the low-density scene graph after • We leverage an LLM along with the CoT strategy and
Step 3. the in-context few-shot learning technique to extract in-
To alleviate the semantic over-simplification and the formative triplets without the need for fine-tuning the
low-density scene graph issues, we propose a new approach, LLM. To the best of our knowledge, we are the first to
namely Large Language Model for weakly-supervised leverage an LLM for the SGG task.
SGG (LLM4SGG) that adopts a pre-trained Large Lan- • LLM4SGG outperforms the state-of-the-art WSSGG
guage Model (LLM), which has shown remarkable trans- methods, especially in terms of mR@K, demonstrat-
ferability to various downstream tasks in NLP such as sym- ing its efficacy in addressing the long-tail problem in
bolic reasoning, arithmetic, and common-sense reasoning WSSGG for the first time.
[2, 5, 38]. Inspired by the idea of Chain-of-Thought2 (CoT)
[41], which arrives at an answer in a stepwise manner, 2. Related Works
we separate the triplet formation process into two chains,
Weakly-Supervised Scene Graph Generation
each of which replaces the rule-based parser in Step 2 (i.e.,
(WSSGG). The WSSGG task aims to train an SGG
Chain-1) and the KB in Step 3 (i.e., Chain-2). More pre-
model without relying on an annotated scene graph dataset.
cisely, we design a prompt for extracting triplets from a cap-
To achieve this, most WSSGG studies [20, 47, 54, 57]
tion, and ask the LLM to extract triplets formed as <subject,
utilize image captions and ground unlocalized triplets
predicate, object> (Chain-1). We expect that the predicates
with image regions. Specifically, VSPNet [49] proposes
extracted based on a comprehensive understanding of the
the iterative graph alignment algorithm to reflect the
caption’s context via LLM are semantically rich, thereby al-
high-order relations between unlocalized triplets and image
leviating the semantic over-simplification issue. Besides, to
regions. SGNLS [57] uses a pre-trained object detector
alleviate the low-density scene graph issue, we additionally
[29] to ground the entities in unlocalized triplets, which
incorporate a paraphrased version of the original caption.
share the same classes with the output of object detectors.
To this end, we further design a prompt for paraphrasing
In addition to the information derived from the object
the original caption and extracting more triplets from the
detector, [20] employs a pre-trained vision-language model
paraphrased caption. However, entities and predicates in
[15] to capture the semantic interactions among objects.
the triplets obtained after Chain-1 are not yet aligned with
VS3 [54] uses a grounding-based object detector [18],
the target data. Hence, we design a another prompt to align
which calculates the similarity between the entity text in
them with entity/predicate classes of interest, and ask the
unlocalized triplet and image region, thereby grounding
LLM to align them with semantically relevant lexeme in-
the unlocalized triplets. However, these methods overlook
cluded in a predefined lexicon, which is the set of vocabu-
the triplet formation process that leads to the semantic
laries that are present in the target data (Chain-2). To fur-
over-simplification (Step 2) and the low-density scene
ther engage the LLM in the reasoning process of Chain-1
graph (Step 3). In this regard, existing methods result in
and Chain-2, we employ the in-context few-shot learning
a sub-optimal performance even when unlocalized triplets
that incorporates a few input-output examples within the
are correctly grounded in Step 4.
prompt, enabling the LLM to perform the task without the
need for fine-tuning. Large Language Model (LLM). LLMs have demonstrated
To validate the effectiveness of LLM4SGG, we apply it remarkable transferability to various downstream tasks such
to state-of-the-art WSSGG methods [54, 57]. Through ex- as symbolic reasoning, arithmetic, and common-sense rea-
tensive experiments, we show that LLM4SGG significantly soning [2, 5, 38]. Specifically, GPT-3 (175B) [2] stands as
enhances the performance of existing WSSGG methods in a cornerstone to break the line of numerous language tasks.
terms of mean Recall@K and Recall@K performance on Inspired by GPT-3, PaLM (540B) [5], LLaMA (65B) [38],
Visual Genome and GQA datasets by alleviating the se- OPT (175B) [53], and LaMDA (137B) [37] have been sub-
mantic over-simplification and the low-density scene graph sequently introduced. More recently, advanced GPT mod-
els (e.g., GPT-4 [28], ChatGPT [27]) fine-tuned with human
2 We use the CoT strategy as a means to arrive at an answer in a stepwise feedback have gained prominence and widely applied for
manner, which differs from the Chain-of-Thought prompting. diverse applications, e.g., planner of tools [9, 24], mobile
3
task automation [42]. In this work, we employ the power In the weakly supervised SGG task, we aim to gen-
of LLM (i.e., ChatGPT) to alleviate the two issues, i.e., se- erate a scene graph when the ground truth scene graph
mantic over-simplification and low-density scene graph, in is not given, i.e., there are no bounding boxes and en-
the context of the WSSGG task. tity/predicate class information. Instead, existing WSSGG
In-Context Few-shot Learning. In-context few-shot learn- approaches [20, 33, 47, 54, 57] use image captions along
ing incorporates a few input-output examples related to a with associated images to produce scene graphs, i.e., local-
target task, conditioning the LLM on the context of exam- ized triplets. More precisely, they extract a set of triplets
ples. Specifically, GPT-3 [2] pioneered the concept of in- Gw = {si , pi , oi }N i=1 from the image captions, where Nw
w
context learning to facilitate an LLM as a versatile model on is the number of triplets extracted from the captions. How-
diverse tasks. This breakthrough has proliferated a plethora ever, while the extracted triplets contain the class informa-
of research to leverage the in-context few-shot learning for tion (i.e., si,c , oi,c and pi,c ), they are unlocalized, since
various tasks [24, 26, 46, 55]. More precisely, Chameleon bounding boxes si,b and oi,b are not included in the cap-
[24] integrates a few examples to enhance its understand- tion. Therefore, it is essential to perform the grounding step
ing of tool planning task. [26] utilizes positive and negative to associate the unlocalized triplets with bounding boxes.
examples related to questions for question generation tasks. Once we have obtained the localized triplets, we can apply
ReAct [46] incorporates examples of reasoning with action the conventional SGG training scheme, i.e., Tθ : I → Gf .
for solving decision-making tasks. Inspired by recent in- In this paper, our focus is to address the semantic over-
context few-shot learning approaches, we provide a few ex- simplification and low-density scene graph issues regarding
amples to LLMs to help 1) understand the process of triplet the unlocalized triplets Gw , which has been overlooked in
extraction from a caption (i.e., Step 2), and 2) align the en- existing WSSGG studies. Specifically, we aim to produce
tity/predicate classes with the target data (i.e., Step 3) in the an enhanced Gw by refining the process of scene graph
context of the WSSGG task. dataset construction via an LLM. This refinement includes
the triplet extraction step from the caption (Step 2) and the
alignment of entity/predicate classes (Step 3), leveraging
3. Method the LLM’s comprehensive understanding of language and
In this section, we describe LLM4SGG in detail. We reasoning ability.
would like to emphasize that LLM4SGG mainly focuses on 3.2. Prompt Configuration
the triplet formation process conducted in Step 2 (parsing)
and Step 3 (aligning), while existing WSSGG approaches In fact, it is not a trivial task for an LLM to immediately
mainly focus on Step 4 (grounding). We start by present- generate triplets from a caption whose entities and predi-
ing the problem formulation of WSSGG (Section 3.1), fol- cates are aligned with entity/predicate classes of interest, as
lowed by the prompt configuration (Section 3.2). Next, we such a task is a novel task for the LLM. Inspired by the idea
introduce how LLMs are adopted to address the two issues of the Chain-of-thought (CoT) [41], which arrives at an an-
of conventional WSSGG approaches when parsing the im- swer in a stepwise manner, we separate the triplet formation
age caption (Section 3.3) and aligning the triplets in cap- process into the following two chains: Chain-1 – Extract-
tions with entity/predicate classes of interest (Section 3.4). ing triplets from captions. Chain-2 – Aligning entities and
Finally, we ground the unlocalized triplets by associating predicates with the entity/predicate classes of interest. To
them with bounding boxes (i.e., image regions) and train carefully design each chain, we define the LLM function,
the SGG model using the localized triplets (Section 3.5). i.e., LLM (·), with the following prompt input:
The overall pipeline of LLM4SGG is shown in Figure 2. \label {eqn:prompt} \vspace {-1ex} \scriptsize Output = LLM(\underbrace {\textsf {Task description}, \textsf {In-context examples}, \textsf {Actual question}}_{\text {Prompt input}}),
4
Step 2-1
Extract triplets from Unlocalized Triplet Step 4 Localized Triplet
Grounding
original caption
standing on holding
rock
Model Input
rock
Q. Given the sentence “A surfer on rock holding standing on person surfboard
wooden surfboard,” extract meaningful triplets.
person
Step 1 LLM Task Description
Prepare a caption with its image From the given sentence, the task is to extract meaningful triplets
Model Output holding formed as <subject, predicate, object>.
A surfer on rock holding wooden surfboard In-context Examples
A. <surfer, on, rock>, <surfer, holding, surfboard>
surfboard Q. Given the sentence {Example 1}, extract meaningful triplets.
A. {Example Output]
…
Q. Given the sentence {Example K}, extract meaningful triplets.
A. {Example Output]
Step 2-2 person standing on
Extract triplets from Actual Question
paraphrased caption Q. Given the sentence “A surfer on rock holding wooden surfboard,” extract
Misaligned Triplet surfer stands on meaningful triplets.
Model Input Step 3
Step 3 Task Description
Q. Given the sentence “A surfer on rock holding
rock Align
Alignentity/predicate
entity/predicateclasses
classes From the given sentence, the task is to extract meaningful triplets formed as
wooden surfboard,” paraphrase the sentence <subject, predicate, object>.
and extract meaningful triplets. Please follow the following two steps.
stands on Model Input Step 1: Paraphrase the sentence.
Step 2: From the paraphrased sentence, extract meaningful triplets.
LLM Q. Given the lexeme “surfer / stands on,”
surfer find the most semantically relevant lexeme In-context Examples
in the predefined entity / predicate lexicon. Q. Given the sentence {Example 1}, extract meaningful triplets.
Model Output
A. Step 1: The sentence can be paraphrased as: {Example Output}
A. Paraphrased sentence: holding Step 2: Meaningful triplets extracted from the paraphrased sentence are: {Example Output]
LLM
…
A surfer stands on a rock while holding
wooden surfboard Q. Given the sentence {Example K}, extract meaningful triplets.
Extracted triplets: surfboard Model Output A. Step 1: The sentence can be paraphrased as: {Example Output}
<surfer, stands on, rock>, Step 2: Meaningful triplets extracted from the paraphrased sentence are: {Example Output]
<surfer, holding, surfboard> A. person / standing on
Actual Question
Misaligned entity/predicate Q. Given the sentence “A surfer on rock holding wooden surfboard,” extract meaningful triplets.
Prompt for paraphrased caption Aligned entity/predicate Prompt for original caption < Prompt Input for Step 2-1 (Top) and Step 2-2 (Bottom) >
Figure 2. The pipeline of LLM4SGG. Given an image with its caption, we use an LLM to extract triplets from the original caption (Step
2-1) and the paraphrased caption (Step 2-2). Then, we align the entity/predicate classes within the extracted triplets with semantically
similar lexeme in the target data via an LLM (Step 3), obtaining the unlocalized triplets. Lastly, we ground the unlocalized triplets over
image regions (Step 4) followed by the training of an SGG model.
3.3. Chain-1. Triplet Extraction via LLM (Step 2 Extracting triplets from the original caption (Step 2-1).
in Figure 2) As extracting triplets from the original caption does not in-
Based on the LLM’s comprehensive understanding of the volve the caption paraphrasing step, we exclude it from the
context of an image caption, we aim to extract triplets from prompt used in Step 2-2. Please refer to Figure 2 right (top)
the caption. As discussed in Section 1, we use not only the for an example of the prompt input used in Step 2-1.
original caption but also a paraphrased caption generated by In summary, we obtain triplets from both the origi-
the LLM to address the low-density scene graph issue. nal and paraphrased captions after Step 2-1 and Step 2-
Extracting triplets from a paraphrased caption (Step 2- 2, respectively, which in turn alleviates the semantic over-
2). To extract triplets from a paraphrased caption, we simplification issue of predicates and the low-density scene
inform the LLM about the task at hand by presenting the graph issue.
following prompt: F ROM THE GIVEN SENTENCE , THE
TASK IS TO EXTRACT MEANINGFUL TRIPLETS FORMED
3.4. Chain-2. Alignment of Classes in Triplets via
AS ⟨ SUBJECT , PREDICATE , OBJECT ⟩ (i.e., Task descrip-
LLM (Step 3 in Figure 2)
tion in Equation 1). We then instruct the LLM to follow the The entities (i.e., subject and object) and predicates within
two steps, i.e., paraphrasing step and triplet extraction step. the triplets obtained from Step 2 described in Section 3.3 are
To help the LLM understand the process of performing the not yet aligned with the target data. Based on the semantic
two steps, we present few examples to the LLM that de- reasoning ability of the LLM, we aim to align them with the
scribe how to answer the questions (i.e., In-context exam- semantically relevant lexeme in the target data.
ples in Equation 13 ), which involves a manual construction
Aligning entities in the triplets with entity classes of in-
of questions and corresponding answers to the paraphrasing
terest. We instruct the LLM with the following prompt:
and the triplet extraction steps. That is, for given a caption,
G IVEN THE LEXEME { ENTITY }, FIND SEMANTICALLY
we show the LLM how we expect a paraphrased caption
RELEVANT LEXEME IN THE PREDEFINED ENTITY LEXI -
and the extracted triplets would look like (e.g., Given “Four
CON , where the predefined entity lexicon is Ce (i.e., Task
clocks sitting on a floor next to a woman’s feet,” we show
description in Equation 1). Similar to Section 3.3, we
the LLM that a paraphrased sentence would be “Four clocks
present a few examples to the LLM that describe how to
are placed on the floor beside a woman’s feet,” and extracted
answer the questions (i.e., In-context examples in Equa-
triplets would be ⟨clocks, placed on, floor⟩ and ⟨clocks, be-
tion 1). For example, we provide the LLM with a few ex-
side, feet⟩). Lastly, we show the caption of our interest to
amples regarding hierarchical relationships such as pigeon
the LLM, and let the LLM extract triplets from the caption
being semantically relevant to bird, and singular-plural re-
(i.e., Actual question in Equation 1). Please refer to Fig-
lationships such as surfboards being semantically relevant
ure 2 right (bottom) for an example of the prompt input used
to surfboard. Lastly, we show the entity of our interest to
in Step 2-2.
the LLM (i.e., Actual question in Equation 1), which en-
3 We use captions in COCO caption dataset for examples. ables the LLM to generate an answer by finding the most
5
semantically relevant entity in Ce . Please refer to Table 7 in Genome (VG) caption [44]. For fair comparisons, we use
Appendix A.1 for an example of the prompt input. the same set of images that have been utilized in previous
Aligning predicates in the triplets with predicate classes WSSGG studies [20, 47, 54, 57], leading to the utilization
of interest. Likewise, we instruct the LLM with the fol- of 64K images on COCO caption dataset, 145K images on
lowing prompt: G IVEN THE LEXEME { PREDICATE }, FIND CC caption dataset, and 57K images on VG caption dataset.
SEMANTICALLY RELEVANT LEXEME IN THE PREDEFINED To evaluate the trained SGG model, we employ the widely
PREDICATE LEXICON , where the predefined predicate lex- used Visual Genome (VG) dataset [13] and GQA dataset
icon is Cp (i.e., Task description in Equation 1). We also [11]. The VG dataset contains the ground-truth localized
present a few examples to the LLM that describe how to triplet information annotated by humans. We follow the
answer the questions (i.e., In-context examples in Equa- standard split of VG [44], which consist of 150 entity
tion 1). For example, we provide the LLM with a few exam- classes and 50 predicate classes. For the GQA dataset used
ples regarding tense relationships such as the lies on being in the SGG task, we follow the same pre-processing step of
semantically relevant to lying on, and positional relation- a previous SGG study [8], which involves selecting top-200
ship such as next to being semantically relevant to near. frequent entity classes and top-100 frequent predicate
Lastly, we show the predicate of our interest to the LLM classes. In both datasets, 30% of the total images are used
(i.e., Actual question in Equation 1). Please refer to Ta- for evaluation. Please refer to Appendix C.1 for more
ble 8 in Appendix A.1 for an example of the prompt in- details regarding the datasets. Note that we mainly use
put. Furthermore, regarding the alignment of classes within the COCO caption dataset for analysis of LLM4SGG
a large predefined lexicon, please refer to Appendix A.2. throughout this paper, i.e., CC and VG caption datasets are
Note that as our main focus in this work is on developing exclusively used in quantitative result on VG (Section 4.1).
a framework for utilizing LLMs in weakly-supervised SGG Evaluation metrics. Recent fully-supervised SGG stud-
rather than finding a specific prompt design that works the ies [1, 48, 51] have emphasized improving the accuracy of
best, we tested with a single prompt design. However, since predictions for fine-grained predicates rather than coarse-
prompt designs are crucial in leveraging LLMs, we plan to grained predicates, since the former construct richer scene
explore various designs in our future work. graphs. As a result, they commonly use mean Recall@K
After performing Chain-1 (Section 3.3) and Chain-2 (mR@K) that computes the average of Recall@K (R@K)
(Section 3.4), we obtain intermediate unlocalized triplets across all predicates. In line with the recent emphasis on
N̂w fine-grained predicates, we incorporate both mR@K and
Ĝw = {si , pi , oi }i=1 , where si,c , oi,c ∈ {Ce ∪ None} and
pi,c ∈ {Cp ∪ None}. It is worth noting that if there is no se- R@K in our evaluation, whereas previous WSSGG studies
mantically relevant lexeme, we request the LLM to generate [20, 47, 54] mainly rely on the R@K metric alone. More-
None as the answer, due to the fact that the entity/predicate over, we report F@K, which is the harmonic average of
classes in the target data may not cover a wide range of en- R@K and mR@K to jointly consider R@K and mR@K,
tities/predicates. Similar to the conventional approach, we following previous SGG studies [12, 51]. Regarding the
discard a triplet if any of its three components (i.e., sub- evaluation task, we follow previous WSSGG studies and
ject, predicate and object) is None. Lastly, we obtain the adopt the Scene Graph Detection (SGDet) task, where both
final unlocalized triplets Gw = {si , pi , oi }N i=1 (Nw ≤N̂w ),
w the ground-truth bounding box and the entity class infor-
where si,c , oi,c ∈ Ce and pi,c ∈ Cp . mation are not provided. Please refer to Appendix C.2 for
more detail regarding the task.
3.5. Model Training Baselines. Please refer to Appendix C.3 for details regard-
Given the final unlocalized triplets Gw , we ground them ing the baselines.
over relevant image regions to get localized triplets, mean- Implementation details. Please refer to Appendix C.4 re-
ing that we obtain si,b and oi,b . To this end, we employ garding the implementation details.
two state-of-the-art grounding methods, i.e., SGNLS [57]
and VS3 [54]. Please refer to Appendix A.3 for more de- 4.1. Quantitative Result on VG
tail about how each method performs grounding. After Table 2 shows the performance of baseline models and
grounding Gw , we obtain localized triplets and use them as those when LLM4SGG is applied. We have the follow-
pseudo-labels for training a supervised SGG model. Please ing observations based on COCO caption dataset: 1) Ap-
refer to Appendix A.4 for more detail about the model train- plying LLM4SGG to SGNLS and VS3 improves the per-
ing. formance in terms of R@K and mR@K, which demon-
4. Experiment strates the effectiveness of the triplet formation through
LLM4SGG. Notably, LLM4SGG significantly improves
Datasets. To train an SGG model without an annotated mR@K, implying that LLM4SGG effectively alleviates the
scene graph dataset, we use three caption datasets: COCO long-tailed problem in WSSGG. This can be clearly seen
caption [3], Conceptual (CC) caption [31], and Visual in Figure 3, which shows the performance gain on fine-
6
Method Dataset R@50 R@100 mR@50 mR@100 F@50 F@100
Motif (CVPR’18) - Fully-supervised VG 31.89 36.36 6.38 7.57 10.63 / 12.53 12.53
LSWS (CVPR’21) 3.29 3.69 3.27 3.66 3.28 3.67
SGNLS (ICCV’21) 3.80 4.46 2.51 2.78 3.02 3.43
SGNLS (ICCV’21)+LLM4SGG 5.09+ 1.29 5.97+ 1.51 4.08+ 1.57 4.49+ 1.71 4.53+ 1.51 5.13+ 1.70
Li et al (MM’22) COCO 6.40 7.33 1.73 1.98 2.72 3.12
VS3 (CVPR’23) Caption 6.60 8.01 2.88 3.25 4.01 4.62
VS3 (CVPR’23)+LLM4SGG 8.91+ 2.31 10.43+ 2.42 7.11+ 4.23 8.18+ 4.93 7.91+ 3.90 9.17+ 4.55
VS3 (CVPR’23)+Rwt 4.25 5.04 5.17 5.99 4.67 5.47
VS3 (CVPR’23)+Rwt+LLM4SGG 5.10+ 0.85 6.34+ 1.30 8.42+ 3.25 9.90+ 3.91 6.35+ 1.69 7.73+ 2.26
VS3 (CVPR’23) CC 6.69 8.20 1.73 2.04 2.75 3.27
VS3 (CVPR’23)+LLM4SGG Caption 9.47+ 2.78 10.69+ 2.49 5.40+ 3.67 6.09+ 4.05 6.88+ 4.13 7.76+ 4.49
VS3 (CVPR’23) VG 14.54 18.48 2.80 3.79 4.70 6.29
VS3 (CVPR’23)+LLM4SGG Caption 18.40+ 3.86 22.28+ 3.80 6.26+ 3.46 7.60+ 3.81 9.34+ 4.64 11.33+ 5.04
Table 2. Performance comparisons on the SGDet task. The best performance among WSSGG models within each dataset is in bold. The
red numbers indicate the absolute performance improvement after applying LLM4SGG. Rwt denotes using the reweighting strategy [51].
Recall@100
Frequency
3
(a) 154K 6.60 / 8.01 2.88 / 3.25 4.01 / 4.62
(b) ✓ 243K 9.46 / 11.22 3.43 / 3.92 5.03 / 5.81
(c) ✓ ✓ 256K 8.42 / 9.85 5.99 / 6.95 7.00 / 8.15
(d) ✓ ✓ 327K 11.76 / 13.38 3.50 / 4.05 5.39 / 6.22
Frequency 0 in Parser+KB
(e) ✓ ✓ ✓ 344K 8.91 / 10.43 7.11 / 8.18 7.91 / 9.17
7
3 3
Method R@50 / 100 mR@50 / 100 F@50 / 100
3 3
Motif (Fully-supervised) 28.90 / 33.10 6.40 / 7.70 10.48 / 12.49
VS3 5.90 / 6.97 1.60 / 1.81 2.52 / 2.87
VS3 +LLM4SGG 8.88 / 10.38 5.33 / 6.51 6.66 / 8.00
8
References Connecting language and vision using crowdsourced dense
image annotations. International journal of computer vision,
[1] Bashirul Azam Biswas and Qiang Ji. Probabilistic debias- 123:32–73, 2017. 1, 6, 2, 3, 5, 7
ing of scene graphs. In Proceedings of the IEEE/CVF Con- [14] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-
ference on Computer Vision and Pattern Recognition, pages jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan
10429–10438, 2023. 6, 8 Popov, Matteo Malloci, Alexander Kolesnikov, et al. The
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- open images dataset v4: Unified image classification, object
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- detection, and visual relationship detection at scale. Interna-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- tional Journal of Computer Vision, 128(7):1956–1981, 2020.
guage models are few-shot learners. Advances in neural in- 2, 3, 4
formation processing systems, 33:1877–1901, 2020. 3, 4 [15] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare,
[3] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi.
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Align before fuse: Vision and language representation learn-
Microsoft coco captions: Data collection and evaluation ing with momentum distillation. Advances in neural infor-
server. arXiv preprint arXiv:1504.00325, 2015. 6 mation processing systems, 34:9694–9705, 2021. 2, 3, 4
[4] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, [16] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Blip-2: Bootstrapping language-image pre-training with
Universal image-text representation learning. In European frozen image encoders and large language models. arXiv
conference on computer vision, pages 104–120. Springer, preprint arXiv:2301.12597, 2023. 8
2020. 2, 3 [17] Lin Li, Guikun Chen, Jun Xiao, Yi Yang, Chunping
[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Wang, and Long Chen. Compositional feature augmenta-
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul tion for unbiased scene graph generation. arXiv preprint
Barham, Hyung Won Chung, Charles Sutton, Sebastian arXiv:2308.06712, 2023. 3
Gehrmann, et al. Palm: Scaling language modeling with [18] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian-
pathways. arXiv preprint arXiv:2204.02311, 2022. 3 wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu
[6] Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded
Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: language-image pre-training. In Proceedings of the
Unifying object detection heads with attentions. In Proceed- IEEE/CVF Conference on Computer Vision and Pattern
ings of the IEEE/CVF conference on computer vision and Recognition, pages 10965–10975, 2022. 3, 2, 4
pattern recognition, pages 7373–7382, 2021. 2, 3, 7 [19] Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Bipartite graph network with adaptive message passing for
Toutanova. Bert: Pre-training of deep bidirectional unbiased scene graph generation. In Proceedings of the
transformers for language understanding. arXiv preprint IEEE/CVF Conference on Computer Vision and Pattern
arXiv:1810.04805, 2018. 2, 3, 7 Recognition, pages 11109–11119, 2021. 1, 3
[8] Xingning Dong, Tian Gan, Xuemeng Song, Jianlong Wu, [20] Xingchen Li, Long Chen, Wenbo Ma, Yi Yang, and Jun
Yuan Cheng, and Liqiang Nie. Stacked hybrid-attention and Xiao. Integrating object-aware and interaction-aware knowl-
group collaborative learning for unbiased scene graph gener- edge for weakly supervised scene graph generation. In Pro-
ation. In Proceedings of the IEEE/CVF Conference on Com- ceedings of the 30th ACM International Conference on Mul-
puter Vision and Pattern Recognition, pages 19427–19436, timedia, pages 4204–4213, 2022. 1, 2, 3, 4, 6
2022. 6, 8, 3 [21] Yongzhi Li, Duo Zhang, and Yadong Mu. Visual-semantic
[9] Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya matching by exploring high-order attention and distraction.
Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A gen- In Proceedings of the IEEE/CVF conference on computer vi-
eral multi-modal assistant that can plan, execute, inspect, and sion and pattern recognition, pages 12786–12795, 2020. 1
learn. arXiv preprint arXiv:2306.08640, 2023. 3 [22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
[10] Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. To- Piotr Dollár. Focal loss for dense object detection. In Pro-
wards open-vocabulary scene graph generation with prompt- ceedings of the IEEE international conference on computer
based finetuning. In European Conference on Computer Vi- vision, pages 2980–2988, 2017. 2
sion, pages 56–73. Springer, 2022. 3 [23] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
[11] Drew A Hudson and Christopher D Manning. Gqa: A new Zhang, Stephen Lin, and Baining Guo. Swin transformer:
dataset for real-world visual reasoning and compositional Hierarchical vision transformer using shifted windows. In
question answering. In Proceedings of the IEEE/CVF con- Proceedings of the IEEE/CVF international conference on
ference on computer vision and pattern recognition, pages computer vision, pages 10012–10022, 2021. 4
6700–6709, 2019. 6, 8, 3, 7 [24] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei
[12] Siddhesh Khandelwal and Leonid Sigal. Iterative scene Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao.
graph generation. Advances in Neural Information Process- Chameleon: Plug-and-play compositional reasoning with
ing Systems, 35:24295–24308, 2022. 6, 3 large language models. arXiv preprint arXiv:2304.09842,
[13] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, 2023. 3, 4
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- [25] George A Miller. Wordnet: a lexical database for english.
tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Communications of the ACM, 38(11):39–41, 1995. 1, 2
9
[26] Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Han- the IEEE/CVF winter conference on applications of com-
naneh Hajishirzi. Cross-task generalization via natu- puter vision, pages 1508–1517, 2020. 1
ral language crowdsourcing instructions. arXiv preprint [40] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret
arXiv:2104.08773, 2021. 4 Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
[27] OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2023. 3, Denny Zhou, Donald Metzler, et al. Emergent abilities of
4 large language models. arXiv preprint arXiv:2206.07682,
[28] OpenAI. Gpt-4. Technical report, 2023. 3 2022. 4
[29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [41] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Faster r-cnn: Towards real-time object detection with region Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
proposal networks. Advances in neural information process- Chain-of-thought prompting elicits reasoning in large lan-
ing systems, 28, 2015. 1, 3, 2, 4 guage models. Advances in Neural Information Processing
[30] Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei- Systems, 35:24824–24837, 2022. 3, 4, 6
Fei, and Christopher D Manning. Generating semantically [42] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao
precise scene graphs from textual descriptions for improved Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang,
image retrieval. In Proceedings of the fourth workshop on and Yunxin Liu. Empowering llm to use smartphone for in-
vision and language, pages 70–80, 2015. 1 telligent task automation. arXiv preprint arXiv:2308.15272,
[31] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu 2023. 4
Soricut. Conceptual captions: A cleaned, hypernymed, im- [43] Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei
age alt-text dataset for automatic image captioning. In Pro- Li, Weiwei Sun, and Wei-Ying Ma. Unified visual-semantic
ceedings of the 56th Annual Meeting of the Association for embeddings: Bridging vision and language with structured
Computational Linguistics (Volume 1: Long Papers), pages meaning representations. In Proceedings of the IEEE/CVF
2556–2565, 2018. 6 Conference on Computer Vision and Pattern Recognition,
[32] Jiaxin Shi, Hanwang Zhang, and Juanzi Li. Explainable and pages 6609–6618, 2019. 1, 2, 6, 8
explicit visual reasoning over scene graphs. In Proceedings [44] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei.
of the IEEE/CVF conference on computer vision and pattern Scene graph generation by iterative message passing. In Pro-
recognition, pages 8376–8384, 2019. 1 ceedings of the IEEE conference on computer vision and pat-
[33] Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, and Chenliang Xu. tern recognition, pages 5410–5419, 2017. 6
A simple baseline for weakly-supervised scene graph gener- [45] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai.
ation. In Proceedings of the IEEE/CVF International Con- Auto-encoding scene graphs for image captioning. In Pro-
ference on Computer Vision, pages 16393–16402, 2021. 1, ceedings of the IEEE/CVF conference on computer vision
4 and pattern recognition, pages 10685–10694, 2019. 1
[34] Mohammed Suhail, Abhay Mittal, Behjat Siddiquie, Chris [46] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran,
Broaddus, Jayan Eledath, Gerard Medioni, and Leonid Si- Karthik Narasimhan, and Yuan Cao. React: Synergizing
gal. Energy-based learning for scene graph generation. In reasoning and acting in language models. arXiv preprint
Proceedings of the IEEE/CVF conference on computer vi- arXiv:2210.03629, 2022. 4
sion and pattern recognition, pages 13936–13945, 2021. 3 [47] Keren Ye and Adriana Kovashka. Linguistic structures as
[35] Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and weak supervision for visual scene graph generation. In Pro-
Hanwang Zhang. Unbiased scene graph generation from bi- ceedings of the IEEE/CVF Conference on Computer Vision
ased training. In Proceedings of the IEEE/CVF conference and Pattern Recognition, pages 8289–8299, 2021. 1, 2, 3, 4,
on computer vision and pattern recognition, pages 3716– 6
3725, 2020. 5 [48] Kanghoon Yoon, Kibum Kim, Jinyoung Moon, and Chany-
[36] Damien Teney, Lingqiao Liu, and Anton van Den Hengel. oung Park. Unbiased heterogeneous scene graph generation
Graph-structured representations for visual question answer- with relation-aware message passing neural network. In Pro-
ing. In Proceedings of the IEEE conference on computer ceedings of the AAAI Conference on Artificial Intelligence,
vision and pattern recognition, pages 1–9, 2017. 1 pages 3285–3294, 2023. 1, 6, 8, 3
[37] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam [49] Alireza Zareian, Svebor Karaman, and Shih-Fu Chang.
Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Weakly supervised visual semantic parsing. In Proceedings
Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: of the IEEE/CVF Conference on Computer Vision and Pat-
Language models for dialog applications. arXiv preprint tern Recognition, pages 3736–3745, 2020. 2, 3
arXiv:2201.08239, 2022. 3 [50] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin
[38] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Choi. Neural motifs: Scene graph parsing with global con-
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, text. In Proceedings of the IEEE conference on computer
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. vision and pattern recognition, pages 5831–5840, 2018. 1,
Llama 2: Open foundation and fine-tuned chat models. arXiv 3, 4
preprint arXiv:2307.09288, 2023. 3, 9 [51] Ao Zhang, Yuan Yao, Qianyu Chen, Wei Ji, Zhiyuan Liu,
[39] Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, Maosong Sun, and Tat-Seng Chua. Fine-grained scene graph
and Xilin Chen. Cross-modal scene graph matching for generation with data transfer. In European conference on
relationship-aware image-text retrieval. In Proceedings of computer vision, pages 409–424. Springer, 2022. 6, 7, 3, 5
10
[52] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang,
Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.
Vinvl: Revisiting visual representations in vision-language
models. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 5579–5588,
2021. 1
[53] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe,
Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab,
Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained trans-
former language models. arXiv preprint arXiv:2205.01068,
2022. 3
[54] Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei,
and Chang-Wen Chen. Learning to generate language-
supervised and open-vocabulary scene graph using pre-
trained visual-semantic space. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 2915–2924, 2023. 1, 3, 4, 6, 2, 7, 8
[55] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola.
Automatic chain of thought prompting in large language
models. arXiv preprint arXiv:2210.03493, 2022. 4
[56] Yiwu Zhong, Liwei Wang, Jianshu Chen, Dong Yu, and Yin
Li. Comprehensive image captioning via scene graph de-
composition. In Computer Vision–ECCV 2020: 16th Euro-
pean Conference, Glasgow, UK, August 23–28, 2020, Pro-
ceedings, Part XIV 16, pages 211–229. Springer, 2020. 1
[57] Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin
Li. Learning to generate scene graph from natural language
supervision. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 1823–1834, 2021. 1,
3, 4, 6, 2
[58] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
hamed Elhoseiny. Minigpt-4: Enhancing vision-language
understanding with advanced large language models. arXiv
preprint arXiv:2304.10592, 2023. 8
11
LLM4SGG: Large Language Model for Weakly Supervised
Scene Graph Generation
Supplementary Material
Task Description
A. Details of Method From the given sentence, the task is to extract meaningful triplets formed as <subject, predicate, object>.
Note that the subject is the entity or noun that performs the action or is being described, and the object
is the entity or noun that is affected by the action or is receiving the action. The predicate is a verb or
A.1. Details of Prompt adjective without auxiliary verb, and is represented without the tense.
In-context Examples
Let’s take a few examples to understand how to extract meaningful triplets.
We provide the complete prompts for extracting triplets Question: Given the sentence "a slice of bread is covered with a sour cream and guacamole,"
extract the meaningful triplets.
from paraphrased captions (Table 5) and from original cap- Answer: Meaningful triplets are <bread, covered with, sour cream>, and <bread, covered with, guacamole>.
Question: Given the sentence "A beautiful woman walking a dog on top of a beach,"
tions (Table 6). Moreover, we provide those for aligning extract meaningful triplets.
Answer: Meaningful triplets are <woman, walking with, dog>, <woman, on, beach>, and <dog, on, beach>.
entity classes (Table 7), and predicate classes (Table 8) with Question: Given the sentence "Four clock sitting on a floor next to a woman’s feet,"
extract the meaningful triplets.
entity and predicate classes in the target data, respectively. Answer: Meaningful triplets are <clock, sitting on, floor> and <clock, next to, feet>.
Question: Given the sentence "One person sits in a chair looking at her phone while another rests on the couch,"
extract meaningful triplets.
Answer: Meaningful triplets are <person, sits in, chair>, <person, looking at, phone>,
Task Description and <person, rests on, couch>.
From the given sentence, the task is to extract meaningful triplets formed as <subject, predicate, object>. Question: Given the sentence "A lady and a child near a park bench with kites and ducks flying in the sky and
To extract meaningful triplets from the sentence, please follow the following two steps.
on the ground," extract meaningful triplets.
Step 1: Paraphrase the sentence.
Step 2: From the paraphrased sentence obtained in the Step 1, extract meaningful triplets formed as <subject, predicate, object>. Answer: Meaningful triplets are <lady, near, park bench>, <child, near, park bench>, <kites, flying in sky>,
Note that the subject is the entity or noun that performs the action or is being described, and the object is the entity and <ducks, on, ground>.
or noun that is affected by the action or is receiving the action. The predicate is a verb or adjective without auxiliary verb. Question: Given the sentence "Two men sit on a bench near the sidewalk and one of them talks on a cell phone,"
In-context Examples extract meaningful triplets.
Let’s take a few examples to understand how to extract meaningful triplets. Answer: Meaningful triplets are <men, sit on, bench>, <bench, near, sidewalk>, and <man, talks on, phone>.
Question: Given the sentence "a slice of bread is covered with a sour cream and guacamole," extract meaningful triplets. Actual Question
Answer: Step 1: The sentence can be paraphrased as:
Question: Given the sentence Input, extract meaningful triplets. Answer:
A piece of bread is topped with both sour cream and guacamole.
Step 2: Meaningful triplets, where the subject and object are the simple noun, extracted from the paraphrased sentence are:
<bread, topped with, sour cream>, <bread, topped with, guacamole>.
The meaningful triplets are <bread, topped with, sour cream>, and <bread, topped with, guacamole>. Table 6. Prompt for triplet extraction from original caption.
Question: Given the sentence "A beautiful woman walking a dog on top of a beach," extract meaningful triplets.
Answer: Step 1: The sentence can be paraphrased as:
A lovely woman strolling with a dog on the beach. Task Description
Step 2: Meaningful triplets, where the subject and object are the simple noun, extracted from the paraphrased sentence are: The predefined entity lexicon containing 150 lexemes is numbered as follows: 1.airplane 2.animal 3.arm 4.bag 5.banana
<woman, strolling with, dog>, <woman, on, beach>, <dog, on, beach>. 6.basket 7.beach 8.bear 9.bed 10.bench 11.bike 12.bird 13.board 14.boat 15.book 16.boot 17.bottle 18.bowl 19.box 20.boy
The meaningful triplets are <woman, strolling with, dog>, <woman, on, beach>, and <dog, on, beach>. 21.branch 22.building 23.bus 24.cabinet 25.cap 26.car 27.cat 28.chair 29.child 30.clock 31.coat 32.counter 33.cow 34.cup
Question: Given the sentence "Four clock sitting on a floor next to a woman’s feet," extract meaningful triplets. 35.curtain 36.desk 37.dog 38.door 39.drawer 40.ear 41.elephant 42.engine 43.eye 44.face 45.fence 46.finger 47.flag 48.flower
Answer: Step 1: The sentence can be paraphrased as: 49.food 50.fork 51.fruit 52.giraffe 53.girl 54.glass 55.glove 56.guy 57.hair 58.hand 59.handle 60.hat 61.head 62.helmet
Four clocks are placed on the floor beside a woman’s feet. 63.hill 64.horse 65.house 66.jacket 67.jean 68.kid 69.kite 70.lady 71.lamp 72.laptop 73.leaf 74.leg 75.letter 76.light 77.logo
Step 2: Meaningful triplets, where the subject and object are the simple noun, extracted from the paraphrased sentence are: 78.man 79.men 80.motorcycle 81.mountain 82.mouth 83.neck 84.nose 85.number 86.orange 87.pant 88.paper 89.paw 90.people
<clocks, placed on, floor>, <clocks, beside, feet>. 91.person 92.phone 93.pillow 94.pizza 95.plane 96.plant 97.plate 98.player 99.pole 100.post 101.pot 102.racket 103.railing
The meaningful triplets are <clocks, placed on, floor> and <clocks, beside, feet>. 104.rock 105.roof 106.room 107.screen 108.seat 109.sheep 110.shelf 111.shirt 112.shoe 113.short 114.sidewalk 115.sign
Question: Given the sentence "One person sits in a chair looking at her phone while another rests on the couch," 116.sink 117.skateboard 118.ski 119.skier 120.sneaker 121.snow 122.sock 123.stand 124.street 125.surfboard 126.table 127.tail
extract meaningful triplets. 128.tie 129.tile 130.tire 131.toilet 132.towel 133.tower 134.track 135.train 136.tree 137.truck 138.trunk 139.umbrella
Answer: Step 1: The sentence can be paraphrased as: 140.vase 141.vegetable 142.vehicle 143.wave 144.wheel 145.window 146.windshield 147.wing 148.wire 149.woman 150.zebra.
A person is seated in a chair, using their phone, while someone else is relaxing on the couch. Given the lexeme, the task is to find semantically relevant lexeme from the predefined entity lexicon.
Step 2: Meaningful triplets, where the subject and object are the simple noun, extracted from the paraphrased sentence are: However, if there is no semantically relevant lexeme in the predefined entity lexicon, please answer 0.None.
<person, seated in, chair>, <person, using, phone>, <person, relaxing on, couch>. In-context Examples
The meaningful triplets are <person, seated in, chair>, <person, using, phone>, and <person, relaxing on, couch>. Let’s take a few examples.
Question: Given the sentence "A lady and a child near a park bench with kites and ducks flying in the sky and on the ground," Question: Given the lexeme "water," find semantically relevant lexeme in the predefined entity lexicon. Answer: 0.None
extract meaningful triplets. Question: Given the lexeme "bus," find semantically relevant lexeme in the predefined entity lexicon. Answer: 142.vehicle
Answer: Step 1: The sentence can be paraphrased as: Question: Given the lexeme "steel," find semantically relevant lexeme in the predefined entity lexicon. Answer: 0.None
A woman and a child are close to a park bench, while kites soar through the sky and ducks move around on the ground. Question: Given the lexeme "vanity," find semantically relevant lexeme in the predefined entity lexicon. Answer: 110.shelf
Step 2: Meaningful triplets, where the subject and object are the simple noun, extracted from the paraphrased sentence are: Question: Given the lexeme "desktop," find semantically relevant lexeme in the predefined entity lexicon. Answer: 72.laptop
<woman, close to, park bench>, <child, close to, park bench>, <kites, soar through, sky>, <ducks, move around, ground>. Question: Given the lexeme "cobble," find semantically relevant lexeme in the predefined entity lexicon. Answer: 104.rock
The meaningful triplets are <woman, close to, park bench>, <child, close to, park bench>, <kites, soar through, sky>, Question: Given the lexeme "poles," find semantically relevant lexeme in the predefined entity lexicon. Answer: 99.pole
Question: Given the lexeme "wastebasket," find semantically relevant lexeme in the predefined entity lexicon. Answer: 6.basket
and <ducks, move around, ground>.
Question: Given the lexeme "blue," find semantically relevant lexeme in the predefined entity lexicon. Answer: 0.None
Question: Given the sentence "Two men sit on a bench near the sidewalk and one of them talks on a cell phone,"
Question: Given the lexeme "motorcyclist," find semantically relevant lexeme in the predefined entity lexicon. Answer: 98.player
extract meaningful triplets.
Question: Given the lexeme "passenger," find semantically relevant lexeme in the predefined entity lexicon. Answer: 91.person
Answer: Step 1: The sentence can be paraphrased as:
Question: Given the lexeme "pigeon," find semantically relevant lexeme in the predefined entity lexicon. Answer: 12.bird
Two guys are seated on a bench near the road, and one of them talks on a mobile phone.
Question: Given the lexeme "grass," find semantically relevant lexeme in the predefined entity lexicon. Answer: 96.plant
Step 2: Meaningful triplets, where the subject and object are the simple noun, extracted from the paraphrased sentence are: Question: Given the lexeme "surfboards," find semantically relevant lexeme in the predefined entity lexicon. Answer: 125.surfboard
<guys, seated on, bench>, <bench, near, road>, <guy, talks on, phone>. Question: Given the lexeme "striped shirts," find semantically relevant lexeme in the predefined entity lexicon. Answer: 111.shirt
The meaningful triplets are <guys, seated on, bench>, <bench, near, road>, and <guy, talks on, phone>. Actual Question
Actual Question Question: Given the lexeme Input, find semantically relevant lexeme in the predefined entity lexicon. Answer:
Question: Given the sentence Input, extract meaningful triplets. Answer:
ally reach the maximum prompt length (e.g., 4096 tokens in Question: Given the lexeme "hangs on," find semantically relevant lexeme in the predefined predicate lexicon. Answer: 19.hanging from
Actual Question
Question: Given the lexeme Input, find semantically relevant lexeme in the predefined predicate lexicon. Answer:
ChatGPT). This makes it impossible to align a query entity
with entity classes in a large predefined lexicon. To address Table 8. Prompt for alignment of predicate class.
it, as shown in Figure 5, we can construct several prompts in
a hierarchical way. More precisely, for 1,594 entity classes
in VinVL [52] as an example of a large predefined lexicon, we begin by separating 1,594 entity classes into sub-groups,
1
Query: Ramen 1,594 entity classes (predefined lexicon)
(i.e., GLIP [18]) to ground unlocalized triplets. Specifically,
GLIP disentangles the task of object localization, which in-
(200 classes) (200 classes) (194 classes)
volves identifying the object’s bounding box, and object
… recognition, which entails recognizing the class associated
breakfast food truck rock
pizza bull noodle
… … … with that bounding box. Unlike the simultaneous object de-
LLM LLM LLM tection and recognition through Region Proposal Network
breakfast food truck … noodle (RPN) in Faster R-CNN, GLIP initially detects bounding
boxes. Then, given the text features corresponding to the
entity classes in the target data, it calculates the similar-
breakfast
food truck
Prompt for ity between these text features [7] and the visual features
… alignment of classes
noodle
[6] of bounding boxes. The grounding of the subject and
LLM Misaligned entity class object within an unlocalized triplet is achieved by choos-
Aligned entity class
ing the bounding boxes with the highest similarity scores.
noodle
Once subject and object grounding is achieved, the predi-
Figure 5. Framework for addressing large predefined lexicons
cate serves as a pseudo-label for training the SGG model.
when aligning classes with those of interest. A.4. Details of Model Training
each of which contains 200 or less than 200 entity classes. Here, we provide a detailed explanation for model
It enables us to avoid the maximum prompt length restric- training after obtaining localized triplets, i.e., Gw =
tion, and list all the entities of sub-groups within the prompt. {si , pi , oi }N
i=1 , where si,b and oi,b are obtained from
w
Then, we can ask the LLM to align the query entity (e.g., grounding methods. We follow the original training strat-
Ramen) with entities in the sub-group, yielding semanti- egy for each method, i.e., SGNLS [57] and VS3 [54].
cally relevant entities within each sub-group (e.g., break- SGNLS. SGNLS uses a Transformer-based Uniter model
fast, food truck, and noodle). Finally, we ask the LLM once [4] on top of a pre-trained Faster R-CNN [29] object de-
more to align a query entity with the intermediate output, al- tector to capture contextual information of neighboring ob-
lowing for identifying the most semantically relevant entity jects. Each contextualized representation of the subject and
(e.g., noodle) within the large predefined lexicon. Through object is input into an entity classifier to generate entity log-
this process, we can successfully align a query entity with its. The entity classifier and Uniter model are then trained
entity classes in a large predefined lexicon. Aligning predi- using cross-entropy loss, with supervision provided by en-
cate classes in a larger predifined lexicon can be done simi- tity labels (i.e., si,c and oi,c ), respectively. For the predi-
larly. cate, its representation is obtained by feed-forwarding the
contextualized representations of the subject and object and
A.3. Details of Grounding Methods is fed into the predicate classifier. Then, the predicate clas-
sifier and Untier model are trained using cross-entropy loss
To ground unlocalized triplets, we employ two state-of-the- with supervision on predicate class pi,c .
art grounding methods, i.e., SGNLS [57] and VS3 [54]. VS3 . VS3 builds an additional predicate classifier on top
Herein, we provide a detailed explanation of each ground- of a pre-trained GLIP [18] object detector for predicate pre-
ing method. diction. Specifically, the concatenation of visual features
SGNLS. SGNLS employs a pre-trained Faster R-CNN [29] and spatial information of si and oi is fed into MLP to ob-
object detector trained on Open Images [14] to ground un- tain the predicate representation. Based on its representa-
localized triplets. More precisely, it grounds the subject and tion, the predicate classifier generates the predicate’s logit.
object within the unlocalized triplet with image regions that The predicate classifier and a cross-modal fusion module
share the same class with the subject/object. It is important within GLIP are trained with supervision provided by predi-
to note that the set of 601 entity classes in Open Images cate class pi,c using cross-entropy loss. When training enti-
does not completely cover the 150 entity classes in Visual ties (subject and object), the approach is similar to the class
Genome [13]. In other words, there are entity classes in recognition task in GLIP. Specifically, it maximizes the dot
Open Images that do not exist in Visual Genome. There- product between the text features of entity classes (i.e., si,c
fore, a knowledge base (i.e., WordNet [25]) is used to align and oi,c ) and their visual features of bounding boxes using
as many of Open Images’ entity classes as possible with binary focal loss [22].
Visual Genome’s entity classes. The aligned entity classes
of Open Images are then used to compare with the subjects B. Regarding the impact of grounding method
and objects in the unlocalized triplet for grounding. The
on LLM4SGG
predicate within the localized subject and object serves as a
pseudo label for training the SGG model. In Table 2 of main paper, we observe that applying
VS3 . VS3 employs a grounding-based object detector LLM4SGG to VS3 [54] (i.e., VS3 +LLM4SGG) results in
2
greater performance improvement compared to applying it
Frequency
to SGNLS [57] (i.e., SGNLS+LLM4SGG). We provide a
detailed explanation regarding the impact of the grounding
method on LLM4SGG.
As mentioned in Section A.3, SGNLS includes the pro-
cess of aligning the 601 entity classes from the Faster R- Predicate
CNN [29] trained on Open Images [14] with the 150 entity
Frequency
classes in Visual Genome [13]. We find that 34 out of 150
entity classes in Visual Genome are not aligned in the end.
In other words, the 601 entity classes in Open Images do
not cover these 34 entity classes. As a result, unlocalized
triplets containing these 34 entity classes are discarded and Predicate
not used for training since the image regions do not contain
the corresponding 34 classes, and they fail to be grounded. Figure 6. Predicate distribution for Visual Genome (Top) and
GQA (Bottom) in test data.
In fact, 100K among the 344K unlocalized triplets obtained
through LLM4SGG are discarded, exacerbating the low-
density scene graph issue. On the other hand, VS3 fully C.2. Evaluation Protocol
utilizes all 344K triplets. As mentioned in Section A.3, VS3 In the SGDet task, a predicted triplet is considered as cor-
computes the similarity between text features [7] of entities rect if regions corresponding to the subject and object over-
(subject and object) in the unlocalized triplet and the visual lap with the ground-truth boxes with an IoU>0.5, while also
feature [6] of each image region. Then, the image region having the correct subject and object labels. In the pro-
with the highest score is grounded with that entity. This in- cedure of incorporating the correct triplet into R@K and
dicates that the subject and object are always successfully mR@K performance, we initially compute the triplet score
grounded with the image regions having the highest score. by multiplying the subject score, predicate score, and ob-
Therefore, all 344K triplets are being grounded and used for ject score for all subject-object pair in an image, followed
training, effectively alleviating the low-density scene graph by sorting them. If the correct triplet falls within the top-
issue. K of the sorted triplets, it contributes to R@K and mR@K
Taking a further step, in Section 3.4, we use LLM4SGG performance.
to align the 601 entity classes in Open Images with 150 en-
tity classes in Visual Genome. However, we observe that C.3. Baselines
even if an LLM is used for alignment, 30 entity classes are For comparison LLM4SGG with baselines, we incorporate
still not aligned because 30 entity classes have completely the fully-supervised approach [50] and weakly-supervised
different semantic meanings with the 601 entity classes in approach [20, 47, 54, 57] [47].
Open Images. For example, phone and racket do not over-
• Motif [50] (Fully-supervised): Based on the analysis of
lap with the semantic meaning with 601 entity classes in
repeated patterns (i.e., motif), this method employs Bi-
Open Images. This suggests that the grounding method
LSTM to capture the motifs appearing across images.
(i.e., SGNLS) with Faster R-CNN trained on Open Images
• LSWS [47]: This method utilizes a Graph Neural Net-
somewhat limits the effectiveness of LLM4SGG.
work (GNN) applied to triplets extracted from captions
to capture the linguistic structure present among triplets,
C. Experiment Setup with the aim of improving grounding unlocalized triplets.
Furthermore, this method extends its capability by itera-
C.1. Dataset tively estimating scores between image regions and text
entities.
Visual Genome dataset [13] and GQA dataset [11] are • SGNLS [57]: To ground the unlocalized triplets over im-
widely used in the SGG task. Visual Genome dataset is a age regions, it leverages information from a pre-trained
benchmark dataset used for evaulating the fully-supervised object detector (i.e., Faster R-CNN [29]). Specifically,
approach [12, 19, 48, 51]. For test data, each image has 13.8 when the class of a text entity matches a class within
objects and 6.9 predicates on average. a bounding box, the text entity is grounded on that
Recently, GQA dataset has also been used for evaluat- bounding box. Once localized triplets are acquired, a
ing the fully-supervised approach [8, 10, 17, 34]. For test Transformer-based Uniter model [4] is trained based on
data, each image contains 9.3 objects and 4.6 predicates on the contextualized representation of entities under the su-
average. pervision of localized triplets.
The predicate distributions of Visual Genome and GQA • [20]: In the process of grounding unlocalized triplets,
are plotted in Figure 6. this method not only leverages a pre-trained object detec-
3
3
Recall@100
Recall@100
3 3
Frequency
Frequency 0 in Parser+KB
Predicate
(a) Performance per class
Figure 7. Performance comparison per class when adopting the
reweighting method. The red-colored predicates denote the pred-
Frequency
icates with a frequency of 0 in the conventional approach while
LLM4SGG generates all of them with a frequency greater than 0.
(Bar: number of predicate instances, Line: Recall@100)
4
Method zR@50 zR@100
Motif (Fully-supervised) 0.31 0.60
Semantic Ambiguity
3
VS 1.16 1.46
VS3 +LLM4SGG 2.20 3.02
<man, on, beach> ( Selected ) <man, on, beach>
<man, walking on, beach> <man, walking on, beach> ( Selected )
(a) (b) Table 10. Performance comparison for exploration of training
Figure 9. Example of illustrating semantic ambiguity between (a) space.
and (b).
(c)) consistently outperform the baseline with the selection
Row Selection R@50 / 100 mR@50 / 100 F@50 / 100 of each case (row (d),(e),(f)), respectively, which verifies
VS3 +LLM4SGG the effectiveness of LLM4SGG. 2) LLM4SGG with a se-
(a) Random 8.15 / 9.55 5.10 / 6.19 6.27 / 7.51 lection of coarse-grained predicates (row (b)) severely dete-
(b) Coarse-grained 9.79 / 11.37 2.52 / 3.03 4.01 / 4.78
(c) Fine-grained 8.91 / 10.43 7.11 / 8.18 7.91 / 9.17
riorates the mR@K performance while increasing the R@K
VS3
performance compared to the random selection (row (a)).
(d) Random 6.60 / 8.01 2.88 / 3.25 4.01 / 4.62
While R@K performance is improved by increasing the in-
(e) Coarse-grained 6.99 / 8.20 2.66 / 2.99 3.85 / 4.38 stances of coarse-grained predicates, the fine-grained pred-
(f) Fine-grained 6.18 / 7.43 3.82 / 4.27 4.72 / 5.42 icates in an image are not effectively utilized when com-
Table 9. Experiment for predicate selection bined with coarse-grained predicates, resulting in a decrease
in the mR@K performance. In contrast, selecting the fine-
grained predicates (row (c)) significantly increases the per-
test set shown in Figure 6, focusing on coarse-grained pred- formance of mR@K. 3) LLM4SGG with a selection of the
icates improves the performance in terms of R@K. There- fine-grained predicates (row (c)) yields higher R@K and
fore, Motif shows high R@K performance while deteriorat- mR@K compared to random selection (row (a)). Regard-
ing the mR@K performance. On the other hand, as shown ing the improvement of R@K performance, we attribute
in Figure 8(b), the predicate class distribution of triplets it to the effect of alleviating the semantic ambiguity [51].
generated by LLM4SGG is less severe compared with that For example, Figure 9(a) and (b) illustrate the two cases
of the Visual Genome dataset. As a result, VS3 +LLM4SGG where unlocalized triplets parsed from several captions ex-
trained with the dataset generated by LLM4SGG shows hibit predicates on and walking on between the same sub-
performance improvement on fine-grained predicates but ject man and object beach. If we randomly select the pred-
relatively inferior performance on coarse-grained predi- icates, the model may learn on in one image and walking
cates. Therefore, VS3 +LLM4SGG outperforms Motif on in another image even though they are associated with
in terms of mR@K while showing inferior performance the same subject and object, making the SGG model con-
on R@K. Furthermore, applying the reweighting strategy fused due to the similar visual features with different predi-
(VS3 +Rwt+LLM4SGG) further enhances the performance cates. For this reason, selecting walking on for both images
of fine-grained predicates, leading to an improvement in helps mitigate semantic ambiguity, resulting in enhancing
mR@K. The performance improvement in terms of mR@K the performance.
compared to the fully-supervised approach, which indicates
the capability of constructing richer scene graphs, demon- D.4. Experiment for Exploration of Training Space
strates the effectiveness of LLM4SGG. In Table 10, we show the results of zero-shot Recall@K
(zR@K), as introduced by [35], to explore the training
D.3. Experiment for Predicate Selection
space of triplets extracted from captions. It is important
Regarding the implementation details, to further alleviate to note that the zR@K metric used in the fully-supervised
the long-tailed predicate distribution after Step 3 in our approach evaluates how well the model predicts ⟨subject,
framework, we select the most fine-grained predicate when predicate, object⟩ triplet sets that have never been ob-
there are multiple predicates between the same subject- served in the training data of Visual Genome dataset [13].
object pair, where the fine-grainedness is determined based Therefore, while triplets extracted from the captions in
on the predicate distribution within the entire set of unlo- COCO dataset may accidentally overlap with these zero-
calized triplets. In Table 9, we further conduct experiments shot triplets, we employ this metric to understand how
under three different scenarios to understand the effect of much our proposed approach broadens the training space.
predicate selection: random predicate selection, coarse- When we compare VS3 , which learns from triplets extracted
grained predicate selection, and fine-grained predicate se- from captions through the conventional approach, to Mo-
lection. We have the following observations: 1) LLM4SGG tif, a fully-supervised approach trained on Visual Genome
with the random selection (row (a)) and the selection of dataset, we observe that VS3 achieves a higher zR@K, im-
coarse-grained (row (b)) and fine-grained predicates (row plying that captions contain a broader range of diverse com-
5
(a) growing on (b) on back of (c) belonging to
→
→
( connected to ) ( belongs to )
→
( bicycle ) ( someone )
Figure 10. Case studies on extracting fine-grained predicates with a frequency of 0 in the conventional approach (i.e., Parser+KB). (a):
LLM4SGG extracts the fine-grained predicate found in the caption. (b): LLM4SGG extracts the fine-grained predicate while aligning
it with a predicate in the target data. (c): In LLM4SGG, paraphrasing the caption via LLM aids in identifying a fine-grained predicate,
imparting a more specific meaning to it.
Method R@50 / 100 mR@50 /100 F@50 / 100 VS3 +LLM4SGGcomb ) outperforms the baseline, implying
VS 3
6.60 / 8.01 2.88 / 3.25 4.01 / 4.62 that even though prompt design can be diverse, it consis-
VS3 +LLM4SGGcomb 8.66 / 10.20 6.28 / 7.06 7.28 / 8.34 tently verifies the efficacy of the LLM-based triplet for-
VS3 +LLM4SGG 8.91 / 10.43 7.11 / 8.18 7.91 / 9.17
mation process. On the other hand, VS3 +LLM4SGGcomb
Table 11. Experiment for new prompt. shows inferior performance compared to VS3 +LLM4SGG.
This is due to a practical reason incurred by the length
limit in the prompt of GPT-3.5, which allows only up to
positional relationships compared to the Visual Genome 4096 tokens. Specifically, VS3 +LLM4SGGcomb integrates
dataset. On the other hand, VS3 with LLM4SGG (i.e., the instructions of all four steps in the task description,
VS3 +LLM4SGG) significantly improves the zR@K perfor- including the definition of entity/predicate lexicons, and
mance compared to VS3 . This suggests that triplets gen- in-context examples following four steps within a single
erated through LLM4SGG are proficient at capturing the prompt, which leads to a longer length while the max-
compositional relationships found in captions, thereby ex- imum length is constrained. For this reason, we could
panding the training space of triplets. This expansion is only accommodate four in-context examples. In contrast,
achieved by addressing the semantic over-simplification is- VS3 +LLM4SGG divides the four steps into two chains
sue, leading to the creation of a more varied set of predi- (Section 3.3 and Section 3.4), allowing more in-context ex-
cates, and the low-density scene graph, resulting in a wider amples in each chain. Specifically, VS3 +LLM4SGG con-
range of compositional triplets. We argue that the use of the tains six in-context examples in Chain-1 (i.e., triplet ex-
zR@K metric demonstrates the effectiveness of LLM4SGG traction) and fourteen in-context examples in Chain-2 (i.e.,
in terms of expanding the training space. alignment of entity/predicate classes). The increased num-
ber of in-context examples equips an LLM to adapt more
D.5. Experiment for New Prompt Design effectively to the provided few-shot task [41], ultimately en-
hancing its adaptability for the task of triplet extraction task
In Table 11, we conduct an experiment with a new prompt and alignment of entity/predicate classes. In summary, un-
design. In fact, the prompt for extracting triplets from cap- der the practical length limit of GPT-3.5, we observe that
tions in Section 3.3 and the alignment of entity/predicate an approach that divides the original four steps into two
classes with target data in Section 3.4 can be combined into chains is more effective in extracting triplets aligned with
one. More precisely, we instruct the LLM to follow the entity/predicate of interest than an approach that combines
four steps for triplet extraction from paraphrased caption: them into a single prompt.
Step 1) Paraphrase the caption, Step 2) Extract meaning-
ful triplets from the paraphrased caption obtained in Step 1, D.6. Case Studies on Extracting Fine-grained Pred-
Step 3) Find semantically relevant lexeme in the predefined icates
entity lexicon for the subject and object from the triplets
obtained in Step 2, where the entity lexicon is enumerated In Figure 10, we present case studies on predicates in which
in Table 7. Step 4) Find semantically relevant lexeme in the conventional approach (i.e., Parser+KB) eventually ends
the predefined predicate lexicon for the predicate from the up in a frequency of 0, while LLM4SGG does not. In Fig-
triplets obtained in Step 3, where the predicate lexicon is ure 10(a), despite the presence of the fine-grained predicate
enumerated in Table 8. Following the four steps, we include on back of in the caption, the conventional approach fails to
stepwise results in the in-context examples for in-context extract it since the conventional approach relies on a heuris-
few-shot learning, and insert the caption from which we tic rule-based parser [43] that lacks the capability of un-
want to extract triplets in the actual question. As shown derstanding the context of caption and extracting the pred-
in Table 11, LLM4SGG with the combined prompt (i.e., icate on back of at once. On the other hand, LLM4SGG
6
3 3 3
VS +LLM4SGG VS +LLM4SGG VS +LLM4SGG
parked on mounted on lying on
bus street clock building cat roof
VS 3 VS 3 VS 3
on on on
bus street clock building cat roof
GT: bus - parked on - street GT: clock - mounted on - building GT: cat - on - roof
(a) Predicate of Rare Frequency (parked on) (b) Predicate of Frequency 0 (mounted on) (c) Predicate of Frequency 0 (lying on)
Figure 11. Qualitative results on Visual Genome dataset. (a) A predicate parked on rarely appears in the conventional approach. (b), (c)
Predicates mounted on and lying on never appear in the conventional approach.
successfully extracts the predicate on back of through a of on, we argue that lying on is a more realistic answer that
comprehensive understanding of the caption’s context. In provides a richer context. However, as lying on is never ob-
Figure 10(b), we observe that the conventional approach, served while training the baseline (i.e., VS3 ), it can never
which suffers from semantic over-simplification, extracts be predicted. Through the qualitative analysis, we again
the coarse-grained predicate to instead of connected to, demonstrate the effectiveness of alleviating the long-tailed
whereas LLM4SGG extracts the fined-grained predicate problem in WSSGG.
connected to within the caption. Subsequently, by align-
ing connected to with semantically relevant lexeme, at- E. Experiment on GQA
tached to, in the target data through LLM’s semantic rea-
soning ability, it eventually generates a fine-grained pred- E.1. Training and Evaluation
icate attached to. This demonstrates the effectiveness
of LLM-based triplet extraction in addition to LLM-based Training. To train a SGG model for evaluation on the GQA
alignment, leading to capturing the fine-grained predicates. dataset [11], LLM4SGG requires three modifications, en-
Interestingly, in Figure 10(c), we observe that the para- compassing the change of 1) predefined entity, 2) predicate
phrasing step (Section 3.3) in LLM4SGG aids in extracting lexicon, and 3) in-context examples in Section 3.4, while
fine-grained predicates. Specifically, paraphrasing the cap- maintaining the triplet extraction process in Section 3.3.
tion conveys a more specific meaning, i.e., from someone’s More precisely, in the predefined entity and predicate lexi-
glass to glass that belongs to someone, thus enabling the ex- cons, we replace the entity and predicate classes originally
traction of the fine-grained predicate belonging to, which associated with Visual Genome dataset [13] with entity and
the conventional approach cannot achieve. Through these predicate classes in the GQA dataset, respectively. For in-
case studies, we demonstrate the effectiveness of extracting context examples, we substitute the examples that are ini-
the fine-grained predicates in LLM4SGG. tially relevant to Visaul Genome dataset with examples re-
lated to GQA dataset. After obtaining unlocalized triplets
composed of entity/predicate classes in the GQA dataset,
D.7. Qualitative Results
the process of grounding them remains unchanged from the
To further verify the effectiveness of LLM4SGG on Vi- grounding method used in Visual Genome dataset. Please
sual Genome dataset [13], in Figure 11 , we showcase refer to the details of the grounding process in Section A.3.
qualitative results from the test data, comparing the base- Evaluation. To evaluate a trained model on the GQA
line (i.e., VS3 [54]) with LLM4SGG applied to VS3 dataset, only a single modification is needed in the archi-
(i.e., VS3 +LLM4SGG) . In Figure 11(a), we observe that tecture of VS3 [54]. Specifically, we change the output en-
LLM4SGG accurately predicts a fine-grained predicate tity class of bounding boxes from the entity classes of the
parked on between subject bus and object street, which Visual Genome dataset to those of the GQA dataset. For
rarely appears in the training data using the conventional details of VS3 regarding the determination of entity classes
approach. In contrast, the baseline (i.e., VS3 ) following for bounding boxes, the target data’s entity classes are listed
the conventional approach predicts a coarse-grained pred- in text format, e.g., airplane. animal. arm. ... . Then, a text
icate on due to the semantic over-simplification, incurring encoder [7] encodes the enumerated text list to obtain the
the long-tailed predicate distribution. Furthermore, in Fig- text features for each entity. The similarity between these
ure 11(b), we observe that LLM4SGG makes a correct pre- text features of entity classes and visual features [6] of the
diction on the predicate whose frequency is 0 in the conven- bounding boxes is computed. In perspective of the bound-
tional approach (i.e., mounted on). On the other hand, the ing box, the text entity with the highest similarity is chosen
baseline predicts a coarse-grained predicate on and never as the class assigned to the bounding box. In the procedure
predicts mounted on since it has not been observed during of determining the entity classes for bounding boxes, we
training. Interestingly, while in Figure 11(c) LLM4SGG simply list the entity classes from the GQA dataset instead
made an incorrect prediction by predicting lying on instead of Visual Genome’s entity classes. This ensures that entity
7
3
Recall@100
3
classes assigned to the bounding boxes align with the en- demonstrates the effectiveness of LLM4SGG with the more
tity classes in the GQA dataset. Subsequently, we proceed challenging GQA dataset.
with a standard evaluation protocol used in Visual Genome
dataset. E.3. Qualitative Results
E.2. Performance Comparison per class To further demonstrate the effectiveness of LLM4SGG on a
more challenging dataset, GQA [11], we present qualitative
In Figure 12(a), we show a performance comparison per results comparing the baseline (i.e., VS3 ) with LLM4SGG
class on the GQA dataset [11]. For baseline (i.e., VS3 [54]), applied to VS3 (i.e., VS3 +LLM4SGG) in Figure 13. For
we observe that the performance for most predicates, except Figure 13(a), (b), and (c), we showcase examples from the
for coarse-grained predicates, is nearly zero. It is attributed test data, where predicates have a frequency of 0 in the
to the semantic over-simplification inherent in the conven- training data when triplets are generated using the conven-
tional approach, leading to a long-tailed predicate class dis- tional approach, i.e., grazing in, skiing on, and standing in
tribution as shown in Figure 12(b). The long-tailed prob- front of. In Figure 13(a) and (b), we observe that the base-
lem is severely exacerbated in the GQA dataset [11] due to line predicts the coarse-grained predicate in, which is the
its inclusion of more complicated predicates (e.g., stand- second most frequent predicate, as shown in Figure 12(b).
ing next to), as well as a greater variety of fine-grained The prediction of coarse-grained predicate is attributed to
predicates, such as talking on and driving down, which are the semantic over-simplification issue of the conventional
challenging to extract from captions using heuristic rule- approach, leading to a long-tailed problem. On the other
based parser [43]. In fact, 44 out of 100 predicates have hand, LLM4SGG correctly predicts the fine-grained predi-
a frequency of 0, meaning that they are never predicted. cates grazing in and skiing on by effectively alleviating the
On the other hand, as shown in Figure 12(b), LLM4SGG semantic over-simplification issue. In Figure 13(c), we en-
effectively addresses the long-tailed problem by alleviat- counter a complicated predicate, standing in front of, be-
ing the semantic over-simplification and low-density scene tween the subject woman and object door. Such predicates
graph issues, increasing the instances that belong to the are intricate to extract from captions unless they are com-
fine-grained predicates instances so that there are no pred- prehensively understood and extracted at once. LLM4SGG,
icates with a frequency of 0. As a result, LLM4SGG sig- however, adeptly extracts the fine-grained predicate stand-
nificantly enhances the performance of fine-grained predi- ing in front of by understanding the entire context of the
cates, thereby improving the performance of mR@K. This caption via LLM. LLM4SGG makes it possible to learn the
8
3 Method # Triplet R@50 / 100 mR@50 / 100 F@50 / 100
VS +LLM4SGG
grazing in VS3 154K 6.60 / 8.01 2.88 / 3.25 4.01 / 4.62
sheep field +LLM4SGG-LLaMA2 (7B) 228K 8.26 / 9.91 5.90 / 6.72 6.88 / 8.01
+LLM4SGG-GPT-3.5 (175B) 344K 8.91 / 10.43 7.11 / 8.18 7.91 / 9.17
9
Image with
its caption
Large dog laying on top of a A person is standing on the A woman walking down a A group of children sitting A bus is riding down the A brown dog in the middle
bed and looking up at mirror beach holding a kite street talking on a cell phone around a dinner table street with passengers of books on a table
<woman, down, street> <child, sitting around, table> <bus, down, street> <dog, in, middle>
Conventional X <person, on, beach> Fail! Fail! Fail! Fail!
Approach <woman, None, street> <child, None, table> <bus, None, street> <dog, in, None>
Paraphrase: A bus is moving along
<woman, walking down, street> <child, sitting around, table> <dog, in, books>
LLM4SGG the street with people on board
<dog, laying on, bed> <person, standing on, beach> Fail! Fail! Correct?
Correct?
(LLaMA 2) <woman, None, street> <child, None, table> <dog, in, board>
<bus, in front of, street>
Paraphrase: A bus is traveling along
<woman, walking down, street> <child, sitting around, table> <dog, in, books>
LLM4SGG the street with passengers inside
<dog, laying on, bed> <person, standing on, beach> Success! Success!
Success!
Success!
(GPT-3.5) <woman, walking on, street> <child, sitting on, table> <dog, in, book>
<bus, along, street>
(b) Discarding Triplets (c) Incorrect Alignment
(a) Mitigation of Semantic Over-simplification
due to Lack of Semantic Reasoning due to Lack of Semantic Reasoning
Figure 14. Qualitative Results for utilization of a smaller language model (i.e., LLaMA2-7B). Red arrow: Failure of alignment, Blue arrow:
Success of alignment, Orange arrow: Incorrect alignment.
3 3 3
Steps Num. Output/Input tokens per image Cost per image
3 3 3 (Step 2-1) Triplet Extraction of Original Caption 0.16K / 0.52K $0.00050
(Step 2-2) Triplet Extraction of Paraphrased Caption 0.48K / 0.89K $0.00117
(Step 3) Alignment of object classes 0.11K / 1.18K $0.00076
(Step 3) Alignment of predicate classes 0.11K / 0.82K $0.00058
Table 13. Num. tokens and cost per image containing 5 captions.
G. Evaluating the Quality of Triplets Gener- H. Cost for Utilization of a Large Language
ated by LLM4SGG Model
In Figure 15, we quantitatively evaluate the quality of The cost of using ChatGPT for augmenting the data is sum-
triplets generated by LLM4SGG. To this end, we com- marized in Table 13. Considering that the input tokens and
bine triplets made by the conventional approach (i.e., output tokens cost $0.0005 and $0.0015 per 1K tokens, the
Parser+KB) and triplets generated by LLM4SGG with dif- cost per image for Step 2-1 is computed by (520/1K) ×
ferent ratios. Specifically, we gradually increase the ratio of 0.0005 + (160/1K) × 0.0015, which is similarly computed
triplets generated by LLM4SGG while decreasing the ratio in other steps.
of triplets made by the conventional approach. For example,
in an 8:2 ratio, we randomly select 80% of the total triplets
made by the conventional approach and 20% of the total
triplets generated by LLM4SGG. As shown in Figure 15,
as we increase the ratio of LLM4SGG’s triplets, we observe
a gradual improvement in mR@K and F@K performance.
10