Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Zero-Shot Image Harmonization with Generative Model Prior

Jianqi Chen Zhengxia Zou Yilan Zhang Keyan Chen Zhenwei Shi*
Beihang University
https://github.com/WindVChen/Diff-Harmonization
arXiv:2307.08182v1 [cs.CV] 17 Jul 2023

Foreground
Stable Diffusion
(Frozen)


Composite Image Harmonized Image
Background
Figure 1. Given a composite image, our method can achieve its harmonized result, where the color space of the foreground is aligned with
that of the background. Our method does not need to collect a large number of composite images for training, but only utilizes a pretrained
text-to-image generative model. The first column from the left in the upper row is the source image of the foreground (“house”), and the
others are the harmonized results of the foreground in different backgrounds. In the lower row, we take one of the composite images as an
example to show the entire harmonization process.

Abstract Constraint Text which is optimized to well illustrate the


image environments. Some further designs are introduced
Recent image harmonization methods have demon- for preserving the foreground content structure. The result-
strated promising results. However, due to their heavy re- ing framework, highly consistent with human behavior, can
liance on a large number of composite images, these works achieve harmonious results without burdensome training.
are expensive in the training phase and often fail to gen- Extensive experiments have demonstrated the effectiveness
eralize to unseen images. In this paper, we draw lessons of our approach, and we have also explored some interest-
from human behavior and come up with a zero-shot image ing applications.
harmonization method. Specifically, in the harmonization
process, a human mainly utilizes his long-term prior on har-
monious images and makes a composite image close to that 1. Introduction
prior. To imitate that, we resort to pretrained generative
Image harmonization is a technique for aligning the
models for the prior of natural images. For the guidance
color space of foreground objects with that of the back-
of the harmonization direction, we propose an Attention-
ground in a composite image. Recent state-of-the-art meth-
* Corresponding author. ods [32, 61, 26, 20, 9, 24, 10, 54, 18, 66, 5, 50, 7] are mostly
based on deep learning networks which have achieved more
promising results than traditional ones [29, 62, 42, 38, 53].
However, the performance of these methods, either per-
forming image harmonization in a supervised manner with 
paired inharmonious/harmonized images [61, 26, 20, 9, 24,
10, 54, 18, 7] or constructing a zero-sum game by means
of GAN [17] technique [66, 5], is highly correlated with
the quality of the collection of composite images. If the
collected composite images could not cover real-world sit-
uations, then the trained network will fail to obtain satisfac- Long-Term Prior
tory results in practical use.
Compared with real image collection, the production of Outlier Composite Image
a real-world composite image needs more labor costs, as
Figure 2. Human behavior of harmonizing a composite image. We
one should crop a foreground object from an image and
humans can perform image harmonization relying only on our
then paste it at a reasonable place in another one. To avoid long-term prior, without seeing many composite images in ad-
the unbearable cost of making a large real composite image vance. E.g., we can harmonize the “over-bright” dog above.
dataset, the current mainstream alternative [54, 5, 10] is to
“synthesize” composite images. Specifically, by applying a
tremely huge amounts of data, these generative models have
variety of color transforms [42, 59, 39, 14, 30] on the se-
demonstrated to the world their extraordinary generative
mantic regions of existing large-scale datasets [4, 33, 67],
power, and are intrinsically embedded with the prior of
we can obtain enough “synthesized” composite images, to-
real image distribution. Among them, the Stable Diffusion
gether with their paired harmonized images (i.e. the origi-
[43], featuring its open source and fabulous results, gains
nal ones). The more transformations considered, the more
rather unprecedented attention, thus we base our method
likely to get close to the real situation. However, this will
on it. To guide the harmonization direction, we propose
also result in a rather large dataset [10], which will place
an Attention-Constraint Text. This synthesized text, in-
much burden on the training phase. Such a costly approach
spired by the Textual Inversion technique [16, 45], is op-
is in fact inconsistent with human behavior.
timized to restrict attention to foreground/background area,
Let us think about how we humans harmonize an image: thus well delineating their respective environments (just like
Suppose we are familiar with the basic controls of an image the aforementioned human behavior). We can then combine
editing software (e.g., Photoshop). Given a composite im- the synthesized text with the existing text-guided local edit-
age, we first locate the inconsistency between foreground ing approaches [21, 36] to conduct the image harmoniza-
and background (with the help of a mask). Then, based tion process. However, direct combination will cause severe
on the foreground/background regions, we coarsely iden- content distortion, making harmonization lose its meaning.
tify their possible imaging environment, which guides our To this end, we exploit the self-attention maps in the Sta-
editing direction. By editing the foreground region and re- ble Diffusion and also leverage supplemental content text
peating the previous steps, we can often get a satisfactory to preserve the structure. The final framework is an itera-
harmonious result at last. This process, without training on tive harmonization loop (again similar to human behavior),
large amounts of inharmonious data, harmonizes images ac- which can achieve high content retention and satisfactory
tually in a zero-shot manner, and we attribute its success and harmonized results. Fig. 1 displays some of our results.
feasibility to human’s long-term strong prior. Specifically, In summary, our contributions are as follows:
most of what we observe in our lives are real and harmo-
nious images, and if asked to imagine something, we would • We introduce a new perspective on image harmoniza-
tend to give a harmonious answer. Thus, an inharmonious tion. By thinking about human behavior, we analyze
image is an outlier to the prior (possibly the reason why we the feasibility of zero-shot image harmonization based
define it as “inharmonious”), and the above harmonization on the generative model prior.
process is to gradually pull this outlier closer to our prior.
There is just no need to know the distribution of composite • We propose a zero-shot image harmonization frame-
images. We display the human process in Fig. 2. work. With the designed Attention-Constraint Text
and some operations for content retention, we can
So, can we imitate human behavior in harmonizing im- achieve satisfactory results without relying on burden-
ages without relying on composite image collection? To some composite image collection. This framework is
answer this question, we need to find out the “strong also highly consistent with human behavior.
prior” first, and recent large-scale text-to-image models
[47, 40, 64, 43] come into our sight. Trained on ex- • We conduct extensive experiments on various cases to

2
demonstrate the effectiveness of the method. Further 3. Preliminaries
applications are also explored such as artwork harmo-
nization, user auxiliary, etc., where our method has Before introducing our proposed method, in this section,
great potential. we briefly review the involved techniques to better under-
stand what follows.
Diffusion model. Denoising diffusion probabilistic
2. Related Works
models (DDPMs) [22] are a class of generative models that
Image harmonization. Compared with traditional perform an iterative image denoising process on a random
methods [29, 62, 42, 38, 53] that rely on matching low- Gaussian noise initial. A U-Net [44] like neural network is
level statistics between foreground and background, Tsai et trained based on the minimization of the following objective
al. [54] pioneered leveraging deep learning in image harmo- function:
nization and demonstrated the superiority of its powerful se-
mantic representation capability. To suffice the data-hungry L = Ex0 ,ϵ∼N (0,I),t ∥ϵ − ϵθ (xt , t)∥22 (1)
training of deep networks, Cong et al. [10] constructed a where ϵθ is the network for predicting the noise ϵ that is ap-
large synthesized image harmonization dataset, by applying
a variety of color transforms [42, 59, 39, 14, 30] on segmen- √ the input√clean image x0 with different intensities:
plied on
xt = αt x0 + 1 − αt ϵ, where αt are a series of fixed
tation regions of the existing datasets [4, 33, 28]. Many of hyper-parameters. By gradually denoising for T timesteps,
the following works [61, 26, 9, 18, 50, 7] then based their the diffusion models can then generate images following the
methods on this large synthesized dataset. Some other ap- real data distribution from a randomly sampled noise image.
proaches have either tried to leverage different color trans- These models can further generalize to conditional genera-
forms [20, 24] or leverage the GAN [17] technique to ex- tions, guided by image classes or texts, and the predicted
plore harmonization with unpaired image data [5, 66], but noise changes into ϵθ (xt , t, C), where C denotes the addi-
still heavily rely on transforming the segmented regions for tional guidance.
composite image collection. Although some promising re- Since the denoising process is conducted on the high-
sults have been achieved, whether the synthesized compos- dimensional image space repeatedly, the training and in-
ite images can reflect the real-world situation determines ference of the diffusion models are unbearably expensive.
the final generalization and performance of these methods. Thus Rombach et al. [43] proposed to map the input im-
With more color transformations considered, the trained age to a latent space with an autoencoder network before
network can be more robust, yet the training cost also be conducting the forward and reverse process. They further
more unaffordable. introduce the cross-attention mechanism [56] into the net-
Text-to-image synthesis. Recently, we have witnessed work besides the self-attention one which is normally used
the incredible generative power of many large-scale text- [51, 22, 37, 12], thus facilitating the guidance from various
to-image models [41, 13, 47, 40, 64, 43]. Trained on vast modalities. By training on large amounts of image-text data
amounts of image-text data, these models have the real data pairs [48], this work is later developed into the well-known
distribution embedded in themselves and gain the ability to Stable Diffusion, which is open source and displays amaz-
generate stunning and fabulous images that well follow the ing generative performance.
given textual guidance. Among them, the diffusion mod- DDIM and DDIM inversion. To accelerate sampling
els [47, 40, 43], an especially interesting class of genera- of the diffusion models, Song et al. [52] generalize DDPMs
tive model that performs denoising process iteratively from from a particular Markovian process to non-Markovian pro-
a random Gaussian noise initial [51, 22], have achieved cesses (DDIM), which can lead to a deterministic sampling
rather unprecedented attention. Many downstream meth- given an initial noise xT :
ods emerge and have implemented the diffusion model into  √ 
a variety of tasks, such as image restoration [57, 63], image p xt − 1 − αt ϵθ (xt ) p
xt−1 = αt−1 √ + 1 − αt−1 ·ϵθ (xt )
inpainting [31, 46], adversarial attack [6], and even recogni- αt
tion [1, 8]. While, the image editing application still covers (2)
a great proportion. With the guidance from text [23], trained Since DDIM can be further taken as the Euler integration
classifier [12], etc., these methods [21, 36, 58, 3, 11, 60, 2] for a particular ODE [52], it provides the feasibility of map-
achieve manipulating local content of the image while pre- ping a real image back to its corresponding initial latent by
serving the remained area. Since they mainly aim to change reversing Eq. (2). This operation, named DDIM inversion,
the local content structure, these methods are not suitable paves the way for later editing of real images [11, 36].
for harmonizing an image. Methods of style transfer [27] Real image local editing. Recently, Hertz et al. [21]
also cannot be adopted, as they tend to transfer the style be- deeply exploited the semantic strength of the cross-attention
tween two images, whereas we focus on making one com- maps in the large-scale text-to-image model [47] and pro-
posite image harmonious. posed Prompt-to-Prompt (abbreviated as P2P). Leveraging

3
Iterative Harmonization Loop
Composite image Iteration 1 Iteration 𝑺

Harmonization Harmonization
Operations  Operations
𝒙𝑻 𝒙𝑻−𝟏 𝒙𝟏 𝒙𝟎 
DDIM Inversion

Optimization Harmonization Operations


“Autumn*”
Intermediate self-attention maps

“Autumn” “Girl winter*”


(Foreground) Text Guidance

Cross attention Mask


𝒙𝒕 Replace 𝒙𝒕−𝟏
Optimization on
“Winter” “Winter*” Intermediate self-attention maps
(Background) Background Mask
from “Girl autumn ” *
(a) Attention-Constraint Text (b) Content Retention
Figure 3. The pipeline of our method. The harmonization process is an iterative loop, which is aligned with human behavior. Starting
from the composite image, we invert the last harmonized result into its diffusion latent and then leverage some harmonization operations
to obtain the next harmonized result. Among the harmonization operations, (a) Attention-Constraint Text optimization is designed to get
texts that can describe the foreground/background environment better. For the (b) foreground content retention, we leverage self-attention
maps and supplemental content text. (We also utilize P2P [21] and Null-Text [36] technologies. For the sake of brevity, the above figure
does not explicitly draw that part, and we refer to their work for more details.)

the strong correlations between the image pixels and the text represent the environments of the foreground and the back-
tokens, they achieved local image editing by fixing the spe- ground. For an image composed of rich elements, even an
cific cross-attention maps while changing the prompt texts. experienced and well-literated person has to make some ef-
However, when applied to real images, it fails to recon- fort on it. Since only two words are quite limited, one might
struct the original image well. This problem was later ex- consider using more words to improve the representation,
plored by Mokady et al. [36], where they successfully trans- yet it would introduce a new problem of different matching
ferred P2P to real images by optimizing the null token em- orders between foreground and background texts. The other
beddings in the classifier-free guidance [23] (we abbreviate problem is content distortion. When leveraging P2P [21]
their method as Null-Text). for image harmonization, there are usually nontrivial dis-
tortions in the foreground content structure. This would go
against the purpose of image harmonization, where we hope
4. Method
to harmonize the image while preserving the foreground
With the techniques mentioned in Sec. 3, given an input content structure as much as possible.
real image, we can leverage DDIM inversion [52] to obtain To tackle the problems, we propose a framework (as de-
its initial latent, optimize the unconditional token embed- picted in Fig. 3), where two Attention-Constraint Texts are
dings to well reconstruct the input [36], and finally resort to optimized to well describe the image environments, and
P2P for real image local editing [21]. Thus, we can come some operations are exploited to keep the content struc-
up with a straightforward approach for image harmoniza- ture. The pipeline can achieve harmonious results by iterat-
tion: First describe the environments of the foreground and ing the harmonization loop. With the diffusion model itself
the background, each with one text. Based on P2P [21], frozen and only the text embeddings optimized, the method
we can fix the cross-attention maps between the foreground is mainly based on the generative model prior without re-
text and the image pixels, and change the foreground text to lying on a large number of composite training images, thus
background text. Then, it is expected that the background is highly consistent with the human behavior (see Sec. 1).
text can guide the editing of the foreground area environ- Here, we turn to describe more details of the framework.
ment, thus achieving image harmonization.
4.1. Attention-Constraint Text
However, from our experiments (see Sec. 5.2), the
above straightforward approach suffers from two problems. From Fig. 4, we can observe the amazing generative
Firstly, it is arduous to find two words that can accurately power of the Stable Diffusion [43] that can create images

4
t=T t=0 with the value range of the mask, we normalize the atten-
tion map with its maximum value. In practice, we can just
Diffusion
timestep coarsely describe the background environment with an ini-
tial word and then minimize Eq. (3) to get a more represen-
tative refined embedding, thus facilitating a lot. To avoid
“ Bear ” the overfitting of the optimization, we add a regularization
term to get the refined embedding close to its initial:
min(LEmb + w × ∥Emb − Embinit ∥22 ). (4)
Emb
“ Mars ”
where Embinit is the embedding of the initial text that
“A bear walking on Mars”
coarsely illustrate the environment, and w is a hyper-
parameter to balance the two terms. We coin the obtained
Figure 4. Given the prompt text “A bear walking on Mars”, the
embedding as Attention-Constraint Text.
Stable Diffusion can generate an image well aligned with the se-
mantic information. Visualizing the intermediate cross-attention With the optimized embedding for the background, we
maps between text and images, we can observe that there are clear can then replace the word whose attention is fixed on the
correlations even at the earlier steps. foreground area. One may consider leveraging a noun word
describing the foreground content, such as “girl”, “car”,
“dog”, etc., as these words focus well on foreground regions
well reflecting the text guidance. If visualizing the cross- while being easier to think of than environment words.
attention maps in the diffusion process, we would be sur- However, we found that substitutions of these words fail
prised to find that even at the earlier timesteps with quite to achieve harmonious results (see Fig. 7), probably due to
noisy images, there are obvious correlations between the the lack of environmental information as guidance. Thus,
text and the image pixels. The model seems to have de- practically we also optimize the Attention-Constraint Text
cided which image area for which text just from the ini- for the foreground, based on an initial coarse description of
tial diffusion timestep. Thus based on this, Hertz et al. its environment.
[21] can achieve local image editing by changing text while As inspired by Textual Inversion [16, 45], our method
maintaining cross-attention maps. Then, to achieve image shares a similarity with them in that the Attention-
harmonization, we can leverage a word whose attention is Constraint Text also comes from the optimization of the
mainly constrained to the foreground area of the composite text embedding. However, there are essentially many dif-
image, and replace it with another word that can illustrate ferences. From the perspective of goals, [16, 45] aim to
the background environment. abstract specific objects from multiple images and further
Let’s leave the foreground word alone for now and fo- generate richer images with the optimized text, while our
cus on the background one first. In the text-to-image pro- method is based on only one composite image and the tar-
cess, the Stable Diffusion does not leverage the guidance get is to harmonize the image. From the perspective of im-
from texts directly, but in fact utilizes the text embeddings plementations, [16, 45] optimize the text by reconstructing
that the texts correspond to. Thus, we may not be able to multiple input images, while ours constrains the attention
well illustrate the environment with only one word even of the text to specific regions (foreground/background) to
consulting the entire dictionary, yet it is possible to find a better represent the image environments.
high-dimensional representative text embedding. Accord-
ing to the aforementioned correlations between texts and 4.2. Content retention
image pixels, here we propose a coarse quantitative indi- To achieve image harmonization while avoiding much
cator, based on cross-attention maps, to evaluate whether a content distortion, we here preserve the content structure
text embedding can represent the environment. Specifically, from two aspects.
we can take the embedding as an approximation of the envi- Self-attention exploitation. Although Hertz et al. [21]
ronment if the attention is well focused on the background have been successful in local editing of images, we found
region, neither overflowing nor underfilling. To achieve that even if we changed the text of the environment but
that, we can optimize a text embedding as follows: not the object itself, the content structure would distort a
Att(Emb) lot (see Fig. 8). Different from P2P [21] dependent on the
LEmb = ∥M − ∥2 . (3) cross attention, we here leverage the self-attention maps to
max(Att(Emb)) 2
achieve high content retention. As demonstrated in [49, 55]
where Emb denotes the optimized text embedding, M rep- previously, the self-similarly-based descriptors can achieve
resents the background mask, and Att is the cross-attention structure capturing while ignoring appearance information.
map between the embedding and the image pixels. To align By fixing the self-attention maps in the Stable Diffusion

5
firstly for the foreground environment text and then replac- the-art harmonization methods [18, 61, 26, 19]. Since our
ing the text while keeping the cross-attention maps as in method is not trained on a large number of synthesized
[21], we can achieve the goal of better content retention. composite images like theirs, for fairness and practicality,
By utilizing the self-attention mechanism, we in fact restrict comparisons are performed on 158 real composite images
each iteration of editing to small changes (see Fig. 3), just collected from Flickr by ourselves (some examples are dis-
like human behavior. Thus, in practical use, we apply the played in Fig. 1) and from [9, 20]. From the qualitative
diffusion editing several times until getting a satisfactory results, we can observe that our method, even without re-
harmonized result. lying on heavy training, has achieved comparable or even
Supplemental text for foreground content. As men- better (e.g., Row 3 in Fig. 5) performance to other methods,
tioned in Sec. 4.1, we currently only require one envi- demonstrating the effectiveness of our zero-shot approach.
ronment text for foreground and background respectively. As ground-truth harmonized results are unavailable for
However, in our experiments (Fig. 8), we found that only the real composite images, following [16, 36, 2, 58, 60],
the environment text failed to well preserve the foreground we conduct a subjective user study for the quantitative
structure, even when we have leveraged the self attention. comparisons. Specifically, we constructed a questionnaire
Furthermore, we also noticed that leveraging only the en- where 29 participants are invited to select visually harmo-
vironment text would pose more challenges (need longer nious images. Each time, participants will be given two
optimization of unconditional embedding with [36]) on the images, either the results of the harmonization methods
reconstructing of the foreground area. or the original composite image. The displaying order is
We attribute these phenomena to the lack of text guid- randomly shuffled and the participants are to pick up the
ance. Specifically, since the environment text did not con- more harmonious one between the two. A total of 22,910
tain information of image content, when we forced the votes are collected. From the voting percentages shown in
DDIM-inversed latent to reconstruct its input as [36], we Fig. 6, we can observe that our zero-shot approach success-
actually relied on the optimized unconditional embedding fully achieves comparable results to existing harmonization
to represent the content information. This optimized un- methods trained on a large number of composite images,
conditional embedding, only specific to its paired environ- demonstrating the effectiveness and superiority of our work.
ment text (foreground), has no generalization to another
environment text (background). Thus, when replaced the
foreground environment text with the background one, we 5.2. Ablation Study
fail to preserve the foreground content information. This
then causes the content distortion. Therefore, we supple- We here conduct ablation studies on the designs in
ment another text for the foreground content besides the en- Sec. 4. For more ablation experiments and analyses, please
vironment one. Since the context text embedding is well refer to the Supplementary Material.
trained before with a huge number of images [43], it can Ablation of Attention-Constraint Text. In Fig. 7,
easily adapt to pair with other texts, and thus achieve high we have demonstrated the effectiveness of the proposed
content retention (see Fig. 8). Attention-Constraint Text. Without the optimization, it can
It should be also noted that, the resulted content reten- be seen from the attention maps that the initial text fails to
tion here with the supplemental text has no relation with the well illustrate the environments of foreground/background
fixation of the cross-attention maps in P2P [21]. To verify areas, resulting in a deviated harmonization direction. The
this, we conduct experiments in Fig. 8 where only the self presence of initial text is also important, otherwise we can-
attention is utilized. It can be observed that the content is not obtain satisfactory results with text embeddings opti-
still highly preserved without fixing cross-attention maps. mized from scratch. Furthermore, we verified the infeasi-
bility of substituting environment text with content text for
5. Results the foreground mentioned in Sec. 4.1 which fails to achieve
We here start by comparing our method with the exist- harmonious results.
ing image harmonization methods, both qualitatively and Ablation of content retention designs. From Fig. 8,
quantitatively. Then, we conduct the ablation study of the we can observe that with the self attention leveraged, the
designs in our method. Some further applications are ex- content structure is largely preserved. This is further im-
plored at last. More implementation details of our method proved when adding supplemental content text. To verify
can be found in the Supplementary Material. that the improvement does not come from the fixation of
cross-attention maps in P2P [21], we show the results with
5.1. Comparisons
and without fixed cross-attention. As shown in Fig. 8, we
To demonstrate our superiority, in Fig. 5 we display the can see that there is almost no difference before and after,
harmonized results of our method and the current state-of- validating our statement in Sec. 4.2.

6
Composite Mask Intrinsic IHT Harmonizer DCCF Ours

Summer*
Spring*

Bright*
Dim*

Bright*
Warm*

Cold*
Sunset*

Bright*
Cloudy*

Bright*
Dark*

Bright*
Autumn*

Figure 5. Qualitative comparisons with other state-of-the-art harmonization methods. From left to right are composite images, foreground
masks, and the harmonized results of Intrinsic [19], IHT [18], Harmonizer [26], DCCF [61], and ours. Please zoom in for a better view.
The rightmost column is the foreground/background environment text, where the asterisk in the upper right corner indicates the Attention-
Constraint optimization. (For simplicity, we here ignore the supplemental content text.)

Voting Percentage
𝟕𝟓. 𝟒𝟓% 𝟐𝟒. 𝟓𝟓% Composite

𝟒𝟒. 𝟏𝟏% 𝟓𝟓. 𝟖𝟗% IntrinsicIH Girl sunny* Girl sunny Girl (empty)* Girl girl
Foreground Text
and Cross-attention:
Ours 𝟒𝟓. 𝟖𝟑% 𝟓𝟒. 𝟏𝟕% IHT
Background Text: “Girl dark*” “Girl dark” “Girl (empty)*” “Girl dark*”

𝟓𝟗. 𝟒𝟓% 𝟒𝟎. 𝟓𝟓% Harmonizer

𝟓𝟔. 𝟑𝟑% 𝟒𝟑. 𝟔𝟕% DCCF

Figure 6. User study of the harmonization performance. Users are


invited to select the more harmonious one between two images.
Here, we display the voting percentages of different methods [61, Foreground Text
Girl autumn* Girl autumn Girl (empty)* Girl girl

26, 19, 18]. “Composite” denotes the original composite image. and Cross-attention:

Background Text: “Girl winter*” “Girl winter” “Girl (empty)*” “Girl winter*”
5.3. Applications (a) Final (b) W/o text (c) W/o initial (d) Double
optimization text content text

In this subsection, we go a further step to explore the ap- Figure 7. Ablation of the Attention-Constraint Text. The
plications of our method. We mainly focus on two: artwork Attention-Constraint optimization and the initial text are ablated.
harmonization, and user auxiliary. We further verify the infeasibility of utilizing double content text
Artwork harmonization. Besides the harmonized re- for the foreground (mentioned in Sec. 4.1). Please zoom in for a
sults on the natural composite images displayed in Sec. 5.1, better view.
we try to generalize our method to composite images with
a large style gap between the foreground and background. tures. To achieve better results, we slightly relax the con-
Here, we evaluate the performance on some artistic pic- straint on content retention by utilizing self-attention maps

7
3D LUT

(a) Composite image (b) W/o self attention (c) W/o supplemental (d) Final (e) W/o cross attention
W/o supplemental text text

Figure 8. Ablation of the operations for content retention. We ab-


late the utilization of self-attention maps and the supplemental text Figure 10. Further application as user auxiliary. With our iterative
here. We also ablate the utilization of cross-attention maps in P2P harmonization process, the intermediate operation of the network
[21] to verify the effectiveness of our operations (mentioned in can be extracted, which can be leveraged as guidance for the hu-
Sec. 4.2). Please zoom in for a better view. man process.

alization to a variety of image types, while the existing ap-


“Cat/House/Bird/Plane photo* painting*”
proaches [18, 9, 61, 20, 26, 24, 7] are limited to be applied
to the specific scope of the synthesized composite images
seen in their training phase.
User auxiliary. As depicted in Fig. 3, our method is an
iterative harmonization process. Each time the composite
image is harmonized a little. Thus, compared with the one-
step methods [20, 18, 10, 54], the user observes all the in-
termediate results, which can bring many benefits and lead
the method to be a white-box approach. Specifically, due to
the trivial content distortion between the results of two adja-
cent iterations, we can optimize a 3D LUT [65, 25], which
is a lookup table that maps RGB values between two im-
ages. By predicting either a global 3D LUT for the entire
foreground region or multiple local LUTs for foreground
meaningful parts such as faces, we can reveal the latent op-
erations of the model, which in turn can provide guidance
for an image editor to harmonize images. The process is
illustrated in Fig. 10.

Figure 9. Further application of artwork harmonization. We ex- 6. Conclusions


plore generalizing our method on artwork. The small image in the
lower left corner of each image is the original composite image. In this work, we introduce the feasibility of imitating hu-
Please zoom in for a better view. man behavior and propose a zero-shot image harmonization
method. Without relying on burdensome training of a huge
only in certain diffusion timesteps (for natural images all amount of composite image data, we can achieve satisfac-
timesteps are considered). From Fig. 9, it can be seen tory harmonized results only with the prior of a pretrained
that our method still succeeds in harmonizing the artwork generative model and the guidance of the text describing the
images. We attribute this success to the powerful gen- image environment. The effectiveness of our method has
erative model prior. As mentioned in Sec. 1, a human been validated through a variety of cases. Even for com-
mainly relies on his long-term prior on harmonious im- posite images with large inconsistencies between the back-
ages to conduct the image harmonization process, and can ground and the foreground such as artistic ones, our method
often lead to harmonious results. Similar to that, thanks can also lead to a harmonious result. Furthermore, we have
to the extremely large amount of real-world training data deeply discussed the limitations of the method and future
[41, 13, 47, 40, 64, 43], the pretrained large-scale genera- work (please refer to the Supplementary Material). We hope
tive models inherently have that prior. Thus, our method, that our method can pave the way for further research on
based on the generative model prior, can gain better gener- zero-shot image harmonization.

8
References on Circuits and Systems for Video Technology, 18(9):1258–
1267, 2008. 2, 3
[1] Tomer Amit, Eliya Nachmani, Tal Shaharbany, and Lior [15] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun
Wolf. Segdiff: Image segmentation with diffusion proba- Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang,
bilistic models. arXiv preprint arXiv:2112.00390, 2021. 3 and William Yang Wang. Training-free structured diffusion
[2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended guidance for compositional text-to-image synthesis. arXiv
diffusion for text-driven editing of natural images. In Pro- preprint arXiv:2212.05032, 2022. 13
ceedings of the IEEE/CVF Conference on Computer Vision [16] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
and Pattern Recognition, pages 18208–18218, 2022. 3, 6 nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
[3] Manuel Brack, Patrick Schramowski, Felix Friedrich, Do- Or. An image is worth one word: Personalizing text-to-
minik Hintersdorf, and Kristian Kersting. The stable artist: image generation using textual inversion. arXiv preprint
Steering semantics in diffusion latent space. arXiv preprint arXiv:2208.01618, 2022. 2, 5, 6
arXiv:2212.06013, 2022. 3 [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
[4] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Durand. Learning photographic global tonal adjustment with Yoshua Bengio. Generative adversarial nets. In Z. Ghahra-
a database of input/output image pairs. In CVPR 2011, pages mani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Wein-
97–104. IEEE, 2011. 2, 3 berger, editors, Advances in Neural Information Processing
[5] Bor-Chun Chen and Andrew Kae. Toward realistic image Systems, volume 27. Curran Associates, Inc., 2014. 2, 3, 13
compositing with adversarial learning. In Proceedings of [18] Zonghui Guo, Zhaorui Gu, Bing Zheng, Junyu Dong, and
the IEEE/CVF Conference on Computer Vision and Pattern Haiyong Zheng. Transformer for image harmonization and
Recognition, pages 8415–8424, 2019. 1, 2, 3 beyond. IEEE Transactions on Pattern Analysis and Ma-
[6] Jianqi Chen, Hao Chen, Keyan Chen, Yilan Zhang, Zhengxia chine Intelligence, 2022. 1, 2, 3, 6, 7, 8
Zou, and Zhenwei Shi. Diffusion models for impercep- [19] Zonghui Guo, Haiyong Zheng, Yufeng Jiang, Zhaorui Gu,
tible and transferable adversarial attack. arXiv preprint and Bing Zheng. Intrinsic image harmonization. In Pro-
arXiv:2305.08192, 2023. 3 ceedings of the IEEE/CVF Conference on Computer Vision
[7] Jianqi Chen, Yilan Zhang, Zhengxia Zou, Keyan Chen, and Pattern Recognition, pages 16367–16376, 2021. 6, 7
and Zhenwei Shi. Dense pixel-to-pixel harmonization [20] Yucheng Hang, Bin Xia, Wenming Yang, and Qingmin Liao.
via continuous image representation. arXiv preprint Scs-co: Self-consistent style contrastive learning for image
arXiv:2303.01681, 2023. 1, 2, 3, 8 harmonization. In Proceedings of the IEEE/CVF Conference
[8] Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Diffu- on Computer Vision and Pattern Recognition, pages 19710–
siondet: Diffusion model for object detection. arXiv preprint 19719, 2022. 1, 2, 3, 6, 8
arXiv:2211.09788, 2022. 3 [21] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
[9] Wenyan Cong, Xinhao Tao, Li Niu, Jing Liang, Xuesong Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
Gao, Qihao Sun, and Liqing Zhang. High-resolution im- age editing with cross attention control. arXiv preprint
age harmonization via collaborative dual transformations. In arXiv:2208.01626, 2022. 2, 3, 4, 5, 6, 8, 12
Proceedings of the IEEE/CVF Conference on Computer Vi- [22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion and Pattern Recognition, pages 18470–18479, 2022. 1, sion probabilistic models. Advances in Neural Information
2, 3, 6, 8 Processing Systems, 33:6840–6851, 2020. 3
[10] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, [23] Jonathan Ho and Tim Salimans. Classifier-free diffusion
Weiyuan Li, and Liqing Zhang. Dovenet: Deep image guidance. In NeurIPS 2021 Workshop on Deep Generative
harmonization via domain verification. In Proceedings of Models and Downstream Applications, 2021. 3, 4, 12
the IEEE/CVF Conference on Computer Vision and Pattern [24] Yifan Jiang, He Zhang, Jianming Zhang, Yilin Wang, Zhe
Recognition, pages 8394–8403, 2020. 1, 2, 3, 8 Lin, Kalyan Sunkavalli, Simon Chen, Sohrab Amirghodsi,
[11] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Sarah Kong, and Zhangyang Wang. Ssh: A self-supervised
and Matthieu Cord. Diffedit: Diffusion-based seman- framework for image harmonization. In Proceedings of the
tic image editing with mask guidance. arXiv preprint IEEE/CVF International Conference on Computer Vision,
arXiv:2210.11427, 2022. 3 pages 4832–4841, 2021. 1, 2, 3, 8
[12] Prafulla Dhariwal and Alexander Nichol. Diffusion models [25] Hakki Can Karaimer and Michael S Brown. A software
beat gans on image synthesis. Advances in Neural Informa- platform for manipulating the camera imaging pipeline. In
tion Processing Systems, 34:8780–8794, 2021. 3 European Conference on Computer Vision, pages 429–444.
[13] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Springer, 2016. 8
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, [26] Zhanghan Ke, Chunyi Sun, Lei Zhu, Ke Xu, and Rynson WH
Hongxia Yang, et al. Cogview: Mastering text-to-image gen- Lau. Harmonizer: Learning to perform white-box image and
eration via transformers. Advances in Neural Information video harmonization. In European Conference on Computer
Processing Systems, 34:19822–19835, 2021. 3, 8, 12 Vision, pages 690–706. Springer, 2022. 1, 2, 3, 6, 7, 8
[14] Ulrich Fecker, Marcus Barkowsky, and André Kaup. [27] Gihyun Kwon and Jong Chul Ye. Diffusion-based image
Histogram-based prefiltering for luminance and chromi- translation using disentangled style and content representa-
nance compensation of multiview video. IEEE Transactions tion. arXiv preprint arXiv:2209.15264, 2022. 3

9
[28] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, [42] Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter
and James Hays. Transient attributes for high-level under- Shirley. Color transfer between images. IEEE Computer
standing and editing of outdoor scenes. ACM Transactions graphics and applications, 21(5):34–41, 2001. 2, 3
on graphics (TOG), 33(4):1–11, 2014. 3 [43] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[29] Jean-Francois Lalonde and Alexei A Efros. Using color com- Patrick Esser, and Björn Ommer. High-resolution image
patibility for assessing image realism. In 2007 IEEE 11th synthesis with latent diffusion models. In Proceedings of
International Conference on Computer Vision, pages 1–8. the IEEE/CVF Conference on Computer Vision and Pattern
IEEE, 2007. 2, 3 Recognition, pages 10684–10695, 2022. 2, 3, 4, 6, 8, 12, 13
[30] Joon-Young Lee, Kalyan Sunkavalli, Zhe Lin, Xiaohui Shen, [44] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
and In So Kweon. Automatic content-aware color and net: Convolutional networks for biomedical image segmen-
tone stylization. In Proceedings of the IEEE conference on tation. In International Conference on Medical image com-
computer vision and pattern recognition, pages 2470–2478, puting and computer-assisted intervention, pages 234–241.
2016. 2, 3 Springer, 2015. 3
[31] Wenbo Li, Xin Yu, Kun Zhou, Yibing Song, Zhe Lin, and [45] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Jiaya Jia. Sdm: Spatial diffusion model for large hole image Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
inpainting. arXiv preprint arXiv:2212.02963, 2022. 3 tuning text-to-image diffusion models for subject-driven
[32] Jingtang Liang, Xiaodong Cun, Chi-Man Pun, and Jue generation. arXiv preprint arXiv:2208.12242, 2022. 2, 5
Wang. Spatial-separated curve rendering network for effi- [46] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
cient and high-resolution image harmonization. In European Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
Conference on Computer Vision, pages 334–349. Springer, Norouzi. Palette: Image-to-image diffusion models. In
2022. 1 ACM SIGGRAPH 2022 Conference Proceedings, pages 1–
[33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, 10, 2022. 3
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[47] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Zitnick. Microsoft coco: Common objects in context. In
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
European conference on computer vision, pages 740–755.
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
Springer, 2014. 2, 3
Rapha Gontijo Lopes, et al. Photorealistic text-to-image
[34] Ilya Loshchilov and Frank Hutter. Decoupled weight decay diffusion models with deep language understanding. arXiv
regularization. arXiv preprint arXiv:1711.05101, 2017. 12 preprint arXiv:2205.11487, 2022. 2, 3, 8, 12
[35] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan
[48] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu-
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
sion probabilistic model sampling in around 10 steps. arXiv
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
preprint arXiv:2206.00927, 2022. 13
man, et al. Laion-5b: An open large-scale dataset for
[36] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch,
training next generation image-text models. arXiv preprint
and Daniel Cohen-Or. Null-text inversion for editing real
arXiv:2210.08402, 2022. 3
images using guided diffusion models. arXiv preprint
[49] Eli Shechtman and Michal Irani. Matching local self-
arXiv:2211.09794, 2022. 2, 3, 4, 6, 12
similarities across images and videos. In 2007 IEEE Con-
[37] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
ference on Computer Vision and Pattern Recognition, pages
denoising diffusion probabilistic models. In International
1–8. IEEE, 2007. 5
Conference on Machine Learning, pages 8162–8171. PMLR,
2021. 3 [50] Konstantin Sofiiuk, Polina Popenova, and Anton Konushin.
[38] Francois Pitie, Anil C Kokaram, and Rozenn Dahyot. N- Foreground-aware semantic representations for image har-
dimensional probability density function transfer and its ap- monization. In Proceedings of the IEEE/CVF Winter Confer-
plication to color transfer. In Tenth IEEE International Con- ence on Applications of Computer Vision, pages 1620–1629,
ference on Computer Vision (ICCV’05) Volume 1, volume 2, 2021. 1, 3
pages 1434–1439. IEEE, 2005. 2, 3 [51] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[39] François Pitié, Anil C Kokaram, and Rozenn Dahyot. Au- and Surya Ganguli. Deep unsupervised learning using
tomated colour grading using colour distribution transfer. nonequilibrium thermodynamics. In International Confer-
Computer Vision and Image Understanding, 107(1-2):123– ence on Machine Learning, pages 2256–2265. PMLR, 2015.
137, 2007. 2, 3 3
[40] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, [52] Jiaming Song, Chenlin Meng, and Stefano Ermon.
and Mark Chen. Hierarchical text-conditional image gen- Denoising diffusion implicit models. arXiv preprint
eration with clip latents. arXiv preprint arXiv:2204.06125, arXiv:2010.02502, 2020. 3, 4, 12, 13
2022. 2, 3, 8, 12 [53] Kalyan Sunkavalli, Micah K Johnson, Wojciech Matusik,
[41] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, and Hanspeter Pfister. Multi-scale image harmonization.
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. ACM Transactions on Graphics (TOG), 29(4):1–10, 2010.
Zero-shot text-to-image generation. In International Confer- 2, 3
ence on Machine Learning, pages 8821–8831. PMLR, 2021. [54] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli,
3, 8, 12 Xin Lu, and Ming-Hsuan Yang. Deep image harmonization.

10
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3789–3797, 2017. 1, 2, 3, 8
[55] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali
Dekel. Splicing vit features for semantic appearance transfer.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 10748–10757, 2022.
5
[56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 3
[57] Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot im-
age restoration using denoising diffusion null-space model.
arXiv preprint arXiv:2212.00490, 2022. 3
[58] Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya
Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and
Shiyu Chang. Uncovering the disentanglement capabil-
ity in text-to-image diffusion models. arXiv preprint
arXiv:2212.08698, 2022. 3, 6
[59] Xuezhong Xiao and Lizhuang Ma. Color transfer in cor-
related color space. In Proceedings of the 2006 ACM in-
ternational conference on Virtual reality continuum and its
applications, pages 305–309, 2006. 2, 3
[60] Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun
Zhang. Smartbrush: Text and shape guided object inpaint-
ing with diffusion model. arXiv preprint arXiv:2212.05034,
2022. 3, 6
[61] Ben Xue, Shenghui Ran, Quan Chen, Rongfei Jia, Binqiang
Zhao, and Xing Tang. Dccf: Deep comprehensible color fil-
ter learning framework for high-resolution image harmoniza-
tion. In European Conference on Computer Vision, pages
300–316. Springer, 2022. 1, 2, 3, 6, 7, 8
[62] Su Xue, Aseem Agarwala, Julie Dorsey, and Holly Rush-
meier. Understanding and improving the realism of image
composites. ACM Transactions on graphics (TOG), 31(4):1–
10, 2012. 2, 3
[63] Yueqin Yin, Lianghua Huang, Yu Liu, and Kaiqi Huang.
Diffgar: Model-agnostic restoration from generative arti-
facts using image-to-image diffusion models. arXiv preprint
arXiv:2210.08573, 2022. 3
[64] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-
jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-
fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres-
sive models for content-rich text-to-image generation. arXiv
preprint arXiv:2206.10789, 2022. 2, 3, 8, 12
[65] Hui Zeng, Jianrui Cai, Lida Li, Zisheng Cao, and Lei Zhang.
Learning image-adaptive 3d lookup tables for high perfor-
mance photo enhancement in real-time. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2020. 8
[66] Fangneng Zhan, Shijian Lu, Changgong Zhang, Feiying Ma,
and Xuansong Xie. Adversarial image composition with
auxiliary illumination. In Proceedings of the Asian Confer-
ence on Computer Vision, 2020. 1, 2, 3
[67] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi-
dler, Adela Barriuso, and Antonio Torralba. Semantic under-
standing of scenes through the ade20k dataset. International
Journal of Computer Vision, 127(3):302–321, 2019. 2

11
A. Overview
In the appendix, we will first introduce more implemen-
tation details of our proposed method in Appendix B. Then,
in Appendix C, we conduct more ablation studies, includ-
ing the existence of the regularization term in the optimiza-
tion, the prompt text format and order, and the number of
harmonization iterations. We give a deep analysis of our
method’s limitations and future work in Appendix D. Some
failure cases are displayed in Appendix E, and further trials
on content retention are demonstrated in Appendix F at last.

B. Implementation Details
(a) Composite image (b) W/o regularization (c) Final
We base our method on the pretrained Stable Diffusion Figure 11. Ablation of the regularization term in the Attention-
[43] and images with 512×512 resolution. The DDIM sam- Constraint Text optimization.
pling schedule [52] is leveraged for a deterministic diffusion
process, and the number of the diffusion timesteps is set to “A girl in the style of
bright* sunset*”
“Bright* girl” “Girl bright*”
sunset*
Sunset*
50. We remove the conditional guidance in DDIM inversion
by setting the guidance scale to 0, while in the reverse pro-
cess, the guidance scale in the classifier-free guidance [23]
is set to 2.5.
The optimization of the Attention-Constraint Text is “A girl in the style of “Autumn* girl” “Girl autumn*”
autumn* winter*” Winter* winter*
based on the cross-attention maps in the diffusion process,
and we have explored two kinds of implementations: opti-
mizing style and training style. The former is to optimize
the text embedding in each diffusion timestep, similar to
[36]. In each step, we optimize the embedding several times
based on the embedding out of the last step. Thus we will
get a series of optimized embeddings when finishing the fi- (a) Composite image (b) Formal format (c) Order reverse (c) Final
nal timestep. Different from this, the latter implementation Figure 12. Ablation of the format and order of the initial text. The
is to train one embedding. By taking all the timesteps as a effect of formal/simple text is explored, together with whether the
dataset, we can iteratively sample from the dataset a batch to environment text is placed at the end. Please zoom in for a better
train the embedding. When adopting AdamW [34] with an view.
initial learning rate 1e−3 (1e−2 for training style), setting w
in Eq.(4) in the main paper to 5000 (1000 for training style),
and optimizing two times in each timestep (train 50 epochs tion of the text embedding. We ablate the regularization in
with batch size set to 4), we found these two implementa- Fig. 11, and the results have demonstrated its effectiveness.
tions can both achieve satisfactory harmonized results. If With the regularization, the text embedding will not go far
not specified, we leverage the optimizing style in the con- away from the initial and thus play a better guiding role.
text.
We also adopt AdamW [34] in the reconstruction process C.2. Prompt text format
[36] with the learning rate 1e−1 . Following [36], we utilize Currently, we just leverage one environment text and one
an increasing strategy, starting from one iteration with an content text for the foreground/background. However, it can
increment for every ten timesteps, to decide the iterations of be observed that the given examples of the large-scale text-
the unconditional embedding optimization in each diffusion to-image models [41, 13, 47, 40, 64, 43] often leverage a
timestep. All the experiments are run on a single RTX 3090 more complete and formal sentence to describe a picture.
GPU. Thus, we here explore the effect of different prompt text
formats. As shown in Fig. 12, we can see that the cur-
C. More Ablation Studies rently used simplified texts can achieve better results. We
believe that this is due to the restrictions imposed by long
C.1. Regularization of Attention-Constraint Text
prompt texts. When utilizing cross-attention maps in P2P
As mentioned in Sec. 4.1 of the main paper, we leverage [21], more texts will bring more constraints on image con-
an additional regularization term to restrict the optimiza- tent, while the simple text format will facilitate the harmo-

12
1 2 3 4 5

6 7 8 9 10

“Girl autumn*”
winter*

Harmonization Iterations
Figure 13. Ablation of the number of harmonization iterations. More iterations allow for more impact from the background environment,
yet also more content distortion. The number at the upper right corner of the image denotes the iteration number.

nization a lot. The first limitation comes from the time cost. Due to the
progressive denoising process of the diffusion model where
C.3. Prompt text order a full sampling involves multiple timesteps, the diffusion it-
self is much more time-consuming [52, 43] compared with
Interestingly, we found that different orders of prompt
other generative models such as GAN [17]. As mentioned
texts lead to some differences in the results. When the envi-
in Sec. 4 in the main paper, our method can be viewed as a
ronment text is placed before the content text, there is a drop
hyper-diffusion with iterative diffusion samplings, harmo-
in the performance (see Fig. 12). Deep into the structure of
nizing the composite image a little each time. Thus, the
the Stable Diffusion [43], we found that this is because of
harmonization process will cost a lot of time. By leveraging
the usage of the casual attention mask, with which the latter
recent efficient sampling strategies [52, 35], we can largely
text token will be blended with the semantics of the formal
alleviate the burden of time cost, yet it still struggles to ap-
token. Thus, if place the content text behind, it will be af-
ply the method for real-time applications. Further acceler-
fected by the optimization of the former environment text,
ation of the diffusion process needs to be explored for this
which will then lead to the failure of the content retention
problem.
mentioned in Sec. 4.2 of the main paper. This phenomenon
Another limitation is content retention. For image har-
is also consistent with the findings by [15].
monization, the hope is to harmonize an image while keep-
ing its content structure. Thus in Sec. 4.2 in the main paper,
C.4. Harmonization iterations
we have exploited the self attention and the supplemental
As our method is an iterative process, here we explore text which can achieve high content retention. However, we
how many iterations can lead to a stable harmonious result. also found that for some specific cases (see Appendix E),
From Fig. 13, we can see that there is a tradeoff between the distortion may still remain severe. A few further ap-
harmony and content retention. More iterations allow for proaches have been tried, yet with little success (see Ap-
more impact from the background environment, yet also pendix F).
more content distortion. Practically, we set the number of Moreover, as mentioned in Sec. 4.1 in the main paper,
iterations to 10 and select the best result from them. the Attention-Constraint Text is optimized based on an ini-
tial text, which is provided by users to coarsely illustrate the
D. Limitations and Future Work environments of the image. Although for many images we
can easily come up with the initial text, there are still some
While we can achieve image harmonization without that we may need to think a bit more, which is then incon-
training on a large number of composite images, our method venient. Also, if the prompt text given by the user deviates
is still subject to some limitations to be addressed in follow- greatly, the result will not be satisfactory (see Appendix E).
up work. The harmonization results depend to some extent on the ini-

13
(a) Failure case: Content distortion
Sunny* Dim* Bright* Sunset*

(a) Composite image (b) 3D LUT (c) Content text (d) Final
optimization

Figure 15. We here display our two further trials for content reten-
tion. One is to optimize a 3D LUT, while the other is to apply the
(b) Failure case: Misleading text Attention-Constraint optimization also on the content text.
Figure 14. We here display two representative failure cases. One is
the content distortion when faced with huge inconsistency between
foreground and background. The other is the deviated result by mizing a 3D LUT. Thus we have also considered optimizing
misleading text. a 3D LUT between the initial composite image and the fi-
nal harmonized result. However, from the results in Fig. 15,
we can see that it does not usually work (seems fine for the
tial text. “handbag”, while bad for the “girl”). We attribute the failure
One avenue for future work is to provide initial text au- to the content distortion. Since the 3D LUT is intrinsically
tomatically. This can be achieved by training a network an RGB-to-RGB global transformation, the content struc-
for generating environment texts. By constructing a dataset tures of two images need to be close enough to get good
with pairs of images and texts describing the environment, LUT parameters, a condition our method cannot satisfy due
the whole harmonization process can be freed from human to the content distortion that remained in our method. Fur-
intervention and thus facilitated. thermore, as shown in Fig. 15, our method achieves more
In addition, more objective quantitative indicators need local editing, which cannot be fully represented by a global
to be explored in the future. As described in Sec. 5.1 in the 3D LUT.
main paper, we currently resort to subjective user surveys Optimization of content text. As we leverage the
to quantitatively evaluate the performance of the harmo- Attention-Constraint Text to well illustrate the environment
nization methods on real-world composite images. Such an of the images in Sec. 4.1 of the main paper, a natural
evaluation approach is labor-intensive and time-consuming. thought is whether we can apply the same to the content
Therefore, future research should also focus on this part. text for content retention. Motivated by this, we try to opti-
mize the content text to focus on the foreground area of the
E. Failure Cases cross-attention maps. However, the content retention fails
In Fig. 14, we have demonstrated the limitations of our to improve and gets even worse (see Fig. 15). Similar to
method mentioned in Appendix D. We can observe that for the statement to the supplemental text effect (Sec. 4.2 of
images with huge inconsistencies between the foreground the main paper), we believe this is because the optimization
and the background, the content structure distorts more eas- of content text breaks the text embedding’s generalization,
ily. Also, if the initial text provided is inaccurate, the result making it specific to only one context and not generalizable
will be misleading and inharmonious. to another, thus leading to more serious content distortions.

F. Further Trials on Content Retention


As mentioned in Appendix D, we have also tried some
other approaches for preserving content, yet with little suc-
cess. We list these trials here as potentially valuable refer-
ences for further research.
Reliance on 3D LUT. In Sec. 5.3 of the main paper, we
described the further applications on user guidance by opti-

14

You might also like