Professional Documents
Culture Documents
Paper 2
Paper 2
Zhen Li1,2 * Mingdeng Cao2,3 * Xintao Wang2 † Zhongang Qi2 Ming-Ming Cheng1 † Ying Shan2
1
VCIP, CS, Nankai University 2 ARC Lab, Tencent PCG 3 The University of Tokyo
https://photo-maker.github.io/
A man wearing A woman wearing A man wearing a A boy wearing a
headphones with red hair sunglasses and necklace Christmas hat doctoral cap
arXiv:2312.04461v1 [cs.CV] 7 Dec 2023
1
more intriguing and practically valuable applications. Be- Fig. 3). This is because that: 1) during the training pro-
sides, to drive the training of our PhotoMaker, we propose cess, both the target image and the input ID image sample
an ID-oriented data construction pipeline to assemble the from the same image. The trained model easily remembers
training data. Under the nourishment of the dataset con- characteristics unrelated to the ID in the image, such as ex-
structed through the proposed pipeline, our PhotoMaker pressions and viewpoints, which leads to poor editability,
demonstrates better ID preservation ability than test-time and 2) relying solely on a single ID image to be customized
fine-tuning based methods, yet provides significant speed makes it difficult for the model to discern the characteris-
improvements, high-quality generation results, strong gen- tics of the ID to be generated from its internal knowledge,
eralization capabilities, and a wide range of applications. resulting in unsatisfactory ID fidelity.
Based on the above two points, and inspired by the suc-
cess of DreamBooth, in this paper, we aim to: 1) ensure that
1. Introduction the ID image condition and the target image exhibit varia-
tions in viewpoints, facial expressions, and accessories, so
Customized image generation related to humans [28, 35, that the model does not memorize information that is irrele-
51] has received considerable attention, giving rise to nu- vant to the ID; 2) provide the model with multiple different
merous applications, such as personalized portrait pho- images of the same ID during the training process to more
tos [1], image animation [67], and virtual try-on [59]. Early comprehensively and accurately represent the characteris-
methods [39, 41], limited by the capabilities of generative tics of the customized ID.
models (i.e., GANs [19, 29]), could only customize the gen- Therefore, we propose a simple yet effective feed-
eration of the facial area, resulting in low diversity, scene forward customized human generation framework that can
richness, and controllability. Thanks to larger-scale text- receive multiple input ID images, termed as PhotoMaker.
image pair training datasets [55], larger generation mod- To better represent the ID information of each input im-
els [44, 52], and text/visual encoders [45, 46] that can pro- age, we stack the encodings of multiple input ID images
vide stronger semantic embeddings, diffusion-based text- at the semantic level, constructing a stacked ID embedding.
to-image generation models have been continuously evolv- This embedding can be regarded as a unified representation
ing recently. This evolution enables them to generate in- of the ID to be generated, and each of its subparts corre-
creasingly realistic facial details and rich scenes. The con- sponds to an input ID image. To better integrate this ID
trollability has also greatly improved due to the existence of representation and the text embedding into the network, we
text prompts and structural guidance [40, 65] replace the class word (e.g., man and woman) of the text
Meanwhile, under the nurturing of powerful diffusion embedding with the stacked ID embedding. The result em-
text-to-image models, many diffusion-based customized bedding simultaneously represents the ID to be customized
generation algorithms [17, 50] have emerged to meet users’ and the contextual information to be generated. Through
demand for high-quality customized results. The most this design, without adding extra modules in the network,
widely used in both commercial and community applica- the cross-attention layer of the generation model itself can
tions are DreamBooth-based methods [2, 50]. Such ap- adaptively integrate the ID information contained in the
plications require dozens of images of the same identity stacked ID embedding.
(ID) to fine-tune the model parameters. Although the re-
At the same time, the stacked ID embedding allows us to
sults generated have high ID fidelity, there are two obvious
accept any number of ID images as input during inference
drawbacks: one is that customized data used for fine-tuning
while maintaining the efficiency of the generation like other
each time requires manual collection and thus is very time-
tuning-free methods [56, 62]. Specifically, our method re-
consuming and laborious; the other is that customizing each
quires about 10 seconds to generate a customized human
ID requires 10-30 minutes, consuming a large amount of
photo when receiving four ID images, which is about 130×
computing resources, especially when the generation model
faster than DreamBooth1 . Moreover, since our stacked ID
becomes larger. Therefore, to simplify and accelerate the
embedding can represent the customized ID more compre-
customized generation process, recent works, driven by ex-
hensively and accurately, our method can provide better ID
isting human-centric datasets [29, 36], have trained visual
fidelity and generation diversity compared to state-of-the-
encoders [11, 62] or hypernetworks [5, 51] to represent the
art tuning-free methods. Compared to previous methods,
input ID images as embeddings or LoRA [25] weights of
our framework has also greatly improved in terms of con-
the model. After training, users only need to provide an im-
trollability. It can not only perform common recontextual-
age of the ID to be customized, and personalized generation
ization but also change the attributes of the input human
can be achieved through a few dozen steps of fine-tuning or
image (e.g., accessories and expressions), generate a hu-
even without any tuning process. However, the results cus-
man photo with completely different viewpoints from the
tomized by these methods cannot simultaneously possess
ID fidelity and generation diversity like DreamBooth (see 1 Test on one NVIDIA Tesla V100
2
input ID, and even modify the input ID’s gender and age as DreamBooth [50] and Textual Inversion [17]. Given
(see Fig. 1). that both pioneer works require substantial time for fine-
It is worth noticing that our PhotoMaker also unleashes tuning, some studies have attempted to expedite the pro-
a lot of possibilities for users to generate customized hu- cess of personalized customization by reducing the num-
man photos. Specifically, although the images that build ber of parameters needed for tuning [2, 20, 32, 64] or by
the stacked ID embedding come from the same ID during pre-training with large datasets [18, 51]. Despite these ad-
training, we can use different ID images to form the stacked vances, they still require extensive fine-tuning of the pre-
ID embedding during inference to merge and create a new trained model for each new concept, making the process
customized ID. The merged new ID can retain the charac- time-consuming and restricting its applications. Recently,
teristics of different input IDs. For example, we can gener- some studies [12, 13, 27, 37, 38, 56, 61] attempt to per-
ate Scarlett Johansson that looks like Elun Musk or a cus- form personalized generation using a single image with a
tomized ID that mixes a person with a well-known IP char- single forward pass, significantly accelerating the person-
acter (see Fig. 1(c)). At the same time, the merging ratio alization process. These methods either utilize personaliza-
can be simply adjusted by prompt weighting [3, 21] or by tion datasets [12, 57] for training or encode the images to be
changing the proportion of different ID images in the input customized in the semantic space [11, 27, 38, 56, 61, 62].
image pool, demonstrating the flexibility of our framework. Our method focuses on the generation of human portraits
Our PhotoMaker necessitates the simultaneous input of based on both of the aforementioned technical approaches.
multiple images with the same ID during the training pro- Specifically, it not only relies on the construction of an ID-
cess, thereby requiring the support of an ID-oriented hu- oriented personalization dataset, but also on obtaining the
man dataset. However, existing datasets either do not clas- embedding that represents the person’s ID in the seman-
sify by IDs [29, 35, 55, 68] or only focus on faces with- tic space. Unlike previous embedding-based methods, our
out including other contextual information [36, 41, 60]. We PhotoMaker extracts a stacked ID embedding from multi-
therefore design an automated pipeline to construct an ID- ple ID images. While providing better ID representation,
related dataset to facilitate the training of our PhotoMaker. the proposed method can maintain the same high efficiency
Through this pipeline, we can build a dataset that includes as previous embedding-based methods.
a large number of IDs, each with multiple images featuring
diverse viewpoints, attributes, and scenarios. Meanwhile, 3. Method
in this pipeline, we can automatically generate a caption for
3.1. Overview
each image, marking out the corresponding class word [50],
to better adapt to the training needs of our framework. Given a few ID images to be customized, the goal of our
PhotoMaker is to generate a new photo-realistic human im-
2. Related work age that retains the characteristics of the input IDs and
changes the content or the attributes of the generated ID
Text-to-Image Diffusion Models. Diffusion models [23, under the control of the text prompt. Although we input
58] have made remarkable progress in text-conditioned multiple ID images for customization like DreamBooth, we
image generation [30, 47, 49, 52], attracting widespread still enjoy the same efficiency as other tuning-free methods,
attention in recent years. The remarkable performance accomplishing customization with a single forward pass,
of these models can be attributable to high-quality large- while maintaining promising ID fidelity and text edibility.
scale text-image datasets [9, 54, 55], the continuous up- In addition, we can also mix multiple input IDs, and the
grades of foundational models [10, 43], conditioning en- generated image can well retain the characteristics of differ-
coders [26, 45, 46], and the improvement of controllabil- ent IDs, which releases possibilities for more applications.
ity [34, 40, 63, 65]. Due to these advancements, Podell et The above capabilities are mainly brought by our proposed
al. [44] developed the currently most powerful open-source simple yet effective stacked ID embedding, which can pro-
generative model, SDXL. Given its impressive capabilities vide a unified representation of the input IDs. Furthermore,
in generating human portraits, we build our PhotoMaker to facilitate training our PhotoMaker, we design a data con-
based on this model. However, our method can also be ex- struction pipeline to build a human-centric dataset classified
tended to other text-to-image synthesis models. by IDs. Fig. 2(a) shows the overview of the proposed Pho-
toMaker. Fig. 2(b) shows our data construction pipeline.
Personalization in Diffusion Models. Owing to the pow-
erful generative capabilities of the diffusion models, more 3.2. Stacked ID Embedding
researchers try to explore personalized generation based on
them. Currently, mainstream personalized synthesis meth- Encoders. Following recent works [27, 56, 61], we use the
ods can be mainly divided into two categories. One re- CLIP [45] image encoder Eimg to extract image embeddings
lies on additional optimization during the test phase, such for its alignment with the original text representation space
3
Training ID Images
Diffusion Model 1. Image Downloading
Image Encoder
Image
Embeddings 3. ID Verification
Text MLPs
Inference ID Images Embedding 4. Cropping & Segmentation
Grouped by IDs
Updated Text Embedding
(a) (b)
Figure 2. Overviews of the proposed (a) PhotoMaker and (b) ID-oriented data construction pipeline. For the proposed PhotoMaker,
we first obtain the text embedding and image embeddings from text encoder(s) and image encoder, respectively. Then, we extract the
fused embedding by merging the corresponding class embedding (e.g., man and woman) and each image embedding. Next, we concatenate
all fused embeddings along the length dimension to form the stacked ID embedding. Finally, we feed the stacked ID embedding to all
cross-attention layers for adaptively merging the ID content in the diffusion model. Note that although we use images of the same ID with
the masked background during training, we can directly input images of different IDs without background distortion to create a new ID
during inference.
in diffusion models. Before feeding each input image into age embedding ei . We use two MLP layers to perform such
the image encoder, we filled the image areas other than the a fusion operation. The fused embeddings can be denoted
body part of a specific ID with random noises to eliminate as {êi ∈ RD | i = 1 . . . N }. By combining the feature
the influence of other IDs and the background. Since the vector of the class word, this embedding can represent the
data used to train the original CLIP image encoder mostly current input ID image more comprehensively. In addition,
consists of natural images, to better enable the model to ex- during the inference stage, this fusion operation also pro-
tract ID-related embeddings from the masked images, we vides stronger semantic controllability for the customized
finetune part of the transformer layers in the image encoder generation process. For example, we can customize the age
when training our PhotoMaker. We also introduce addi- and gender of the human ID by simply replacing the class
tional learnable projection layers to inject the embedding word (see Sec. 4.2).
obtained from the image encoder into the same dimension After obtaining the fused embeddings, we concatenate
as the text embedding. Let {X i | i = 1 . . . N } denote N them along the length dimension to form the stacked id em-
input ID images acquired from a user, we thus obtain the bedding:
extracted embeddings {ei ∈ RD | i = 1 . . . N }, where D
denotes the projected dimension. Each embedding corre- s∗ = Concat([ê1 , . . . , êN ]) s∗ ∈ RN ×D . (1)
sponds to the ID information of an input image. For a given This stacked ID embedding can serve as a unified repre-
text prompt T , we extract text embeddings t ∈ RL×D using sentation of multiple ID images while it retains the origi-
the pre-trained CLIP text encoder Etext , where L denotes nal representation of each input ID image. It can accept
the length of the embedding. any number of ID image encoded embeddings, therefore,
Stacking. Recent works [17, 50, 62] have shown that, in the its length N is variable. Compared to DreamBooth-based
text-to-image models, personalized character ID informa- methods [2, 50], which inputs multiple images to finetune
tion can be represented by some unique tokens. Our method the model for personalized customization, our method es-
also has a similar design to better represent the ID informa- sentially sends multiple embeddings to the model simul-
tion of the input human images. Specifically, we mark the taneously. After packaging the multiple images of the
corresponding class word (e.g., man and woman) in the in- same ID into a batch as the input of the image encoder,
put caption (see Sec. 3.3). We then extract the feature vector a stacked ID embedding can be obtained through a single
at the corresponding position of the class word in the text forward pass, significantly enhancing efficiency compared
embedding. This feature vector will be fused with each im- to tuning-based methods. Meanwhile, compared to other
4
embedding-based methods [61, 62], this unified representa- also may inspire potential future ID-driven research. The
tion can maintain both promising ID fidelity and text con- statistics of the dataset are shown in the appendix.
trollability, as it contains more comprehensive ID informa- Image downloading. We first list a roster of celebrities,
tion. In addition, it is worth noting that, although we only which can be obtained from VoxCeleb1 and VGGFace2 [7].
used multiple images of the same ID to form this stacked We search for names in the search engine according to the
ID embedding during training, we can use images that come list and crawled the data. About 100 images were down-
from different IDs to construct it during the inference stage. loaded for each name. To generate higher quality portrait
Such flexibility opens up possibilities for many interesting images [44], we filtered out images with the shortest side of
applications. For example, we can mix two persons that ex- the resolution less than 512 during the download process.
ist in reality or mix a person and a well-known character IP
(see Sec. 4.2). Face detection and filtering. We first use RetinaNet [16]
to detect face bounding boxes and filter out the detections
Merging. We use the inherent cross-attention mechanism
with small sizes (less than 256 × 256). If an image does not
in diffusion models to adaptively merge the ID informa-
contain any bounding boxes that meet the requirements, the
tion contained in stacked ID embedding. We first replace
image will be filtered out. We then perform ID verification
the feature vector at the position corresponding to the class
for the remaining images.
word in the original text embedding t with the stacked
id embedding s∗ , resulting in an update text embedding ID verification. Since an image may contain multiple faces,
t∗ ∈ R(L+N −1)×D . Then, the cross-attention operation can we need first to identify which face belongs to the current
be formulated as: identity group. Specifically, we send all the face regions in
the detection boxes of the current identity group into Arc-
( Face [15] to extract identity embeddings and calculate the
Q = WQ · ϕ(zt ); K = WK · t∗ ; V = WV · t∗ L2 similarity of each pair of faces. We sum the similarity
T
Attention(Q, K, V) = softmax( QK√ ) · V,
d
calculated by each identity embedding with all other em-
(2) beddings to get the score for each bounding box. We select
where ϕ(·) is an embedding that can be encoded from the the bounding box with the highest sum score for each im-
input latent by the UNet denoiser. WQ , WK , and WV age with multiple faces. After bounding box selection, we
are projection matrices. Besides, we can adjust the de- recompute the sum score for each remaining box. We calcu-
gree of participation of one input ID image in generating late the standard deviation δ of the sum score by ID group.
the new customized ID through prompt weighting [3, 21], We empirically use 8δ as a threshold to filter out images
demonstrating the flexibility of our PhotoMaker. Recent with inconsistent IDs.
works [2, 32] found that good ID customization perfor- Cropping and segmentation. We first crop the image with
mance can be achieved by simply tuning the weights of the a larger square box based on the detected face area while
attention layers. To make the original diffusion models bet- ensuring that the facial region can occupy more than 10%
ter perceive the ID information contained in stacked ID em- of the image after cropping. Since we need to remove the
bedding, we additionally train the LoRA [2, 25] residuals of irrelevant background and IDs from the input ID image be-
the matrices in the attention layers. fore sending it into the image encoder, we need to gener-
ate the mask for the specified ID. Specifically, we employ
3.3. ID-Oriented Human Data Construction the Mask2Former [14] to perform panoptic segmentation
Since our PhotoMaker needs to sample multiple images of for the ‘person’ class. We leave the mask with the highest
the same ID for constructing the stacked ID embedding dur- overlap with the facial bounding box corresponding to the
ing the training process, we need to use a dataset classi- ID. Besides, we choose to discard images where the mask
fied by IDs to drive the training process of our PhotoMaker. is not detected, as well as images where no overlap is found
However, existing human datasets either do not annotate ID between the bounding box and the mask area.
information [29, 35, 55, 68], or the richness of the scenes Captioning and marking We use BLIP2 [33] to generate
they contain is very limited [36, 41, 60] (i.e., they only fo- a caption for each cropped image. Since we need to mark
cus on the face area). Thus, in this section, we will intro- the class word (e.g., man, woman, and boy) to facilitate the
duce a pipeline for constructing a human-centric text-image fusion of text and image embeddings, we regenerate cap-
dataset, which is classified by different IDs. Fig. 2(b) illus- tions that do not contain any class word using the random
trates the proposed pipeline. Through this pipeline, we can mode of BLIP2 until a class word appears. After obtain-
collect an ID-oriented dataset, which contains a large num- ing the caption, we singularize the class word in the caption
ber of IDs, and each ID has multiple images that include dif- to focus on a single ID. Next, we need to mark the posi-
ferent expressions, attributes, scenes, etc. This dataset not tion of the class word that corresponds to the current ID.
only facilitates the training process of our PhotoMaker but Captions that contain only one class word can be directly
5
annotated. For captions that contain multiple class words, ric [22, 42]. Importantly, as most embedding-based meth-
we count the class words contained in the captions for each ods tend to incorporate facial pose and expression into the
identity group. The class word with the most occurrences representation, the generated images often lack variation in
will be the class word for the current identity group. We the facial region. Thus, we propose a metric, named Face
then use the class word of each identity group to match Diversity, to measure the diversity of the generated facial re-
and mark each caption in that identity group. For a cap- gions. Specifically, we first detect and crop the face region
tion that does not include the class word that matches that in each generated image. Next, we calculate the LPIPS [66]
of the corresponding identity group, we employ a depen- scores between each pair of facial areas for all generated im-
dence parsing model [24] to segment the caption according ages and take the average. The larger this value, the higher
to different class words. We calculate the CLIP score [45] the diversity of the generated facial area.
between the sub-caption after segmentation and the specific Evaluation dataset. Our evaluation dataset includes 25
ID region in the image. Besides, we calculate the label sim- IDs, which consist of 9 IDs from Mystyle [41] and an addi-
ilarity between the class word of the current segment and tional 16 IDs that we collected by ourselves. Note that these
the class word of the current identity group through Sen- IDs do not appear in the training set, serving to evaluate the
tenceFormer [48]. We choose to mark the class word cor- generalization ability of the model. To conduct a more com-
responding to the maximum product of the CLIP score and prehensive evaluation, we also prepare 40 prompts, which
the label similarity. cover a variety of expressions, attributes, decorations, ac-
tions, and backgrounds. For each prompt of each ID, we
4. Experiments generate 4 images for evaluation. More details are listed in
the appendix.
4.1. Setup
4.2. Applications
Implementation details. To generate more photo-
realistic human portraits, we employ SDXL model [44] In this section, we will elaborate on the applications that our
stable-diffusion-xl-base-1.0 as our text-to- PhotoMaker can empower. For each application, we choose
image synthesis model. Correspondingly, the resolution of the comparison methods which may be most suitable for the
training data is resized to 1024 × 1024. We employ CLIP corresponding setting. The comparison method will be cho-
ViT-L/14 [45] and an additional projection layer to obtain sen from DreamBooth [50], Textual Inversion [17], Fast-
the initial image embeddings ei . For text embeddings, we Composer [62], and IPAdapter [63]. We prioritize using the
keep the original two text encoders in SDXL for extraction. official model provided by each method. For DreamBooth
The overall framework is optimized with Adam [31] on 8 and IPAdapter, we use their SDXL versions for a fair com-
NVIDIA A100 GPUs for two weeks with a batch size of 48. parison. For all applications, we have chosen four input
We set the learning rate as 1e − 4 for LoRA weights, and ID images to form the stacked ID embedding in our Pho-
1e − 5 for other trainable modules. During training, we ran- toMaker. We also fairly use four images to train the methods
domly sample 1-4 images with the same ID as the current that need test-time optimization. We provide more samples
target ID image to form a stacked ID embedding. Besides, in the appendix for each application.
to improve the generation performance by using classifier- Recontextualization We first show results with simple con-
free guidance, we have a 10% chance of using null-text em- text changes such as modified hair color and clothing, or
bedding to replace the original updated text embedding t∗ . generate backgrounds based on basic prompt control. Since
We also use masked diffusion loss [6] with a probability of all methods can adapt to this application, we conduct quan-
50% to encourage the model to generate more faithful ID- titative and qualitative comparisons of the generated re-
related areas. During the inference stage, we use delayed sults (see Tab. 1 and Fig. 3). The results show that our
subject conditioning [62] to solve the conflicts between text method can well satisfy the ability to generate high-quality
and ID conditions. We use 50 steps of DDIM sampler [58]. images, while ensuring high ID fidelity (with the largest
The scale of classifier-free guidance is set to 5. CLIP-T and DINO scores, and the second best Face Sim-
Evaluation metrics. Following DreamBooth [50], we use ilarity). Compared to most methods, our method generates
DINO [8] and CLIP-I [17] metrics to measure the ID fidelity images of higher quality, and the generated facial regions
and use CLIP-T [45] metric to measure the prompt fidelity. exhibit greater diversity. At the same time, our method can
For a more comprehensive evaluation, we also compute the maintain a high efficiency consistent with embedding-based
face similarity by detecting and cropping the facial regions methods. For a more comprehensive comparison, we show
between the generated image and the real image with the the user study results in Sec. B in the appendix.
same ID. We use RetinaFace [16] as the detection model. Bringing person in artwork/old photo into reality. By
Face embedding is extracted by FaceNet [53]. To evalu- taking artistic paintings, sculptures, or old photos of a per-
ate the quality of the generation, we employ the FID met- son as input, our PhotoMaker can bring a person from the
6
References DreamBooth Textual Inversion FastComposer IPAdapter PhotoMaker (Ours)
A man wearing a
spacesuit
A man wearing a
Christmas hat
A woman sitting
at the beach,
with purple
sunset
A woman with red
hair
A woman wearing
a doctoral cap
Figure 3. Qualitative comparison on universal recontextualization samples. We compare our method with DreamBooth [50], Textual
Inversion [17], FastComposer [62], and IPAdapter [63] for five different identities and corresponding prompts. We observe that our method
generally achieves high-quality generation, promising editability, and strong identity fidelity. (Zoom-in for the best view)
CLIP-T↑ (%) CLIP-I↑ (%) DINO↑ (%) Face Sim.↑ (%) Face Div.↑ (%) FID↓ Speed↓ (s)
DreamBooth [50] 29.8 62.8 39.8 49.8 49.1 374.5 1284
Textual Inversion [17] 24.0 70.9 39.3 54.3 59.3 363.5 2400
FastComposer [62] 28.7 66.8 40.2 61.0 45.4 375.1 8
IPAdapter [63] 25.1 71.2 46.2 67.1 52.4 375.2 12
PhotoMaker (Ours) 26.1 73.6 51.5 61.8 57.7 370.3 10
Table 1. Quantitative comparison on the universal recontextualization setting. The metrics used for benchmarking cover the ability to
preserve ID information (i.e., CLIP-I, DINO, and Face Similarity), text consistency (i.e., CLIP-T), diversity of generated faces (i.e., Face
Diversity), and generation quality (i.e., FID). Besides, we define personalized speed as the time it takes to obtain the final personalized
image after feeding the ID condition(s). We measure personalized time on a single NVIDIA Tesla V100 GPU. The best result is shown in
bold, and the second best is underlined.
last century or even ancient times to the present century to liance of DreamBooth on the quality and resolution of cus-
“take” photos for them. Fig. 4(a) illustrate the results. Com- tomized images, it is difficult for DreamBooth to generate
pared to our method, both Dreambooth and SDXL have dif- high-quality results when using old photos for customized
ficulty generating realistic human images that have not ap- generation.
peared in real photos. In addition, due to the excessive re-
Changing age or gender. By simply replacing class words
7
References DreamBooth SDXL PhotoMaker (Ours) References DreamBooth SDXL PhotoMaker (Ours)
8
) D F H 6 L P L O D U L W \
&