Paper 2

Customizing Realistic Human Photos via Stacked ID Embedding
Zhen Li1,2 * Mingdeng Cao2,3 * Xintao Wang2 † Zhongang Qi2 Ming-Ming Cheng1 † Ying Shan2
1
VCIP, CS, Nankai University 2 ARC Lab, Tencent PCG 3 The University of Tokyo
https://photo-maker.github.io/
A man wearing A woman wearing A man wearing a A boy wearing a
headphones with red hair sunglasses and necklace Christmas hat doctoral cap
arXiv:2312.04461v1 [cs.CV] 7 Dec 2023
User inputs User inputs
(a) Attributes change

A man coding in front of a A man in a helmet and A man wearing
A man wearing a spacesuit
computer vest riding a motorcycle headphones
(b) Artwork / old-photo to reality

A woman holding a bottle
A photo of a woman A woman with red hair A woman in the snow
of red wine
(c) Identity mixing

Figure 1. Given a few images of input ID(s), the proposed PhotoMaker can generate diverse personalized ID images based on the text
prompt in a single forward pass. Our method can well preserve the ID information from the input image pool while generating realistic
human photos. PhotoMaker also empowers many interesting applications such as (a) changing attributes, (b) bringing persons from
artworks or old photos into reality, or (c) performing identity mixing. (Zoom-in for the best view)
Abstract (ID) fidelity, and flexible text controllability. In this work,

we introduce PhotoMaker, an efficient personalized text-
Recent advances in text-to-image generation have made to-image generation method, which mainly encodes an ar-
remarkable progress in synthesizing realistic human photos bitrary number of input ID images into a stack ID embed-
conditioned on given text prompts. However, existing per- ding for preserving ID information. Such an embedding,
sonalized generation methods cannot simultaneously sat- serving as a unified ID representation, can not only encap-
isfy the requirements of high efficiency, promising identity sulate the characteristics of the same input ID comprehen-
sively, but also accommodate the characteristics of differ-
∗ Interns in ARC Lab, Tencent PCG † Corresponding authors
ent IDs for subsequent integration. This paves the way for
1
more intriguing and practically valuable applications. Be- Fig. 3). This is because that: 1) during the training pro-
sides, to drive the training of our PhotoMaker, we propose cess, both the target image and the input ID image sample
an ID-oriented data construction pipeline to assemble the from the same image. The trained model easily remembers
training data. Under the nourishment of the dataset con- characteristics unrelated to the ID in the image, such as ex-
structed through the proposed pipeline, our PhotoMaker pressions and viewpoints, which leads to poor editability,
demonstrates better ID preservation ability than test-time and 2) relying solely on a single ID image to be customized
fine-tuning based methods, yet provides significant speed makes it difficult for the model to discern the characteris-
improvements, high-quality generation results, strong gen- tics of the ID to be generated from its internal knowledge,
eralization capabilities, and a wide range of applications. resulting in unsatisfactory ID fidelity.
Based on the above two points, and inspired by the suc-
cess of DreamBooth, in this paper, we aim to: 1) ensure that
1. Introduction the ID image condition and the target image exhibit varia-
tions in viewpoints, facial expressions, and accessories, so
Customized image generation related to humans [28, 35, that the model does not memorize information that is irrele-
51] has received considerable attention, giving rise to nu- vant to the ID; 2) provide the model with multiple different
merous applications, such as personalized portrait pho- images of the same ID during the training process to more
tos [1], image animation [67], and virtual try-on [59]. Early comprehensively and accurately represent the characteris-
methods [39, 41], limited by the capabilities of generative tics of the customized ID.
models (i.e., GANs [19, 29]), could only customize the gen- Therefore, we propose a simple yet effective feed-
eration of the facial area, resulting in low diversity, scene forward customized human generation framework that can
richness, and controllability. Thanks to larger-scale text- receive multiple input ID images, termed as PhotoMaker.
image pair training datasets [55], larger generation mod- To better represent the ID information of each input im-
els [44, 52], and text/visual encoders [45, 46] that can pro- age, we stack the encodings of multiple input ID images
vide stronger semantic embeddings, diffusion-based text- at the semantic level, constructing a stacked ID embedding.
to-image generation models have been continuously evolv- This embedding can be regarded as a unified representation
ing recently. This evolution enables them to generate in- of the ID to be generated, and each of its subparts corre-
creasingly realistic facial details and rich scenes. The con- sponds to an input ID image. To better integrate this ID
trollability has also greatly improved due to the existence of representation and the text embedding into the network, we
text prompts and structural guidance [40, 65] replace the class word (e.g., man and woman) of the text
Meanwhile, under the nurturing of powerful diffusion embedding with the stacked ID embedding. The result em-
text-to-image models, many diffusion-based customized bedding simultaneously represents the ID to be customized
generation algorithms [17, 50] have emerged to meet users’ and the contextual information to be generated. Through
demand for high-quality customized results. The most this design, without adding extra modules in the network,
widely used in both commercial and community applica- the cross-attention layer of the generation model itself can
tions are DreamBooth-based methods [2, 50]. Such ap- adaptively integrate the ID information contained in the
plications require dozens of images of the same identity stacked ID embedding.
(ID) to fine-tune the model parameters. Although the re-
At the same time, the stacked ID embedding allows us to
sults generated have high ID fidelity, there are two obvious
accept any number of ID images as input during inference
drawbacks: one is that customized data used for fine-tuning
while maintaining the efficiency of the generation like other
each time requires manual collection and thus is very time-
tuning-free methods [56, 62]. Specifically, our method re-
consuming and laborious; the other is that customizing each
quires about 10 seconds to generate a customized human
ID requires 10-30 minutes, consuming a large amount of
photo when receiving four ID images, which is about 130×
computing resources, especially when the generation model
faster than DreamBooth1 . Moreover, since our stacked ID
becomes larger. Therefore, to simplify and accelerate the
embedding can represent the customized ID more compre-
customized generation process, recent works, driven by ex-
hensively and accurately, our method can provide better ID
isting human-centric datasets [29, 36], have trained visual
fidelity and generation diversity compared to state-of-the-
encoders [11, 62] or hypernetworks [5, 51] to represent the
art tuning-free methods. Compared to previous methods,
input ID images as embeddings or LoRA [25] weights of
our framework has also greatly improved in terms of con-
the model. After training, users only need to provide an im-
trollability. It can not only perform common recontextual-
age of the ID to be customized, and personalized generation
ization but also change the attributes of the input human
can be achieved through a few dozen steps of fine-tuning or
image (e.g., accessories and expressions), generate a hu-
even without any tuning process. However, the results cus-
man photo with completely different viewpoints from the
tomized by these methods cannot simultaneously possess
ID fidelity and generation diversity like DreamBooth (see 1 Test on one NVIDIA Tesla V100
2
input ID, and even modify the input ID’s gender and age as DreamBooth [50] and Textual Inversion [17]. Given
(see Fig. 1). that both pioneer works require substantial time for fine-
It is worth noticing that our PhotoMaker also unleashes tuning, some studies have attempted to expedite the pro-
a lot of possibilities for users to generate customized hu- cess of personalized customization by reducing the num-
man photos. Specifically, although the images that build ber of parameters needed for tuning [2, 20, 32, 64] or by
the stacked ID embedding come from the same ID during pre-training with large datasets [18, 51]. Despite these ad-
training, we can use different ID images to form the stacked vances, they still require extensive fine-tuning of the pre-
ID embedding during inference to merge and create a new trained model for each new concept, making the process
customized ID. The merged new ID can retain the charac- time-consuming and restricting its applications. Recently,
teristics of different input IDs. For example, we can gener- some studies [12, 13, 27, 37, 38, 56, 61] attempt to per-
ate Scarlett Johansson that looks like Elun Musk or a cus- form personalized generation using a single image with a
tomized ID that mixes a person with a well-known IP char- single forward pass, significantly accelerating the person-
acter (see Fig. 1(c)). At the same time, the merging ratio alization process. These methods either utilize personaliza-
can be simply adjusted by prompt weighting [3, 21] or by tion datasets [12, 57] for training or encode the images to be
changing the proportion of different ID images in the input customized in the semantic space [11, 27, 38, 56, 61, 62].
image pool, demonstrating the flexibility of our framework. Our method focuses on the generation of human portraits
Our PhotoMaker necessitates the simultaneous input of based on both of the aforementioned technical approaches.
multiple images with the same ID during the training pro- Specifically, it not only relies on the construction of an ID-
cess, thereby requiring the support of an ID-oriented hu- oriented personalization dataset, but also on obtaining the
man dataset. However, existing datasets either do not clas- embedding that represents the person’s ID in the seman-
sify by IDs [29, 35, 55, 68] or only focus on faces with- tic space. Unlike previous embedding-based methods, our
out including other contextual information [36, 41, 60]. We PhotoMaker extracts a stacked ID embedding from multi-
therefore design an automated pipeline to construct an ID- ple ID images. While providing better ID representation,
related dataset to facilitate the training of our PhotoMaker. the proposed method can maintain the same high efficiency
Through this pipeline, we can build a dataset that includes as previous embedding-based methods.
a large number of IDs, each with multiple images featuring
diverse viewpoints, attributes, and scenarios. Meanwhile, 3. Method
in this pipeline, we can automatically generate a caption for
3.1. Overview
each image, marking out the corresponding class word [50],
to better adapt to the training needs of our framework. Given a few ID images to be customized, the goal of our
PhotoMaker is to generate a new photo-realistic human im-
2. Related work age that retains the characteristics of the input IDs and
changes the content or the attributes of the generated ID
Text-to-Image Diffusion Models. Diffusion models [23, under the control of the text prompt. Although we input
58] have made remarkable progress in text-conditioned multiple ID images for customization like DreamBooth, we
image generation [30, 47, 49, 52], attracting widespread still enjoy the same efficiency as other tuning-free methods,
attention in recent years. The remarkable performance accomplishing customization with a single forward pass,
of these models can be attributable to high-quality large- while maintaining promising ID fidelity and text edibility.
scale text-image datasets [9, 54, 55], the continuous up- In addition, we can also mix multiple input IDs, and the
grades of foundational models [10, 43], conditioning en- generated image can well retain the characteristics of differ-
coders [26, 45, 46], and the improvement of controllabil- ent IDs, which releases possibilities for more applications.
ity [34, 40, 63, 65]. Due to these advancements, Podell et The above capabilities are mainly brought by our proposed
al. [44] developed the currently most powerful open-source simple yet effective stacked ID embedding, which can pro-
generative model, SDXL. Given its impressive capabilities vide a unified representation of the input IDs. Furthermore,
in generating human portraits, we build our PhotoMaker to facilitate training our PhotoMaker, we design a data con-
based on this model. However, our method can also be ex- struction pipeline to build a human-centric dataset classified
tended to other text-to-image synthesis models. by IDs. Fig. 2(a) shows the overview of the proposed Pho-
toMaker. Fig. 2(b) shows our data construction pipeline.
Personalization in Diffusion Models. Owing to the pow-
erful generative capabilities of the diffusion models, more 3.2. Stacked ID Embedding
researchers try to explore personalized generation based on
them. Currently, mainstream personalized synthesis meth- Encoders. Following recent works [27, 56, 61], we use the
ods can be mainly divided into two categories. One re- CLIP [45] image encoder Eimg to extract image embeddings
lies on additional optimization during the test phase, such for its alignment with the original text representation space
3
Training ID Images
Diffusion Model 1. Image Downloading
Image Encoder
2. Face Detection & Filtering
Image
Embeddings 3. ID Verification
Stacked ID Embedding Grouped by IDs
Text MLPs
Inference ID Images Embedding 4. Cropping & Segmentation
Grouped by IDs
Updated Text Embedding
Text Encoder(s) Training 5. Captioning & Marking

Inference
“a woman with long hair and a white shirt
“A man in a suit and tie Shared is looking down”
“A man wearing a blue cap” “a man in a black suit with a sword in her
with a badge on his lapel” hand”
…
(a) (b)
Figure 2. Overviews of the proposed (a) PhotoMaker and (b) ID-oriented data construction pipeline. For the proposed PhotoMaker,
we first obtain the text embedding and image embeddings from text encoder(s) and image encoder, respectively. Then, we extract the
fused embedding by merging the corresponding class embedding (e.g., man and woman) and each image embedding. Next, we concatenate
all fused embeddings along the length dimension to form the stacked ID embedding. Finally, we feed the stacked ID embedding to all
cross-attention layers for adaptively merging the ID content in the diffusion model. Note that although we use images of the same ID with
the masked background during training, we can directly input images of different IDs without background distortion to create a new ID
during inference.
in diffusion models. Before feeding each input image into age embedding ei . We use two MLP layers to perform such
the image encoder, we filled the image areas other than the a fusion operation. The fused embeddings can be denoted
body part of a specific ID with random noises to eliminate as {êi ∈ RD | i = 1 . . . N }. By combining the feature
the influence of other IDs and the background. Since the vector of the class word, this embedding can represent the
data used to train the original CLIP image encoder mostly current input ID image more comprehensively. In addition,
consists of natural images, to better enable the model to ex- during the inference stage, this fusion operation also pro-
tract ID-related embeddings from the masked images, we vides stronger semantic controllability for the customized
finetune part of the transformer layers in the image encoder generation process. For example, we can customize the age
when training our PhotoMaker. We also introduce addi- and gender of the human ID by simply replacing the class
tional learnable projection layers to inject the embedding word (see Sec. 4.2).
obtained from the image encoder into the same dimension After obtaining the fused embeddings, we concatenate
as the text embedding. Let {X i | i = 1 . . . N } denote N them along the length dimension to form the stacked id em-
input ID images acquired from a user, we thus obtain the bedding:
extracted embeddings {ei ∈ RD | i = 1 . . . N }, where D
denotes the projected dimension. Each embedding corre- s∗ = Concat([ê1 , . . . , êN ]) s∗ ∈ RN ×D . (1)
sponds to the ID information of an input image. For a given This stacked ID embedding can serve as a unified repre-
text prompt T , we extract text embeddings t ∈ RL×D using sentation of multiple ID images while it retains the origi-
the pre-trained CLIP text encoder Etext , where L denotes nal representation of each input ID image. It can accept
the length of the embedding. any number of ID image encoded embeddings, therefore,
Stacking. Recent works [17, 50, 62] have shown that, in the its length N is variable. Compared to DreamBooth-based
text-to-image models, personalized character ID informa- methods [2, 50], which inputs multiple images to finetune
tion can be represented by some unique tokens. Our method the model for personalized customization, our method es-
also has a similar design to better represent the ID informa- sentially sends multiple embeddings to the model simul-
tion of the input human images. Specifically, we mark the taneously. After packaging the multiple images of the
corresponding class word (e.g., man and woman) in the in- same ID into a batch as the input of the image encoder,
put caption (see Sec. 3.3). We then extract the feature vector a stacked ID embedding can be obtained through a single
at the corresponding position of the class word in the text forward pass, significantly enhancing efficiency compared
embedding. This feature vector will be fused with each im- to tuning-based methods. Meanwhile, compared to other
4
embedding-based methods [61, 62], this unified representa- also may inspire potential future ID-driven research. The
tion can maintain both promising ID fidelity and text con- statistics of the dataset are shown in the appendix.
trollability, as it contains more comprehensive ID informa- Image downloading. We first list a roster of celebrities,
tion. In addition, it is worth noting that, although we only which can be obtained from VoxCeleb1 and VGGFace2 [7].
used multiple images of the same ID to form this stacked We search for names in the search engine according to the
ID embedding during training, we can use images that come list and crawled the data. About 100 images were down-
from different IDs to construct it during the inference stage. loaded for each name. To generate higher quality portrait
Such flexibility opens up possibilities for many interesting images [44], we filtered out images with the shortest side of
applications. For example, we can mix two persons that ex- the resolution less than 512 during the download process.
ist in reality or mix a person and a well-known character IP
(see Sec. 4.2). Face detection and filtering. We first use RetinaNet [16]
to detect face bounding boxes and filter out the detections
Merging. We use the inherent cross-attention mechanism
with small sizes (less than 256 × 256). If an image does not
in diffusion models to adaptively merge the ID informa-
contain any bounding boxes that meet the requirements, the
tion contained in stacked ID embedding. We first replace
image will be filtered out. We then perform ID verification
the feature vector at the position corresponding to the class
for the remaining images.
word in the original text embedding t with the stacked
id embedding s∗ , resulting in an update text embedding ID verification. Since an image may contain multiple faces,
t∗ ∈ R(L+N −1)×D . Then, the cross-attention operation can we need first to identify which face belongs to the current
be formulated as: identity group. Specifically, we send all the face regions in
the detection boxes of the current identity group into Arc-
( Face [15] to extract identity embeddings and calculate the
Q = WQ · ϕ(zt ); K = WK · t∗ ; V = WV · t∗ L2 similarity of each pair of faces. We sum the similarity
T
Attention(Q, K, V) = softmax( QK√ ) · V,
d
calculated by each identity embedding with all other em-
(2) beddings to get the score for each bounding box. We select
where ϕ(·) is an embedding that can be encoded from the the bounding box with the highest sum score for each im-
input latent by the UNet denoiser. WQ , WK , and WV age with multiple faces. After bounding box selection, we
are projection matrices. Besides, we can adjust the de- recompute the sum score for each remaining box. We calcu-
gree of participation of one input ID image in generating late the standard deviation δ of the sum score by ID group.
the new customized ID through prompt weighting [3, 21], We empirically use 8δ as a threshold to filter out images
demonstrating the flexibility of our PhotoMaker. Recent with inconsistent IDs.
works [2, 32] found that good ID customization perfor- Cropping and segmentation. We first crop the image with
mance can be achieved by simply tuning the weights of the a larger square box based on the detected face area while
attention layers. To make the original diffusion models bet- ensuring that the facial region can occupy more than 10%
ter perceive the ID information contained in stacked ID em- of the image after cropping. Since we need to remove the
bedding, we additionally train the LoRA [2, 25] residuals of irrelevant background and IDs from the input ID image be-
the matrices in the attention layers. fore sending it into the image encoder, we need to gener-
ate the mask for the specified ID. Specifically, we employ
3.3. ID-Oriented Human Data Construction the Mask2Former [14] to perform panoptic segmentation
Since our PhotoMaker needs to sample multiple images of for the ‘person’ class. We leave the mask with the highest
the same ID for constructing the stacked ID embedding dur- overlap with the facial bounding box corresponding to the
ing the training process, we need to use a dataset classi- ID. Besides, we choose to discard images where the mask
fied by IDs to drive the training process of our PhotoMaker. is not detected, as well as images where no overlap is found
However, existing human datasets either do not annotate ID between the bounding box and the mask area.
information [29, 35, 55, 68], or the richness of the scenes Captioning and marking We use BLIP2 [33] to generate
they contain is very limited [36, 41, 60] (i.e., they only fo- a caption for each cropped image. Since we need to mark
cus on the face area). Thus, in this section, we will intro- the class word (e.g., man, woman, and boy) to facilitate the
duce a pipeline for constructing a human-centric text-image fusion of text and image embeddings, we regenerate cap-
dataset, which is classified by different IDs. Fig. 2(b) illus- tions that do not contain any class word using the random
trates the proposed pipeline. Through this pipeline, we can mode of BLIP2 until a class word appears. After obtain-
collect an ID-oriented dataset, which contains a large num- ing the caption, we singularize the class word in the caption
ber of IDs, and each ID has multiple images that include dif- to focus on a single ID. Next, we need to mark the posi-
ferent expressions, attributes, scenes, etc. This dataset not tion of the class word that corresponds to the current ID.
only facilitates the training process of our PhotoMaker but Captions that contain only one class word can be directly
5
annotated. For captions that contain multiple class words, ric [22, 42]. Importantly, as most embedding-based meth-
we count the class words contained in the captions for each ods tend to incorporate facial pose and expression into the
identity group. The class word with the most occurrences representation, the generated images often lack variation in
will be the class word for the current identity group. We the facial region. Thus, we propose a metric, named Face
then use the class word of each identity group to match Diversity, to measure the diversity of the generated facial re-
and mark each caption in that identity group. For a cap- gions. Specifically, we first detect and crop the face region
tion that does not include the class word that matches that in each generated image. Next, we calculate the LPIPS [66]
of the corresponding identity group, we employ a depen- scores between each pair of facial areas for all generated im-
dence parsing model [24] to segment the caption according ages and take the average. The larger this value, the higher
to different class words. We calculate the CLIP score [45] the diversity of the generated facial area.
between the sub-caption after segmentation and the specific Evaluation dataset. Our evaluation dataset includes 25
ID region in the image. Besides, we calculate the label sim- IDs, which consist of 9 IDs from Mystyle [41] and an addi-
ilarity between the class word of the current segment and tional 16 IDs that we collected by ourselves. Note that these
the class word of the current identity group through Sen- IDs do not appear in the training set, serving to evaluate the
tenceFormer [48]. We choose to mark the class word cor- generalization ability of the model. To conduct a more com-
responding to the maximum product of the CLIP score and prehensive evaluation, we also prepare 40 prompts, which
the label similarity. cover a variety of expressions, attributes, decorations, ac-
tions, and backgrounds. For each prompt of each ID, we
4. Experiments generate 4 images for evaluation. More details are listed in
the appendix.
4.1. Setup
4.2. Applications
Implementation details. To generate more photo-
realistic human portraits, we employ SDXL model [44] In this section, we will elaborate on the applications that our
stable-diffusion-xl-base-1.0 as our text-to- PhotoMaker can empower. For each application, we choose
image synthesis model. Correspondingly, the resolution of the comparison methods which may be most suitable for the
training data is resized to 1024 × 1024. We employ CLIP corresponding setting. The comparison method will be cho-
ViT-L/14 [45] and an additional projection layer to obtain sen from DreamBooth [50], Textual Inversion [17], Fast-
the initial image embeddings ei . For text embeddings, we Composer [62], and IPAdapter [63]. We prioritize using the
keep the original two text encoders in SDXL for extraction. official model provided by each method. For DreamBooth
The overall framework is optimized with Adam [31] on 8 and IPAdapter, we use their SDXL versions for a fair com-
NVIDIA A100 GPUs for two weeks with a batch size of 48. parison. For all applications, we have chosen four input
We set the learning rate as 1e − 4 for LoRA weights, and ID images to form the stacked ID embedding in our Pho-
1e − 5 for other trainable modules. During training, we ran- toMaker. We also fairly use four images to train the methods
domly sample 1-4 images with the same ID as the current that need test-time optimization. We provide more samples
target ID image to form a stacked ID embedding. Besides, in the appendix for each application.
to improve the generation performance by using classifier- Recontextualization We first show results with simple con-
free guidance, we have a 10% chance of using null-text em- text changes such as modified hair color and clothing, or
bedding to replace the original updated text embedding t∗ . generate backgrounds based on basic prompt control. Since
We also use masked diffusion loss [6] with a probability of all methods can adapt to this application, we conduct quan-
50% to encourage the model to generate more faithful ID- titative and qualitative comparisons of the generated re-
related areas. During the inference stage, we use delayed sults (see Tab. 1 and Fig. 3). The results show that our
subject conditioning [62] to solve the conflicts between text method can well satisfy the ability to generate high-quality
and ID conditions. We use 50 steps of DDIM sampler [58]. images, while ensuring high ID fidelity (with the largest
The scale of classifier-free guidance is set to 5. CLIP-T and DINO scores, and the second best Face Sim-
Evaluation metrics. Following DreamBooth [50], we use ilarity). Compared to most methods, our method generates
DINO [8] and CLIP-I [17] metrics to measure the ID fidelity images of higher quality, and the generated facial regions
and use CLIP-T [45] metric to measure the prompt fidelity. exhibit greater diversity. At the same time, our method can
For a more comprehensive evaluation, we also compute the maintain a high efficiency consistent with embedding-based
face similarity by detecting and cropping the facial regions methods. For a more comprehensive comparison, we show
between the generated image and the real image with the the user study results in Sec. B in the appendix.
same ID. We use RetinaFace [16] as the detection model. Bringing person in artwork/old photo into reality. By
Face embedding is extracted by FaceNet [53]. To evalu- taking artistic paintings, sculptures, or old photos of a per-
ate the quality of the generation, we employ the FID met- son as input, our PhotoMaker can bring a person from the
6
References DreamBooth Textual Inversion FastComposer IPAdapter PhotoMaker (Ours)
A man wearing a
spacesuit
A man wearing a
Christmas hat
A woman sitting
at the beach,
with purple
sunset
A woman with red
hair
A woman wearing
a doctoral cap
Figure 3. Qualitative comparison on universal recontextualization samples. We compare our method with DreamBooth [50], Textual
Inversion [17], FastComposer [62], and IPAdapter [63] for five different identities and corresponding prompts. We observe that our method
generally achieves high-quality generation, promising editability, and strong identity fidelity. (Zoom-in for the best view)
CLIP-T↑ (%) CLIP-I↑ (%) DINO↑ (%) Face Sim.↑ (%) Face Div.↑ (%) FID↓ Speed↓ (s)
DreamBooth [50] 29.8 62.8 39.8 49.8 49.1 374.5 1284
Textual Inversion [17] 24.0 70.9 39.3 54.3 59.3 363.5 2400
FastComposer [62] 28.7 66.8 40.2 61.0 45.4 375.1 8
IPAdapter [63] 25.1 71.2 46.2 67.1 52.4 375.2 12
PhotoMaker (Ours) 26.1 73.6 51.5 61.8 57.7 370.3 10
Table 1. Quantitative comparison on the universal recontextualization setting. The metrics used for benchmarking cover the ability to
preserve ID information (i.e., CLIP-I, DINO, and Face Similarity), text consistency (i.e., CLIP-T), diversity of generated faces (i.e., Face
Diversity), and generation quality (i.e., FID). Besides, we define personalized speed as the time it takes to obtain the final personalized
image after feeding the ID condition(s). We measure personalized time on a single NVIDIA Tesla V100 GPU. The best result is shown in
bold, and the second best is underlined.
last century or even ancient times to the present century to liance of DreamBooth on the quality and resolution of cus-
“take” photos for them. Fig. 4(a) illustrate the results. Com- tomized images, it is difficult for DreamBooth to generate
pared to our method, both Dreambooth and SDXL have dif- high-quality results when using old photos for customized
ficulty generating realistic human images that have not ap- generation.
peared in real photos. In addition, due to the excessive re-
Changing age or gender. By simply replacing class words
7
References DreamBooth SDXL PhotoMaker (Ours) References DreamBooth SDXL PhotoMaker (Ours)
A man coding in front A woman happily smiling,

of a computer looking at the camera
A man piloting a A girl wearing a

spaceship Christmas hat
(a) (b)
Figure 4. Applications on (a) artwork and old photo, and (b) changing age or gender. We are able to bring the past people back to real
life or change the age and gender of the input ID. For the first application, we prepare a prompt template A photo of <original
prompt>, photo-realistic for DreamBooth and SDXL. Correspondingly, we change the class word to the celebrity name in the
original prompt. For the second one, we replace the class word to <class word> <name>, (at the age of 12) for them.
References DreamBooth SDXL PhotoMaker (Ours)
(e.g. man and woman), our method can achieve changes
in gender and age. Fig. 4(b) shows the results. Although
SDXL and DreamBooth can also achieve the correspond-
ing effects after prompt engineering, our method can more
easily capture the characteristic information of the charac- A man holding a bottle
of red wine
ters due to the role of the stacked ID embedding. Therefore,
our results show a higher ID fidelity.
Identity mixing. If the users provide images of different
IDs as input, our PhotoMaker can well integrate the char-
A woman frowning
acteristics of different IDs to form a new ID. From Fig. 5, at the camera
we can see that neither DreamBooth nor SDXL can achieve

identity mixing. In contrast, our method can retain the char-
acteristics of different IDs well on the generated new ID,
regardless of whether the input is an anime IP or a real per- A man wearing a
spacesuit
son, and regardless of gender. Besides, we can control the
proportion of this ID in the new generated ID by controlling
the corresponding ID input quantity or prompt weighting. Figure 5. Identity mixing. We are able to generate the im-
We show this ability in Fig. 10-11 in the appendix. age with a new ID while preserving input identity characteristics.
We prepare a prompt template <original prompt>, with
Stylization. In Fig. 6, we demonstrate the stylization capa- a face blended with <name:A> and <name:B> for
bilities of our method. We can see that, in the generated im- SDXL. (Zoom-in for the best view)
ages, our PhotoMaker not only maintains good ID fidelity A <class> in a comic A painting of a <class>, A Ukiyo-e painting of a
References A sketch of a <class> book in Van Gogh style <class>
but also effectively exhibits the style information of the in-
put prompt. This reveals the potential for our method to
drive more applications. Additional results are shown in the
Fig. 12 within the appendix.
4.3. Ablation study
We shortened the total number of training iterations by eight
times to conduct ablation studies for each variant.
The influence about the number of input ID images. We
explore the impact that forming the proposed stacked ID Figure 6. The stylization results of our PhotoMaker. The sym-
embedding through feeding different numbers of ID im- bol <class> denotes it will be replaced by man or woman ac-
ages. In Fig. 7, we visualize this impact across different cordingly. (Zoom-in for the best view)
metrics. We conclude that using more images to form a
stacked ID embedding can improve the metrics related to ID Upon the input of an increasing number of ID images, the
fidelity. This improvement is particularly noticeable when growth rate of the values in the ID-related metrics signif-
the number of input images is increased from one to two. icantly decelerates. Additionally, we observe a linear de-
8

)DFH6LPLODULW\

&/,37
&/,3,
',12

,PDJHV ,PDJHV ,PDJHV ,PDJHV
(a) (b) (c) (d)
Figure 7. The impact of the number of input ID images on (a) CLIP-I, (b) DINO, (c) CLIP-T, and (d) Face Similarity, respectively.
CLIP-T↑ DINO↑ Face Sim.↑ Face Div.↑ CLIP-T↑ DINO↑ Face Sim.↑ Face Div.↑
Average 28.7 47.0 48.8 56.3 Single embed 27.9 50.3 50.5 56.1
Linear 28.6 47.3 48.1 54.6 Single image 27.3 50.3 60.4 51.7
Stacked 28.0 49.5 53.6 55.0 Ours 28.0 49.5 53.6 55.0
(a) Embedding composing choices. (b) Training data sampling strategy.
Table 2. Ablation studies for the proposed PhotoMaker. The best results are marked in bold.
References 1 image 2 images 6 images 10 images
its effectiveness. Besides, such a way offers greater flexibil-
ity than others, including accepting any number of images
and better controlling the mixing process of different IDs.
A photo of a man The benefits from multiple embeddings during training.
We explore two other training data sampling strategies to
demonstrate that it is necessary to input multiple images
with variations during training. The first is to choose only
A woman wearing a one image, which can be different from the target image,
red sweater
to form the ID embedding (see “single embed” in Tab. 2b).
Our multiple embedding way has advantages in ID fidelity.
The second sampling strategy is to regard the target im-
A photo of a woman
age as the input ID image (to simulate the training way of
most embedding-based methods). We generate multiple im-
Figure 8. The impact of varying the quantity of input images ages based on this image with different data augmentation
on the generation results. It can be observed that the fidelity of methods and extract corresponding multiple embeddings.
the ID increases with the quantity of input images. In Tab. 2b, as the model can easily remember other irrele-
vant characteristics of the input image, the generated facial
area lacks sufficient changes (low diversity).
cline on the CLIP-T metric. This indicates there may ex-
ist a trade-off between text controllability and ID fidelity.
From Fig. 8, we see that increasing the number of input im-
5. Conclusion
ages enhances the similarity of the ID. Therefore, the more We have presented PhotoMaker, an efficient personalized
ID images to form the stacked ID embedding can help the text-to-image generation method that focuses on generat-
model perceive more comprehensive ID information, and ing realistic human photos. Our method leverages a sim-
then more accurately represent the ID to generate images. ple yet effective representation, stacked ID embedding,
Besides, as shown by the Dwayne Johnson example, the for better preserving ID information. Experimental results
gender editing capability decreases, and the model is more have demonstrated that our PhotoMaker, compared to other
prone to generate images of the original ID’s gender. methods, can simultaneously satisfy high-quality and di-
The choices of composing multiple embeddings. We ex- verse generation capabilities, promising editability, high in-
plore three ways to compose the ID embedding, including ference efficiency, and strong ID fidelity. Besides, we also
averaging the image embeddings, adaptively projecting em- have found that our method can empower many interesting
beddings through a linear layer, and our stacking way. From applications that previous methods are hard to achieve, such
Tab. 2a, we see the stacking way has the highest ID fidelity as changing age or gender, bringing persons from old pho-
while ensuring a diversity of generated faces, demonstrating tos or artworks back to reality, and identity mixing.
9
References [16] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia,
and Stefanos Zafeiriou. Retinaface: Single-shot multi-level
[1] Photo ai. https://photoai.com/. Accessed: 2023- face localisation in the wild. In CVPR, 2020. 5, 6
12-08. 2 [17] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik,
[2] Low-rank adaptation for fast text-to-image diffusion fine- Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An
tuning. https : / / github . com / cloneofsimo / image is worth one word: Personalizing text-to-image gen-
lora, 2022. 2, 3, 4, 5 eration using textual inversion. In ICLR, 2023. 2, 3, 4, 6, 7,
[3] Prompt weighting. https : / / huggingface . co / 1
docs/diffusers/using-diffusers/weighted_ [18] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano,
prompts, 2023. 3, 5, 2 Gal Chechik, and Daniel Cohen-Or. Designing an encoder
[4] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Only a for fast personalization of text-to-image models. arXiv
matter of style: Age transformation using a style-based re- preprint arXiv:2302.12228, 2023. 3
gression model. TOG, 2021. 5 [19] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
[5] Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Cohen-Or, Ariel Shamir, and Amit H Bermano. Domain- Yoshua Bengio. Generative adversarial networks. ACM
agnostic tuning-encoder for fast personalization of text-to- Communications, 2020. 2
image models. TOG, 2023. 2 [20] Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar,
Dimitris Metaxas, and Feng Yang. Svdiff: Compact param-
[6] Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-
eter space for diffusion fine-tuning. In ICCV, 2023. 3
Or, and Dani Lischinski. Break-a-scene: Extracting multiple
[21] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
concepts from a single image. In SIGGRAPH Asia, 2023. 6
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image
[7] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and An- editing with cross attention control. In ICLR, 2023. 3, 5, 2
drew Zisserman. Vggface2: A dataset for recognising faces [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
across pose and age. In FG, 2018. 5 Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
[8] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, two time-scale update rule converge to a local nash equilib-
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- rium. In NeurIPS, 2017. 6
ing properties in self-supervised vision transformers. In [23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
ICCV, 2021. 6, 1 sion probabilistic models. In NeurIPS, 2020. 3
[9] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu [24] Matthew Honnibal, Ines Montani, Sofie Van Landeghem,
Soricut. Conceptual 12m: Pushing web-scale image-text pre- and Adriane Boyd. spacy: Industrial-strength natural lan-
training to recognize long-tail visual concepts. In CVPR, guage processing in python, 2020. 6
2021. 3 [25] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
[10] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Lora: Low-rank adaptation of large language models. In
Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion ICLR, 2022. 2, 5
transformer for photorealistic text-to-image synthesis. arXiv [26] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade
preprint arXiv:2310.00426, 2023. 3 Gordon, Nicholas Carlini, Rohan Taori, Achal Dave,
[11] Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han-
Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open-
Liu, Kang Du, et al. Photoverse: Tuning-free image clip, 2021. 3
customization with text-to-image diffusion models. arXiv [27] Xuhui Jia, Yang Zhao, Kelvin CK Chan, Yandong Li, Han
preprint arXiv:2309.05793, 2023. 2, 3 Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, and
Yu-Chuan Su. Taming encoder for zero fine-tuning image
[12] Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Rui, Xuhui
customization with text-to-image diffusion models. arXiv
Jia, Ming-Wei Chang, and William W Cohen. Subject-driven
preprint arXiv:2304.02642, 2023. 3
text-to-image generation via apprenticeship learning. arXiv
[28] Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei
preprint arXiv:2304.00186, 2023. 3
Zhang, and Qiang Xu. HumanSD: A native skeleton-guided
[13] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, diffusion model for human image generation. In ICCV, 2023.
and Hengshuang Zhao. Anydoor: Zero-shot object-level im- 2
age customization. arXiv preprint arXiv:2307.09481, 2023. [29] Tero Karras, Samuli Laine, and Timo Aila. A style-based
3 generator architecture for generative adversarial networks. In
[14] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- CVPR, 2019. 2, 3, 5
der Kirillov, and Rohit Girdhar. Masked-attention mask [30] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
transformer for universal image segmentation. In CVPR, Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
2022. 5 Text-based real image editing with diffusion models. In
[15] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos CVPR, 2023. 3
Zafeiriou. Arcface: Additive angular margin loss for deep [31] Diederik P Kingma and Jimmy Ba. Adam: A method for
face recognition. In CVPR, 2019. 5, 1 stochastic optimization. In ICLR, 2015. 6
10
[32] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Peter J Liu. Exploring the limits of transfer learning with a
Shechtman, and Jun-Yan Zhu. Multi-concept customization unified text-to-text transformer. JMLR, 2020. 2, 3
of text-to-image diffusion. In CVPR, 2023. 3, 5 [47] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
[33] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. and Mark Chen. Hierarchical text-conditional image gen-
Blip-2: Bootstrapping language-image pre-training with eration with clip latents. arXiv preprint arXiv:2204.06125,
frozen image encoders and large language models. In ICML, 2022. 3
2023. 5 [48] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence
[34] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- embeddings using siamese bert-networks. In EMNLP, 2019.
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. 6
Gligen: Open-set grounded text-to-image generation. In [49] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
CVPR, 2023. 3 Patrick Esser, and Björn Ommer. High-resolution image syn-
[35] Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, thesis with latent diffusion models. In CVPR, 2022. 3
Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, and Sergey [50] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Tulyakov. Hyperhuman: Hyper-realistic human gener- Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
ation with latent structural diffusion. arXiv preprint tuning text-to-image diffusion models for subject-driven
arXiv:2310.08579, 2023. 2, 3, 5 generation. In CVPR, 2023. 2, 3, 4, 6, 7, 1
[36] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. [51] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei,
Deep learning face attributes in the wild. In ICCV, 2015. Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein,
2, 3, 5 and Kfir Aberman. Hyperdreambooth: Hypernetworks for
[37] Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. fast personalization of text-to-image models. arXiv preprint
Subject-diffusion: Open domain personalized text-to-image arXiv:2307.06949, 2023. 2, 3
generation without test-time fine-tuning. arXiv preprint
[52] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
arXiv:2307.11410, 2023. 3
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
[38] Yiyang Ma, Huan Yang, Wenjing Wang, Jianlong Fu, and
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
Jiaying Liu. Unified multi-modal latent diffusion for joint
Rapha Gontijo Lopes, et al. Photorealistic text-to-image
subject and text conditional image generation. arXiv preprint
diffusion models with deep language understanding. In
arXiv:2303.09319, 2023. 3
NeurIPS, 2022. 2, 3
[39] Andrew Melnik, Maksim Miasayedzenkau, Dzianis
[53] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Makarovets, Dzianis Pirshtuk, Eren Akbulut, Dennis Holz-
Facenet: A unified embedding for face recognition and clus-
mann, Tarek Renusch, Gustav Reichert, and Helge Ritter.
tering. In CVPR, 2015. 6
Face generation and editing with stylegan: A survey. arXiv
preprint arXiv:2212.09102, 2022. 2 [54] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
[40] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon-
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m:
gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning
Open dataset of clip-filtered 400 million image-text pairs.
adapters to dig out more controllable ability for text-to-image
arXiv preprint arXiv:2111.02114, 2021. 3
diffusion models. arXiv preprint arXiv:2302.08453, 2023. 2,
3 [55] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
[41] Yotam Nitzan, Kfir Aberman, Qiurui He, Orly Liba, Michal Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Yarom, Yossi Gandelsman, Inbar Mosseri, Yael Pritch, and Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
Daniel Cohen-Or. Mystyle: A personalized generative prior. man, et al. Laion-5b: An open large-scale dataset for
TOG, 2022. 2, 3, 5, 6, 1 training next generation image-text models. arXiv preprint
[42] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On arXiv:2210.08402, 2022. 2, 3, 5
aliased resizing and surprising subtleties in gan evaluation. [56] Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instant-
In CVPR, 2022. 6 booth: Personalized text-to-image generation without test-
[43] William Peebles and Saining Xie. Scalable diffusion models time finetuning. arXiv preprint arXiv:2304.03411, 2023. 2,
with transformers. In ICCV, 2023. 3 3
[44] Dustin Podell, Zion English, Kyle Lacey, Andreas [57] Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang,
Robin Rombach. Sdxl: Improving latent diffusion mod- Glenn Entis, Yuanzhen Li, et al. Styledrop: Text-to-image
els for high-resolution image synthesis. arXiv preprint generation in any style. In NeurIPS, 2023. 3
arXiv:2307.01952, 2023. 2, 3, 5, 6, 1 [58] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
[45] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya ing diffusion implicit models. In ICLR, 2021. 3, 6
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [59] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- Chen, Liang Lin, and Meng Yang. Toward characteristic-
ing transferable visual models from natural language super- preserving image-based virtual try-on network. In ECCV,
vision. In ICML, 2021. 2, 3, 6 2018. 2
[46] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, [60] Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang,
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change
11
Loy. Mead: A large-scale audio-visual dataset for emotional
talking-face generation. In ECCV, 2020. 3, 5
[61] Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei
Zhang, and Wangmeng Zuo. Elite: Encoding visual con-
cepts into textual embeddings for customized text-to-image
generation. In ICCV, 2023. 3, 5
[62] Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo
Durand, and Song Han. Fastcomposer: Tuning-free multi-
subject image generation with localized attention. arXiv
preprint arXiv:2305.10431, 2023. 2, 3, 4, 5, 6, 7, 1
[63] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. arXiv preprint arXiv:2308.06721,
2023. 3, 6, 7, 1, 2
[64] Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li,
Chenyang Qi, Xintao Wang, Ying Shan, and Huicheng
Zheng. Inserting anybody in diffusion models via celeb ba-
sis. In NeurIPS, 2023. 3
[65] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models. In
ICCV, 2023. 2, 3
[66] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR, 2018. 6
[67] Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang,
Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker:
Learning realistic 3d motion coefficients for stylized audio-
driven single image talking face animation. In CVPR, 2023.
2
[68] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dong-
dong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming
Zeng, and Fang Wen. General facial representation learning
in a visual-linguistic manner. In CVPR, 2022. 3, 5
12
Appendix
3KRWR0DNHU2XUV ,3$GDSWHU )DVW&RPSRVHU 'UHDP%RRWK
Evaluation IDs
,')LGHOLW\
1 Alan Turing 14 Kamala Harris
2 Albert Einstein 15 Marilyn Monroe
3 Anne Hathaway 16 Mark Zuckerberg 4XDOLW\
4 Audrey Hepburn 17 Michelle Obama
5 Barack Obama 18 Oprah Winfrey 'LYHUVLW\
6 Bill Gates 19 Renée Zellweger
7 Donald Trump 20 Scarlett Johansson
7H[W)LGHOLW\
8 Dwayne Johnson 21 Taylor Swift
9 Elon Musk 22 Thomas Edison 3HUFHQWDJH
10 Fei-Fei Li 23 Vladimir Putin Figure 9. User preferences on ID fidelity, generation quality,
11 Geoffrey Hinton 24 Woody Allen face diversity, and text fidelity for different methods. For ease
12 Jeff Bezos 25 Yann LeCun of illustration, we visualize the proportion of total votes that each
13 Joe Biden method has received. Our PhotoMaker occupies the most signifi-
cant proportion in these four dimensions.
Table 3. ID names used for evaluation. For each name, we col-
lect four images totally.
image pairs for each user. Each of them includes a refer-
ence image of the input ID and corresponding text prompt.
A. Dataset Details We have four randomly generated images of each method
for each text-image pair. Each user is requested to answer
Training dataset. Based on Sec. 3.3 in the main paper, fol- four questions for these 20 sets of results: 1) Which method
lowing a sequence of filtering steps, the number of images is most similar to the input person’s identity? 2) Which
in our constructed dataset is about 112K. They are classified method produces the highest quality generated images? 3)
by about 13,000 ID names. Each image is accompanied by Which method generates the most diverse facial area in the
a mask for the corresponding ID and an annotated caption. images? 4) Which method generates images that best match
Evaluation dataset. The image dataset used for evaluation the input text prompt? We have anonymized the names of
comprises manually selected additional IDs and a portion all methods and randomized the order of methods in each
of MyStyle [41] data. For each ID name, we have four im- set of responses. We had a total of 40 candidates partici-
ages that serve as input data for comparative methods and pating in our user study, and we received 3,200 valid votes.
for the final metric evaluation (i.e., DINO [8], CLIP-I [17], The results are shown in Fig. 9.
and Face Sim. [15]). For single-embedding methods (i.e., We find that our PhotoMaker has advantages in terms of
FastComposer [62] and IPAdapter [63]), we randomly se- ID fidelity, generation quality, diversity, and text fidelity, es-
lect one image from each ID group as input. Note that the pecially the latter three. In addition, we found that Dream-
ID names exist in the training image set, utilized for the Booth is the second-best algorithm in balancing these four
training of our method, and the test image set, do not ex- evaluation dimensions, which may explain why it was more
hibit any overlap. We list ID names for evaluation in Tab. 3. prevalent than the embedding-based methods in the past. At
For text prompts used for evaluation, we consider six fac- the same time, IPAdapter shows a significant disadvantage
tors: clothing, accessories, actions, expressions, views, and in terms of generated image quality and text consistency,
background, which make up 40 prompts that are listed in as it focuses more on image embedding during the training
the Tab. 4. phase. FastComposer has a clear shortcoming in the diver-
sity of the facial region for their single-embedding training
B. User Study pipeline. The above results are generally consistent with
Tab. 1 in the main paper, except for the discrepancy in the
In this section, we conduct a user study to make a more CLIP-T metric. This could be due to a preference for select-
comprehensive comparison. The comparative methods ing images that harmonize with the objects appearing in the
we have selected include DreamBooth [50], FastCom- text when manually choosing the most text-compatible im-
poser [62], and IPAdapter [63]. We use SDXL [44] as the ages. In contrast, the CLIP-T tends to focus on whether the
base model for both DreamBooth and IPAdapter because of object appears. This may demonstrate the limitations of the
their open-sourced implementations. We display 20 text- CLIP-T. We also provide more visual samples in Fig. 14-17
1
Category Prompt Category Prompt
General a photo of a <class word> a <class word> laughing on the lawn
a <class word> wearing a Superman outfit a <class word> frowning at the camera
a <class word> wearing a spacesuit a <class word> happily smiling, looking at the
Expression
Clothing a <class word> wearing a red sweater camera
a <class word> wearing a purple wizard outfit a <class word> crying disappointedly, with
a <class word> wearing a blue hoodie tears flowing
a <class word> wearing headphones a <class word> wearing sunglasses
a <class word> with red hair a <class word> playing the guitar in the view of
left side
a <class word> wearing headphones with red
hair a <class word> holding a bottle of red wine, up-
per body
a <class word> wearing a Christmas hat
Accessory a <class word> wearing sunglasses and neck-
a <class word> wearing sunglasses
lace, close-up, in the view of right side
a <class word> wearing sunglasses and neck-
a <class word> riding a horse, in the view of the
lace
View top
a <class word> wearing a blue cap
a <class word> wearing a doctoral cap, upper
a <class word> wearing a doctoral cap body, with the left side of the face facing the camera
a <class word> with white hair, wearing glasses a <class word> crying disappointedly, with
a <class word> in a helmet and vest riding a mo- tears flowing, with left side of the face facing the
torcycle camera
a <class word> holding a bottle of red wine a <class word> sitting in front of the camera,
a <class word> driving a bus in the desert with a beautiful purple sunset at the beach in the
a <class word> playing basketball background
Action a <class word> swimming in the pool
a <class word> playing the violin
a <class word> piloting a spaceship Background a <class word> climbing a mountain
a <class word> riding a horse a <class word> skiing on the snowy mountain
a <class word> coding in front of a computer a <class word> in the snow
a <class word> playing the guitar a <class word> in space wearing a spacesuit
(a) (b)
Table 4. Evaluation text prompts categorized by (a) general setting, clothing, accessory, action, (b) expression, view, and back-
ground. The class word will be replaced with man, woman, boy, etc. For each ID and each prompt, we randomly generated four
images for evaluation.
for reference. to control the number of input images, prompt weighting

requires fewer photos to adjust the merge ratio of differ-
C. More Ablations ent IDs, demonstrating its superior usability. Besides, the
two ways of adjusting the mixing ratio of different IDs both
Adjusting the ratio during identity mixing. For identity demonstrate the flexibility of our method.
mixing, our method can adjust the merge ratio by either
controlling the percentage of identity images within the in- D. Stylization Results
put image pool or through the method of prompt weight- Our method not only possesses the capability to generate re-
ing [3, 21]. In this way, we can control that the person gen- alistic human photos, but it also allows for stylization while
erated with a new ID is either more closely with or far away preserving ID attributes. This demonstrates the robust gen-
from a specific input ID. Fig. 10 shows how our method eralizability of the proposed method. We provide the styl-
customizes a new ID by controlling the proportion of dif- ization results in Fig. 12.
ferent IDs in the input image pool. For a better description,
we use a total of 10 images as input in this experiment. We E. More Visual Results
can observe a smooth transition of images with the two IDs.
This smooth transition encompasses changes in skin color Recontextualization. We first provide a more intuitive
and age. Next, we use four images per generated ID to con- comparison in Fig. 14. We compare our PhotoMaker
duct prompt weighting. The results are shown in Fig. 11. with DreamBooth [50], FastComposer [62], and IPAdapa-
We multiply the embedding corresponding to the images ter [63], for universal recontextualization cases. Compared
related to a specific ID by a coefficient to control its propor- to other methods, the results generated by our method can
tion of integration into the new ID. Compared to the way simultaneously satisfy high-quality, strong text controllabil-
2
80% Obama 20% Obama
20% Biden 50% 80% Biden
A man wearing a red sweater
80% Obama 20% Obama

20% Scarlett 50% 80% Scarlett
A woman happily smiling, looking at the camera
Figure 10. The impact of the proportion of images with different IDs in the input sample pool on the generation of new IDs. The
first row illustrates the transition from Barack Obama to Joe Biden. The second row depicts the shift from Michelle Obama to Scarlett
Johansson. To provide a clearer illustration, percentages are used in the figure to denote the proportion of each ID in the input image pool.
The total number of images contained in the input pool is 10. (Zoom-in for the best view)
: 1.0
0.1 0.5 0.8 0.9 1.0 1.1 1.2 1.5 2.0
A man in the snow
: 1.0
0.1 0.5 0.8 0.9 1.0 1.1 1.2 1.5 2.0
A woman wearing sunglasses
Figure 11. The impact of prompt weighting on the generation of new IDs. The first row illustrates a blend of Barack Obama and Joe
Biden. The first row from left to right represents the progressive increase in the weight of the ID image embedding corresponding to Barack
Obama in the image. The second row illustrates a blend of Elsa (Disney) and Anne Hathaway. The weight for Elsa is gradually increased.
(Zoom-in for the best view)
3
A sketch of a A <class> in a A <class> in Ghibli A painting of a <class>, A Ukiyo-e painting
References <class> comic book animation style in Van Gogh style of a <class>
Figure 12. The stylization results of our PhotoMaker with different input IDs and different style prompts. Our method can be
seamlessly transferred to a variety of styles, concurrently preventing the generation of realistic results. The symbol <class> denotes it
will be replaced by man or woman accordingly. (Zoom-in for the best view)
4
A photo of Mira Murati A photo of OpenAI CTO F. Limitations
First, our method only focuses on maintaining the ID infor-
mation of a single generated person in the image, and can-
not control multiple IDs of generated persons in one image
simultaneously. Second, our method excels at generating
half-length portraits, but is relatively not good at generating
full-length portraits. Third, the age transformation ability
A photo of chief scientist in of our method is not as precise as some GAN-based meth-
A photo of Ilya Sutskever
OpenAI
ods [4]. If the users need more precise control, modifica-
tions to the captions of the training dataset may be required.
Finally, our method is based on the SDXL and the dataset
we constructed, so it will also inherit their biases.
G. Broader Impact
In this paper, we introduce a novel method capable of gener-
Figure 13. Two examples that are unrecognizable by the SDXL. ating high-quality human images while maintaining a high
We replaced two types of text prompts (e.g., name and position) degree of similarity to the input identity. At the same time,
but were unable to prompt the SDXL to generate Mira Murati and our method can also satisfy high efficiency, decent facial
Ilya Sutskever. generation diversity, and good controllability.
For the academic community, our method provides a
strong baseline for personalized generation. Our data cre-
ation pipeline enables more diverse datasets with varied
ity, and high ID fidelity. We then focus on the IDs that poses, actions, and backgrounds, which can be instrumen-
SDXL can not generate itself. We refer to this scenario as tal in developing more robust and generalizable computer
the “non-celebrity” case. Compared Fig. 15 with Fig. 13, vision models.
our method can successfully generate the corresponding in- In the realm of practical applications, our technique has
put IDs for this setting. the potential to revolutionize industries such as entertain-
ment, where it can be used to create realistic characters
Bringing person in artwork/old photo into reality. for movies or video games without the need for extensive
Fig. 16-17 demonstrate the ability of our method to bring CGI work. It can also be beneficial in virtual reality, pro-
past celebrities back to reality. It is worth noticing that viding more immersive and personalized experiences by al-
our method can generate photo-realistic images from IDs lowing users to see themselves in different scenarios. It is
in statues and oil paintings. Achieving this is quite chal- worth noticing that everyone can rely on our PhotoMaker to
lenging for the other methods we have compared. quickly customize their own digital portraits.
However, we acknowledge the ethical considerations
Changing age or gender. We provide more visual results that arise with the ability to generate human images with
for changing age or gender in Fig. 18. As mentioned in the high fidelity. The proliferation of such technology may lead
main paper, we only need to change the class word when to a surge in the inappropriate use of generated portraits,
we conduct such an application. In the generated ID images malicious image tampering, and the spreading of false in-
changed in terms of age or gender, our method can well formation. Therefore, we stress the importance of develop-
preserve the characteristics in the original ID. ing and adhering to ethical guidelines and using this tech-
nology responsibly. We hope that our contribution will spur
Identity mixing. We provide more visual results for iden-
further discussion and research into the safe and ethical use
tity mixing application in Fig. 19. Benefiting from our
of human generation in computer vision.
stacked ID embedding, our method can effectively blend
the characteristics of different IDs to form a new ID. Subse-
quently, we can generate text controlled based on this new
ID. Additionally, our method provides great flexibility dur-
ing the identity mixing, as can be seen in Fig. 10-11. More
importantly, we have explored in the main paper that ex-
isting methods struggle to achieve this application. Con-
versely, our PhotoMaker opens up a multitude of possibili-
ties.
5
References Dreambooth FastComposer IPAdapter PhotoMaker (Ours)
A man in the snow
A man crying disappointedly, with tears flowing
A photo of a woman
A man wearing a doctoral cap
A woman climbing a mountain
A man in space wearing a spacesuit
Figure 14. More visual examples for recontextualization setting. Our method not only provides high ID fidelity but also retains text
editing capabilities. We randomly sample three images for each prompt. (Zoom-in for the best view)
6
A woman happily smiling, looking at the camera
A photo of a woman
A man in the snow
A man wearing sunglasses
A man in a helmet and vest riding a motorcycle
A man sitting in front of the camera, with a beautiful purple sunset at the beach in the background
Figure 15. More visual examples for recontextualization setting. Our method not only provides high ID fidelity but also retains text
editing capabilities. We randomly sample three images for each prompt. (Zoom-in for the best view)
7
A man wearing a Christmas hat
A man riding a horse
A man frowning at the camera
A man wearing a red sweater
A man wearing a blue cap
A man wearing sunglasses
Figure 16. More visual examples for bringing person in old-photo back to life. Our method can generate high-quality images. We
randomly sample three images for each prompt. (Zoom-in for the best view)
8
A woman wearing a spacesuit
A woman wearing sunglasses
A man holding a bottle of red wine
A man in the snow
A photo of a man
A man in space wearing a spacesuit
Figure 17. More visual examples for bringing person in artworks back to life. Our PhotoMaker can generate photo-realistic images
while other methods are hard to achieve. We randomly sample three images for each prompt. (Zoom-in for the best view)
9
References A boy happily smiling A boy wearing a blue hoodie A boy wearing sunglasses
A woman with white hair,

A woman with red hair A woman wearing a spacesuit
wearing the sunglasses
A boy wearing a red sweater A boy swimming in the pool A boy frowning at the camera
A woman wearing a doctoral

A woman in the snow A woman happily smiling
cap
Figure 18. More visual examples for changing age or gender for each ID. Our PhotoMaker, when modifying the gender and age of the
input ID, effectively retains the characteristics of the face ID and allows for textual manipulation. We randomly sample three images for
each prompt. (Zoom-in for the best view)
10
A man wearing headphones
References A man wearing a doctoral cap A man wearing a Christmas cap
with red hair
A man wearing headphones A man in space wearing a

A man in the snow
with red hair spacesuit
A man wearing sunglasses and

A man happily smiling A man wearing a red sweater
necklace
A woman wearing a red A woman wearing a blue A woman wearing a Christmas

sweater hoodie cap
A woman in wearing a A woman wearing headphones

A woman swimming in the pool
spacesuit with red hair
A woman in wearing a A woman wearing a red

A woman piloting a spaceship
spacesuit sweater
Figure 19. More visual results for identity mixing applications. Our PhotoMaker can maintain the characteristics of both input IDs in
the new generated ID image, while providing high-quality and text-compatible generation results. We randomly sample three images for
each prompt. (Zoom-in for the best view)
11

Paper 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper 2

Uploaded by

Copyright:

Available Formats

Customizing Realistic Human Photos via Stacked ID Embedding

User inputs User inputs

(a) Attributes change

(b) Artwork / old-photo to reality

(c) Identity mixing

Abstract (ID) fidelity, and flexible text controllability. In this work,

2. Face Detection & Filtering

Stacked ID Embedding Grouped by IDs

Text Encoder(s) Training 5. Captioning & Marking

A man coding in front A woman happily smiling,

A man piloting a A girl wearing a

we can see that neither DreamBooth nor SDXL can achieve

for reference. to control the number of input images, prompt weighting

A man wearing a red sweater

80% Obama 20% Obama

A woman happily smiling, looking at the camera

0.1 0.5 0.8 0.9 1.0 1.1 1.2 1.5 2.0

A man in the snow

0.1 0.5 0.8 0.9 1.0 1.1 1.2 1.5 2.0

A woman wearing sunglasses

A man in the snow

A man crying disappointedly, with tears flowing

A man wearing a doctoral cap

A woman climbing a mountain

A man in space wearing a spacesuit

A woman happily smiling, looking at the camera

A man in the snow

A man wearing sunglasses

A man in a helmet and vest riding a motorcycle

A man wearing a Christmas hat

A man riding a horse

A man frowning at the camera

A man wearing a red sweater

A man wearing a blue cap

A man wearing sunglasses

A woman wearing a spacesuit

A woman wearing sunglasses

A man holding a bottle of red wine

A man in the snow

A man in space wearing a spacesuit

A woman with white hair,

A woman wearing a doctoral

A man wearing headphones A man in space wearing a

A man wearing sunglasses and

A woman wearing a red A woman wearing a blue A woman wearing a Christmas

A woman in wearing a A woman wearing headphones

A woman in wearing a A woman wearing a red

You might also like