Professional Documents
Culture Documents
(Arisandy Yudha Putra - 23150137) Research Interest
(Arisandy Yudha Putra - 23150137) Research Interest
2
Work Experiences (Research Phase)
AI Engineer (Part Time) at Keisuugiken
Project:
• Classification using YOLOv4
• Optical Character Recognition
3D Designer (Freelance)
Project:
• 3D Designing using Solidwork
• 3D Printing (user)
3
Brief Topic Review:
Text-to-Image Generation
Text-to-Image Generation
Outline
• Definition
• Recent Trends
• Potential Application
• General Challenges
• Future Directions
• References
5
Text-to-Image Generation
Definition
• Text-to-image generation is a method to generate image by taking input
a natural language description (text).
• The model learns the mapping between textual descriptions and images.
• Approaches: GANs, VQ-VAE Transformer-based methods, autoregressive,
diffusion model, etc.
Example:
6
Text-to-Image Generation
Recent Trends
• DALL-E 1 (2021) [1]
• The goal is to train a transformer [4] to autoregressively model the text
and image tokens as a single stream of data.
• Two-stages training procedure:
1. Train dVAE to compress each 256x256 RGB image into a 32x32
grid of image tokens.
2. Concatenate 256 BPE-encoded text tokens with the 32x32 image
tokens, and train an autoregressive transformer to joint
distribution over text and image tokens.
7
Text-to-Image Generation
Recent Trends
• NUWA (2022): Visual Synthesis [2]
• Can work for text, image, video
• Approaches: 3D transformer encoder-decoder framework
8
Text-to-Image Generation
Recent Trends
• Parti (2022) [5]
• Focusing on:
• high-fidelity photorealistic images
• Content-rich synthesis involving complex composition and world
knowledge
• Sequence-to-sequence modeling (autoregressive)
• Approaches:
1. Transformer-based image tokenizer, ViT-VQGAN, to encode
images as sequences of discrete tokens.
2. Scaling the encoder-decoder Transformer model up to 20B
parameters to achieve consistent quality improvements
9
Text-to-Image Generation
Potential Application
• Virtual environments: for video games, simulations, and other virtual reality applications.
• E-commerce: to generate product images for online marketplaces, based on textual descriptions
of the products.
• Fashion: to generate fashion designs and clothing images based on textual descriptions of the
desired designs.
• Accessibility: to create visual aids for people with visual impairments, such as generating images
to accompany text descriptions of visual content.
• Education: to create visual aids for educational and training purposes, such as generating images
to illustrate concepts and ideas.
• Art: to generate unique and creative artwork based on textual descriptions or prompts.
10
Text-to-Image Generation
General Challenges
• Data quality and quantity: require large amounts of high-quality data to train effectively. However,
obtaining such data can be difficult, especially for specialized domains or languages.
• Ambiguity in natural language: different people may describe the same image in different ways. This
can make it difficult to create a consistent mapping between textual descriptions and images.
• Complexity of visual content: textures, shapes, and lighting, among others. Capturing all these
aspects in a single model can be challenging and may require specialized architectures.
• Lack of interpretability: Some text-to-image synthesis models can be difficult to interpret, making it
challenging to understand how the model is generating images from textual descriptions.
• Computational resources: requiring large amounts of memory and processing power. Training and
running can be costly, making it difficult to use them effectively.
11
Text-to-Image Generation
Future Direction
• Incorporating context and background knowledge: Text-to-image models often generate images based
solely on the input text, without taking into account other contextual information or background knowledge
that might be relevant.
• Improving visual quality and realism: While current text-to-image models can generate high-quality images,
there is still room for improvement in terms of visual quality and realism.
• Incorporating human feedback: Incorporating human feedback into the text-to-image generation process
may help to improve the quality and relevance of generated images. Future research may explore ways to
develop interactive text-to-image models that can receive feedback from human users and use this feedback
to refine image generation.
• Better understanding of image composition: Many text-to-image models generate images by using attention
mechanisms to selectively generate different regions of the image based on the text input. However, there is
still much to be learned about how to effectively compose these regions into coherent images
12
Text-to-Image Generation
References
[1] A. Ramesh et al., ‘Zero-Shot Text-to-Image Generation’, in Proceedings of the 38th
International Conference on Machine Learning, 2021, pp. 8821–8831.
[2] C. Wu et al., ‘NÜWA: Visual Synthesis Pre-Training for Neural VisUal World
CreAtion’, in ECCV, 2022, pp. 720–736.
[3] C. Saharia et al., ‘Photorealistic Text-to-Image Diffusion Models with Deep
Language Understanding’, in arXiv:1910.10683, 2022
[4] A. Vaswani et al., ‘Attention is All You Need’, in arXiv:1706.03762, 2017.
[5] J. Yu et al., ‘Scaling Autoregressive Models for Content-Rich Text-to-Image
Generation’, in arXiv:2206.10789, 2022
13