Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Research Interest

for Course of Random Processes


Arisandy Yudha Putra
(Student ID) 23150137
(Program) Ph.D
(Department) Intelligent Mechatronics Engineering
Hello, Prof. Sikandar Aftab.

Name : Yudha Putra Arisandy


Age : 25 Years
Bachelor : Mechanical and Biosystem Engineering, IPB University, Indonesia
Master : Computer Science, IPB University, Indonesia
Research Interest: Text-to-Image Generation Model

2
Work Experiences (Research Phase)
AI Engineer (Part Time) at Keisuugiken
Project:
• Classification using YOLOv4
• Optical Character Recognition

Research Assistant at IPB University (Freelance)


Project:
• VIS-NIR Spectroscopy Portable Monitoring Device
• Autonomous Mobile Robot for on-farm monitoring

3D Designer (Freelance)
Project:
• 3D Designing using Solidwork
• 3D Printing (user)

3
Brief Topic Review:
Text-to-Image Generation
Text-to-Image Generation

Outline
• Definition
• Recent Trends
• Potential Application
• General Challenges
• Future Directions
• References

5
Text-to-Image Generation

Definition
• Text-to-image generation is a method to generate image by taking input
a natural language description (text).
• The model learns the mapping between textual descriptions and images.
• Approaches: GANs, VQ-VAE Transformer-based methods, autoregressive,
diffusion model, etc.

Description text Text-to-image Model Image

Example:

“A couple glasses are Text-to-image model


sitting on a table” (ex: Imagen)

6
Text-to-Image Generation
Recent Trends
• DALL-E 1 (2021) [1]
• The goal is to train a transformer [4] to autoregressively model the text
and image tokens as a single stream of data.
• Two-stages training procedure:
1. Train dVAE to compress each 256x256 RGB image into a 32x32
grid of image tokens.
2. Concatenate 256 BPE-encoded text tokens with the 32x32 image
tokens, and train an autoregressive transformer to joint
distribution over text and image tokens.

7
Text-to-Image Generation
Recent Trends
• NUWA (2022): Visual Synthesis [2]
• Can work for text, image, video
• Approaches: 3D transformer encoder-decoder framework

• IMAGEN (2022): Text-to-image diffusion model [3]


• Large transformer language models (understanding text) and Diffusion
model (high-fidelity image generation) are combined for:
• Unprecedented degree of photorealism
• Deep level of language understanding
• The lack of previous work:
• DALL-E 2: Complex, needs to learn latent prior (the development of
DALL-E 1 [3])
• GLIDE: insufficient image fidelity
• XMC-GAN: relatively small text encoder

8
Text-to-Image Generation
Recent Trends
• Parti (2022) [5]
• Focusing on:
• high-fidelity photorealistic images
• Content-rich synthesis involving complex composition and world
knowledge
• Sequence-to-sequence modeling (autoregressive)
• Approaches:
1. Transformer-based image tokenizer, ViT-VQGAN, to encode
images as sequences of discrete tokens.
2. Scaling the encoder-decoder Transformer model up to 20B
parameters to achieve consistent quality improvements

9
Text-to-Image Generation

Potential Application
• Virtual environments: for video games, simulations, and other virtual reality applications.

• E-commerce: to generate product images for online marketplaces, based on textual descriptions
of the products.

• Fashion: to generate fashion designs and clothing images based on textual descriptions of the
desired designs.

• Accessibility: to create visual aids for people with visual impairments, such as generating images
to accompany text descriptions of visual content.

• Education: to create visual aids for educational and training purposes, such as generating images
to illustrate concepts and ideas.

• Art: to generate unique and creative artwork based on textual descriptions or prompts.

10
Text-to-Image Generation

General Challenges
• Data quality and quantity: require large amounts of high-quality data to train effectively. However,
obtaining such data can be difficult, especially for specialized domains or languages.

• Ambiguity in natural language: different people may describe the same image in different ways. This
can make it difficult to create a consistent mapping between textual descriptions and images.

• Complexity of visual content: textures, shapes, and lighting, among others. Capturing all these
aspects in a single model can be challenging and may require specialized architectures.

• Lack of interpretability: Some text-to-image synthesis models can be difficult to interpret, making it
challenging to understand how the model is generating images from textual descriptions.

• Computational resources: requiring large amounts of memory and processing power. Training and
running can be costly, making it difficult to use them effectively.

11
Text-to-Image Generation

Future Direction
• Incorporating context and background knowledge: Text-to-image models often generate images based
solely on the input text, without taking into account other contextual information or background knowledge
that might be relevant.

• Improving visual quality and realism: While current text-to-image models can generate high-quality images,
there is still room for improvement in terms of visual quality and realism.

• Incorporating human feedback: Incorporating human feedback into the text-to-image generation process
may help to improve the quality and relevance of generated images. Future research may explore ways to
develop interactive text-to-image models that can receive feedback from human users and use this feedback
to refine image generation.

• Better understanding of image composition: Many text-to-image models generate images by using attention
mechanisms to selectively generate different regions of the image based on the text input. However, there is
still much to be learned about how to effectively compose these regions into coherent images

12
Text-to-Image Generation

References
[1] A. Ramesh et al., ‘Zero-Shot Text-to-Image Generation’, in Proceedings of the 38th
International Conference on Machine Learning, 2021, pp. 8821–8831.
[2] C. Wu et al., ‘NÜWA: Visual Synthesis Pre-Training for Neural VisUal World
CreAtion’, in ECCV, 2022, pp. 720–736.
[3] C. Saharia et al., ‘Photorealistic Text-to-Image Diffusion Models with Deep
Language Understanding’, in arXiv:1910.10683, 2022
[4] A. Vaswani et al., ‘Attention is All You Need’, in arXiv:1706.03762, 2017.
[5] J. Yu et al., ‘Scaling Autoregressive Models for Content-Rich Text-to-Image
Generation’, in arXiv:2206.10789, 2022

13

You might also like