(Arisandy Yudha Putra - 23150137) Research Interest

Research Interest
for Course of Random Processes

Arisandy Yudha Putra
(Student ID) 23150137
(Program) Ph.D
(Department) Intelligent Mechatronics Engineering
Hello, Prof. Sikandar Aftab.
Name : Yudha Putra Arisandy

Age : 25 Years
Bachelor : Mechanical and Biosystem Engineering, IPB University, Indonesia
Master : Computer Science, IPB University, Indonesia
Research Interest: Text-to-Image Generation Model
2
Work Experiences (Research Phase)
AI Engineer (Part Time) at Keisuugiken
Project:
• Classification using YOLOv4
• Optical Character Recognition
Research Assistant at IPB University (Freelance)

Project:
• VIS-NIR Spectroscopy Portable Monitoring Device
• Autonomous Mobile Robot for on-farm monitoring
3D Designer (Freelance)
Project:
• 3D Designing using Solidwork
• 3D Printing (user)
3
Brief Topic Review:
Text-to-Image Generation
Outline
• Definition
• Recent Trends
• Potential Application
• General Challenges
• Future Directions
• References
5
Definition
• Text-to-image generation is a method to generate image by taking input
a natural language description (text).
• The model learns the mapping between textual descriptions and images.
• Approaches: GANs, VQ-VAE Transformer-based methods, autoregressive,
diffusion model, etc.
Description text Text-to-image Model Image
Example:
“A couple glasses are Text-to-image model

sitting on a table” (ex: Imagen)
6
Recent Trends
• DALL-E 1 (2021) [1]
• The goal is to train a transformer [4] to autoregressively model the text
and image tokens as a single stream of data.
• Two-stages training procedure:
1. Train dVAE to compress each 256x256 RGB image into a 32x32
grid of image tokens.
2. Concatenate 256 BPE-encoded text tokens with the 32x32 image
tokens, and train an autoregressive transformer to joint
distribution over text and image tokens.
7
Recent Trends
• NUWA (2022): Visual Synthesis [2]
• Can work for text, image, video
• Approaches: 3D transformer encoder-decoder framework
• IMAGEN (2022): Text-to-image diffusion model [3]

• Large transformer language models (understanding text) and Diffusion
model (high-fidelity image generation) are combined for:
• Unprecedented degree of photorealism
• Deep level of language understanding
• The lack of previous work:
• DALL-E 2: Complex, needs to learn latent prior (the development of
DALL-E 1 [3])
• GLIDE: insufficient image fidelity
• XMC-GAN: relatively small text encoder
8
Recent Trends
• Parti (2022) [5]
• Focusing on:
• high-fidelity photorealistic images
• Content-rich synthesis involving complex composition and world
knowledge
• Sequence-to-sequence modeling (autoregressive)
• Approaches:
1. Transformer-based image tokenizer, ViT-VQGAN, to encode
images as sequences of discrete tokens.
2. Scaling the encoder-decoder Transformer model up to 20B
parameters to achieve consistent quality improvements
9
Potential Application
• Virtual environments: for video games, simulations, and other virtual reality applications.
• E-commerce: to generate product images for online marketplaces, based on textual descriptions
of the products.
• Fashion: to generate fashion designs and clothing images based on textual descriptions of the
desired designs.
• Accessibility: to create visual aids for people with visual impairments, such as generating images
to accompany text descriptions of visual content.
• Education: to create visual aids for educational and training purposes, such as generating images
to illustrate concepts and ideas.
• Art: to generate unique and creative artwork based on textual descriptions or prompts.
10
General Challenges
• Data quality and quantity: require large amounts of high-quality data to train effectively. However,
obtaining such data can be difficult, especially for specialized domains or languages.
• Ambiguity in natural language: different people may describe the same image in different ways. This
can make it difficult to create a consistent mapping between textual descriptions and images.
• Complexity of visual content: textures, shapes, and lighting, among others. Capturing all these
aspects in a single model can be challenging and may require specialized architectures.
• Lack of interpretability: Some text-to-image synthesis models can be difficult to interpret, making it
challenging to understand how the model is generating images from textual descriptions.
• Computational resources: requiring large amounts of memory and processing power. Training and
running can be costly, making it difficult to use them effectively.
11
Future Direction
• Incorporating context and background knowledge: Text-to-image models often generate images based
solely on the input text, without taking into account other contextual information or background knowledge
that might be relevant.
• Improving visual quality and realism: While current text-to-image models can generate high-quality images,
there is still room for improvement in terms of visual quality and realism.
• Incorporating human feedback: Incorporating human feedback into the text-to-image generation process
may help to improve the quality and relevance of generated images. Future research may explore ways to
develop interactive text-to-image models that can receive feedback from human users and use this feedback
to refine image generation.
• Better understanding of image composition: Many text-to-image models generate images by using attention
mechanisms to selectively generate different regions of the image based on the text input. However, there is
still much to be learned about how to effectively compose these regions into coherent images
12
References
[1] A. Ramesh et al., ‘Zero-Shot Text-to-Image Generation’, in Proceedings of the 38th
International Conference on Machine Learning, 2021, pp. 8821–8831.
[2] C. Wu et al., ‘NÜWA: Visual Synthesis Pre-Training for Neural VisUal World
CreAtion’, in ECCV, 2022, pp. 720–736.
[3] C. Saharia et al., ‘Photorealistic Text-to-Image Diffusion Models with Deep
Language Understanding’, in arXiv:1910.10683, 2022
[4] A. Vaswani et al., ‘Attention is All You Need’, in arXiv:1706.03762, 2017.
[5] J. Yu et al., ‘Scaling Autoregressive Models for Content-Rich Text-to-Image
Generation’, in arXiv:2206.10789, 2022
13

(Arisandy Yudha Putra - 23150137) Research Interest

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Arisandy Yudha Putra - 23150137) Research Interest

Uploaded by

Copyright:

Available Formats

Research Interest

for Course of Random Processes

Name : Yudha Putra Arisandy

Research Assistant at IPB University (Freelance)

Description text Text-to-image Model Image

“A couple glasses are Text-to-image model

• IMAGEN (2022): Text-to-image diffusion model [3]

You might also like