Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

RAID’24 Summer Project SOP

Personalized Image Editing of Generated


Images

Personal Information
● Full Name: Mendpara Laksh Alpeshbhai
● Roll Number: B23CS1037
● Contacts :
○ Email: b23cs1037@iitj.ac.in
○ GitHub Username: Laksh-Mendpara
○ WhatsApp no : +91 7778894104
● Technical Skills:-
○ Convolutional Neural Networks (CNNs)
○ Generative Adversarial Networks (GANs)
○ Variational Autoencoders (VAEs)
○ Transformer Networks and Self-Attention Mechanisms
○ Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and
Gated Recurrent Units (GRU)
○ Langchain
○ Autoencoders
○ Neural Networks and Multilayer Perceptron (MLP)

Implementation Details
High-Level Overview
To address the problem statement of personalized image editing using text prompts, we
could implement a solution based on the "Forgedit" method, leveraging diffusion models
and innovative embedding manipulation for high-quality image editing guided by text
prompts (Based on Forgedit research paper - Link for the research paper).

1
Methods and Approach
1. Diffusion Model Fine-tuning:
a. Make a Stable Diffusion model, which forms the backbone for image
manipulation.
b. Fine-tunning this model using original images and embeddings derived
from both BLIP and CLIP models. This process enhances the model's
ability to interpret and integrate textual prompts into the editing process.
2. Vector Subtraction/projection Mechanism:
a. Subtraction Operation: Calculate the editing vector by subtracting the
source text embedding from the target text embedding. This vector
represents the specific modifications or adjustments desired in the image.
b. Projection Operation: Calculate the editing vector by projecting the target
text embedding on the target text embedding. This vector represents the
specific modifications or adjustments desired in the image.
3. Forgetting Strategy:
a. Incorporating a UNet-based forgetting strategy within the diffusion model:
b. Optimizing the UNet's encoder to focus on spatial structure learning, which
preserves the fundamental features and layout of the original image.
c. Tailoring the UNet's decoder to emphasize appearance and identity
maintenance, ensuring that edited images retain coherence and fidelity to
the original while accommodating user-defined changes.

Tools and Technologies


● Programming Languages and Frameworks: Python for implementation, PyTorch
for deep learning functionalities.
● Models: BLIP and CLIP for text-based operations, GANs, Auto Encoders, Stable
Diffusion for image manipulation.
● Supporting Libraries: Numpy, pytorch, pandas, scikit-learn, tensorflow and
Matplotlib for visualization.

Unique Methodologies and Innovations


● Vector Subtraction/Projection Mechanism: For getting the desired image, we will
first use a subtracting mechanism, and if we see that the model forgets more
parameters than using a subtracting mechanism, else using projection
mechanism.

2
● DDIM Sampling: Facilitates iterative refinement of edited images, integrating
textual cues seamlessly into the visual domain for coherent and contextually
relevant outputs.

Timeline Detail
● Learning all the Tech-stack required for the project:
○ Duration - 15 to 20 days
○ Deliverables - Comprehensive understanding of GANs, diffusion model and
acquiring knowledge on data preprocessing, augmentation, and handling.
● Finding different approaches and start data collection:
○ Duration - 15 days
○ Deliverables - Dataset and selection of the most suitable model for the
project with a detailed plan outlining the approach for image editing using
text prompts.
● Model Implementation, training and optimization:
○ Duration - 20 days
○ Deliverables - Implementation of the selected text-to-image generation
model and preliminary results showcasing the model's ability to edit images
based on text prompts.

3
● Refinement and creating inference pipeline:
○ Duration - 7 days
○ Deliverables - Fine-tuning the model for improved accuracy and reliability
and Fine-tuning the model for improved accuracy and reliability.
● Documentation:
○ Duration - 7 days
○ Deliverables - final model with proper report and GitHub repository.

About Myself
I'm a first-year Computer Science student at IIT-Jodhpur deeply passionate about
Machine Learning. Since starting at IITJ, I've explored various ML domains like
computer vision, NLP, and Generative AI. I learned about GANs, VAEs and am
really curious about Generative AI because of its challenging nature in building
real-world applications. This project on diffusion models and Generative AI aligns
perfectly with my goal to explore innovative solutions and tackle emerging
challenges in AI, aiming to contribute meaningfully to this dynamic field.

Experience
● I was part of the support team for the Inter-IIT competition, working on the problem
statement of DevRev.
● In WARP projects, I worked on the project - FIFA Auction Player-Worth Determination.
● I developed a CNN from scratch in the C language as part of my ICS major project (GitHub
link).

Time Commitment
I am fully capable of dedicating 28-30 hours per week during the summer break and 16-20 hours
per week during regular classes. I am willing to invest additional time to ensure the project is
completed to the highest standard.

You might also like