Creating Talking Head Videos With Generative AI - by Sau Sheong

1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
Open in app
Search
Member-only story
Creating talking head videos with generative

AI
Creating talking head videos using various generative AI techniques and tools
Sau Sheong · Follow

14 min read · Dec 26, 2023
Listen Share More
Talking heads are exactly what it sounds like — it’s a person talking in front of a
video camera, showing mostly the head and sometimes up to the shoulder or even
torso. If you watch any TV or social media, you’re likely to have seen it. They are
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 1/26
pretty popular among social media video creators and are often used for product
reviews, training videos, explainers, newscasting, reporting and so on.
There are a number of AI services that can create pretty amazing talking head
videos that speak in any number of languages, and they are all quite mind-
bogglingly good. As usual, I was curious — how did they do it, and can I recreate
something similar?
Well, the answer is obviously yes since I’m writing this article. Of course, my
attempt is far from the efforts of those well-funded companies, but I believe I come
reasonably close.
Let me give you a couple of quick samples of the output before jumping in. Here’s
one closer to the season with Santa Claus sending his Christmas greetings.
Santa Claus Christmas greetings

This digital avatar is created using
Persona.https://github.com/sausheong/persona
youtube.com
Here’s another one, in Chinese, just to show it works in multiple languages, sending
greetings for the upcoming Chinese New Year.
Chinese New Year Greetings in Chinese

This talking head video is created using
Persona.https://github.com/sausheong/persona
youtube.com
Creating talking heads

The overall algorithm is quite straightforward:
1. Generate the speech that the talking head is going to speak
2. Create a still image of the talking head
3. Animate the still image of the talking head into a video
4. Generate the moving lips based on the speech and super-impose them on the
talking head video, along with the speech
5. Improve the quality of the video (optional)
The last bit is optional but if you’re trying to load your video to social media you
should have decent quality to share.
If you feel amazed that I am able to come up with all of the above, I want to disclaim
that I didn’t do any of the above on my own. Actually, I just took existing algorithms
and code and stitched them together in a way that makes sense, to generate the
talking head. Each of the algorithm/project on its own is amazing, but putting them
all together has a quite a different effect altogether. It was deeply satisfying too.
I put all the code here in a project called Persona, which I’ll show how it works in a
bit, and how to use it to generate talking head videos later.
GitHub - sausheong/persona: Persona AI avatar

Persona AI avatar. Contribute to sausheong/persona development
by creating an account on GitHub.
github.com
For now, let’s take a look at the first step, to generate the speech.
Generating the speech

For speech generation, I have many, many options. There are plenty of libraries and
services. In fact, even OpenAI has a text-to-speech API which is trivial to use.
However, in the spirit of using Python libraries, I decided against using any APIs to
generate the speech.
Instead, I use Tortoise-TTS, a text-to-speech Python library that uses AI to generate

pretty high quality speech.
import os, time

import torchaudio
from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_voices
def generate_speech(path_id, outfile, voice, text, speed="standard"):

tts = TextToSpeech(kv_cache=True, half=True)
selected_voices = voice.split(',')
for _, selected_voice in enumerate(selected_voices):
if '&' in selected_voice:
voice_sel = selected_voice.split('&')
else:
voice_sel = [selected_voice]
voice_samples, conditioning_latents = load_voices(voice_sel)
gen, _ = tts.tts_with_preset(text, k=1, voice_samples=voice_samples,

conditioning_latents=conditioning_latents,
return_deterministic_state=True)
if isinstance(gen, list):
for j, g in enumerate(gen):
torchaudio.save(os.path.join("temp", path_id, outfile),
g.squeeze(0).cpu(), 24000)
else:
torchaudio.save(os.path.join("temp", path_id, outfile),
gen.squeeze(0).cpu(), 24000)
The code is not mine, I just tweaked it. However it’s pretty straightforward. First, I
create a TextToSpeech instance. Then I load the voices from the library of voices
provided by Tortoise-TTS. You can actually create your own voices — you just need
to record at least 3 snippets of at least 10 seconds into WAV files and place them in
the voices directory.
Then I use the voices to generate the speech and finally I use torchaudio to save it to
file.
This is how the generate_speech function is used.
message = """Apple today confirmed that it will be permanently closing its

Infinite Loop retail store in Cupertino, California on January 20. Infinite
Loop served as Apple's headquarters between the mid-1990s and 2017, when its
current Apple Park headquarters opened a few miles away."""
generate_speech(path_id, "temp.wav", "daniel", message, "ultra_fast")
The code will generate a WAV file called temp.wav which contains the speech. The
ultra_fast parameter is the fastest, you can also use fast or standard (which is the
slowest).
Tortoise-TTS is powerful but unfortunately can be quite slow. If you prefer to use
another text-to-speech generator, please feel free. While this code produces WAV
format files, you can use other formats such as MP3 as well as an input. I’ll show you
how later.
Generate the talking head image

As before, there are plenty of choices of libraries and APIs to use. The most logical
choice for me, without using APIs, would be to choose one of the many Stable
Diffusion models available.
I chose the SDXL-Turbo model because it’s an exciting new model that promises to
generate images really quickly. I was quite blown away the first time I tried it and
even though it didn’t run as fast as I thought it would on my own hardware, it was
fast enough.
The code was also dead simple, using HuggingFace’s diffusers library.
from diffusers import AutoPipelineForText2Image
def generate_image(path_id, imgfile, prompt):

pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo")
image = pipe(prompt=prompt, num_inference_steps=4, guidance_scale=0.0).images[
image.save(os.path.join("temp", path_id, imgfile))
First, I load the model into a pipeline using AutoPipelineForText2Image . Then I

generate an image using the pipeline, and save it to file.
This is how I used the generate_image function.
avatar_description = "Young Indian man with short dark hair, serious look"
generate_image(path_id, imgfile, f"hyperrealistic digital avatar, centered, \
{avatar_description}, rim lighting, studio lighting, looking at the \

camera")
You might be wondering why I broke up the prompt in 2 pieces. It’s basically to
reassure that whatever the input is later on, I’ll always create a talking head.
This is the output.
As before in generating speech, you can use another text-to-image generator, or

even a headshot photo of a person. Persona can use PNG or JPG files.
Animating the talking head

This is where things turn interesting. I could, in fact, create a talking head without
any head animations. This works quite well for shorter speeches but it does look
decidedly odd to have a talking head that is staring unblinkingly at you while he/she
is talking.
I decided to look for ways to animate the head, and I found an interesting project. In
fact I found a project that took 2 different projects to produce the effect of an
animated talking head that I wanted.
The combined project called Face Animation in Real Time consists of 2 separate
projects:
1. One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing and,
2. GAN Prior Embedded Network for Blind Face Restoration in the Wild
I took THAT project and tweaked it to make it work for me. Let’s take a step back and
explain what these 2 projects did.
The first one, One-Shot Free-View Neural Talking-Head Synthesis for Video
Conferencing, takes a single still image and using another video (called a driver
video), reproduces the motion of the video on the still image. As you can tell, this is
the crux of animating the talking head is all about.
The second project, GAN Prior Embedded Network for Blind Face Restoration in the
Wild, does blind-face restoration on the video output from the output of the first
project, making the end result a lot more natural.
This is the animate_face function that collates all the necessary pieces together and
generates an animated video file from a still image and a driver video.
import os, sys, cv2, yaml, imageio, torch, subprocess, platform

import numpy as np
import torch.nn.functional as F
import subprocess, platform
from mutagen.wave import WAVE
from datetime import timedelta
from tqdm import tqdm
from face_vid2vid.sync_batchnorm.replicate import DataParallelWithCallback
from face_vid2vid.modules.generator import OcclusionAwareSPADEGenerator
from face_vid2vid.modules.keypoint_detector import KPDetector, HEEstimator
from face_vid2vid.animate import normalize_kp
from batch_face import RetinaFace
def animate_face(path_id, audiofile, driverfile, imgfile, animatedfile):

faceanimation = FaceAnimationClass(os.path.join("temp", path_id, imgfile),
use_sr=False)
tmpfile = f"temp/{path_id}/tmp.mp4"
duration = get_audio_duration(os.path.join("temp", path_id, audiofile))
hms = seconds_to_hms(duration)
command = f"ffmpeg -ss 00:00:00 -i {driverfile} -to {hms} -c copy {tmpfile}

subprocess.call(command, shell=platform.system() != 'Windows')
capture = cv2.VideoCapture(tmpfile)
fps = capture.get(cv2.CAP_PROP_FPS)
frames = []
_, frame = capture.read()
while frame is not None:
frames.append(frame)
_, frame = capture.read()
capture.release()
output_frames = []
for frame in tqdm(frames):
result = faceanimation.inference(frame)
output_frames.append(result)
writer = imageio.get_writer(os.path.join("temp", path_id, animatedfile),
fps=fps, quality=9, macro_block_size=1,
codec="libx264", pixelformat="yuv420p")
for frame in output_frames:
writer.append_data(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
writer.close()
A driver video is nothing more than a short snippet of any talking head video. It
doesn’t need any audio. The driver video’s head and facial motions (blinking eyes,
raised eyebrows etc) will be used on top of the still image.
First, I create an instance of the FaceAnimationClass with the still image. Then I take
the driver video and convert it into frames. For every frame, I use the instance of
FaceAnimationClass and generate a new frame based on the still image. Finally, I
take all the frames and write it to a new video.
Notice that before I start using the driver video, I used ffmpeg to trim it to a smaller
temporary driver file. This is because my driver video about 1 minute and if I use it
directly it will take more time to process. To reduce processing time, I trimmed the
driver video to the same length as the speech.
A driver video (no sound needed) that I took from news snippet
This is the still image I used, generated by SDXL-Turbo based on the prompt “Young
Indian man with short dark hair, serious look”.
Young Indian man with short dark hair, serious look
And this is the animated face that is produced by applying the driver video on the
still image.
Animated face output (no sound)
You might notice that the lips movement is quite small, it doesn’t matter because
we’re going to super-impose another set of lips on it.
Generating and super-imposing the moving lips

This is the crux of the entire effort. Without the lips matching the speech, the whole
thing will look like some weird foreign language dubbing. We want natural-looking
talking heads with the lips moving according to the words. And to do this there is
really only one project that does it very well, and that is the Wav2Lip project.
The modify_lips function puts together the speech and the animated video to
produce the combined output video.
def modify_lips(path_id, audiofile, animatedfile, outfilePath):

animatedfilePath = os.path.join("temp", path_id, animatedfile)
audiofilePath = os.path.join("temp", path_id, audiofile)
tempAudioPath = os.path.join("temp", path_id, "temp.wav")
tempVideoPath = os.path.join("temp", path_id, "temp.avi")
video_stream = cv2.VideoCapture(animatedfilePath)
fps = video_stream.get(cv2.CAP_PROP_FPS)
full_frames = []
while 1:
still_reading, frame = video_stream.read()
if not still_reading:
video_stream.release()
break
if resize_factor > 1:
frame = cv2.resize(frame, (frame.shape[1]//resize_factor,
frame.shape[0]//resize_factor))
if rotate:
frame = cv2.rotate(frame, cv2.cv2.ROTATE_90_CLOCKWISE)
y1, y2, x1, x2 = crop

if x2 == -1: x2 = frame.shape[1]
if y2 == -1: y2 = frame.shape[0]
frame = frame[y1:y2, x1:x2]
full_frames.append(frame)
command = 'ffmpeg -y -i {} -strict -2 {}'.format(audiofilePath,

tempAudioPath)
subprocess.call(command, shell=True)
wav = wav2lip.audio.load_wav(tempAudioPath, 16000)
mel = wav2lip.audio.melspectrogram(wav)
if np.isnan(mel.reshape(-1)).sum() > 0:
raise ValueError('Mel contains nan! ')
mel_chunks = []
mel_idx_multiplier = 80./fps
i = 0
while 1:
start_idx = int(i * mel_idx_multiplier)
if start_idx + mel_step_size > len(mel[0]):
mel_chunks.append(mel[:, len(mel[0]) - mel_step_size:])
break
mel_chunks.append(mel[:, start_idx : start_idx + mel_step_size])
i += 1
full_frames = full_frames[:len(mel_chunks)]
batch_size = wav2lip_batch_size
gen = datagen(full_frames.copy(), mel_chunks)
for i, (img_batch, mel_batch, frames, coords) in enumerate(tqdm(gen,

total=int(np.ceil(float(len(mel_chunks))/batch_size)))):
if i == 0:
model = load_model(checkpoint_path)
frame_h, frame_w = full_frames[0].shape[:-1]
out = cv2.VideoWriter(tempVideoPath, cv2.VideoWriter_fourcc(*'DIVX'),
fps, (frame_w, frame_h))
img_batch = torch.FloatTensor(np.transpose(img_batch,
(0, 3, 1, 2))).to(device)
mel_batch = torch.FloatTensor(np.transpose(mel_batch,
(0, 3, 1, 2))).to(device)
with torch.no_grad():
pred = model(mel_batch, img_batch)
pred = pred.cpu().numpy().transpose(0, 2, 3, 1) * 255.
for p, f, c in zip(pred, frames, coords):

y1, y2, x1, x2 = c
p = cv2.resize(p.astype(np.uint8), (x2 - x1, y2 - y1))
f[y1:y2, x1:x2] = p
out.write(f)
out.release()
command = 'ffmpeg -y -i {} -i {} -strict -2 -q:v 1 {}'.format(tempAudioPath,

tempVideoPath, outfilePath)
First, I take the animated video from before. Then I take the speech file, convert it
first into WAV format (this is why we can use any other format, it is first converted
into WAV) then a mel spectrogram using the librosa library. The mel spectrogram
is then broken up into chunks and converted into batches, alongside with the
frames earlier.
These batches are then fed into the model to generate a new set of frames that has
the correct lip movements according the speech. These frames are finally compiled
into a video, together with the original speech file to create the output video.
This is the output video that is produced, putting everything together.
Persona talking head - Apple closing its Infinite Loop retail store
Talking head created by Persona.
youtube.com
Improving the video

The output video is 256x256 and the video quality is passable for its size. However if
you want something bigger (for example, posting on YouTube) you’d want a better
quality video.
To do this we need to break down the video into frames first, then use a technique to
make the image look better, then reassemble the frames back into the video.
There are a number of techniques that can be used for making the image higher
resolution, many of them are GAN based image restoration techniques. I found that
Real-ESRGAN works pretty well for me, and so that’s the one I used.
First of all, we need to break down the video into frames using vid2frames .
def vid2frames(vidPath, framesOutPath):

vidcap = cv2.VideoCapture(vidPath)
success,image = vidcap.read()
frame = 1
while success:
cv2.imwrite(os.path.join(framesOutPath, str(frame).zfill(5) + '.png'),
image)
success,image = vidcap.read()
frame += 1
Now that we have a bunch of image files in a directory, we need to take each of them
and improve them using Real-ESRGAN.
def improve(disassembledPath, improvedPath):

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = RealESRGAN(device, scale=4)
model.load_weights('weights/RealESRGAN_x4.pth', download=True)
files = glob.glob(os.path.join(disassembledPath,"*.png"))
results = t_map(real_esrgan, files, [model]*len(files),

[improvedPath] * len(files)
I used t_map to wrap tqdm around the real_esrgan function in order to show
progress.
def real_esrgan(img_path, model, improvedPath):

image = Image.open(img_path).convert('RGB')
sr_image = model.predict(image)
img_name = os.path.basename(img_path)
sr_image.save(os.path.join(improvedPath, img_name))
The improved image files are placed in another directory when it’s done (this can
take some time since I didn’t run it in parallel). Once it’s done, I can use
restore_frames to combine the improved images and the speech audio file into a
final output video.
def restore_frames(audiofilePath, videoOutPath, improveOutputPath):

no_of_frames = count_files(improveOutputPath)
audio_duration = get_audio_duration(audiofilePath)
framesPath = improveOutputPath + "/%5d.png"
fps = no_of_frames/audio_duration
command = f"ffmpeg -y -r {fps} -f image2 -i {framesPath} -i {audiofilePath}
def get_audio_duration(audioPath):
audio = WAVE(audioPath)
duration = audio.info.length
return duration
def count_files(directory):
return len([name for name in os.listdir(directory) if
os.path.isfile(os.path.join(directory, name))])
Running Persona
Now that we have all the pieces in place, let’s see how to run Persona. Remember,
this is not a web application, it’s just a script that puts various pieces of code
together by calling functions.
First, you need to clone the repo from GitHub and install the required
dependencies.
$ git clone https://github.com/sausheong/persona.git

$ cd persona
$ pip install -r requirements.txt
Next, you need to download the following weights and PyTorch files.
1. Wav2Lip weights — place them in the wav2lip directory
2. Face detection weights — place them in the directory

wave2lip/face_detection/detection/sfd/s3fd.pth
3. Real-ESRGAN weights — create a folder named weights and place the file in
there
Once all dependencies are installed, you can run Person by calling the persona.py
script. This is the default way to call Persona.
$ python persona.py
Running it the first time takes a while because other than the files you downloaded
earlier, the script will automatically download other necessary weights.
A new folder named temp will be created to store all temporary files, and a new
folder named results will be created to store the final resultant videos. In the temp
folder, a new folder with a path_id will be created to store all the temporary files
created during the creation of the talking head video.
If you want to use your own image for your talking head video, you can do this.
$ python persona.py --image=<path/to/your/image>
If you want to use your own speech file, you can do this.
$ python persona.py --speech=<path/to/your/wav file>
Remember to use only WAV files.
Running it this way only creates the smaller videos in the results directory. If you
want the larger video, you can do this.
$ python persona.py --improve
This runs the normal generation, followed by the improvement step. The
improvement step can be pretty slow (> 20 minutes).
How about if you generated a smaller video but you want to now improve it to a
larger one? I’ve got you covered for that too.
$ python persona.py --improve --path_id=<your path id> --skipgen
The skipgen flag tells Persona to skip all the generation, and the improve flag tells
Persona to improve the video. However you still need to tell Persona which video
file to use and where to store the temporary frames, so you need to provide a
path_id as well.
Hardware
A note about the hardware I ran this on. So far I’ve only tried this on an Intel x86_64
machine with Nvdia GPUs, using CUDA. I spun up an n1-highmem-16 (16vCPU, 8
core, 104 GB memory) instance on Google Cloud with 2xT4 GPUs and ran my
experiments on it.
Running Persona on this configuration generally takes about 4 minutes to generate

the small talking head video.
generating speech: 54 seconds

generating avatar image: 18 seconds
animating face: 2 minutes
modifying lips: 48 seconds
total time: 4 minutes
Depending on the amount of text spoken, and the speed can differ greatly. In fact, I
usually find generating speech the slowest part. If you’re impatient and want to use
something else, you can try OpenAI’s text-to-speech, which is pretty fast (I have
some commented out code in speech.py you can uncomment). You can also try any
other text-to-speech generator, or even record your own!
I also tried other GPUs and while the more expensive ones generally run faster, they
are also a LOT more expensive so beware.
Final thoughts
I spent a couple of days over the Christmas break to muck around with talking head
videos. It was a fascinating journey and I learnt a lot and hope you enjoy playing
about with the project as well!
Libraries
Here are the libraries used.
Generating speech
GitHub - neonbjb/tortoise-tts: A multi-voice TTS system trained

with an emphasis on quality
A multi-voice TTS system trained with an emphasis on quality -
GitHub - neonbjb/tortoise-tts: A multi-voice TTS system…
github.com
Generating the still image
GitHub - Stability-AI/generative-models: Generative Models by

Stability AI
Generative Models by Stability AI. Contribute to Stability-
AI/generative-models development by creating an account on…
github.com
Animating the face
GitHub - sky24h/Face_Animation_Real_Time: One-shot face

animation using webcam, capable of running…
One-shot face animation using webcam, capable of running in real
time. - GitHub - sky24h/Face_Animation_Real_Time…
github.com
GitHub - yangxy/GPEN
Contribute to yangxy/GPEN development by creating an account on
GitHub.
github.com
GitHub - zhanglonghao1992/One-Shot_Free-
View_Neural_Talking_Head_Synthesis: Pytorch implementation…
Pytorch implementation of paper "One-Shot Free-View Neural
Talking-Head Synthesis for Video Conferencing" - GitHub …
github.com
Improving the video
GitHub - ai-forever/Real-ESRGAN: PyTorch implementation of

Real-ESRGAN model
PyTorch implementation of Real-ESRGAN model. Contribute to ai-
forever/Real-ESRGAN development by creating an account on…
github.com
AI Python Videos Stable Diffusion Text To Speech
Follow
Written by Sau Sheong

2.3K Followers
I write, code.
More from Sau Sheong
Sau Sheong
Matching resumes with job postings using LLMs and Go

Matching text content using cosine similarity
· 16 min read · Dec 9, 2023
94 1
Sau Sheong in Stackademic
Creating a simple ChatGPT clone with Go

A simple introduction to writing LLM applications with Go
· 4 min read · Aug 6, 2023
216 4
Sau Sheong in Stackademic
Creating a ChatGPT clone that runs on your laptop with Go

Running Llama-2 on your laptop using llama.cpp and Go
· 17 min read · Aug 20, 2023
237 6
Sau Sheong in Geek Culture
Prompt Engineering with LlamaIndex and OpenAI GPT-3

Using GPT-3 with your own documents, datasets, images and videos
· 6 min read · Mar 29, 2023
190 11
See all from Sau Sheong
Recommended from Medium
Gavin Li in AI Advances
How Your Ordinary 8GB MacBook’s Untapped AI Power Can Run 70B LLM
Models That Will Blow Your Mind!
Do you think your Apple MacBook is only good for making PPTs, browsing the web, and
streaming shows? If so, you really don’t understand the…
4 min read · Dec 28, 2023
729 9
Eric Risco
The End of Retrieval Augmented Generation? Emerging Architectures

Signal a Shift
· 3 min read · Dec 25, 2023
908 40
Lists
Coding & Development

11 stories · 358 saves
Predictive Modeling w/ Python

Generative AI Recommended Reading

What is ChatGPT?
Aymen El Amri in FAUN — Developer Community 🐾
The Hottest Open Source Projects Of 2023

This article was originally posted on faun.dev.
· 14 min read · Dec 28, 2023
1.6K 9
Gencay I. in Level Up Coding
3 Trending GPT That Will Save You Tons of Time

Exploring the Future of AI: How GPT and ChatGPT Are Changing the Tech Landscape
· 4 min read · Dec 26, 2023
353
Anmol Tomar in CodeX
Say Goodbye to Loops in Python, and Welcome Vectorization!

Use Vectorization — a super-fast alternative to loops in Python
· 5 min read · Dec 28, 2023
1.6K 19
ChatDOC
Revolutionizing RAG with Enhanced PDF Structure Recognition

We examines methods for extracting structured knowledge from documents to augment LLMs
with domain expertise.
15 min read · Dec 19, 2023
337 2
See more recommendations

Creating Talking Head Videos With Generative AI - by Sau Sheong

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Creating Talking Head Videos With Generative AI - by Sau Sheong

Uploaded by

Copyright:

Available Formats

1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium

Creating talking head videos with generative

Sau Sheong · Follow

Listen Share More

Santa Claus Christmas greetings

Chinese New Year Greetings in Chinese

Creating talking heads

1. Generate the speech that the talking head is going to speak

2. Create a still image of the talking head

3. Animate the still image of the talking head into a video

5. Improve the quality of the video (optional)

GitHub - sausheong/persona: Persona AI avatar

Generating the speech

Instead, I use Tortoise-TTS, a text-to-speech Python library that uses AI to generate

import os, time

def generate_speech(path_id, outfile, voice, text, speed="standard"):

gen, _ = tts.tts_with_preset(text, k=1, voice_samples=voice_samples,

This is how the generate_speech function is used.

message = """Apple today confirmed that it will be permanently closing its

generate_speech(path_id, "temp.wav", "daniel", message, "ultra_fast")

Generate the talking head image

from diffusers import AutoPipelineForText2Image

def generate_image(path_id, imgfile, prompt):

First, I load the model into a pipeline using AutoPipelineForText2Image . Then I

This is how I used the generate_image function.

{avatar_description}, rim lighting, studio lighting, looking at the \

This is the output.

As before in generating speech, you can use another text-to-image generator, or

Animating the talking head

1. One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing and,

import os, sys, cv2, yaml, imageio, torch, subprocess, platform

def animate_face(path_id, audiofile, driverfile, imgfile, animatedfile):

command = f"ffmpeg -ss 00:00:00 -i {driverfile} -to {hms} -c copy {tmpfile}

Young Indian man with short dark hair, serious look

Animated face output (no sound)

Generating and super-imposing the moving lips

def modify_lips(path_id, audiofile, animatedfile, outfilePath):

y1, y2, x1, x2 = crop

frame = frame[y1:y2, x1:x2]

command = 'ffmpeg -y -i {} -strict -2 {}'.format(audiofilePath,

for i, (img_batch, mel_batch, frames, coords) in enumerate(tqdm(gen,

pred = pred.cpu().numpy().transpose(0, 2, 3, 1) * 255.

for p, f, c in zip(pred, frames, coords):

command = 'ffmpeg -y -i {} -i {} -strict -2 -q:v 1 {}'.format(tempAudioPath,

This is the output video that is produced, putting everything together.

Improving the video

def vid2frames(vidPath, framesOutPath):

def improve(disassembledPath, improvedPath):

results = t_map(real_esrgan, files, [model]*len(files),

def real_esrgan(img_path, model, improvedPath):

def restore_frames(audiofilePath, videoOutPath, improveOutputPath):

$ git clone https://github.com/sausheong/persona.git

1. Wav2Lip weights — place them in the wav2lip directory

2. Face detection weights — place them in the directory

script. This is the default way to call Persona.

$ python persona.py --image=<path/to/your/image>

$ python persona.py --speech=<path/to/your/wav file>

Remember to use only WAV files.

$ python persona.py --improve

$ python persona.py --improve --path_id=<your path id> --skipgen

Running Persona on this configuration generally takes about 4 minutes to generate

generating speech: 54 seconds