Professional Documents
Culture Documents
Creating Talking Head Videos With Generative AI - by Sau Sheong
Creating Talking Head Videos With Generative AI - by Sau Sheong
Open in app
Search
Member-only story
Talking heads are exactly what it sounds like — it’s a person talking in front of a
video camera, showing mostly the head and sometimes up to the shoulder or even
torso. If you watch any TV or social media, you’re likely to have seen it. They are
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 1/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
pretty popular among social media video creators and are often used for product
reviews, training videos, explainers, newscasting, reporting and so on.
There are a number of AI services that can create pretty amazing talking head
videos that speak in any number of languages, and they are all quite mind-
bogglingly good. As usual, I was curious — how did they do it, and can I recreate
something similar?
Well, the answer is obviously yes since I’m writing this article. Of course, my
attempt is far from the efforts of those well-funded companies, but I believe I come
reasonably close.
Let me give you a couple of quick samples of the output before jumping in. Here’s
one closer to the season with Santa Claus sending his Christmas greetings.
Here’s another one, in Chinese, just to show it works in multiple languages, sending
greetings for the upcoming Chinese New Year.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 2/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
4. Generate the moving lips based on the speech and super-impose them on the
talking head video, along with the speech
The last bit is optional but if you’re trying to load your video to social media you
should have decent quality to share.
If you feel amazed that I am able to come up with all of the above, I want to disclaim
that I didn’t do any of the above on my own. Actually, I just took existing algorithms
and code and stitched them together in a way that makes sense, to generate the
talking head. Each of the algorithm/project on its own is amazing, but putting them
all together has a quite a different effect altogether. It was deeply satisfying too.
I put all the code here in a project called Persona, which I’ll show how it works in a
bit, and how to use it to generate talking head videos later.
For now, let’s take a look at the first step, to generate the speech.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 3/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
The code is not mine, I just tweaked it. However it’s pretty straightforward. First, I
create a TextToSpeech instance. Then I load the voices from the library of voices
provided by Tortoise-TTS. You can actually create your own voices — you just need
to record at least 3 snippets of at least 10 seconds into WAV files and place them in
the voices directory.
Then I use the voices to generate the speech and finally I use torchaudio to save it to
file.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 4/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
The code will generate a WAV file called temp.wav which contains the speech. The
ultra_fast parameter is the fastest, you can also use fast or standard (which is the
slowest).
Tortoise-TTS is powerful but unfortunately can be quite slow. If you prefer to use
another text-to-speech generator, please feel free. While this code produces WAV
format files, you can use other formats such as MP3 as well as an input. I’ll show you
how later.
I chose the SDXL-Turbo model because it’s an exciting new model that promises to
generate images really quickly. I was quite blown away the first time I tried it and
even though it didn’t run as fast as I thought it would on my own hardware, it was
fast enough.
The code was also dead simple, using HuggingFace’s diffusers library.
avatar_description = "Young Indian man with short dark hair, serious look"
generate_image(path_id, imgfile, f"hyperrealistic digital avatar, centered, \
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 5/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
You might be wondering why I broke up the prompt in 2 pieces. It’s basically to
reassure that whatever the input is later on, I’ll always create a talking head.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 6/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
I decided to look for ways to animate the head, and I found an interesting project. In
fact I found a project that took 2 different projects to produce the effect of an
animated talking head that I wanted.
The combined project called Face Animation in Real Time consists of 2 separate
projects:
2. GAN Prior Embedded Network for Blind Face Restoration in the Wild
I took THAT project and tweaked it to make it work for me. Let’s take a step back and
explain what these 2 projects did.
The first one, One-Shot Free-View Neural Talking-Head Synthesis for Video
Conferencing, takes a single still image and using another video (called a driver
video), reproduces the motion of the video on the still image. As you can tell, this is
the crux of animating the talking head is all about.
The second project, GAN Prior Embedded Network for Blind Face Restoration in the
Wild, does blind-face restoration on the video output from the output of the first
project, making the end result a lot more natural.
This is the animate_face function that collates all the necessary pieces together and
generates an animated video file from a still image and a driver video.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 7/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
tmpfile = f"temp/{path_id}/tmp.mp4"
duration = get_audio_duration(os.path.join("temp", path_id, audiofile))
hms = seconds_to_hms(duration)
capture = cv2.VideoCapture(tmpfile)
fps = capture.get(cv2.CAP_PROP_FPS)
frames = []
_, frame = capture.read()
while frame is not None:
frames.append(frame)
_, frame = capture.read()
capture.release()
output_frames = []
for frame in tqdm(frames):
result = faceanimation.inference(frame)
output_frames.append(result)
writer = imageio.get_writer(os.path.join("temp", path_id, animatedfile),
fps=fps, quality=9, macro_block_size=1,
codec="libx264", pixelformat="yuv420p")
for frame in output_frames:
writer.append_data(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
writer.close()
A driver video is nothing more than a short snippet of any talking head video. It
doesn’t need any audio. The driver video’s head and facial motions (blinking eyes,
raised eyebrows etc) will be used on top of the still image.
First, I create an instance of the FaceAnimationClass with the still image. Then I take
the driver video and convert it into frames. For every frame, I use the instance of
FaceAnimationClass and generate a new frame based on the still image. Finally, I
take all the frames and write it to a new video.
Notice that before I start using the driver video, I used ffmpeg to trim it to a smaller
temporary driver file. This is because my driver video about 1 minute and if I use it
directly it will take more time to process. To reduce processing time, I trimmed the
driver video to the same length as the speech.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 8/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
A driver video (no sound needed) that I took from news snippet
This is the still image I used, generated by SDXL-Turbo based on the prompt “Young
Indian man with short dark hair, serious look”.
And this is the animated face that is produced by applying the driver video on the
still image.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 9/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
You might notice that the lips movement is quite small, it doesn’t matter because
we’re going to super-impose another set of lips on it.
The modify_lips function puts together the speech and the animated video to
produce the combined output video.
video_stream = cv2.VideoCapture(animatedfilePath)
fps = video_stream.get(cv2.CAP_PROP_FPS)
full_frames = []
while 1:
still_reading, frame = video_stream.read()
if not still_reading:
video_stream.release()
break
if resize_factor > 1:
frame = cv2.resize(frame, (frame.shape[1]//resize_factor,
frame.shape[0]//resize_factor))
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 10/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
if rotate:
frame = cv2.rotate(frame, cv2.cv2.ROTATE_90_CLOCKWISE)
full_frames.append(frame)
if np.isnan(mel.reshape(-1)).sum() > 0:
raise ValueError('Mel contains nan! ')
mel_chunks = []
mel_idx_multiplier = 80./fps
i = 0
while 1:
start_idx = int(i * mel_idx_multiplier)
if start_idx + mel_step_size > len(mel[0]):
mel_chunks.append(mel[:, len(mel[0]) - mel_step_size:])
break
mel_chunks.append(mel[:, start_idx : start_idx + mel_step_size])
i += 1
full_frames = full_frames[:len(mel_chunks)]
batch_size = wav2lip_batch_size
gen = datagen(full_frames.copy(), mel_chunks)
img_batch = torch.FloatTensor(np.transpose(img_batch,
(0, 3, 1, 2))).to(device)
mel_batch = torch.FloatTensor(np.transpose(mel_batch,
(0, 3, 1, 2))).to(device)
with torch.no_grad():
pred = model(mel_batch, img_batch)
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 11/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
f[y1:y2, x1:x2] = p
out.write(f)
out.release()
First, I take the animated video from before. Then I take the speech file, convert it
first into WAV format (this is why we can use any other format, it is first converted
into WAV) then a mel spectrogram using the librosa library. The mel spectrogram
is then broken up into chunks and converted into batches, alongside with the
frames earlier.
These batches are then fed into the model to generate a new set of frames that has
the correct lip movements according the speech. These frames are finally compiled
into a video, together with the original speech file to create the output video.
Persona talking head - Apple closing its Infinite Loop retail store
Talking head created by Persona.
youtube.com
To do this we need to break down the video into frames first, then use a technique to
make the image look better, then reassemble the frames back into the video.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 12/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
There are a number of techniques that can be used for making the image higher
resolution, many of them are GAN based image restoration techniques. I found that
Real-ESRGAN works pretty well for me, and so that’s the one I used.
First of all, we need to break down the video into frames using vid2frames .
Now that we have a bunch of image files in a directory, we need to take each of them
and improve them using Real-ESRGAN.
I used t_map to wrap tqdm around the real_esrgan function in order to show
progress.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 13/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
The improved image files are placed in another directory when it’s done (this can
take some time since I didn’t run it in parallel). Once it’s done, I can use
restore_frames to combine the improved images and the speech audio file into a
final output video.
def get_audio_duration(audioPath):
audio = WAVE(audioPath)
duration = audio.info.length
return duration
def count_files(directory):
return len([name for name in os.listdir(directory) if
os.path.isfile(os.path.join(directory, name))])
Running Persona
Now that we have all the pieces in place, let’s see how to run Persona. Remember,
this is not a web application, it’s just a script that puts various pieces of code
together by calling functions.
First, you need to clone the repo from GitHub and install the required
dependencies.
Next, you need to download the following weights and PyTorch files.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 14/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
3. Real-ESRGAN weights — create a folder named weights and place the file in
there
Once all dependencies are installed, you can run Person by calling the persona.py
$ python persona.py
Running it the first time takes a while because other than the files you downloaded
earlier, the script will automatically download other necessary weights.
A new folder named temp will be created to store all temporary files, and a new
folder named results will be created to store the final resultant videos. In the temp
folder, a new folder with a path_id will be created to store all the temporary files
created during the creation of the talking head video.
If you want to use your own image for your talking head video, you can do this.
If you want to use your own speech file, you can do this.
Running it this way only creates the smaller videos in the results directory. If you
want the larger video, you can do this.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 15/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
This runs the normal generation, followed by the improvement step. The
improvement step can be pretty slow (> 20 minutes).
How about if you generated a smaller video but you want to now improve it to a
larger one? I’ve got you covered for that too.
The skipgen flag tells Persona to skip all the generation, and the improve flag tells
Persona to improve the video. However you still need to tell Persona which video
file to use and where to store the temporary frames, so you need to provide a
path_id as well.
Hardware
A note about the hardware I ran this on. So far I’ve only tried this on an Intel x86_64
machine with Nvdia GPUs, using CUDA. I spun up an n1-highmem-16 (16vCPU, 8
core, 104 GB memory) instance on Google Cloud with 2xT4 GPUs and ran my
experiments on it.
Depending on the amount of text spoken, and the speed can differ greatly. In fact, I
usually find generating speech the slowest part. If you’re impatient and want to use
something else, you can try OpenAI’s text-to-speech, which is pretty fast (I have
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 16/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
some commented out code in speech.py you can uncomment). You can also try any
other text-to-speech generator, or even record your own!
I also tried other GPUs and while the more expensive ones generally run faster, they
are also a LOT more expensive so beware.
Final thoughts
I spent a couple of days over the Christmas break to muck around with talking head
videos. It was a fascinating journey and I learnt a lot and hope you enjoy playing
about with the project as well!
Libraries
Here are the libraries used.
Generating speech
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 17/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
GitHub - yangxy/GPEN
Contribute to yangxy/GPEN development by creating an account on
GitHub.
github.com
GitHub - zhanglonghao1992/One-Shot_Free-
View_Neural_Talking_Head_Synthesis: Pytorch implementation…
Pytorch implementation of paper "One-Shot Free-View Neural
Talking-Head Synthesis for Video Conferencing" - GitHub …
github.com
Follow
I write, code.
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 18/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
Sau Sheong
94 1
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 19/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
216 4
237 6
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 20/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
190 11
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 21/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
Gavin Li in AI Advances
How Your Ordinary 8GB MacBook’s Untapped AI Power Can Run 70B LLM
Models That Will Blow Your Mind!
Do you think your Apple MacBook is only good for making PPTs, browsing the web, and
streaming shows? If so, you really don’t understand the…
729 9
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 22/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
Eric Risco
908 40
Lists
What is ChatGPT?
9 stories · 266 saves
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 23/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
1.6K 9
Exploring the Future of AI: How GPT and ChatGPT Are Changing the Tech Landscape
353
1.6K 19
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 25/26
1/6/24, 1:16 PM Creating talking head videos with generative AI | by Sau Sheong | Dec, 2023 | Medium
ChatDOC
337 2
https://medium.com/@sausheong/creating-talking-head-videos-with-generative-ai-2df3947fd506 26/26