Presentation On A Deep Learning Approach To Learn Lip Sync From Audio

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 8

Pre-defense

A Deep Learning Approach to


Learn Lip Sync from Audio
Presented by Supervised by
Milon Mahato Md. Reduanul Haque
171-15-1472 Sr. Lecturer
Nazmul Hassan Department of CSE
171-15-1487 Daffodil International University
Habibur Rahman Md. Mahfujur Rahman
171-15-1471 Lecturer
Mazharul Islam Department of CSE
171-15-1425 Daffodil International University

Saturday 05 December 2020


Table of Contents
❖ Introduction
❖ Motivation
❖ Objectives
❖ Methodology
❖ Outcome
❖ References

Pre-defense
2
Introduction
The advancement in new information and technology, the aspects of media from the
few past years making an epoch-making change to the new era of audio and visual
things with the popularity and essentiality of generating creative media contents.

Especially, in the field of deep learning voice to video synchronization, language


dubbing with accurate lip synchronization, 3d film or animation video creation, and
also in gaming with random famous characters are incredibly most demanded thing
but in reality, creating or implementing these contents are more complex and quite
challenging.

Figure 1: Examples of face manipulation

Pre-defense
3
Motivation
• Our approach is based on synthesizing video from audio in the region
around the mouth, and using compositing techniques to borrow the rest
of the head and torso from other stock video.

• Our compositing approach is similar to Wav2Lip, Face2Face, although


Face2Face transfer mouth from another video, whereas we synthesize
the mouth shape directly from audio.

Pre-defense
4
Objectives
• Generating photorealistic mouth texture preserves fine detail in the
lips and teeth, and reproduces time-varying wrinkles and dimples
around the mouth and chin.

• Synthesizing mouth shape from audio, trained on millions of video


frames, that is significantly simpler then prior methods.

Pre-defense
5
Methodology
Our novel Wav2Lip model produces significantly more accurate lip-
synchronization in dynamic, unconstrained talking face videos. Quantitative
metrics indicate that the lip-sync in our generated videos are almost as good
as real-synced videos.

(credit: Cornell University, New York)

Pre-defense
6
Outcome

Pre-defense
7
Reference
References:

[1] A. Jamaludin, J. S. Chung and A. Zisserman, "You said that?: Synthesising talking faces from
audio," International Journal of Computer Vision, vol. 127, no. 11-12, pp. 1767-1779, 2019.

[2] Y. Chen, W. Gao, Z. Wang, J. Miao and D. Jiang, "Mining audio/visual database for speech driven
face animation," in 2001 IEEE International Conference on Systems, Man and Cybernetics. e-
Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236), 2001.

[3] T. Karras, T. Aila, S. Laine, A. Herva and J. Lehtinen, "Audio-driven facial animation by joint end-to-
end learning of pose and emotion," ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 1-12,
2017

[4] S. Suwajanakorn, S. M. Seitz and I. Kemelmacher-Shlizerman, "Synthesizing obama: learning lip


sync from audio," ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 1-13, 2017.

[5] SUPASORN SUWAJANAKORN, STEVEN M. SEITZ, and IRA KEMELMACHER-SHLIZERMAN,


University of Washington. Synthesizing Obama: Learning Lip Sync from Audio. SIGGRAPH 2017

Pre-defense
8

You might also like