Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

A

PROJECT REPORT
ON

“VOICE CHANGING SYSTEM USING MACHINE LEARNING”

SUBMITTED IN PARTIAL FULFILLMENT OF THE


REQUIREMENTS OF DEGREE OF
BACHELOR OF ENGINEERING
BY

SAHIL BONDE BE A 11

SUPERVISOR
PROF. SNEHAL SHINDE

DEPARTMENT OF COMPUTER ENGINEERING


Pillai HOC College of Engineering and Technology, Rasayani
Pillai’s HOC Educational Campus, HOCL Colony,
Rasayani, Tal: Khalapur, Dist: Raigad- 410 207
UNIVERSITY OF MUMBAI
[2023-24]
Mahatma Education Society’s
Pillai HOC College of Engineering and Technology,
Rasayani-410207
[2023-24]

Certificate

This is to certify that the project entitled “Voice Changing system using
machine learning” is a bonafide work done by Sahil Bonde and is
submitted in partial fulfillment of the require- ments for the degree of
Bachelors of Engineering in Computer Engineer- ing to the University of
Mumbai.

Prof. Snehal Shinde Dr. Rajashree Gadhave


Supervisor Project Coordinator

Prof. Rohini Bhosale Dr. Jagdish W. Bakal


Head of Department Principal
Project Report Approval for B. E

This project report entitled “Voice Changing system using machine


learning” by Sahil Bonde is approved for the degree of Bachelors of
Engineering in Computer Engineering of the Uni- versity of Mumbai.

Examiners:

Internal Examiner Name and Sign:

External Examiner Name and Sign:

Date:

Place:
Declaration

We declare that this written submission represents our ideas in our own
words and where others’ ideas or words have been included. We have ad-
equately cited and referenced the original sources. We also declare that we
have adhered to all principles of academic honesty and integrity and have
not misrepresented or fabricated or falsified any idea/data/fact/source in our
submission. We understand that any violation of the above will because for
disciplinary action by the Institute and can also evoke penal action from the
sources which have thus not been properly cited or from whom proper per-
mission has not been taken when needed.

Sahil Bonde
Abstract

Machine learning, a transformative field within artificial intelligence, em-


powers computersnto learn from data and make predictions or decisions
without explicit programming. The Real-time voice conversion by using ma-
chine learning is a ground-breaking technology, facilitating instant transfor-
mation of one person’s voice into another’s during live communication.This
system can adeptly adjust pitch, tone, accent, and even gender, creating a
seamlessly natural voice transformation. Complementing this, pitch extrac-
tion algorithms play a fundamental role in speech and audio processing,
identifying the fundamental frequency associated with voice pitch or musi-
cal notes. Additionally, in the realm of real-time voice conversion.the Crepe
technique, is a crucial component in real-time voice conversion projects. It
works by extracting the harmonic content of a source speaker’s voice and
modifying it to match a target speaker’s pitch and the Harvest technique,
vital for real-time pitch analysis and modification, further elevate the ca-
pabilities of voice transformation. Voice conversion involves transferring
the fundamental characteristics of one speaker’s identity to another speaker
while maintaining the original speech content. Through efficient computa-
tional techniques and resource management, the system endeavors to offer
an economical solution for users desiring superior voice transformations.

Keywords: Generative Adversarial Networks (GANs); Con- volution Neu-


ral Network (CNN); Frechet Inception Distance (FID); Retrieval - Based
Voice Conversion (RVC).

i
Abbreviations

CNN-VC - Convolutional Neural Network Voice Conversion

DNN-VC - Deep Neural Network Voice Conversion

FID - Frechet Inception Distance

RVC - Retrieval - Based Voice Conversion

VCM - Voice Conversion Model

ii
List of Figures

2.1 Literature Survey… … … … … … … … … … … … … … … … . 6


4.2 Gantt Chart ................................................................................ 13
4.3 Implemented System.................................................................. 14
5.1 Usecase Daigram ....................................................................... 16
5.2 Class Diagram ........................................................................... 19
5.3 Activity Diagram........................................................................ 20
5.4 Sequence Diagram ..................................................................... 21
6.1.1 DFD Level 0 .............................................................................. 23
6.1.2 DFD level 1 ............................................................................... 23
6.2 Flow Chart ................................................................................. 24
7.1 System Architecture ................................................................... 26
8.1.1 User Interface ............................................................................ 30
8.1.2 Congfiure Settings ...................................................................... 30
8.1.3 Upload Trained RVC Models ..................................................... 31
8.1.4 Select RVC Models..................................................................... 31
Plagarism Report ....................................................................... 37

iii
List of Tables

2.1 Literature Survey Table………………………………………... 7

5.1 UseCase Document ......................................................................... 18

iv
Contents

Abstract i

Abbreviations ii
List of Figures...................................................................................... iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organization of Project Report . . . . . . . . . . . . . . . . 2

2 Literature survey 4
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Existing System . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 8

3 Requirement Gathering 9
3.1 Software and Hardware Requirements ....................................... 10

4 Plan Of Project 11
4.1 Methodology.............................................................................. 12
4.2 Project Plan (Gantt Chart) .......................................................... 13
4.3 Implemented System.................................................................. 14

5 Project Analysis 15
5.1 Use Case Diagram ..................................................................... 16
5.2 Use Case Document ................................................................... 17
5.2.1 Use case Analysis .......................................................... 18
5.3 Class Diagram ........................................................................... 19
5.4 Activity Diagram ....................................................................... 20
5.5 Sequence Diagram ..................................................................... 21
6 Project Design 22
v
Contents
6.1 Data Flow Diagram .................................................................... 23
6.2 Flow Chart ................................................................................. 24

7 Implemented System 25
7.1 System Architecture ................................................................... 26
7.2 Sample code............................................................................... 27

8 Result Analysis 29
8.1 Result Analysis .......................................................................... 30

9 Conclusion And Future Scope 32


9.1 Conclusion ................................................................................. 33
9.2 Future Scope .............................................................................. 33

References 34

Acknowledgment 35

Appendix I: List of Publication 36

Appendix II: Plagiarism Report of Paper 37

vi
Voice Changing system using machine learning

Chapter 1

Introduction

Pillai HOC College of Engineering and Technology, Rasayani 1


Voice Changing system using machine learning

1.1 Background

1.2 Relevance

The emergence of generative Artificial Intelligence (AI) in recent years has


brought about significant implications. Cutting-edge systems can now con- vert a
speaker’s voice into another in real-time through microphones and sophisticated
deep learning models. This capability, once confined to sci- ence fiction, is now
achievable using consumer-level computing technol- ogy. While this technology
holds promise for entertainment purposes, its advancements also present a
formidable security threat. Voice serves as a primary method for humans to
recognize each other in social contexts, of- ten without question. Moreover, voice
recognition plays a vital role in bio- metric authentication. Consequently, the
ability to clone and manipulate voices could be exploited unethically, leading to
breaches in privacy and se- curity. This manipulation opens the door to potential
misrepresentation and identity theft, necessitating urgent attention from the
scientific community. Therefore, addressing these ethical and security concerns
related to voice conversion technology is paramount. This entails developing
robust safe- guards, such as improved authentication methods or encryption
techniques, to mitigate the risks of exploitation.

1.3 Organization of Project Report

Entertainment Industry: Voice changing systems are widely used in the


entertainment industry for dubbing, character voice generation in animations and
video games, and creating voice effects in movies and music. Machine learning
techniques enable realistic and flexible manipulation of voices, en- hancing the
creative possibilities for content creators.
Telecommunications and Voice Assistants: In telecommunications, voice
changing systems can improve user experience by allowing users to person- alize
their voice during calls or video conferences. Additionally, in voice assistant
applications, such as virtual agents and chatbots, voice conversion technology can
enable the creation of more engaging and diverse conversa- tional interfaces.
Security and Privacy: Voice changing systems have implications for secu- rity
and privacy. They can be used to anonymize speakers’ identities in sen- sitive
communications or protect individuals’ privacy in public broadcasts or

Pillai HOC College of Engineering and Technology, Rasayani 2


Voice Changing system using machine learning

online interactions. Conversely, they also raise concerns about voice imper-
sonation and the potential for misuse in fraud or social engineering attacks,
highlighting the need for robust authentication and detection mechanisms.
Accessibility and Assistive Technology: Voice conversion technology can
benefit individuals with speech disabilities by enabling them to communicate using
synthesized voices that better match their identities or preferences. Moreover, it
can aid language learners by providing opportunities to practice speaking in
different accents or dialects.
Research and Development: Voice changing systems serve as a valuable
research tool for studying speech synthesis, voice perception, and human-
computer interaction. Advancements in machine learning techniques for
voice conversion contribute to our understanding of speech processing and
pave the way for future innovations in artificial intelligence and natural lan-
guage processing.

Pillai HOC College of Engineering and Technology, Rasayani 3


Voice Changing system using machine learning

Chapter 2

Literature survey

Pillai HOC College of Engineering and Technology, Rasayani 4


Voice Changing system using machine learning

2.1 Related Work

In recent years, the field of cyber security has seen increasing attention due
to the rising frequency and sophistication of cyber attacks. In this section,
we review related work in the area of cyber attack detection, with a focus on
malware detection and analysis of the ransomware malicious code
1. Voice Conversion using Deep Learning
In this research Albert Aparicio Isarn and Antonio Bonafonte In this project
we present a first attempt at a Voice Conversion system based on Deep
Learning in which the alignment between the training data is intrinsic to the
model.Our system is structured in three main blocks. The first performs a
vocoding of the speech (we have used Ah ocoder for this task) and a nor-
malization of the data.
2. An overview of voice conversion and its challenges

Sisman, B, Yamagishi, J, King, S Li, H 2020 focus on the analysis Voice


conversion (VC) is a significant aspect of artificial intelligence. It is the study
of how to convert one’s voice to sound like that of another without changing
the linguistic content. Voice conversion belongs to a general technical field
of speech synthesis, which converts text to speech or changes the properties
of speech,
3. A comparative study of voice conversion techniques
Zhang, A., Lipton, Z. C., Li, M., Smola, A. J. (2021) Voice conversion using
artificial intelligence is an essential field of science. It is the science of trans-
forming one voice to sound like another person’s voice without changing the
linguistic content [1]. Voice conversion belongs to the general technical field
known as speech synthesis, which converts text into speech or changes
speech properties, such as voice identity, emotion, or accent
4. Voice Conversion Based on Deep Neural Networks for Time-Variant
Linear Transformations
Chien-Yu Huang, Yist Y. Lin, Hung-Yi Lee, Lin-Shan Lee. (2020)To
achieve a high quality conversion, it is a reasonable strategy to collect a
large amount of training data, annotate them and train a big conversion
model. However, considering the cost of data collection, and human and
computational resources, it is also worth investigating how to build a
conversion model from a limited amount of data

Pillai HOC College of Engineering and Technology, Rasayani 5


Voice Changing system using machine learning

Figure 2.1: Literature Survey

Pillai HOC College of Engineering and Technology, Rasayani 6


Voice Changing system using machine learning

2.2 Existing System

Summary of literature survey :


The literature survey examines various approaches to ransomware detection
and prevention. One study highlights the challenges of using traditional se-
curity measures, noting their tendency to misclassify small workloads and
their low robustness against high entropy. Another study focuses on analyz-
ing malicious files collected through a honeypot trap system and benchmark-
ing anti-virus software, revealing issues related to transparency and the han-
dling of large files. Additionally, a dynamic feature dataset for ransomware
detection using machine learning algorithms is introduced, achieving high
accuracy but facing challenges such as point errors and cost efficiency. Fi-
nally, a survey on early ransomware detection identifies the lack of ran-
somware datasets and insufficient data for effective machine learning model
training as significant obstacles in this field.

Pillai HOC College of Engineering and Technology, Rasayani 7


Voice Changing system using machine learning

2.3 Problem Statement

In the context of voice-to-voice conversion, the current state-of-the- art sys-


tems exhibit limitations in accurately preserving the linguistic content and
naturalness of the converted speech. Furthermore, the lack of robustness in
handling diverse speaking styles, emotional variations, and linguistic con-
texts poses a significant challenge in achieving seamless and high-fidelity
voice conversion across different speakers and scenarios. The existing sys-
tems may fail to adequately account for the variations in pitch, timbre, and
prosodic features, resulting in unnatural and distorted converted speech that
does not convincingly resemble the desired target speaker. The system should
aim to achieve a high degree of perceptual similarity to the target speaker
while maintaining the coherence and consistency of the converted output
across different linguistic contexts and speaking scenarios.

Pillai HOC College of Engineering and Technology, Rasayani 8


Voice Changing system using machine learning

Chapter 3

Requirement Gathering

Pillai HOC College of Engineering and Technology, Rasayani 9


Voice Changing system using machine learning

3.1 Software and Hardware Requirements

Here we will discuss everything we will need in order to execute. Below we


list the necessary hardware and software requirements.
1. Software Requirements:

• Programming Language : Python

• IDE : Windows CMD

2. Hardware Requirements :

• Desktop/Laptop Processor: Intel i7 Core Processor above

• Processor: Minimum 2 GHz; Recommended 3GHz or more

• Ethernet connection (LAN) OR a wireless adapter (Wi-Fi)

• Hard Drive: Minimum 500 GB; Recommended 1 TB or more

• Memory (RAM): Minimum 16 GB; Recommended GB or above

Pillai HOC College of Engineering and Technology, Rasayani 10


Voice Changing system using machine learning

Chapter 4

Plan Of Project

Pillai HOC College of Engineering and Technology, Rasayani 11


Voice Changing system using machine learning

4.1 Methodology

Crepe algorithm: ”Convolutional Representation of Pitch Estimation,” is


a deep learning-based method designed for real-time or near real-time es-
timation of the fundamental frequency (pitch) of monophonic audio sig-
nals.Crepe offers promising results in pitch estimation tasks, particularly in
scenarios requiring real-time processing and robustness to noise. Its deep
learning-based approach allows it to capture complex temporal patterns in
the audio signal, leading to accurate and efficient pitch estimation perfor-
mance.
Harvest Algorithm:The Harvest algorithm is a fundamental part of the
STRAIGHT (Speech Transformation and Representation using Adaptive In-
terpolation of weiGHTed spectrogram) analysis-synthesis system. It is specif-
ically used for pitch estimation in speech signals. Overall the Harvest algo-
rithm plays a critical role in extracting pitch information from speech signals,
enabling a wide range of applications in voice processing and analysis.
Training Process: Our model underwent training for 500,000 steps, equiv-
alent to 53 epochs, utilizing a solitary RTX 3090 GPU over a span of three
days, with a batch size of 9. Employing an AdamW optimizer alongside an
exponential learning rate scheduler, we stabilized training by normalizing
gradients to 1
Dio Algorithm: It stands for ”Distributed Inline-Optimization” and is often
used for fundamental frequency (F0) estimation in speech signals. DIO is
commonly employed in conjunction with the ”STONE” algorithm for high-
quality speech synthesis, particularly in the context of the WORLD vocoder.
DIO algorithm plays a crucial role in accurately estimating the fundamental
frequency of speech signals, enabling a wide range of applications in voice
processing, analysis, and synthesis.
Evaluation Metrics: Crucial for assessing the Retrieval Based Voice Con-
version System’s performance, evaluation metrics should provide a compre-
hensive evaluation. A combination of objective and subjective metrics cap-
tures both human perception of voice quality and quantitative measures of
various aspects of the voice conversion process

Pillai HOC College of Engineering and Technology, Rasayani 12


Voice Changing system using machine learning

4.2 Project Plan (Gantt Chart)

Figure 4.2: Gantt Chart

A Gantt chart is a visual project management tool that illustrates a project


schedule. It displays tasks or activities as bars along a timeline, showing
their start and end dates. Gantt charts help project managers and teams track
progress, allocate resources, and manage dependencies, aiding in effective
planning and execution.

Pillai HOC College of Engineering and Technology, Rasayani 13


Voice Changing system using machine learning

4.3 Implemented System

Figure 4.3: Implemented System

The process involves several steps, including feature extraction, speaker


embedding, and synthesis. Feature extraction involves analyzing the in- put
speech to extract relevant features, such as pitch and spectral envelope.
Speaker embed- ding involves mapping the extracted features on to a low- di-
mensional space represent the speaker’s unique characteristics. Finally, syn-
thesis involves using the speaker embedding to retrieve prerecorded speech
samples from the database and combine them to create the output speech

Pillai HOC College of Engineering and Technology, Rasayani 14


Voice Changing system using machine learning

Chapter 5

Project Analysis

Pillai HOC College of Engineering and Technology, Rasayani 15


Voice Changing system using machine learning

5.1 Use Case Diagram

Figure 5.1: Use case Digram

This use Case composed of following things


Real-Time Voice Conversion: Represents the system or software applica-
tion responsible for real-time voice conversion. Start Voice Conversion: A
use case where the user initiates the process of real-time voice conversion.
This might involve activating the voice converter system to begin process-
ing input audio in real-time. Stop Voice Conversion: A use case where the
user terminates the real-time voice conversion process. This might involve
deactivating the voice converter system and stopping the processing of input
audio. Setting: A use case where the user adjusts settings or parameters
related to the real-time voice conversion process. This might include con-
figuring options such as pitch shifting, timbre adjustment, or other transfor-
mation settings. Actors: The actors in this diagram could include end-users
interacting with the real-time voice conversion system.

Pillai HOC College of Engineering and Technology, Rasayani 16


Voice Changing system using machine learning

5.2 Use Case Document

Use Case Document:Voice Changing system using machine learning


1. Title:Voice Changing system using machine learning Accent Conver-
sion in Language Learning Apps:
Use Case: Language learning platforms can utilize voice conversion to
help learners mimic accents. For instance, a learner trying to acquire a
British accent could input their voice recordings, which the ML model could
then convert to sound more British. Benefit: Learners can improve their
pronunciation by listening to their converted voices, aiding in more accurate
accent emulation.
Accessibility for Visually Impaired Individuals:
Use Case: Voice conversion can be employed to convert text-to-speech in
a more personalized manner for visually impaired individuals. ML models
can convert generic synthesized voices into voices resembling those of the
user’s friends or family.
Benefit: Users may find it more engaging and comforting to listen to
voices they are familiar with, enhancing their overall user experience.
Character Voice Generation in Gaming and Entertainment:
Use Case: In video games or animated films, developers can use voice
conversion to synthesize dialogue in characters’ voices. ML models can
convert voice actors’ recordings into various character voices, saving time
and resources.
Benefit: This approach allows for a wider range of character voices with-
out the need for multiple voice actors, providing more diverse and immersive
experiences for players or viewers.
Personalized Voice Assistants:
Use Case: Voice-controlled virtual assistants like Siri or Alexa could uti-
lize voice conversion to tailor responses to individual users. The assistant
could mimic the user’s voice to provide responses, making interactions more
personalized.
Benefit: Users may find it more intuitive and engaging to interact with a
virtual assistant that sounds like them, enhancing the overall user experience
and fostering a stronger sense of connection.

Pillai HOC College of Engineering and Technology, Rasayani 17


Voice Changing system using machine learning

5.2.1 Use case Analysis

Use Case Document:Voice Changing system using machine learning


1Title:Voice Changing system using machine learning Accent Conversion
in Language Learning Apps:
Use Case: Language learning platforms can utilize voice conversion to
help learners mimic accents. For instance, a learner trying to acquire a
British accent could input their voice recordings, which the ML model could
then convert to sound more British. Benefit: Learners can improve their
pronunciation by listening to their converted voices, aiding in more accurate
accent emulation.
Accessibility for Visually Impaired Individuals:
Use Case: Voice conversion can be employed to convert text-to-speech in
a more personalized manner for visually impaired individuals. ML models
can convert generic synthesized voices into voices resembling those of the
user’s friends or family.
Benefit: Users may find it more engaging and comforting to listen to
voices they are familiar with, enhancing their overall user experience.
Character Voice Generation in Gaming and Entertainment:
Use Case: In video games or animated films, developers can use voice
conversion to synthesize dialogue in characters’ voices. ML models can
convert voice actors’ recordings into various character voices, saving time
and resources.
Benefit: This approach allows for a wider range of character voices with-
out the need for multiple voice actors, providing more diverse and immersive
experiences for players or viewers.
Personalized Voice Assistants:
Use Case: Voice-controlled virtual assistants like Siri or Alexa could uti-
lize voice conversion to tailor responses to individual users. The assistant
could mimic the user’s voice to provide responses, making interactions more
personalized.
Benefit: Users may find it more intuitive and engaging to interact with a
virtual assistant that sounds like them, enhancing the overall user experience
and fostering a stronger sense of connection.

Pillai HOC College of Engineering and Technology, Rasayani 18


Voice Changing system using machine learning

5.3 Class Diagram

Figure 5.2: Class Diagram

• VoiceConverter: Represents the main class responsible for converting voice.


It contains input and output audio streams and methods to set input audio,
perform conversion, and retrieve output audio.
• AudioStream: Represents an audio stream, which contains audio data (byte
array) and metadata about the audio format (sample rate, sample size, etc.).
It provides methods to get audio data and format information.
• AudioFormat: Represents the format of audio data, including sample rate,
sample size, number of channels, encoding type, frame size, frame rate, and
endianness. It provides methods to retrieve information about the audio for-
mat. This is a basic representation, and depending on the complexity of the
voice converter system, you may need to expand it to include more classes
and relationships. For instance, you might include classes for different audio
processing algorithms, user interfaces, file I/O operations,

Pillai HOC College of Engineering and Technology, Rasayani 19


Voice Changing system using machine learning

5.4 Activity Diagram

Figure 5.3: Activity Diagram

Activity diagram is a visual representation of the flow of actions or activi-


ties within a system, process, or workflow. It uses standardized symbols and
notation to depict the sequence of tasks, decisions, and interactions among
different components or actors. These diagrams help in understanding the
chronological order of activities, their dependencies, and decision points
within a system, aiding in system analysis, design, and documentation. Ac-
tivity diagrams are particularly useful in modeling business processes, soft-
ware workflows, and complex systems, providing stakeholders with a clear
and concise .

Pillai HOC College of Engineering and Technology, Rasayani 20


Voice Changing system using machine learning

5.5 Sequence Diagram

Figure 5.4: Sequence Diagram

Voice Conversion tool:


EchoFetch Retrival Based Voice Conversion with a graphical user interface
simplifies the process of voice translating.Basically Flow Diagram shows
the flow of the project Users upload trained voice models specify source
and target voices and convert live voices to one to another . The tool
extracts voice from the rvc model and pitch the voice and convert our voice
into someones voice.

Pillai HOC College of Engineering and Technology, Rasayani 21


Voice Changing system using machine learning

Chapter 6

Project Design

Pillai HOC College of Engineering and Technology, Rasayani 22


Voice Changing system using machine learning

6.1 Data Flow Diagram

• DFD (Level 0):

Figure 6.1.1: DFD Level 0

In this level 0 DFD, the Voice conversion system Basically user sends the
voice to application and Apk sends to admin Admin process all the sound
convert the voice and sent back to user
• DFD (Level 1):

Figure 6.1.2: DFD level 1

This Level 1 DFD illustrates the key interactions and components of the
Voice Conversion using ml there is user and admin and processing models
and targetting database
– Voice Converter This text box likely refers to a script or function used by
the administrator to deploy the honeypot system on the network.
– User: Represented the user to perform voice converion using ml algorithms
Basically Firstly we have some trained RVC voice models and upload that
RVC models to the EchoFetch Voice conversion tool and It fetches pitch the
sound and Gives sound like rvc models

Pillai HOC College of Engineering and Technology, Rasayani 23


Voice Changing system using machine learning

6.2 Flow Chart

Figure 6.2: Flow Chart

Figure 6.2 represents EchoFetch Retrival Based Voice Conversion with a


graphical user interface simplifies the process of voice translating. Users
upload trained voice models specify source and target voices and convert
live voices to one to another . The tool extracts voice from the rvc model
and pitch the voice and convert our voice into someones voice.

Pillai HOC College of Engineering and Technology, Rasayani 24


Voice Changing system using machine learning

Chapter 7

Implemented System

Pillai HOC College of Engineering and Technology, Rasayani 25


Voice Changing system using machine learning

7.1 System Architecture

Figure 7.1: System Architecture

The typical voice conversion scheme composes of training and conversion


processes shown in Fig 7.1 During training process, the acoustic features re- lated
to the speaker identity are extracted from source and target speech sig- nals. Next,
each source acoustic feature is mated to the correspondent target feature by frame
alignment method, to build a source-target transfer func- tion. Finally, a mapping
function is learned from the aligned source target feature pairs. During conversion
process, the mapping function is applied to the acoustic feature extracted from
source speech to produce converted feature matrix. Then the converted feature
matrix is passed to a synthesizer to reconstruct a speech signal Retrieval based
Voice Conversion (hereinafter referred to as RVC) is a newtype of voice
conversion program developed in June 2023. Compared with traditional training
timbre imitation program, Retrieval based Voice Conversion eliminates timbre
leakage by replacing in- put source features with training set features using
top1retrieval.At the same time, RVC has the following characteristics: even on
relativelypoor graphics cards can be fast training; Training with a small amount
of data canalsoget better results

Pillai HOC College of Engineering and Technology, Rasayani 26


Voice Changing system using machine learning

7.2 Sample code

%cd /content/
!pip install colorama --quiet
from colorama import Fore, Style
import os
print(f"{Fore.CYAN}> Cloning the repository...{Style.RESET
!git clone https://github.com/w-okada/voice-changer.git --
print(f"{Fore.GREEN}> Successfully cloned the repository!{
%cd voice-changer/server/
print(f"{Fore.CYAN}> Installing libportaudio2...{Style.RES
!apt-get -y install libportaudio2 -qq
print(f"{Fore.CYAN}> Installing pre-dependencies...{Style.
# Install dependencies that are missing from requirements.
!pip install faiss-gpu fairseq pyngrok --quiet
!pip install pyworld --no-build-isolation --quiet
print(f"{Fore.CYAN}> Installing dependencies from requirem
!pip install -r requirements.txt --quiet
print(f"{Fore.GREEN}> Successfully installed all packages!
Token = ’2WNe6ETalPYMTrD6NdHK1QwB4cx_4V6rJmFvWmvgjQ2g7Xuxw
Region = "us - United States (Ohio)" # @param ["ap - Asia/
#@markdown **5** - *(optional)* Other options:
ClearConsole = True # @param {type:"boolean"}
%cd /content/voice-changer/server
from pyngrok import conf, ngrok
MyConfig = conf.PyngrokConfig()
MyConfig.auth_token = Token
MyConfig.region = Region[0:2]
#conf.get_default().authtoken = Token
#conf.get_default().region = Region
conf.set_default(MyConfig);
import subprocess, threading, time, socket, urllib.request
PORT = 8000
from pyngrok import ngrok
ngrokConnection = ngrok.connect(PORT)
public_url = ngrokConnection.public_url
from IPython.display import clear_output
def wait_for_server():
while True:
Pillai HOC College of Engineering and Technology, Rasayani 27
Voice Changing system using machine learning

time.sleep(0.5)
sock = socket.socket(socket.AF_INET, socket.SOCK_STRE
result = sock.connect_ex((’127.0.0.1’, PORT))
if result == 0:
break
sock.close()
if ClearConsole:
clear_output()
print("--------- SERVER READY! ----------------- ")
print("Your server is available at:")
print(public_url)
print(" ----------------------------------------------------------- ")
threading.Thread(target=wait_for_server, daemon=True).star
!python3 MMVCServerSIO.py \
-p {PORT} \
--https False \
--content_vec_500 pretrain/checkpoint_best_legacy_500.pt
--content_vec_500_onnx pretrain/content_vec_500.onnx \
--content_vec_500_onnx_on true \
--hubert_base pretrain/hubert_base.pt \
--hubert_base_jp pretrain/rinna_hubert_base_jp.pt \
--hubert_soft pretrain/hubert/hubert-soft-0d54a1f4.pt \
--nsf_hifigan pretrain/nsf_hifigan/model \
--crepe_onnx_full pretrain/crepe_onnx_full.onnx \
--crepe_onnx_tiny pretrain/crepe_onnx_tiny.onnx \
--rmvpe pretrain/rmvpe.pt \
--model_dir model_dir \
--samples samples.json
ngrok.disconnect(ngrokConnection.public_url)

Pillai HOC College of Engineering and Technology, Rasayani 28


Voice Changing system using machine learning

Chapter 8

Result Analysis

Pillai HOC College of Engineering and Technology, Rasayani 29


Voice Changing system using machine learning

8.1 Result Analysis

Figure 8.1.1: User Interface

User Interface:

Pillai HOC College of Engineering and Technology, Rasayani 30


Voice Changing system using machine learning

This section likely represents the main screen of the EchoFetch voice con-
version tool. It might allow users to configure and modify various settings
for the Conversion.

Figure 8.1.2: Congfiure Settings

Congfiure Settings:
user can easily upload Trained RVC models to the Echofetch The results
demonstrate the system’s ability to achieve high-quality voice conversions
with significantly reduced system requirements compared to existing
setups.you can easily modify tune chunks and select gpu

Figure 8.1.3: Upload Trained RVC Models

Pillai HOC College of Engineering and Technology, Rasayani 31


Voice Changing system using machine learning

Upload Trained RVC Models:


This section likely represents a mock interface designed to . It might display
misleading information or threats to pressure users into making a payment.

Figure 8.1.4: Select RVC Models


This Section is basically you can select uploaded rvc models and convert the
voice This Section having pre installed anime voices and cartoon voices,you
can add photos after uploading RVC Model

Pillai HOC College of Engineering and Technology, Rasayani 32


Voice Changing system using machine learning

Chapter 9

Conclusion And Future Scope

Pillai HOC College of Engineering and Technology, Rasayani 33


Voice Changing system using machine learning

9.1 Conclusion

In conclusion, EchoFetch Retrieval- Based Voice Conversion System, em-


phasizing its accuracy, live conversion capabilities, and customization op-
tions. Evaluation showcases its effectiveness through quantitative metrics
and qualitative analyses. Notably, the system’s live voice conversion feature
allows real-time transformations for diverse applications. The commitment
to transparency, with publicly available code and setups, reinforces its prac-
ticality. The conclusion highlights ongoing advancements and under- scores
the importance of ethical considerations in the evolving field of voice con-
version technology.

9.2 Future Scope

The future work for the Retrieval-Based Voice Conversion System encom-
passes a multifaceted approach. It involves integrating advanced algorithms,
exploring state-of-the-art techniques in machine learning, expanding the voice
database with diverse voices and challenging scenarios, and optimizing for
real-time conversion without compromising quality. The focus extends to
enhancing hyper parameter tuning through advanced techniques and devel-
oping a user-friendly interface for broader accessibility. Cross-linguistic ca-
pabilities, robustness improvements, and interactive voice conversion meth-
ods are key objectives, along with incorporating user feedback, emotional
variability, and ethical considerations. The road map also includes evalu-
ating the system’s generalization on voices not in the training dataset, ex-
ploring cross-modal conversions, addressing privacy concerns, and ensuring
ethical use.

Pillai HOC College of Engineering and Technology, Rasayani 34


Voice Changing system using machine learning

References

(a) Sisman, B., Yamagishi, J., King, S., Li, H. (2020). ”An overview of voice
conversion and its challenges”: From statistical modeling to deep learning.
IEEE/ACM Transactions on Audio, Speech and Language Processing, 29,
132-157. doi: 10.1109/TASLP.2020.3038524
(b) Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. (2020). Parallel
wavegan: ”A fast waveform generation model based on generative adver-
sarial networks with multi-resolution spectrogram”. In ICASSP 2020-2020
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 6199–620.
(c) Zhang, A., Lipton, Z. C., Li, M., Smola, A. J. (2021). ”Dive into deep learn-
ing” arXiv preprint arXiv:2106.11342. Available at: https://iopscience.iop.org/article/10.
899X/844/1/012039/pdf
(d) Chien-Yu Huang, Yist Y. Lin, Hung-Yi Lee, Lin-Shan Lee. (2020). Defend-
ing your voice:”Adversarial attack on voice conversion”. ArXiv, abs/2005.08781.
(e) Seung-Won Park, Doo-Young Kim, Myun-Chul Joe. (2020). Cotatron:”Selection
of Optimal Solution for Example and Model of Retrieval Based Voice Con-
version”, abs/2005.03295.

Pillai HOC College of Engineering and Technology, Rasayani 35


Voice Changing system using machine learning

Acknowledgment
It is a privilege for us to have been associated with Prof. Snehal Shinde,
our guide, during this project work. We have been greatly benefited by her
valuable suggestions and ideas. It is with great pleasure that we express our
deep sense of gratitude to them for their valuable guidance, constant
encouragement and patience throughout this work. We are also indebted to
our guide for extending the help to academic literature. We express our
gratitude to Dr. Rajashree Gadhave (Project Coordinator), Prof. Rohini
Bhosale(Head of Department of Computer Engineering), Dr. J. W. Bakal
(Principal) for their constant encouragement, cooperation and support. We
take this opportunity to thank all our classmates for their company during
the course work and for useful discussion we had with them. We would be
failing in our duties if we do not make a mention of our family members,
including our parents for providing moral support, without which this work
would not have been completed.

Thanking You,

Sahil Bonde

Pillai HOC College of Engineering and Technology, Rasayani 36


Voice Changing system using machine learning

List of Publications
Journal
(a) Sahil Bonde,Kiran Jagdale,Sayana Maity Prof.Snehal Shinde, “EchoFetch
Retevial Based Voice Conversion”, ISSN: 2456-4184,April, 2024, volume 9
.

[Status: Submitted]

Pillai HOC College of Engineering and Technology, Rasayani 37


Voice Changing system using machine learning

Plagiarism of Report

Figure : Plagarism Report

Pillai HOC College of Engineering and Technology, Rasayani 38

You might also like