Data Sorting Guideline

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

This document was created to serve as guideline on how to sort/create the data that will be used to

train a new voice model.

Things you’ll need beforehand:

 ffmpeg;
 WLM;
 Python 3.8;
 Anaconda;
 WAV audio files;
 LJSpeech Tools;

Link to LJSpeech Tools -> https://github.com/lalalune/LJSpeechTools


Click on code and then click on Download ZIP

To install ffmpeg -> sudo apt install ffmpeg

To install the other LJSpeech Tools -> pip install resemblyzer SpeechRecognition

Note: Separator only needs to be used if there’s more than one speaker, otherwise
you won’t need it;

Wav Files need to be snipped into sentences, no bigger than 5-6 seconds;

If you wish to create your own data, a Digital Audio Workstation (DAW) will be
needed to snip the source file in smaller ones, Audacity is a free option but
feel free to use your preferred DAW.

After everything is installed and you have your data at hand follow this steps:

Place your data on the “wavs” folder (which you extracted from LJSpeech Tools)

If there’s more than one speaker, place some sample audio files from the speaker you want the files
from on the “target” folder and place samples from the not desired speakers on the “ignore” folder

Run the separator on anaconda -> python separator.py –threshold=0.65

(the threshold can be adjusted between 0.6 and 0.9)

After you run this the files will be sorted. The ones that match your speaker will stay on the “wavs”
folder, the files that do not match your speaker will be moved to the “not_speaker” folder

Note: this script is fairly reliable but it’s not perfect, you will still need to go through all the remaining
files on the wavs folder and review them to see if there’s any audio files that don’t belong there
After you’re done sorting, is time to use the transcriber.

Run the transcriber on anaconda -> python transcriber.py

After the transcriber is done running you’ll see that a metadata.csv file will be created. This file
contains the sentences that are being spoken on each audio file and each individual sentence is
assigned to each individual audio file.

Like the separator, the transcriber is not perfect and you’ll need to review if the sentences match the
audio they’re assigned. The AI is really sensitive so this part needs to be as accurate as possible.
Sometimes it switches words because the pronunciation wasn’t perfect, sometimes it doesn’t catch
the first word if it’s a short one like “the” or “an”.

Please note that no special characters are present on the sentences, you don’t need to include
commas, periods, exclamations or interrogation signs.

Note: if there’s a word that is written the same, but it’s pronounced on another way, you can specify
this by using the phonemes.

e.g.: On the files I was working on there was a character named Date (pronounced dah-tay) and it is
written the same way as date (as in date of the month, year, etc), so I had to specify to the AI how to
pronounce it. You can do this by using curly brackets and specify the phonemes inside of them like
this {dah-tay}

You might also like