AI-chapter 4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

SPEECH TO TEXT

Data curation
Lê Phương Anh Nguyễn Thy Ngọc
Lê Hoàng Linh Chi Nguyễn Phương Nhi
01.
DETERMINE THE POSIBILE INTERNAL AND
EXTERNAL DATASETS
Internal Datasets
● Transcripts of spoken Vietnamese:
- Could be compiled in-house by recording
and transcribing speech in various
settings, such as news broadcasts,
interviews, and other spoken content.
- Provide the text data needed to train the
speech-to-text system.

● Audio recordings:
- Will be used to train the system to
recognize and transcribe spoken words
and phrases.
External Datasets
● Mozilla’s Common Voice Dataset:
- Publicly available dataset of voice recordings
and transcripts collected by Mozilla.
- The Vietnamese portion of the dataset
consists of thousands of hours of voice
recordings from thousands of speakers. The
dataset is labeled with the corresponding text
of what was spoken, and is available for
download.

● Academic datasets:
- Vietnamese Speech Corpus (VSC) is the
speech dataset from the Institute of
Information Technology in Vietnam. They
may be smaller in size than the Common
Voice dataset, but could still be useful for
training a speech-to-text system.
02.
DESCRIBE THE DATASETS
Attributes

● Format: Audio recordings: The audio recordings in the dataset will likely be stored in a
specific format, such as WAV or MP3. The audio data will have a specific sampling rate,
bit depth, and channel count, which will need to be considered during pre-processing. For
example, the sampling rate might be 16 kHz or 44.1kHz, the bit depth might be 16 bits or
24 bits, and the channel count might be mono or stereo.

● Quality: The audio recordings could have varying quality, depending on factors
such as the recording equipment and environment.
Features
● Audio features: Techniques used to analyze audio data, such as speech or
music. Acoustic characteristics such as pitch, frequency, and volume.
Background noise and other environmental factors. Language-specific
phonemes, tones, and other linguistic features. One common technique is called
Mel-frequency cepstral coefficients (MFCCs), which breaks down the audio into a
series of features that can be used to train a speech-to-text system. These
features can be of a fixed or variable length, depending on the length of the
audio clip being analyzed.
● Text features: The text data will consist of sequences of characters, which can be
represented as a sequence of one-hot vectors or embeddings. The text features
may also include information about the context of the text, such as punctuation
or capitalization.
Labeling

● The audio recordings will be labeled data, as they correspond to the text transcriptions in
the speech-to-text dataset. The labeling may include information about the context of
the speech, such as the speaker or the type of spoken content (e.g. news, interview,
lecture). The labels can be used to train the speech-to-text system to recognize and
transcribe different types of spoken content.
● It's also possible to have some audio recordings without corresponding text
transcriptions, which can be used for unsupervised learning or other tasks such as
speaker diarization. In this case, the audio recordings will be labeled as "no label".

You might also like