Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Internship Report

Field of Study :

Specialized Master's in Big Data and Cloud Computing

Transcription of Audios in Moroccan Darija

Performed by:

Idir Yasmine

Under the supervision of:

Mr Imade Benelallam

Academic Year 2022-2023


Contents

Contents 2

1 Introduction 4
1.1 Overview of the Real-time Speech Recognition Application . . . 4
1.2 Objective and Purpose of the Report . . . . . . . . . . . . . . . . 4

2 Technologies Used 5
2.1 FastAPI Framework . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 PyAudio Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Kafka Messaging System . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Azure Cognitive Services - Speech SDK . . . . . . . . . . . . . . 7
2.5 Python’s Threading Module . . . . . . . . . . . . . . . . . . . . . 7
2.6 Jinja2Templates for HTML Rendering . . . . . . . . . . . . . . . 8
2.7 KafkaProducer and KafkaConsumer . . . . . . . . . . . . . . . . 8
2.8 Azure Speech Configuration . . . . . . . . . . . . . . . . . . . . . 8

3 Application Architecture 9
3.1 Workflow Description . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Scalability and Future Considerations . . . . . . . . . . . . . . . 9

4 Real-Time Transcription Process 9


4.1 Audio Capture and Data Streaming . . . . . . . . . . . . . . . . 9
4.2 Speech Recognition Workflow . . . . . . . . . . . . . . . . . . . . 10

5 Future Enhancements 10
5.1 Potential Improvements . . . . . . . . . . . . . . . . . . . . . . . 10
5.1.1 Expanded Language Support . . . . . . . . . . . . . . . . 10
5.1.2 Enhanced Accuracy . . . . . . . . . . . . . . . . . . . . . 10
5.1.3 Interactive Features . . . . . . . . . . . . . . . . . . . . . 10
5.1.4 Integration with AI Assistants . . . . . . . . . . . . . . . 10
5.1.5 Real-Time Collaboration Tools . . . . . . . . . . . . . . . 11

6 Conclusion 11

2
Abstract
This document is the result of our work carried out as part of our end-
of-studies project. The purpose of this report is to explore audio transcrip-
tion in Moroccan Darija using an artificial intelligence-based approach. It
addresses the challenge of transcribing this specific language and high-
lights the importance of this task. The report details the methodology
implemented, from data preprocessing to the use of a model to achieve
exceptional accuracy. Furthermore, it assesses the model’s performance
and provides concrete examples of successful transcriptions in Moroccan
Darija. This report presents an innovative and effective solution for tran-
scribing this language, demonstrating its relevance and utility in various
contexts.

Keywords: Transcription, Artificial Intelligence, Model, Preprocessing,


Data, Evaluation, Performance

3
1 Introduction
1.1 Overview of the Real-time Speech Recognition Appli-
cation
The Real-Time Speech Transcription Application for the Darija language lever-
ages cutting-edge technologies to provide instantaneous transcription services.
Designed to cater specifically to the Darija language, this application facilitates
real-time speech-to-text conversion, allowing users to obtain accurate transcrip-
tions of spoken Darija content. By utilizing Kafka as a messaging system and
integrating Azure Cognitive Services for speech recognition, this application
addresses the growing need for efficient transcription services in the Darija lan-
guage.
The primary functionalities of the application include audio capture, stream-
ing, and recognition of speech input in Darija. It enables users to receive live
transcriptions of spoken content in real-time, fostering accessibility and commu-
nication in the Darija-speaking community. The real-time speech recognition
application is a sophisticated system that enables users to capture live audio
through a microphone, process it using Azure Cognitive Services for speech
recognition, and display the transcribed text in a user-friendly web interface.
Key functionalities of the application include:

• Audio Capture (PyAudio): The application utilizes PyAudio to cap-


ture audio data in chunks from the system’s microphone. This data is
then processed for speech recognition.
• Kafka Integration: Kafka, as a distributed event streaming platform,
facilitates the seamless transfer of audio data chunks from the recording
module to the Azure Cognitive Services component. This ensures efficient
and reliable communication between different parts of the application.
• Azure Cognitive Services - Speech SDK: Azure Cognitive Services
provide the Speech SDK, which is employed for real-time speech recogni-
tion. This SDK processes the incoming audio data from Kafka, performs
speech-to-text conversion, and provides live transcriptions of the spoken
content.
• User Interface (FastAPI and Jinja2Templates): The application’s
user interface is developed using FastAPI, a modern web framework, com-
bined with Jinja2Templates for HTML rendering. This interface displays
the live transcriptions obtained from the speech recognition process, al-
lowing users to interact with the system easily.

1.2 Objective and Purpose of the Report


The objective of this report is to provide a comprehensive understanding of the
Real-Time Speech Transcription Application’s architecture, functionalities, and

4
technologies employed for real-time transcription in the Darija language using
Kafka. It aims to dissect the application’s components, workflow, challenges
encountered, and recommendations for potential enhancements.
This report will delve into the intricate details of the application’s architec-
ture, outlining the core components such as PyAudio for audio capturing, Kafka
for seamless message streaming, Azure Cognitive Services for real-time speech
recognition in Darija, and the FastAPI framework combined with Jinja2Templates
for web-based user interface rendering. Additionally, it will explore how these
components collaboratively enable the application’s capability to perform in-
stantaneous transcription in the Darija language.
Furthermore, the report will discuss the scalability options and potential en-
hancements in the application’s architecture to meet the demands of increased
usage and to enhance its overall performance. This report serves multiple pur-
poses aimed at providing a comprehensive understanding of the application:

1. Technical Insights: It will delve into the technical intricacies of each


component utilized in the application’s architecture, including FastAPI,
PyAudio, Kafka, Azure Cognitive Services, threading, and more.
2. Functional Workflow Description: The report will detail the step-by-
step workflow, explaining how audio capture, data streaming via Kafka,
speech recognition, and UI rendering are interconnected and function in
real-time.
3. API Endpoint Overview: It will offer a detailed breakdown of the API
endpoints, elucidating their functionalities, expected input/output, and
the flow of data through these endpoints.
4. Considerations and Recommendations: Addressing various aspects
like error handling strategies, scalability considerations, security measures,
and UI enhancements to suggest improvements or optimizations for the
application.

By covering these aspects, the report aims to provide a comprehensive insight


into the technologies used, functionalities, workflow, and potential areas for
improvement within the real-time speech recognition application.

2 Technologies Used
2.1 FastAPI Framework

FastAPI is a modern web frame-

work for building APIs with Python 3.7+. It offers several features such as:

5
• High Performance: FastAPI is known for its high performance due to
its use of Python type hints.
• Automatic API Documentation: It generates interactive API docu-
mentation automatically, facilitating API exploration.
• Data Validation and Serialization: FastAPI performs automatic data
validation and serialization using Pydantic models.

In the application, FastAPI serves as the core framework to create API end-
points, handle HTTP requests, and generate responses for controlling the record-
ing and retrieving transcriptions.

2.2 PyAudio Library

PyAudio is a set of Python bindings for

PortAudio, enabling audio input and output. Key functionalities include:

• Audio Input/Output: PyAudio facilitates audio capture from the sys-


tem’s microphone and audio playback.
• Configuration Control: It allows configuration of audio input settings
like sample format, channels, and sampling rate.

In the application, PyAudio is used to capture audio data in chunks from the
microphone, which is further processed for speech recognition.

2.3 Kafka Messaging System

Kafka is an open-source distributed event

6
streaming platform known for its role as a message broker. Its integration in
the application involves:
• Message Queuing: Kafka enables the streaming of audio data chunks
from the recording module to the Azure Cognitive Services component.
• Reliable Communication: It ensures reliable communication between
different components of the application by handling data streaming effi-
ciently.
Kafka’s message queuing mechanism facilitates the smooth transfer of audio
data for further processing in the application.

2.4 Azure Cognitive Services - Speech SDK

Azure Cognitive Services provide the Speech

SDK for real-time speech recognition. Key features include:


• Real-time Speech Recognition: The Speech SDK processes audio data
in near real-time, converting speech to text.
• Integration Flexibility: It seamlessly integrates with applications to
provide live transcriptions of spoken content.
In the application, Azure’s Speech SDK handles the actual speech recognition
process, providing live transcriptions of the captured audio.

2.5 Python’s Threading Module

Python’s Threading Module allows concur-

rent execution of code, managing multiple tasks simultaneously. Its use involves:
• Concurrent Tasks: Threading manages concurrent processes within the
application, such as audio recording and continuous speech recognition.
• Non-blocking Execution: It ensures non-blocking execution, prevent-
ing tasks from halting the entire application.

7
Threading in the application ensures the smooth functioning of different pro-
cesses without blocking the main execution flow.

2.6 Jinja2Templates for HTML Rendering


Jinja2Templates is a templating engine used with FastAPI for dynamic HTML
content generation. Its functionalities include:
• Dynamic HTML Generation: Jinja2Templates dynamically generates
HTML content based on provided data.

• Template Rendering: It renders HTML templates, allowing for easy


integration with the FastAPI framework.
In the application, Jinja2Templates is utilized to render the user interface, dis-
playing live transcriptions and status updates.

2.7 KafkaProducer and KafkaConsumer


KafkaProducer and KafkaConsumer are components from the Kafka library used
for producing and consuming messages to/from Kafka topics.
• Message Production: KafkaProducer sends audio data chunks to a
Kafka topic for further processing.
• Message Consumption: KafkaConsumer consumes these chunks for
speech recognition or other processing tasks.
These components facilitate the communication and data streaming within the
application through Kafka’s message queuing system.

2.8 Azure Speech Configuration


Azure Speech Configuration involves setting up the configuration for Azure Cog-
nitive Services’ Speech SDK.

• Configuration Parameters: It includes subscription key, service region,


and language settings for the Speech SDK.
• Speech Recognition Setup: The configuration sets up the Speech SDK
with necessary credentials and language preferences.

In the application, this configuration ensures proper setup and functionality of


the Azure Speech recognizer for accurate speech recognition.

8
3 Application Architecture
3.1 Workflow Description
The architecture of this application revolves around a seamless interaction among
several essential components, each playing a crucial role in the real-time tran-
scription process in Darija:

• Audio Capture and Streaming to Kafka: The PyAudio module acts


as an intermediary, capturing audio streams from the user’s microphone.
These audio segments are then forwarded to the Kafka messaging system
for processing by other components.
• Integration with Azure Cognitive Services: The Kafka messaging
system serves as a central mechanism, transmitting the captured audio
segments to Azure Cognitive Services specialized in Darija voice recogni-
tion. These services perform the conversion of spoken language into text.
• User Interface Rendering: The dynamic user interface is facilitated
by FastAPI and Jinja2Templates. It displays the resulting transcriptions
from voice recognition in real-time, allowing users to immediately view
transcribed content.

3.2 Scalability and Future Considerations


Scalability of the architecture is essential to anticipate application growth and
meet increasing user demands:

• Horizontal Scalability: Exploring solutions like Kafka partitioning is


pivotal to effectively distribute workload across multiple nodes. This en-
sures better management of incoming and outgoing audio streams.
• Data Flow Optimization: Improving data transfer mechanisms be-
tween components is necessary to reduce latency. The aim is to optimize
processing speed, ensuring a responsive and smooth user experience.

4 Real-Time Transcription Process


4.1 Audio Capture and Data Streaming
The process of real-time transcription begins with the capture of audio data,
which is then seamlessly streamed to the Kafka messaging system. This phase
involves the PyAudio module, responsible for capturing audio segments origi-
nating from the user’s microphone.
PyAudio serves as the bridge, converting raw audio signals into digital data
chunks, which are subsequently transmitted in real-time to Kafka. These chunks
of audio data are sent as messages to specific Kafka topics, facilitating a smooth
flow of information for further processing.

9
4.2 Speech Recognition Workflow
The speech recognition phase involves Azure Cognitive Services, specifically
tailored to enable real-time speech-to-text conversion for the Darija language.
Once the audio data reaches Kafka, it is received by components responsible for
integration with Azure Cognitive Services.
Azure Cognitive Services employs sophisticated algorithms and language
models trained for Darija voice recognition. Upon receiving audio segments,
these services execute a series of recognition processes, analyzing and interpret-
ing spoken language patterns to generate accurate textual representations.
The process involves breaking down the received audio into recognizable
speech components, utilizing various linguistic and acoustic models. The result
is the transformation of spoken Darija language into text, which is then relayed
back for further utilization, such as displaying live transcriptions on the user
interface.

5 Future Enhancements
5.1 Potential Improvements
5.1.1 Expanded Language Support
To broaden the application’s accessibility, consider extending its language sup-
port beyond Darija. This expansion aims to make the application more inclusive
by accommodating a wider range of languages. By integrating voice recogni-
tion for various languages, the application will cater to a more diverse audience,
enhancing its usefulness and appeal to a broader user base.

5.1.2 Enhanced Accuracy


Continuous refinement of voice recognition models is crucial to ensure greater
accuracy in capturing and transcribing voice content. By investing in training
models with larger and more diverse datasets, the goal is to enhance transcrip-
tion reliability. Increased accuracy will strengthen user trust in the application’s
transcription precision.

5.1.3 Interactive Features


Introducing interactive features such as the ability to pause/resume transcrip-
tion, speaker identification, or translation options can significantly diversify the
user experience. These features will provide users with more control and cus-
tomization, enhancing the application’s utility in various usage scenarios.

5.1.4 Integration with AI Assistants


Exploring integration with AI assistants will equip the application with addi-
tional functionalities, such as contextual understanding, smart responses, and

10
more personalized interactions. This integration can significantly enhance the
application’s usefulness by offering more sophisticated assistance and interaction
to users.

5.1.5 Real-Time Collaboration Tools


The integration of real-time collaboration tools, allowing simultaneous sharing
and editing of transcriptions among multiple users, will foster cooperation and
efficiency in collaborative work environments. This feature will strengthen the
application’s practicality, particularly for teams working on projects requiring
real-time interaction.

6 Conclusion
This project has presented a real-time speech transcription application designed
specifically for Darija language utilizing Kafka and Azure Cognitive Services.
The application’s architecture, workflow, and functionalities were comprehen-
sively detailed, covering audio capture, streaming, speech recognition, and user
interface rendering.
In conclusion, this application stands as a promising solution in overcoming
language barriers. Its real-time transcription capabilities hold significant poten-
tial in facilitating effective communication. With continuous advancements and
future enhancements, the application is poised to become an indispensable tool
for enabling inclusive and efficient real-time transcription, fostering seamless
communication across diverse linguistic backgrounds.

11

You might also like