PDF To Voice by Using Deep Learning

Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR051
Pdf to Voice by Using Deep Learning

1
S.Sarjun Beevi ; 2Tayi Gopi Chand; 3Tamatam Hemanth Reddy; 4Tammana Rama Naga Sai Gokul ; 5Alamuru Harika
1 Assistant professor, School of Computing, Department of Computer Science and Engineering, Bharath Institute of Higher
Education and Research, Chennai, India- 600073.
2,3,4,5
Student, School of Computing, Department of Computer Science and Engineering, Bharath Institute of Higher Education and
Research, Chennai, India- 600073.
Abstract:- Audio books are extraordinary for people voice of the speaker. This mission is an attempt to expand an
who, like most people, want to listen to themselves read. advanced software program development engine that can
These can't be bought and stored within the library at extract PDF text from clear and corrupted text, document
domestic. Audiobooks are a notable manner to rest your photographs and handwritten textual content and convert the
eyes and take a damage from the steady stimulation of corresponding digital records into speech indicators. At the
virtual gadgets. Others as a time shop. For instance, hold core of this precise software tool is an OCR (Optical
studying books even as doing exceptional responsibilities Character Recognition) module which accomplishes the
on the equal time. Not only will this lessen the issues of vital morphological tasks in digital photography and
millennials, but it's going to also be a valuable tool for conversion. The text processing information is then
lowering human being’s visibility. The ability to converted into speech signals the usage of various synthesis
transform any content material into an audiobook is a strategies. Text to speech synthesis is in particular based
true present of humanity. Our technology may be used to totally at the concept of OCR. OCR stands for Best
develop such gadgets. Text-to-speech and other recitation Maximum Recognition. Optical man or woman recognition
applications are extensively used to assist college (OCR) is the process of changing scanned or published
students broaden studying comprehension abilities. The photographs of text or handwritten text into editable text for
PDF to Audio System is a screen reader designed and similarly processing. This paper affords a strong technique
developed for powerful audio verbal exchange. The to mining and re-telegraphing twentieth-century discourse
International Organization for Standardization (ISO) records. The first computer speech synthesis structures had
has unique PDF documents as an open file format. been advanced inside the late Fifties, and the primary entire
Document layout is one of the handiest formats for text-to-speech structures have been finished in 1968. The
digital conversation and facts change. This is very first is primarily based at the TDS layout and the accessory
essential if we want to improve the accessibility of our is about, the second approach is primarily based on the
readers' display screen by adding audio to our content diphone set. In the Nineties, a unit selection package was
material. Features from PDF files encompass hyperlinks used that drastically improved the pitch of the bundle. As
and buttons, as well as audio and video documents. gadget studying methods grow to be popular, neural
Multiple languages may be supported the use of PDF-to- community fashions are taught to are expecting more words
audio generation, which lets in customers to hear text and sentences. The fundamental characteristics of text-to-
definitely (spoken). speech are naturalness, accuracy and clarity. In the task, we
used a easy technique to create TTS the usage of Python.
Keywords:- Text to Speech, Image Extraction, OCR LORO can convert any documented PDF file to an MP3 file.
Technology, and Speech Recognition. It presents only the most significant content by utilizing
advanced deep learning algorithms to examine the
I. INTRODUCTION relationship between a page's title and body. It can remove
the iterative author’s name and file indexes from the page
Speaker authentication is gaining massive hobby in the along with any unnecessary stuff. To allow you to listen to
area of speaker authentication in recent times due to the what you truly intended to hear, LORO filters out all of that
popularity of voice access and protection programs (Zhang stuff for you. The greatest APIs from Google Cloud services,
et al., 2017; Hanili, 2018). Speaker authentication can be such as the Vision, Auto ML, and Textspeak APIs, are used
distinguished from speaker popularity and speaker by LORO. Thus, it produces the best-in-class product.
verification. In speaker identification, a person's voice is as LORO's usefulness is further increased by its capacity to
compared to a group of regarded audio system and to the recognize text in handwritten documents in addition to
type of recognised speakers. In speaker verification, the converting written texts into audio files.
speaker's voice is commonplace or rejected because the
voice of a selected individual. The speaker verification II. OBJECTIVE
device may be textual or text-based totally. A textual
foundation assumes that audio system correctly examine a A mystery Markov model. Well, impartial
given utterance or interruption. After this, he evaluates configuration is used whilst the item is partly diagnosed or
similarities the first is the similarity among the spoken and all of the records important for the choice of the sensors isn't
spoken words, the second one is the perceptive look and available (in the case of speech reputation, as in a speaker).
IJISRT24APR051 www.ijisrt.com 383

An example of this is the phonological photograph, wherein cognition, voice-controlled techniques are now more
the program ought to use semantic gadgets to sign feasible and are utilized in a selection of conditions in a
opportunity of hearing. The main goal of the program is to ramification of human-to-human, human-to-human
help computer application graduate students understand the interaction and manipulation. Letter to PC. People who are
basics of programming languages. After the system was blind can use email more easily thanks to the voice-based
completed, the following results related to the system goal application developed for the voice-based email system
were achieved. Students pursuing a master's degree in initiative. The primary goal of the proposed system is to
computer applications should be able to become familiar provide the essential features, such as voice-based
with code and application. If students can run and install the communication and email composition, reading, sending,
app, so can they. They are accessed by executing various and receiving. This facilitates the usage of the previously
objects when the application is launched. In order for described functions, including the ability to send emails via
the\user to understand the read text, he should be able to text or voice. Email usage is made easier for blind users with
select the desired PDF, convert it to sound and display the the help of this suggested way.
text (PyQt5: Label). Students with dyslexia or other reading
difficulties should be able to. Because the system doesn't need a keyboard or mouse,
users can input data just by saying the message. Because of
III. RELATED WORK this, the structure we are building is entirely distinct from
those that already exist. Unlike other systems that primarily
A. Travel Expressions: Hindi-English Speech-to-Speech focus on persons with visual impairments, our system
Translation System. additionally targets only a certain group of people. The
Mrinalini Ket al. (2015) wrote this. Speech-to-speech system can receive instructions from the user, which it then
translation systems convert speech indicators into executes. In addition, the system asks the user to perform
translations in the supply language. U . S . A . Deal with certain actions so that the correct services are used.
enter.
V. PROPOSED SYSTEM
B. Department of CSE, Speech-to-Speech Translation: A
Review Gurgaon's NorthCap University Sumanlata In this gadget, the PDF record is acquired via the
Gautam, CSE Department consumer and processed via the OCR version. This includes
Author: Mahak Dureja November 2015. Speech to photo pre-processing, picture popularity, signature
speech translation (S2ST) involves the translation of speech recognition, and feature extraction. OCR then converts the
from one language to any other. This may be performed the text copy into audio formats.
use of automated speech reputation (ASR), textual content-
to-text translation (MT), and textual content-to-speech The majority of people are too busy to read books or
synthesis subsystems (TDS) which are text-centric. convert PDF files into MP3 players, which is why third-
party applications and web applications are preferred. In this
C. "Text-to-speech conversion system for Spanish in real system, I am creating the application using Python which
time," Signal processing, speech, and acoustics decodes a PDF file into an audio file and reads it to the user.
Writer: J. C. Olabe, A. Santos, R. Martinez, M. The app is more user friendly because it doesn't require
Martinez, E. Munoz, A. Quilis, J. Bernstein, 1984. The goal audio files or an MP3 player. The user must select the PDF
of the research became to develop a text-to-speech converter file that the user wants to listen.
(DSC) for Spanish that accepts a continuous source of
alphanumeric characters (as much as 250 words in keeping  Advantages
with minute) and takes the primary herbal fine of the A speaker take a look at calls for a sample of the goal
Spanish language. speaker and an example that suggests the general
characteristics of other audio system.
D. In February 1987, Kavaler, R. et al. published "A
Dynamic Time Warp Integrated Circuit for a 1000-Word VI. BLOCK DIAGRAM
Recognition System" in the IEEE Journal of Solid-State
Circuits, vol. SC-22, NO 1.
The chip plays a dynamic time-stamping set of rules.
The chip is a part of the popularity board of the speech
gadget.
IV. EXISTING SYSTEM
The creation of voice control structures has changed

the way human beings interact with PCs. Voice or speech
verification permits clients to send requests to a palms-loose
pc, thereby producing sales and providing suitable responses
to clients. After critical long-term evaluation and
improvements in simulated intelligence and human Fig .1 Block Diagram

 Applications Now after reading the text, the text is printed to the
It is likewise useful for blind folks who can't study QtLabel located in the GUI. If the PDF file does not contain
well. This exchange allows college students to concentrate page numbers, the above steps will not be performed.
to autographed notes on exams.
VI. RESULT AND DISCUSSION
VII. MODULES
This isn't an age when you can deliver a eBook
 Obtaining images everywhere, sporting and analysing a e-book is becoming
 Initial processing tough every day. So I came to this answer. This yearbook
 Extraction of features app will assist you to convert difficult replica books to PDF
 Dividing format. In this utility, if the consumer clicks on a difficult
 Grouping replica photo using his cell digicam, it will be transformed
into PDF format. Optical man or woman reputation and text-
 Obtaining Images to-speech are used to print PDF-formatted pages. The
The process of obtaining an image from sources is software permits you to update the e book with one click on
known as image acquisition. Hardware systems like cameras on the hard copy and saves this audio document within the
and databases, as well as some encoders and sensors, can be database. By putting in this utility, you could study books in
used to accomplish this. your mobile and convert notes and books into audio
documents (PDF) to examine your books and notes in a few
 Initial Processing clicks. This app also helps different blind adults via
The primary goal of image pre-processing is to permitting them to concentrate to documents every time,
improve data, such as images, by minimizing undesired anywhere.
distortions or enhancing certain features—or, to put it
another way, by removing undesired disturbances from the  Screen Shot :
image.
 Extraction of Features
It is a step in the dimensional reduction process that
involves splitting up an initial collection of raw data and
reducing it into more manageable groups.
It is the process of converting an image's pixels into a

labelled image. You can process the significant portions of
an image using this method, not the full one.
 Grouping
The challenge of precisely recognizing what is in the
picture. The model will go through that process when it has
been trained to identify different classes. For instance, you
could teach a model to identify the three distinct species in
the picture. Fig . 2 Home page
VIII. IMPLEMENTATION
User can select PDF files using PDF to Audio

Converter GUI. The user must click the play button to
extract and read the text from the PDF file. The app also has
text display stickers and play, pause and stop buttons. The
application is designed so that the player cannot be stopped
until the speaker has read the extracted text. In addition, the
graphical user interface includes a text display label; the text
is visible only after the extracted text is read.
In this PDF to audio converter user has to select any

PDF file from desired location. After selecting the PDF file,
the user has to click on the play button. If the PDF file has
pages inside, the thing is extracted into the PDF file. The
source text is given out in print mode. Then read the Fig . 3 Interface
extracted text.

identity without spectrograms. Second, a brand-new

approach to focused on goal and non-targeted audio system
in the speaker verification assignment is provided. Third, we
proposed a new strategy to generate additional speaker facts
the usage of the Universal Background Model (UBM). The
3 proposed strategies had been experimentally carried out to
the SRE THUYG-20 hull at 3 noise tiers: M1: CNN/UBM,
M2: M1+M2, CNN/NEARC, and the effects display that the
proposed approach outperforms the present techniques. The
equal error price (EER) is extensively decreased whilst M1
and M2 are used together. This article presents the OCR
methodology for text recognition (PDF file) to mp3 audio
conversion. OCR is an amazing technology with a lot of
untapped potential. Our goal is to further refine our model
so that it can process any type of described file. Our goal is
to give every picture a voice so that visually impaired
Fig . 4 Uploading PDF individuals and young children can understand the context
with ease. For the vast majority of consumers, our model
will be easier to use. It will require sound understanding to
improve the user's voice comfort.
XI. FUTURE SCOPE
Programmable voice manage structure is the manner

human beings engage with computers. A voice or speech
popularity structure allows requests to be sent to a customer
PC in a arms-loose manner, thereby processing the request
and imparting the correct reaction to the patron. After
numerous researches and enhancements inside the area of
synthetic intelligence and human wondering, voice control
found nowadays has end up very beneficial and used in
many regions to improve and work in conversation between
human beings and among human beings and between
Fig . 5 Downloaded File computers. The program is currently limited to reading text
from PDF files with page numbers; adding PDFs without
page numbers, Word documents, and PowerPoint
presentations with navigation will improve the application's
usability and accessibility. The ability to play from the
paused word and pause the speaker upon user request makes
the pause button more user-friendly. Choose a PDF file. If a
PDF file contains Take a PDF extract text and print it in
CMD Windows. Examine the retrieved text. QtLabel PDF
print text cannot be extracted. Exit-Exit Converter Page no
e- No page no'sISSN: 2582-5208 Vol. 02, Issue 12,
December 2020, International Research Journal of
Modernization in Engineering Technology and Science
Factor of Impact: 5.354 irjmets.com International Research
Journal of Modernization in Engineering, Technology and
Science [566] can be found at www.irjmets.com. The written
work shall be incorporated (PyQt5: Label) It can be further
Fig . 6 Result enhanced to display text while the speaker reads the
extracted text, which makes it easier for students to
X. CONCLUSION understand, but only after reading the complete extracted
information.
In this work, we furnished a unique speaker mapping
approach using deep gaining knowledge of systems. The
weight matrix is extracted using a Restricted Boltzmann
Machine (RBM) and then fed to a Convolutional Neural
Network (CNN). In this context, we make 3 most important
contributions. First, Rhonchus can be used for speaker

REFERENCES [15]. Jamshed Memon, Maira Sami, Rizwan Ahmed Khan,

Mueen Uddin, “Handwritten Optical Character
[1]. Beigi, H. (2011). Fundamentals of speaker recognition Recognition (OCR): A Comprehensive Systematic
(1st ed.). Literature Review(SLR)”, 2020 IEEE Access, Vol.8,
[2]. New York: Springer. https://doi.org/10.1007/978-0- 2020, doi:10.1109/ACCESS.2020.3012542
387-77592-0. Bennani, Y., & Gallinari, P. (1994). [16]. HPattern Recognition and Natural Language
[3]. Connectionist approaches for automatic speaker Processing: State of the Art, BYMirjana Kocaleva,
recognition. In: Proceedings of the Automatic Speaker Done Stojanov, Igor Stojanovik, Zoran Zdravev
Recognition, Identifcation and Verifcation. ,Published On:Elearning Center – University “Goce
[4]. R. Masumura, T. Asami, T. Oba, H. Masataki, and S. Delcev”, Krste Misirkov bb, Shtip, R.Macedonia
Sakauchi, ‘‘Viterbi approximation of latent words Faculty of Computer Science – University “Goce
language models for automatic speech recognition,’’ J. Delcev”, Krste Misirkov bb, Shtip, R.Macedonia
Inf. Process., vol. 27, pp. 168–176, 2019, doi: [17]. A Handwriting Recognition Using Eccentricity and
10.2197/ipsjjip.27.168. Metric Feature Extraction Based on K-Nearest
[5]. D. Palaz, M. Magimai-Doss, and R. Collobert, ‘‘End- Neighbors, BY: E. Hari Rachmawanto, G. Rambu
to-end acoustic modeling using convolutional neural Anarqi, D. R. I. Moses Setiadi and C. Atika Sari
networks for HMM-based automatic speech Published on : International Seminar on Application for
recognition,’’ Speech Commun., vol. 108, pp. 15–32, Technology of Information and Communication, 2018,
Apr. 2019, doi: 10.1016/j.specom.2019.01.004. pp. 411-416
[6]. S.-C. Lee, J.-F. Wang, and M.-H. Chen, ‘‘Threshold- [18]. Handwritten Text Recognition using Deep Learning
based noise detection and reduction for automatic (CNN,RNN) BY- Rohini G. Khalkar, Adarsh Singh
speech recognition system in human- robot Dikhi, Anirudh Goel3, Manisha Gupta PUBLISHED
interactions,’’ Sensors, vol. 18, no. 7, p. 2068, Jun. ON :IARJSET International Advanced Research
2018, doi:10.3390/s18072068. Journal in Science, Engine Vol. 8, Issue 6, June 2021
[7]. H. Wang, F. Gao, Y. Zhao, and L. Wu, ‘‘Wave Net with [19]. España-Boquera, S.; Castro-Bleda, M.J.; Gorbe-Moya,
cross attention for audiovisual speech recognition,’’ J.; Zamora-Martinez, F. (2011). Improving Offline
IEEE Access, vol. 8, pp. 169160–169168, 2020, doi: Handwritten Text Recognition with Hybrid
10.1109/ACCESS.2020.3024218. HMM/ANN Models. , 33(4), 0–779.
[8]. Ogawa and T. Hori, ‘‘Error detection and accuracy doi:10.1109/tpami.2010.141
estimation in automatic speech recognition using deep Dept.ofCSE,BMSCE2022-23 34
bidirectional recurrent neural networks,’’ Speech [20]. Gyeonghwan Kim1, Venu Govindaraju2, Sargur N.
Commun., vol. 89, pp. 70–83, May 2017, doi: Srihari2 Department of , oul 100- 611, Korea; e-mail:
10.1016/j.specom.2017.02.009. gkim@ccs.sogang.ac.kr 2 CEDAR, State University of
[9]. J. Keshet, ‘‘Automatic speech recognition: A primer for New York at Buffalo, 520 Lee Entrance, Amherst, NY
speech-language pathology researchers,’’ Int. J. 14228–2567, USA
Speech-Lang. Pathol., vol. 20, no. 6, pp. 599–609, Oct. [21]. Hull, J.J. (1994). A database for handwritten text
2018, doi: 10.1080/17549507.2018.1510033. recognition research. , 16(5), 0–554.
[10]. D. Wang, X. Wang, and S. Lv, ‘‘An overview of end- doi:10.1109/34.291440
to-end automatic speech recognition,’’ Symmetry, vol. [22]. Read, J.C., S.J. MacFarlane, and C. Casey. Measuring
11, no. 8, p. 1018, Aug. 2019, doi: the Usability of Text Input Methods for Children. in
10.3390/sym11081018. HCI2012. 2012. Lille, France: Springer Verlag.
[11]. G. Gosztolya and T. Grósz, ‘‘Domain adaptation of
deep neural networks for automatic speech recognition
via wireless sensors,’’ J. Electr. Eng., vol. 67, no. 2, pp.
124–130, Apr. 2016, doi: 10.1515/jee-2016-0017.
[12]. Y.-H. Tu, J. Du, T. Gao, and C.-H. Lee, ‘‘A multi-target
SNR-progressive learning approach to regression-
based speech enhancement,’’ IEEE/ACM Trans. Audio,
Speech, Language Process., vol. 28, pp. 1608–1619,
2020, doi: 10.1109/TASLP.2020.2996503.
[13]. J. Ming and D. Crookes, ‘‘Speech enhancement based
on full-sentence correlation and clean speech
recognition,’’ IEEE/ACM Trans. Audio, Speech,
Language Process., vol. 25, no. 3, pp. 531–543, Mar.
2017, doi: 10.1109/TASLP.2017.2651406.
[14]. N. Darapaneni et al., "Handwritten Form Recognition
Using Artificial Neural Network," 2020 IEEE 15th
International Conference on Industrial and Information
Systems (ICIIS), 2020, pp. 420-424, doi:
10.1109/ICIIS51140.2020.9342638.

PDF To Voice by Using Deep Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDF To Voice by Using Deep Learning

Uploaded by

Copyright:

Available Formats

Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR051

Pdf to Voice by Using Deep Learning

IJISRT24APR051 www.ijisrt.com 383

IV. EXISTING SYSTEM

The creation of voice control structures has changed

IJISRT24APR051 www.ijisrt.com 384

It is the process of converting an image's pixels into a

User can select PDF files using PDF to Audio

In this PDF to audio converter user has to select any

IJISRT24APR051 www.ijisrt.com 385

identity without spectrograms. Second, a brand-new

XI. FUTURE SCOPE

Programmable voice manage structure is the manner

IJISRT24APR051 www.ijisrt.com 386

REFERENCES [15]. Jamshed Memon, Maira Sami, Rizwan Ahmed Khan,

IJISRT24APR051 www.ijisrt.com 387

You might also like