Capstone Project Report (AST)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

TEXT SUMMARIZATION

A PROJECT REPORT

Submitted by

Nikhil Murugesan (18BCE10173)


Muskan Sharma (18BCE10166)
Jayasurya M (18BCE10126)

in partial fulfillment for the award of the degree


of

BACHELOR OF TECHNOLOGY
in

COMPUTER SCIENCE AND ENGINEERING

SCHOOL OF COMPUTING SCIENCE AND ENGINEERING


VIT BHOPAL UNIVERSITY
KOTHRIKALAN, SEHORE
MADHYA PRADESH - 466114

APRIL 2022
VIT BHOPAL UNIVERSITY, KOTHRIKALAN, SEHORE
MADHYA PRADESH – 466114

BONAFIDE CERTIFICATE

Certified that this project report titled “TEXT SUMMARIZATION” is the bonafide

work of Nikhil Murugesan (18BCE10126), Muskan Sharma (18BCE10166),

Jayasurya M (18BCE10126) who carried out the project work under my supervision.

Certified further that to the best of my knowledge the work reported at this time does

not form part of any other project/research work based on which a degree or award

was conferred on an earlier occasion on this or any other candidate.

PROGRAM CHAIR PROJECT GUIDE


Dr.Sandip mal, Senior Assistant Professor Dr.AVR Mayuri,Senior Assistant
professor
School of Computing Science and Engineering School of Computing Science and
Engineering
VIT BHOPAL UNIVERSITY VIT BHOPAL UNIVERSITY

The Project Exhibition I Examination is held on _______________


ACKNOWLEDGEMENT

First and foremost, I would like to thank the Lord Almighty for His presence and immense blessings

throughout the project work.

I wish to express my heartfelt gratitude to Dr.Sandip mal, Head of the Department, School of

Computing Science and Engineering for much of his valuable support and encouragement in

carrying out this work.

I would like to thank my internal guide Dr AVR Mayuri, for continually guiding and actively

participating in my project, giving valuable suggestions to complete the project work.

I would like to thank all the technical and teaching staff of the School of Computing science and

Engineering, who extended directly or indirectly all support.

Last, but not least, I am deeply indebted to my parents who have been the greatest support while I

worked day and night for the project to make it a success.


LIST OF ABBREVIATIONS

AST- Automatic Text Summarization

NLP- Natural Language Processing

RNN- Recurrent Neural Network

GRU- Gated Recurrent Unit

LSTM- Long Short Term Memory


LIST OF FIGURES

FIGURE TITLE PAGE NO.

NO.

1.
System Architecture 18

2. User Interface Design 18


ABSTRACT

The navigation through hundreds of documents to find interesting information is a

tough job and a waste of time and effort. It is very difficult for human beings to

manually extract the summary of a large document of text. There is plenty of text

material available on the internet. So there is a problem of searching for relevant

documents from the number of documents available, and absorbing relevant

information from it. In order to solve the above two problems, automatic text

summarization is very much necessary.

Automatic text summarization is a technique concerning the creation of a compressed

form for a single document or multi-documents for tackling such a problem. The most

important benefits of using a summary are its reduced reading time and providing a

quick guide to interesting information. Automatic text summarization techniques aim

to find the most important text units and present them as a summary of the original

document.
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO.


NO.

List of abbreviations 4
List of figures 5
Abstract 6
1 CHAPTER-1:
PROJECT DESCRIPTION AND OUTLINE
1.1 Introduction 9
1.2 Motivation for the work 9
1.3 Problem Statement 10
1.4 Aim & Objective 10

2 CHAPTER-2:
RELATED WORK INVESTIGATION
11
2.1 Introduction
11
2.2 Existing Approaches/Methods
12
2.2.1 Approaches/Methods -1
12
2.3 <Pros and cons of the stated Approaches/Methods >
2.4 Issues/observations from investigation

3 CHAPTER-3:
REQUIREMENT ARTIFACTS
13
3.1 Introduction
13
3.2 Hardware and Software requirements
3.3 Specific Project requirements
14
3.3.1 Data requirement
14
3.3.2 Functions requirement
3.3.3 Performance and security requirement
3.3.4 Look and Feel Requirements 15
4 CHAPTER-4:
DESIGN METHODOLOGY AND ITS NOVELTY
16
4.1 Methodology and goal
17
4.2 Functional modules design and analysis
18
4.3 System architecture design
18
4.4 User Interface designs

5 CHAPTER-5:
TECHNICAL IMPLEMENTATION & ANALYSIS
5.1 Technical coding and code solutions 19
5.2 Test and validation 22-37
5.3 Performance Analysis 39

6 CHAPTER-6:
PROJECT OUTCOME AND APPLICABILITY
39
6.1 Key implementations outlines of the System
40
6.2 Significant project outcomes
40
6.3 Project applicability on Real-world applications

7 CHAPTER-7:

CONCLUSIONS AND RECOMMENDATION


7.1 Limitation/Constraints of the System 41
7.2 Future Enhancements 42
8 References 43

CHAPTER 1

PROJECT DESCRIPTION AND OUTLINE

1.1 Introduction

Text summarization refers to the technique of shortening long pieces of text. Before going to the
Text summarization, first we must know what a summary is. A summary is a short form of text
that is formed from one or more texts that gives important information in the original text. The
purpose of automatic text summarization is presenting the source text into a shorter version with
semantics. Summary reduces the reading time. The intention is to create a coherent and fluent
summary having only the main points outlined in the document.

1.2 Motivation

An enterprise produces a huge amount of data every day and most of the data are either
unstructured or very long. It takes a lot of effort and time to process this data manually.Text
summarization refers to the technique of shortening long pieces of text using machine learning
and natural language processing. The intention is to create a coherent and fluent summary
having only the main points outlined in the document. Text Summarization is increasingly being
used in the commercial sector such as Telephone communication industry, data mining of text
databases, for web-based information retrieval, in word Processing tools. Many approaches
differ on the behavior of their problem formulations. Automatic text summarization is an
important step for information management tasks. It solves the problem of selecting the most
important portions of the text. High quality summarization requires sophisticated NLP
techniques.

1.3 Problem Statement

Text Summarization is one of those applications of Natural Language Processing (NLP) which
is bound to have a huge impact on our lives. With growing digital media and ever-growing
publishing, there is a time constraint to go through entire articles, documents, books to decide
whether they are useful or not. The explosion of electronic documents has made it difficult for
users to extract useful information from them. The user due to the large amount of information
does not read many relevant and interesting documents. This demands an automatic text
summarization that can generate a concise and meaningful summary of text from multiple text
resources such as books, news articles, blog posts, research papers, emails, and tweets

1.4 Aim & Objective

Today, our world is parachuted by the gathering and dissemination of huge amounts of data.

The International Data Corporation (IDC) projects that the total amount of digital data
circulating annually around the world would sprout from 4.4 zettabytes in 2013 to hit 180
zettabytes in 2025. With such a big amount of data circulating in the digital space, there is a
need to develop machine learning algorithms that can automatically shorten longer texts and
deliver accurate summaries that can fluently pass the intended messages.
Our objective is to apply text summarization that reduces reading time, accelerates the process
of researching for information, and increases the amount of information that can fit in an area.

CHAPTER-2

2.1 Introduction

With the advancement of technology, the internet is accessible through various devices
like smartphones, smart watches and within the reach of common people. That leads
to the accessibility of a lot of information through the World Wide Web (WWW).
More information on the internet sometimes becomes so difficult to select only
required information from large texts. Due to the information, manual summarization
of information is very challenging and also a time-consuming task. So to overcome
this challenge, an idea of making a working automatic text summarization was born.
ATS uses NPL to generate small summaries of big text documents in a few minutes.

2.2 Existing Approaches/Methods

2.2.1 Approaches/Methods -1

Numerous approaches for identifying important content for automatic text summarization have been
developed to date. Topic representation approaches first derive an intermediate representation of the
text that captures the topics discussed in the input. Based on these representations of topics,
sentences in the input document are scored for importance. In contrast, in indicator representation
approaches, the text is represented by a diverse set of possible indicators of importance which do not
aim at discovering topicality. These indicators are combined, very often using machine learning
techniques, to score the importance of each sentence. Finally, a summary is produced by selecting
sentences in a greedy approach, choosing the sentences that will go in the summary one by one, or
globally optimizing the selection, choosing the best set of sentences to form a summary. One of the
most common approaches is extractive summarization systems for short, paragraph length
summaries and these summarizers identify the most important sentences in the input, which can be
either a single document or a cluster of related documents, and string them together to form a
summary. The decision about what content is important is driven primarily by the input to the
summarizer.

2.3 <Pros and cons of the stated Approaches/Methods >

It has been observed that in the context of multi-document summarization of news articles,
extraction may be inappropriate because it may produce summaries which are overly verbose or
biased towards some sources where as in abstractive summarization it gives us short, correct
summary of the document

2.4 Issues/observations from investigation

As the availability of data in the form of text increases day by day. It becomes so difficult to read the
whole textual data in order to find the required information which is both difficult as well as a
time-consuming task for a human being. So, at that time ATS performs an important role by
providing a summary of a whole text document by extracting only the useful information and
sentences. There are different approaches to text summarization. The real-world applications of text
summarization can be: documents summarization, news and articles summarization, review systems,
recommendation systems, social media monitoring, survey responses systems. The paper provides a
literature review of various research works in the field of automatic text summarization. This
research area can be explored more by looking in existing systems and working on different and new
techniques of NLP and Machine Learning.

Related work

Developing learning algorithms for distributed compositional semantics of words has been a
longstanding open problem at the intersection of language understanding and machine learning. In
recent years, several approaches have been developed for learning composition operators that map
word vectors to sentence vectors including recursive networks, recurrent networks, convolutional
networks and recursive- convolutional methods among others. There are many methods to
summarize documents by finding topics of the document first and scoring the individual sentences
with respect to the topics. Sentence clustering has been successfully applied in document
summarization to discover the topics conveyed in a document collection. All of these methods
produce sentence representations that are passed to a supervised task and depend on a class label to
backpropagate through the composition weights. Consequently, these methods learn high-quality
sentence representations but are tuned only for their respective task. Our model is an alternative to
the above models in that it can learn unsupervised sentence representations by introducing a
distributed sentence indicator as part of a neural language model.

CHAPTER-3

REQUIREMENT ARTIFACTS

3.1 Hardware Requirements

● Multi-core CPU with Core i7 or higher

● 16GB RAM or higher

● 128GB SSD or higher

3.2 Software Requirements

● Visual Studio

Visual Studio Code is a lightweight but powerful source code editor. It comes with built-in
support for JavaScript, TypeScript and Node.js and has a rich ecosystem of extensions for
other languages such as C++, C#, Java, Python, PHP and Go. In our project, this source code
editor is used as our development environment for React js, Django framework and REST
API framework.

● Anaconda

Anaconda is a distribution of the Python and R programming languages for scientific


computing that aims to simplify package management and deployment. The distribution
includes data-science packages suitable for Windows, Linux, and macOS. We have used this
to develop and test our text summarization

● Google Chrome

Google Chrome is a cross-platform web browser developed by Google. We have used this to
debug our front end and server responses. Google Chrome includes a built-in console in
which the process and the responses can be logged. This eased the testing process of the
project.

3.3 Specific Project requirements

3.3.1 Data requirement

● The system shall never display the contents of the documents to anyone on the internet.

● The system's backend shall be encrypted.

● The system’s backend shall only be accessible to the administrators

3.3.2 Functions requirement


1. DOCUMENT MANAGEMENT:

● Ability to add, remove, and update the document for summarization.

● Ability to upload the documents a different number of times to get various summaries

3.3.3 Performance and security requirement

● The project shall be based on the web and depends upon the optimization capabilities of the
webserver.

● The project shall start with the initial load time which depends upon the network strength of
the carrier and the user is using to access the Internet.

● The performance of the project shall not be affected by the hardware specifications of the
user

● Secure sockets are to be used in all transactions involving any confidential information of the
user.

● The application shall not leave any cookies on the user’s system (computer/laptop/phone)
containing the user’s credentials.

3.3.4 Look and Feel Requirement

● The web application provides storage of all documents uploaded on redundant computers
with automatic switchover.

● The backup of the server is constantly maintained and updated to reflect the most recent
changes
● A commercial deployment site is used to produce the application and the application server
takes care of the site.

● In case of failure, a re-initialization of the program will be done.

CHAPTER-4

DESIGN METHODOLOGY AND NOVELTY

4.1 Methodology and goal

Our summarization model is an encoder-decoder model. That is, an encoder maps words to a
sentence vector and a decoder is used to generate the surrounding sentences. Encoder-decoder
models have gained a lot of traction for neural machine translation. In this setting, an encoder is used
to map e.g. an English sentence into a vector. The decoder then conditions on this vector to generate
a translation for the source English sentence.

4.2 Functional Module designs and analysis

The system interface is developed using the Django framework with the REST API framework.
Django framework manages the backend working of the website. It is responsible to process the
requests from the client then process the PDF and send the input to the summarization model . REST
API is responsible for transporting the JSON format input and output between the backend and the
client system. REST API logs all the requests received from the client in a JSON string format. This
API view can be accessed by the admin for maintenance purposes since the view provides the
creation, update, reset and delete record option.
For Text Summarization, we first train a model that takes in a dataset and then convert them
into sentence tuples. Given a tuple (si-1; si; si+1) of contiguous sentences, with si the ith sentence of
the dataset the sentence si is encoded into a vector representation and tries to reconstruct the
previous sentence si-1 and next sentence si+1. We then freeze this model and save it as an encoder.

In our model, we use a recurrent neural network (RNN) encoder with gated recurrent unit
(GRU) activations and an RNN decoder with a conditional GRU. This model combination is nearly
identical to the RNN encoder-decoder used in neural machine translation. GRU has been shown to
perform as well as LSTM on sequence modeling tasks while being conceptually simpler. GRU units
have only 2 gates and do not require the use of a cell. While we use RNNs for our model, an encoder
and decoder can be used so long as we can backpropagate through it.

4.3 System Architecture design

PREPROCESSING: There are three steps in preprocessing. Stop words are removed from the text.
Stop words are frequently occurring words such as ‘a’ an’, the’ that provides less meaning and
contains noise. The Stop words are predefined and stored in an array. Tokenization will separate the
input text into separate tokens. Punctuation marks, spaces and word terminators are the word
breaking characters. Word Stemming is used to convert every word into its root form by removing its
prefix and suffix for comparison with other words.

Encoder: The encoder is typically a GRU-RNN which generates a fixed-length vector representation
h(i) for each sentence S(i) in the input. The encoded representation h(i) is obtained by passing the
final hidden state of the GRU cell (i.e. after it has seen the entire sentence) to multiple dense layers.
The encoder produces vectors in batches of sentences with the same length for optimization
purposes. A vector is a Numpy array with as many rows as the length of the sentence.

Decoder: The decoder is a neural language model which conditions the encoder output. The
computation is like that of the encoder except we introduce matrices that are used to bias the update
gate, reset gate and hidden state computation by the sentence vector. One decoder is used for the
next sentence while a second decoder is used for the previous sentence. Separate parameters are used
for each decoder

4.4 User Interface design

The user interface is the front end of the webpage. The webpage consists of a drop zone area where
the client can drop or browse a PDF file to summarize. This PDF is sent to the backend framework
using the REST API in a JSON format. The JSON file is received by the Django framework and
then the PyPDF2 module splits the text from the PDF file. The text is then sent to the text
summarization model. The result is the summary. The frontend converts this string into a JavaScript
object from where this information is displayed on the client user interface. The client then can
download the. During this complete process, the user interface gives the status of the process
happening in the backend when the client chooses to process the PDF file. The user has to wait for
some time to get the result displayed on the webpage.
CHAPTER-5

TECHNICAL IMPLEMENTATION & ANALYSIS

5.1 Technical coding and code solutions

class contract:
contractions = {
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

from nltk.tokenize import sent_tokenize


import re

#Defining few schematics for contraction words


contractions = {
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

#replace these contractions in the given file

contractions_re = re.compile('(%s)' % '|'.join(contractions.keys()))


def expand_contractions(s, contractions_dict=contractions):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, s)

with open("text.txt","r") as file:


data=file.read().replace("\n","")
passage="".join(data)
sentences = sent_tokenize(passage)
sentences = [expand_contractions(i) for i in sentences]
sentences = [re.sub('\n', '', i) for i in sentences]

#Loading pre trained models

import skipthoughts
model = skipthoughts.load_model()

encoder = skipthoughts.Encoder(model)
encoded = encoder.encode(sentences)

# KMean clustering and summary formation


from sklearn.metrics import pairwise_distances_argmin_min
import numpy as np
from sklearn.cluster import KMeans

n_clusters = int(np.ceil(len(encoded)**0.6))
print(n_clusters)

kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit(encoded)

avg = []
for j in range(n_clusters):
idx = np.where(kmeans.labels_ == j)[0]
avg.append(np.mean(idx))
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, encoded)
ordering = sorted(range(n_clusters), key=lambda k: avg[k])
summary = ''.join([sentences[closest[idx]] for idx in ordering])
print(summary)

'''
Skip-thought vectors
'''
import os
#os.environ["THEANO_FLAGS"] = "mode=FAST_RUN,device=cuda,floatX=float32"

import theano
import theano.tensor as tensor

import pickle as pkl


import numpy
import nltk
import warnings
from collections import OrderedDict, defaultdict
from scipy.linalg import norm
from nltk.tokenize import word_tokenize

profile = False

#----------------------------------------------------------------------------
-#
# Specify model and table locations here
#----------------------------------------------------------------------------
-#
path_to_models = 'C:\\model files\\'
path_to_tables = 'C:\\model files\\'
#----------------------------------------------------------------------------
-#

path_to_umodel = path_to_models + 'uni_skip.npz'


path_to_bmodel = path_to_models + 'bi_skip.npz'

def load_model():
"""
Load the model with saved tables
"""
# Load model options
print('Loading model parameters...')
with open('%s.pkl'%path_to_umodel, 'rb') as f:
uoptions = pkl.load(f)
with open('%s.pkl'%path_to_bmodel, 'rb') as f:
boptions = pkl.load(f)

# Load parameters
uparams = init_params(uoptions)
uparams = load_params(path_to_umodel, uparams)
utparams = init_tparams(uparams)
bparams = init_params_bi(boptions)
bparams = load_params(path_to_bmodel, bparams)
btparams = init_tparams(bparams)

# Extractor functions
print('Compiling encoders...')
embedding, x_mask, ctxw2v = build_encoder(utparams, uoptions)
f_w2v = theano.function([embedding, x_mask], ctxw2v, name='f_w2v')
embedding, x_mask, ctxw2v = build_encoder_bi(btparams, boptions)
f_w2v2 = theano.function([embedding, x_mask], ctxw2v, name='f_w2v2')

# Tables
print('Loading tables...')
utable, btable = load_tables()

# Store everything we need in a dictionary


print('Packing up...')
model = {}
model['uoptions'] = uoptions
model['boptions'] = boptions
model['utable'] = utable
model['btable'] = btable
model['f_w2v'] = f_w2v
model['f_w2v2'] = f_w2v2

return model

def load_tables():
"""
Load the tables
"""
words = []
utable = numpy.load(path_to_tables +
'utable.npy',allow_pickle=True,encoding='bytes')
btable = numpy.load(path_to_tables +
'btable.npy',allow_pickle=True,encoding='bytes')
f = open(path_to_tables + 'dictionary.txt', 'rb')
for line in f:
words.append(line.decode('utf-8').strip())
f.close()
utable = OrderedDict(zip(words, utable))
btable = OrderedDict(zip(words, btable))
return utable, btable

class Encoder(object):
"""
Sentence encoder.
"""

def __init__(self, model):


self._model = model

def encode(self, X, use_norm=True, verbose=True, batch_size=128,


use_eos=False):
"""
Encode sentences in the list X. Each entry will return a vector
"""
return encode(self._model, X, use_norm, verbose, batch_size, use_eos)

def encode(model, X, use_norm=True, verbose=True, batch_size=128,


use_eos=False):
"""
Encode sentences in the list X. Each entry will return a vector
"""
# first, do preprocessing
X = preprocess(X)

# word dictionary and init


d = defaultdict(lambda : 0)
for w in model['utable'].keys():
d[w] = 1
ufeatures = numpy.zeros((len(X), model['uoptions']['dim']),
dtype='float32')
bfeatures = numpy.zeros((len(X), 2 * model['boptions']['dim']),
dtype='float32')

# length dictionary
ds = defaultdict(list)
captions = [s.split() for s in X]
for i,s in enumerate(captions):
ds[len(s)].append(i)

# Get features. Encodes by length, in order to avoid wasting computation


for k in ds.keys():
if verbose:
print(k)
numbatches = int(len(ds[k]) / batch_size + 1)
for minibatch in range(numbatches):
caps = ds[k][minibatch::numbatches]

if use_eos:
uembedding = numpy.zeros((k+1, len(caps),
model['uoptions']['dim_word']), dtype='float32')
bembedding = numpy.zeros((k+1, len(caps),
model['boptions']['dim_word']), dtype='float32')
else:
uembedding = numpy.zeros((k, len(caps),
model['uoptions']['dim_word']), dtype='float32')
bembedding = numpy.zeros((k, len(caps),
model['boptions']['dim_word']), dtype='float32')
for ind, c in enumerate(caps):
caption = captions[c]
for j in range(len(caption)):
if d[caption[j]] > 0:
uembedding[j,ind] = model['utable'][caption[j]]
bembedding[j,ind] = model['btable'][caption[j]]
else:
uembedding[j,ind] = model['utable']['UNK']
bembedding[j,ind] = model['btable']['UNK']
if use_eos:
uembedding[-1,ind] = model['utable']['<eos>']
bembedding[-1,ind] = model['btable']['<eos>']
if use_eos:
uff = model['f_w2v'](uembedding,
numpy.ones((len(caption)+1,len(caps)), dtype='float32'))
bff = model['f_w2v2'](bembedding,
numpy.ones((len(caption)+1,len(caps)), dtype='float32'))
else:
uff = model['f_w2v'](uembedding,
numpy.ones((len(caption),len(caps)), dtype='float32'))
bff = model['f_w2v2'](bembedding,
numpy.ones((len(caption),len(caps)), dtype='float32'))
if use_norm:
for j in range(len(uff)):
uff[j] /= norm(uff[j])
bff[j] /= norm(bff[j])
for ind, c in enumerate(caps):
ufeatures[c] = uff[ind]
bfeatures[c] = bff[ind]
features = numpy.c_[ufeatures, bfeatures]
return features

def preprocess(text):
"""
Preprocess text for encoder
"""
X = []
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
for t in text:
sents = sent_detector.tokenize(t)
result = ''
for s in sents:
tokens = word_tokenize(s)
result += ' ' + ' '.join(tokens)
X.append(result)
return X

def nn(model, text, vectors, query, k=5):


"""
Return the nearest neighbour sentences to query
text: list of sentences
vectors: the corresponding representations for text
query: a string to search
"""
qf = encode(model, [query])
qf /= norm(qf)
scores = numpy.dot(qf, vectors.T).flatten()
sorted_args = numpy.argsort(scores)[::-1]
sentences = [text[a] for a in sorted_args[:k]]
print('QUERY: ' + query)
print('NEAREST: ')
for i, s in enumerate(sentences):
print(s, sorted_args[i])

def word_features(table):
"""
Extract word features into a normalized matrix
"""
features = numpy.zeros((len(table), 620), dtype='float32')
keys = table.keys()
for i in range(len(table)):
f = table[keys[i]]
features[i] = f / norm(f)
return features

def nn_words(table, wordvecs, query, k=10):


"""
Get the nearest neighbour words
"""
keys = table.keys()
qf = table[query]
scores = numpy.dot(qf, wordvecs.T).flatten()
sorted_args = numpy.argsort(scores)[::-1]
words = [keys[a] for a in sorted_args[:k]]
print('QUERY: ' + query)
print('NEAREST: ')
for i, w in enumerate(words):
print(w)

def _p(pp, name):


"""
make prefix-appended name
"""
return '%s_%s'%(pp, name)

def init_tparams(params):
"""
initialize Theano shared variables according to the initial parameters
"""
tparams = OrderedDict()
for kk, pp in params.items():
tparams[kk] = theano.shared(params[kk], name=kk)
return tparams

def load_params(path, params):


"""
load parameters
"""
pp = numpy.load(path)
for kk, vv in params.items():
if kk not in pp:
warnings.warn('%s is not in the archive'%kk)
continue
params[kk] = pp[kk]
return params

# layers: 'name': ('parameter initializer', 'feedforward')


layers = {'gru': ('param_init_gru', 'gru_layer')}

def get_layer(name):
fns = layers[name]
return (eval(fns[0]), eval(fns[1]))

def init_params(options):
"""
initialize all parameters needed for the encoder
"""
params = OrderedDict()

# embedding
params['Wemb'] = norm_weight(options['n_words_src'], options['dim_word'])

# encoder: GRU
params = get_layer(options['encoder'])[0](options, params,
prefix='encoder',
nin=options['dim_word'],
dim=options['dim'])
return params

def init_params_bi(options):
"""
initialize all paramters needed for bidirectional encoder
"""
params = OrderedDict()
# embedding
params['Wemb'] = norm_weight(options['n_words_src'], options['dim_word'])

# encoder: GRU
params = get_layer(options['encoder'])[0](options, params,
prefix='encoder', nin=options['dim_word'], dim=options['dim'])
params = get_layer(options['encoder'])[0](options, params,
prefix='encoder_r', nin=options['dim_word'], dim=options['dim'])
return params

def build_encoder(tparams, options):


"""
build an encoder, given pre-computed word embeddings
"""
# word embedding (source)
embedding = tensor.tensor3('embedding', dtype='float32')
x_mask = tensor.matrix('x_mask', dtype='float32')

# encoder
proj = get_layer(options['encoder'])[1](tparams, embedding, options,
prefix='encoder',
mask=x_mask)
ctx = proj[0][-1]

return embedding, x_mask, ctx

def build_encoder_bi(tparams, options):


"""
build bidirectional encoder, given pre-computed word embeddings
"""
# word embedding (source)
embedding = tensor.tensor3('embedding', dtype='float32')
embeddingr = embedding[::-1]
x_mask = tensor.matrix('x_mask', dtype='float32')
xr_mask = x_mask[::-1]

# encoder
proj = get_layer(options['encoder'])[1](tparams, embedding, options,
prefix='encoder',
mask=x_mask)
projr = get_layer(options['encoder'])[1](tparams, embeddingr, options,
prefix='encoder_r',
mask=xr_mask)

ctx = tensor.concatenate([proj[0][-1], projr[0][-1]], axis=1)

return embedding, x_mask, ctx

# some utilities
def ortho_weight(ndim):
W = numpy.random.randn(ndim, ndim)
u, s, v = numpy.linalg.svd(W)
return u.astype('float32')

def norm_weight(nin,nout=None, scale=0.1, ortho=True):


if nout == None:
nout = nin
if nout == nin and ortho:
W = ortho_weight(nin)
else:
W = numpy.random.uniform(low=-scale, high=scale, size=(nin, nout))
return W.astype('float32')

def param_init_gru(options, params, prefix='gru', nin=None, dim=None):


"""
parameter init for GRU
"""
if nin == None:
nin = options['dim_proj']
if dim == None:
dim = options['dim_proj']
W = numpy.concatenate([norm_weight(nin,dim),
norm_weight(nin,dim)], axis=1)
params[_p(prefix,'W')] = W
params[_p(prefix,'b')] = numpy.zeros((2 * dim,)).astype('float32')
U = numpy.concatenate([ortho_weight(dim),
ortho_weight(dim)], axis=1)
params[_p(prefix,'U')] = U
Wx = norm_weight(nin, dim)
params[_p(prefix,'Wx')] = Wx
Ux = ortho_weight(dim)
params[_p(prefix,'Ux')] = Ux
params[_p(prefix,'bx')] = numpy.zeros((dim,)).astype('float32')

return params

def gru_layer(tparams, state_below, options, prefix='gru', mask=None,


**kwargs):
"""
Forward pass through GRU layer
"""
nsteps = state_below.shape[0]
if state_below.ndim == 3:
n_samples = state_below.shape[1]
else:
n_samples = 1

dim = tparams[_p(prefix,'Ux')].shape[1]

if mask == None:
mask = tensor.alloc(1., state_below.shape[0], 1)

def _slice(_x, n, dim):


if _x.ndim == 3:
return _x[:, :, n*dim:(n+1)*dim]
return _x[:, n*dim:(n+1)*dim]

state_below_ = tensor.dot(state_below, tparams[_p(prefix, 'W')]) +


tparams[_p(prefix, 'b')]
state_belowx = tensor.dot(state_below, tparams[_p(prefix, 'Wx')]) +
tparams[_p(prefix, 'bx')]
U = tparams[_p(prefix, 'U')]
Ux = tparams[_p(prefix, 'Ux')]

def _step_slice(m_, x_, xx_, h_, U, Ux):


preact = tensor.dot(h_, U)
preact += x_

r = tensor.nnet.sigmoid(_slice(preact, 0, dim))
u = tensor.nnet.sigmoid(_slice(preact, 1, dim))

preactx = tensor.dot(h_, Ux)


preactx = preactx * r
preactx = preactx + xx_

h = tensor.tanh(preactx)

h = u * h_ + (1. - u) * h
h = m_[:,None] * h + (1. - m_)[:,None] * h_

return h

seqs = [mask, state_below_, state_belowx]


_step = _step_slice

rval, updates = theano.scan(_step,


sequences=seqs,
outputs_info = [tensor.alloc(0., n_samples,
dim)],
non_sequences = [tparams[_p(prefix, 'U')],
tparams[_p(prefix, 'Ux')]],
name=_p(prefix, '_layers'),
n_steps=nsteps,
profile=profile,
strict=True)
rval = [rval]
return rval

5.2 Test and validation

5.2.1 Test approach

The individual segments of the application were tested. The text summarizer tested for their models.
The final application was then integrated and checked with sample documents containing pdf and
text.
5.2.2 Features tested

Among all the text summarization algorithms available, skip-thought vectors have the highest
accuracy and resemblance to the human-like summary. Word to vector improves the outcome
significantly compared to other methods because their main benefit arguably is that they don't
require expensive annotation, but can be derived from large unannotated corpora that are readily
available. Pre-trained embeddings can then be used in downstream tasks that use small amounts of
labelled data.

5.2.3 Testing tools and Environment

The majority of the testing was done using Google Chrome and Visual Studio Code.Google Chrome
was used to test the front end of the project whereas Visual Studio Code was used to test the back
end of the project.

Visual Studio Code is a freeware source-code editor made by Microsoft for Windows,Linux and
macOS. Features include support for debugging, syntax highlighting, intelligent code completion,
snippets, code refactoring, and embedded Git. The reason to use this source-code editor was that this
supports all the modules and programming languages involved in our project. This eased the
development and testing phase of the project.

5.2.4 Test cases

The test cases involve summarization of various PDF files containing a different number of pages.
The benefit of testing PDF files with a different number of pages is that this can determine the time
difference between the summarization of these files. This also helps in determining the quality of the
summary produced since more sentences will have more K-means clusters to form a summary.

5.2.5 Test procedure


The test procedure involves uploading different types of PDF files that differ in the number of pages
format. All the communication between the server and the client is monitored in the Google Chrome
console to ensure that the right data is being communicated between both ends.

During the working of the models, all the outputs are logged in the internal terminal of the Visual
Studio Code to ensure that the model works as intended. Once the summarization and the image
caption is done, the response is logged in the Visual Studio Code terminal to ensure that the output is
sent to the client in a proper format. Once the response is received by the client, it is then logged in
the Google Chrome console to ensure that the data received was in correct format.

5.3 Performance Analysis


Similar to how Word2Vec embeddings are trained by predicting the surrounding words,the Skip
Thought Vectors are trained by predicting the sentence at time, t-1 and t+1. As this model is trained,
the learned representation (hidden layer) will now place similar sentences closer together which
enables higher performance clustering. The performance issue arises during the K-means clustering
because we had to manually provide the best number of clusters in each session to get the highest
performance. This was decided after a hit and trial with several documents that to get the best
performance, the number of clusters should be 30% of the total number of sentences in the
document.

CHAPTER-6:
PROJECT OUTCOME AND APPLICABILITY

6.1 Key implementations outlines of the System

We were successfully able to implement ATS, using the encoder decoder algorithm. Key
implementation of our project was master, contradiction and skip thoughts. These three are the
backend (server) of our project. Which basically will take the text from the documents, tokenize the
words, reduce them and generate a new fresh summary for us.
Another key implementation done by us was the front-end part of the project which the user will
interact with. We made our website simple, direct and extremely user- friendly.

6.2 Significant project outcomes

By the end of this project we implemented an encoder- decoder algorithm and created an ATS model
for users.

6.3 Project applicability on Real-world applications

Various organisations today, be it online shopping, private sector organisations, government, tourism
and catering industry, or any other institute that offers customer services, they are all concerned to
learn their customer’s feedback each time their services are utilised.

Now, considering that these companies are receiving an enormous amount of feedback and data
every single day. It becomes quite a tedious task for the management to analyse each of these data
points and come up with insights. Therefore, using Machine Learning, models have become capable
of understanding human language with the help of NLP (Natural Language Processing).

We can also summarize case studies, research papers, thesis, essay exceeding 700+ words using
NLP. It will help us in saving a lot of time and energy and result in increase in work efficiency.

CHAPTER-7:
CONCLUSIONS AND RECOMMENDATION
7.1 Constraints of the System

The text summarization model and image caption model requires high computational power to
produce a result in as little time as possible and handle multiple requests from the client at the same
time. The server requires fast local storage to process all the model features and load the model
constraints when the server starts. Involving the GPU processing units improves the performance
significantly. Therefore, if the project is deployed in a full-fledged server with GPU arrays then the
results can be produced within a few seconds even with large PDF files and multiple requests from
the client.

Vanishing gradient constraint with the recurrent neural network (RNN) model of text summarization.
In an RNN information travels through the neural network from input neurons to the output neurons,
while the error is calculated and propagated back through the network to update the weights. The
cost function compares your outcomes (red circles on the image below) to your desired output. As a
result, you have these values throughout the time series, for every single one of these red circles.

Essentially, every single neuron that participated in the calculation of the output, associated with this
cost function, should have its weight updated to minimize that error. And the thing with RNNs is that
it’s not just the neurons directly below this output layer that contributed but all of the neurons far
back in time. So, you have to propagate back through time to these neurons. The problem relates to
updating weight recurring (wrec) – the weight that is used to connect the hidden layers to themselves
in the unrolled temporal loop.

For instance, to get from xt-3 to xt-2 we multiply xt-3 by wrec. Then, to get from xt-2 to xt-1 we
again multiply xt-2 by wrec. So, we multiply with the same weight multiple times, and this raises a
problem that when we multiply something by a small number, the value decreases very quickly. The
lower the gradient is, the harder it is for the network to update the weights and the longer it takes to
get to the result. To solve this problem we have used GRU units to make the RNN only to
backpropagate to a specific node. This ensures that the encoder does not run in an infinite loop and
thus produces the result.
7.2 Future Enhancements

The model can be enhanced in a few areas like the quality of hardware on which the summaries
are being computed. Our project is completely based on the performance of the model, CPU and the
storage unit. This model can also be deployed in GPU with better storage units to improve the speed
of the outcome. In theory, the performance is 10x faster in GPU with fast solid state drives. On the
other hand, the model can also be deployed using LSTM units as compared to the current controlled
GRU units but in theory, the performance improvement of LSTM is not significant compared to
GRU units.
REFERENCES

[1] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, "Show and tell: A neural image caption
generator," CVPR 2015. [pdf]

[2] H. Fang et al., "From captions to visual concepts and back," CVPR 2015.[pdf]

[3] X. Jia, E. Gavves, B. Fernando and T. Tuytelaars, "Guiding the Long-Short Term
Memory Model for Image Caption Generation" ICCV 2015.[pdf]

[4] Jamie Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio
Torralba, Raquel Urtasun, Sanja Fidler. Skip-Thought Vectors. In NIPS, 2015.

[5] A Quick Introduction to Text Summarization in Machine Learning by Dr. Michael J.


Garbade. Towardsdatascience.com 2018.

[6] Automatic Text Summarization with Machine Learning — An overview by Luís


Gonçalves Medium.com 2020.

[7] Extractive Text Summarization using Neural Networks — Sinha et al.(2018).

[8] Global Encoding for Abstractive Summarization — Lin et al.(2018).

[9] Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting — Chen


and Bansal(2018).

[10] Paper Summary: Skip-Thought Vectors by Mike Plotz. Medium.com 2018.

[11] RNN vs GRU vs LSTM by Hemanth Pedamallu. Medium.com 2020.

[12] Illustrated Guide to LSTM and GRU: A step by step explanation by Michael
Phi.towardsdatascience.com 2018.
[13] LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp Review
Dataset as an Example by Shudong Yang, Xueying Yu and Ying Zhou 2020.

[14] Material UI documentation 2021.

[15] yuvaleros.github.io/material-ui-dropzone documentation and sample.

[16] Django documentation, Django framework 2021 (docs.djangoproject.com).

[17] Tutorial: Django REST with React (and a sprinkle of testing) by Valentino Gagliardi
2021.

[18] Build a single-page React application in a hybrid Django project by saas pegasus 2021.

[19] A Survey on Automatic Image Caption Generation by Shuang Bai and Shan An in 2018.

[20] A Model for Text Summarization by Rasim Algulivez, Ramiz Aliguliyev, Nijat Isazade
2017.

[21] Improving Performance of Text Summarization by S.A.Barbara ,Pallavi D.Patil 2014.

You might also like