Professional Documents
Culture Documents
Data Science Lab Mini Project Report Topic: Text Summarization Name: Vemula Yaminee Jyothsna Roll No: 20BM6JP44
Data Science Lab Mini Project Report Topic: Text Summarization Name: Vemula Yaminee Jyothsna Roll No: 20BM6JP44
Introduction:
Various sectors nowadays struggle to know the customer feedback to be in online shopping,
government sector, private organizations, tourism, etc. to improve their services/product.
These companies receive huge data every single day which is not in a structured format. It
will be a very hectic job for the organizations to evaluate the reviews/ feedback of every
customer to satisfy their needs.
Machine learning which is a booming field can help us understand the human language and
summarize the content in such a way that it highlights the important information of the
large text provided. This can be achieved by using one of the filed of machine learning
knows as Natural Language Processing abbreviated as NLP.
• Newsletters
• Media monitoring
• Understanding customer satisfaction in the form of reviews
• Social media monitoring
• Video scripting
• Helping disabled people
• Help desk and customer reviews.
• Extractive summarization
• Abstractive summarization.
Extractive summarization:
In this approach, we identify the important sentences from the original text and extract
them to provide the summary of the original text.
Abstractive summarization:
In this approach, we work to generate new sentences from the original text and form the
summary. The sentences from the summary might not be in the original text. This method is
in contrast to extractive summarization.
Objective:
To generate a summary for each Amazon fine food customer reviews available at
https://www.kaggle.com/snap/amazon-fine-food-reviews
Dataset Description:
Amazon fine food reviews dataset is used which is available in Kaggle. This dataset has 10 columns
which constitute ProductId, UserId, ProfileName, HelpfulnessNumerator, Helpfulnessdenominatir,
Score, Time, Summary, Text. Here Text column contains the detailed reviews by the customer in the
form of a paragraph. The total number of products described is 56845.
For this project, we consider only the Text column and Summary column for validating the results
from our model.
Dataset is as follows:
Algorithms Used:
Two Algorithms have been used for extractive summarization. They are:
• Word frequency
• Term frequency-inverse document frequency (tf-idf)
Word-frequency:
In this algorithm, we calculate the sentence score and summarize based on the highest
sentence scored available. The steps are as follows:
Step-1: Tokenize the text data.
Step-2: Removing stop words and storing them.
Step-3: Creating a frequency table with each word score.
Step-4: Calculating the score of each sentence from the frequency table.
Step-5: Calculating average sentence score from the text.
Evaluation measure:
For Text summarization in this project, we have used rouge-score as the evaluation
parameter.
Rouge stands for Recall Oriented Understudy for Gist Evaluation. It is a set of metrics for
evaluating the summarized text generated by algorithms with original summary text.
The three metrics this rouge score has are:
• Precision
• Recall
• F-measure
Precision and recall in terms of Rouge are as follows:
Precision = number of overlapping words/ Total number of words in the generated summary
Recall = number of overlapping words/ Total number of words in the original summary
F-measure = 2[(precision * Recall) / (Precision + Recall)]
Observations:
• It is observed that precision, recall, and f-measure scores have been improved for tf-
idf model compared to the word frequency model.
• The values for the precision, recall, and f-measure scores for tf-idf algorithm have
been obtained with a 50% information retrieval value. It can be improved by
increasing the information retrieval value.
• Cosine similarity between the generated summary and Summary present in the data
were are also calculated, but the score was very less in the range of 0.01 to 0.2. the
cause of this might be due to high dimensions in the generated summary. This could
be improved by performing PCA on the text to reduce the dimensionality of the
summary to be generated.
• The generated summaries would have been more explainable if the large text
documents were considered. Here reviews of the customers being very short
paragraphs there is a no much difference between the original review and summary
generated.
• BERT-based models might give more concise summaries, but could not be
implemented because of the installation problems.