AI 60003180058 Exp10

60003180058
Varun Vora
EXPERIMENT 10
AIM: - To perform text summarization.
THEORY: -
Text summarization refers to the technique of shortening long pieces of text. The intention is
to create a coherent and fluent summary having only the main points outlined in the
document. Automatic text summarization is a common problem in machine learning and
natural language processing (NLP). Machine learning models are usually trained to
understand documents and distill the useful information before outputting the required
summarized texts.
There are two main types of how to summarize text in NLP:
Extraction-based summarization
The extractive text summarization technique involves pulling keyphrases from the source
document and combining them to make a summary. The extraction is made according to the
defined metric without making any changes to the texts.
Here is an example:
Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In
the city, Mary gave birth to a child named Jesus.
Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.
As you can see above, the words in bold have been extracted and joined to create a summary
— although sometimes the summary can be grammatically strange.
Abstraction-based summarization
The abstraction technique entails paraphrasing and shortening parts of the source document.
When abstraction is applied for text summarization in deep learning problems, it can
overcome the grammar inconsistencies of the extractive method.
The abstractive text summarization algorithms create new phrases and sentences that relay
the most useful information from the original text — just like humans do.
Therefore, abstraction performs better than extraction. However, the text summarization
algorithms required to do abstraction are more difficult to develop; that’s why the use of
extraction is still popular.
Here is an example:
Abstractive summary: Joseph and Mary came to Jerusalem where Jesus was born.
Working:
Usually, text summarization in NLP is treated as a supervised machine learning problem
(where future outcomes are predicted based on provided data).
Typically, here is how using the extraction-based approach to summarize texts can work:
1. Introduce a method to extract the merited keyphrases from the source document. For
example, you can use part-of-speech tagging, words sequences, or other linguistic patterns to
identify the keyphrases.
2. Gather text documents with positively-labeled keyphrases. The keyphrases should be
compatible to the stipulated extraction technique. To increase accuracy, you can also create
negatively-labeled keyphrases.
3. Train a binary machine learning classifier to make the text summarization. Some of the
features you can use include:
60003180058
Varun Vora
Length of the keyphrase

Frequency of the keyphrase
The most recurring word in the keyphrase
Number of characters in the keyphrase
4. Finally, in the test phrase, create all the keyphrase words and sentences and carry out
classification for them.
TEXT RANK ALGORITHM:

TextRank is an algorithm based on PageRank, which often used in keyword extraction and
text summarization.
FLOW:
 The first step would be to concatenate all the text contained in the articles
 Then split the text into individual sentences
 In the next step, we will find vector representation (word embeddings) for each and
every sentence
 Similarities between sentence vectors are then calculated and stored in a matrix
 The similarity matrix is then converted into a graph, with sentences as vertices and
similarity scores as edges, for sentence rank calculation
 Finally, a certain number of top-ranked sentences form the final summary
CODE: -
import subprocess
from ibm_watson import SpeechToTextV1

from ibm_watson.websocket import RecognizeCallback, AudioSource
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
def sp_to_text():
60003180058
Varun Vora
command = 'ffmpeg -i MLintro.mp4 -ab 160k -ar 44100 -vn ml_audio.wav'

subprocess.call(command, shell=True)
apikey = 'WjfWKa1PpOvlmyiaEcUjWWH_Ksy2LhTtTEZn_D_EPKML'
url = 'https://api.eu-gb.speech-to-text.watson.cloud.ibm.com/instances/5dbc1611-dbad-
48e5-8f64-efc1ef8d4ff0'
# Setup service
authenticator = IAMAuthenticator(apikey)
stt = SpeechToTextV1(authenticator=authenticator)
stt.set_service_url(url)
with open('ml_audio.wav', 'rb') as f:

res = stt.recognize(audio=f, content_type='audio/wav', model='en-
AU_NarrowbandModel', continuous=True).get_result()
text = [result['alternatives'][0]['transcript'].rstrip() + '.\n' for result in res['results']]
text = [para[0].title() + para[1:] for para in text]

transcript = ''.join(text)
with open('output.txt', 'w') as out:
out.writelines(transcript)
return transcript
def summary(inp_text):
summarized_text = summarizer.summarize(
inp_text, ratio=0.4, language="english", split=True, scores=True)
sl=0
summary=''
for sentence, score in summarized_text:
summary = summary + " " + sentence
return summary
INPUT TEXT:
The world is filled with data a lot of their pictures music words spreadsheets videos and it
doesn't look like it's going to slow down anytime soon.
Machine learning brings the promise of deriving meaning from all of that data.
Either C. Clark famously once said any sufficiently advanced technology is indistinguishable
from magic I have found machine learning not to be magic but rather towards and technology
so you can utilise to answer questions with your data this is cloud adventures my name is
your bank well and each episode we will be exploring the order science and talk of machine
learning along the way we'll see just how easy it is to create amazing experiences and your
valuable insights.
The value of machine learning is only just beginning to show itself there is a lot of data in the
world today generated not only by people but also about computers phones and other devices
this will only continue to grow in the years to come traditionally humans have analysed data
and adapted systems to the changes in the the parents however as the farm of data surpasses
the ability for humans to make sense of it and manually rate those rules will turn increasingly
to automated systems that can learn from the dealer importantly but changes and data to adapt
60003180058
Varun Vora
to a shifting landscape with the machine running all around us and the products were used
today however it isn't always apparent that machine learning is behind it all.
While things like tagging objects have been people inside of photos are clearly machine
learning a place it may not be immediately apparent that recommending the next video to
watch is also powered by machine learning.
Of course perhaps the biggest example of all is Google search every time you use Google
search you're using a system that has many machine learning systems at its core from
understanding the text of your query to adjusting the results based on your personal interests
such as knowing which results to show you first when searching for job depending on
whether you're a couple extra or a developer perhaps your books today machine learning's
immediate applications are already quite wide ranging including image recognition fraud
detection and recommendation systems as well as text and speech systems to these powerful
capabilities can be applied to a wide range of fields from diabetic retinopathy and skin cancer
detection to retail and of course transportation in the form of self parking and self driving
vehicles.
It wasn't that long ago that when a company or product had machine learning in its offerings
he was considered normal now every company is pivoting to use machine learning in their
products in some way it's rapidly becoming well and expected future just as we expect
companies to have a website that works on your mobile device or perhaps an app the day will
soon come when it will be expected at our technology will be personalised instead for and
self correcting.
As we used machine learning to make human tasks better faster and easier than before we can
also look further into the future when machine learning can help us do tasks that we never
could have achieved on our on.
For you it's not hard to take advantage of machine learning today between has gotten quite
good all you need is data developers and a willingness to take the plunge for our purposes I
have shortened the definition of machine learning touches five words using data to answer
questions well I wouldn't use it a short answer for an essay prompt on exam it's a really useful
purpose for us here in particular we can split the definition into two parts using data and
answer questions these two pieces broadly outlined the two sides of machine learning both of
them you call important using data is what we refer to as transfer well yes in question is
referred to as making predictions or inference now let's roll into those two sides briefly for a
little bit.
Training refers to using a data to inform the creation and fine tune of a predictive model this
predictive model can then be used to several productions I'm previously unseen data and
answer those questions as more data is gathered at the model can be improved over time and
new predictive models deployed.
As you may have noticed the key component of this entire process is did everything hinges
under didn't secure to unlocking machine learning just as much as machine learning is the
kids unlocking that hidden inside indeed.
This was just a high level overview of machinery why it's useful and some of its applications
machine learning is a broad field spending an entire family of techniques for inferring
answers from the so in future episodes willing to give you a better sense of what approaches
to use for a given dataset and question you want to answer as well as provide the tools for
how to accomplish it.
And I next episode we'll dive right in to the country process of doing machine learning in
more detail going through step by step formula for how to approach machine learning
problems.
60003180058
Varun Vora
OUTPUT SUMMARY: -
TEXTRANK ALGO:
CODE:
import re
import numpy as np
from nltk import sent_tokenize, word_tokenize
from nltk.cluster.util import cosine_distance
MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE)
def normalize_whitespace(text):
"""
Translates multiple whitespace into single space character.
If there is at least one new line character chunk is replaced
by single LF (Unix new line) character.
"""
return MULTIPLE_WHITESPACE_PATTERN.sub(_replace_whitespace, text)
def _replace_whitespace(match):
text = match.group()
if "\n" in text or "\r" in text:

return "\n"
else:
return " "
60003180058
Varun Vora
def is_blank(string):
"""
Returns `True` if string contains only white-space characters
or is empty. Otherwise `False` is returned.
"""
return not string or string.isspace()
def get_symmetric_matrix(matrix):
"""
Get Symmetric matrix
:param matrix:
:return: matrix
"""
return matrix + matrix.T - np.diag(matrix.diagonal())
def core_cosine_similarity(vector1, vector2):

"""
measure cosine similarity between two vectors
:param vector1:
:param vector2:
:return: 0 < cosine similarity value < 1
"""
return 1 - cosine_distance(vector1, vector2)
class TextRank4Sentences():
def __init__(self):
self.damping = 0.85 # damping coefficient, usually is .85
self.min_diff = 1e-5 # convergence threshold
self.steps = 100 # iteration steps
self.text_str = None
self.sentences = None
self.pr_vector = None
def _sentence_similarity(self, sent1, sent2, stopwords=None):

if stopwords is None:
stopwords = []
sent1 = [w.lower() for w in sent1]

sent2 = [w.lower() for w in sent2]
all_words = list(set(sent1 + sent2))
vector1 = [0] * len(all_words)

60003180058
Varun Vora
vector2 = [0] * len(all_words)
# build the vector for the first sentence

for w in sent1:
if w in stopwords:
continue
vector1[all_words.index(w)] += 1
# build the vector for the second sentence

for w in sent2:
if w in stopwords:
continue
vector2[all_words.index(w)] += 1
return core_cosine_similarity(vector1, vector2)
def _build_similarity_matrix(self, sentences, stopwords=None):

# create an empty similarity matrix
sm = np.zeros([len(sentences), len(sentences)])
for idx1 in range(len(sentences)):

for idx2 in range(len(sentences)):
if idx1 == idx2:
continue
sm[idx1][idx2] = self._sentence_similarity(sentences[idx1], sentences[idx2],

stopwords=stopwords)
# Get Symmeric matrix

sm = get_symmetric_matrix(sm)
# Normalize matrix by column

norm = np.sum(sm, axis=0)
sm_norm = np.divide(sm, norm, where=norm != 0) # this is ignore the 0 element in
norm
return sm_norm
def _run_page_rank(self, similarity_matrix):
pr_vector = np.array([1] * len(similarity_matrix))
# Iteration
previous_pr = 0
for epoch in range(self.steps):
pr_vector = (1 - self.damping) + self.damping * np.matmul(similarity_matrix,
pr_vector)
if abs(previous_pr - sum(pr_vector)) < self.min_diff:
break
else:
60003180058
Varun Vora
previous_pr = sum(pr_vector)
return pr_vector
def _get_sentence(self, index):
try:
return self.sentences[index]
except IndexError:
return ""
def get_top_sentences(self, number=5):
top_sentences = []
if self.pr_vector is not None:
sorted_pr = np.argsort(self.pr_vector)
sorted_pr = list(sorted_pr)
sorted_pr.reverse()
index = 0
for epoch in range(number):
sent = self.sentences[sorted_pr[index]]
sent = normalize_whitespace(sent)
top_sentences.append(sent)
index += 1
return top_sentences
def analyze(self, text, stop_words=None):

self.text_str = text
self.sentences = sent_tokenize(self.text_str)
tokenized_sentences = [word_tokenize(sent) for sent in self.sentences]
similarity_matrix = self._build_similarity_matrix(tokenized_sentences, stop_words)
self.pr_vector = self._run_page_rank(similarity_matrix)
text_str = '''
In this video, I'm going to define what is probably the most common type of Machine
Learning problem, which is Supervised Learning. I'll define Supervised Learning more
formally later, but it's probably best to explain or start with an example of what it is, and we'll
do the formal definition later. Let's say you want to predict housing prices. A while back a
student collected data sets from the City of Portland, Oregon, and let's say you plot the data
set and it looks like this. Here on the horizontal axis, the size of different houses in square
feet, and on the vertical axis, the price of different houses in thousands of dollars. So, given
60003180058
Varun Vora
this data, let's say you have a friend who owns a house that is say 750 square feet, and they
are hoping to sell the house, and they want to know how much they can get for the house. So,
how can the learning algorithm help you? One thing a learning algorithm might be want to do
is put a straight line through the data, also fit a straight line to the data. Based on that, it looks
like maybe their house can be sold for maybe about $150,000. But maybe this isn't the only
learning algorithm you can use, and there might be a better one. For example, instead of
fitting a straight line to the data, we might decide that it's better to fit a quadratic function, or
a second-order polynomial to this data. If you do that and make a prediction here, then it
looks like, well, maybe they can sell the house for closer to $200,000. One of the things we'll
talk about later is how to choose, and how to decide, do you want to fit a straight line to the
data? Or do you want to fit a quadratic function to the data? There's no fair picking whichever
one gives your friend the better house to sell. But each of these would be a fine example of a
learning algorithm. So, this is an example of a Supervised Learning algorithm.'''
tr4sh = TextRank4Sentences()
tr4sh.analyze(text_str)
print(tr4sh.get_top_sentences(6))
OUTPUT:
CONCLUSION: - Hence, we have performed text summarization (extractive).

AI 60003180058 Exp10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI 60003180058 Exp10

Uploaded by

Copyright:

Available Formats

60003180058

AIM: - To perform text summarization.

There are two main types of how to summarize text in NLP:

Length of the keyphrase

TEXT RANK ALGORITHM:

from ibm_watson import SpeechToTextV1

command = 'ffmpeg -i MLintro.mp4 -ab 160k -ar 44100 -vn ml_audio.wav'

with open('ml_audio.wav', 'rb') as f:

text = [result['alternatives'][0]['transcript'].rstrip() + '.\n' for result in res['results']]

text = [para[0].title() + para[1:] for para in text]

from nltk.cluster.util import cosine_distance

MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE)

if "\n" in text or "\r" in text:

def core_cosine_similarity(vector1, vector2):

def _sentence_similarity(self, sent1, sent2, stopwords=None):

sent1 = [w.lower() for w in sent1]

all_words = list(set(sent1 + sent2))

vector1 = [0] * len(all_words)

vector2 = [0] * len(all_words)

# build the vector for the first sentence

# build the vector for the second sentence

return core_cosine_similarity(vector1, vector2)

def _build_similarity_matrix(self, sentences, stopwords=None):

for idx1 in range(len(sentences)):

sm[idx1][idx2] = self._sentence_similarity(sentences[idx1], sentences[idx2],

# Get Symmeric matrix

# Normalize matrix by column

def _run_page_rank(self, similarity_matrix):

pr_vector = np.array([1] * len(similarity_matrix))

def _get_sentence(self, index):

def get_top_sentences(self, number=5):

if self.pr_vector is not None:

def analyze(self, text, stop_words=None):

tokenized_sentences = [word_tokenize(sent) for sent in self.sentences]

similarity_matrix = self._build_similarity_matrix(tokenized_sentences, stop_words)

CONCLUSION: - Hence, we have performed text summarization (extractive).

You might also like