10,11,12,12,14

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Cho10

NCP Pipeline"

Problem of Info Overload

Information Overload ,
a state of being overwhelmed by

the amount of data presented by one's attention on

Processing !

Solutions Approaches!
= &

Info
S
&
·

· Info · Document ·Text


retrival
Extraction categroziation & Summerziain

dustring
Text summerziation ,
the proces of creating a shorter

version of text documents that contains the most important


into from the original document!
-
cypes of summaries

Tocative Cricut
&
Informative
Summary Summary Summary
provides more detailed information
provides a general overview of the text provides with author’s perspective on
about the text
• Identifies the main points and overall message the text and give the chance to critically
• Goes beyond main points to
develop into it
provide more context and
M explanation
• Help the reader to think more critically
about the text and develop their own
just understanding of it

the main More details!


points !
Prespective.

• Indicative summary - This article reports on a new study that


has found that eating chocolate can help you lose weight. The
study found that participants who ate chocolate on a regular
basis lost more weight and had smaller waistlines than those
who did not eat chocolate.
• Informative summary – The study involved 102 participants who

more
Ento were randomly assigned to either a group that ate chocolate on
a regular basis or a group that did not eat chocolate. The
participants in the chocolate group ate 70 grams of dark
chocolate per day for 12 weeks. The participants in the control
group did not eat chocolate for the 12-week period. At the end
of the study, the participants in the chocolate group had lost an
average of 5 pounds and had smaller waistlines than the
participants in the control group.
,= • Critical summary – Overall, this is a good article that provides
valuable information about the potential benefits of eating
chocolate for weight loss. However, it is important to note that
more research is needed to confirm these findings.
Automatic Text Summerziaion (ATS) Catogries!
*
(two main Calognies) S

Extractive Summerziation Approach!


Summerziation Approach!
Abstractive

select the most generate the


important sentences summary from
from the input text and 14 :&2.ji ! H I Diff generate scratch using
concatenate them to their own
form theus
summary language
model.

• Indicative summaries
• Generate sentences describing the content of the text
• Are more complex than extractive summarization
approaches, but they can
produce more informative and coherent summaries.
• Informative and critical summaries

Dimensions >
-
Single document us . mill-document

Context - QuerySpecific vs .

Query independent !

Mani & Maybury (1999)


= s

Gener C
& &151 , is Its s Absotrack Ve
⑤ g
summarizatiou summarization
Query focused,
Update g
summarization
g generating
generating summarization summaries that are
summaries for & Di faithful to the
general-

generating generating meaning of the
& 3 summaries of summaries of text original document,
purpose text
documents. !
&S1051 text & documents that
contain new
but which may not
contain any of the
documents 53
! 61
· information relevant original sentences.
-
&

that are -
& to a previously
generated summary.
relevant to a 500 8:54
Ex-movie summaries, specific query. 111 Headlines
Biographics ,
minutes, to minutes
News articles ! Headlines ,
movie series
, ... movie/Tu
Summaries, series
..

research papers
Stagea &
I
b
Content Conceptual
Realization
identification Organization
involves identifying involves generating the

Jostei
important involves organizing the
information in the identified content into a Go I summary text based on
the conceptual
input text. coherent structure. As w
· - organization.
&61
• i.e. Finding relations 19990
• i.e. Extracting • i.e. Selecting existing
keywords and between pieces of sentences and/or
information and grouping generating new ones
phrases, named related ones
entities, main topic

Automatic Abstractings a method of generating


summaries oftext using computer algorithms !

eatualuation ③ Sechniques!
Summarization Syste - -

sid wit

&
&jps1 , 130

9053 da 11

-10
jel and
u
! masi W , !

Ratio sini
5 : 10
03 E -
-41
!
w
&
choll

TC is an important part of fext mining

Categorization/classification the task of


wa

Text -

assigning categroies to free-text documents!

Examples Spermfitering !
duss (calagony 1)
NLp Data Mining &
Document
,

duss (calagony 2) =

duss
learning
Machine
(calagony 3)
assign documents do one or more predefined Together
calogonies !

T
learning algorithms S
for TC
test
&

categories S

Bayesian Relevance feed back Nearest Neighbor


(Naivel CRocchial Case based)
Neural Network Rule based (Ripper( Super Vector Machines (SVM

The rector- Space Model -


> Dimension -t 1 vocablary/
=

jg be

Boundaries are -
> decision boundaries
jiji
The rector
Space
model
us
I

weights/1 ,
misiglysic as
-
lerm Weights ;Term frequency &

do jimigi
make most common

weights/1 ,
misiglysic as
-
term Weights ; Inverse Document frequency
6
·

&

-Tf-LDf Weighting
· -
Similarity Measure ,
function that computes the degree
of similarity between two vectors!
"

·! is bi
C Similarity Measure
-

lime
!"
query &one document
at a

termel term Q
&

t I

↓. i similarity a Dice -

Using Relevance fee crochial


most J, i

1 Use Standard Tf/ID by maximum of


Common

.
-
Normalized

2 For each
.
category Compute Prototype rector a i, versive i

19 .
&

.
3 assign text ,
Cesine Similarity s
Vectors 1

big classes
s
.
& cosine 11 id.
11 .s & ↑ is ji

calognies S1j
gijs o
Vector Space Model , Decision Boundries
are defiend by Centroi
-ks !
Centroids Computions

T
Rocchio Properties
I G
&

Forms a simple Classification is based on


Does not similarity to class
guarantee a generalization of the Prototype vector does not
examples in each class prototypes.
consistent need to be
hypothesis. (a prototype). averaged or otherwise
normalized for length
since cosine similarity is
insensitive to
vector length.
Nearest-Neighbor Lear Algorithm
Case Based,
, Using Only
AKA lazy
:
Memory Based, learning

the closest example to determine


-
Categrozition

Naive Bayes for feet classification ,

&

numo,
Ex
• Modeled as generating a d
bag of words for a
document in a given
category by repeatedly
sampling with replacement Gaininga
,

from a
vocabulary V = {w1, w2,…
wm} based on the
probabilities P(wj | ci). le gi

• Smooth probability
estimates with Laplace
&
= Dir
! De
m-estimates assuming a
uniform distribution
over all words (p = 1/|V|)
and m = |V|
Chol2

Clusting+ is the process of partioning of examples from

heterogenogeneous to homogeneous subsets -


grouping
Si
Examplece 2s disterbation !

Clusters Approches 22
(1

A Commerative
g &
vs
. Divisive
&
custring
(Bottom-up (partitional &
top down(
!j 99
/

1019 9
/
--
5

methods start with each separate all


example in its own cluster examples
and iteratively combine immediately into
them to form larger and clusters.
larger clusters.

Clustering Methode require


Direct a specification
of the num of clusters , I desired.

clustering evaluation function -> (‫كانت االفضل )افضل عدد من الكلسترز للتقسيم‬ & ‫انه يقيم ايت قيمة لــ‬
3
Cluster

Similiarity
. &

Single link Complete link Groub average

Similarity of two most similar Similarity of two least similar Average similarity between
members. members. members.

! gi
Bir two points !

Non-hierarchial clustering , randomly choose K instance


as seed, one per cluster
-100 -
5 seeds = 95

I
Issues in Clustening ! 5

Internal External
• Tightness and separation • Compare to
of clusters (e.g. k-means known class
objective) labels on
benchmark data
• Fit of probabilistic model
to data
Cho13

Machine Translation MIT -


the task of converting one natural

hanguage into another automatically !


Arabic: I' English : S's

It is good for
est Bad for
3334 9
33 3 9
web pages Emails Utas first Pass Computer-aided literature Meefings/court Madical War
human translation translation Tactics
recordings
1915005xi migl -

,
"
in hospital
postediding !

challanges of It
&
S b &

translation is difficult Different languages vary July automated machine Current I


& j54j4) across multiple dimensions translation is still not possible systems can
Espec limited Sub-taungage be used to
largesourcea ye
in
↑ j' T

domains speed up human


1 250 dis is
is , 5 0 translation

– Limited vocabulary and few wherelp output
basic phrase types. Ambiguity
is fixed post-editing
can be resolved using local i
derb 115869
context. -0
+15 [ ,

To verb its y ms

– Examples: Weather
forecasting, air travel queries,
restaurant recommendation.
Language Divergence
4
Typology – is the study of systematic cross-linguistic similarities and divergences
– Study of the structure of the world’s languages for the purpose of classification, comparisons, and analysis.

• Language divergences can be classified to:


– Lexical divergences – differences in vocabulary and phrases

·
– Morphological variation – differences in word structures (morphemes)
– Syntactic variation - differences in sentence structures ↳! se ,
pi
– Semantic variation - differences in meaning of words and phrases
– Segmentation variation - differences in sound patterns
– Inferential load - differences in context inferences and assumptions

Lexical Divergences!Ge
*

elt Approaches -

S
*

Rule-Based (RBMH)
is a traditional approach to machine
translation that utilizes a set of manually
crafted (hand-written) rules to translate
text from one language to another.

· R11j5
! ] NJ 9.
Is

parec translation -
Iransfer Model
=>

Interlingua
based ut
involves directly translating words involve translating the source
from the source language to the language text into an intermediate utilizes a semantic intermediate
target language using a bilingual language, then into the target representation of the source
dictionary or word list. language. language called interlingua, which is
- .
Die Ness ↑s j5
used to generate the target

Steps:
– also known as word-for-word language text using language-
translation &S specific rules.
– is a straightforward approach but 9
1920
capture the meaning of the "source
often produces inaccurate S • Analysis: Syntactically
translations due to differences in 3) parse Source language
u word ~ test
grammar, order, and idiomatic
-

expressions between languages • Transfer: Rules to turn in Language independent


-

:;%11 S
this parse into parse for formal
Target language
• Generation: Generate stiwig si ,
Semantic Lasid I
– Example: Target sentence from represention
• Input sentence in English: The cat sits on the mat parse tree
• Output sentence in German: Der Katze sitzt auf der Matte
Steps ;
Pros vis • Translate source
– can produce more accurate translations sentence into
than direct translation using language- meaning
specific rules at each stage of the representation
translation process • Generate target
Rule ! &
Direct/ig
·
sentence from
meaning
representation

Statical Met (SMA)


is an approach to machine
translation that utilizes statistical
models to predict the most likely
translation for a given word, phrase
or a sentence.
Evaluation of MT
G
G

Human evaluation
Need an evaluation metric that takes
is expensive
& very
! metrics se ,58 ,
%soviii) Seconds Not months !
Slow !
,

S S 5

BIZU NEST TER METFOR


Blinigual Evaluation National Institute Translation Metric for
of Standards & Error of
Sunderstudy
evolution

technology ! rate translation w/Explicit


is a precision- based metric
that measures the overlap Ordering!
between n-grams (sequences
of n words) in the machine-
translated output and the
reference translation.
cho14

Speach -
the most common analog signal produced by
humans

Speach Production ->


involvesPhysiology Cognition
,

and acoustics to produce meaningful sounds to communicate .

Speach Visualization combines the science of speech


,
u *
produce
on with the art of visual represention -9/138
15, /1
:S
interminate proces?

Automatic Speech recognization (ASRI - is the


process of converting spocken laungage into
text !
ASR Challanges
J

Input factors Processing factors Output factors

·
Text Transcript
&

feature
S

Acoustic signal language Acoustic Language The final product, the text
version of the spoken
Extraction modeling modeling language, is the result of
the previous stages.
captured through This stage
microphones, the specific spoken Identifying the This stage
language, with its decodes the decodes the
contains the relevant aspects sequence of
vocabulary, grammar, of the acoustic sequence of
speaker's voice but and pronunciation sounds sounds • Accuracy and fluency of
can also be mixed signal that (phonemes) from
rules, guides the represent the (phonemes) sentences are crucial
with background interpretation of the the extracted from the • but factors like keywords,
noise and other spoken words, features,
acoustic signal. like pitch, extracted punctuation and speaker
environmental essentially features, identification
factors. formants, and recognizing the
spectral energy. essentially can also be part of the
building blocks of recognizing output.
speech. the building

blocks of
• Microphone: close- e speech.
mic, throat-mic, gi 8/15i

microphone array
• Sources: band-
↳S MAP2
limited, background -12/ :
noise
• Speaker: speaker
dependent, speaker
independent

ASR Preformance Evaluation


I
b b S 6
ErrorRate Resource User Experiance
Accuracy Speedy Consumption
Percentage of
tokens correctly
Percentage of
errors made by latency The amount of memory,
Subjective factors like
recognized the system ease of use, clarity of
(inverse of Time taken to processing power, and transcripts, and overall
accuracy) process the other resources required by satisfaction with the
speech and the system. system.
generate the ! j542 ,so
output text
(transcript) ?5 ! resource
954 digi
.
g
! es69/
ASR Approaches Template-Based ASR , Only worked for
isolated words ,
one user !
'ji is 30- ()

Dictonary ! Jim
"

. 50

You might also like