Big Data Analytics

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

RECOMMENDATION SYSTEM

FOR TITLE WORDS


Sindhu Abro (221-17-0016)
A BRIEF SURVEY OF TEXT MINING:
CLASSIFICATION, CLUSTERING AND EXTRACTION
TECHNIQUES
 Knowledge Discovery vs. Data Mining
 Knowledge Discovery in Databases is extracting implicit valid, new
and potentially useful information from data, which is nontrivial where
as Data Mining is a the application of particular algorithms for
extracting patterns from data. KDD aims at discovering hidden
patterns and connections in the data
 KDD refers to the overall process of discovering useful knowledge
from data while data mining refers to a specific step in this
process
 Text Mining Approaches
 Information Retrieval (IR) mostly focused on facilitating information access
rather than analyzing information and finding hidden patterns
 Natural Language Processing (NLP): aims at understanding of natural
language using computers
 Information Extraction from text (IE): Information Extraction is the task of
automatically extracting information or facts from unstructured or semi-
structured documents e.g., extraction entities
 Many others: Summarization, Text Streams and Social Media Mining,
Opinion Mining and Sentiment Analysis
 Text Preprocessing
 Tokenization is the task of breaking a character
sequence up into pieces
 Filtering is usually done on documents to remove
some of the words. A common filtering is stop-words
removal. (e.g. prepositions, conjunctions, etc).
 Stemming methods aim at obtaining stem (root) of
derived words.
 CLASSIFICATION aims to assign predefined
classes to text documents
 Naive Bayes Classifier
 Nearest Neighbor Classifier
 Decision Tree classifiers
 Support Vector Machines
 CLUSTERING is the task of finding groups of
similar documents in a collection of
documents
 k-means Clustering
 Probabilistic Clustering and Topic Models
(Probabilistic Latent Semantic Analysis (pLSA)
and Latent Dirichlet Allocation (LDA))
AUTOMATIC KEYWORD EXTRACTION FOR TEXT
SUMMARIZATION: A SURVEY
 Due to the excessiveness of data, there is a need of
automatic summarizer which will be capable to
summarize the data especially textual data in original
document without losing any critical purposes
 Summarization process is highly depend on keyword
extraction.
 Automatic Keyword Extraction is the process of
selecting words and phrases from the text document
that can at best project the core sentiment of the
document without any human intervention depending
on the model
 Recent literature on automatic keyword extraction:
 Simple Statistical Approach
 These strategies are rough, simplistic and have a
tendency to have no training sets.
 Linguistics Approach
 Thisapproach utilizes the linguistic features of the
words for keyword detection and extraction in text
documents.
 Itincorporates the lexical analysis , syntactic
analysis, discourse analysis etc.
 Machine Learning Approach
 Keyword extraction can also be seen as a learning
problem. This approach requires manually
annotated training data and training models
 Hybrid Approach : Mixture of above
AN EMPIRICAL STUDY OF IMPORTANT KEYWORD
EXTRACTION TECHNIQUES FROM DOCUMENTS
 The primary mission of important keyword extraction
is to extract a specific group of words or keywords
which highlights the main content of the documents.
 The basic data mining applications related to keyword
extractions
 Automatic clustering
 automatic filtering
 automatic indexing
 automatic summarization
 information visualization
 topic detection and tracking
 studied various algorithms to find out important
keywords in a document like:
 support vector machine (SVM)
 conditional random fields (CRF)
 NP-chunk
 ngrams
 multiple linear regression
 logistic regression
-> SVM shows a better result
YAKE! COLLECTION-INDEPENDENT AUTOMATIC
KEYWORD EXTRACTOR

 YAKE! does not rely on dictionaries or thesauri, neither


it is trained against any corpora.
 Follow an unsupervised approach which builds upon
features extracted from the text.
 Keyword Extraction Pipelining: Six Steps →
1. Text Preprocessing (Tokenization, Stemming, Stop
Word Removal)
2. Feature Extraction
 Casing → Lower/Upper Case,

 Word Positional → Those words occurs at start of

document,
 Word Frequency → more often occurance of

words,
 Word Relatedness to Context → same / different

words that occur left & right side of the candidate


word,
 Word Dif Sentence → how often a candidate word

occurs inside a single sentence)


3. Individual Term Score (Calculated from Above
Features
4. Candidate Keywords List Generation (Based on
Term Score)
5. Data De duplication (Remove Duplicates using
Levenshtein distance)
6. Ranking (Based on Individual Term Score)
 The results can be explored through three
different functionalities:
1. Annotated text -> shows the text annotated
with the top 10 keywords retrieved by
YAKE.
2. Word cloud -> uses the relevance score of
each keyword retrieved by YAKE!, to
generate a word cloud, where more
important keywords are given a higher size
3. Comparing YAKE against IBM NLU and Rake
TEXT SUMMARIZATION WITH AUTOMATIC
KEYWORD EXTRACTION IN TELUGU E-
NEWSPAPERS
 Automatic Keyword Extraction
 The main aim of automatic keyword extraction is to point
out a set of words or phrases that best represents the
document.
 Extraction (Testing) model is shown in Figure 2. The articles
are supplied to the POS tagger on the documents. The score
is calculated for each text, and few top scored texts are
selected as a keyword.
DATASET

 PLOS open access journals research articles


 Format: XML
 Contains complete information related to
research papers
 Size 5 GB
 Instances more than 2 lacs
METHODOLOGY


Input: Paper Abstract from corpus

Applying preprocessing techniques(stop word removal etc)

Preprocessing

Recommendation System for Title


Words
REFERENCES:
 https://arxiv.org/abs/1707.02919
 https://arxiv.org/abs/1704.03242

 http
://ieeexplore.ieee.org/abstract/document/812
2154
/
 https://
www.researchgate.net/publication/32316746
4_YAKE_Collection-Independent_Automatic_K
eyword_Extractor
 https://
www.researchgate.net/publication/31423917
1_Text_Summarization_with_Automatic_Keyw
ord_Extraction_in_Telugu_e-Newspapers

You might also like