Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Text Mining

Text Mining
• Text mining typically aims to extract or generate new information
from textual information but does not necessarily need to
understand the text itself
• NLP- language structure within texts E.g. PoS tagging, n-grams etc
• Pre-Processing
• Text corpus- representing a collection of text documents
• Text database - Grammatical parsing and pre-processing steps
transform the unstructured text corpus into a semi-structured format
• Term-document matrix -structured representation
• Bag-of-words mechanism containing term frequencies for all documents in the
corpus
• Vector and feature generation
Text Mining Application
• Text classification
• Sentiment mining
• Syntax analysis
• Analysing the syntactic structure of texts
• Relationship identification
• Finding connections and similarities between distinct subsets of documents in the
corpus
• Information extraction and retrieval
• Search engines and web robots
• Document/Text summarization
• Extracting relevant and representative keywords, phrases, and sentences from texts
• Dimensionality Reduction and Topic Modeling
Example
Business intelligence (BI) is the set of techniques and tools for
the transformation of raw data into meaningful and useful
information for business analysis purposes. Business Intelligence
(BI) technologies are capable of handling large amounts of
unstructured data to help identify, develop and otherwise create
new strategic business opportunities. The goal of Business
Intelligence (BI) is to allow for the easy interpretation of these
large volumes of data. Identifying new opportunities and
implementing an effective strategy based on insights can provide
businesses with a competitive market advantage and long-term
stability.
Bag of words
• Business intelligence (BI) ************* techniques *** tools
*************************************************************************
information ********business **********.Business Intelligence (BI)
*************************************************************************
************************************************************************
Business Intelligence (BI)
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Frequency
• Business – 4
• Intelligence – 3
• ……….
• P(business in this document) = 4/total word count.
Text Clustering
• Package reqd. tm, NLP, snowballc, Rclolorbrewer, wordcloud
• Options
• header
• stringAsFactor – keep character variables as they are, rather than convert them to
factor.
• fileEncoding – character strings in R can be decleared to be encoded “latin1” or
“UTF-8” (U for Universal codded character set, TF for – Transformation Format, 8 bit).
• read the text file
• Read.delim() – by default read files into list.
• Read.table() - better to use sep (it can be “,”, “\t”…..) etc…
• readLines() – readLines(filename,n=-1)
Term Frequency-Inverse Document
Frequency
• Doc1: HRM students XLRI
• Doc2: HRM students placement
• Doc 3: Business Management XLRI
Tf (x)=(no. of times term x occurs)/total number of terms in the
document
IDF (x) = log2(total number of documents/no. of documents with term
x)
Idf values: HRM – log2(3/2) = .585. TF of HRM in doc1 is 1.
TfIdf value of HRM in doc 1 : 1/3*.585=.194.[doc1 has 3 elements]

You might also like