Professional Documents
Culture Documents
Data Mining and BI - Student Notes 2
Data Mining and BI - Student Notes 2
Knowledge
Value
Gain insights
Information
Investigate cause - effect
Create a model
Velocity
• Batch processes
• Near real time data
• Real time data
• Streaming/ periodic
Variety
• Structured – ERP, CRM
• Semi structured – XML
• Unstructured – AV, Doc,
PDF, Email, IoT
Google processes 20 PB/ day, Facebook logs are 60 TB/day, eBay has 7 PB user data
Understanding and preparing
the data for analysis
Sources and cleansing
Types of data - structured
Id Name Location Assignment Course Course title Advantages
1234 A Mumbai Mumbai F1234 Fintech • Centralised databases
• Concurrent operations
2345 B Pune Kolkata M1234 Fin Mgmt • CRUD rights
• ODBC interfaces
3456 C Bengaluru Ahmedabad S1234 Strategy
• Batches/ real time
4567 D Nagpur Delhi ST1234 Statistics • Robust transactions
Automated tools need proper mapping, correction rules and workflow for acceptance of corrections
Data analysis
How does data analysis help
• Percentage of repeat visitors and the reasons for them to come back
Detection – based on pattern of past data decide if an event is an outlier – credit card default, redhead league
Regression – to find out or forecast specific value of an outcome – predict the price of a scrip, sales discount rate
Reinforcement – to help in decision making – out of 100 past events, a decision not to lend money was correct
Introduction to probability
Joint, Marginal and Conditional Probability
Bucket/ Fruit Orange Apple Row Total
Red 30 10 40
Blue 15 45 60
Column Total 45 55 100
• Probability of randomly selecting red bucket is 40% and that of blue bucket is 60%
• Joint Probability – the probability of two events occurring at the same time – e.g. 30/100
• It is written as P(X = xi , Y = yj) = (nij / N) --- N is total sample size
• Marginal Probability – the probability of one event happening irrespective of the other – e.g. 45/100
• It is written as P(X = xi ) = (ci / N) or (Y = yj ) = (rj /N)
• Conditional Probability – the probability of one event occurring given the other – e.g. 30/45 above
• It is written as P(Y = yj | X = xi) = (nij / ci) or (nij / rj ) --- ci and rj are column or row totals for those events
−P P N N
I(Pi, Ni) = × log2 − × log 2
PAttr + NAttr PAttr + NAttr PAttr + NAttr PAttr + NAttr
σ 𝑃𝑖 + 𝑁𝑖
Entropy of each Attribute = × 𝐼 𝑃𝑖 𝑥𝑁𝑖
(𝑃𝐶𝑙𝑎𝑠𝑠 + 𝑁𝐶𝑙𝑎𝑠𝑠)
• Industry use –
• Google, Yahoo use clustering to cluster web pages by similarity
• Also used to identify relevance rate of search results which reduces the search time
Find the distance between points ‘a’ and ‘b’ – importance is weighted by factor ‘wi’
Yb b c
𝐷 𝑎, 𝑏 = 𝑤1 𝑥𝑎 − 𝑥𝑏 2 + 𝑤2 𝑦𝑎 − 𝑦𝑏 2 + 𝑤3 𝑧𝑎 − 𝑧𝑏 2
Object X Y Object a b c d e
Xb Xa
Y a 2 4 a 0 6.325 7.071 1.414 7.159
5 d b 8 2 b 0 1.414 7.616 1.118
a c 9 3 c 0 8.246 2.062
3 d 1 5 d 0 8.500
c
b e 8.5 1 e 0
These methods are used iteratively to suitably group the data starting from K number of clusters
ML Algorithms (Unsupervised) – K-means clustering
Y
5 d Cluster 1 Cluster 2
𝐷 𝑏, 𝑎𝑏𝑑 = 8 − 3.67 2 + 2 − 3.67 2 = 4.641 ‘b’ is closer to Centroid of Cluster 2 hence ‘b’ should be
in Cluster 2 and not in Cluster 1
𝐷 𝑏, 𝑐𝑒 = 8 − 8.75 2 + 2−2 2 = 0.750
This is a much faster way for creating tighter groups compared to other clustering methods
Converting variables into measurable attributes
Marital Age Age Use of clustering for questionnaire analysis
Obs Gender
Status (yrs) category
Respondent Q1 Q2 Q3
a Single Female 15 Y Correlations –
a 10 5 3
b Married Male 30 M Q1 – Q2 = 0.984
b 30 7.5 3.1 Q1 – Q3 = 0.076
c Separated Male 60 O
c 20 6 2.9 Q2 – Q3 = 0.23
d Single Female 32 M
d 40 8 2.95
𝐷 𝑖, 𝑗 = 1 – (Number of matches/ Number of attributes)
Obs a b c d Variable Q1 Q2 Q3
a 0 1 1 1/3 Q1 0 0.016 0.924
b 0 2/3 2/3 Q2 0 0.770
c 0 1 Q3 0
d 0