Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Introduction to Data Science

and Machine Learning


Prof Atul Patankar
How Data Science, Statistics and ML match up and
also differ
Data Science Statistical Analysis Machine Learning
Data collection Not dependent on data Needs well defined collection Uses data already available
collection approach
Data sources All types of data Mainly numeric Application specific
Data cleansing Can use any data and cleanse it Needs clean data to get Uses cleansed data
accurate results
Process Inductive Deductive Algorithms run on model without
presumptive hypothesis
Data volume Very large volume Sample level Large volume
Nature Exploratory Exploratory or Descriptive or Concerned with application
Experimental performance and optimization
Approach Analyse data, extract patterns, Generally – hypothesise, put Self learning algorithms explore
develop model sample to test and estimate training dataset, try to predict a
for the population class variable in new data set
Methods Estimation, classification, Neural Inferential and descriptive Uses statistical tools like
networks, Clustering, statistics – regression, ANOVA classification, regression to learn
Association , Visualisation etc. from training data set
Cross Industry Standard Process for Data Mining
(CRISP-DM)
• Business understanding: define business goals and align data mining with those
• Data understanding: familiarize with the data and the domain, by exploring
knowledge base and utilizing prior experience
• Data preparation: clean and prepare the data for mining. Select relevant data
subset, find properties/attributes, check and generate new attributes. Define
appropriate attribute values and/or value discretization
• Data mining: choose the most appropriate data mining tools for summarization,
classification, regression, association, clustering and searching for patterns or
models of interest
• Interpretation of results: use patterns/ model for visualization, transformation,
removal of redundant patterns
• Use of knowledge: gained from data for business improvement
How is Data Science
scaling up?
It encompasses Analysis, Mining and ML
Gartner insights into evolution of Data Science
Individual record is of less value Wisdom Scale up the value chain

Knowledge
Value

Predict with probability

Gain insights
Information
Investigate cause - effect

Data Test with historical data

Test with real data

Create a model

Analyse for obvious clues

Gather data – big data


Difficulty
Gartner Hype Cycle shows Data Science is picking up
IBM – 3Vs Volume
• From GB to TB to PB
• 90% of data in 2017-18
• Flight data, BFS
transactions, Medical

Velocity
• Batch processes
• Near real time data
• Real time data
• Streaming/ periodic

Variety
• Structured – ERP, CRM
• Semi structured – XML
• Unstructured – AV, Doc,
PDF, Email, IoT

Google processes 20 PB/ day, Facebook logs are 60 TB/day, eBay has 7 PB user data
Understanding and preparing
the data for analysis
Sources and cleansing
Types of data - structured
Id Name Location Assignment Course Course title Advantages
1234 A Mumbai Mumbai F1234 Fintech • Centralised databases
• Concurrent operations
2345 B Pune Kolkata M1234 Fin Mgmt • CRUD rights
• ODBC interfaces
3456 C Bengaluru Ahmedabad S1234 Strategy
• Batches/ real time
4567 D Nagpur Delhi ST1234 Statistics • Robust transactions

Id Name Location Course Course title


1234 A Mumbai F1234 Fintech Issues
• Disk based operations
2345 B Pune M1234 Fin Mgmt • Can’t fit into RAM
S1234 Strategy • Concurrency locks
3456 C Bengaluru
• Query optimization
4567 D Nagpur ST1234 Statistics • Retrieval time for very
Id Assignment Course large data volume
1234 Mumbai F1234 • Can’t accept
unstructured data
2345 Kolkata M1234
3456 Ahmedabad S1234
4567 Delhi ST1234
Types of data – unstructured data
• Raw text Issues –
• Global Regular Expression Print (GREP) pattern-based search or substring indexing with regular
expressions e.g. “[Tes]+co”
• Tf.Idf (term frequency/inverse doc frequency) – statistic to determine how important a word is in a
collection of documents or relevance of the word in results for a user query
• LSI (latent semantic indexing) – is car similar to auto? – relationships between words
• Documents, charts, PDF files from digital libraries. Issues -
• Content matching from scanned documents
• PDF files are encrypted
• Charts may have colour coding, pattern matching may be tedious
• Images, photos, videos, songs – Binary Large Objects (BLOBS). Issues –
• Identification, retrieval, colour matching
• Songs may be indexed by genre, singer, musician, instrument etc.
• Scanned images and photos (not scanned) need to be identified based on images
Data storage
• Data warehousing
• The database is stored in secure, off-site location
• Slices, snapshots, or views are made available on staging machines
• Dedicated interactive query servers allow fast user access
• The data might be processed or maintained in summarized form
• Distributed databases
• Not similar to Blockchain
• Different parts of the data held in different sites
• Some queries are local while the others are “corporate-wide”
• The database servers manage distributed queries. The users can use it as a single
server located centrally
• The servers synchronize the parts for maintaining consistency
Data retrieval and presentation
Regional sales Seasonal demand Periodic targets &
transactions forecast performance appraisals
OLAP – Online Analytical Processing
• Presents processed data in a
meaningful manner rather than just the Data Warehouse
large volume of transactions from
which users have to gauge the trends OLAP Engine

• Multi-dimensional tables of aggregated


data at various levels e.g. sales in
geographical regions in specific periods
• Further slicing and dicing possible -
seasonal trends in different product
categories or geographies
• Presentation can be managed using
spreadsheets and graphs
Data cleansing - preparing the data for analysis
Issue Concerns Remedy Remarks
Missing values • How to interpret? Is data not • Use zero or a constant Fill in the values after
available? • Input mean/ mode value analysing the root cause
• Is the value 0? • Random value from range
Duplicate values • Are partial matches • Correct after checking Verify with source
duplicates? • Find pattern
• Is the file corrupt? • Check historical data
Inconsistent data • Multiple query results • Manual correction Master data may need
• Data entry correctly done? • Check with the party cleansing
• Gaps in data are left blank • Delete wrong entries
Stale data • Is the source accurate? • Try to run interface/ batch If it recurs, the supplier
• Can it be used for transaction? manually and check may have to check
• Specific data correction
Outlier values • Data can’t be used as is • Check if last session data Remove such cases from
• Leads to calculation errors can be used analysis sample

Automated tools need proper mapping, correction rules and workflow for acceptance of corrections
Data analysis
How does data analysis help
• Percentage of repeat visitors and the reasons for them to come back

• Products bought together

• Categories - city transport, recreation, education, healthcare, water supply,


sanitation

• Customer feedback on social media

• Building a prototype model based on reviews

• Data visualisation helps in quick decision making – trends, shocks etc.


Machine Learning
Types
Supervised Learning
• Supervised learning
• Data is in the form of records of attribute values i.e. given set of sample data
• The records are labelled by a Class variable to which these belong
• The objective is to find a model or classifier which will enable a new instance to be
classified e.g. ‘climate’ of countries, ‘fuel efficiency’ of cars, ‘diagnosis’ of a patient
• The models derived from other attributes are used for prediction or classification
• Method examples – Decision Tree Analysis, Rule Set Induction
• For example –
• Provide input parameters which decide an outcome e.g. when does it rain?
• Humidity > 95%
• Temperature > 35 degrees Celsius
• Wind from West side
Unsupervised Learning
• Unsupervised learning
• There are no labels for data points
• The models detect patterns from unlabelled data for exploratory analysis
• Algorithms group the data in clusters to describe the structure
• Try to find unknown properties of data
• For example –
• Classify fruits
• Colour, size, weight, season
• Based on the classification, one can put labels like – citrus, sweet, nutritious
• When the computer learns on its own
• Create a list of animals by weight
• Segregate based on height, nocturnal behaviour, jaw strength, teeth etc.
Reinforced Learning
• Reinforced learning
• The algorithm learns on its own
• Gets confirmation about the outcome – correctness, usefulness, accuracy etc.
• The algorithm refines itself based on the feedback
• For example –
• Similar to training domesticated animals
• Do the work like classification and check if it is correct
• Get reward or negative points
• Refine the logic based on the outcome
• Mood based ambient condition controller
• Based on tone, behaviour, body function
• Turn on suitable music, lighting, adjust the temperature
Which algorithm is to be used?
Classification – to separate the data into predecided groups – is it human or machine, is it a citrus fruit or not

Detection – based on pattern of past data decide if an event is an outlier – credit card default, redhead league

Regression – to find out or forecast specific value of an outcome – predict the price of a scrip, sales discount rate

Clustering (unsupervised) – to organize objects into groups based on characteristics

Reinforcement – to help in decision making – out of 100 past events, a decision not to lend money was correct
Introduction to probability
Joint, Marginal and Conditional Probability
Bucket/ Fruit Orange Apple Row Total
Red 30 10 40
Blue 15 45 60
Column Total 45 55 100

• Probability of randomly selecting red bucket is 40% and that of blue bucket is 60%

• Joint Probability – the probability of two events occurring at the same time – e.g. 30/100
• It is written as P(X = xi , Y = yj) = (nij / N) --- N is total sample size

• Marginal Probability – the probability of one event happening irrespective of the other – e.g. 45/100
• It is written as P(X = xi ) = (ci / N) or (Y = yj ) = (rj /N)

• Conditional Probability – the probability of one event occurring given the other – e.g. 30/45 above
• It is written as P(Y = yj | X = xi) = (nij / ci) or (nij / rj ) --- ci and rj are column or row totals for those events

Marginal probability is the sum of Joint probabilities


Joint probability is product of Conditional and Marginal probabilities
Machine Learning Algorithms –
Naïve Bayes Classifier
(Supervised)
ML Algorithms (Supervised) – Naïve Bayes Classifier
Naïve Bayes – The Bayes theorem predicts probability of an event based on prior
knowledge of conditions which might be related to the event
• Used for classification of emails, text notes, documents or a web page
• Facebook analyses status update with +ve/ -ve emotions
• Google classifies documents, searches and pages with relevance scores
• Many websites classify articles on technology, sports, politics etc.
• Gmail allocates a value from available categories e.g. Spam/ No spam
• Employs simple classification of words based on Bayes probability theorem
• Suitable for large data volume with many independent attributes
• Performs well when variables are categorical
• Requires less volume of training data set than say logistic regression
• It works well even for multi class prediction
ML Algorithms (Supervised) – an example of Bayes
No Weather Temperature Humidity Windy Tour? Probability of going on tour – P(Yes) = 9/14
1 Sunny Hot High No No Probability of tour cancellation – P(No) = 5/14
2 Sunny Hot High Yes No Weather Yes No P(Y) P(N) Temp Yes No P(Y) P(N)
3 Cloudy Hot High No Yes Sunny 2 3 2/9 3/5 Hot 2 2 2/9 2/5
4 Rainy Mild High No Yes Cloudy 4 0 4/9 0/5 Mild 4 2 4/9 2/5
5 Rainy Cold Normal No Yes Rainy 3 2 3/9 2/5 Cold 3 1 3/9 1/5
6 Rainy Cold Normal Yes No Total 9 5 1 1 Total 9 5 1 1
7 Cloudy Cold Normal Yes Yes
Humidity Yes No P(Y) P(N) Windy Yes No P(Y) P(N)
8 Sunny Mild High No No
High 3 4 3/9 4/5 No 6 2 6/9 2/5
9 Sunny Cold Normal No Yes
Normal 6 1 6/9 1/5 Yes 3 3 3/9 3/5
10 Rainy Mild Normal No Yes
Total 9 5 1 1 Total 9 5 1 1
11 Sunny Mild Normal Yes Yes
12 Cloudy Mild High Yes Yes Event – Sunny, cold, highly humid, P(Y) P(N)
13 Cloudy Hot Normal No Yes strong wind – should we go on tour? Sunny 2/9 3/5
14 Rainy Mild High Yes No Cold 3/9 1/5
P(Event) of yes = P(X)P(Y) = (2/9*3/9*3/9*3/9)*(9/14) = 0.0053 The decision - Humid H 3/9 4/5
P(Event) of no = P(X)P(N) = (3/5*1/5*4/5*3/5)*(5/14) = 0.0206 Higher of normalised probabilities – Windy Y 3/9 3/5
P(Event) = P(Sunny)*P(Cold)*P(High humidity)*P(Windy) P(Y |Event) = 0.0053 / 0.02186 = 0.2424
Class 9/14 5/14
of total 5/14*4/14*7/14*6/14= 0.02186 P(N |Event) = 0.0206 / 0.02186 = 0.9421
Machine Learning Algorithms –
Decision Tree
(Supervised)
ML Algorithms (Supervised) – Decision Trees
• A node represents a test on the attribute. A branch is an outcome of the test and a leaf node
represents a class label or the decision after computing all the attributes
• These help to present the data in a visual and intuitive manner
• The chart shows how a different decision would have impacted the model
• Decision optimality can be cross checked by back-tracking the entire tree
• Best suitable where instances are represented as attribute – value pairs
• Where multiple attributes are available, the decision tree can work with missing data/ data
errors/ outliers as it can use other attributes
• The algorithm works well with categorical variables but continuous type can also be used
• Used for Remote sensing, Loan default probability, Identification of ‘At Risk Patients’
ML Algorithms (Supervised) – Decision Trees –
introduction to important terms
• Entropy – it is the measure of randomness or unpredictability in data e.g.
data on different types of flowers or animals
• Information gain – it represents the decrease in entropy after a branch is
split e.g. data on animals has highest entropy. After these are grouped into
tall (say elephant, giraffe) and short (lion, deer), there is information gain.
The decision conditions should be such that the gain is the highest
• Root node – the first node where branches are split
• Decision node – any node in the tree where branches are split
• Leaf node – it is the last node in the tree which has no further branches
ML Algorithms (Supervised) – Decision Trees –
formulae
−P P N N
Entropy of Class = × log2 − × log 2
PClass + NClass Pclass + NClass PClass + NClass PClass + NClass

−P P N N
I(Pi, Ni) = × log2 − × log 2
PAttr + NAttr PAttr + NAttr PAttr + NAttr PAttr + NAttr

σ 𝑃𝑖 + 𝑁𝑖
Entropy of each Attribute = × 𝐼 𝑃𝑖 𝑥𝑁𝑖
(𝑃𝐶𝑙𝑎𝑠𝑠 + 𝑁𝐶𝑙𝑎𝑠𝑠)

Gain = (Entropy of Class) – (Entropy of Attribute)


ML Algorithms (Supervised) – Decision Trees
Info Age
Age Positive Negative Entropy Gain
Gain
Age House Type Cashflow
Old 0 3 0 Old Young
Old Yes Working Negative
Mid 2 2 1 0.40 0.60 Mid
Old No Working Negative
Young 3 0 0 Negative Positive
Old No Retired Negative
Mid Yes Working Negative Info ?
House Positive Negative Entropy Gain
Gain
Mid Yes Retired Negative
Yes 1 3 0.81 1. Steps to decide tree structure –
Mid No Retired Positive 0.8754 0.1245
No 4 2 0.92
2. Calculate entropy for each
Mid No Working Positive attribute –
Young Yes Working Positive Info • Calculate Information Gain
Type Positive Negative Entropy Gain
Gain
Young No Retired Positive • Calculate class entropy for
Working 3 3 1
Young No Working Positive each attribute
1 0
Retired 2 2 1 3. Calculate Gain for each attribute
−5 5 5 5 4.The attribute having maximum
Entropy of Class = × log2 − × log 2 =1 gain forms a Node
5+5 5+5 5+5 5+5
5. Number of branches under root
−0 0 3 3
I Age = Old Pi, Ni = × log2 − × log 2 =0 node is decided by the number of
0+3 0+3 0+3 0+3 values in that variable. Here, we
0+3 2+2 3+0 have three values in Age so there
Entropy of Attribute ′Age′ = × 0 + × 1 + × 0 = 0.40
(5+5) (5+5) (5+5) will be three branches
Gain Age = 1 – 0.4 = 0.6
Machine Learning Algorithms –
Apriori
(Unsupervised)
ML Algorithm (Unsupervised) – Apriori
• It searches for associations amongst the items in a set of transactions
• It uses statistical analysis – frequent occurrence of ‘A’ results in similar
occurrence of ‘B’ as well
• It collects similar events which are dependent in nature and then tries to
relate with an independent event e.g. a Tab with a cover or a stand
• Properties –
• Subset of larger set of items also follows the same properties
• Property of subset applies to superset as well – frequent/ infrequent
• Mainly used for
• Google uses it for autocomplete functionality
• Amazon develops insights on products which are purchased together – tea & sugar
• To detect side effects of drugs – health condition, effects of taking drug, diagnosis
Apriori algorithm - Terminology
• Support – represents proportion of transactions which contain both the
terms/ items/ goods
• It is represented as A -> B = Probability of both A and B happening together
• It is measured in terms of frequency of association i.e. A and B being in a basket
• Confidence – it is the strength of association between A and B
• Represented as Conditional probability of B with A divided by probability of A
• it is occurrence of B in the transactions where A is also there
• Finding the association and deciding on its strength is first step
• Next, the association rule suggested by the algorithm should have business
interest e.g. if it suggests items which are not purchased frequently, then it
is of no consequence
ML Algorithm (Unsupervised) – Apriori - example
Candidate 1 Level 1
Transaction id Items Item Support Item Support
T1 A, B, C {A} 3 {A} 3
T2 A, C {B} 2 {B} 2
T3 A, D {C} 2 {C} 2
T4 B, E, F {D} 1
{E} 1
Minimum support = 50% {F} 1
Value of support =
(Percentage) X (Number of transactions)/ 100 Candidate 2
Here it will be = 50 X 4/100 = 2 Item Support
Level 2
{ A, B } 1
Minimum confidence = 50% Item Support
{ B, C } 1
{ A, C } 2
{ A, C } 2
Association Rule Support Confidence
Association calculation formula – Here both associations are
A→C 2 2/3 = 66.67%
Support/ Occurrence of item on left side valid as confidence level is
C→A 2 2/2 = 100% more than given level of 50%
Machine Learning Algorithms –
Clustering
(Unsupervised)
ML Algorithms (Unsupervised) – clustering
• Clustering – it groups the data points based on various methods like
• Simple grouping models – do not provide much insights beyond grouping
• Centroid K-means – assign data points to clusters to minimize distances e.g. news
• Mean shifting – locate high density of data points with the centre as the mean
• DBSCAN – remove the noise, shift to higher density area by shifting small distance
• Gaussian Mixture Models – uses standard deviation and mean for probability %
• Connectivity hierarchical – find distance of a single data point from cluster

• Industry use –
• Google, Yahoo use clustering to cluster web pages by similarity
• Also used to identify relevance rate of search results which reduces the search time

Market segmentation, product positioning, image recognition, opinion/ habit typology


ML Algorithms (Unsupervised) K – nearest neighbour
D(xi, yi) is a record Find the distance between points ‘a’ and ‘b’ – all the variables have equal importance
Ya a 2 2
𝐷 𝑎, 𝑏 = 𝑥𝑎 − 𝑥𝑏 + 𝑦𝑎 − 𝑦𝑏

Find the distance between points ‘a’ and ‘b’ – importance is weighted by factor ‘wi’
Yb b c
𝐷 𝑎, 𝑏 = 𝑤1 𝑥𝑎 − 𝑥𝑏 2 + 𝑤2 𝑦𝑎 − 𝑦𝑏 2 + 𝑤3 𝑧𝑎 − 𝑧𝑏 2

Object X Y Object a b c d e
Xb Xa
Y a 2 4 a 0 6.325 7.071 1.414 7.159
5 d b 8 2 b 0 1.414 7.616 1.118
a c 9 3 c 0 8.246 2.062
3 d 1 5 d 0 8.500
c
b e 8.5 1 e 0

1 e • Observation points ‘a’ and ‘d’ have least distance


X • Similarly distance between ‘c’, ‘d’ and ‘e’ is also minimum
1 3 5 7 9 • Any other combination will prove to be distant

These methods are used iteratively to suitably group the data starting from K number of clusters
ML Algorithms (Unsupervised) – K-means clustering
Y
5 d Cluster 1 Cluster 2

a Cluster 2 Obs X Y Obs X Y


Cluster 1
a 2 4 c 9 3
3
c b 8 2 e 8.5 1
b
d 1 5
1 e Avg 3.67 3.67 Avg 8.75 2
X
1 3 5 7 9

𝐷 𝑎, 𝑎𝑏𝑑 = 2 − 3.67 2 + 4 − 3 ⋅ 67 2 = 1 ⋅ 702


‘a’ is closer to Centroid of Cluster 1
𝐷 𝑎, 𝑐𝑒 = 2 − 8.75 2 + 4−2 2 = 7.040

𝐷 𝑏, 𝑎𝑏𝑑 = 8 − 3.67 2 + 2 − 3.67 2 = 4.641 ‘b’ is closer to Centroid of Cluster 2 hence ‘b’ should be
in Cluster 2 and not in Cluster 1
𝐷 𝑏, 𝑐𝑒 = 8 − 8.75 2 + 2−2 2 = 0.750

This is a much faster way for creating tighter groups compared to other clustering methods
Converting variables into measurable attributes
Marital Age Age Use of clustering for questionnaire analysis
Obs Gender
Status (yrs) category
Respondent Q1 Q2 Q3
a Single Female 15 Y Correlations –
a 10 5 3
b Married Male 30 M Q1 – Q2 = 0.984
b 30 7.5 3.1 Q1 – Q3 = 0.076
c Separated Male 60 O
c 20 6 2.9 Q2 – Q3 = 0.23
d Single Female 32 M
d 40 8 2.95
𝐷 𝑖, 𝑗 = 1 – (Number of matches/ Number of attributes)

Obs a b c d Variable Q1 Q2 Q3
a 0 1 1 1/3 Q1 0 0.016 0.924
b 0 2/3 2/3 Q2 0 0.770
c 0 1 Q3 0
d 0

• Draft questionnaire is first tested on few respondents


• Distance between ‘a’ and ‘d’ is shortest
• Distance between ‘Q1’ and ‘Q2’ is shortest
• Hence ‘a’ and ‘d’ can be grouped to form a Cluster
i.e. (1 – Correlation coffecient)
• Distance between ‘b’ and ‘c’ is lesser
• The researcher may drop either Q1 or Q2 from
• Will these be in another Cluster?
the final questionnaire
Implementation
methodology and systems
Process and architecture
Data mining implementation - MapReduce
Map function distributes records across available computers
City Temperature
Input data files are converted to intermediate
Toronto 20
key-value implementation by Map function
New York 22
Las Vegas 24
Reduce tasks
San Diego 28
City and City and City and City and max
Charlotte 25
Temperature Temperature Temperature Temperature
San Diego 26
Toronto (20) Toronto (12) Toronto (28) Toronto (28)
New York 15
New York (22) New York (15) New York (25) New York (25)
Las Vegas 22
Las Vegas (24) Las Vegas (22) Las Vegas (29) Las Vegas (29)
Charlotte 23
San Diego (28) San Diego (26) San Diego (30) San Diego (30)
Toronto 12
Charlotte (25) Charlotte (23) Charlotte (27) Charlotte (27)
New York 25
Reduce function finds common keys Reduce function then provides
Las Vegas 29
to group the attributes the result – maximum temperature
Charlotte 27
for each city
San Diego 30
Toronto 28

You might also like