MBA933_-_Lectures_1-2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

MBA933

Data Mining
Tools & Techniques
Lectures 1‐2

Dr. Faiz Hamid


Associate Professor
Department of Management Sciences
IIT Kanpur
fhamid@iitk.ac.in
Course Structure
• 16 Sessions (8 Weeks)
– Live online sessions

• Evaluation
– Assignments and/or Project: 30%
– Surprise Quizzes: 30% There will be no make‐up quiz
– End Term Exam: 30%
– Class Participation: 10%
Course Materials
• Books
– Data Mining: Concepts and Techniques, 3rd ed.
• Jiawei Han, Micheline Kamber & Jian Pei
– Introduction to Data Mining
• P. N. Tan, M. Steinbach & V. Kumar
– An Introduction to Statistical Learning: With Applications in
Python
• James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J.
– Hands‐On Machine Learning with Scikit‐Learn, Keras and
TensorFlow
• Aurelien Geron
– The Art of R Programming
• Norman Matloff

• Handouts and Case Studies


Course Outline
• Introduction to DM, DM Tools, functionalities and applications
• Basic Data Understanding
• Data Preparation for DM
• Supervised and Unsupervised Learning
• Decision Trees
• Artificial Neural Network
• Naive Bayes Classifier
• Classifier Evaluation and Improvement Techniques
Module 1

Introduction to Data Mining, DM


Tools, Functionalities and
Applications
Why Data Mining?
 Automated data  Explosive Growth of Data:
collection tools, from terabytes to
database systems, petabytes
Web, computerized  Major sources:
society  Business: Web, e‐commerce,
 Growth of many transactions, stocks, …
application areas  Science: Remote sensing,
bioinformatics, scientific
simulation, …
 Society and everyone: news,
digital cameras, YouTube

 We are drowning in an ocean of data, but starving for knowledge


 Solution DATA MINING
Internet of Events (IoE)

information created due data generated due data generated by objects data with spatial
to increase in knowledge to social interaction connected to network dimension
What is Data Mining?
An iterative and Many steps, passes
interactive process of Human Intervention
discovering
‐ novel, Non‐trivial
‐ valid, Generalized to future
‐ useful, Action is possible
‐ comprehensive and
‐ understandable Leading to insight
patterns and models in
MASSIVE data sources
What is Data Mining?
• Data mining: a misnomer?
• Knowledge discovery in databases (KDD), knowledge
extraction, pattern analysis, data archeology, information
harvesting, business intelligence, etc.
• Is DM = KD?
– Knowledge Discovery ‐ Overall process of extracting knowledge from
data
– Data Mining ‐ A step in KD process, application of a specific algorithm
based on the overall goal of the KD process

• What is not data mining?


– Simple search and query processing
– Expert systems or small ML/statistical programs
Knowledge Discovery Process
Integration

Interpretation Knowledge
& Evaluation

Knowledge
Raw
Data __ __ __ Patterns

Understanding
__ __ __
__ __ __ and
Rules
Transformed
Data
DATA Target
Data
Ware
house
Steps of KD Process
1. Learning the application domain:
– relevant prior knowledge and goals of application
2. Creating a target data set: data selection
3. Data cleaning and preprocessing: (may take 60% of
effort!)
4. Data reduction and transformation:
– find useful features, dimensionality/variable
reduction, invariant representation
Steps of KD Process
5. Choosing functions of data mining
– summarization, classification, regression,
association, clustering
6. Choosing the mining algorithm(s)
7. Data mining: search for patterns of interest
8. Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant
patterns, etc.
9. Use of discovered knowledge
Evolution of DM

1980s
•ERP

1990s
•CRM

2000s
•eCommerce

2010s
•Data Mining / Big Data Analytics
Why has the new age emerged?
• Computing Storm
– Cheaper technology
– Mobile computing
– Social networking
– Cloud computing

• Data Storm
– Volume
– Velocity
– Variety
– Veracity
Why has the new age emerged?

1969

• AGC had 64 Kbyte of memory and operating at 0.043 MHz


• TI‐84 calculator developed by Texas Instruments in 2004 is 350 times faster than AGC
and had 32 times more RAM and 14,500 times more ROM
• Even USB‐C chargers are ~48 times faster than AGC
What is Big Data?
• Data becomes large enough that it
cannot be processed using conventional
methods
• It isn’t just a description of raw volume
• Real issue is usability / accessibility
• Challenge is to develop cost‐effective
and reliable methods for extracting value
from large and complex sets of data in
real time
• Big Data analytics vs. Traditional
analytics
– Speed
– Scale
– Complexity
The 4 V’s
• Volume
• Size of the Data
• Quantity of transactions,
events, or amount of
history
• Attributes, dimensions,…
• Terrabytes to 10s of
petabytes

• Velocity
• Data Volume per Time
• Speed at which data is
created, accumulated,
ingested, and processed
The 4 V’s
• Variety
• Assortment of data
• Traditional data, especially
operational data, is “structured”
• Recently data has become
increasingly “unstructured”
• Data does not have a predefined
data model and/or does not fit well
into a relational database
• Text, audio, video, image,
geospatial, Internet data (click
streams and log files)
• Amount of data is doubling every
two years
• Most new data is unstructured
(~95%)
• Unstructured data is vastly
underutilized
• “We don’t have better algorithms,
we just have more data” (Peter
Norvig, Google Head of AI)
The 4 V’s
• Veracity

– how much uncertainty is in the data


– data could have many missing values, …
– reliable analysis a challenge
Big Data Examples
• Europe's Very Long Baseline Interferometry (VLBI) has 16
telescopes, each of which produces 1 Gigabit/second of
astronomical data over a 25‐day observation session
– storage and analysis a big problem

• AT&T handles billions of calls per day


– so much data, it cannot be all stored ‐‐analysis has to be done “on the
fly”, on streaming data

• Knowledge Discovery is needed to make sense and use of data


Is Big Data analytics worth the effort?
• Competitive advantage in ultracompetitive global economy
• Nucleus Research (2011) concluded that analytics pays back
$10.66 for every dollar spent
• Media Math Co. achieved a 212% ROI in five months with an
annual revenue lift of $2.2M
• Drive top‐line and simultaneously minimize operational cost
• Big Data analytics aren’t constrained by predefined set of
questions
• “You don’t know what you don’t know”
• You don’t have to guess
– Fact based decision ‐ use data to find answers that are more specific and
significantly more useful
Barham, Husam. "Achieving competitive advantage through big data: A literature review." 2017 Portland international conference on
management of engineering and technology (PICMET). IEEE, 2017.
Types of Data Analytics
• Descriptive Analytics
– What has happened?
– Reporting
– a view of key metrics and
measures
– exploratory data analysis (EDA)
– data queries, reports, descriptive
statistics, data visualization
including data dashboards, basic
what‐if spreadsheet models
– E.g. 30% of our customers are self‐
employed
Types of Data Analytics
• Diagnostic Analytics
– Why something happened?
– drill down and isolate the root‐
cause of a problem
– Did the latest marketing
campaign impact sales?
– Did the weather affect beer
sales?
Types of Data Analytics
• Predictive Analytics
– What is likely to happen?
– predict outcome
– being able to predict allows
one to make better decisions
– construct models based on
past data to predict future
– linear regression, time series
analysis, data‐mining
techniques, simulation, …
Types of Data Analytics
• Prescriptive Analytics
– What do I need to do?
– course of action to take: decision
– utilises an understanding of what
has happened, why it has
happened and a variety of “what‐
might‐happen” analysis to help
the user determine the best course
of action to take
– E.g. a predictive model estimates
the probability that a customer
will default on a loan > 0.6.
Prescriptive analytics recommends
not to award the loan
– Optimization techniques, decision
analysis, utility theory, ...
Types of Data Analytics
DM Application Areas
• Science
– astronomy, bioinformatics, drug discovery, …
• Business
– advertising, CRM (Customer Relationship
management), investments, manufacturing,
sports/entertainment, telecom, e‐Commerce,
targeted marketing, health care, …
• Web
– search engines, bots, …
• Government
– law enforcement, profiling tax cheaters, anti‐terror, …
DM for Customer Modeling
• Customer Tasks:
– attrition prediction
• Attrition rate of mobile phone customers is around 25‐30% a year!
• Predict who is likely to attrite next month, given customer
information for the past N months
• Estimate customer value and what is the cost‐effective offer to be
made to this customer
– targeted marketing:
• cross‐sell, customer acquisition, …
– credit‐risk
– fraud detection
• Industries
– banking, telecom, retail sales, …
Credit Risk Assessment Case
• Situation: Person applies for a loan
• Task:
– Should a bank approve the loan?
• Note:
– People who have the best credit don’t need the loans, and people
with worst credit are not likely to repay
– Bank’s best customers are in the middle

• Banks develop credit models using variety of machine


learning methods
• Mortgage and credit card proliferation are the results of
being able to successfully predict if a person is likely to
default on a loan
E‐commerce Case
• A person buys a book (product) at Amazon.com
• Task: Recommend other books (products) this person
is likely to buy
• Amazon does clustering based on books bought:
– customers who bought “Advances in Knowledge Discovery and
Data Mining”, also bought “Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations”

• Netflix movie recommendation system


– https://research.netflix.com/research‐area/recommendations
– https://pyimagesearch.com/2023/07/03/netflix‐movies‐and‐series‐recommendation‐
systems/

• Recommendation programs are quite successful


Genomic Microarrays Case
• Given microarray (medical) data for a number of
patients, can we
– Accurately diagnose the disease?
– Predict outcome for given treatment?
– Recommend best treatment?

• Predict Acute Lymphoblastic Leukemia (ALL) vs.


Acute Myeloid Leukemia (AML)
– 38 training cases, 34 test, ~ 7,000 genes
– Use train data to build diagnostic model
– Results on test data: 33/34 correct, 1 error
Security & Fraud Detection Case
• Credit Card Fraud Detection
• Detection of Money laundering
– FAIS (US Treasury)
• Securities Fraud
– NASDAQ KDD system
• Phone fraud
– AT&T, Bell Atlantic, British
Telecom/MCI
• Bio‐terrorism detection at Salt Lake
Olympics 2002
Disaster Management
• Optimization Analytics used to direct the
correct supplies of recovery/food items
to areas where they are needed most

• Does a village need bottled water or


boats, rice or wheat, shelter or toilets?

• Hurricane Frances was on its way to hit Florida’s Atlantic coast (2004)
• Wal‐Mart wants to predict which items will be sold most in the path of the
hurricane
• Obvious items: bottled water, flashlights
• Mined shopper history when Hurricane Charley struck several weeks earlier
• In the past sales of strawberry Pop‐Tarts and Beer increased seven times
Data Mining Functionalities
• Specify the kinds of patterns to be found in data
mining tasks
• Descriptive
• Class/Concept description: Characterization and
discrimination
– Data characterization ‐ summarization of the general
characteristics or features of a target class of data
– Data discrimination ‐ comparison of the general features of
the target class data objects against objects from one or
multiple contrasting classes
– Output ‐ pie charts, bar charts, curves, multidimensional
data cubes, and multidimensional tables
Example
• AllElectronics is a successful international company with
branches around the world
• Each branch has its own set of databases
• The database has following relation tables:
– customer – (cust_ID, name, address, age, occupation, annual_income,
credit_information, category,…)
– item – (item_ID, brand, category, type, price, place made, supplier,
cost,...)
– branch – (branch _ID, name, address,...)
– purchases – (trans_ID, cust_ID, empl_ID, date, time, method_paid,
amount)
– items_sold – (trans_ID, item_ID, qty)
Example
• Data characterization
– Summarize the characteristics of customers who spend more than
$5000 a year at AllElectronics
– Result – a general profile of these customers, such as that they are 40
to 50 years old, employed, and have excellent credit ratings

• Data discrimination
– Compare two groups of customers—those who shop for computer
products regularly (e.g., more than twice a month) and those who
rarely shop for such products (e.g., less than three times a year)
– Result ‐ 80% of the customers who frequently purchase computer
products are 20‐40 years old and have a university education, whereas
60% of the customers who infrequently buy such products are either
seniors or youths, and have no university degree
Data Mining Functionalities
• Mining Frequent Patterns, Associations,
and Correlations
– Patterns that occur frequently in data
– Frequent itemset – a set of items that often
appear together in a transactional data set
– What items are frequently purchased together
in Walmart?
– Association analysis
• buys(X, “computer”) ‐> buys(X, “software”)
• [support = 2%, confidence = 60%]
– Frequent sequential pattern
– Output – Association Rules
Data Mining Functionalities
• Classification and Prediction
– Finding models that describe and distinguish classes or concepts
for prediction
– Supervised: Deriving models from labeled data
– Typical methods:
• Decision trees, naïve Bayesian classification, support vector
machines, neural networks, logistic regression, …
– Typical applications:
• Credit card fraud detection, direct marketing, classifying stars,
diseases, web‐pages, …
– Output: Classification Rules (i.e., IF‐THEN rules), Decision Trees,
Neural Networks
• age(X, “youth”) AND income(X, “high”) ‐> class(X, “Buys Computer”)
Data Mining Functionalities
• Classification and Prediction
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes
– Unsupervised learning
– Market segmentation: Identifying groups of consumers
– Maximize intra‐class similarity and minimize inter‐class similarity
Spends

Number of purchases
Trying to determine the
appropriate customer Apply clustering algorithm Selling the product to the
for the product to the customer base targeted customer
Data Mining Functionalities
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of the
data
– an observation that deviates so much from other observations as to arouse
suspicion that it was generated by a different mechanism (Hawkins, 1980)
– Noise or exception?
– Methods: by‐product of clustering or regression analysis, …
– useful in fraud detection, rare events analysis
Data Mining Functionalities
• Trend and evolution analysis
– Describes and models regularities or
trends for objects whose behavior
changes over time
– Trend, time series, and deviation:
regression analysis
• Stock market

– Sequential pattern mining, periodicity


analysis
• Cross selling
• e.g., first buy digital camera, then buy large
SD memory cards
Machine Learning
algorithm classification
Are All Patterns Interesting?
• A data mining system has the potential to generate thousands
or even millions of patterns, or rules
• Only a small fraction of the patterns potentially generated
would actually be of interest
• What makes a pattern interesting?
– easily understood
– valid on new or test data with some degree of certainty
– potentially useful, and
– novel
• An interesting pattern represents knowledge
• Measures of pattern interestingness
– Support, confidence, accuracy, coverage, unexpectedness, actionable
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High‐Performance


High Performance
Technology Computing

DM: a cross‐disciplinary field that focuses on discovering


properties from vast amount of data

You might also like