Professional Documents
Culture Documents
MBA933_-_Lectures_1-2
MBA933_-_Lectures_1-2
MBA933_-_Lectures_1-2
Data Mining
Tools & Techniques
Lectures 1‐2
• Evaluation
– Assignments and/or Project: 30%
– Surprise Quizzes: 30% There will be no make‐up quiz
– End Term Exam: 30%
– Class Participation: 10%
Course Materials
• Books
– Data Mining: Concepts and Techniques, 3rd ed.
• Jiawei Han, Micheline Kamber & Jian Pei
– Introduction to Data Mining
• P. N. Tan, M. Steinbach & V. Kumar
– An Introduction to Statistical Learning: With Applications in
Python
• James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J.
– Hands‐On Machine Learning with Scikit‐Learn, Keras and
TensorFlow
• Aurelien Geron
– The Art of R Programming
• Norman Matloff
information created due data generated due data generated by objects data with spatial
to increase in knowledge to social interaction connected to network dimension
What is Data Mining?
An iterative and Many steps, passes
interactive process of Human Intervention
discovering
‐ novel, Non‐trivial
‐ valid, Generalized to future
‐ useful, Action is possible
‐ comprehensive and
‐ understandable Leading to insight
patterns and models in
MASSIVE data sources
What is Data Mining?
• Data mining: a misnomer?
• Knowledge discovery in databases (KDD), knowledge
extraction, pattern analysis, data archeology, information
harvesting, business intelligence, etc.
• Is DM = KD?
– Knowledge Discovery ‐ Overall process of extracting knowledge from
data
– Data Mining ‐ A step in KD process, application of a specific algorithm
based on the overall goal of the KD process
Interpretation Knowledge
& Evaluation
Knowledge
Raw
Data __ __ __ Patterns
Understanding
__ __ __
__ __ __ and
Rules
Transformed
Data
DATA Target
Data
Ware
house
Steps of KD Process
1. Learning the application domain:
– relevant prior knowledge and goals of application
2. Creating a target data set: data selection
3. Data cleaning and preprocessing: (may take 60% of
effort!)
4. Data reduction and transformation:
– find useful features, dimensionality/variable
reduction, invariant representation
Steps of KD Process
5. Choosing functions of data mining
– summarization, classification, regression,
association, clustering
6. Choosing the mining algorithm(s)
7. Data mining: search for patterns of interest
8. Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant
patterns, etc.
9. Use of discovered knowledge
Evolution of DM
1980s
•ERP
1990s
•CRM
2000s
•eCommerce
2010s
•Data Mining / Big Data Analytics
Why has the new age emerged?
• Computing Storm
– Cheaper technology
– Mobile computing
– Social networking
– Cloud computing
• Data Storm
– Volume
– Velocity
– Variety
– Veracity
Why has the new age emerged?
1969
• Velocity
• Data Volume per Time
• Speed at which data is
created, accumulated,
ingested, and processed
The 4 V’s
• Variety
• Assortment of data
• Traditional data, especially
operational data, is “structured”
• Recently data has become
increasingly “unstructured”
• Data does not have a predefined
data model and/or does not fit well
into a relational database
• Text, audio, video, image,
geospatial, Internet data (click
streams and log files)
• Amount of data is doubling every
two years
• Most new data is unstructured
(~95%)
• Unstructured data is vastly
underutilized
• “We don’t have better algorithms,
we just have more data” (Peter
Norvig, Google Head of AI)
The 4 V’s
• Veracity
• Hurricane Frances was on its way to hit Florida’s Atlantic coast (2004)
• Wal‐Mart wants to predict which items will be sold most in the path of the
hurricane
• Obvious items: bottled water, flashlights
• Mined shopper history when Hurricane Charley struck several weeks earlier
• In the past sales of strawberry Pop‐Tarts and Beer increased seven times
Data Mining Functionalities
• Specify the kinds of patterns to be found in data
mining tasks
• Descriptive
• Class/Concept description: Characterization and
discrimination
– Data characterization ‐ summarization of the general
characteristics or features of a target class of data
– Data discrimination ‐ comparison of the general features of
the target class data objects against objects from one or
multiple contrasting classes
– Output ‐ pie charts, bar charts, curves, multidimensional
data cubes, and multidimensional tables
Example
• AllElectronics is a successful international company with
branches around the world
• Each branch has its own set of databases
• The database has following relation tables:
– customer – (cust_ID, name, address, age, occupation, annual_income,
credit_information, category,…)
– item – (item_ID, brand, category, type, price, place made, supplier,
cost,...)
– branch – (branch _ID, name, address,...)
– purchases – (trans_ID, cust_ID, empl_ID, date, time, method_paid,
amount)
– items_sold – (trans_ID, item_ID, qty)
Example
• Data characterization
– Summarize the characteristics of customers who spend more than
$5000 a year at AllElectronics
– Result – a general profile of these customers, such as that they are 40
to 50 years old, employed, and have excellent credit ratings
• Data discrimination
– Compare two groups of customers—those who shop for computer
products regularly (e.g., more than twice a month) and those who
rarely shop for such products (e.g., less than three times a year)
– Result ‐ 80% of the customers who frequently purchase computer
products are 20‐40 years old and have a university education, whereas
60% of the customers who infrequently buy such products are either
seniors or youths, and have no university degree
Data Mining Functionalities
• Mining Frequent Patterns, Associations,
and Correlations
– Patterns that occur frequently in data
– Frequent itemset – a set of items that often
appear together in a transactional data set
– What items are frequently purchased together
in Walmart?
– Association analysis
• buys(X, “computer”) ‐> buys(X, “software”)
• [support = 2%, confidence = 60%]
– Frequent sequential pattern
– Output – Association Rules
Data Mining Functionalities
• Classification and Prediction
– Finding models that describe and distinguish classes or concepts
for prediction
– Supervised: Deriving models from labeled data
– Typical methods:
• Decision trees, naïve Bayesian classification, support vector
machines, neural networks, logistic regression, …
– Typical applications:
• Credit card fraud detection, direct marketing, classifying stars,
diseases, web‐pages, …
– Output: Classification Rules (i.e., IF‐THEN rules), Decision Trees,
Neural Networks
• age(X, “youth”) AND income(X, “high”) ‐> class(X, “Buys Computer”)
Data Mining Functionalities
• Classification and Prediction
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes
– Unsupervised learning
– Market segmentation: Identifying groups of consumers
– Maximize intra‐class similarity and minimize inter‐class similarity
Spends
Number of purchases
Trying to determine the
appropriate customer Apply clustering algorithm Selling the product to the
for the product to the customer base targeted customer
Data Mining Functionalities
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of the
data
– an observation that deviates so much from other observations as to arouse
suspicion that it was generated by a different mechanism (Hawkins, 1980)
– Noise or exception?
– Methods: by‐product of clustering or regression analysis, …
– useful in fraud detection, rare events analysis
Data Mining Functionalities
• Trend and evolution analysis
– Describes and models regularities or
trends for objects whose behavior
changes over time
– Trend, time series, and deviation:
regression analysis
• Stock market