Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Data Mining and

Analysis

Data Mining Methodologies


&
A Brief Introduction to
Data Science

Dr David Daqing Chen


Outline
• Data mining methodologies:
– CRISP-DM, a nonlinear process
– Key phases of CRISP-DM
• What is data science and its driving forces
• A comprehensive BI architecture

20/04/2023 DMA Lecture 02 2


Data Mining Methodology
• Why have a methodology:
– To standardise the data mining/knowledge discovery process,
Industry-neutral, application-neutral, tool-neutral
– To mould the activities in a typical data mining project into a
set of logical and manageable steps and tasks with expected
outcomes
– To guide the data mining process and avoid mistakes
– To promote good and quality data mining practice
• CRISP-DM (CRoss-Industry Standard Process for Data
Mining)
– A six-phase methodology, developed by a group of European
companies (http://www.statoo.com/CRISP-DM.pdf)

20/04/2023 DMA Lecture 02 3


Data Mining Methodology: CRISP-DM
A non-linear
process: Loops
exist Business Data
Understanding Understanding

Data
Preparation
Data

Deployment Modelling

Evaluation
20/04/2023 DMA Lecture 02 4
CRISP-DM: Phases and Tasks
(Use this as a Check-list for Your Project)

Business Data Data


Modelling Evaluation Deployment
Understanding Understanding Preparation

Understanding the Consider Select


Select Evaluate Plan
Business Data Modeling
Required Data Results Deployment
Goals & Objectives Requirements Technique

Collect Plan Monitering


Assess Clean Generate Determine
Initial &
Situation Data Test Design Next Steps
Data Maintenance

Determine Produce
Describe Transform Build
Data Mining Final
Data Data Model
Goals Report

Conduct
Produce Integrate Assess Review
Initial Data
Project Plan Data Model Project
Exploration

Verify
Format
Data
Data
20/04/2023 Quality DMA Lecture 02 5
CRISP-DM Phase 1
Business Understanding
• Objectives
– To thoroughly understand, from a business perspective, what the client
really wants to accomplish, their business goals, resources, and constraints
– To translate these goals and restrictions into a data mining problem
definition
– To produce a preliminary plan for achieving the data mining goals and the
business goals
• Outputs
– Statement of primary business objectives
– Statement of data mining objectives
– Statement of successful criteria
• Think from business perspective first, not from analytical
perspective, in order to have a set of meaningful questions to which
analytics to be applied to find possible answers/causes
20/04/2023 DMA Lecture 02 6
CRISP-DM Phase 1
Business Understanding
• Case 1:
– Business goal: To increase first-year student progression rate by 10%
– Data mining goal:
• To identify factors affecting student progression, e.g., entry qualifications, family commitment,
p/t jobs, etc. Possible models include: cluster analysis, association analysis, correlation, etc.
• To predict and identify student at-risk. Possible models include various predictive models and
descriptive models. R-F-M model?
• Case 2:
– Business goal: Acquire new customers in order to increase sales by 20% for this year
– Data mining goal:
• Identify which existing customers are most/least profitable.
• Identify the shopping patterns of the most profitable customers, what have they purchased? in
which sequence have they purchased products?
• Identify how customer demographics are linked to their purchasing.
• Case 3:
– Business goal: How many new donors to recruit for blood donation and by which time
to recruit them?
20/04/2023 DMA Lecture 02 7
– Data Mining goal:?
CRISP-DM Phase 2
Data Understanding
• Objectives
– To collect the initial data
– To explore the data, get familiar with the data and discovery initial insights into
the data by identifying “Gross” or “surface” properties
– To evaluate the quality of the data, may need to loop back to the previous stage
• Relevance
• Completeness (coverage)
• Missing values, outliers, extreme values, incomparable value ranges of variables,
inconsistence, imbalanced class, etc., etc.
– Detect a sub-set(s) of the data of interest, and may address directly the data
mining goals
• Outputs
– Data description report
– Data exploration report
– Data quality report
20/04/2023 DMA Lecture 02 8
CRISP-DM Phase 2
Data Understanding: “Surface Properties”

20/04/2023 DMA Lecture 02 9


CRISP-DM Phase 3
Data Preparation
• The most time-consuming and labour-intensive stage: Usually take
over 80% of the time in a data mining project
– The so called “90-9-1” phenomenon in analytics
• Objectives
– To get your data ready for analysis
– To prepare from the initial raw data the final data set that is to be used by all
subsequent phases
– To select the records, tables and attributes (variables) that are to be analysed
and that are appropriate/relevant to the analysis
– To clean the raw data so that it is ready for the modelling tools
– To perform transformation on certain attributes
• Outputs
– A target data set(s)
– Data pre-processing report: Cleaning, reformatting, transformation,
integration, etc.
20/04/2023 DMA Lecture 02 10
CRISP-DM Phase 4
Modelling
• Objectives
– To select and apply appropriate modelling techniques, quite often, a set of models
may create using various modelling techniques, e. g., decision trees, linear/logistic
regressions, k-nearest neighbours, k-means clustering
– To adjust and refine model settings (parameters) to optimal values so as to
optimise the results
– To test the model’s quality and validity
– If necessary, loop back to the data preparation phase to bring the format of the
data into the line with the specific requirements of a particular data mining
technique
• Note:
– Different techniques may be used for the same data mining problem
– Many modelling techniques have certain assumptions about the data
• Outputs:
– Report of the actual modelling technique that is used
– Report of model description
– Report of resulting model assessment
20/04/2023 DMA Lecture 02 11
CRISP-DM Phase 5
Modul Evaluation
• Objectives
– To understand the data mining result
– To evaluate the one or more models generated from the modelling
phase and any findings with respect to business success criteria –
from a business perspective
– To determine whether the model in fact properly achieves the
objectives set for it in the first phase
– To establish whether any important business issues have not been
sufficiently considered
– To come to a decision about the use of the data mining results or to
initiate further iterations, or to set up new data mining projects
• Output
– Report of model assessment and approved model
– Report of further actions and the rationale
20/04/2023 DMA Lecture 02 12
CRISP-DM Phase 6
Deployment
• Objectives
– To organise and present the discovered knowledge in a
way that the customer can use
– To plan deployment
– To plan monitoring and maintenance
– To generate a final report
• Output
– Deployment plan about the deployment strategy,
necessary steps, monitoring and how to perform them
– Final project report
20/04/2023 DMA Lecture 02 13
In Summary…
• Data mining is
– A non-linear process
– A trial-and-error process
– A research process starting with identifying and
understanding certain business problems
• Data mining results
– Include Models and Findings
– Should be assessed from a business perspective with
respect to business success criteria – what sense do the
models and findings make, how can they be applied to
address the business concerns/problems?
20/04/2023 DMA Lecture 02 14
Data Science Life-cycle

Present
Results (VIZ)

Frame the Understand Extract Model &


Problem the Data Features Analyse

Deploy the
Codes
(Model)

A data-driven process for problem solving


20/04/2023 DMA Lecture 02 15
The Overview of the Process of
Data Science
• Frame the problem: Understand the business
scenario and generate a well-defined analytics
problem(s)
– Data science is about addressing business problems,
so ask the right questions from business perspective.
– Identify stakeholders, determine the purpose of the
analytics work, and generate statements of work
(SOWs).

20/04/2023 DMA Lecture 02 16


Understand the Data
• Identify the problems with the data as quickly as possible
– Should the data be collected from scratch?
– Is the data relevant to the analytics work required?
– Is the data comprehensive?
– Is the data representative?
– Is the data complete, any blank entries (possibly missing values)?
– How big is the data, how many attributes (variables), how many rows?
– What each attribute means/represents in the data?
–…

• Conduct exploratory data analysis (EDA) to identify


some features in the data by using simple statistics,
graphs, charts, and plots. Practically this is very useful.
20/04/2023 DMA Lecture 02 17
Examples of EDA

20/04/2023 DMA Lecture 02 18


Extract Features
• To make analysis possible
• The most time-consuming process: Generally known as data pre-
processing or data wrangling, i. e., convert data from its raw format
(and/or values) into a proper format (and/or values) for using the data
– Select the attributes relevant to intended analysis.
– Aggregate original attributes to create new attributes.
– Normalise data.
– Change data type if needed.
– Replace blank entries with estimated values or disregard any record containing
blank entries, if missing values?
– Reduce the data volume.
– Join multiple data sets.
– Reduce the number of attributes.
– Identify the most relevant, influential attributes to an analysis task.
–…
20/04/2023 DMA Lecture 02 19
Model and Analyse
• Relatively simple and straightforward.
• A range of models available
– Have a deep understanding of each model’s strengths and
limitations
– Understand what sense the model created makes.
– Use a simpler model as long as it works for addressing the
business problems
– Be flexible with multiple models
– Ensemble modelling (A set of models used collectively)
– Don’t treat a model as a black-box: should understand how
a model works
20/04/2023 DMA Lecture 02 20
Present Results &
Deploy Codes (Models)
• Deliver the required information and/or the end
product to the people who need them to do their
work effectively
– Visualisation is a simple but effective approach for
information delivery – storytelling.
– Visualisation makes analytics easer for ordinary staff
who may not have the relevant knowledge of
advanced analytics.

20/04/2023 DMA Lecture 02 21


So, What is Data Science?
• Data science is an interdisciplinary area about the frameworks,
processes, systems, and scientific approaches to extract
knowledge or insights from data in various heterogeneous
forms, either structured or unstructured.
• Essentially data science is about a data-driven process for
analytics and is considered a "fourth paradigm" of science
(empirical, theoretical, computational and now data-driven).
• The problems that data science is intended to address are
much more complex, difficult, diverse, or even impossible if
only using traditional approaches, in terms of data formats and
volumes, diversity, dynamic and real-time information demand
etc.
20/04/2023 DMA Lecture 02 22
The Main Driving Forces of Data Science
• Driving forces
– Big data: New and advanced data infrastructures and
data processing approaches, such as Hadoop,
MapReduce, Spark, in-memory databases, NoSQL,
etc, etc.
– Machine learning: Advanced algorithms for
modelling, such as deep learning networks
– Cloud-based computing facilities for complex large-
scale “calculations”

20/04/2023 DMA Lecture 02 23


A Comprehensive BI Architecture and its Main
Components – The Skill
BusinessSet?
Analytics Environment
Data Warehouse Environment Business Analytics Environment
Structured
Un/semi - Technical staff Business user Analytical
Structured Report
Summary
ETL Data mart
Extraction, Integration

Data
source
Transformation

Data mart

Load Enterprise
data Data mart
warehouse
Interactive
Data queries
source
OLAP
cube

Internal/ Data
mining
external User interface
Browser
Portal
20/04/2023
Dashboard DMA Lecture 02 24
Scorecard
Summary
• The concept of methodology, why have a data mining methodology
• CRISP-DM
– The key stages, tasks and input/output within each stage, between
different stages
– Essential:
• Translating a business problem into a data mining problem;
• Evaluating the value of the models created from a business perspective: How can
they be used to address business concerns
• Data mining is a research process: Always ask yourself what has
found, and what impact your finding will have on business
Always start your analysis with a set of clearly-defined
questions/problems
No meaningful question, No analysis
20/04/2023 DMA Lecture 02 25

You might also like