CR

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

CRISP-DM

By :
ELGOUNIDI Hajar
SAFSAFI Aya
EL MALKI Ikram Supervised by :
AQAABICH Reda Mr Hicham RAFFAK
Overview
Introduction to CRISP-DM
Phases of the CRISP-DM
Example
Advantages
Limitations
Conclusion
Introduction
Data mining has become an essential tool for
businesses to analyze and extract valuable
insights from their data. However, data mining
projects can be complex and challenging,
requiring a systematic and structured approach
to ensure success.
The CRISP-DM framework provides a clear
and flexible process model for data mining
projects, which can help to execute successful
data mining projects. Let’s understand the
CRISP-DM framework, and go through an
overview of the process model, benefits, and
how it can be applied to real-world scenarios.
What is DATA mining
Data mining is the process of discovering patterns, trends,
correlations, or valuable information from large sets of data. It
involves analyzing and extracting useful knowledge or insights from
datasets, often using various techniques from statistics, machine
learning, and database systems. The goal of data mining is to
uncover hidden patterns and relationships within data that can be
used to make informed decisions.

Data Mining: Focuses on the overarching goal of knowledge discovery from data.
CRISP-DM: Provides a standardized framework to guide practitioners through the stages of a data mining
project.
What is CRISP-DM Framework.
The CRISP-DM framework provides a structured and
systematic approach for Data Mining. It is a widely
used process model that offers several benefits for
data mining projects, including providing a clear and
standardized process for project execution, helping to
manage risks and uncertainties, improving
communication and collaboration among team
members, enhancing transparency and building trust,
and increasing efficiency and ensuring quality. It
consists of six phases, each with its set of objectives,
tasks, and deliverables.
Business understanding – What does the
business need?
Data understanding – What data do we
have / need? Is it clean?
Data preparation – How do we organize
the data for modeling?
Modeling – What modeling techniques
should we apply?
Evaluation – Which model best meets the
business objectives?
Deployment – How do stakeholders access
the results?
03 CRISP-DM’S STEPS
Business understanding
Not knowing how to frame the business problem, is a problem itself.

The journey of a successful data mining project begins


with a solid foundation in business understanding. This
phase is crucial as it sets the stage for the project by
defining the problem and understanding the project
objectives in the context of the business. It involves
close interaction with stakeholders to align the data
mining goals with business objectives, ensuring that the
project delivers real value. The clear articulation of the
business problem not only guides the subsequent steps
but also helps in evaluating the success of the project.
1. Determine the Business Objectives: The objective is to understand business perspective and
what needs to be achieved or addressed by the data mining project. This includes understanding
the business problem, identifying potential areas of improvement and defining the success criteria
for the project.
2. Assess the Situation: This step involves more detailed fact-finding about resource availability,
requirement, assumptions, constraints, contingency plan and other factors including costs and benefits
in determining the data analysis goal and project plan.
3. Determine Data Mining Goals: The purpose is to identify the specific goals and objectives of the
data mining project. This includes identifying the key questions that need to be answered and the
metrics that will be used to measure success.
4. Produce Project Plan: Based on the information gathered in the previous steps, this produces a
project plan that outlines the scope of the project, the business objectives, the success criteria, and the
key business questions to be answered including selection of tools and techniques.
Business understanding

Exemple : Online Retail Platform

Company Background:
You are working with an online retail platform that sells a variety of products, ranging from electronics to
clothing.

Business Problem:
The company has noticed a plateau in its sales growth and wants to understand customer purchasing
patterns more deeply.
The goal is to identify trends, preferences, and potential areas for improvement in order to create
targeted sales and marketing strategies.

Business Objective:
Objective: Understand customer purchasing patterns to improve sales strategies.
Data understanding
In this phase, you need to identify already what data
you have, where to get the data, what tools to use to
get the data, and how much data available is crucial.
Understanding your data from the initial phase will
make your data science project more sense.

Always remember, before diving in a data science


project, always know and understand first WHAT
data to acquire, WHERE to acquire, HOW to acquire,
and HOW MUCH data to acquire.
Data understanding

What is Data?
Collection of objects defined by attributes. An attribute is a property or
characteristic of an object.
Examples: eye color of a person, temperature, etc.
• Other names: variable, field, characteristic feature, predictor, etc.
A collection of attributes describe an object
• Other names: record, point, case, sample entity, instance, etc.
Data understanding
Types of data

1. Quantitative Data:
Quantitative data refers to information that can be expressed in numerical terms and can be
measured or counted. It deals with quantities and is often associated with variables that have a
numeric value.
Examples: Age, height, weight, temperature, income, and the number of products sold.
2. Qualitative Data:
Qualitative data is descriptive and non-numeric. It deals with qualities or characteristics that
cannot be measured in numerical terms. This type of data is often categorical and represents
attributes or labels.
Examples: Gender, color, and the type of material used in a product .
Data understanding
Data Representation:

Numeric Data: Represents quantities and can be measured in numerical terms.


Textual Data: Consists of characters and is often used to convey information through written
language.
Categorical Data: Represents categories or labels, such as colors, genders, or product types.

Data Sources :

First-party data is information that a company or organization collects directly from its own
interactions with its customers or users. This data is typically gathered through direct interactions,
such as customer purchases, website visits, feedback forms, and other first-hand experiences.
Data understanding
Data Sources :

Second-party data is essentially someone else's first-party data. It is obtained through a mutually
beneficial relationship or partnership between two organizations. In this arrangement, one
organization shares its first-party data directly with another, often for strategic or collaborative
purposes.

Third-party data is information that is acquired from external sources that are not directly
connected to the collecting organization or its users. This data is typically purchased from data
providers or vendors who aggregate and sell data from various sources.
1. Collect Initial Data: Collects the available data from various sources that are relevant to the project
and load it into the analysis tool.
2. Describe Data: Examine the data and documents the structure of the data, including its format,
type, and potential quality issues. Describe the nature of the data and any patterns or anomalies that
may be present.
3. Explore Data: This step assess the data to identify patterns, trends, and relationships that may exist
between different data points. This is typically done through data visualization tools or exploratory data
analysis techniques.
4. Verify Data Quality: Inspects the quality of the data to identify any potential issues such as missing
data, data outliers, or data entry errors. This step is important to ensure that the data is suitable for use
in the data mining process.
Data understanding
Exemple : Online Retail Platform

Importing Libraries:
*pandas is a powerful data manipulation library.
*matplotlib.pyplot is a library for creating static, animated, and interactive visualizations in Python.
*seaborn is a data visualization library based on Matplotlib. It provides a high-level interface for
drawing attractive and informative statistical graphics.
*stringIO acts as an in-memory file that allows you to use functions designed to read from files with a
string variable.
Data understanding
Exemple : Online Retail Platform

Creating a Sample Dataframe:


Data understanding
Exemple : Online Retail Platform

Reading Data into a DataFrame:

Displaying the First Few Rows of the DataFrame:


Data preparation
Data preparation is the process of cleaning and
transforming raw data prior to processing and analysis. It
is an important step and often involves reformatting data,
making corrections to data and the combining of data sets
to enrich data.
Using an uncleansed or dirty data to your model will have
your model produce wrong results or assumptions of the
data. A New York Times article reported that data
scientists spend from 50% to 80% of their time in
collecting and preparing data before it can be used for
exploration and especially before feeding it to the model.
1.Select Data: Select the data based on the decision that will be used for modeling, based on the
business understanding and data understanding developed in the previous phases.
2.Clean Data: Transforming the data by addressing any missing or invalid values, removing duplicates,
and addressing any other issues that may impact the quality of the data.
3.Construct Data: Constructing new variables or features that will be used in modeling, based on the
business understanding from the same records. For Example: area = length * width.
4.Integrate Data: Combining multiple sources of data to create a single dataset, ensuring that the data is
properly aligned and that there are no inconsistencies or errors.
5.Format Data: Ensuring that the data is in the correct format for analysis or modeling.
Data preparation
Exemple : Online Retail Platform

df.fillna(df.mean(), inplace=True) fills missing values in the DataFrame (df) with the mean of each respective
column.

X is created by dropping the 'SpendingScore' column from the DataFrame, leaving only the features.
y is created to represent the target variable, which, in this case, is 'SpendingScore'.
Modeling
This phase is focused on building and
validating the data mining models.
It involves in selecting the best modeling
techniques , training and testing the
models,and evaluating their effectiveness.

How to choose a model?


It Depends. It all depends on what the goal of
your task or project is and this should already
be identified in the Business Understanding
phase of the CRISP-DM
Modeling
Machine learning is a subset of artificial intelligence (AI) that focuses on the development of
algorithms and models that enable computers to learn and make predictions or decisions without
being explicitly programmed.

Categories of Machine Learning

Supervised Learning:In supervised learning, the model is trained on a labeled dataset, where the
input data is paired with corresponding output labels. The goal is for the model to learn the mapping
from inputs to outputs, making predictions on new, unseen data (linear regression, classification)
Unsupervised Learning:Unsupervised learning involves modeling without labeled output data.
The algorithm tries to find patterns or relationships within the data. Common tasks include
clustering (grouping similar data points) and dimensionality reduction (simplifying the data while
retaining important information).
1. Select Modeling Techniques: Select the appropriate modeling techniques based on the
business goals, available data, and the problem at hand. The selection of algorithm depends on the
type of data, such as numeric, categorical, or text data. For example: regression.
2.Generate Test Design: The next step is to design a test plan that outlines the criteria for evaluating
the performance of the models. It includes deciding how to divide the available data set into training
data, test data, and validation data sets based on the metrics to use such as accuracy, precision,
recall and F1-score.
3.Building Model: To build the model using the selected technique. This involves training the model
on the prepared data and adjusting the parameters.
4.Assess Model Performance: This step is to assess the performance of the model by testing it on a
hold-out dataset. This helps to determine how well the model generalizes to new data.
Modeling

Exemple : Online Retail Platform

sklearn pour effectuer une régression linéaire sur un ensemble de données.

train_test_split: This function is used to divide the dataset into training and test sets.
LinearRegression: This is the linear regression model you will use.
mean_squared_error: This function is used to calculate the mean square error, which measures the mean of
the squares of the deviations between the actual values and the predicted values.
Modeling

Exemple : Online Retail Platform

X represents the characteristics of the dataset.


y represents the labels (target values) associated with the characteristics.
test_size=0.2 specifies that 20% of the data will be used as a test set, and 80% as a training set.
random_state=42 fixes the seed of the random number generator, ensuring reproducibility of the
results.
Modeling

Exemple : Online Retail Platform

We create a LinearRegression object called model.


The model is adjusted to the training data using the fit method.

The trained model is used to make predictions on the test set (X_test).
Modeling

Exemple : Online Retail Platform

Real values (y_test) are compared with predictions (predictions) using the mean_squared_error
function and the result is displayed.
Evaluation

In this phase, the model is evaluated to determine


whether it meets the business objectives and
requirements. The evaluation phase helps in
identifying any issues with the model and provides
insights on how to improve it.
1.Evaluate results: This stage aids in determining whether the model is lacking or producing the
intended results against the business objectives and requirements. Models are accepted if they meet
the chosen criteria after being assessed in light of business success criteria.
2.Review Process: At this point, the process used to create the model is reviewed. It summarizes the
process evaluation and offers suggestions for tasks that were skipped or that ought to be done again.
It also aids in spotting any problems with the procedure and offers suggestions for how to make it
better.
3.Determine next steps: Based on the evaluation results and the process review, a choice needs to
be made whether to finish the project and continue to deployment if it is practicable, or initiate further
iterations, or set up new projects. The analysis of the budget and remaining resources is part of this
process and could influence the decision making.
Evaluation
Exemple : Online Retail Platform
Deployment

The Deployment phase of the CRISP-DM methodology


refers to the process of taking the solutions generated in
the modeling phase and putting them into action. It is
the final step in the data mining process and involves
deploying the solutions into a production environment,
where they can be used to drive business value.
Deployment
1. Plan Deployment: In this step, the project team develops a plan for deploying the model or solution.
The plan should include the necessary resources, timelines, and budget required for deployment.
2. Plan Monitoring and Maintenance: The team monitors the performance of the deployed solution
and evaluates its effectiveness in meeting the business objectives. Any issues or concerns are
addressed promptly to ensure that the solution continues to perform optimally. This could include
further improvements, modifications, or even decommissioning the solution if it is no longer useful.
3. Produce Final Report: The final written report of the data mining engagement includes includes
arranging and summarizing the results in addition to all of the earlier deliverables. At the end of the
project, there may be a frequent meeting with the client on the project’s progress.
4. Review Project: Review the project to see what worked well, what could have been improved, and
how to do it better next time.
Exemple : Online Retail Platform
04 EXAMPLE
Case Study :
Stroke Prediction
1-Business Understanding
The Business Understanding phase, is crucial to maintain a strong connection between the data mining
goals and the overarching business objectives. This ensures that the subsequent phases of the project
are aligned with the mission of developing a meaningful and effective stroke prediction model.

1. Define the Business Objectives:


The objective is to develop a predictive model that can identify individuals at a high risk of stroke
based on certain health and lifestyle factors.
2. Assess the Situation:
According to the World Health Organization (WHO) stroke is the 2nd leading cause
of death globally, responsible for approximately 11% of total deaths.
3. Determine Data Mining Goals:
The goal is to leverage historical health data to identify patterns and factors
associated with stroke risk and use that information to predict future occurrences.
2-Data Understanding
The Data Understanding phase in CRISP-DM involves exploring and familiarizing yourself with
the data that will be used in your analysis.
1. Collect Initial Data:
Gather relevant data sources that may include demographic information, medical history,
lifestyle factors, and other potential predictors of stroke.
2-Data Understanding
2. Describe Data:
Here we have a description of the columns in a dataset.
2-Data Understanding
2. Explore data:
Visualize the data to identify patterns, trends, and potential outliers. This can involve creating
histograms, scatter plots, or correlation matrices.
Now we need to imports the necessary libraries for data visualization (matplotlib.pyplot and
seaborn) and sets up a filter to ignore warning messages during code execution. This enables a
clean output.
2-Data Understanding
2. Explore data:
Now we need to quickly inspect and understand the structure our dataset.

this code reads a CSV file into a DataFrame, sets the 'id' column as the index, prints the shape of the
DataFrame, and displays the first row of the dataset.
2-Data Understanding
2. Explore data:
here we have a quick overview of the distribution and scale of the numerical data in the
DataFrame.

This code provides a quick overview of the central tendency and spread of the numerical features
in our dataset. It is used to identify potential outliers, understanding the range of values, and
gaining insights into the distribution of the data.
2-Data Understanding
2. Explore data:
Now we need to understand the distribution of ages in the dataset,

<AxesSubplot:xlabel='age', ylabel='Density'>

This code visually represents the


distribution of ages in the dataset using a
histogram. The resulting plot shows the
frequency of different age ranges.
2-Data Understanding
3. Verify data:
For understanding the data types of each feature in the dataset, we will use this function

We have here an overview of the types of data stored in each


column(Categorical Variables or Numeric Variables.).Which is
crucial for understanding the nature of the features in the
dataset .
3-Data Preparation
We are now preparing the data for analysis and modeling. This phase involves several key tasks to ensure
that the data is in a suitable for building and evaluating models.

1.Handling Missing Data:


This is is a guides decisions on how to handle missing data during the data preparation phase.

X.isnull().sum() provides a summary of


the number of missing values in each
column of the DataFrame X. The output
is a Series where each entry
corresponds to a column, and the value
is the count of missing values in that
column.
3-Data Preparation
1.Handling Missing Data:
Now we will replaces missing values in the 'bmi' column.

the code is handling missing values in the The output "0" indicates that after the imputation
'bmi' column by filling them with the mean process, there are no longer any missing values in
value of the non-missing 'bmi' values. After the 'bmi' column. The count of missing values has
this operation, the code checks and prints the been reduced to zero. This suggests that the missing
count of missing values in the 'bmi' column to values in the 'bmi' column were successfully replaced
confirm that there are no longer any missing with the mean value of the non-missing 'bmi' values,
values in that column. and the data in the 'bmi' column is now complete.
3-Data Preparation
2. Construct Data:

The goal of this step is to gain a preliminary understanding of the 'gender' column, which is essential for
making informed decisions in subsequent stages of the modeling process.

This output provides a glimpse of the 'gender'


column in your dataset. Each row in the
'gender' column is associated with a unique
index, and the values represent the gender
information for each individual in the dataset.
The length of 5110 indicates the total number
of entries in the 'gender' column.
3-Data Preparation
2. Construct Data:
Now we will convert categorical variables into a numerical format suitable for machine learning models.

The mapping of 'Male,' 'Female,' and 'Other' to


numerical values (3, 1, and 2) facilitates model
compatibility. Additionally, choosing the 'uint8' data
type enhances storage efficiency, particularly when
dealing with categorical variables having a small set of
distinct values.
3-Data Preparation
2. Construct Data:
The goal of this step is to quickly examinate and understand the 'hypertension' column, providing a foundation
for subsequent analysis and decision-making in the data exploration and preparation process.

The 'hypertension' column appears to be binary,


with values of 0 or 1, representing the absence (0)
or presence (1) of hypertension for each individual.
The exploration of this column provides a quick
overview of the distribution of hypertension status
in the first five individuals in the dataset.
3-Data Preparation
3. Integrate Data:
The goal of this step is to enhance the efficiency of memory usage and representation of binary categorical
variables in the DataFrame,This practice is essential for efficient storage and computations, especially in
machine learning scenarios, where small data types enhance compatibility and streamline preprocessing for
analysis and modeling.

The code converts the 'hypertension' and 'heart_disease' columns to 'uint8,' optimizing memory usage for
binary categorical variables (0 or 1).
3-Data Preparation
3. Integrate Data:
The goal of this step in data preparation is to transform the categorical 'smoking_status' column into a
numerical format suitable for machine learning models.

The numerical values (0, 1, 2, 3, etc.) in the


'smoking_status' column represent the encoded
versions of the original categorical smoking status
categories.
This transformed 'smoking_status' column is now
ready for use in machine learning models that
require numerical input, allowing for more efficient
and effective analyses and predictions.
3-Data Preparation
3. Integrate Data:

This folowing code prepares these columns for machine learning models by converting categorical
information into a numerical format, making them suitable for inclusion in predictive modeling tasks.

For 'ever_married':
'Yes' is replaced with 1, and 'No' is replaced with 0, transforming the column into a binary representation of
marital status.
The data type is then converted to 'uint8' for optimized memory usage.
3-Data Preparation
3. Integrate Data:

For 'Residence_type':
'Rural' is replaced with 1, and 'Urban' is replaced with 0, converting the column into a binary representation of
residence type.
The data type is converted to 'uint8' for memory efficiency.
3-Data Preparation
3. Integrate Data:
The goal is to gather descriptive statistics about the distribution of work types, providing a foundation for
further analysis, modeling, and decision-making in the data science or business context.

Each line focuses on a specific category within the 'work_type'


column and counts the occurrences of that category using the
count() method.
The code is helpful for understanding the distribution of
individuals across different work types in the dataset.
The printed counts provide a quick summary of how many
individuals fall into each work type category, offering insights
into the composition of the workforce.
3-Data Preparation
4.Format Data:

the goal is to prepare the 'work_type' information in a format suitable for machine learning models
while preserving the categorical nature of the original data.

This code creates dummy variables for


the 'work_type' column in the
DataFrame X. The resulting DataFrame
contains binary indicators (0 or 1) for
each unique category in the 'work_type'
column.
3-Data Preparation
4.Format Data:

This approach achieves the same result as using pd.get_dummies() alone but integrates the dummy
variables back into the original DataFrame and drops the original categorical column. This is a
common practice in preparing data for machine learning.
3-Data Preparation
4.Formal Data:

The code is handling the categorical variable 'work_type' by converting it into dummy variables(
binary variables), creating a binary representation for each work type.
The resulting DataFrame X now includes these dummy variables for each unique work type, and
the original 'work_type' column has been dropped.
3-Data Preparation
This kind of information is crucial for understanding the structure of the dataset, ensuring that the data
types are appropriate for the analysis or modeling task, and identifying any potential issues or
necessary preprocessing steps.

The dataset represented by DataFrame X has dimensions (5110,


15), with each row corresponding to an individual and 15 columns
capturing demographic, health, occupational, and target
information. Categorical variables like gender, hypertension, and
others are efficiently encoded as binary using uint8 data type,
while numeric features such as age, average glucose level, and
BMI are appropriately represented as int64 or float64. The
dataset is structured for binary classification, aiming to predict
stroke occurrence, and is well-prepared for machine learning
tasks with memory-efficient encoding of categorical variables.
3-Data Preparation
the relationships between variables in a dataset

The heatmap visually represents the correlations between different columns in the DataFrame X.
Positive correlations are represented by warmer colors, negative correlations by cooler colors, and the
intensity of color indicates the strength of the correlation.
This visualization is useful for identifying relationships and dependencies between variables, which is
valuable in understanding the dataset and making decisions about feature selection or engineering.
3-Data Preparation
In general, we would say if the correlation
between two variables is bigger than 0.7, it

will then, have a high correlationship
between 0.7 to 0.3, will say it has median
correlations and
if lower than 0.3, we say that it has low
correlations.
unfortunately,all the variable in the heatmap
do not show some important
message,because there is no variables that
have high correlatinship with the stroke
variable, and other correlationship that have
a high value with other variable are all just
common sense.Thus, we take all the feature
as input data to build our model.
3-Data Preparation
preparing data for machine learning, specifically for training a predictive model.

The goal of these operations is to separate the target variable ('stroke') from the feature variables in
preparation for a machine learning model.
The y variable now contains the target variable, and X contains the remaining features that will be
used to predict the target.
This separation is common in machine learning workflows to clearly distinguish between the input
features and the output variable during model training.
After this operation, y would typically be used as the target variable when training a machine learning
model, and X would be used as the feature matrix. For example, in a classification task, the model
might be trained to predict whether an individual had a stroke (y) based on the other features in the
dataset (X).
3-Data Preparation
This preprocessing step is particularly beneficial for algorithms that are sensitive to the scale of input
features, such as gradient-based optimization algorithms commonly used in machine learning models.

The MinMaxScaler scales each feature independently, rescaling them to a specified range. By
default, it scales features to a range between 0 and 1.
Scaling is important in machine learning to ensure that features with different scales or units do not
disproportionately influence the model training process.
The fit_transform method is used to both compute the scaling parameters (using fit) and apply the
scaling to the data (using transform).
After this operation, the feature variables in X have been scaled, and the transformed X can be
used for training machine learning models.
3-Data Preparation
to create separate datasets for training and validation to assess the performance of a machine
learning model.

train_x and train_y represent the feature matrix and target variable for training, respectively.
val_x and val_y represent the feature matrix and target variable for validation, respectively.
The test_size parameter determines the proportion of data allocated for validation, and
random_state ensures reproducibility.
05 CRISP-DM BENEFITS
What are its
advantages?
Benefits of using CRISP-DM Framework:

01 Improved project planning

02 Clearer understanding of data requirements

03 Improved collaboration and communication

04 Consistent and repeatable results


06 CRISP-DM LIMITATIONS
Limitations of CRISP-DM Framework

1. Lack of Flexibility: The framework 2. Limited Emphasis on Data


is a rigid process that involves a Visualization: The CRISP-DM framework
fixed set of steps. This lack of does not place a strong emphasis on
flexibility can sometimes be an data visualization, which is an
issue, as not all projects fit neatly important part of the data analysis
into the framework though it process. While the framework includes
provides the benefits of iteration. steps for data exploration and data
Projects that do not follow a linear modeling, it does not provide a
sequence of steps, or that require structured approach to visualizing data
more iteration and exploration, may and communicating insights.
not benefit from using the CRISP-
DM framework.
Limitations of CRISP-DM Framework

3. Focus on Technical Aspects: The 4. Lack of Emphasis on Ethics: The CRISP-


CRISP-DM framework is heavily focused DM framework does not explicitly address
on technical aspects of data mining and ethical considerations in data mining and
analysis, and it may not adequately analysis. Projects that involve sensitive or
address business and organizational personal data may require additional
considerations. As a result, projects that ethical guidelines and considerations
involve significant business or beyond what is included in the framework.
organizational complexities may require
additional frameworks or methodologies
to supplement the CRISP-DM framework.
07 CONCLUSION
The CRISP-DM framework is a powerful tool for data mining professionals who are looking to
develop and execute successful data mining projects. By providing a structured and flexible
approach to data mining, it can help ensure that the project stays on track and that the results
are relevant, meaningful, and actionable. Whether you are working in manufacturing,
telecommunications, banking, or any other industry, the framework can help you to extract
valuable insights from your data and drive business success.
THANKS FOR WATCHING
Made By HIRA

You might also like