Rapport

Machine Learning Rapport
Mehdi Mseddi Khlifi Midani

April 2024
1 Exercice 1 (Data Engineering)

1.1 Question a
Let’s walk through the preparation steps for the Stroke Prediction Dataset:
1. Data Cleaning:
Identify missing values: In this dataset, the ’bmi’ column contains ’N/A’ values.
Decide on a strategy for handling missing values: Since ’bmi’ is a numerical feature, we could replace
missing values with the median ’bmi’ value.
2. Handling Categorical Variables:

Identify categorical variables: ’gender’, ’ever married’, ’work type’, ’Residence type’, and ’smok-
ing status’.
Encode categorical variables: We’ll use one-hot encoding to convert categorical variables into binary
columns.
3. Feature Scaling:
Check the distribution of numerical features: ’age’, ’avg glucose level’, and ’bmi’.
Normalize these features to ensure they are on a similar scale, which is particularly important for
algorithms like k-nearest neighbors and support vector machines.
4. Feature Engineering:
No specific feature engineering is mentioned in the dataset. However, we could potentially calculate
the body mass index (BMI) from weight and height if those were available.
5. Data Splitting:
Split the dataset into training and testing sets, with the majority of the data allocated to training
to ensure the model has sufficient data to learn from.
6. Handling Imbalanced Data (if applicable):
Check if the target variable (’stroke’) is imbalanced. If so, consider techniques like oversampling,
undersampling, or using algorithms that handle class imbalance well.
7. Final Checks:
After preprocessing, perform a final check to ensure all missing values are handled, categorical vari-
ables are encoded, numerical features are scaled, and the dataset is ready for modeling.
Overall, the preparation involves cleaning the data, encoding categorical variables, scaling numerical features,
potentially engineering new features, splitting the data for training and testing, handling imbalanced data if
necessary, and performing final checks before training a machine learning model.
1.2 Question b
By loading the data into a Pandas DataFrame, we can easily manipulate and analyze it using various built-in
functions and methods provided by Pandas. This data structure is suitable for handling tabular data like the
Stroke Prediction Dataset.
1
Mseddi Mehdi and Khlifi Midani
1.3 Question c
Size: The dataset contains a total of 17, id’s, gender, age, hypertension, heart disease, ever married,
work type, Residence type, avg glucose level, bmi, smoking status, and stroke.
Features:
1. id: Unique identifier for each individual.
2. gender: Gender of the individual (Male/Female).
3. age: Age of the individual in years.
4. hypertension: Indicates whether the individual has hypertension (1 for yes, 0 for no).
5. heart disease: Indicates whether the individual has heart disease (1 for yes, 0 for no).
6. ever married: Indicates whether the individual is married (Yes/No).
7. work type: Type of work the individual is engaged in (e.g., Private, Self-employed, Govt job).
8. Residence type: Type of residence of the individual (Urban/Rural).
9. avg glucose level: Average glucose level in the blood.
10. bmi: Body mass index (BMI) of the individual.
11. smoking status: Smoking status of the individual (e.g., formerly smoked, never smoked, smokes).
12. stroke: Target variable indicating whether the individual had a stroke (1 for yes, 0 for no).
Predictive Variables:
– All features except ’id’ and ’stroke’ are predictive variables. These variables are used to predict
whether an individual is at risk of having a stroke.
Target Variable:
– The target variable is ’stroke’, which indicates whether the individual had a stroke. It’s a binary
variable with values 1 (stroke occurred) or 0 (no stroke occurred).
Feature Types:
– Categorical Features: ’gender’, ’ever married’, ’work type’, ’Residence type’, ’smoking status’.
– Numerical Features: ’age’, ’hypertension’, ’heart disease’, ’avg glucose level’, ’bmi’.
– Target Variable: ’stroke’ is binary.
This dataset contains information about individuals’ demographic characteristics, health conditions, and lifestyle
factors. The goal is to use this information to predict the likelihood of an individual having a stroke based on
their features.
1.4 Question d
Let’s apply preprocessing steps to clean and filter the data before analysis:
1. Handling Missing Values:
Replace missing values in the ’bmi’ column with the median BMI value.
2. Encoding Categorical Variables:
Convert categorical variables (’gender’, ’ever married’, ’work type’, ’Residence type’, ’smoking status’)
into numerical format using one-hot encoding.
3. Feature Scaling:
Normalize numerical features (’age’, ’avg glucose level’, ’bmi’) to ensure they are on a similar scale.
We can use techniques like Min-Max scaling or Standardization.
4. Handling Imbalanced Data (if applicable):
Check if the target variable (’stroke’) is imbalanced. If so, consider applying techniques such as
oversampling, undersampling, or using algorithms that handle class imbalance well.
5. Drop Unnecessary Columns:
If the ’id’ column does not provide any meaningful information for prediction, it can be dropped.
2
1.5 Question e
To analyze, characterize, and summarize the cleaned dataset, we’ll provide summary statistics for numerical
features and frequency counts for categorical features. We’ll also use tables and plots where appropriate to
illustrate the analysis results.
3
4
Summary Statistics for Numerical Features:

1. Age:
Mean: 0. Mean represents the average age.
Standard Deviation: 1. Standard deviation measures the dispersion of age values around the mean.
Min: -1.91. Minimum age in the dataset.
Max: 1.71. Maximum age in the dataset.
Interpretation: The age distribution is centered around the mean with a standard deviation of 1.
2. Average Glucose Level:
Mean: 0. Mean represents the average glucose level.
Standard Deviation: 1. Standard deviation measures the dispersion of glucose level values around
the mean.
Min: -1.13. Minimum glucose level in the dataset.
Max: 3.66. Maximum glucose level in the dataset.
Interpretation: The average glucose level varies across the dataset with a standard deviation of 1.
3. BMI (Body Mass Index):
Mean: 0. Mean represents the average BMI.
Standard Deviation: 1. Standard deviation measures the dispersion of BMI values around the mean.
Min: -2.41. Minimum BMI in the dataset.
Max: 8.93. Maximum BMI in the dataset.
Interpretation: The BMI distribution is centered around the mean with a standard deviation of 1.
Frequency Counts for Categorical Features:
1. Gender:
Female: 2994, Male: 2116.
2. Ever Married:
Yes: 3353, No: 1757.
3. Work Type:
Private: 2925, Self-employed: 819, Govt job: 657, Children: 687, Never worked: 22.
4. Residence Type:
Rural: 2596, Urban: 2514.
5. Smoking Status:
Never smoked: 3218, Unknown: 3566, Formerly smoked: 885, Smokes: 789.
let’s interpret the analysis results produced specifically for the given DataFrame.
1. Summary Statistics for Numerical Features:
The summary statistics provide an overview of the distribution of numerical features in the dataset,
including statistics such as mean, standard deviation, minimum, maximum, and quartiles.
2. Frequency Counts for Categorical Features:
The frequency counts display the distribution of categories within each categorical feature. For
example, it shows the count of males and females (gender), married and unmarried individuals
(ever married), different types of work (work type), residence types (Residence type), and smoking
statuses. These counts provide insights into the distribution of categories within each categorical
variable.
3. Distribution of Numerical Features:
5
Histograms are plotted to visualize the distribution of numerical features (age, average glucose level,
and BMI). Histograms show the frequency distribution of values within each numerical feature,
allowing us to understand their shape, central tendency, and spread. For instance, we can observe if
the distribution is symmetric, skewed, or multimodal.
4. Frequency Counts of Categorical Features:
Count plots are used to visualize the frequency counts of different categories within each categorical
feature. These plots show the number of occurrences of each category, providing insights into the
distribution of categorical variables. We can observe the count of different genders, marital statuses,
work types, residence types, and smoking statuses.
1.6 Question f
Age, average glucose level, and BMI exhibit variations across the dataset, as indicated by their standard
deviations.
There’s a roughly equal distribution of genders in the dataset.
Most individuals in the dataset are married.
Private work type is the most common, followed by self-employed and government job.
Residence type is evenly distributed between rural and urban areas.
Never smoked and unknown smoking status are the most prevalent categories in the dataset.
1.7 Question g
Here are some ideas for further analysis, including data transformation and data reduction techniques:
1. Data Transformation:
Normalization: Scale numerical features to a standard range (e.g., [0, 1]) to ensure that they con-
tribute equally to the model. This can be achieved using techniques like Min-Max scaling.
Log Transformation: Transform skewed numerical features to a more Gaussian-like distribution using
log transformation. This can improve the performance of models that assume normality.
Feature Engineering: Create new features by combining existing ones or extracting useful information.
For example, creating interaction terms or polynomial features can capture nonlinear relationships.
2. Data Reduction:
Principal Component Analysis (PCA): Reduce the dimensionality of the feature space while retaining
most of the variance. PCA identifies the orthogonal axes (principal components) that capture the
maximum amount of variation in the data.
Feature Selection: Select a subset of relevant features that are most informative for predicting the
target variable. Techniques like Recursive Feature Elimination (RFE) or feature importance from
tree-based models can be used for this purpose.
Manifold Learning: Explore nonlinear dimensionality reduction techniques such as t-distributed
Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP)
to visualize high-dimensional data in lower dimensions while preserving local and global structure.
Let’s conduct the suggested analysis, starting with data transformation and then moving on to data reduction
techniques.
1. Data Transformation:
(a) Normalization:
1 from sklearn.preprocessing import MinMaxScaler

2
3 # Initialize MinMaxScaler
4 scaler = MinMaxScaler()
5
6 # Normalize numerical features
7 normalized_data = scaler.fit_transform(processed_df[numeric_features])
6
8
9 # Convert normalized data back to DataFrame
10 normalized_df = pd.DataFrame(normalized_data, columns=numeric_features)
11
12 # Display the first few rows of the normalized DataFrame
13 print(normalized_df.head())
14
Explanation:
We used MinMaxScaler to scale the numerical features of the dataset to the range [0, 1].
The resulting DataFrame ’normalized df’ contains the normalized numerical features.
(b) Log Transformation (for skewed features):
1 # Apply log transformation to skewed numerical features

2 skewed_features = ['avg_glucose_level', 'bmi']
3 skewed_data = filtered_df[skewed_features].apply(np.log1p)
4
5 # Display the first few rows of the log-transformed DataFrame
6 print(skewed_data.head())
7
8
Explanation:
We applied log transformation to the skewed numerical features ’avg glucose level’ and ’bmi’
using the numpy log1p function.
The resulting DataFrame ’skewed data’ contains the log-transformed numerical features.
2. Data Reduction:
(a) Principal Component Analysis (PCA):
1 from sklearn.decomposition import PCA

2
3 # Initialize PCA with desired number of components
4 pca = PCA(n_components=2)
5
6 # Apply PCA to the normalized data
7 pca_data = pca.fit_transform(normalized_df)
8
9 # Convert PCA data to DataFrame
10 pca_df = pd.DataFrame(data=pca_data, columns=['PC1', 'PC2'])
11
12 # Display the first few rows of the PCA DataFrame
13 print(pca_df.head())
Explanation:
We used PCA to reduce the dimensionality of the normalized data to two principal components.
The resulting DataFrame ’pca df’ contains the reduced dimensions PC1 and PC2.
(b) Feature Selection:
1 from sklearn.feature_selection import SelectKBest, f_classif

2
3 # Define the target variable 'stroke'
4 y = data['stroke']
5
6 # Initialize SelectKBest with ANOVA F-value as scoring function
7 selector = SelectKBest(score_func=f_classif, k='all')
8
9 # Apply feature selection to the normalized data
10 selected_features = selector.fit_transform(normalized_df, y)
11
12 # Get selected feature indices
13 selected_indices = selector.get_support(indices=True)
14
15 # Get names of selected features
16 selected_feature_names = normalized_df.columns[selected_indices]
17
18 # Display the selected feature names
19 print(selected_feature_names)
20
7
Explanation:
We used SelectKBest with ANOVA F-value as the scoring function to select the top 5 features
that are most relevant to predicting the target variable ’stroke’.
The selected feature names are printed to identify the most informative features.
2 Exercice 2 (Model Engineering)

2.1 Question a
To split the data into training and test sets, we typically use the train test split function provided by the scikit-
learn library. This function randomly shuffles the data and splits it into two subsets: one for training the model
and the other for evaluating its performance. Here’s how we can split the data:
1 from sklearn.model_selection import train_test_split

2
3 # Split the data into features (X) and target variable (y)
4 X = processed_df.drop(columns=['stroke']) # Features
5 y = processed_df['stroke'] # Target variable
6
7 # Split the data into training and test sets (e.g., 80% train, 20% test)
8 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10 # 'test_size' parameter specifies the proportion of the dataset to include in the test split.
11 # 'random_state' parameter ensures reproducibility by fixing the random seed for shuffling.
12
After splitting the data, we have:

X train: Features for training the model.
X test: Features for evaluating the model’s performance.
y train: Target variable values corresponding to the training set.
y test: Target variable values corresponding to the test set.
8
2.2 Question c
Graphical representation:
9
Textual representation:
|--- age <= 1.07

| |--- age <= 0.19
| | |--- bmi <= 3.56
| | | |--- avg_glucose_level <= -1.06
| | | | |--- avg_glucose_level <= -1.06
| | | | | |--- avg_glucose_level <= -1.07
| | | | | | |--- class: 0
| | | | | |--- avg_glucose_level > -1.07
| | | | | | |--- avg_glucose_level <= -1.06
| | | | | | | |--- class: 1
| | | | | | |--- avg_glucose_level > -1.06
| | | | | | | |--- class: 0
| | | | |--- avg_glucose_level > -1.06
| | | | | |--- class: 1
| | | |--- avg_glucose_level > -1.06
| | | | |--- age <= -0.25
| | | | | |--- age <= -1.85
| | | | | | |--- bmi <= -0.31
| | | | | | | |--- class: 0
| | | | | | |--- bmi > -0.31
| | | | | | | |--- age <= -1.86
| | | | | | | | |--- class: 0
| | | | | | | |--- age > -1.86
| | | | | | | | |--- class: 1
| | | | | |--- age > -1.85
| | | | | | |--- smoking_status_smokes <= 0.50
| | | | | | | |--- class: 0
| | | | | | |--- smoking_status_smokes > 0.50
| | | | | | | |--- avg_glucose_level <= -0.66
| | | | | | | | |--- avg_glucose_level <= -0.67
| | | | | | | | | |--- class: 0
| | | | | | | | |--- avg_glucose_level > -0.67
| | | | | | | | | |--- class: 1
| | | | | | | |--- avg_glucose_level > -0.66
| | | | | | | | |--- class: 0
| | | | |--- age > -0.25
| | | | | |--- bmi <= 0.26
| | | | | | |--- bmi <= 0.25
| | | | | | | |--- smoking_status_formerly smoked <= 0.50
| | | | | | | | | |--- avg_glucose_level <= -0.93
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- avg_glucose_level > -0.93
| | | | | | | | | | |--- class: 1
| | | | | | | | | |--- smoking_status_Unknown <= 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- smoking_status_Unknown > 0.50
| | | | | | | | | | |--- avg_glucose_level <= -0.50
| | | | | | | | | | | |--- truncated branch of depth 2
| | | | | | | | | | |--- avg_glucose_level > -0.50
| | | | | | | | | | | |--- class: 0
| | | | | | | |--- smoking_status_formerly smoked > 0.50
| | | | | | | | |--- age <= -0.21
| | | | | | | | | |--- work_type_Self-employed <= 0.50
| | | | | | | | | | |--- bmi <= -0.17
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- bmi > -0.17
| | | | | | | | | | | |--- class: 1
| | | | | | | | | |--- work_type_Self-employed > 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- age > -0.21
| | | | | | | | | |--- bmi <= 0.15
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- bmi > 0.15
| | | | | | | | | | |--- bmi <= 0.18
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- bmi > 0.18
| | | | | | | | | | | |--- class: 0
| | | | | | |--- bmi > 0.25
| | | | | | | | |--- class: 1
| | | | | | | | |--- class: 0
10
| | | | | |--- bmi > 0.26

| | | | | | |--- class: 0
| | |--- bmi > 3.56
| | | |--- age <= 0.03
| | | | |--- class: 0
| | | |--- age > 0.03
| | | | |--- class: 1
| |--- age > 0.19
| | |--- avg_glucose_level <= 0.10
| | | |--- ever_married_Yes <= 0.50
| | | | |--- bmi <= 0.09
| | | | | |--- class: 0
| | | | |--- bmi > 0.09
| | | | | | |--- bmi <= 0.16
| | | | | | | |--- class: 1
| | | | | | |--- bmi > 0.16
| | | | | | | | |--- work_type_Self-employed <= 0.50
| | | | | | | | | |--- class: 0
| | | | | | | | |--- work_type_Self-employed > 0.50
| | | | | | | | | |--- Residence_type_Urban <= 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- Residence_type_Urban > 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- work_type_Govt_job <= 0.50
| | | | | | | | | |--- smoking_status_smokes <= 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | | |--- smoking_status_smokes > 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | |--- work_type_Govt_job > 0.50
| | | | | | | | | |--- class: 0
| | | | | | |--- class: 1
| | | |--- ever_married_Yes > 0.50
| | | | |--- bmi <= 0.38
| | | | | |--- bmi <= 0.36
| | | | | | | |--- bmi <= -0.21
| | | | | | | | |--- age <= 0.23
| | | | | | | | | |--- smoking_status_never smoked <= 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- smoking_status_never smoked > 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- age > 0.23
| | | | | | | | | | |--- class: 0
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | | |--- class: 0
| | | | | | | |--- bmi > -0.21
| | | | | | | | |--- bmi <= -0.04
| | | | | | | | | |--- gender_Male <= 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- gender_Male > 0.50
| | | | | | | | | | |--- smoking_status_smokes <= 0.50
| | | | | | | | | | |--- smoking_status_smokes > 0.50
| | | | | | | | |--- bmi > -0.04
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- bmi <= 0.28
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- bmi > 0.28
| | | | | | | | |--- class: 1
11

| | | | | | | | |--- smoking_status_never smoked <= 0.50
| | | | | | | | | | |--- age <= 0.65
| | | | | | | | | | |--- age > 0.65
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- bmi <= 0.32
| | | | | | | | | | |--- bmi > 0.32
| | | | | | | | |--- smoking_status_never smoked > 0.50
| | | | | | | | | |--- class: 0
| | | | | |--- bmi > 0.36
| | | | | | |--- Residence_type_Urban <= 0.50
| | | | | | | |--- class: 0
| | | | | | |--- Residence_type_Urban > 0.50
| | | | | | | |--- class: 1
| | | | |--- bmi > 0.38
| | | | | |--- age <= 0.59
| | | | | | |--- class: 0
| | | | | |--- age > 0.59
| | | | | | |--- age <= 0.63
| | | | | | | |--- Residence_type_Urban <= 0.50
| | | | | | | | | |--- smoking_status_Unknown <= 0.50
| | | | | | | | | | |--- work_type_Govt_job <= 0.50
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- work_type_Govt_job > 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | | |--- smoking_status_Unknown > 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | | |--- class: 0
| | | | | | | |--- Residence_type_Urban > 0.50
| | | | | | | | |--- class: 0
| | | | | | |--- age > 0.63
| | | | | | | |--- class: 0
| | |--- avg_glucose_level > 0.10
| | | |--- avg_glucose_level <= 0.10
| | | | |--- class: 1
| | | |--- avg_glucose_level > 0.10
| | | | |--- age <= 0.85
| | | | | |--- smoking_status_smokes <= 0.50
| | | | | | |--- avg_glucose_level <= 1.22
| | | | | | | |--- bmi <= 1.65
| | | | | | | | |--- class: 0
| | | | | | | |--- bmi > 1.65
| | | | | | | | |--- bmi <= 1.91
| | | | | | | | | |--- class: 1
| | | | | | | | |--- bmi > 1.91
| | | | | | | | | |--- class: 0
| | | | | | |--- avg_glucose_level > 1.22
| | | | | | | |--- avg_glucose_level <= 1.38
| | | | | | | | |--- gender_Male <= 0.50
| | | | | | | | | |--- avg_glucose_level <= 1.28
| | | | | | | | | | |--- work_type_Private <= 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- work_type_Private > 0.50
| | | | | | | | | | | |--- class: 0
| | | | | | | | | |--- avg_glucose_level > 1.28
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- gender_Male > 0.50
| | | | | | | | | |--- class: 0
| | | | | | | |--- avg_glucose_level > 1.38
| | | | | | | | |--- bmi <= -0.08
| | | | | | | | | |--- bmi <= -0.18
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- bmi > -0.18
| | | | | | | | | | |--- avg_glucose_level <= 2.38
| | | | | | | | | | |--- avg_glucose_level > 2.38
| | | | | | | | | | | |--- class: 0
| | | | | | | | |--- bmi > -0.08
12
| | | | | | | | | | |--- age <= 0.63

| | | | | | | | | | |--- age > 0.63
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- gender_Female <= 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- gender_Female > 0.50
| | | | | | | | | | | |--- class: 0
| | | | | |--- smoking_status_smokes > 0.50
| | | | | | | |--- bmi <= 0.86
| | | | | | | | |--- bmi <= -1.01
| | | | | | | | | |--- gender_Male <= 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- gender_Male > 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- bmi > -1.01
| | | | | | | | | |--- age <= 0.68
| | | | | | | | | | |--- age <= 0.28
| | | | | | | | | | |--- age > 0.28
| | | | | | | | | | | |--- class: 0
| | | | | | | | | |--- age > 0.68
| | | | | | | | | | |--- Residence_type_Urban <= 0.50
| | | | | | | | | | |--- Residence_type_Urban > 0.50
| | | | | | | | | | | |--- class: 0
| | | | | | | |--- bmi > 0.86
| | | | | | | | |--- avg_glucose_level <= 0.42
| | | | | | | | | |--- age <= 0.63
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- age > 0.63
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- avg_glucose_level > 0.42
| | | | | | | | | |--- bmi <= 1.44
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- bmi > 1.44
| | | | | | | | | | | |--- class: 0
| | | | | | | |--- class: 1
| | | | |--- age > 0.85
| | | | | |--- smoking_status_formerly smoked <= 0.50
| | | | | | |--- age <= 0.94
| | | | | | | |--- work_type_Private <= 0.50
| | | | | | | | |--- work_type_Govt_job <= 0.50
| | | | | | | | | |--- class: 0
| | | | | | | | |--- work_type_Govt_job > 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- bmi <= -0.50
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- bmi > -0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | |--- work_type_Private > 0.50
| | | | | | | | |--- gender_Female <= 0.50
| | | | | | | | | | |--- age <= 0.90
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- age > 0.90
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- gender_Female > 0.50
| | | | | | | | | |--- class: 0
| | | | | | |--- age > 0.94
| | | | | | | |--- class: 0
| | | | | |--- smoking_status_formerly smoked > 0.50
| | | | | | | |--- age <= 1.03
| | | | | | | | |--- bmi <= 1.38
| | | | | | | | | |--- class: 1
13
| | | | | | | | |--- bmi > 1.38

| | | | | | | | | |--- class: 0
| | | | | | | |--- age > 1.03
| | | | | | | | |--- class: 0
| | | | | | | |--- bmi <= 0.96
| | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- class: 1
| | | | | | | | | |--- class: 0
| | | | | | | |--- bmi > 0.96
| | | | | | | | |--- bmi <= 1.03
| | | | | | | | | |--- class: 1
| | | | | | | | |--- bmi > 1.03
| | | | | | | | | | |--- class: 0
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | | |--- class: 0
|--- age > 1.07
| |--- work_type_Private <= 0.50
| | |--- bmi <= 0.73
| | | |--- bmi <= 0.67
| | | | | |--- age <= 1.12
| | | | | | | |--- class: 0
| | | | | | | |--- class: 1
| | | | | |--- age > 1.12
| | | | | | |--- age <= 1.65
| | | | | | | | |--- bmi <= -0.17
| | | | | | | | | |--- class: 0
| | | | | | | | |--- bmi > -0.17
| | | | | | | | | |--- bmi <= -0.03
| | | | | | | | | | |--- class: 1
| | | | | | | | | |--- bmi > -0.03
| | | | | | | | | | |--- class: 0
| | | | | | | | |--- smoking_status_formerly smoked <= 0.50
| | | | | | | | | |--- class: 0
| | | | | | | | |--- smoking_status_formerly smoked > 0.50
| | | | | | | | | |--- bmi <= 0.02
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- bmi > 0.02
| | | | | | | | | | |--- bmi <= 0.25
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- bmi > 0.25
| | | | | | | | | | | |--- class: 0
| | | | | | |--- age > 1.65
| | | | | | | | |--- class: 0
| | | | | | | | | |--- smoking_status_formerly smoked <= 0.50
| | | | | | | | | | |--- bmi <= -1.24
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- bmi > -1.24
| | | | | | | | | | | |--- class: 1
| | | | | | | | | |--- smoking_status_formerly smoked > 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- class: 0
| | | | | | | | |--- class: 1
| | | | | | | | |--- class: 0
14

| | | | | | | |--- smoking_status_Unknown <= 0.50
| | | | | | | | |--- smoking_status_smokes <= 0.50
| | | | | | | | | |--- class: 1
| | | | | | | | |--- smoking_status_smokes > 0.50
| | | | | | | | | |--- class: 0
| | | | | | | |--- smoking_status_Unknown > 0.50
| | | | | | | | |--- class: 0
| | | | | | |--- age <= 1.43
| | | | | | | |--- ever_married_Yes <= 0.50
| | | | | | | | |--- bmi <= -0.40
| | | | | | | | | |--- class: 0
| | | | | | | | |--- bmi > -0.40
| | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- class: 0
| | | | | | | |--- ever_married_Yes > 0.50
| | | | | | | | |--- smoking_status_smokes <= 0.50
| | | | | | | | | |--- bmi <= 0.51
| | | | | | | | | | |--- work_type_Self-employed <= 0.50
| | | | | | | | | | |--- work_type_Self-employed > 0.50
| | | | | | | | | | | |--- class: 0
| | | | | | | | | |--- bmi > 0.51
| | | | | | | | | | |--- work_type_Self-employed <= 0.50
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- work_type_Self-employed > 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | |--- smoking_status_smokes > 0.50
| | | | | | | | | |--- age <= 1.16
| | | | | | | | | | |--- class: 1
| | | | | | | | | |--- age > 1.16
| | | | | | | | | | |--- bmi <= -0.75
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- bmi > -0.75
| | | | | | | | | | | |--- class: 0
| | | | | | |--- age > 1.43
| | | | | | | | |--- bmi <= 0.61
| | | | | | | | | | |--- bmi <= -0.91
| | | | | | | | | | |--- bmi > -0.91
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | | |--- class: 1
| | | | | | | | |--- bmi > 0.61
| | | | | | | | | |--- bmi <= 0.65
| | | | | | | | | | |--- class: 1
| | | | | | | | | |--- bmi > 0.65
| | | | | | | | | | |--- class: 0
| | | | | | | | |--- class: 0
| | | |--- bmi > 0.67
| | | | |--- bmi <= 0.70
| | | | | |--- class: 1
| | | | |--- bmi > 0.70
| | | | | | |--- class: 1
| | | | | | |--- class: 0
| | |--- bmi > 0.73
| | | |--- avg_glucose_level <= -0.73
| | | | | |--- class: 0
| | | | | |--- class: 1
| | | |--- avg_glucose_level > -0.73
| | | | |--- class: 0
| |--- work_type_Private > 0.50
| | |--- bmi <= 2.04
15
| | | |--- age <= 1.34

| | | | |--- avg_glucose_level <= 3.05
| | | | | |--- bmi <= 0.96
| | | | | | |--- bmi <= -0.04
| | | | | | | | | | |--- ever_married_Yes <= 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- ever_married_Yes > 0.50
| | | | | | | | | | |--- age <= 1.16
| | | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- age > 1.16
| | | | | | | | | | | |--- class: 1
| | | | | | | | | |--- class: 1
| | | | | | | | |--- smoking_status_Unknown <= 0.50
| | | | | | | | | |--- class: 0
| | | | | | | | |--- smoking_status_Unknown > 0.50
| | | | | | | | | |--- age <= 1.25
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- age > 1.25
| | | | | | | | | | |--- gender_Male <= 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | | | |--- gender_Male > 0.50
| | | | | | | | | | | |--- class: 0
| | | | | | |--- bmi > -0.04
| | | | | | | | |--- ever_married_Yes <= 0.50
| | | | | | | | | |--- class: 0
| | | | | | | | |--- ever_married_Yes > 0.50
| | | | | | | | | |--- class: 1
| | | | | | | | |--- class: 0
| | | | | |--- bmi > 0.96
| | | | | | |--- age <= 1.29
| | | | | | | | |--- smoking_status_Unknown <= 0.50
| | | | | | | | | |--- class: 0
| | | | | | | | |--- smoking_status_Unknown > 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- class: 1
| | | | | | |--- age > 1.29
| | | | | | | |--- class: 1
| | | | |--- avg_glucose_level > 3.05
| | | | | |--- bmi <= -0.32
| | | | | | |--- class: 0
| | | | | |--- bmi > -0.32
| | | | | | |--- class: 1
| | | |--- age > 1.34
| | | | |--- avg_glucose_level <= 2.73
| | | | | |--- bmi <= 0.58
| | | | | | |--- bmi <= 0.53
| | | | | | | | | | |--- class: 0
| | | | | | | | | | | |--- class: 1
| | | | | | | | | |--- class: 1
| | | | | | | | |--- class: 0
| | | | | | |--- bmi > 0.53
| | | | | | | |--- class: 1
| | | | | |--- bmi > 0.58
16
| | | | | | |--- ever_married_No <= 0.50

| | | | | | | | |--- class: 0
| | | | | | | | |--- age <= 1.49
| | | | | | | | | |--- smoking_status_never smoked <= 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | | |--- smoking_status_never smoked > 0.50
| | | | | | | | | | |--- class: 0
| | | | | | | | |--- age > 1.49
| | | | | | | | | |--- class: 0
| | | | | | |--- ever_married_No > 0.50
| | | | | | | |--- class: 1
| | | | |--- avg_glucose_level > 2.73
| | | | | |--- bmi <= 0.73
| | | | | | |--- age <= 1.52
| | | | | | | |--- class: 0
| | | | | | |--- age > 1.52
| | | | | | | |--- bmi <= -0.30
| | | | | | | | |--- class: 0
| | | | | | | |--- bmi > -0.30
| | | | | | | | |--- class: 1
| | | | | |--- bmi > 0.73
| | | | | | |--- class: 1
| | |--- bmi > 2.04
| | | |--- smoking_status_Unknown <= 0.50
| | | | |--- class: 1
| | | |--- smoking_status_Unknown > 0.50
| | | | |--- class: 0
2.3 Question d
To determine the most relevant features for the classification task of predicting stroke occurrence, we can
use decision tree-based methods and evaluate feature importance based on Information Gain, Gain Ratio for
Attribute Selection, and Gini index. Here’s how the overall importance of a feature in a decision tree can be
computed using these methods:
1. Information Gain:
Information Gain measures the reduction in entropy or uncertainty achieved by splitting on a par-
ticular feature.
Features with higher Information Gain are more informative for the classification task.
We can compute Information Gain for each feature by evaluating the entropy of the target variable
(stroke) before and after splitting on that feature.
2. Gain Ratio for Attribute Selection:
Gain Ratio is a modification of Information Gain that accounts for the intrinsic usefulness of a feature.
It divides the Information Gain by the split information, which measures the homogeneity of split
partitions.
Features with higher Gain Ratio are more valuable for splitting the data.
We can compute Gain Ratio for each feature and compare them to assess their relevance.
3. Gini index (CART, IBM IntelligentMiner):
Gini index measures the impurity of a dataset by calculating the probability of incorrectly classifying
a randomly chosen element if it were randomly labeled.
Features that result in lower Gini index values after splitting are considered more important.
We can compute the Gini index for each feature and analyze how much they reduce impurity when
used for splitting.
Given the provided features, we can apply these methods to determine their relevance for predicting stroke
occurrence. After training a decision tree classifier on the dataset, we can extract feature importance scores
using these techniques and identify the features that contribute most significantly to the classification task.
17
2.4 Question e
Leaf Node 1 (Class: 0)
1 |--- bmi <= -0.31

2 | |--- age <= -1.86
3 | | |--- class: 0
4
Here, the decision tree first checks if the BMI is less than or equal to -0.31. If it is, then it checks if the age is
less than or equal to -1.86. If both conditions are met, the data point falls into this leaf node, resulting in the
classification of class 0.
1 |--- age > -1.85

2 | |--- smoking_status_smokes <= 0.50
3 | | |--- class: 0
4
5
In this leaf node, the decision tree checks if the age is greater than -1.85. If it is, then it checks if the smoking
status is less than or equal to 0.50. If both conditions are met, the data point falls into this leaf node, resulting
in the classification of class 0.
1 |--- bmi <= 0.09

2 | |--- class: 0
3
4
5
For this leaf node, the decision tree only checks if the BMI is less than or equal to 0.09. If it is, the data point
falls into this leaf node, resulting in the classification of class 0.
1 |--- bmi <= 0.36

2 | |--- avg_glucose_level <= -0.22
3 | | |--- bmi <= -0.21
4 | | | |--- age <= 0.23
5 | | | | |--- smoking_status_never smoked <= 0.50
6 | | | | | |--- class: 0
7
In this leaf node, the decision tree checks several conditions sequentially. It first checks if the BMI is less than
or equal to 0.36. If it is, it further checks if the average glucose level is less than or equal to -0.22. If both
conditions are met, it continues to check other conditions until it reaches the final classification.
1 |--- avg_glucose_level > -0.13

2 | |--- class: 1
3
Here, the decision tree checks if the average glucose level is greater than -0.13. If it is, the data point falls into
this leaf node, resulting in the classification of class 1.
2.5 Question f
Training set accuracy: 1.0
2.6 Question g
Test set accuracy: 0.8962818003913894
18
2.7 Question h
Pre-pruning in decision trees involves setting constraints on tree growth before it becomes fully grown. This
helps prevent overfitting by stopping the tree from splitting nodes further if certain conditions are met. The
main parameters for pre-pruning in scikit-learn’s DecisionTreeClassifier are max depth, min samples split,
min samples leaf, and max leaf nodes.
To run the decision tree algorithm on the training data with pre-pruning, I’ll set some of these parameters.
Here’s how:
1. max depth: This parameter restricts the depth of the decision tree. By setting a maximum depth, we
control the complexity of the tree. A lower max depth prevents the tree from growing too deep and
overfitting.
2. min samples split: This parameter specifies the minimum number of samples required to split an internal
node. If a node has fewer samples than min samples split, the split will not be performed.
3. min samples leaf: This parameter sets the minimum number of samples required to be at a leaf node. If
a split results in a leaf node with fewer samples than min samples leaf, the split will not be considered.
4. max leaf nodes: This parameter limits the maximum number of leaf nodes in the tree. It helps control
the size of the tree.
To obtain optimum results, we typically use techniques like cross-validation or grid search to tune these hyper-
parameters. Cross-validation helps evaluate the model’s performance on various parameter settings, while grid
search systematically searches through a range of parameter values to find the best combination.
2.8 Question i
Size of Original Tree: 585
Size of Pruned Tree: 39
2.9 Question j
Accuracy of pruned tree on test set: 0.9393346379647749
Accuracy of unpruned tree on test set: 0.8992172211350293
Accuracy of pruned tree on training set: 0.954434250764526
Accuracy of unpruned tree on training set: 1.0
2.10 Question k
Error pruning, also known as cost-complexity pruning, is a common method used for post-pruning decision
trees. The procedure involves iteratively removing nodes from the tree while monitoring the change in error
rate or cost function on a separate validation set or using cross-validation. Here’s a step-by-step explanation of
error pruning:
1. Initial Tree: Train a decision tree using the training dataset. This tree may be overfitted to the training
data, resulting in poor generalization to unseen data.
2. Validation Set: If using error pruning with a validation set, split the original dataset into three subsets:
training set, validation set, and test set. The training set is used to train the initial decision tree, the
validation set is used for pruning, and the test set is used for final evaluation.
3. Pruning Procedure:
Starting from the leaves of the tree, evaluate the error rate or cost function on the validation set
after removing each candidate node (subtree).
Remove the node (subtree) that results in the smallest increase in error rate or cost function on the
validation set.
Continue this process iteratively until further pruning increases the error rate or cost function on the
validation set.
4. Final Pruned Tree: Once pruning is complete, the pruned tree is obtained.
5. Evaluation: Evaluate the pruned tree using the test set to assess its performance on unseen data.
19
2.11 Question l
Size of Original Tree: 583
Size of Pruned Tree: 5
2.12 Question n
I would recommend using the pruned tree to classify future data. Here’s the justification:
1. Test Set Accuracy: The pruned tree outperforms the unpruned tree on the test set with a higher accuracy
of 0.939 compared to 0.899. This indicates that the pruned tree generalizes better to unseen data, which
is the ultimate goal of any classification model.
2. Training Set Accuracy: Although the unpruned tree achieves a perfect accuracy of 1.0 on the training
set, it suffers from overfitting, as evidenced by its lower performance on the test set. The pruned tree,
with a slightly lower accuracy of 0.954 on the training set, avoids overfitting and demonstrates better
generalization ability.
2.13 Question o
Based on these observations, the post-pruned tree is recommended for classifying future data due to its better
performance on unseen data and its ability to avoid overfitting.
2.14 Question p
Combining both pre-pruning and post-pruning techniques can be an interesting approach to optimize decision
tree models. Here’s why it could be beneficial:
1. Reduced Computational Complexity: Pre-pruning helps to limit the growth of the decision tree during
training by setting constraints such as maximum depth, minimum samples per leaf, or maximum number
of leaf nodes. This reduces the computational complexity during training.
2. Improved Generalization: Post-pruning further refines the decision tree after it has been fully grown. It
removes branches that do not contribute significantly to improving the accuracy on the validation set,
leading to better generalization performance on unseen data.
3. Better Interpretability: By combining both techniques, you can create a decision tree model that is not
only accurate but also interpretable. Pre-pruning constraints can control the size and complexity of the
tree, while post-pruning can refine it to retain only the most informative branches.
1 from sklearn.tree import DecisionTreeClassifier

2 from sklearn.metrics import accuracy_score
3 from sklearn.model_selection import train_test_split
4 from sklearn.tree import plot_tree
5
6 # Define a function to compute error rate
7 def compute_error_rate(y_true, y_pred):
8 return 1 - accuracy_score(y_true, y_pred)
9
10 # Split the data into training and test sets
11 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
12
13 # Pre-Pruning: Train the decision tree classifier with pre-pruning parameters
14 tree_classifier = DecisionTreeClassifier(max_depth=5, min_samples_split=10)
15 tree_classifier.fit(X_train, y_train)
16
17 # Post-Pruning: Perform cost-complexity pruning
18 path = tree_classifier.cost_complexity_pruning_path(X_train, y_train)
19 ccp_alphas, impurities = path.ccp_alphas, path.impurities
20
21 # Initialize an array to store pruned trees
22 pruned_trees = []
23
24 # For each alpha along the pruning path
25 for ccp_alpha in ccp_alphas:
26 # Prune the tree using the current alpha
27 pruned_tree = DecisionTreeClassifier(ccp_alpha=ccp_alpha)
28 pruned_tree.fit(X_train, y_train)
29 pruned_trees.append(pruned_tree)
20
30
31 # Evaluate the pruned trees on the test set
32 error_rates = [compute_error_rate(y_test, tree.predict(X_test)) for tree in pruned_trees]
33
34 # Choose the pruned tree with the minimum error rate
35 best_tree_index = np.argmin(error_rates)
36 best_pruned_tree = pruned_trees[best_tree_index]
37
38 # Evaluate the best pruned tree on the test set
39 pruned_tree_accuracy = accuracy_score(y_test, best_pruned_tree.predict(X_test))
40 print("Accuracy of pruned tree:", pruned_tree_accuracy)
41
42 # Plot the pruned tree
43 plt.figure(figsize=(20,10))
44 plot_tree(best_pruned_tree, filled=True, feature_names=data.feature_names, class_names=data.target_names)
45 plt.show()
46
21

Rapport

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rapport

Uploaded by

Copyright:

Available Formats

Machine Learning Rapport

Mehdi Mseddi Khlifi Midani

1 Exercice 1 (Data Engineering)

2. Handling Categorical Variables:

Summary Statistics for Numerical Features:

1 from sklearn.preprocessing import MinMaxScaler

1 # Apply log transformation to skewed numerical features

1 from sklearn.decomposition import PCA

1 from sklearn.feature_selection import SelectKBest, f_classif

2 Exercice 2 (Model Engineering)

1 from sklearn.model_selection import train_test_split

After splitting the data, we have:

 X test: Features for evaluating the model’s performance.

 y train: Target variable values corresponding to the training set.

 y test: Target variable values corresponding to the test set.

|--- age <= 1.07

| | | | | |--- bmi > 0.26

| | | | | | | |--- avg_glucose_level > -0.21

| | | | | | | | | | |--- age <= 0.63

| | | | | | | | |--- bmi > 1.38

| | | | | | |--- avg_glucose_level > -0.36

| | | |--- age <= 1.34

| | | | | | |--- ever_married_No <= 0.50

3. Gini index (CART, IBM IntelligentMiner):

1 |--- bmi <= -0.31

1 |--- age > -1.85

1 |--- bmi <= 0.09

1 |--- bmi <= 0.36

1 |--- avg_glucose_level > -0.13

1 from sklearn.tree import DecisionTreeClassifier

You might also like

X test: Features for evaluating the model’s performance.

y train: Target variable values corresponding to the training set.

y test: Target variable values corresponding to the test set.