20mid0209 Lab - 6

NAME: R B SHARAN
REG NO: 20MID0209

COURSE: Advanced Predictive Analysis
COURSE CODE: MDI 3003
SLOT: L41+L42
LAB ASSIGNMENT - 6
QUESTION 1:
1) Objective:
The experiment aims to identify the optimal combination of base learners and
meta-learner for stacking. By evaluating different configurations of base
learners and meta-learner parameters, the goal is to discover the ensemble
setup that maximizes the overall accuracy and generalizability of the model.
2) Overview of Heterogeneous Ensemble Learner Method:

Heterogeneous ensembles combine various types of machine learning
algorithms, such as decision trees, neural networks, support vector machines,
or linear models. Each base model has its own strengths and weaknesses, and
by combining diverse models, the ensemble can handle different aspects of the
data, leading to improved generalization.
Different base models are selected based on their suitability for the problem
domain. For example, a random forest might capture complex nonlinear
patterns, while a linear model could handle simpler relationships. The selection
of base models depends on the dataset characteristics and the ensemble's
objectives.
Base models in a heterogeneous ensemble should be as independent and
diverse as possible. Independence ensures that the models make different
errors on different subsets of data. Diversity in predictions helps in reducing
the overall error when combined. Methods like bagging (Bootstrap
Aggregating) and boosting (AdaBoost) can be employed to create diverse base
models.
Stacking, a common technique in heterogeneous ensembles, involves training a
meta-learner (or a higher-level model) to make predictions based on the
outputs of individual base models. The predictions of base models serve as
features for the meta-learner. Stacking helps capture higher-order patterns in
the data that might be missed by individual models, leading to improved
overall performance.
3) For the implementation of Heterogeneous Ensemble Learner

following steps have been applied:
➢ Load the Dataset: Load the breast cancer dataset using a library
like scikit-learn.
➢ Split the Data: Split the dataset into training and testing sets to
evaluate the model's performance.
➢ Define Base Learners: Choose base learners for your ensemble.
In this case, use Decision Tree, Random Forest, XGBoost, and k-
Nearest Neighbors (KNN) classifiers.
➢ Define Meta-Learner: Select a meta-learner, which is another
machine learning model that combines the outputs of base
learners. Common choices include models like Logistic
Regression or another Random Forest classifier.
➢ Create the Stacking Ensemble Model: Use a Stacking Classifier
to combine the predictions from base learners and train the
meta-learner. Set the base learners and meta-learner when
creating the Stacking Classifier.
➢ Train the Ensemble Model: Train the Stacking Ensemble Model
using the training data. The Stacking Classifier will handle the
stacking and training of base learners and the meta-learner.
➢ Make Predictions: Use the trained ensemble model to make
predictions on the test data. Evaluate the Model: Assess the
accuracy or other relevant metrics of the ensemble model's
predictions on the test set to measure its performance.
4) Steps applied in this experiment according to the algorithm:

➢ Import Necessary Libraries: Import required libraries including
scikit-learn models, data loading, train-test split, and evaluation
metrics.
➢ Load and Split the Data: Load the breast cancer dataset and
split it into training and testing sets, using 80% of the data for
training and 20% for testing.
➢ Create Individual Models:

• Initialize individual machine learning models: Decision Tree, K-
Nearest Neighbors (KNN), XGBoost, and Random Forest. Train
each individual model on the training data.
➢ Create a Heterogeneous Ensemble Model: Create a

VotingClassifier with the individual models as estimators. Use
'soft' voting, allowing probability-based voting. Fit the
ensemble model on the training data.
➢ Obtain Probabilities from Ensemble: Get the probabilities

(class probabilities) from the ensemble model's predictions on
the training data.
➢ Create Meta-Learner (Logistic Regression): Initialize a Logistic

Regression model as the meta-learner. Train the logistic
regression model using the probabilities obtained from the
ensemble model and the corresponding true labels (y_train).
➢ Make Predictions with Stacked Model: Get the probabilities

from the ensemble model's predictions on the test data. Use
the trained logistic regression model to predict the final labels
using these probabilities.
➢ Evaluate the Stacked Model: Calculate the accuracy of the
stacked model's predictions on the test data. Print the accuracy
of the stacked model to evaluate its performance.
➢ Fit and Evaluate Individual Models:

• Suppress warnings to avoid cluttered output. Iterate through
the individual models (Decision Tree, KNN, XGBoost, Random
Forest).
• Fit each individual model on the training data and make
predictions on the test data.
• Calculate accuracy for each individual model.
• Print accuracy scores for each individual model and calculate
the mean accuracy across all models.
5) Description of the dataset:
Applying a Heterogeneous Ensemble Learner to the Breast Cancer
dataset involves combining diverse machine learning algorithms such
as Decision Trees, Random Forest, XGBoost, and k-Nearest Neighbors
(KNN). By leveraging the strengths of these different algorithms, the
ensemble model can capture a broader range of patterns and
nuances within the data. Stacking, a popular technique in ensemble
learning, is used to merge predictions from individual models and
create a meta-learner, often a Logistic Regression model, for making
the final decision.
6) Python code:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,
VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset

data = load_breast_cancer()
X, y = data.data, data.target
# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Create individual models (modified models)

decision_tree = DecisionTreeClassifier(max_depth=3, random_state=42) #
Limit the depth of Decision Tree
ada_boost = AdaBoostClassifier(n_estimators=50, random_state=42) # Use
AdaBoost as another base model
# Create a heterogeneous ensemble model using VotingClassifier (modified

ensemble)
ensemble = VotingClassifier(estimators=[
('decision_tree', decision_tree),
('knn', KNeighborsClassifier(n_neighbors=5)),
('ada_boost', ada_boost), # Include AdaBoost in the ensemble
('random_forest', RandomForestClassifier(random_state=42))
], voting='soft') # 'soft' voting allows for probability-based voting
# Fit the ensemble model on the training data

ensemble.fit(X_train, y_train)
# Get the probabilities from the ensemble

ensemble_probs = ensemble.predict_proba(X_train)
# Create a Logistic Regression model to stack the ensemble's output (modified

meta-learner)
logistic_regressor = LogisticRegression(C=0.1, random_state=42) # Regularize
the Logistic Regression
# Train the Logistic Regression model using probabilities from the ensemble
logistic_regressor.fit(ensemble_probs, y_train)
# Make predictions on the test data

ensemble_test_probs = ensemble.predict_proba(X_test)
stacked_predictions = logistic_regressor.predict(ensemble_test_probs)
# Evaluate the stacked model using accuracy as the evaluation metric

accuracy = accuracy_score(y_test, stacked_predictions)
print("Stacked Model Accuracy:", accuracy)
OUPUT: Stacked Model Accuracy: 0.9649122807017544
# Calculate and display mean accuracy scores for individual models and overall
mean accuracy
models = ['Decision Tree', 'Random Forest', 'kNN', 'XGBoost']
model_accuracies = [accuracy for name, accuracy in accuracies] # Assuming
accuracies list is populated
# Calculate mean accuracy for individual models

for model, accuracy in zip(models, model_accuracies):
print(f"{model} Accuracy: {accuracy:.4f}")
# Calculate and display overall mean accuracy

mean_accuracy = sum(model_accuracies) / len(model_accuracies)
print(f"Mean Accuracy: {mean_accuracy:.4f}")
OUTPUT:
Decision Tree Accuracy: 0.9474
Random Forest Accuracy: 0.9561
kNN Accuracy: 0.9561
XGBoost Accuracy: 0.9649
Mean Accuracy: 0.9561
7) Result:
A high accuracy score demonstrates the success of the
Heterogeneous Ensemble Learner in improving the overall
classification performance. It signifies the model's potential to aid in
diagnosing breast cancer, showcasing the strength of combining
diverse algorithms for robust predictions.
Untitled42 - Jupyter Notebook http://localhost:8888/notebooks/Untitled42.ipynb?kernel_name=python3
In [9]: from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset

data = load_breast_cancer()
X, y = data.data, data.target
# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state
# Create individual models (modified models)

decision_tree = DecisionTreeClassifier(max_depth=3, random_state=42) # Limit the dept
ada_boost = AdaBoostClassifier(n_estimators=50, random_state=42) # Use AdaBoost as an
# Create a heterogeneous ensemble model using VotingClassifier (modified ensemble)

ensemble = VotingClassifier(estimators=[
('decision_tree', decision_tree),
('knn', KNeighborsClassifier(n_neighbors=5)),
('ada_boost', ada_boost), # Include AdaBoost in the ensemble
('random_forest', RandomForestClassifier(random_state=42))
], voting='soft') # 'soft' voting allows for probability-based voting
# Fit the ensemble model on the training data

ensemble.fit(X_train, y_train)
# Get the probabilities from the ensemble

ensemble_probs = ensemble.predict_proba(X_train)
# Create a Logistic Regression model to stack the ensemble's output (modified meta-lea
logistic_regressor = LogisticRegression(C=0.1, random_state=42) # Regularize the Logi
# Train the Logistic Regression model using probabilities from the ensemble
logistic_regressor.fit(ensemble_probs, y_train)
# Make predictions on the test data

ensemble_test_probs = ensemble.predict_proba(X_test)
stacked_predictions = logistic_regressor.predict(ensemble_test_probs)
# Evaluate the stacked model using accuracy as the evaluation metric

accuracy = accuracy_score(y_test, stacked_predictions)
print("Stacked Model Accuracy:", accuracy)
Stacked Model Accuracy: 0.9649122807017544
1 of 2 07-11-2023, 19:30
Untitled42 - Jupyter Notebook http://localhost:8888/notebooks/Untitled42.ipynb?kernel_name=python3
In [8]: # Calculate and display mean accuracy scores for individual models and overall mean ac
models = ['Decision Tree', 'Random Forest', 'kNN', 'XGBoost']
model_accuracies = [accuracy for name, accuracy in accuracies] # Assuming accuracies
# Calculate mean accuracy for individual models

for model, accuracy in zip(models, model_accuracies):
print(f"{model} Accuracy: {accuracy:.4f}")
# Calculate and display overall mean accuracy

mean_accuracy = sum(model_accuracies) / len(model_accuracies)
print(f"Mean Accuracy: {mean_accuracy:.4f}")
Decision Tree Accuracy: 0.9474

Random Forest Accuracy: 0.9561
kNN Accuracy: 0.9561
XGBoost Accuracy: 0.9649
Mean Accuracy: 0.9561
In [ ]:
2 of 2 07-11-2023, 19:30

20mid0209 Lab - 6

Uploaded by

Copyright:

Available Formats

You might also like

20mid0209 Lab - 6

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20mid0209 Lab - 6

Uploaded by

Copyright:

Available Formats

NAME: R B SHARAN

REG NO: 20MID0209

2) Overview of Heterogeneous Ensemble Learner Method:

3) For the implementation of Heterogeneous Ensemble Learner

4) Steps applied in this experiment according to the algorithm:

➢ Create Individual Models:

➢ Create a Heterogeneous Ensemble Model: Create a

➢ Obtain Probabilities from Ensemble: Get the probabilities

➢ Create Meta-Learner (Logistic Regression): Initialize a Logistic

➢ Make Predictions with Stacked Model: Get the probabilities

➢ Fit and Evaluate Individual Models:

# Load the Breast Cancer dataset

# Split the dataset into training and testing sets

# Create individual models (modified models)

# Create a heterogeneous ensemble model using VotingClassifier (modified

# Fit the ensemble model on the training data

# Get the probabilities from the ensemble

# Create a Logistic Regression model to stack the ensemble's output (modified

# Make predictions on the test data

# Evaluate the stacked model using accuracy as the evaluation metric

OUPUT: Stacked Model Accuracy: 0.9649122807017544

# Calculate mean accuracy for individual models

# Calculate and display overall mean accuracy

In [9]: from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer dataset

# Split the dataset into training and testing sets

# Create individual models (modified models)

# Create a heterogeneous ensemble model using VotingClassifier (modified ensemble)

# Fit the ensemble model on the training data

# Get the probabilities from the ensemble

# Make predictions on the test data

# Evaluate the stacked model using accuracy as the evaluation metric

Stacked Model Accuracy: 0.9649122807017544

# Calculate mean accuracy for individual models

# Calculate and display overall mean accuracy

Decision Tree Accuracy: 0.9474

You might also like