20mid0209 Lab - 6

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

NAME: R B SHARAN

REG NO: 20MID0209


COURSE: Advanced Predictive Analysis
COURSE CODE: MDI 3003
SLOT: L41+L42

LAB ASSIGNMENT - 6
QUESTION 1:

1) Objective:
The experiment aims to identify the optimal combination of base learners and
meta-learner for stacking. By evaluating different configurations of base
learners and meta-learner parameters, the goal is to discover the ensemble
setup that maximizes the overall accuracy and generalizability of the model.

2) Overview of Heterogeneous Ensemble Learner Method:


Heterogeneous ensembles combine various types of machine learning
algorithms, such as decision trees, neural networks, support vector machines,
or linear models. Each base model has its own strengths and weaknesses, and
by combining diverse models, the ensemble can handle different aspects of the
data, leading to improved generalization.
Different base models are selected based on their suitability for the problem
domain. For example, a random forest might capture complex nonlinear
patterns, while a linear model could handle simpler relationships. The selection
of base models depends on the dataset characteristics and the ensemble's
objectives.
Base models in a heterogeneous ensemble should be as independent and
diverse as possible. Independence ensures that the models make different
errors on different subsets of data. Diversity in predictions helps in reducing
the overall error when combined. Methods like bagging (Bootstrap
Aggregating) and boosting (AdaBoost) can be employed to create diverse base
models.
Stacking, a common technique in heterogeneous ensembles, involves training a
meta-learner (or a higher-level model) to make predictions based on the
outputs of individual base models. The predictions of base models serve as
features for the meta-learner. Stacking helps capture higher-order patterns in
the data that might be missed by individual models, leading to improved
overall performance.

3) For the implementation of Heterogeneous Ensemble Learner


following steps have been applied:
➢ Load the Dataset: Load the breast cancer dataset using a library
like scikit-learn.
➢ Split the Data: Split the dataset into training and testing sets to
evaluate the model's performance.
➢ Define Base Learners: Choose base learners for your ensemble.
In this case, use Decision Tree, Random Forest, XGBoost, and k-
Nearest Neighbors (KNN) classifiers.
➢ Define Meta-Learner: Select a meta-learner, which is another
machine learning model that combines the outputs of base
learners. Common choices include models like Logistic
Regression or another Random Forest classifier.
➢ Create the Stacking Ensemble Model: Use a Stacking Classifier
to combine the predictions from base learners and train the
meta-learner. Set the base learners and meta-learner when
creating the Stacking Classifier.
➢ Train the Ensemble Model: Train the Stacking Ensemble Model
using the training data. The Stacking Classifier will handle the
stacking and training of base learners and the meta-learner.
➢ Make Predictions: Use the trained ensemble model to make
predictions on the test data. Evaluate the Model: Assess the
accuracy or other relevant metrics of the ensemble model's
predictions on the test set to measure its performance.

4) Steps applied in this experiment according to the algorithm:


➢ Import Necessary Libraries: Import required libraries including
scikit-learn models, data loading, train-test split, and evaluation
metrics.

➢ Load and Split the Data: Load the breast cancer dataset and
split it into training and testing sets, using 80% of the data for
training and 20% for testing.

➢ Create Individual Models:


• Initialize individual machine learning models: Decision Tree, K-
Nearest Neighbors (KNN), XGBoost, and Random Forest. Train
each individual model on the training data.

➢ Create a Heterogeneous Ensemble Model: Create a


VotingClassifier with the individual models as estimators. Use
'soft' voting, allowing probability-based voting. Fit the
ensemble model on the training data.

➢ Obtain Probabilities from Ensemble: Get the probabilities


(class probabilities) from the ensemble model's predictions on
the training data.

➢ Create Meta-Learner (Logistic Regression): Initialize a Logistic


Regression model as the meta-learner. Train the logistic
regression model using the probabilities obtained from the
ensemble model and the corresponding true labels (y_train).

➢ Make Predictions with Stacked Model: Get the probabilities


from the ensemble model's predictions on the test data. Use
the trained logistic regression model to predict the final labels
using these probabilities.
➢ Evaluate the Stacked Model: Calculate the accuracy of the
stacked model's predictions on the test data. Print the accuracy
of the stacked model to evaluate its performance.

➢ Fit and Evaluate Individual Models:


• Suppress warnings to avoid cluttered output. Iterate through
the individual models (Decision Tree, KNN, XGBoost, Random
Forest).
• Fit each individual model on the training data and make
predictions on the test data.
• Calculate accuracy for each individual model.
• Print accuracy scores for each individual model and calculate
the mean accuracy across all models.
5) Description of the dataset:
Applying a Heterogeneous Ensemble Learner to the Breast Cancer
dataset involves combining diverse machine learning algorithms such
as Decision Trees, Random Forest, XGBoost, and k-Nearest Neighbors
(KNN). By leveraging the strengths of these different algorithms, the
ensemble model can capture a broader range of patterns and
nuances within the data. Stacking, a popular technique in ensemble
learning, is used to merge predictions from individual models and
create a meta-learner, often a Logistic Regression model, for making
the final decision.

6) Python code:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,
VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset


data = load_breast_cancer()
X, y = data.data, data.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create individual models (modified models)


decision_tree = DecisionTreeClassifier(max_depth=3, random_state=42) #
Limit the depth of Decision Tree
ada_boost = AdaBoostClassifier(n_estimators=50, random_state=42) # Use
AdaBoost as another base model

# Create a heterogeneous ensemble model using VotingClassifier (modified


ensemble)
ensemble = VotingClassifier(estimators=[
('decision_tree', decision_tree),
('knn', KNeighborsClassifier(n_neighbors=5)),
('ada_boost', ada_boost), # Include AdaBoost in the ensemble
('random_forest', RandomForestClassifier(random_state=42))
], voting='soft') # 'soft' voting allows for probability-based voting

# Fit the ensemble model on the training data


ensemble.fit(X_train, y_train)

# Get the probabilities from the ensemble


ensemble_probs = ensemble.predict_proba(X_train)

# Create a Logistic Regression model to stack the ensemble's output (modified


meta-learner)
logistic_regressor = LogisticRegression(C=0.1, random_state=42) # Regularize
the Logistic Regression
# Train the Logistic Regression model using probabilities from the ensemble
logistic_regressor.fit(ensemble_probs, y_train)

# Make predictions on the test data


ensemble_test_probs = ensemble.predict_proba(X_test)
stacked_predictions = logistic_regressor.predict(ensemble_test_probs)

# Evaluate the stacked model using accuracy as the evaluation metric


accuracy = accuracy_score(y_test, stacked_predictions)
print("Stacked Model Accuracy:", accuracy)

OUPUT: Stacked Model Accuracy: 0.9649122807017544

# Calculate and display mean accuracy scores for individual models and overall
mean accuracy
models = ['Decision Tree', 'Random Forest', 'kNN', 'XGBoost']
model_accuracies = [accuracy for name, accuracy in accuracies] # Assuming
accuracies list is populated

# Calculate mean accuracy for individual models


for model, accuracy in zip(models, model_accuracies):
print(f"{model} Accuracy: {accuracy:.4f}")

# Calculate and display overall mean accuracy


mean_accuracy = sum(model_accuracies) / len(model_accuracies)
print(f"Mean Accuracy: {mean_accuracy:.4f}")
OUTPUT:
Decision Tree Accuracy: 0.9474
Random Forest Accuracy: 0.9561
kNN Accuracy: 0.9561
XGBoost Accuracy: 0.9649
Mean Accuracy: 0.9561

7) Result:
A high accuracy score demonstrates the success of the
Heterogeneous Ensemble Learner in improving the overall
classification performance. It signifies the model's potential to aid in
diagnosing breast cancer, showcasing the strength of combining
diverse algorithms for robust predictions.
Untitled42 - Jupyter Notebook http://localhost:8888/notebooks/Untitled42.ipynb?kernel_name=python3

In [9]: from sklearn.datasets import load_breast_cancer


from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset


data = load_breast_cancer()
X, y = data.data, data.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state

# Create individual models (modified models)


decision_tree = DecisionTreeClassifier(max_depth=3, random_state=42) # Limit the dept
ada_boost = AdaBoostClassifier(n_estimators=50, random_state=42) # Use AdaBoost as an

# Create a heterogeneous ensemble model using VotingClassifier (modified ensemble)


ensemble = VotingClassifier(estimators=[
('decision_tree', decision_tree),
('knn', KNeighborsClassifier(n_neighbors=5)),
('ada_boost', ada_boost), # Include AdaBoost in the ensemble
('random_forest', RandomForestClassifier(random_state=42))
], voting='soft') # 'soft' voting allows for probability-based voting

# Fit the ensemble model on the training data


ensemble.fit(X_train, y_train)

# Get the probabilities from the ensemble


ensemble_probs = ensemble.predict_proba(X_train)

# Create a Logistic Regression model to stack the ensemble's output (modified meta-lea
logistic_regressor = LogisticRegression(C=0.1, random_state=42) # Regularize the Logi

# Train the Logistic Regression model using probabilities from the ensemble
logistic_regressor.fit(ensemble_probs, y_train)

# Make predictions on the test data


ensemble_test_probs = ensemble.predict_proba(X_test)
stacked_predictions = logistic_regressor.predict(ensemble_test_probs)

# Evaluate the stacked model using accuracy as the evaluation metric


accuracy = accuracy_score(y_test, stacked_predictions)
print("Stacked Model Accuracy:", accuracy)

Stacked Model Accuracy: 0.9649122807017544

1 of 2 07-11-2023, 19:30
Untitled42 - Jupyter Notebook http://localhost:8888/notebooks/Untitled42.ipynb?kernel_name=python3

In [8]: # Calculate and display mean accuracy scores for individual models and overall mean ac
models = ['Decision Tree', 'Random Forest', 'kNN', 'XGBoost']
model_accuracies = [accuracy for name, accuracy in accuracies] # Assuming accuracies

# Calculate mean accuracy for individual models


for model, accuracy in zip(models, model_accuracies):
print(f"{model} Accuracy: {accuracy:.4f}")

# Calculate and display overall mean accuracy


mean_accuracy = sum(model_accuracies) / len(model_accuracies)
print(f"Mean Accuracy: {mean_accuracy:.4f}")

Decision Tree Accuracy: 0.9474


Random Forest Accuracy: 0.9561
kNN Accuracy: 0.9561
XGBoost Accuracy: 0.9649
Mean Accuracy: 0.9561

In [ ]:

2 of 2 07-11-2023, 19:30

You might also like